The Joys of Cross Platform Development

FormerLurker · August 28, 2019, 6:28pm

I just made an amazing discovery yesterday that I thought I'd share regarding the raspberry pi and the performance costs of using the clock function (in time.h) when using C++.

As many of you know, I've been working on a high performance gcode pre-processor for my plugin Octolapse. The routine started out being written in python, but had abysmal performance (about 0.1MB/second) on the pi, and not much better on my windows machine. I decided c++ was the way to go.

After many versions, revisions, and enhancements, I was able to increase the throughput to 32MB/sec on my dev box (x86) when running a test program (exc) and about 7.4MB/sec when running through python (not sure exactly why the performance is so much lower here, but that's another discussion).

However, I noticed that these gains were not linear when comparing performance on my dev box and on the RPI 3B+. Here, I was getting about 1.615MB/sec in the best case. I also noticed that the performance gains on my dev box were not translating linearly with performance on the pi, so I knew something else was going on. For example, when I increased performance on my dev box by 50%, I was seeing maybe a 10% increase on the PI.

My solution was to run some profiling on the PI, but had a lot of trouble doing this. I ended up with a poor man's approach: trying to isolate which functions were taking the most time (reading from a file, parsing, position processing, or other) by directly measuring and outputting the times spend in the main functions in my inner loop. To do this I added some calls to clock() (and the high performance timer, but the results were similar). I noticed then that the execution times went through the roof, but only on the PI! There were 8 total calls to clock() deep within the inner loop of the program that were added to test performance, and one that was there for functional reasons. When I removed the 8 that I added to test performance, the execution time plummeted.

The one remaining call to clock() was being used to determine when to give a status update (% complete) to the UI. Since I now knew how slow this call was on the PI, I added some code that only allowed the call once per 1000 lines of gcode processed. My throughput then increased from 1.61MB/sec to 7.44MB/sec! This call had virtually no effect ( a fraction of a percent of total time) on my dev machine.

The moral of the story: When doing cross platform development, if you run into performance problems don't make the assumption that a bottleneck on one system will be a bottleneck on another. If I had taken the time to get profiling working on the PI in the first place, I would have saved myself SO much work. When I added the timers, I was almost sure I'd see some issue with IO (reading from SD), poor floating point performance, or some other issue. it turns out it was one call to clock()!

In retrospect, I'm glad I had this issue, because it forced me to make some performance gains that I wouldn't have otherwise made, and I learned something important about cross platform development.

/end_rant

ElectroFlux · August 29, 2019, 12:42am

Thanks for sharing.
At some point you were on a mission.
Now some of us can avoid the same all together on the PI.

sagonzal · August 29, 2019, 1:38am

Very interesting...I'll dig it!...thank you for sharing.

mark-psl · August 29, 2019, 1:47pm

Cool. I'm going back a few decades but when I was doing embedded systems development then (and Pi is an embedded system), the hardware abstraction layer is likely where the difference is. Your x86 dev machine and the Pi have different physical logic and constraints for timer functions.

Great find. BTW, were you able to set up profiling on the Pi?

FormerLurker · August 29, 2019, 2:22pm

Your x86 dev machine and the Pi have different physical logic and constraints for timer functions.

After I discovered this I did some research on the topic. Apparently this is a known thing when running linux, and has something to do with the kernel. Apparently getting the clock time in linux can take 200X longer than it does in windows!

BTW, were you able to set up profiling on the Pi?

No, but it's on my list. It's so easy to do on my development machine that I've been dreading it though. Profiling a C++ routine called from python turns out to be a bit of a pain in the neck. I'm planning to take the plunge and create a full fledged development environment in Linux so I can get more comfortable there. I've learned a lot over the last couple years, but I still feel like a total noob

jneilliii · August 29, 2019, 3:38pm

That's how I feel when it comes to linux too. Great write up though.

FormerLurker · August 29, 2019, 3:43pm

That's how I feel when it comes to linux too. Great write up though.

Feel the impostor syndrome flow through you! I think anyone who gets into a specialty feels this at some point.

Thank you for your compliment!

OutsourcedGuru · August 29, 2019, 5:44pm

Look at the comparison table at the end of this for the file size for various executables. I was rather surprised at how much bloat some of these added. One of these days I should add assembly language to that.

I wonder if this would work on the ARM7. It looks like greased lightning to me.

FormerLurker · August 29, 2019, 6:06pm

Very interesting. Size is one thing, and speed is another.

Here is a good article I found regarding a similar call acquiring system time from Windows vs Linux. Warning: It is a deep dive.

OutsourcedGuru · August 29, 2019, 6:26pm

I know, right? I'm working with someone now who's added a RTC to their board which I'm expected to query over a serial interface to get the time, presumably. And then, I'm guessing that they want me to update the time on the Pi. (Talk about "synchronicity at a distance".) I forgot to add that it would be Python on both sides of the conversation.

FormerLurker · September 6, 2019, 4:52pm

Wow, so many likes. You guys and girls are awesome! I love this community!