I just made an amazing discovery yesterday that I thought I'd share regarding the raspberry pi and the performance costs of using the clock function (in time.h) when using C++.
As many of you know, I've been working on a high performance gcode pre-processor for my plugin Octolapse. The routine started out being written in python, but had abysmal performance (about 0.1MB/second) on the pi, and not much better on my windows machine. I decided c++ was the way to go.
After many versions, revisions, and enhancements, I was able to increase the throughput to 32MB/sec on my dev box (x86) when running a test program (exc) and about 7.4MB/sec when running through python (not sure exactly why the performance is so much lower here, but that's another discussion).
However, I noticed that these gains were not linear when comparing performance on my dev box and on the RPI 3B+. Here, I was getting about 1.615MB/sec in the best case. I also noticed that the performance gains on my dev box were not translating linearly with performance on the pi, so I knew something else was going on. For example, when I increased performance on my dev box by 50%, I was seeing maybe a 10% increase on the PI.
My solution was to run some profiling on the PI, but had a lot of trouble doing this. I ended up with a poor man's approach: trying to isolate which functions were taking the most time (reading from a file, parsing, position processing, or other) by directly measuring and outputting the times spend in the main functions in my inner loop. To do this I added some calls to clock() (and the high performance timer, but the results were similar). I noticed then that the execution times went through the roof, but only on the PI! There were 8 total calls to clock() deep within the inner loop of the program that were added to test performance, and one that was there for functional reasons. When I removed the 8 that I added to test performance, the execution time plummeted.
The one remaining call to clock() was being used to determine when to give a status update (% complete) to the UI. Since I now knew how slow this call was on the PI, I added some code that only allowed the call once per 1000 lines of gcode processed. My throughput then increased from 1.61MB/sec to 7.44MB/sec! This call had virtually no effect ( a fraction of a percent of total time) on my dev machine.
The moral of the story: When doing cross platform development, if you run into performance problems don't make the assumption that a bottleneck on one system will be a bottleneck on another. If I had taken the time to get profiling working on the PI in the first place, I would have saved myself SO much work. When I added the timers, I was almost sure I'd see some issue with IO (reading from SD), poor floating point performance, or some other issue. it turns out it was one call to clock()!
In retrospect, I'm glad I had this issue, because it forced me to make some performance gains that I wouldn't have otherwise made, and I learned something important about cross platform development.
/end_rant