I've developed a simple program and want to evaluate its runtime performance on a real machine, e.g. my MacBook. The source code goes:
#include <stdio.h>
#include <vector>
#include <ctime>
int main () {
auto beg = std::clock () ;
for (int i = 0; i < 1e8; ++ i) {
}
auto end = std::clock () ;
printf ("CPU time used: %lf ms\n", 1000.0*(end-beg)/CLOCKS_PER_SEC) ;
}
It's compiled with gcc and the optimization flag is set to the default. With the help of bash script, I ran it for 1000 times and recorded the runtime by my MacBook, as following:
[130.000000, 136.000000): 0
[136.000000, 142.000000): 1
[142.000000, 148.000000): 234
[148.000000, 154.000000): 116
[154.000000, 160.000000): 138
[160.000000, 166.000000): 318
[166.000000, 172.000000): 139
[172.000000, 178.000000): 40
[178.000000, 184.000000): 11
[184.000000, 190.000000): 3
"[a, b): n" means that the actual runtime of the same program is between a ms and b ms for n times.
It's clear that the real runtime varies greatly and it seems not a normal distribution. Could someone kindly tell me what causes this and how I can evaluate the runtime correctly?
Thanks for responding to this question.
Benchmarking is hard!
Short answer: use google benchmark
Long answer: There are many things that will interfere with timings.
The only way to avoid these effects are to disable CPU scaling, to do "cache-flush" functions (normally just touching a lot of memory before starting), running at high priority, and locking yourself to a single CPU. Even after all that, your timings will still be noisy, so the last thing is simply to repeat a lot, and use the average.
This why tools like google benchmark are probably your best bet.
video from CPPCon
Also available live online
Average is not necessary the best. median for example allows to sort out bound artifacts. Lowest is also to be considered, as the one with the less interferences (even if less reproductible/comparable between runs).
@Jarod42: github.com/google/benchmark#reporting-statistics. You can always argue about which average is best, it mostly depends on what you are using it for. Expect variance though. It's almost impossible to remove.