Warm tip: This article is reproduced from stackoverflow.com, please click
c++ linux

Huge latency spikes while running simple code

发布于 2020-03-27 10:20:47

I have a simple benchmark that demonstrates performance of busywait threads. It runs in two modes: first one simply gets two timepoints sequentially, second one iterates through vector and measures duration of an iteration. I see that two sequential calls of clock::now() takes about 50 nanoseconds on the average and one average iteration through vector takes about 100 nanoseconds. But sometimes these operations are executed with a huge delay: about 50 microseconds in the first case and 10 milliseconds (!) in the second case.

Test runs on single isolated core so context switches do not occur. I also call mlockall in beginning of the program so I assume that page faults do not affect the performance.

Following additional optimizations were also applied:

  • kernel boot parameters: intel_idle.max_cstate=0 idle=halt irqaffinity=0,14 isolcpus=4-13,16-27 pti=off spectre_v2=off audit=0 selinux=0 nmi_watchdog=0 nosoftlockup=0 rcu_nocb_poll rcu_nocbs=19-20 nohz_full=19-20;
  • rcu[^c] kernel threads moved to a housekeeping CPU core 0;
  • network card RxTx queues moved to a housekeeping CPU core 0;
  • writeback kernel workqueue moved to a housekeeping CPU core 0;
  • transparent_hugepage disabled;
  • Intel CPU HyperThreading disabled;
  • swap file/partition is not used.

Environment:

System details:
Default Archlinux kernel:
5.1.9-arch1-1-ARCH #1 SMP PREEMPT Tue Jun 11 16:18:09 UTC 2019 x86_64 GNU/Linux

that has following PREEMPT and HZ settings:
CONFIG_HZ_300=y
CONFIG_HZ=300
CONFIG_PREEMPT=y

Hardware details:

RAM: 256GB

CPU(s):              28
On-line CPU(s) list: 0-27
Thread(s) per core:  1
Core(s) per socket:  14
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
Stepping:            1
CPU MHz:             3200.011
CPU max MHz:         3500.0000
CPU min MHz:         1200.0000
BogoMIPS:            5202.68
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            35840K
NUMA node0 CPU(s):   0-13
NUMA node1 CPU(s):   14-27

Example code:


    struct TData
    {
        std::vector<char> Data;

        TData() = default;
        TData(size_t aSize)
        {
            for (size_t i = 0; i < aSize; ++i)
            {
                Data.push_back(i);
            }
        }
    };

    using TBuffer = std::vector<TData>;

    TData DoMemoryOperation(bool aPerform, const TBuffer& aBuffer, size_t& outBufferIndex)
    {
        if (!aPerform)
        {
            return TData {};
        }

        const TData& result = aBuffer[outBufferIndex];

        if (++outBufferIndex == aBuffer.size())
        {
            outBufferIndex = 0;
        }

        return result;
    }

    void WarmUp(size_t aCyclesCount, bool aPerform, const TBuffer& aBuffer)
    {
        size_t bufferIndex = 0;
        for (size_t i = 0; i < aCyclesCount; ++i)
        {
            auto data = DoMemoryOperation(aPerform, aBuffer, bufferIndex);
        }
    }

    void TestCycle(size_t aCyclesCount, bool aPerform, const TBuffer& aBuffer, Measurings& outStatistics)
    {
        size_t bufferIndex = 0;
        for (size_t i = 0; i < aCyclesCount; ++i)
        {
            auto t1 = std::chrono::steady_clock::now();
            {
            auto data = DoMemoryOperation(aPerform, aBuffer, bufferIndex);
            }
            auto t2 = std::chrono::steady_clock::now();
            auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
            outStatistics.AddMeasuring(diff, t2);
        }
    }

    int Run(int aCpu, size_t aDataSize, size_t aBufferSize, size_t aCyclesCount, bool aAllocate, bool aPerform)
    {
        if (mlockall(MCL_CURRENT | MCL_FUTURE))
        {
            throw std::runtime_error("mlockall failed");
        }

        std::cout << "Test parameters"
            << ":\ndata size=" << aDataSize
            << ",\nnumber of elements=" << aBufferSize
            << ",\nbuffer size=" << aBufferSize * aDataSize
            << ",\nnumber of cycles=" << aCyclesCount
            << ",\nallocate=" << aAllocate
            << ",\nperform=" << aPerform
            << ",\nthread ";

        SetCpuAffinity(aCpu);

        TBuffer buffer;

        if (aPerform)
        {
            buffer.resize(aBufferSize);
            std::fill(buffer.begin(), buffer.end(), TData { aDataSize });
        }

        WaitForKey();
        std::cout << "Running..."<< std::endl;

        WarmUp(aBufferSize * 2, aPerform, buffer);

        Measurings statistics;
        TestCycle(aCyclesCount, aPerform, buffer, statistics);
        statistics.Print(aCyclesCount);

        WaitForKey();

        if (munlockall())
        {
            throw std::runtime_error("munlockall failed");
        }

        return 0;
    }

And following results are received: First:

StandaloneTests --run_test=MemoryAccessDelay --cpu=19 --data-size=280 --size=67108864 --count=1000000000 --allocate=1 --perform=0
Test parameters:
data size=280,
number of elements=67108864,
buffer size=18790481920,
number of cycles=1000000000,
allocate=1,
perform=0,
thread 14056 on cpu 19

Statistics: min: 16: max: 18985: avg: 18
0 - 10 : 0 (0 %): -
10 - 100 : 999993494 (99 %): min: 40: max: 117130: avg: 40
100 - 1000 : 946 (0 %): min: 380: max: 506236837: avg: 43056598
1000 - 10000 : 5549 (0 %): min: 56876: max: 70001739: avg: 7341862
10000 - 18985 : 11 (0 %): min: 1973150818: max: 14060001546: avg: 3644216650

Second:

StandaloneTests --run_test=MemoryAccessDelay --cpu=19 --data-size=280 --size=67108864 --count=1000000000 --allocate=1 --perform=1
Test parameters:
data size=280,
number of elements=67108864,
buffer size=18790481920,
number of cycles=1000000000,
allocate=1,
perform=1,
thread 3264 on cpu 19

Statistics: min: 36: max: 4967479: avg: 48
0 - 10 : 0 (0 %): -
10 - 100 : 964323921 (96 %): min: 60: max: 4968567: avg: 74
100 - 1000 : 35661548 (3 %): min: 122: max: 4972632: avg: 2023
1000 - 10000 : 14320 (0 %): min: 1721: max: 33335158: avg: 5039338
10000 - 100000 : 130 (0 %): min: 10010533: max: 1793333832: avg: 541179510
100000 - 1000000 : 0 (0 %): -
1000000 - 4967479 : 81 (0 %): min: 508197829: max: 2456672083: avg: 878824867

Any ideas what is the reason of such huge delays and how it may be investigated?

Questioner
Dmitriy Chizhov
Viewed
123
Maxim Egorushkin 2019-07-04 01:19

In:

TData DoMemoryOperation(bool aPerform, const TBuffer& aBuffer, size_t& outBufferIndex);

It returns a std::vector<char> by value. That involves a memory allocation and data copying. The memory allocations can do a syscall (brk or mmap) and memory mappings related syscalls are notorious for being slow.

When timings include syscalls one cannot expect low variance.

You may like to run your application with /usr/bin/time --verbose <app> or perf -ddd <app> to see the number of page faults and context switches.