温馨提示:本文翻译自stackoverflow.com,查看原文请点击:c++ - Raw Loop on an array of bool is 5 times faster than transform or for_each
c++ c++17 performance clang++

c++ - 布尔数组上的Raw Loop比transform或for_each快5倍

发布于 2020-03-27 11:45:24

根据我之前对transform和for_each进行基准测试的经验,它们的执行速度通常比原始循环要快一些,并且它们当然更安全,因此我尝试将所有原始循环替换为transform,generate和for_each。今天,我比较了使用for_each,transform和raw循环可以快速翻转布尔值的结果,我得到了非常令人惊讶的结果。raw_loop的执行速度是其他两个循环的5倍。我真的没有找到找到如此巨大差异的充分理由吗?

#include <array>
#include <algorithm>


static void ForEach(benchmark::State& state) {
  std::array<bool, sizeof(short) * 8> a;
  std::fill(a.begin(), a.end(), true);

  for (auto _ : state) {
    std::for_each(a.begin(), a.end(), [](auto & arg) { arg = !arg; });
    benchmark::DoNotOptimize(a);
  }
}
BENCHMARK(ForEach);

static void Transform(benchmark::State& state) {
  std::array<bool, sizeof(short) * 8> a;
  std::fill(a.begin(), a.end(), true);

  for (auto _ : state) {
    std::transform(a.begin(), a.end(), a.begin(), [](auto arg) { return !arg; });
    benchmark::DoNotOptimize(a);
  }
}
BENCHMARK(Transform);


static void RawLoop(benchmark::State& state) {
  std::array<bool, sizeof(short) * 8> a;
  std::fill(a.begin(), a.end(), true);

  for (auto _ : state) {
    for (int i = 0; i < a.size(); i++) {
      a[i] = !a[i];
    }
    benchmark::DoNotOptimize(a);
  }
}
BENCHMARK(RawLoop);

clang ++(7.0)-O3 -libc ++(LLVM)

在此处输入图片说明

查看更多

查看更多

提问者
apramc
被浏览
144
J. Antonio Perez 2019-07-06 06:46

In this example, clang vectorizes indexing but (mistakenly) fails to vectorize iterating.

To summarize the results, there is no difference between using a raw loop and using std::transform or std::for_each. There IS, however, a difference between using indexing and using iterating, and for the purposes of this particular problem, clang is better at optimizing indexing than it is at optimizing iterating because indexing gets vectorized. std::transform and std::for_each use iterating, so they end up being slower (when compiled under clang).

What's the difference between indexing and iterating? - Indexing is when you use an integer to index into an array - Iterating is when you increment a pointer from begin() to end().

Let's write the raw loop using indexing and using iterating, and compare the performance of iterating (with a raw loop) to indexing.

// Indexing
for(int i = 0; i < a.size(); i++) {
    a[i] = !a[i];
}
// Iterating, used by std::for_each and std::transform
bool* begin = a.data();
bool* end   = begin + a.size(); 
for(; begin != end; ++begin) {
    *begin = !*begin; 
}

The example using indexing is better-optimized, and runs 4-5 times faster when compiled with clang.

In order to demonstrate this, let's add two additional tests, both using a raw loop. One will use an iterator, and the other one will use raw pointers.


static void RawLoopIt(benchmark::State& state) {
  std::array<bool, 16> a;
  std::fill(a.begin(), a.end(), true); 

  for(auto _ : state) {
    auto scan = a.begin(); 
    auto end = a.end(); 
    for (; scan != end; ++scan) {
      *scan = !*scan; 
    }
    benchmark::DoNotOptimize(a); 
  }
 }

BENCHMARK(RawLoopIt); 

static void RawLoopPtr(benchmark::State& state) {
  std::array<bool, 16> a;
  std::fill(a.begin(), a.end(), true); 

  for(auto _ : state) {
    bool* scan = a.data(); 
    bool* end = scan + a.size(); 
    for (; scan != end; ++scan) {
      *scan = !*scan; 
    }
    benchmark::DoNotOptimize(a); 
  } 
}

BENCHMARK(RawLoopPtr); 

When using a pointer or an iterator from begin to end, these functions identical in performance to using std::for_each or std::transform.

Clang Quick-bench results:

在此处输入图片说明

This is confirmed by running the clang benchmark locally:

me@K2SO:~/projects/scratch$ clang++ -O3 bench.cpp -lbenchmark -pthread -o clang-bench
me@K2SO:~/projects/scratch$ ./clang-bench
2019-07-05 16:13:27
Running ./clang-bench
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
Load Average: 0.44, 0.55, 0.59
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
ForEach          8.32 ns         8.32 ns     83327615
Transform        8.29 ns         8.28 ns     82536410
RawLoop          1.92 ns         1.92 ns    361745495
RawLoopIt        8.31 ns         8.31 ns     81848945
RawLoopPtr       8.28 ns         8.28 ns     82504276

GCC does not have this problem.

出于本示例的目的,索引编制或迭代之间没有根本区别。它们都对数组应用了相同的转换,并且编译器应该能够对它们进行相同的编译。

确实,GCC能够做到这一点,所有方法的运行速度都比在clang下编译的相应版本

GCC快速基准测试结果: 在此处输入图片说明

GCC本地结果:

2019-07-05 16:13:35
Running ./gcc-bench
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
Load Average: 0.52, 0.57, 0.60
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
ForEach          1.43 ns         1.43 ns    484760981
Transform        1.44 ns         1.44 ns    485788409
RawLoop          1.43 ns         1.43 ns    484973417
RawLoopIt        1.44 ns         1.44 ns    482685685
RawLoopPtr       1.44 ns         1.44 ns    483736235

索引实际上比迭代快吗?不会。索引是更快的,因为clang将其向量化。

在引擎盖下,不会发生迭代建立索引。取而代之的是,gcc和clang 通过将数组视为两个64位整数,并对它们使用按位异或运算来对操作进行矢量化我们可以在用于翻转位的程序集中看到这一点:

       movabs $0x101010101010101,%rax
       nopw   %cs:0x0(%rax,%rax,1)
       xor    %rax,(%rsp)
       xor    %rax,0x8(%rsp)
       sub    $0x1,%rbx

使用clang编译时,迭代速度较慢因为由于某种原因,使用迭代时clang无法向量化操作。这是clang的缺陷,是专门针对此问题的缺陷。随着铛声的提高,这种差异应该消失,这不是我现在要担心的事情。

不要微优化。让编译器处理该问题,并在必要时测试gcc或clang是否针对您的特定用例生成了更快的代码并非所有情况下都更好。例如,clang更好地向量化了一些数学运算。