811c173523
Currently, when ctpop is supported for scalar types, the expansion of @llvm.ctpop.vXiY uses vector element extractions, insertions and individual calls to @llvm.ctpop.iY. When not, expansion with bit-math operations is used for the scalar calls. Local haswell measurements show that we can improve vector @llvm.ctpop.vXiY expansion in some cases by using a using a vector parallel bit twiddling approach, based on: v = v - ((v >> 1) & 0x55555555); v = (v & 0x33333333) + ((v >> 2) & 0x33333333); v = ((v + (v >> 4) & 0xF0F0F0F) v = v + (v >> 8) v = v + (v >> 16) v = v & 0x0000003F (from http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel) When scalar ctpop isn't supported, the approach above performs better for v2i64, v4i32, v4i64 and v8i32 (see numbers below). And even when scalar ctpop is supported, this approach performs ~2x better for v8i32. Here, x86_64 implies -march=corei7-avx without ctpop and x86_64h includes ctpop support with -march=core-avx2. == [x86_64h - new] v8i32: 0.661685 v4i32: 0.514678 v4i64: 0.652009 v2i64: 0.324289 == [x86_64h - old] v8i32: 1.29578 v4i32: 0.528807 v4i64: 0.65981 v2i64: 0.330707 == [x86_64 - new] v8i32: 1.003 v4i32: 0.656273 v4i64: 1.11711 v2i64: 0.754064 == [x86_64 - old] v8i32: 2.34886 v4i32: 1.72053 v4i64: 1.41086 v2i64: 1.0244 More work for other vector types will come next. llvm-svn: 224725 |
||
---|---|---|
clang | ||
clang-tools-extra | ||
compiler-rt | ||
debuginfo-tests | ||
libclc | ||
libcxx | ||
libcxxabi | ||
lld | ||
lldb | ||
llgo | ||
llvm | ||
openmp | ||
polly |