Introduction: Implementing VectorAdd
Let's start with a simple example. This is C/C++ implementation of addition of two vectors.
void vadd(double* a, double const* b, double const* c, int n) {
for (int i = 0; i < n; ++i)
a[i] = b[i] + c[i];
}
Vector instructions can operate 256 elements in a single instruction. Strip-mining to make an innermost loop with 256 iterations is first step to use intrinsics.
#define VL 256
void vadd(double* a, double const* b, double const* c, int n) {
for (int i = 0; i < n; i += VL) {
int vl = min(VL, n - i);
for (int j = 0; j < vl; ++j)
a[j] = b[j] + c[j];
a += vl;
b += vl;
c += vl;
}
}
Then replace the innermost loop with intrinsic functions and __vr
(vector register) type
variables after including velintrin.h
.
#include <velintrin.h>
#define VL 256
void vadd(double* a, double const* b, double const* c, int n) {
for (int i = 0; i < n; i += VL) {
int vl = min(VL, n - i);
__vr vb = _vel_vld_vssl(8, b, vl); // load b to vb
__vr vc = _vel_vld_vssl(8, c, vl); // load c to vc
__vr va = _vel_vfaddd_vvvl(vb, bc, vl); // va = vb + vc
_vel_vst_vssl(va, 8, a, vl); // store va to a
a += vl;
b += vl;
c += vl;
}
}
A vector register can hold 256 64bit elements and an intrinsic function
operates vl
elements in it. vl
is passed as a last argument of an intrinsic
function.
The four intrinsics used in this example are compiled to four vector
instructions vld, vld, vfadd.d, vst
. This is an important benefit of
intrinsics. You can use the instructions you want to use.
Other examples can be found: