Introduction: Implementing VectorAdd
Let's start with a simple example. This is C/C++ implementation of addition of two vectors.
void vadd(double* a, double const* b, double const* c, int n) {
for (int i = 0; i < n; ++i)
a[i] = b[i] + c[i];
}
Vector instructions can operate 256 elements in a single instruction. Strip-mining to make an innermost loop with 256 iterations is first step to use intrinsics.
#define VL 256
void vadd(double* a, double const* b, double const* c, int n) {
for (int i = 0; i < n; i += VL) {
int vl = min(VL, n - i);
for (int j = 0; j < vl; ++j)
a[j] = b[j] + c[j];
a += vl;
b += vl;
c += vl;
}
}
Then replace the innermost loop with intrinsic functions and __vr (vector register) type
variables after including velintrin.h.
#include <velintrin.h>
#define VL 256
void vadd(double* a, double const* b, double const* c, int n) {
for (int i = 0; i < n; i += VL) {
int vl = min(VL, n - i);
__vr vb = _vel_vld_vssl(8, b, vl); // load b to vb
__vr vc = _vel_vld_vssl(8, c, vl); // load c to vc
__vr va = _vel_vfaddd_vvvl(vb, bc, vl); // va = vb + vc
_vel_vst_vssl(va, 8, a, vl); // store va to a
a += vl;
b += vl;
c += vl;
}
}
A vector register can hold 256 64bit elements and an intrinsic function
operates vl elements in it. vl is passed as a last argument of an intrinsic
function.
The four intrinsics used in this example are compiled to four vector
instructions vld, vld, vfadd.d, vst. This is an important benefit of
intrinsics. You can use the instructions you want to use.
Other examples can be found: