Tag Archives: Xscale

WMMX is buggy in Visual Studio 2008

I implemented an object recognition algorithm for Windows Mobile 6 using Visual Studio 2008. When it worked somehow I thought about improving its performance to process more images per second. One aspect of my implementation is to compute two byte vector’s sum of squared differences (some million times per second of course). My device is an ASUS P535 with an Xscale processor so I opt for using Wireless MMX. Since in-line assembler is not supported for ARM processors I used the according MMX intrinsics.

My inital attempt to compute the SSD of two 8 byte vectors looked as follows:


//Computes the sum of squared difference for eight values
int squared_distance(unsigned char *a, unsigned char *b) {
__m64 v1=*((__m64*)(a));
__m64 v2=*((__m64*)(b));
__m64 v3=_mm_subs_pi16(_mm_unpacklo_pi8(v2, zero), _mm_unpacklo_pi8(v1, zero));
result=_mm_mac_pi16(result,v3, v3);
__m64 v4=_mm_subs_pi16(_mm_unpackhi_pi8(v2, zero), _mm_unpackhi_pi8(v1, zero));
result=_mm_mac_pi16(result,v4, v4);
return result.m64_i32[0];
}

Of course the function must be adapted to fit the actual length of the vector to be useful. However, the function returned completely random results. It took me a while to puzzle out why my function is buggy. Actually the values loaded in v1 and v2 are already wrong. __m64 v1=*((__m64*)(a)); should load 8 bytes in v1 but loads only 4 bytes in the lower half of v1. The other 4 bytes seem to be random. I tested a bunch of other options to load values in a __m64 variable and all failed in the same way. Looking into the assembler code generated by the compiler reveals that instead of using the wldrd instruction (which actually loads 8 bytes) the compiler generates a wldrw instruction (which loads only 4 bytes). It might be a compiler bug and I assume that its related to the alignment of the arrays. Intel’s assembler reference manual says that in order to load 8 bytes in a WMMX the bytes must be aligned to 8 bytes. However, Microsoft’s documentation of the WMMX intrinsics tells us that if “data is not appropriately aligned, the program will throw an exception“. No exception for me and I also tried to align the data properly.