Tag Archives: WMMX

The beauty of ARM assembler

After realizing that Visual Studio’s support for XScale intrinsics is somewhat buggy I took a look at ARM assembler. The needed SSD function is quite simple so it was quite easy to implement it (actually it took me quite some time to find the assembler on my disk). Since my data is only aligned to 32 bit I had to stick loading 4 bytes at a time. It looks likes this

squared_distance_asm proc
wldrw wR0, [r0] ; load 4 bytes in wR0
wzero wR10 ; rW10 == 0
wldrw wR1, [r1] ; load 4 bytes in wR1
wunpckilb wR2, wR0, wR10
wunpckilb wR3, wR1, wR10
wsubhss wR2, wR2, wR3
wldrw wR0, [r0, #4] ; load 4 bytes in wR0
wmacsz wR13, wR2, wR2
wldrw wR1, [r1, #4] ; load 4 bytes in wR1
wunpckilb wR2, wR0, wR10
wunpckilb wR3, wR1, wR10
wsubhss wR2, wR2, wR3
; repeat the above as often as necessary

; return the result
tmrrc r0, r1, wR13
end mov pc,lr ; return to C with the return value in R0

Loads and calculation are interleaved to have less pipeline stalls. I haven’t looked at it in detail but the assembler version need ~25% less time than the intrinsics which needs around 25% less time than the naive C version. Still the assembler and the intrinsic versions are slower than I expected. Probably they are not properly inlined.

WMMX is buggy in Visual Studio 2008

I implemented an object recognition algorithm for Windows Mobile 6 using Visual Studio 2008. When it worked somehow I thought about improving its performance to process more images per second. One aspect of my implementation is to compute two byte vector’s sum of squared differences (some million times per second of course). My device is an ASUS P535 with an Xscale processor so I opt for using Wireless MMX. Since in-line assembler is not supported for ARM processors I used the according MMX intrinsics.

My inital attempt to compute the SSD of two 8 byte vectors looked as follows:

//Computes the sum of squared difference for eight values
int squared_distance(unsigned char *a, unsigned char *b) {
__m64 v1=*((__m64*)(a));
__m64 v2=*((__m64*)(b));
__m64 v3=_mm_subs_pi16(_mm_unpacklo_pi8(v2, zero), _mm_unpacklo_pi8(v1, zero));
result=_mm_mac_pi16(result,v3, v3);
__m64 v4=_mm_subs_pi16(_mm_unpackhi_pi8(v2, zero), _mm_unpackhi_pi8(v1, zero));
result=_mm_mac_pi16(result,v4, v4);
return result.m64_i32[0];

Of course the function must be adapted to fit the actual length of the vector to be useful. However, the function returned completely random results. It took me a while to puzzle out why my function is buggy. Actually the values loaded in v1 and v2 are already wrong. __m64 v1=*((__m64*)(a)); should load 8 bytes in v1 but loads only 4 bytes in the lower half of v1. The other 4 bytes seem to be random. I tested a bunch of other options to load values in a __m64 variable and all failed in the same way. Looking into the assembler code generated by the compiler reveals that instead of using the wldrd instruction (which actually loads 8 bytes) the compiler generates a wldrw instruction (which loads only 4 bytes). It might be a compiler bug and I assume that its related to the alignment of the arrays. Intel’s assembler reference manual says that in order to load 8 bytes in a WMMX the bytes must be aligned to 8 bytes. However, Microsoft’s documentation of the WMMX intrinsics tells us that if “data is not appropriately aligned, the program will throw an exception“. No exception for me and I also tried to align the data properly.