After realizing that Visual Studio’s support for XScale intrinsics is somewhat buggy I took a look at ARM assembler. The needed SSD function is quite simple so it was quite easy to implement it (actually it took me quite some time to find the assembler on my disk). Since my data is only aligned to 32 bit I had to stick loading 4 bytes at a time. It looks likes this
squared_distance_asm proc
wldrw wR0, [r0] ; load 4 bytes in wR0
wzero wR10 ; rW10 == 0
wldrw wR1, [r1] ; load 4 bytes in wR1
wunpckilb wR2, wR0, wR10
wunpckilb wR3, wR1, wR10
wsubhss wR2, wR2, wR3
wldrw wR0, [r0, #4] ; load 4 bytes in wR0
wmacsz wR13, wR2, wR2
wldrw wR1, [r1, #4] ; load 4 bytes in wR1
wunpckilb wR2, wR0, wR10
wunpckilb wR3, wR1, wR10
wsubhss wR2, wR2, wR3
; repeat the above as often as necessary
; return the result
tmrrc r0, r1, wR13
end mov pc,lr ; return to C with the return value in R0
Loads and calculation are interleaved to have less pipeline stalls. I haven’t looked at it in detail but the assembler version need ~25% less time than the intrinsics which needs around 25% less time than the naive C version. Still the assembler and the intrinsic versions are slower than I expected. Probably they are not properly inlined.