Essentially, I'm having a real hard time believing that for the 4 byte case, memcpy is faster. Take a look at the actual implementation of memcpy, vs the code generated by the compiler.