I realized that access to "temp" causes bottleneck. On defining it inside for loop, it become local and then there is speedup. Defining it outside makes it shared, which slows the program.