Speed of csvReader

Thu Jan 21 11:25:01 PST 2016

On Thursday, 21 January 2016 at 19:08:38 UTC, data pulverizer 
wrote:
> On Thursday, 21 January 2016 at 18:46:03 UTC, Justin Whear 
> wrote:
>> On Thu, 21 Jan 2016 18:37:08 +0000, data pulverizer wrote:
>>
>>> It's interesting that the output first array is not the same 
>>> as the input
>>
>> byLine reuses a buffer (for speed) and the subsequent split 
>> operation just returns slices into that buffer.  So when 
>> byLine progresses to the next line the strings (slices) 
>> returned previously now point into a buffer with different 
>> contents.  You should either use byLineCopy or .idup to create 
>> copies of the relevant strings.  If your use-case allows for 
>> streaming and doesn't require having all the data present at 
>> once, you could continue to use byLine and just be careful not 
>> to refer to previous rows.
>
> Thanks. It now works with byLineCopy()
>
> Time (s): 1.128

Currently the timing is similar to python pandas:

# Script (Python 2.7.6)
import pandas as pd
import time
col_types = {'col1': str, 'col2': str, 'col3': str, 'col4': str, 
'col5': str, 'col6': str, 'col7': str, 'col8': str, 'col9': str, 
'col10': str, 'col11': str, 'col12': str, 'col13': str, 'col14': 
str, 'col15': str, 'col16': str, 'col17': str, 'col18': str, 
'col19': str, 'col20': str, 'col21': str, 'col22': str}
begin = time.time()
x = pd.read_csv('Acquisition_2009Q2.txt', sep = '|', dtype = 
col_types)
end = time.time()

print end - begin

$ python file_read.py
1.19544792175