-
Notifications
You must be signed in to change notification settings - Fork 187
Description
I'm using PyCall (1.7.2) to run an SQL query against a database, and then getting the results into Julia. It appears to be very slow to convert the python list of tuples to a Julia array of tuples, and it seems that all the slowness is in iterating through the list elements, i.e., the speed is O(n) on the number of items in the list.
Here's some example code. The SQL statement returns exactly 100,000 rows:
Using automatic type conversion
@time rows = cs.cursor[:fetchall]() # PyCall automatically converts to a Julia array of tuples
157.480336 seconds (24.49 M allocations: 649.669 MB, 0.86% gc time)
@time rowarray = map(collect, rows) # Convert the tuples to arrays
0.338664 seconds (1.86 M allocations: 65.706 MB, 37.67% gc time)
length(rowarray)
100000Getting a PyObject and then converting with map
@time rows = pycall(cs.cursor[:fetchall], PyObject)
7.685769 seconds (73 allocations: 4.031 KB)
@time rowarray = map(collect, rows)
119.437264 seconds (27.05 M allocations: 745.472 MB, 1.29% gc time)
length(rowarray)
100000As you can see, calling fetchall() takes 157seconds, and then the map is very fast, whereas calling pycall(fetchall, PyObject) takes 7seconds, and then the map is very slow.
So, wise PyCall devs, is there a way for me to combine the fastest parts of the two approaches? I'm not averse to going as low level as necessary as this level of database slowness is causing us a lot of grief.
PS: I've tried parallelising this with pmap and other low level julia parallel functions, but this involves copying the object which has the same issue.