-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
Description
NumPy data representation, by default, contains rows together in memory. In this example:
numpy.array([[1, 2, 3, 4],
[5, 6, 7, 8]])Operating (e.g. adding up) over 1, 2, 3, 4 will be fast, since the values are together in the buffer, and CPU caches will be used efficiently. If we operate at column level 1, 5, caches won't be used efficiently since the values are far in memory, and performance will be significantly worse.
The problem is that if I create a dataframe from a 2D numpy array (e.g. pandas.DataFrame(2d_numpy_array_with_default_strides)), pandas will build the dataframe with the same shape, meaning that every column in the array will be a column in pandas, and operating over the columns in the dataframe will be inefficient.
Given a dataframe with 2 columns and 10M rows, the difference when adding up the column values is one order of magnitude slower:
>>> import numpy
>>> import pandas
>>> df_default = pandas.DataFrame(numpy.random.rand(10_000_000, 2))
>>> df_efficient = pandas.DataFrame(numpy.random.rand(2, 10_000_000).T)
>>> %timeit df_default.sum()
340 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit df_efficient.sum()
23.4 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)Not sure what's a good solution, I don't think we want to change the behavior to load numpy arrays transposed, and I'm not sure if rewritting the numpy array fixing the strides is a great option either. Personally, I'd check the strides in the provided array, and if they are inefficient, show the user a warning with a descriptive message on what's the problem and how to fix it.