DOC: low_memory in read_csv #13293

chris-b1 · 2016-05-26T11:06:58Z

closes API/DOC: status of low_memory kwarg of read_csv/table #5888, xref ENH/DOC/CLN: Document arguments and reconcile C and Python engines for read_csv #12686
passes git diff upstream/master | flake8 --diff

jreback · 2016-05-26T11:34:15Z

doc/source/io.rst

double back-ticks on False

dtype as well

codecov-io · 2016-05-26T12:16:53Z

Current coverage is 84.20%

Merging #13293 into master will not change coverage

@@             master     #13293   diff @@
==========================================
  Files           138        138          
  Lines         50587      50587          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          42593      42593          
  Misses         7994       7994          
  Partials          0          0

Powered by Codecov. Last updated by b4e2d34...282f1f3

chris-b1 · 2016-05-26T22:33:33Z

@jreback - updated

jreback · 2016-05-26T23:56:27Z

thanks!

jondo · 2016-11-11T15:42:31Z

doc/source/io.rst

  Number of rows of file to read. Useful for reading pieces of large files.
+low_memory : boolean, default ``True``
+  Internally process the file in chunks, resulting in lower memory use
+  while parsing, but possibly mixed type inference.  To ensure no mixed


What does 'resulting in [...] mixed type inference' mean? Is it implementation-specific which type is chosen? Is it deterministic?

It is deterministic - types are consistently inferred based on what's in the data. That said, the internal chunksize is not a fixed number of rows, but instead bytes, so whether you can a mixed dtype warning or not can feel a bit random.

from io import StringIO # just ints - no warning pd.read_csv(StringIO( '\n'.join(str(x) for x in range(10000))) ).dtypes Out[24]: 0 int64 dtype: object # mixed dtype - but fits into a single chunk, so no warning and all parsed as strings pd.read_csv(StringIO( '\n'.join([str(x) for x in range(10000)] + ['a string'])) ).dtypes Out[26]: 0 object dtype: object # mixed dtype - doesn't fit into a single chunk pd.read_csv(StringIO( '\n'.join([str(x) for x in range(1000000)] + ['a string'])) ).dtypes DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False. Out[27]: 0 object dtype: object

I see. Thank you for the explanation and the example!

To complete the example, I see:

df=pd.read_csv(StringIO('\n'.join([str(x) for x in range(1000000)] + ['a string']))) DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False. type(df.loc[524287,'0']) Out[50]: int type(df.loc[524288,'0']) Out[51]: str

The first part of the csv data was seen as only int, so converted to int,
the second part also had a string, so all entries were kept as string.

chris-b1 force-pushed the low-memory-doc branch from 282f1f3 to be7d83d Compare May 26, 2016 11:09

jreback reviewed May 26, 2016
View reviewed changes

jreback added Docs IO CSV read_csv, to_csv labels May 26, 2016

jreback added this to the 0.18.2 milestone May 26, 2016

DOC: low_memory in read_csv

daf9bca

chris-b1 force-pushed the low-memory-doc branch from be7d83d to daf9bca Compare May 26, 2016 22:33

jreback closed this in 4b05055 May 26, 2016

kawochen mentioned this pull request May 26, 2016

ENH/DOC/CLN: Document arguments and reconcile C and Python engines for read_csv #12686

Open

22 tasks

chris-b1 deleted the low-memory-doc branch June 3, 2016 21:56

jondo reviewed Nov 11, 2016

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DOC: low_memory in read_csv #13293

DOC: low_memory in read_csv #13293

Uh oh!

chris-b1 commented May 26, 2016

Uh oh!

jreback May 26, 2016

Uh oh!

jreback May 26, 2016

Uh oh!

codecov-io commented May 26, 2016

Uh oh!

chris-b1 commented May 26, 2016

Uh oh!

jreback commented May 26, 2016

Uh oh!

jondo Nov 11, 2016

Uh oh!

chris-b1 Nov 11, 2016

Uh oh!

jondo Nov 11, 2016

Uh oh!

jondo Nov 14, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

DOC: low_memory in read_csv #13293

DOC: low_memory in read_csv #13293

Uh oh!

Conversation

chris-b1 commented May 26, 2016

Uh oh!

jreback May 26, 2016

Choose a reason for hiding this comment

Uh oh!

jreback May 26, 2016

Choose a reason for hiding this comment

Uh oh!

codecov-io commented May 26, 2016

Current coverage is 84.20%

Uh oh!

chris-b1 commented May 26, 2016

Uh oh!

jreback commented May 26, 2016

Uh oh!

jondo Nov 11, 2016

Choose a reason for hiding this comment

Uh oh!

chris-b1 Nov 11, 2016

Choose a reason for hiding this comment

Uh oh!

jondo Nov 11, 2016

Choose a reason for hiding this comment

Uh oh!

jondo Nov 14, 2016

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants