Skip to content

Conversation

@chris-b1
Copy link
Contributor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double back-ticks on False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype as well

@jreback jreback added Docs IO CSV read_csv, to_csv labels May 26, 2016
@jreback jreback added this to the 0.18.2 milestone May 26, 2016
@codecov-io
Copy link

Current coverage is 84.20%

Merging #13293 into master will not change coverage

@@             master     #13293   diff @@
==========================================
  Files           138        138          
  Lines         50587      50587          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          42593      42593          
  Misses         7994       7994          
  Partials          0          0          

Powered by Codecov. Last updated by b4e2d34...282f1f3

@chris-b1
Copy link
Contributor Author

@jreback - updated

@jreback jreback closed this in 4b05055 May 26, 2016
@jreback
Copy link
Contributor

jreback commented May 26, 2016

thanks!

Number of rows of file to read. Useful for reading pieces of large files.
low_memory : boolean, default ``True``
Internally process the file in chunks, resulting in lower memory use
while parsing, but possibly mixed type inference. To ensure no mixed
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does 'resulting in [...] mixed type inference' mean? Is it implementation-specific which type is chosen? Is it deterministic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is deterministic - types are consistently inferred based on what's in the data. That said, the internal chunksize is not a fixed number of rows, but instead bytes, so whether you can a mixed dtype warning or not can feel a bit random.

from io import StringIO

# just ints - no warning
pd.read_csv(StringIO(
    '\n'.join(str(x) for x in range(10000)))
).dtypes
Out[24]: 
0    int64
dtype: object

# mixed dtype - but fits into a single chunk, so no warning and all parsed as strings
pd.read_csv(StringIO(
    '\n'.join([str(x) for x in range(10000)] + ['a string']))
).dtypes
Out[26]: 
0    object
dtype: object

# mixed dtype - doesn't fit into a single chunk
pd.read_csv(StringIO(
    '\n'.join([str(x) for x in range(1000000)] + ['a string']))
).dtypes
DtypeWarning: Columns (0) have mixed types. 
Specify dtype option on import or set low_memory=False.

Out[27]: 
0    object
dtype: object

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thank you for the explanation and the example!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To complete the example, I see:

df=pd.read_csv(StringIO('\n'.join([str(x) for x in range(1000000)] + ['a string'])))
DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.

type(df.loc[524287,'0'])
Out[50]: int

type(df.loc[524288,'0'])
Out[51]: str

The first part of the csv data was seen as only int, so converted to int,
the second part also had a string, so all entries were kept as string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Docs IO CSV read_csv, to_csv

Projects

None yet

Development

Successfully merging this pull request may close these issues.

API/DOC: status of low_memory kwarg of read_csv/table

4 participants