- 
                Notifications
    You must be signed in to change notification settings 
- Fork 18
Workflow: CSV to parquet #841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…embarrassing-parallel
…embarrassing-parallel
| Just to clarify, this is a WIP and not ready for a proper review... only posted to show the reproducer of the TCP error when running on Coiled is all. :) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @milesgranger
This LGTM, left a couple of comments to consider.
Co-authored-by: Naty Clementi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, we can also chat in Slack or Google Meet, if that helps.
| I just noticed that CI doesn't actually test workflows, unless it's a nightly job, or pr is marked with  benchmarks/.github/workflows/tests.yml Lines 87 to 89 in 3062113 
 You probably want that label (TIL). | 
| Sorry, I was sort of under the impression that (the label) was added towards the end when not many more adjustments were required so as to not waste runs(?). I've been running it from local machine though. | 
| 
 Ah. :) In that case you're all set. Just making sure. | 
5b1b947    to
    ed1d9e3      
    Compare
  
    dc85285    to
    e3145c6      
    Compare
  
    | The failing test is unrelated it seems. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but before you merge, I'd still like to resolve the dataframe.convert-string issue. If it's getting applied, you should not need to explicitly convert object columns to string[pyarrow]. If it's not getting applied, take it out, it's misleading.
Not applied in current version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comment
2e99d46    to
    e701cc1      
    Compare
  
    |  | ||
| df = dd.read_csv( | ||
| files, | ||
| sep="\t", | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes read_csv fall back to the python engine. 2 things:
- if this is intended, we should be explicit
- the python engine is really slow. So if this is not intended we should choose a file that has a 1 character separator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it's quite likely the average user (including me) would not know this.
Therefore, IMO, I'd think it ought to stay this way so when it changes in the future, the bump in performance would be evident historically because it's what the average user might expect.
The second option I think would be a slightly hacky work-around that the benchmarks aren't designed for. My understanding is these workflows should represent somewhat realistic use cases and not as a means for maximizing performance itself. Could be mistaken though. cc @fjetter @jrbourbeau
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally fine by me if this is intended. Just wanted to check.
read_csv raises a warning that tells you that it will use the Python engine if you execute this code. That's why I suggested switching it explicitly. There aren't any plans to support regex or multi char seps in the c engine and I doubt that there ever will be, so we don't have to plan for that scenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a follow-up ticket, we could parametrize this test with comma-separated and tab-separated files if we care about the performance of the different engines.
Continuation from #738