By default, when creating a table from a Dask Dataframe, the create_table
function persists the Dask DataFrame in memory.
This was surprising (partly because I didn't see it in the doc examples) when working on a large ETL job. I suspect for most users (especially those running larger jobs on systems where data is larger than memory) this will be a surprising realization. In my case, I spent time trying to debug which compute tasks might be using more memory than expected only to eventually realize it was just the persist call which failed. Once I re-created the tables w/ persist=False
, my workflow completed successfully.
Is there a rationale for persisting by default?