-
-
Notifications
You must be signed in to change notification settings - Fork 744
Open
Description
A common pattern when using xarray with dask is to have a large number of tasks writing to a smaller number of files, e.g., an xarray.Dataset consisting of a handful of dask arrays gets stored into a single netCDF file.
This works pretty well with the non-distributed version of dask, but doing it with dask-distributed presents two challenges:
- There's significant overhead associated with opening/closing netCDF files (comparable to the cost of writing a single chunk), so we'd really prefer to avoid doing so for every write task -- yet this is what we currently do. I haven't measure this directly but I have a strong suspicion this is part of why xarray can be so slow for writing netCDF files with dask-distributed.
- We need to coordinate a bunch of distributed locks to ensure that we don't try to write to the same file from multiple processes at the same time. These are tricky to reason about (see fix distributed writes pydata/xarray#1793 and xarray.backends refactor pydata/xarray#2261) and unfortunately we still don't have it right yet in xarray -- we only ever got this working with netCDF4-Python, not h5netcdf or scipy.
It would be nice if we could simply consolidate all tasks that involve writing to a single file onto a single worker. This would avoid the necessity to reopen files, pass around open files between processes or worry about distributed locks.
Does dask-distributed have any sort of existing machinery that would facilitate this? In particular, I wonder if this could be a good use-case for actors (#2133).
jhamman
Metadata
Metadata
Assignees
Labels
No labels