Consolidating all tasks that write to a file on a single worker

A common pattern when using xarray with dask is to have a large number of tasks writing to a smaller number of files, e.g., an `xarray.Dataset` consisting of a handful of dask arrays gets stored into a single netCDF file.

This works pretty well with the non-distributed version of dask, but doing it with dask-distributed presents two challenges:
1. There's significant overhead associated with opening/closing netCDF files (comparable to the cost of writing a single chunk), so we'd really prefer to avoid doing so for every write task -- yet this is what we currently do. I haven't measure this directly but I have a strong suspicion this is part of why xarray can be so slow for writing netCDF files with dask-distributed.
2. We need to coordinate a bunch of distributed locks to ensure that we don't try to write to the same file from multiple processes at the same time. These are tricky to reason about (see https://github.com/pydata/xarray/pull/1793 and https://github.com/pydata/xarray/pull/2261) and unfortunately we still don't have it right yet in xarray -- we only ever got this working with netCDF4-Python, not h5netcdf or scipy.

It would be nice if we could simply consolidate all tasks that involve writing to a single file onto a single worker. This would avoid the necessity to reopen files, pass around open files between processes or worry about distributed locks.

Does dask-distributed have any sort of existing machinery that would facilitate this? In particular, I wonder if this could be a good use-case for actors (https://github.com/dask/distributed/pull/2133).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Consolidating all tasks that write to a file on a single worker #2163

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Consolidating all tasks that write to a file on a single worker #2163

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions