Skip to content

Consolidating all tasks that write to a file on a single worker #2163

@shoyer

Description

@shoyer

A common pattern when using xarray with dask is to have a large number of tasks writing to a smaller number of files, e.g., an xarray.Dataset consisting of a handful of dask arrays gets stored into a single netCDF file.

This works pretty well with the non-distributed version of dask, but doing it with dask-distributed presents two challenges:

  1. There's significant overhead associated with opening/closing netCDF files (comparable to the cost of writing a single chunk), so we'd really prefer to avoid doing so for every write task -- yet this is what we currently do. I haven't measure this directly but I have a strong suspicion this is part of why xarray can be so slow for writing netCDF files with dask-distributed.
  2. We need to coordinate a bunch of distributed locks to ensure that we don't try to write to the same file from multiple processes at the same time. These are tricky to reason about (see fix distributed writes pydata/xarray#1793 and xarray.backends refactor pydata/xarray#2261) and unfortunately we still don't have it right yet in xarray -- we only ever got this working with netCDF4-Python, not h5netcdf or scipy.

It would be nice if we could simply consolidate all tasks that involve writing to a single file onto a single worker. This would avoid the necessity to reopen files, pass around open files between processes or worry about distributed locks.

Does dask-distributed have any sort of existing machinery that would facilitate this? In particular, I wonder if this could be a good use-case for actors (#2133).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions