Skip to content

Parameter Server #171

@mrocklin

Description

@mrocklin

I'm restarting the discussion on parameter servers from dask/dask-glm#57

When doing distributed training it is often useful to rapidly update and share parameters between distributed workers. The frequency with which these updates can be shared seems to strongly affect the rate of convergence. The common practice today seems to be to have dedicated nodes within the network that serve as parameter servers, accepting and aggregating updates from workers, and periodically publishing those back out to workers.

Dask's existing task scheduling model is not ideal for this. Every computation or communication checks in with the scheduler, which adds unnecessary network roundtrips and provides a central bottleneck for metadata tracking. To do parameter servers well we probably need to break out of this model and engage more peer-to-peer coordination, without frequent checking in with the scheduler.

Building this infrastructure is not hard, but if possible it would be good to build general machinery that can be used to solve both this problem, and potentially others. I would appreciate having a conversation with a few people to identify what programming and networking constructs might suffice for this.

cc @stsievert @fabianp @MLnick

Metadata

Metadata

Assignees

No one assigned

    Labels

    RoadmapLarger, self-contained pieces of work.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions