Parameter Server

I'm restarting the discussion on parameter servers from https://github.com/dask/dask-glm/issues/57

When doing distributed training it is often useful to rapidly update and share parameters between distributed workers.  The frequency with which these updates can be shared seems to strongly affect the rate of convergence.  The common practice today seems to be to have dedicated nodes within the network that serve as parameter servers, accepting and aggregating updates from workers, and periodically publishing those back out to workers.  

Dask's existing task scheduling model is not ideal for this.  Every computation or communication checks in with the scheduler, which adds unnecessary network roundtrips and provides a central bottleneck for metadata tracking.  To do parameter servers well we probably need to break out of this model and engage more peer-to-peer coordination, without frequent checking in with the scheduler.

Building this infrastructure is not hard, but if possible it would be good to build general machinery that can be used to solve both this problem, and potentially others.  I would appreciate having a conversation with a few people to identify what programming and networking constructs might suffice for this.

cc @stsievert @fabianp @mlnick

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Parameter Server #171

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Parameter Server #171

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions