Skip to content

ggml : add optional CPU backend context, support reusing threads, async compute #721

@slaren

Description

@slaren

As recently seen in llama.cpp (ggml-org/llama.cpp#5226), the cost of starting the threads of the CPU backend is not insignificant. To address this, I propose adding a new CPU context object that holds the threads and can reuse them between invocations. Additionally, this CPU context would behave as an asynchronous queue, so that multiple graph evaluations could be queued into the object. This would enable the implementation of pipeline parallelism with the CPU and GPU backends (ref: ggml-org/llama.cpp#4918 (comment)).

Possible API:

ggml_compute_context_t ggml_compute_context_init(int n_threads);
void ggml_graph_compute_async(ggml_compute_context_t context, struct ggml_cgraph * graph);
void ggml_synchronize(ggml_compute_context_t context);

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions