Build and execute our own computation graph

At present, we are using GGML's computation graph. This works well, but it has a few flaws:

1) We're reliant on whatever support GGML has for threading; the Rust threading ecosystem is more versatile/OS-agnostic
2) Adding new operations requires patching GGML
3) We're coupled pretty tightly to GGML, so switching to an alternate backend would be quite difficult; this will only get worse as we support more models
4) Abstraction of shared pieces of functionality gets a little finicky with the exposed API

After reading https://github.com/ggerganov/llama.cpp/discussions/915, I had a flash of inspiration and realised we could address these problems by using our own computation graph.

The code would be fairly similar to what it is now - but instead of building up a GGML computation graph, we build up our own in Rust code with all of the usual strong-typing guarantees.

To begin with, this computation graph would then be "compiled" to a GGML computation graph, so that it works identically. 

Once that's done, we would look at reimplementing the actual execution of the graph in Rust code and using GGML's operations to do so (e.g. we use its `vec_dot_q4_0`, etc).

This would allow us to decouple from GGML in the future (#3), and gives us freedom to implement new operations that aren't supported by GGML without having to maintain our own patched version.

Ideally, we would just use `burn` or something similar directly, but none of the existing libraries are in a position to serve our needs (GGML-like performance with quantization support). This lets us side-step that issue for now, and focus on describing models that could be executed by anything once support is available.

---

Constructing our own computation graph and compiling it to GGML should be fairly simple (this could be done with `petgraph` or our own graph implementation, it's not that difficult).

The main problem comes in the executor reimplementation - a lot of GGML's more complex operations are coupled to the executor, so we'd have to reimplement them (e.g. all the `ggml_compute_forward_...` functions). Additionally, a lot of the base operations are `static void` and not exposed to the outside world, so it's likely we'd have to patch GGML anyway.

An alternate approach to full graph reimplementation might be to add support for custom elementwise operations _once_ (as @KerfuffleV2 has done in [their fork](https://github.com/rustformers/llama-rs/compare/main...KerfuffleV2:llama-rs:experiment-ggml-map-ops)), so that we can polyfill custom operations from our computation graph.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Build and execute our own computation graph #137

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Build and execute our own computation graph #137

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions