[Tracking][WebLLM] Function calling (beta) and Embeddings

This issue tracks various action items we would like to complete with regard to the features function calling and embeddings.

### Function calling (beta)

We are calling it beta because multiple iterations may be needed for function calling. It may be hard to conform different open-source models' function calling formats to OpenAI API. We will try to make each iteration non-breaking.

- [x] **F1 Enable manual function calling (completed via https://github.com/mlc-ai/web-llm/pull/527, supported in npm 0.2.53)**
  - That is, function calling with only `system`, `user`, `assistant`, `tool` messages, without using the `tools` and `tool_calls` fields of OpenAI API
  - Reach parity with the examples provided in MLC-LLM: https://github.com/mlc-ai/mlc-llm/pull/2744
  - This requires various runtime changes as initiated in https://github.com/mlc-ai/web-llm/pull/467
  - Add examples for Llama3.1 and Hermes2 after supported
- [ ] **F2 Support function calling following OpenAI API with `tools` and `tool_calls`**
  - Essentially supporting the example in https://platform.openai.com/docs/guides/function-calling.
    - There are new fields in the official OpenAI API, which we should support as well if possible.
  - This may limit the flexibility for the user. For instance, while Llama3.1 offers roughly 3 formats for function calling, using `tools` will force us to use only one of them
  - The previous PR only offers minimal one-round function calling support https://github.com/mlc-ai/web-llm/pull/451
    - We want to allow the model to make tool calls or respond in natural language at its own discretion
- [ ] **F3 Use BNFGrammar to guarantee tool call generation correctness**
  - This requires the model to use a special token to signify the beginning of a function call, `<tool_call>` in the case of Hermes2. Upon such a token being generated, we instantiate a BNFGrammar instance. When ended, force it to generate `</tool_call>`. Before and after this tool call, the model can either generate natural language or other tool calls.
  - The previous PR forces the response to be a tool call, which limits the flexibility a lot https://github.com/mlc-ai/web-llm/pull/451
- Related issues
  - https://github.com/mlc-ai/web-llm/issues/462
  - https://github.com/mlc-ai/web-llm/issues/297

### Embedding, Multi-model Engine, Concurrency

- [x] **E1 Loading multiple models within an engine (completed via https://github.com/mlc-ai/web-llm/pull/542, supported in npm 0.2.59)**
  - For applications like RAG, two models are needed to complete this, one embedding model and one LLM. We'd like to hold all models in a single `MLCEngine` instead of instantiating multiple engines. This makes `MLCEngine` behave like an endpoint, and offers the possibility for intra-engine optimizations in the future.
  - Each model can process requests concurrently if needed: https://github.com/mlc-ai/web-llm/pull/546 (published in npm 0.2.60)
- [x] **E2 Fix concurrent request issues (completed via https://github.com/mlc-ai/web-llm/pull/549, supported in npm 0.2.61)**
  - With a single model, we encounter correctness issues with multiple concurrent requests as broughtup in https://github.com/mlc-ai/web-llm/issues/522
  - After E1, we need to further pay close attention to potential concurrency issues.
- [x] **E3 Implement `engine.embeddings.create()` (completed via https://github.com/mlc-ai/web-llm/pull/538, supported in npm 0.2.58)**
- [x] **E4 Add an example for RAG (completed via https://github.com/mlc-ai/web-llm/pull/550)**
- Related issues
  - https://github.com/mlc-ai/web-llm/issues/445
  - https://github.com/mlc-ai/web-llm/issues/438
  - https://github.com/mlc-ai/web-llm/issues/270
  - https://github.com/mlc-ai/web-llm/issues/67

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Tracking][WebLLM] Function calling (beta) and Embeddings #526

Function calling (beta)

Embedding, Multi-model Engine, Concurrency

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Tracking][WebLLM] Function calling (beta) and Embeddings #526

Description

Function calling (beta)

Embedding, Multi-model Engine, Concurrency

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions