[RFC]: Enabling Suffix Decoding, LSTM Speculator, Sequence Parallelism from Arctic Inference

### Motivation.

Snowflake AI Research has recently released several optimizations like Suffix Decoding, LSTM Speculation, Sequence Parallelism, SwiftKV etc, improving TTFT, TPOT and throughput for vLLM via a plugin called Arctic Inference (repo: https://github.com/snowflakedb/arcticinference).

**Performance Improvements**
- 4x faster generation with [Suffix Decoding](https://www.snowflake.com/en/engineering-blog/fast-speculative-decoding-vllm-arctic/) for SWEBench
- 2.4x faster generation with [LSTM Speculator](https://www.snowflake.com/en/engineering-blog/fast-speculative-decoding-vllm-arctic/) for ShareGPT
- 2.8x faster coding with [LSTM Speculator](https://www.snowflake.com/en/engineering-blog/fast-speculative-decoding-vllm-arctic/) for HumanEval
- 2x higher throughput with [SwiftKV](https://www.snowflake.com/en/engineering-blog/swiftkv-llm-compute-reduction/)
- 1.4x throughput than TP=8, but same TTFT as TP=8 with [Arctic Ulysses](https://www.snowflake.com/en/engineering-blog/ulysses-low-latency-llm-inference/)

These optimizations are designed to improve vLLM performance for real production workloads at Snowflake, and will continue to be expanded, maintained, and improved over time

Currently, Arctic Inference is implemented as an external plugin for vLLM. This means users should install both vLLM and Arctic Inference, and then run vLLM with additional configs like:

```
pip install vllm==v0.8.4 arctic-inference

vllm serve ... \
    --sequence-parallel-size ... \
    --speculative-config '{"method": "suffix"}'
```

Problem
--------

Currently, Arctic Inference’s primary user is Snowflake’s own production Cortex inference service. However, we believe that other users in the community can also benefit from its features (there is some interest expressed to us after our most recent speculative decoding blog).

First, how can we make these optimizations more accessible to the vLLM community? While some features (e.g. SwiftKV) are more model-specific, others like Suffix Decoding can provide value to users more generally. We believe more users can benefit if they can use these features through vLLM directly.

Second, how can we build a smooth path for “graduating” features from Arctic Inference to vLLM, if desired, as they become available, more generally used, and as their interfaces stabilize?

### Proposed Change.

Possible Solutions
------------------

1. **All Arctic Inference features remain only available through its plugin.** Basically, keep things the way it is. Arctic Inference features do not get integrated to vLLM directly, and interested users are directed to the Arctic Inference project to install and use it with their installation of vLLM. Of course, users will need to go through extra discovery + installation steps, but it is the simplest. Perhaps Arctic Inference can be highlighted in documentation as an ecosystem project that helps boost performance.

2. **Some Arctic Inference features are directly integrated into vLLM.** This way, vLLM users can benefit from those features without first having to discover the Arctic Inference project (some users asked us about this). The integration can be done in two ways:

    1. Copy the code from Arctic Inference directly to vLLM. Does not require bringing on additional dependencies to vLLM, but will require additional effort to port each feature and maintain them. The Snowflake team will also keep maintaining and improving the same features as part of Arctic Inference, which will make this a duplicated effort. This strategy also does not provide a very smooth “graduation” ramp for features that vLLM might want to integrate in the future.
    &nbsp;
    This also defeats the purpose of supporting [vLLM plugins](https://docs.vllm.ai/en/latest/design/plugin_system.html), which are meant to allow vLLM to leverage optimizations directly without incorporating them into the vLLM’s code base.

    2. Install Arctic Inference as a dependency of vLLM. Then, certain features (e.g. there is interest in Suffix Decoding) can be integrated directly in vLLM, making them immediately available to all vLLM users. Other Arctic Inference features can also be used by optionally enabling its plugin, defaulting to off. The rest of Arctic Inference can even be installed just-in-time, if desired.
    &nbsp;
    There is a question of maintenance burden due to the dependency on Arctic Inference. However, this burden is much lower than other libraries because *Arctic Inference purely exists as an extension to vLLM*, which means:
        1. A part of the Arctic Inference package can be explicitly designated for importing into vLLM, and their interface stability guaranteed (e.g. in a `arctic_inference.core` submodule).
        2. Secondary dependencies are minimal (in fact they are currently non-existent) since Arctic Inference inherits all its major dependencies from vLLM itself, such as torch. For optional features (e.g. those outside `arctic_inference.core`), the Arctic Inference team is responsible for upgrading them after each substantial vLLM release.

Recommended Solution
------------------------

Based on the tradeoff between these three options, we are proposing **option 2.ii**:
- Install Arctic Inference as part of vLLM as a dependency.
- Arctic Inference to create the core submodule, starting with suffix decoding, that has guaranteed interface stability and minimal sub-dependencies.
- The code for integrating suffix decoding with vLLM is contributed to vLLM directly, enabling users to use it natively in vLLM with no extra steps.
- Other features in Arctic Inference can be used on an opt-in basis by enabling the Arctic Inference plugin (off by default).
- Once this integration is established, future features can be smoothly “graduated” into native vLLM usage simply by writing the integration code in vLLM.


### Feedback Period.

_No response_

### CC List.

@simon-mo @WoosukKwon @LiuXiaoxuanPKU @sfc-gh-jrasley @sfc-gh-srajbhandari


### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Enabling Suffix Decoding, LSTM Speculator, Sequence Parallelism from Arctic Inference #18037

Motivation.

Problem

Proposed Change.

Possible Solutions

Recommended Solution

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Enabling Suffix Decoding, LSTM Speculator, Sequence Parallelism from Arctic Inference #18037

Description

Motivation.

Problem

Proposed Change.

Possible Solutions

Recommended Solution

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions