Skip to content

Conversation

@WoosukKwon
Copy link
Collaborator

@WoosukKwon WoosukKwon commented Jun 5, 2024

This PR implements the initial integration of the Google TPU backend. It uses PyTorch XLA for maximal reuse of the existing code base.

The PR features:

  • Seamless support for popular HF models such as Llama, Mistral, Gemma, etc. The model's head size must be either 128 or 256.
  • Basic functionalities of vLLM, including continuous batching
  • Optimized pallas kernels for FlashAttention and PagedAttention

TODOs (next steps):

  • (Fast) top-p sampling (disabled for now due to performance issues)
  • Distributed (tensor-parallel) inference
  • INT8 quantization
  • MoE
  • Support best_of > 1

@WoosukKwon
Copy link
Collaborator Author

@alanwaketan Please take a look!

@WoosukKwon WoosukKwon changed the title [WIP][Hardware] Initial TPU integration [Hardware] Initial TPU integration Jun 11, 2024
@WoosukKwon WoosukKwon marked this pull request as ready for review June 11, 2024 17:47
@WoosukKwon WoosukKwon requested a review from JackCaoG June 11, 2024 17:57
Copy link

@JackCaoG JackCaoG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Collaborator

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly asking for comments for clarifying some code!

For testing, are we planning to add relevant CI in the future?

@WoosukKwon
Copy link
Collaborator Author

@rkooo567 Thanks for the quality review!

@WoosukKwon WoosukKwon merged commit 1a8bfd9 into main Jun 12, 2024
@WoosukKwon WoosukKwon deleted the torch-xla branch June 12, 2024 18:53
robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jun 16, 2024
joerunde pushed a commit to joerunde/vllm that referenced this pull request Jun 17, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 27, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 8, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tpu Related to Google TPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants