-
Notifications
You must be signed in to change notification settings - Fork 0
[WIP] drafting a fix - cropping the kv cache #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Contrarily to UAG, I don't think we need to set remove_from_pkv > 0 in USD (like in regular AG/SD) since draft/target tokens are aligned
How does the KV cache get updated after rejecting draft tokens? |
Here, |
|
@keyboardAnt - Is this change still applicable if #7 is merged? |
|
* gptqmodel Signed-off-by: jiqing-feng <[email protected]> * fix format Signed-off-by: jiqing-feng <[email protected]> * update readme Signed-off-by: jiqing-feng <[email protected]> * gptqmodel need use checkpoint_format (#1) * gptqmodel need use checkpoint_format * fix quantize * Update quantization_config.py * Update quantization_config.py * Update quantization_config.py --------- Co-authored-by: ZX-ModelCloud <[email protected]> Co-authored-by: Qubitium-ModelCloud <[email protected]> * Revert quantizer_gptq.py (#2) * revert quantizer_gptq.py change * pass **kwargs * limit gptqmodel and optimum version Signed-off-by: jiqing-feng <[email protected]> * fix format Signed-off-by: jiqing-feng <[email protected]> * fix warning Signed-off-by: jiqing-feng <[email protected]> * fix version check Signed-off-by: jiqing-feng <[email protected]> * revert unrelated changes Signed-off-by: jiqing-feng <[email protected]> * enable gptqmodel tests Signed-off-by: jiqing-feng <[email protected]> * fix requires gptq Signed-off-by: jiqing-feng <[email protected]> * Fix Transformer compat (#3) * revert quantizer_gptq.py change * pass **kwargs * add meta info * cleanup * cleanup * Update quantization_config.py * hf_select_quant_linear pass checkpoint_format and meta * fix GPTQTestCUDA * Update test_gptq.py * gptqmodel.hf_select_quant_linear() now does not select ExllamaV2 * cleanup * add backend * cleanup * cleanup * no need check exllama version * Update quantization_config.py * lower checkpoint_format and backend * check none * cleanup * Update quantization_config.py * fix self.use_exllama == False * spell * fix unittest * fix unittest --------- Co-authored-by: LRL <[email protected]> Co-authored-by: Qubitium-ModelCloud <[email protected]> * fix format Signed-off-by: jiqing-feng <[email protected]> * fix format again Signed-off-by: jiqing-feng <[email protected]> * update gptqmodel version (#6) * update gptqmodel version * update gptqmodel version * fix unit test (#5) * update gptqmodel version * update gptqmodel version * "not self.use_exllama" is not equivalent to "self.use_exllama==False" * fix unittest * update gptqmodel version * backend is loading_attibutes (#7) * fix format and tests Signed-off-by: jiqing-feng <[email protected]> * fix memory check Signed-off-by: jiqing-feng <[email protected]> * fix device mismatch Signed-off-by: jiqing-feng <[email protected]> * fix result check Signed-off-by: jiqing-feng <[email protected]> * Update src/transformers/quantizers/quantizer_gptq.py Co-authored-by: Marc Sun <[email protected]> * Update src/transformers/quantizers/quantizer_gptq.py Co-authored-by: Marc Sun <[email protected]> * Update src/transformers/quantizers/quantizer_gptq.py Co-authored-by: Marc Sun <[email protected]> * update tests Signed-off-by: jiqing-feng <[email protected]> * review: update docs (#10) * review: update docs (#12) * review: update docs * fix typo * update tests for gptqmodel Signed-off-by: jiqing-feng <[email protected]> * update document (#9) * update overview.md * cleanup * Update overview.md * Update overview.md * Update overview.md * update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md --------- Co-authored-by: Qubitium-ModelCloud <[email protected]> * typo * doc note for asymmetric quant * typo with apple silicon(e) * typo for marlin * column name revert: review * doc rocm support * Update docs/source/en/quantization/gptq.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/quantization/gptq.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/quantization/gptq.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/quantization/gptq.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/quantization/overview.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/quantization/overview.md Co-authored-by: Steven Liu <[email protected]> --------- Signed-off-by: jiqing-feng <[email protected]> Co-authored-by: LRL-ModelCloud <[email protected]> Co-authored-by: ZX-ModelCloud <[email protected]> Co-authored-by: Qubitium-ModelCloud <[email protected]> Co-authored-by: ZX-ModelCloud <[email protected]> Co-authored-by: LRL <[email protected]> Co-authored-by: Marc Sun <[email protected]> Co-authored-by: Mohamed Mekkouri <[email protected]> Co-authored-by: Steven Liu <[email protected]>
* Resolve vptq conflict * Rename spqr package to spqr_quant * Get rid of aqlm mention * Start working on tests * Resolve ruff code checks * Ruff format * Isort * Test updates * Add gpu tag * Rename to modules_to_not_convert * Config update * Docs and config update * Docs and config update * Update to update_torch_dtype * spqr config parameter validation * Ruff update * Apply ruff fixes * Test fixes * Ruff update * Mark tests as @slow again; Ruff; Docstring update * Ruff * Remove absolute path * Resolve typo * Remove redundandt log * Check accelerate/spqr availability * Ruff fix * Check if the config contains proper shapes * Ruff test * Documentation update * overview update * Ruff checks * Ruff code quality * Make style * Update docs/source/en/quantization/spqr.md Co-authored-by: Steven Liu <[email protected]> * Update spqr.md * Enable gptqmodel (huggingface#35012) * gptqmodel Signed-off-by: jiqing-feng <[email protected]> * fix format Signed-off-by: jiqing-feng <[email protected]> * update readme Signed-off-by: jiqing-feng <[email protected]> * gptqmodel need use checkpoint_format (#1) * gptqmodel need use checkpoint_format * fix quantize * Update quantization_config.py * Update quantization_config.py * Update quantization_config.py --------- Co-authored-by: ZX-ModelCloud <[email protected]> Co-authored-by: Qubitium-ModelCloud <[email protected]> * Revert quantizer_gptq.py (#2) * revert quantizer_gptq.py change * pass **kwargs * limit gptqmodel and optimum version Signed-off-by: jiqing-feng <[email protected]> * fix format Signed-off-by: jiqing-feng <[email protected]> * fix warning Signed-off-by: jiqing-feng <[email protected]> * fix version check Signed-off-by: jiqing-feng <[email protected]> * revert unrelated changes Signed-off-by: jiqing-feng <[email protected]> * enable gptqmodel tests Signed-off-by: jiqing-feng <[email protected]> * fix requires gptq Signed-off-by: jiqing-feng <[email protected]> * Fix Transformer compat (#3) * revert quantizer_gptq.py change * pass **kwargs * add meta info * cleanup * cleanup * Update quantization_config.py * hf_select_quant_linear pass checkpoint_format and meta * fix GPTQTestCUDA * Update test_gptq.py * gptqmodel.hf_select_quant_linear() now does not select ExllamaV2 * cleanup * add backend * cleanup * cleanup * no need check exllama version * Update quantization_config.py * lower checkpoint_format and backend * check none * cleanup * Update quantization_config.py * fix self.use_exllama == False * spell * fix unittest * fix unittest --------- Co-authored-by: LRL <[email protected]> Co-authored-by: Qubitium-ModelCloud <[email protected]> * fix format Signed-off-by: jiqing-feng <[email protected]> * fix format again Signed-off-by: jiqing-feng <[email protected]> * update gptqmodel version (#6) * update gptqmodel version * update gptqmodel version * fix unit test (#5) * update gptqmodel version * update gptqmodel version * "not self.use_exllama" is not equivalent to "self.use_exllama==False" * fix unittest * update gptqmodel version * backend is loading_attibutes (#7) * fix format and tests Signed-off-by: jiqing-feng <[email protected]> * fix memory check Signed-off-by: jiqing-feng <[email protected]> * fix device mismatch Signed-off-by: jiqing-feng <[email protected]> * fix result check Signed-off-by: jiqing-feng <[email protected]> * Update src/transformers/quantizers/quantizer_gptq.py Co-authored-by: Marc Sun <[email protected]> * Update src/transformers/quantizers/quantizer_gptq.py Co-authored-by: Marc Sun <[email protected]> * Update src/transformers/quantizers/quantizer_gptq.py Co-authored-by: Marc Sun <[email protected]> * update tests Signed-off-by: jiqing-feng <[email protected]> * review: update docs (#10) * review: update docs (#12) * review: update docs * fix typo * update tests for gptqmodel Signed-off-by: jiqing-feng <[email protected]> * update document (#9) * update overview.md * cleanup * Update overview.md * Update overview.md * Update overview.md * update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md --------- Co-authored-by: Qubitium-ModelCloud <[email protected]> * typo * doc note for asymmetric quant * typo with apple silicon(e) * typo for marlin * column name revert: review * doc rocm support * Update docs/source/en/quantization/gptq.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/quantization/gptq.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/quantization/gptq.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/quantization/gptq.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/quantization/overview.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/quantization/overview.md Co-authored-by: Steven Liu <[email protected]> --------- Signed-off-by: jiqing-feng <[email protected]> Co-authored-by: LRL-ModelCloud <[email protected]> Co-authored-by: ZX-ModelCloud <[email protected]> Co-authored-by: Qubitium-ModelCloud <[email protected]> Co-authored-by: ZX-ModelCloud <[email protected]> Co-authored-by: LRL <[email protected]> Co-authored-by: Marc Sun <[email protected]> Co-authored-by: Mohamed Mekkouri <[email protected]> Co-authored-by: Steven Liu <[email protected]> * Fix : Nemotron Processor in GGUF conversion (huggingface#35708) * fixing nemotron processor * make style * Update docs/source/en/quantization/spqr.md Co-authored-by: Arthur <[email protected]> * Add missing TOC to doc --------- Signed-off-by: jiqing-feng <[email protected]> Co-authored-by: Steven Liu <[email protected]> Co-authored-by: jiqing-feng <[email protected]> Co-authored-by: LRL-ModelCloud <[email protected]> Co-authored-by: ZX-ModelCloud <[email protected]> Co-authored-by: Qubitium-ModelCloud <[email protected]> Co-authored-by: ZX-ModelCloud <[email protected]> Co-authored-by: LRL <[email protected]> Co-authored-by: Marc Sun <[email protected]> Co-authored-by: Mohamed Mekkouri <[email protected]> Co-authored-by: Arthur <[email protected]>
It seems like we haven't been cropping the KV cache. This PR adds the missing functionality and hopefully addresses the issue raised in huggingface#35029 (comment)