Skip to content

Conversation

@keyboardAnt
Copy link
Owner

@keyboardAnt keyboardAnt commented Dec 6, 2024

It seems like we haven't been cropping the KV cache. This PR adds the missing functionality and hopefully addresses the issue raised in huggingface#35029 (comment)

Copy link
Collaborator

@jmamou jmamou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contrarily to UAG, I don't think we need to set remove_from_pkv > 0 in USD (like in regular AG/SD) since draft/target tokens are aligned

@keyboardAnt
Copy link
Owner Author

Contrarily to UAG, I don't think we need to set remove_from_pkv > 0 in USD (like in regular AG/SD) since draft/target tokens are aligned

How does the KV cache get updated after rejecting draft tokens?

@jmamou
Copy link
Collaborator

jmamou commented Dec 8, 2024

Contrarily to UAG, I don't think we need to set remove_from_pkv > 0 in USD (like in regular AG/SD) since draft/target tokens are aligned

How does the KV cache get updated after rejecting draft tokens?

new_cache_size = input_ids.shape[-1] - 1
self.assistant_kwargs["past_key_values"] = _crop_past_key_values(
    self.assistant_model, self.assistant_kwargs["past_key_values"], new_cache_size - 1
)

Here, input_ids corresponds to assistant_input_ids with validated token only + additional token(s) from the target validation

@gauravjain14 gauravjain14 mentioned this pull request Dec 9, 2024
@gauravjain14
Copy link
Collaborator

@keyboardAnt - Is this change still applicable if #7 is merged?

@jmamou
Copy link
Collaborator

jmamou commented Dec 10, 2024

@keyboardAnt - Is this change still applicable if #7 is merged?

@gauravjain14 no

keyboardAnt pushed a commit that referenced this pull request Jan 16, 2025
* gptqmodel

Signed-off-by: jiqing-feng <[email protected]>

* fix format

Signed-off-by: jiqing-feng <[email protected]>

* update readme

Signed-off-by: jiqing-feng <[email protected]>

* gptqmodel need use checkpoint_format (#1)

* gptqmodel need use checkpoint_format

* fix quantize

* Update quantization_config.py

* Update quantization_config.py

* Update quantization_config.py

---------

Co-authored-by: ZX-ModelCloud <[email protected]>
Co-authored-by: Qubitium-ModelCloud <[email protected]>

* Revert quantizer_gptq.py (#2)

* revert quantizer_gptq.py change

* pass **kwargs

* limit gptqmodel and optimum version

Signed-off-by: jiqing-feng <[email protected]>

* fix format

Signed-off-by: jiqing-feng <[email protected]>

* fix warning

Signed-off-by: jiqing-feng <[email protected]>

* fix version check

Signed-off-by: jiqing-feng <[email protected]>

* revert unrelated changes

Signed-off-by: jiqing-feng <[email protected]>

* enable gptqmodel tests

Signed-off-by: jiqing-feng <[email protected]>

* fix requires gptq

Signed-off-by: jiqing-feng <[email protected]>

* Fix Transformer compat (#3)

* revert quantizer_gptq.py change

* pass **kwargs

* add meta info

* cleanup

* cleanup

* Update quantization_config.py

* hf_select_quant_linear pass checkpoint_format and meta

* fix GPTQTestCUDA

* Update test_gptq.py

* gptqmodel.hf_select_quant_linear() now does not select ExllamaV2

* cleanup

* add backend

* cleanup

* cleanup

* no need check exllama version

* Update quantization_config.py

* lower checkpoint_format and backend

* check none

* cleanup

* Update quantization_config.py

* fix self.use_exllama == False

* spell

* fix unittest

* fix unittest

---------

Co-authored-by: LRL <[email protected]>
Co-authored-by: Qubitium-ModelCloud <[email protected]>

* fix format

Signed-off-by: jiqing-feng <[email protected]>

* fix format again

Signed-off-by: jiqing-feng <[email protected]>

* update gptqmodel version (#6)

* update gptqmodel version

* update gptqmodel version

* fix unit test (#5)

* update gptqmodel version

* update gptqmodel version

* "not self.use_exllama" is not equivalent to "self.use_exllama==False"

* fix unittest

* update gptqmodel version

* backend is loading_attibutes (#7)

* fix format and tests

Signed-off-by: jiqing-feng <[email protected]>

* fix memory check

Signed-off-by: jiqing-feng <[email protected]>

* fix device mismatch

Signed-off-by: jiqing-feng <[email protected]>

* fix result check

Signed-off-by: jiqing-feng <[email protected]>

* Update src/transformers/quantizers/quantizer_gptq.py

Co-authored-by: Marc Sun <[email protected]>

* Update src/transformers/quantizers/quantizer_gptq.py

Co-authored-by: Marc Sun <[email protected]>

* Update src/transformers/quantizers/quantizer_gptq.py

Co-authored-by: Marc Sun <[email protected]>

* update tests

Signed-off-by: jiqing-feng <[email protected]>

* review: update docs (#10)

* review: update docs (#12)

* review: update docs

* fix typo

* update tests for gptqmodel

Signed-off-by: jiqing-feng <[email protected]>

* update document (#9)

* update overview.md

* cleanup

* Update overview.md

* Update overview.md

* Update overview.md

* update gptq.md

* Update gptq.md

* Update gptq.md

* Update gptq.md

* Update gptq.md

* Update gptq.md

* Update gptq.md

---------

Co-authored-by: Qubitium-ModelCloud <[email protected]>

* typo

* doc note for asymmetric quant

* typo with apple silicon(e)

* typo for marlin

* column name revert: review

* doc rocm support

* Update docs/source/en/quantization/gptq.md

Co-authored-by: Steven Liu <[email protected]>

* Update docs/source/en/quantization/gptq.md

Co-authored-by: Steven Liu <[email protected]>

* Update docs/source/en/quantization/gptq.md

Co-authored-by: Steven Liu <[email protected]>

* Update docs/source/en/quantization/gptq.md

Co-authored-by: Steven Liu <[email protected]>

* Update docs/source/en/quantization/overview.md

Co-authored-by: Steven Liu <[email protected]>

* Update docs/source/en/quantization/overview.md

Co-authored-by: Steven Liu <[email protected]>

---------

Signed-off-by: jiqing-feng <[email protected]>
Co-authored-by: LRL-ModelCloud <[email protected]>
Co-authored-by: ZX-ModelCloud <[email protected]>
Co-authored-by: Qubitium-ModelCloud <[email protected]>
Co-authored-by: ZX-ModelCloud <[email protected]>
Co-authored-by: LRL <[email protected]>
Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Mohamed Mekkouri <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
keyboardAnt pushed a commit that referenced this pull request Feb 15, 2025
* Resolve vptq conflict

* Rename spqr package to spqr_quant

* Get rid of aqlm mention

* Start working on tests

* Resolve ruff code checks

* Ruff format

* Isort

* Test updates

* Add gpu tag

* Rename to modules_to_not_convert

* Config update

* Docs and config update

* Docs and config update

* Update to update_torch_dtype

* spqr config parameter validation

* Ruff update

* Apply ruff fixes

* Test fixes

* Ruff update

* Mark tests as @slow again; Ruff; Docstring update

* Ruff

* Remove absolute path

* Resolve typo

* Remove redundandt log

* Check accelerate/spqr availability

* Ruff fix

* Check if the config contains proper shapes

* Ruff test

* Documentation update

* overview update

* Ruff checks

* Ruff code quality

* Make style

* Update docs/source/en/quantization/spqr.md

Co-authored-by: Steven Liu <[email protected]>

* Update spqr.md

* Enable gptqmodel (huggingface#35012)

* gptqmodel

Signed-off-by: jiqing-feng <[email protected]>

* fix format

Signed-off-by: jiqing-feng <[email protected]>

* update readme

Signed-off-by: jiqing-feng <[email protected]>

* gptqmodel need use checkpoint_format (#1)

* gptqmodel need use checkpoint_format

* fix quantize

* Update quantization_config.py

* Update quantization_config.py

* Update quantization_config.py

---------

Co-authored-by: ZX-ModelCloud <[email protected]>
Co-authored-by: Qubitium-ModelCloud <[email protected]>

* Revert quantizer_gptq.py (#2)

* revert quantizer_gptq.py change

* pass **kwargs

* limit gptqmodel and optimum version

Signed-off-by: jiqing-feng <[email protected]>

* fix format

Signed-off-by: jiqing-feng <[email protected]>

* fix warning

Signed-off-by: jiqing-feng <[email protected]>

* fix version check

Signed-off-by: jiqing-feng <[email protected]>

* revert unrelated changes

Signed-off-by: jiqing-feng <[email protected]>

* enable gptqmodel tests

Signed-off-by: jiqing-feng <[email protected]>

* fix requires gptq

Signed-off-by: jiqing-feng <[email protected]>

* Fix Transformer compat (#3)

* revert quantizer_gptq.py change

* pass **kwargs

* add meta info

* cleanup

* cleanup

* Update quantization_config.py

* hf_select_quant_linear pass checkpoint_format and meta

* fix GPTQTestCUDA

* Update test_gptq.py

* gptqmodel.hf_select_quant_linear() now does not select ExllamaV2

* cleanup

* add backend

* cleanup

* cleanup

* no need check exllama version

* Update quantization_config.py

* lower checkpoint_format and backend

* check none

* cleanup

* Update quantization_config.py

* fix self.use_exllama == False

* spell

* fix unittest

* fix unittest

---------

Co-authored-by: LRL <[email protected]>
Co-authored-by: Qubitium-ModelCloud <[email protected]>

* fix format

Signed-off-by: jiqing-feng <[email protected]>

* fix format again

Signed-off-by: jiqing-feng <[email protected]>

* update gptqmodel version (#6)

* update gptqmodel version

* update gptqmodel version

* fix unit test (#5)

* update gptqmodel version

* update gptqmodel version

* "not self.use_exllama" is not equivalent to "self.use_exllama==False"

* fix unittest

* update gptqmodel version

* backend is loading_attibutes (#7)

* fix format and tests

Signed-off-by: jiqing-feng <[email protected]>

* fix memory check

Signed-off-by: jiqing-feng <[email protected]>

* fix device mismatch

Signed-off-by: jiqing-feng <[email protected]>

* fix result check

Signed-off-by: jiqing-feng <[email protected]>

* Update src/transformers/quantizers/quantizer_gptq.py

Co-authored-by: Marc Sun <[email protected]>

* Update src/transformers/quantizers/quantizer_gptq.py

Co-authored-by: Marc Sun <[email protected]>

* Update src/transformers/quantizers/quantizer_gptq.py

Co-authored-by: Marc Sun <[email protected]>

* update tests

Signed-off-by: jiqing-feng <[email protected]>

* review: update docs (#10)

* review: update docs (#12)

* review: update docs

* fix typo

* update tests for gptqmodel

Signed-off-by: jiqing-feng <[email protected]>

* update document (#9)

* update overview.md

* cleanup

* Update overview.md

* Update overview.md

* Update overview.md

* update gptq.md

* Update gptq.md

* Update gptq.md

* Update gptq.md

* Update gptq.md

* Update gptq.md

* Update gptq.md

---------

Co-authored-by: Qubitium-ModelCloud <[email protected]>

* typo

* doc note for asymmetric quant

* typo with apple silicon(e)

* typo for marlin

* column name revert: review

* doc rocm support

* Update docs/source/en/quantization/gptq.md

Co-authored-by: Steven Liu <[email protected]>

* Update docs/source/en/quantization/gptq.md

Co-authored-by: Steven Liu <[email protected]>

* Update docs/source/en/quantization/gptq.md

Co-authored-by: Steven Liu <[email protected]>

* Update docs/source/en/quantization/gptq.md

Co-authored-by: Steven Liu <[email protected]>

* Update docs/source/en/quantization/overview.md

Co-authored-by: Steven Liu <[email protected]>

* Update docs/source/en/quantization/overview.md

Co-authored-by: Steven Liu <[email protected]>

---------

Signed-off-by: jiqing-feng <[email protected]>
Co-authored-by: LRL-ModelCloud <[email protected]>
Co-authored-by: ZX-ModelCloud <[email protected]>
Co-authored-by: Qubitium-ModelCloud <[email protected]>
Co-authored-by: ZX-ModelCloud <[email protected]>
Co-authored-by: LRL <[email protected]>
Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Mohamed Mekkouri <[email protected]>
Co-authored-by: Steven Liu <[email protected]>

* Fix : Nemotron Processor in GGUF conversion (huggingface#35708)

* fixing nemotron processor

* make style

* Update docs/source/en/quantization/spqr.md

Co-authored-by: Arthur <[email protected]>

* Add missing TOC to doc

---------

Signed-off-by: jiqing-feng <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: jiqing-feng <[email protected]>
Co-authored-by: LRL-ModelCloud <[email protected]>
Co-authored-by: ZX-ModelCloud <[email protected]>
Co-authored-by: Qubitium-ModelCloud <[email protected]>
Co-authored-by: ZX-ModelCloud <[email protected]>
Co-authored-by: LRL <[email protected]>
Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Mohamed Mekkouri <[email protected]>
Co-authored-by: Arthur <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants