|
| 1 | +# Copyright 2024 The HuggingFace Inc. team. All rights reserved. |
| 2 | +# |
| 3 | +# Licensed under the Apache License, Version 2.0 (the "License"); |
| 4 | +# you may not use this file except in compliance with the License. |
| 5 | +# You may obtain a copy of the License at |
| 6 | +# |
| 7 | +# http://www.apache.org/licenses/LICENSE-2.0 |
| 8 | +# |
| 9 | +# Unless required by applicable law or agreed to in writing, software |
| 10 | +# distributed under the License is distributed on an "AS IS" BASIS, |
| 11 | +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 12 | +# See the License for the specific language governing permissions and |
| 13 | +# limitations under the License. |
1 | 14 | """ |
2 | 15 | Adapted from |
3 | 16 | https://github.com/huggingface/transformers/blob/c409cd81777fb27aadc043ed3d8339dbc020fb3b/src/transformers/integrations/bitsandbytes.py |
@@ -216,18 +229,13 @@ def _replace_with_bnb_linear( |
216 | 229 |
|
217 | 230 | def replace_with_bnb_linear(model, modules_to_not_convert=None, current_key_name=None, quantization_config=None): |
218 | 231 | """ |
219 | | - A helper function to replace all `torch.nn.Linear` modules by `bnb.nn.Linear8bit` modules from the `bitsandbytes` |
220 | | - library. This will enable running your models using mixed int8 precision as described by the paper `LLM.int8(): |
221 | | - 8-bit Matrix Multiplication for Transformers at Scale`. Make sure `bitsandbytes` compiled with the correct CUDA |
222 | | - version of your hardware is installed before running this function. `pip install -i https://test.pypi.org/simple/ |
223 | | - bitsandbytes`. |
224 | | -
|
225 | | - The function will be run recursively and replace all `torch.nn.Linear` modules except for `modules_to_not_convert` |
226 | | - that should be kept as a `torch.nn.Linear` module. The replacement is done under `init_empty_weights` context |
227 | | - manager so no CPU/GPU memory is required to run this function. Int8 mixed-precision matrix decomposition works by |
228 | | - separating a matrix multiplication into two streams: (1) and systematic feature outlier stream matrix multiplied in |
229 | | - fp16 (0.01%), (2) a regular stream of int8 matrix multiplication (99.9%). With this method, int8 inference with no |
230 | | - predictive degradation is possible for very large models (>=176B parameters). |
| 232 | + Helper function to replace the `nn.Linear` layers within `model` with either `bnb.nn.Linear8bit` or |
| 233 | + `bnb.nn.Linear4bit` using the `bitsandbytes` library. |
| 234 | +
|
| 235 | + References: |
| 236 | + * `bnb.nn.Linear8bit`: [LLM.int8(): 8-bit Matrix Multiplication for Transformers at |
| 237 | + Scale](https://arxiv.org/abs/2208.07339) |
| 238 | + * `bnb.nn.Linear4bit`: [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314) |
231 | 239 |
|
232 | 240 | Parameters: |
233 | 241 | model (`torch.nn.Module`): |
|
0 commit comments