Your current environment
n/a
🐛 Describe the bug
When using MLPSpeculator
as the speculative model, each model has an upper-limit on how num_speculative_tokens
can be set. This corresponds to the value of n_predict
in the config of the speculative model. Currently, if the user tries to set num_speculative_tokens
to a value higher than what is supported we get a confusing message. For example, if one uses ibm-fms/llama-13b-accelerator
and sets num_speculative_tokens=4
we will get the following message:
ValueError: Expected both speculative_model and num_speculative_tokens to be provided, but found speculative_model='ibm-fms/llama-13b-accelerator' and num_speculative_tokens=4.
This model supports a maximum of num_speculative_tokesn=3
(e.g., see config here). It would better if we explicitly tell the user to reduce the value of num_speculative_tokens
.