Self Speculative Decoding at lower precisions?

Hello there, 

I was wondering if it were possible to have the self-speculative decoding operate using IQ2 as the draft model and FP8 as the core model (as it has been shown that FP8 is very rarely any different in accuracy compared to FP16).

A look into the following 1.58bit quant method would also be interesting as to its integration:
https://github.com/ggerganov/llama.cpp/pull/5999

I was also curious as to whether llama.cpp quants other than 4bit are compatible at all, as I noticed you only provided examples using 4bit quantisations. My reasoning behind being interested in this is the ability to offload x number of layers to the GPU and keep the remaining layers computing on the CPU, as it is an incredibly useful feature to be able to work with much larger models or/and longer context lengths.

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Self Speculative Decoding at lower precisions? #10666

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Self Speculative Decoding at lower precisions? #10666

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions