Quantization in LLM

Показать описание

If we can quantize a 16-bit floating point number down to an 8-bit or a 4-bit, why not go all the way to 1-bit? Well, it turns out you can do that, but you'll see significant degradation in the output quality of that language model. One bit is not enough, so we need an extra bit.

You can't have a fraction of a bit, though. Quantization is about reducing the number of bits a floating point number uses to fit into the available memory of the device doing the computation. When you have billions of 16-bit floating points, that's a lot of memory.

So we round the numbers to the nearest digit and cut down on bits. This reduces the precision of each parameter in the model but makes it fit into the GPU memory, which is a scarce resource. Doing this allows for faster computation, lower hardware requirements, and less energy consumption.

The term "1.58 bits" means using three states: -1, 0, and 1. While one bit gives 0 and 1, two bits can represent four states, but we only use three and throw one away. log2(3) = 1.58 (trinary state) in reality the hardware is using 2 bits. The software algorithm uses 3 of 4 available states. The 4th state is never used.