The 2-bit Quantization is Insane! See How to Run Mixtral-8x7B on Free-tier Colab.
A Quick Tutorial for AQLM-2-bit Quantization and its Implementation
There has been an increasing demand to make open-source language models more accessible to end-users for local inference and fine-tuning. However, this requires the models to significantly reduce their computational resource demands, allowing their usage on more affordable hardware.
A particularly effective method is called quantization. Practically, quantization is the technique that lowers the bit-width used for representing the weights of a model which effectively decreases its overall size and eases RAM transfer.
The traditional approach of quantization includes selecting a specific quantization grid and normalizer for different parts of the model and then mapping the model weights onto this grid. This mapping algorithm might go through simple rounding or more complex allocations. However, this inevitably creates a trade-off between size and accuracy especially for heavy compression, which can be easily reflected through perplexity — a metric indicating the predictive capability of the model.
AQLM
Against this backdrop, a team from Austria recently released a new state-of-the-art method for 2–3bit width LLM quantization, AQLM (Additive Quantization of Language Models), in a paper that differs fundamentally from some previous methods which normally require complicated formats to manage quantitative outliers.
To evaluate a quantization method, two factors are normally concerned, perplexity (intelligence) and speed. From the comparison result of QuIP#-2bit, AQLM-2bit, and original FP16 for three scales of Llama-2 models, it’s surprising to see the AQLM-2bit is always performing better than QuIP#-2bit, and the AQLM-2bit for 70b model scores 4 in perplexity on Wikitext2 which is much better than original 13b model. For me, when the perplexity score is under 4, the model inference is qualified for use in practical applications.