The 2-bit Quantization is Insane! See How to Run Mixtral-8x7B on Free-tier Colab.

A Quick Tutorial for AQLM-2-bit Quantization and its Implementation

Published in

Level Up Coding

7 min read3 days ago

There has been an increasing demand to make open-source language models more accessible to end-users for local inference and fine-tuning. However, this requires the models to significantly reduce their computational resource demands, allowing their usage on more affordable hardware.

A particularly effective method is called quantization. Practically, quantization is the technique that lowers the bit-width used for representing the weights of a model which effectively decreases its overall size and eases RAM transfer.

The traditional approach of quantization includes selecting a specific quantization grid and normalizer for different parts of the model and then mapping the model weights onto this grid. This mapping algorithm might go through simple rounding or more complex allocations. However, this inevitably creates a trade-off between size and accuracy especially for heavy compression, which can be easily reflected through perplexity — a metric indicating the predictive capability of the model.

Quantization process between fp16 and int8

AQLM

Against this backdrop, a team from Austria recently released a new state-of-the-art method for 2–3bit width LLM quantization, AQLM (Additive Quantization of Language Models), in a paper that differs fundamentally from some previous methods which normally require complicated formats to manage quantitative outliers.

To evaluate a quantization method, two factors are normally concerned, perplexity (intelligence) and speed. From the comparison result of QuIP#-2bit, AQLM-2bit, and original FP16 for three scales of Llama-2 models, it’s surprising to see the AQLM-2bit is always performing better than QuIP#-2bit, and the AQLM-2bit for 70b model scores 4 in perplexity on Wikitext2 which is much better than original 13b model. For me, when the perplexity score is under 4, the model inference is qualified for use in practical applications.

The 2-bit Quantization is Insane! See How to Run Mixtral-8x7B on Free-tier Colab.

A Quick Tutorial for AQLM-2-bit Quantization and its Implementation

AQLM

Create an account to read the full story.

Written by Yeyu Huang

More from Yeyu Huang and Level Up Coding

For a Multi-Agent Framework, CrewAI has its Advantages Compared to AutoGen

A Quick Guide to the App Development with CrewAI and its comparison to AutoGen

The 5 paid subscriptions I actually use in 2024 as a software engineer

Tools I use that are cheaper than Netflix

5 Extremely Useful Plots For Data Scientists That You Never Knew Existed

“5. Theme River”

Use the Cheapest LLM Inference API to Build a Multi-agent App

A Quick Tutorial for Building LLM Apps Using OpenRouter

Recommended from Medium

Behold the Power of Smaug, the New Open-source King

Smaug is the first open-source model to reach an 80 across benchmarks. But it’s also proof that DPO has become the standard for alignment.

Using LangChain ReAct Agents for Answering Multi-hop Questions in RAG Systems

Useful when answering complex queries on internal documents in a step-by-step manner with ReAct and Open AI Tools agents.

Lists

Generative AI Recommended Reading

What is ChatGPT?

The New Chatbots: ChatGPT, Bard, and Beyond

Natural Language Processing

Introducing LlamaCloud and LlamaParse

Today is a big day for the LlamaIndex ecosystem: we are announcing LlamaCloud, a new generation of managed parsing, ingestion, and…

Google’s Gemini 1.5 Shocks AI World with 1M Token and Dramatically Improved Performance

Is Gemini 1.5 worth the $20/month upgrade?

OpenAI Sora: One Step Away From The Matrix

The best text-to-video AI model is also… a world simulator?

Advanced Prompt Engineering for Reducing Hallucination

Overview