The Quantization Underground: How a 4-Line Hack Taught AI Models to Forget Selectively

How a GitHub issue about variable bit rates spawned an underground compression movement that lets 7-billion-parameter models run on a phone.

Nicholas Zinner, Beacon Bot

16 Apr 2026 — 12 min read

Image Generated via Nano Banana 2

On April 30, 2023, a user named ikawrakow opened issue #1256 on the llama.cpp GitHub repository. The title was dry: "Variable bit rate quantization."

The post ran a few hundred words. The core argument fit in a sentence: variable bit rates are standard practice in audio and video compression, so why does neural network quantization still treat every weight the same?

ikawrakow ran the experiment. Take a 7-billion parameter model, crush every tensor (a multi-dimensional array of numerical weights that stores what the model has learned) down to 2-bit precision, then selectively restore certain layers to higher fidelity. Measure which restorations actually improve output quality.

The results were lopsided. Restoring feed_forward.w2 to higher precision improved perplexity (a standard measure of prediction accuracy, where lower is better) by 1.08 points. Restoring attention.wv was worth 0.78. But tok_embeddings, the token embedding layer that everyone assumed was critical enough to warrant its own dedicated 16-bit storage format? Less than 0.05 points. You could crush it to 2-bit and the model wouldn't flinch.

The proof of concept was four lines of C++:

if (tensor.name == "output.weight" ||
    tensor.name.find("attention.wv.weight") != std::string::npos ||
    tensor.name.find("feed_forward.w2.weight") != std::string::npos) {
  new_type = GGML_TYPE_Q5_1;
}

A mixed-precision model using Q3 and Q5 tensors beat a standard Q4 quantization across the board. The file was smaller and the quality was higher. Not all weights are created equal, and anyone treating them uniformly was leaving performance on the table.

Most current quantization methods in the local AI ecosystem trace back to this issue.

The Hack Becomes a Method

In 2023, if you downloaded a quantized language model from HuggingFace, it was almost certainly made by one person.

TheBloke, a prolific community quantizer, operated like a one-person factory. Every major open-source model release, from Llama to Mistral to Vicuna, got the treatment: converted to GGML (GPT-Generated Model Language, the original llama.cpp tensor format) format, quantized at every precision level from Q2 through Q8, uploaded with standardized naming conventions and model cards. The r/LocalLLaMA subreddit treated TheBloke links as canonical. Need a quantized Llama 2 13B? TheBloke had it posted before the discussion thread hit fifty comments.

The approach was blanket quantization. Pick a bit depth, apply it uniformly to every tensor in the model, package the result. Q4_0 for the budget-constrained. Q5_1 for anyone with headroom. Q8_0 for the quality purists. Simple, consistent, and effective enough to bootstrap a community of people running language models on gaming GPUs.

But ikawrakow's variable bit rate experiment had opened a question the project couldn't ignore. The llama.cpp project started developing K-quant formats (Q2_K, Q3_K, Q4_K), which used mixed precision within individual tensors, varying the bit depth at the sub-block level. It was a step toward acknowledging that compression should be uneven. Still, every layer in the model received identical treatment. The K-quants were smarter about how they compressed each layer, but they didn't ask whether some layers deserved more attention than others.

The community was assembling on r/LocalLLaMA, swapping benchmarks, arguing about perplexity measurements, and discovering that the difference between a good and bad quantization could mean the difference between a model that held a conversation and one that devolved into repetitive nonsense. The fundamental question driving all of it was the one ikawrakow had asked: which parts of these models can you throw away?

From Blanket to Calibrated

TheBloke eventually stopped posting. The most prolific quantizer role passed to bartowski, a community member who maintains one of HuggingFace's most-downloaded quantization libraries.

Where TheBloke had set the standard, bartowski industrialized it. Every new model release, from DeepSeek-R1 to Qwen 3 to Gemma 3 to Llama 3 to Phi-4, got GGUF (GPT-Generated Unified Format, the successor to GGML with better metadata and extensibility) quantizations within hours of dropping. The operation scaled to match the accelerating pace of open-source model releases through 2024. On r/LocalLLaMA, recommending bartowski's quants became reflexive.

The more important shift that year was methodological.

Unsloth, a startup founded in 2023 by Daniel Han and his brother Michael, had started as a fine-tuning optimization project. Their work led them into quantization territory through a technique called the importance matrix, or imatrix. The concept: don't guess which weights matter most. Measure them. Run calibration data through the model, record which channels carry the most signal, then use that map to decide where compression can be aggressive and where it needs to be gentle.

The imatrix approach changed lower-bitrate quantizations significantly. A Q2 or Q3 model guided by importance data retained far more coherence than the same model quantized blind. The calibration step added time to the process but removed guesswork, and the results at the extreme low end of the bit-depth spectrum went from barely functional to surprisingly usable.

GGUF became the universal container format. Models were getting bigger. Llama 3 70B dropped in April 2024. Running it on consumer hardware without quantization was impossible. Running it at Q4 on a 48GB setup was totally doable.

The ecosystem had established its supply chain: labs train models at full precision, community quantizers compress them for consumer hardware, users download and run them through Ollama or LM Studio or koboldcpp. Each link in the chain got more sophisticated in 2024, but the basic structure hasn't changed since.

Surgical Precision

By 2025, the llama.cpp project had added --tensor-type flags that let quantizers specify different precision for individual tensor types within a model. The four-line hack from issue #1256 had become a first-class feature. Instead of saying "make everything Q4," a quantizer could now say "keep attention layers at bfloat16 (bf16, a 16-bit floating point format), crush feed-forward layers to IQ4_XS (a 4-bit information-theoretic quantization variant), and leave the router at full precision."

dinerburger, a community quantizer specializing in coding models, applied this to Qwen3-Coder-Next. Their quantization kept every attention tensor in bf16 while compressing feed-forward layers to IQ4_XS, using Unsloth's imatrix calibration data to guide the decisions. A 50GB full-precision model shrank to 16GB. The logic: attention layers handle the precision work of tracking relationships between tokens, which is exactly what a coding model needs to get right. Feed-forward layers do the heavy numerical lifting and tolerate more approximation.

nohurry, another community quantizer, took a different angle. Their "heretic" GGUF variants quantized community fine-tunes of Gemma 4 before Google shipped official GGUF versions. The name was self-aware. They published a full reproduce guide so others could learn the process, and credited the chain of people who made it possible: mradenmacher, Unsloth, bartowski, ubergarm, aessedai, and "thebloke for the OG quants." On r/LocalLLaMA, someone posted a "quantizers appreciation post" and the response was genuine: nohurry replied, "Feedback is much appreciated, I still have a lot to learn!"

Unsloth's Dynamic 2.0, released in early 2026, pushed the per-layer approach to its logical conclusion. Every possible layer in a model gets a custom-tailored quantization scheme. The scheme is model-specific: the optimal compression map for Gemma 3 differs from Llama 4 differs from Qwen 3.5, because different architectures distribute their intelligence differently across tensor types. Dynamic 2.0 added specialized quantization types (Q4_NL, Q5.1, Q5.0, Q4.1, Q4.0) optimized for Apple Silicon, and worked across both dense and mixture-of-experts (MoE, where only a subset of the model's parameters activate per input) architectures. Their Dynamic 3-bit DeepSeek V3.1 GGUF scored 75.6% on Aider Polyglot, a benchmark that tests code generation across multiple programming languages, beating many full-precision models while being approximately 8GB smaller than standard quants.

The people making these per-layer decisions are spending GPU hours compressing models for strangers, posting their work on HuggingFace at odd hours, and arguing about perplexity measurements in comment threads.

The Numbers

Eight quantized models ran through a 14-dimension evaluation suite on a Mac Mini M4 with 24GB unified memory and 14GB allocated to the GPU, using Ollama for inference. Four were Qwen 3.5 9B at standard GGUF quantizations (Q3_K_M through Q8_0). Four were Gemma 4 26B-A4B at Unsloth Dynamic quantizations (Q2_K_XL through IQ4_XS). Claude Opus 4.6 served as the LLM judge for subjective tasks (writing quality, reasoning, multi-turn conversation). Programmatic scoring handled objective tasks (structured output, instruction following, logic, long-context retrieval).

Content Quality

Model	Bits	Size	Writing	Reasoning	Struct	Instruct	Logic	Long Ctx	Multi-Turn	Score
Qwen 3.5 9B Q8	8	9.5G	70%	72%	100%	71%	100%	100%	77%	84%
Qwen 3.5 9B Q4	4	5.7G	70%	72%	100%	86%	100%	100%	77%	85%
Qwen 3.5 9B Q3	3	4.7G	73%	28%	92%	43%	100%	100%	87%	77%
Qwen 3.5 9B Q6	6	7.5G	73%	44%	92%	43%	100%	100%	50%	72%
Gemma 4 26B IQ4	4	13.4G	50%	67%	0%	86%	100%	100%	80%	71%
Gemma 4 26B Q3	3	12.9G	40%	67%	0%	0%	100%	100%	93%	61%
Gemma 4 26B IQ3	3	11.2G	37%	67%	0%	0%	100%	100%	93%	60%
Gemma 4 26B Q2	2	10.5G	30%	56%	0%	0%	100%	100%	100%	59%

Format Compliance

Model	Bits	Writing	Reasoning	Struct	Instruct	Logic	Long Ctx	Multi-Turn	Score
Qwen 3.5 9B Q8	8	100%	67%	100%	100%	100%	100%	100%	96%
Qwen 3.5 9B Q4	4	100%	50%	100%	100%	100%	100%	100%	94%
Qwen 3.5 9B Q3	3	50%	67%	67%	33%	100%	100%	100%	73%
Qwen 3.5 9B Q6	6	50%	67%	67%	33%	100%	100%	100%	73%
Gemma 4 26B IQ4	4	100%	50%	0%	100%	100%	100%	100%	76%
Gemma 4 26B Q3	3	100%	50%	0%	67%	100%	100%	100%	71%
Gemma 4 26B IQ3	3	100%	50%	0%	67%	100%	100%	100%	71%
Gemma 4 26B Q2	2	100%	50%	0%	67%	100%	100%	100%	71%

What Broke and What Didn't

Q4 is effectively free. Qwen Q4_K_M scored 85% on content quality versus Q8's 84%. Format compliance: 94% versus 96%. Speed: 10.3 tokens per second versus 7.5. The four-bit model delivered equivalent quality 40% faster while consuming 40% less memory. Across 14 evaluation dimensions, the gap between Q4 and Q8 was indistinguishable from measurement noise. If you're running local models and haven't benchmarked your own use case, Q4_K_M is the safe default.

Logic and long-context retrieval are bulletproof. Every model, including the 2-bit Gemma, scored 100% on logic tasks and long-context retrieval. These are attention-based lookup operations where the model finds a specific piece of information in context and applies a clear rule to it. Quantization noise doesn't compound in these tasks because the reasoning chain is short and the answer is deterministic.

Reasoning has the sharpest cliff. Qwen dropped from 72% at Q4 and Q8 down to 28% at Q3. Three of the benchmark's six high-variance flags landed on reasoning tasks. Chain-of-thought prompting requires the model to maintain coherent logic across a long output sequence, and quantization errors accumulate with each step. The Q4-to-Q3 boundary is where this accumulation crosses a threshold. Research from Liu et al. (April 2025) on quantization's impact on reasoning found similar patterns: W8A8 (8-bit weights with 8-bit activations) and W4A16 (4-bit weights with 16-bit activations) quantization was near-lossless, but lower bit-widths introduced significant accuracy risks, particularly on smaller models.

Format compliance is the canary. Instruction following dropped from 86% at Q4 to 43% at Q6 for Qwen, and from 86% to 0% across most Gemma quants. The model retains the knowledge to produce correct content but loses the discipline to produce it in the requested format. For anyone building pipelines that parse model output, format compliance breaks before content quality degrades. This is the metric that kills your production system while your content benchmarks still look fine.

The 9B model beat the 26B model. Qwen Q4 at 85% overall outperformed Gemma IQ4_XS at 71%, despite Gemma having nearly three times the total parameters. Part of this is architectural: Gemma 4's MoE design activates only 4 billion parameters per inference pass. But the larger factor is format discipline. Gemma scored 0% on JSON structured output at every quantization level. Asked to extract information into JSON, it produced prose descriptions of JSON structure instead. This is a model-level limitation that persists at full precision, not a quantization artifact. Red Hat's study of 500,000 quantized evaluations found "no discernible difference" between 4-bit and full-precision on MMLU-style benchmarks (Massive Multitask Language Understanding, a standardized test of model knowledge). The benchmark data here confirms that finding while adding a caveat: parameter count doesn't compensate for architectural gaps in format compliance.

The Q6 Curse

Rank	Model	Bits	Content	Format	Combined
1	Qwen 3.5 9B Q8	8	84%	96%	90%
2	Qwen 3.5 9B Q4	4	85%	94%	90%
3	Qwen 3.5 9B Q3	3	77%	73%	75%
4	Gemma 4 26B IQ4	4	71%	76%	74%
5	Qwen 3.5 9B Q6	6	72%	73%	72%
6	Gemma 4 26B Q3	3	61%	71%	66%
7	Gemma 4 26B IQ3	3	60%	71%	65%
8	Gemma 4 26B Q2	2	59%	71%	65%

Qwen Q6_K, using 6 bits per weight, ranked fifth out of eight models. Below the 3-bit Qwen Q3. Content quality: 72% for Q6 versus 77% for Q3. Format compliance: identical at 73%. Multi-turn conversation: 50% for Q6 versus 87% for Q3. The six-bit model was measurably worse than the three-bit model across multiple dimensions.

Quantization quality is supposed to degrade monotonically. More bits should mean more precision should mean better output. The entire Q2-Q3-Q4-Q5-Q6-Q8 hierarchy rests on that assumption. Q6_K broke it.

The variance data offers a partial explanation. Q6 had the benchmark's highest variance flag on reasoning (38% standard deviation relative to max score) and a 26% flag on multi-turn. The model's outputs were erratic, swinging between competent and incoherent across runs. Q3, by contrast, was consistently mediocre. It scored lower on average for reasoning but its other dimensions were stable enough to produce a higher total.

No published paper predicted this. Latitude's quantization benchmarks found 0.51% quality loss at Q4_K_M and monotonic degradation thereafter. The broader literature on quantization treats bit-depth as a smooth quality gradient. The Q6 anomaly suggests that certain quantization schemes interact poorly with certain model architectures in ways that perplexity measurements don't capture. Q6_K's specific block structure may introduce a pattern of rounding errors that is more disruptive than Q3_K_M's coarser but more uniform approximation.

Or it could be noise. Three runs per subjective task is thin. The result needs replication. But it's consistent across dimensions: content, format, and multi-turn all point the same direction. Something about Q6_K and Qwen 3.5 9B doesn't mix.

The Other Bottleneck

Weight quantization compresses what a model knows. KV (key-value) cache quantization compresses what a model remembers during a conversation.

The KV cache stores attention computations for every token the model has processed. It grows linearly with context length. On a 24GB GPU running a Q4 model, roughly 6-8GB goes to model weights, leaving 12-16GB for the KV cache. In long conversations or document-heavy workflows, the cache is often what exhausts memory before the model weights become an issue. Weight quantization is the optimization everyone talks about. KV cache quantization is the one that determines whether a model can hold a 30-page document in context.

Google DeepMind's TurboQuant, presented at ICLR (International Conference on Learning Representations) 2026, compressed KV caches from 16-bit to 3.5 bits per channel with what the authors described as "absolute quality neutrality." At 2.5 bits, degradation was marginal. The technique uses random rotations to simplify the geometry of input vectors (a step the authors call PolarQuant), applies scalar quantizers per coordinate, then corrects residual bias with a 1-bit error-correction pass. It's data-oblivious, meaning it works during inference without a calibration step. Tested on Gemma and Mistral across long-context benchmarks including LongBench (a multi-task benchmark for long document understanding), Needle In A Haystack (a test that hides a specific fact in a long document and checks if the model can retrieve it), and RULER (arXiv 2504.19874).

The practical implications: a 6x memory reduction on the KV cache means a model that currently handles 32K tokens of context could handle 128K or more on the same hardware. No new GPU required. No model swap. Just a different way of storing what the model has already seen.

The catch is that nobody can run it yet. Someone forked llama.cpp to attempt an implementation and immediately hit CUDA (Nvidia's GPU computing framework) shared memory overflow errors. The llama.cpp discussion thread #20969 is a record of engineering pain. vLLM, a popular open-source inference engine, supports FP8 (8-bit floating point) KV cache quantization today, a modest 2x reduction from 16-bit, but TurboQuant's 3.5-bit scheme hasn't been integrated into any public inference engine.

The weight quantization community spent three years going from blanket compression to surgical per-tensor precision. KV cache quantization is earlier on that same curve. FP8 is the current standard. TurboQuant is the research frontier. And the llama.cpp hackers who turned ikawrakow's four-line experiment into a methodology are already trying to close the gap.

A Q4 model with TurboQuant KV cache running 128K context on a Mac Mini is plausible near-term. The pieces exist separately. The integration work hasn't happened yet.

The Night Shift

The people who maintain this ecosystem don't work at Google or Meta or Anthropic. They work on HuggingFace at odd hours, calibrating importance matrices and arguing about perplexity measurements in comment threads that rarely break a hundred upvotes.

bartowski has quantized virtually every significant open-source model release of the past two years. No public profile. No conference talks. nohurry named their quants "heretic" and published a reproduction guide so others could learn the process. dinerburger developed a selective quantization scheme for coding models and shared it with model cards more detailed than most academic papers. Daniel Han at Unsloth built Dynamic 2.0 while simultaneously filing bug reports against Qwen, Llama, Gemma, and Phi, finding errors in the original models during the process of learning how to compress them.

Epoch AI's data shows that frontier AI performance becomes accessible on consumer hardware within six to twelve months. Quantization is the primary mechanism that closes it. Every time a lab releases a model that requires a data center to run at full precision, someone in this community figures out how to make it run on a gaming GPU, and the quality loss is smaller than most people expect.

The r/LocalLLaMA appreciation posts for quantizers are a genuine folk tradition. People thank bartowski and nohurry and TheBloke by name, in threads that read like credits at the end of a film. The work is invisible to anyone who downloads a model and runs it. The fact that a 70-billion parameter model fits on a laptop at all is treated as a given, a feature of the format, rather than the result of thousands of hours of experimentation by people who will never see a dime from it.

Three years ago, ikawrakow asked why neural networks don't get variable bit rates. Four lines of C++. The answer is still being written — in calibration scripts, tensor maps, and HuggingFace uploads at 2 AM.

But knowing how quantization works raises the obvious follow-up: what can you actually do with a model running on your own hardware?