Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Haiquan Lu Zigeng Chen Gongfan Fang Xinyin Ma Xinchao Wang
National University of Singapore
haiquanlu@u.nus.edu xinchao@nus.edu.sg
Corresponding author
Mix-Quant overview

Abstract

LLM agents have recently emerged as a powerful paradigm for solving complex tasks through planning, tool use, memory retrieval, and multi-step interaction. However, these agentic workflows often introduce substantial input-side overhead, making the compute-intensive prefilling stage a key bottleneck in long-context, multi-turn inference. In this work, we propose Mix-Quant, a simple and effective phase-aware quantization framework for fast agentic inference. We first investigate FP4 quantization in agentic LLM workflows and observe that quantizing the entire inference process can incur significant performance degradation. In contrast, the prefilling stage exhibits substantial quantization redundancy and can therefore be quantized with minimal accuracy loss, despite being the dominant source of computation. Based on this insight, we apply high-throughput NVFP4 quantization to the prefilling phase while preserving BF16 precision for decoding. By decoupling prefilling acceleration from decoding quality, Mix-Quant combines phase-aware algorithmic quantization with hardware-efficient NVFP4 execution to alleviate the inference bottleneck in LLM agents. Extensive experiments across long-context and agentic benchmarks demonstrate that Mix-Quant largely preserves task performance while delivering significant efficiency improvements, achieving up to a 3x speedup during prefilling.

Framework

Mix-Quant framework

Mix-Quant decouples NVFP4 prefilling from BF16 decoding for fast and stable agentic LLM inference.

Experimental Results

Table 1

Agentic benchmark performance of Mix-Quant

Model BFCL v4 LongMemEval τ2-bench Avg.
Qwen3-8B40.5057.0031.0642.85
Qwen3-8B-NVFP438.7749.8227.3438.64
Mix-Quant40.6354.8528.8641.45
Qwen3.5-9B58.9686.2786.6977.31
Qwen3.5-9B-NVFP456.8678.0076.2670.37
Mix-Quant57.8984.2781.8974.68
Gemma-4-26B-A4B-it53.0780.5064.6366.07
Gemma-4-26B-A4B-it-NVFP448.1362.4257.3155.95
Mix-Quant51.9472.4560.6261.67
Gemma-4-31B-it68.6090.8073.5077.63
Gemma-4-31B-it-NVFP468.6989.2070.7476.21
Mix-Quant68.1990.4072.8477.14

Table 2

Reasoning and long-context benchmarks performance of Mix-Quant

Model Reasoning Long Context Avg.
MATH500 AIME24 AIME25 LongBench-V2 AA-LCR
Qwen3-8B93.7375.5467.7739.8633.6762.11
Qwen3-8B-NVFP494.1266.5355.3335.4624.6755.22
Mix-Quant94.4076.6666.6638.6028.6761.00
Qwen3.5-9B94.8568.8960.0055.4781.0072.04
Qwen3.5-9B-NVFP493.4654.4440.0050.4078.0063.26
Mix-Quant94.3270.3356.6752.2979.3370.59
Gemma-4-26B-A4B-it95.8677.6765.3353.8367.0071.94
Gemma-4-26B-A4B-it-NVFP495.2075.3362.2248.1550.6766.31
Mix-Quant95.4078.6769.6751.5764.3371.93
Gemma-4-31B-it97.3392.2282.2263.3576.6782.36
Gemma-4-31B-it-NVFP497.7583.3380.0058.5571.6778.26
Mix-Quant97.2093.3381.1161.6473.6781.39

Figure 1

End-to-end prefill latency speedup of Mix-Quant over the BF16 baselines on NVIDIA RTX 5090 GPUs.

End-to-end prefill latency speedup of Mix-Quant over BF16 baselines on NVIDIA RTX 5090 GPUs

BibTeX

@article{lu2026mixquant,
  title={Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs},
  author={Lu, Haiquan and Chen, Zigeng and Fang, Gongfan and Ma, Xinyin and Wang, Xinchao},
  journal={arXiv preprint arXiv:2605.20315},
  year={2026}
}