Mix-Quant

Abstract

LLM agents have recently emerged as a powerful paradigm for solving complex tasks through planning, tool use, memory retrieval, and multi-step interaction. However, these agentic workflows often introduce substantial input-side overhead, making the compute-intensive prefilling stage a key bottleneck in long-context, multi-turn inference. In this work, we propose Mix-Quant, a simple and effective phase-aware quantization framework for fast agentic inference. We first investigate FP4 quantization in agentic LLM workflows and observe that quantizing the entire inference process can incur significant performance degradation. In contrast, the prefilling stage exhibits substantial quantization redundancy and can therefore be quantized with minimal accuracy loss, despite being the dominant source of computation. Based on this insight, we apply high-throughput NVFP4 quantization to the prefilling phase while preserving BF16 precision for decoding. By decoupling prefilling acceleration from decoding quality, Mix-Quant combines phase-aware algorithmic quantization with hardware-efficient NVFP4 execution to alleviate the inference bottleneck in LLM agents. Extensive experiments across long-context and agentic benchmarks demonstrate that Mix-Quant largely preserves task performance while delivering significant efficiency improvements, achieving up to a 3x speedup during prefilling.

Framework

Mix-Quant decouples NVFP4 prefilling from BF16 decoding for fast and stable agentic LLM inference.

Experimental Results

Table 1

Agentic benchmark performance of Mix-Quant

Model	BFCL v4	LongMemEval	τ²-bench	Avg.
Qwen3-8B	40.50	57.00	31.06	42.85
Qwen3-8B-NVFP4	38.77	49.82	27.34	38.64
Mix-Quant	40.63	54.85	28.86	41.45
Qwen3.5-9B	58.96	86.27	86.69	77.31
Qwen3.5-9B-NVFP4	56.86	78.00	76.26	70.37
Mix-Quant	57.89	84.27	81.89	74.68
Gemma-4-26B-A4B-it	53.07	80.50	64.63	66.07
Gemma-4-26B-A4B-it-NVFP4	48.13	62.42	57.31	55.95
Mix-Quant	51.94	72.45	60.62	61.67
Gemma-4-31B-it	68.60	90.80	73.50	77.63
Gemma-4-31B-it-NVFP4	68.69	89.20	70.74	76.21
Mix-Quant	68.19	90.40	72.84	77.14

Table 2

Reasoning and long-context benchmarks performance of Mix-Quant

Model	Reasoning			Long Context		Avg.
Model	MATH500	AIME24	AIME25	LongBench-V2	AA-LCR	Avg.
Qwen3-8B	93.73	75.54	67.77	39.86	33.67	62.11
Qwen3-8B-NVFP4	94.12	66.53	55.33	35.46	24.67	55.22
Mix-Quant	94.40	76.66	66.66	38.60	28.67	61.00
Qwen3.5-9B	94.85	68.89	60.00	55.47	81.00	72.04
Qwen3.5-9B-NVFP4	93.46	54.44	40.00	50.40	78.00	63.26
Mix-Quant	94.32	70.33	56.67	52.29	79.33	70.59
Gemma-4-26B-A4B-it	95.86	77.67	65.33	53.83	67.00	71.94
Gemma-4-26B-A4B-it-NVFP4	95.20	75.33	62.22	48.15	50.67	66.31
Mix-Quant	95.40	78.67	69.67	51.57	64.33	71.93
Gemma-4-31B-it	97.33	92.22	82.22	63.35	76.67	82.36
Gemma-4-31B-it-NVFP4	97.75	83.33	80.00	58.55	71.67	78.26
Mix-Quant	97.20	93.33	81.11	61.64	73.67	81.39

Figure 1

End-to-end prefill latency speedup of Mix-Quant over the BF16 baselines on NVIDIA RTX 5090 GPUs.

End-to-end prefill latency speedup of Mix-Quant over BF16 baselines on NVIDIA RTX 5090 GPUs

BibTeX

@article{lu2026mixquant,
  title={Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs},
  author={Lu, Haiquan and Chen, Zigeng and Fang, Gongfan and Ma, Xinyin and Wang, Xinchao},
  journal={arXiv preprint arXiv:2605.20315},
  year={2026}
}