Quantization of Large Language Models for Integer-Only Hardware
Master's Thesis · ~96 pages · English
Abstract
Large Language Model deployment on edge devices faces significant computational challenges. This research evaluates quantization as a compression method enabling efficient inference on integer-only hardware accelerators. The analysis compares post-training quantization (PTQ) and quantization-aware training (QAT) approaches, finding that INT8 maintains acceptable accuracy (within 1% degradation), while INT4 typically requires mixed-precision strategies, particularly for attention mechanisms.
1. Introduction
The proliferation of Large Language Models has created unprecedented demand for efficient deployment strategies. While models like GPT-4 and LLaMA demonstrate remarkable capabilities, their computational requirements—often exceeding 70 billion parameters—preclude direct deployment on resource-constrained devices.
Quantization emerges as a promising solution, reducing precision from standard FP32 to INT8 or INT4 formats. This thesis systematically evaluates quantization methodologies across multiple LLM architectures, providing actionable guidelines for hardware-aware deployment.
2. Research Questions
RQ1: What is the minimum bit-width that preserves acceptable accuracy across diverse NLP tasks?
RQ2: How do PTQ and QAT approaches compare for integer-only hardware deployment?
RQ3: What mixed-precision configurations optimize accuracy-latency trade-offs for different layer types?
RQ4: How does quantization impact performance across varying model scales (7B–70B parameters)?
3. Key Contributions
This thesis offers four primary contributions to the field:
1. Comprehensive quantization benchmarking across multiple LLM architectures including LLaMA, Mistral, and Falcon
2. Hardware-aware analysis targeting ARM Cortex processors, NVIDIA Tensor Cores, and custom NPUs
3. Practical INT4/INT8 deployment guidelines with empirical accuracy guarantees
4. Novel mixed-precision strategies optimizing the accuracy-latency frontier
References
- [1]Dettmers, T., et al. (2023). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. NeurIPS.
- [2]Frantar, E., & Alistarh, D. (2023). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. ICLR.
- [3]Xiao, G., et al. (2023). SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. ICML.
- [4]Lin, J., et al. (2024). AWQ: Activation-aware Weight Quantization for LLM Compression. MLSys.
This is a sample excerpt. Full papers include complete chapters, verified citations, and downloadable formats.
Free to try · No credit card required · Free to start, 3 credits/day