Efficient Deployment of Low-Bit Large Language Models on CPU
The key innovation of T-MAC lies in adopting a lookup table (LUT) based computation paradigm, rather than the traditional multiply-accumulate (MAC) computation paradigm. T-MAC directly supports low-bit computation using lookup tables, thereby eliminating the dequantization operations necessary in other systems, and significantly reducing the number of multiplication and addition operations.
Experiments have shown that T-MAC demonstrates excellent performance: on a Surface AI PC equipped with the latest Qualcomm Snapdragon X E lite chipset, the 3B BitNet-b1.58 model can generate 48 tokens per second, the 2-bit 7B llama model can generate 30 tokens per second, and the 4-bit 7B llama model can generate 20 tokens per second.
This even surpasses NPU performance!
When deploying the llama-2-7b-4bit model, although the NPU can generate 10.4 tokens per second, the CPU, with the help of T-MAC, can achieve 12.6 tokens per second using only two cores, and can even reach up to 22 tokens per second at its peak.
These all far exceed the average human reading speed and represent a 4 to 5-fold improvement over the original llama.cpp framework.
Even on lower-end devices like the Raspberry Pi 5, T-MAC can achieve a generation rate of 11 tokens per second for the 3B BitNet-b1.58. T-MAC also has significant power consumption advantages: to achieve the same generation rate, T-MAC requires only 1/4 to 1/6 of the cores compared to the original llama.cpp, reducing energy consumption while leaving computational resources for other applications.
Notably, T-MAC's computational performance increases linearly as the number of bits decreases, a phenomenon that is difficult to observe in GPUs and NPUs that implement dequantization. However, T-MAC can achieve 10 tokens per second on a single core and 28 tokens per second on four cores at 2 bits, greatly surpassing NPU performance.
Matrix Multiplication Without Multiplication, Just Table Lookup (LUT)
For low-bit parameters (weights), T-MAC groups each bit separately (e.g., a group of 4 bits), multiplies these bits with the activation vector, pre-computes all possible partial sums, and then stores them using LUT.
Afterwards, T-MAC uses shift and accumulate operations to support scalable bit numbers from 1 to 4. Through this method, T-MAC discards the inefficient FMA (Fused Multiply-Add) instructions on CPUs and instead uses the more power-efficient and higher-efficiency TBL/PSHUF (table lookup) instructions.
Bit-Centric Computation, Replacing Data Type-Centric Computation
Traditional dequantization-based computation is actually data type-centric computation, which requires customization for each different data type.
Each combination of activation and weight bit widths, such as W4A16 (weight int4 activation float16) and W2A8, requires specific weight layouts and computation kernels.
For example, the W3 layout needs to pack 2 bits and another 1 bit separately, and use different interleaving or shuffling methods for memory alignment or fast decoding. Then, the corresponding computation kernel needs to unpack this specific layout into hardware-supported data types for execution.
T-MAC, by observing low-bit matrix multiplication from a bit perspective, only needs to design the optimal data structure for a single bit, and then extend to higher 2/3/4 bits through stacking.
At the same time, for activation vectors of different precisions (float16/float32/int8), only the process of building the table needs to change, and different data structures no longer need to be considered during table lookup.
Moreover, with traditional dequantization-based methods, when reducing from 4-bit to 3/2/1-bit, although memory usage is less, the computational load does not decrease, and performance may even worsen due to the increased overhead of dequantization.
However, T-MAC's computational load decreases linearly as the number of bits decreases, thus bringing better acceleration at lower bits, providing an efficient deployment solution for the latest 1-bit/2-bit models released by works such as BitNet and EfficientQAT.
Highly Optimized Operator Implementation
Bit-centric computation has many advantages, but implementing it on CPUs still poses significant challenges:
(1) Compared to continuous data access of activations and weights, table access is random. The residence of tables in fast on-chip memory is particularly important for final inference performance;
(2) However, on-chip memory is limited, and the lookup table (LUT) method increases on-chip memory usage compared to traditional mpGEMV. This is because lookup tables need to store the results of multiplying the activation vector with all possible bit patterns. This is much more than the activations themselves.
To address this, researchers at Microsoft Research Asia delved into the data flow of table-based computation and designed efficient data structures and computational processes for this computational paradigm, including:
-
Storing LUTs in on-chip memory to leverage table lookup vector instructions (TBL/PSHUF) on CPUs to improve random access performance.
-
Changing the matrix axis computation order to maximize the data reuse rate of limited LUTs placed in on-chip memory.
-
Designing optimal matrix tiling methods specifically for table lookup, combined with autotvm to search for optimal tiling parameters.
-
Layout optimization for parameter weights:
a) Weight reordering to maximize continuous access and improve cache hit rate
b) Weight interleaving to improve decoding efficiency
- Targeted optimizations for Intel/ARM CPUs, including:
a) Register reordering for fast table building
b) Fast 8-bit accumulation using averaging instructions
The researchers applied various optimizations step by step on a basic implementation, ultimately achieving significant acceleration relative to SOTA low-bit operators: