20x inference acceleration for long sequence length tasks on Intel Xeon Max Series CPUs
Intel recently released the Xeon Max Series CPU, a version of the 4th Gen Intel Xeon Scalable processors. It is the first x86 CPU with high bandwidth memory (HBM), enabling 3x the bandwidth capacity of processors without this technology.
challenge
Processing Long Text Sequences Often Faces Bandwidth Limitations
Large language models (LLMs), such as GPT and BERT, are becoming increasingly sophisticated, with the ability to generate human-like responses and analyze large amounts of unstructured data. However, running these models efficiently in production for long sequence length tasks, such as analysis of long articles or documents, has become a challenge, as the models often face memory bandwidth limitations. As a result, the processor’s computational power is frequently underutilized, forcing many to run inference on GPUs instead of CPUs, which is far less cost-efficient and much more time-intensive to maintain.
SOLUTION
Running Numenta’s AI Inference Server on Intel Xeon CPU Max Series processors
To demonstrate the advantages of Numenta running on Intel’s latest processor, we chose to show BERT-Large inference throughput improvements at long 512-element sequence lengths. Leveraging Intel’s AMX instruction set, Numenta applied our neuroscience-based techniques to a custom-trained version of BERT-Large model on the Intel Xeon CPU Max Series processor and compared it with a standard BERT-Large model running on a current AMD Milan system.
RESULTS
20x inference throughput for long sequence lengths
This synergistic combination of Numenta and Intel technology led to a 20x gain in inference throughput for LLMs with extended sequence lengths compared to AMD Milan processors. Furthermore, Numenta’s optimized model still gets an order ot magnitude speed-up on the 4th Gen Xeon CPU without the HBM capability, which is more adept at handling shorter sequence lengths.
BENEFITS
Slash Costs, Boost Efficiency for Large Language Models
Running Numenta Models on Intel’s high bandwidth CPUs enables unparalleled performance speedups for longer sequence length tasks, dramatically reducing the overall cost of running language models in production. This speedup allows customers to:
- Process large documents with high sequence lengths at impressive speeds without sacrificing accuracy
- Realize significant cost savings
- Eliminate the need for resource-intensive GPUs
- Unlock new NLP capabilities for diverse applications without breaking the bank
ADDITIONAL RESOURCES
- Press Release: Numenta Achieves 123x Inference Performance Improvement for BERT Transformers on Intel Xeon Processor Family
- Technical Blog: Numenta and Intel Accelerate Inference
- Xeon Series Product Brief: Intel® Xeon® CPU Max Series Product Brief
Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.
Ready to supercharge your AI solutions with NuPIC?
Related Case Studies
Developing AI-powered games on existing CPU infrastructures without breaking the bank
AI is opening a new frontier for gaming, enabling more immersive and interactive experiences than ever before. NuPIC enables game studios and developers to leverage these AI technologies on existing CPU infrastructure as they embark on building new AI-powered games.
20x inference acceleration for long sequence length tasks on Intel Xeon Max Series CPUs
Numenta technologies running on the Intel 4th Gen Xeon Max Series CPU enables unparalleled performance speedups for longer sequence length tasks.
Numenta + Intel achieve 123x inference performance improvement for BERT Transformers
Numenta technologies combined with the new Advanced Matrix Extensions (Intel AMX) in the 4th Gen Intel Xeon Scalable processors yield breakthrough results.