This article is part of Intel’s Parallel Universe Magazine, originally published on Intel.com.
Problem Statement
As we ring in 2024, businesses across industries have witnessed the transformative potential of large language models (LLMs). Despite their growing potential, businesses still struggle to integrate LLMs into their operations, mainly due to their cost, complexity, high power consumption, and concerns over data protection. GPUs have traditionally been the go-to hardware for LLMs, but they incur significant IT complications and escalating expenses that cannot sustain growing operations. Furthermore, due to worldwide demand and shortages, businesses now must wait over a year to secure GPUs for their AI projects.
A New Approach
Numenta and Intel are setting a new precedent for LLM deployment, freeing businesses from their dependency on GPUs. The Numenta Platform for Intelligent Computing (NuPIC™) maps neuroscience-based concepts to the Intel® Advanced Matrix Extensions (Intel® AMX) instruction set. With NuPIC, it is finally possible to deploy LLMs at scale on Intel® CPUs and deliver dramatic price/performance improvements. (See benchmarks P6 and P11 in Performance Index for 4th Generation Intel® Xeon® Scalable processors for examples.)
Technologies
Intel® AMX
Intel AMX accelerates matrix operations by providing dedicated hardware support (Figure 1). It is particularly beneficial for applications that rely heavily on matrix computations, such as deep learning inference. Intel AMX was first introduced in Intel 4th Generation Intel Xeon processors and is now even better on the new 5th Generation Intel Xeon processors.
NuPIC™
Built on a foundation of decades of neuroscience research, NuPIC allows you to build powerful language-based applications quickly and easily (Figure 2). At the heart of NuPIC is a highly optimized inference server that takes advantage of Intel AMX to enable you to run LLMs efficiently on Intel CPUs. With a library of production-ready pretrained models that can be customized to a variety of natural language processing use-cases, you can directly run the models in the NuPIC Inference Server. You can also fine-tune the models on your data with the NuPIC Training Module, and then deploy your custom fine-tuned model to the NuPIC Model Library and run it in the inference server.
Built on standard inference protocols, NuPIC can be seamlessly integrated into any standard MLOps pipeline. Deployed as Docker containers, the solution stays within your infrastructure in a scalable, secure, highly available, high-performance environment. You retain full control of your custom models. Your data remains fully private and never needs to leave your system.
Why NuPIC on Intel® Xeon® Makes CPUs Ideal for AI Inference
Numenta and Intel are opening a new chapter in this narrative, making it possible to deploy LLMs at scale on CPUs in an extremely cost-effective manner. Here are a few reasons why.
Performance: 17x Faster than NVIDIA A100 Tensor Core* GPUs
With minimal changes to the Transformers structure, NuPIC achieves over two orders of magnitude improvement in inference throughput on Intel AMX-enabled CPUs compared to previous-generation CPUs and significant speedups compared to GPUs (Table 1). For BERT-Large, our platform on 5th Gen Intel Xeon outperformed the NVIDIA A100 GPU by up to 17x. GPUs require higher batch sizes for best parallel performance. However, batching leads to a more complex inference implementation and introduces undesirable latency in real-time applications. In contrast, NuPIC does not require batching for high performance, making applications flexible, scalable, and simple to manage. Despite the disadvantages of batching, we list the performance of the NVIDIA A100 at batch size 8. NuPIC at batch size 1 still outperforms the batched NVIDIA GPU implementation by more than 2x.
Scalability: Managing Multiple Clients
Most LLM applications need to support many clients, and each requires inference results. NuPIC on a single 5th Gen Intel Xeon server enables applications to handle a large number of client requests while maintaining high throughput (Figure 3). None of the clients need to batch their inputs, and each client can be completely asynchronous.
Flexibility: Running Multiple Models on the Same Server
Many business applications are powered by multiple LLMs that each solve a different task. In addition to managing asynchronous clients, NuPIC enables you to run separate models concurrently and asynchronously on the same server, something that is hard to manage efficiently on GPUs. As shown in Figure 3, we ran 128 independent models on a single server. Different models can have different sets of weights and can vary in their computation requirements, but this is all easily handled by NuPIC on Intel CPUs. Moreover, as your data and computational needs grow, it is much simpler and less restrictive to add CPUs to your system than to try to integrate additional GPUs. This is especially true in cloud-based environments, where adding more computational power is often just a click away. From an IT standpoint, CPUs are easier to deploy and manage than GPUs.
Cost and Energy Efficient
CPUs are also generally more affordable and accessible than GPUs, making them the preferred choice for many businesses. However, their effectiveness in running AI inference has traditionally been limited by slower processing speeds. NuPIC accelerates AI inference on Intel CPUs, leading to advantages in inferences per dollar and inferences per watt. Overall, with NuPIC and Intel AMX, AI applications are substantially lower in cost, consume significantly less energy, and are more sustainable and scalable.
Implementation Example
To show how you can leverage NuPIC’s capabilities and features, we provide example code for different use-cases, ranging from those ready for immediate deployment to those that need fine-tuning. In this example, we show how you can use NuPIC to perform sentiment analysis in a financial context, where our NuPIC-optimized BERT model processes text data like news articles and market reports and categorizes them into different sentiments (positive, negative, and neutral). This can help financial analysts quickly understand the overall market sentiment, allowing them to make informed decisions about investment strategies or forecasting market movements.
Install and Launch NuPIC
Numenta provides a download script, which includes example code and all the necessary components for fine-tuning and deployment. After you install NuPIC, run the following commands to start the training module and inference server:
This will download the Docker images and start the Docker containers. By default, the Docker container for the training module and inference server will run on the API server on port 8321 and 8000, respectively. You can stop the server and Docker container at any time with the –stop command.
Prepare Your Dataset
In this example, we will be using the financial sentiment dataset for fine-tuning. First, we split the data into train and test sets:
Fine-Tune Your Data with the NuPIC Training Module
After splitting the dataset, you can now fine-tune the model via NuPIC’s training module. This process involves adjusting parameters like learning rate, batch size, etc. to optimize performance. In this case, we’re fine-tuning a NuPIC-optimized BERT model with the financial sentiment dataset:
NuPIC’s training module will then use the test portion of the financial sentiment dataset and print out an accuracy score. You can adjust hyperparameters if necessary and retrain until you’re happy with the results. Once fine-tuning is complete, a new model is generated. The above script automatically retrieves the model from the training module and stores it in your local directory as a .tar.gz file, where you can then deploy it to NuPIC’s inference server.
Import Your Fine-Tuned Model into the NuPIC Inference Server
To deploy your fine-tuned model, you need to first make it available on the inference server. This means that the output from the last phase (model_xxx.tar.gz) should be uncompressed and moved to the “models” directory, which is mapped into the inference server container. Use the following commands, replacing xxx with the details of your fine-tuned model:
Deployment: Real-Time Inference on Intel Xeon CPUs
Now, it’s time to deploy your model. NuPIC’s Python client makes deployment simple. The client manages communication between your end-user application and model. A simple API enables you to generate embeddings automatically and make requests to the inference server in real time. Set up the NuPIC client and perform inference as follows:
You can send a request with text data to your API endpoint and retrieve the resulting sentiment predictions. For real-time analysis, you can continuously feed market reports and news articles to your API and receive sentiment predictions in return.
Concluding Remarks
With NuPIC and Intel AMX, you get the versatility and security required by enterprise IT, combined with unparalleled scaling of LLMs on Intel CPUs. With this synergistic combination of software and hardware technologies, running LLMs on CPUs becomes more than a possibility — it becomes a strategic advantage. To see how your organization can benefit from NuPIC, you can request a demo here.
Product and Performance Information