How RAG Will Usher In the Next Generation of LLMs and Generative AI

This article originally appeared on TDWI.org

 

If you’ve been keeping up with the steady stream of developments in artificial intelligence — particularly with generative AI — you’ll have noticed that it usually involves a veritable alphabet soup of abbreviations. The latest of these is RAG, or retrieval-augmented generation. However, this is not just another jumble of letters — it may be a big step forward in addressing many of the lingering issues facing AI adoption for business.

What is RAG?

RAG is an emerging AI technique designed to improve the output of large language models (LLMs) by accessing and incorporating information outside their training data sets before generating a response. RAG is an important tool to help combat the nagging issue of hallucinations, as well as enhancing data security and privacy.

A typical AI request (called an inference) involves six basic steps:

  1. Input data preparation. This could involve normalization, tokenization (for text), resizing images, or converting the data into a specific format.
  2. Model loading. This model has already been trained on a data set and has learned patterns that it can apply to new data.
  3. Inference execution. The prepared input data is fed into the model.
  4. Output generation. The nature of this output depends on the task.
  5. Post-processing. The raw output from the model may undergo post-processing to convert it into a more interpretable or useful form.
  6. Result interpretation and action. Finally, the post-processed output is interpreted within the context of the application, leading to an action or decision. For example, in a medical diagnosis application, the output might be interpreted by a healthcare professional to inform a treatment plan.

In a RAG-augmented inference, RAG most affects steps 3 and 4. For example, in step 3, the application also searches whatever external data it’s been given access to (internal company databases, external documents, etc.) in addition to the training data the model was built on. Then, in step 4, RAG picks the top-matched documents from the retrieval step and uses the LLM to generate the response depending on the specific use case (i.e., question answering, summarization, etc.).

To optimize inference performance, RAG often includes an offline process that builds embeddings for all external documents, indexes them, and stores them. A popular architectural choice is to use a vector database for indexing, storage, and retrieval.

RAG’s Pros and Cons

RAG provides many advantages over pre-trained or fine-tuned LLMs. It allows models to access up-to-date external data, mitigating the limitations that LLM training data sets have had. It gives models context to keep up with the latest data without constant retraining.

RAG also helps reduce LLM hallucinations by providing relevant and accurate data sources as context for generation. What’s more, RAG enhances data security and privacy by granting enterprises to allow their applications to access sensitive data while still keeping it separate with added-on protection.

Depending on the use case, RAG often provides a more cost-efficient solution. New data can be embedded and added to the vector database, giving the application access to the latest information without having to continually retrain the LLM. At inference time, this additional context can be retrieved by querying documents most relevant to the input query. That also reduces the need to have a very large token context window, as is the case with models such as Gemini 1.5 Pro.

Meanwhile, introducing RAG to the mix brings several challenges. For example, RAG presents new architectural requirements — a vector database is usually needed to perform the indexing, storage, and retrieval functions, making design and implementation of a RAG system more complicated.

RAG also requires additional steps to generate embedding of queries and to retrieve similar documents. These increase the inference latency compared with inference on pre-trained or fine-tuned LLMs.

Of course, as with all things AI, the quality of the RAG result depends on the quality of the data sources. When there are data quality issues in the external data sources, the RAG application’s quality will be negatively affected.

Leveraging RAG in Business

RAG’s ability to access external data sources during inference time makes it an excellent fit for many applications. Common use cases include:

  • Enterprise internal knowledge bases. RAG enables enterprises to keep their sensitive internal data safe and alleviates the need to keep LLMs updated. Furthermore, it helps provide accurate answers by minimizing hallucinations.
  • Customer service assistants. RAG allows LLMs to use customer-specific information as context to provide personalized answers. It can also enable enterprises to avoid using sensitive customer data to train their LLMs.
  • Domain-specific research tool. Domain-specific knowledge from external data sources can be supplemented with LLMs using RAG, providing an alternative to training a domain-specific LLM.

A Case in Point

One example illustrates how valuable RAG can be in practice. A top publisher wanted to make use of its immense archive of valuable content, but was faced with the daunting task of efficiently researching relevant materials to provide suggestions and insights to their writers and editors — a virtually impossible process without advanced AI technology.

Powered by the GPT and BERT models in a model library, the customer quickly built a powerful RAG application on their existing AI platform. This solution automatically sifts through their extensive content repository, identifies pertinent information, and makes timely, AI-driven recommendations to their editorial team.

The introduction of the RAG application dramatically improved the efficiency and depth of their reporting, allowing the customer to deliver richer, more insightful narratives and to maintain their position as a leader in their field with cutting-edge, data-backed storytelling.

The Retrieval-Augmented Future

Once companies become familiar with RAG, they can combine a variety of off-the-shelf or custom LLMs with internal or external knowledge bases to create a wide range of assistants that help their employees and customers. Chatbots and other conversational systems that use natural language processing can benefit significantly from RAG and generative AI. For example, a generative AI model supplemented with a medical database could be a great assistant for doctors and nurses.

In the future, RAG technology may help generative AI take appropriate action based on contextual information and user prompts. For example, a RAG-augmented AI system might identify the highest-rated vacation rental in Kihei and then initiate booking a two-bedroom beach house within walking distance of your favorite snorkeling spot.

Although RAG might add some complexity to your enterprise AI undertakings, RAG is worth the effort for many use cases. Retrieval-augmented generation builds on the benefits of LLMs by making them more timely, accurate, secure, and contextual. For business applications of generative AI, RAG is an important capability to understand and incorporate within your AI applications.

Authors
Andy Xu • VP Engineering
Share