Generative AI for Structured Knowledge
Introducing SCIRAG (Structured Content-Integrated Retrieval Augmented Generation)
Overview
When using a Large Language Models (LLM) to answer questions about an existing corpus of material, it is common to use an approach called Retrieval Augmented Generation (RAG). RAG is an established method for increasing the relevance of the answers an LLM generates, and reducing the likelihood of hallucinations (a notable shortcoming of LLMs).
RAG can be accomplished fairly simply using existing tools (e.g., LangChain), which work extremely well for a general purpose text corpus. For instance, creating a chatbot to answer user questions about a product based on extensive product documentation is an almost trivial task for which these tools are ideally suited. Unfortunately, for highly rigorous niche topics (such as scientific literature), these tools fall short in several ways.
This document will describe the pitfalls of using RAG for scientific research, as well as the techniques used by FAIR Atlas to overcome them, and create SCIRAG (Structured Content-Integrated Retrieval Augmented Generation), a specialized RAG framework for assisting and accelerating the scientific research process with generative AI.
Introduction to RAG (Retrieval Augmented Generation)
Large Language Models do an impressive job of communicating in natural language. While the details of how they work are beyond the scope of this document, for our purposes it is important to know that they are probabilistic word prediction engines. Consider the following query that one might make to an LLM:
“I am cooking lasagna tonight for four people, how long will it take me?”
In the absence of any further information, an LLM can give a range of times based on recipes, cooking techniques and so forth, but the best it could possibly do is to give a wide range (from ten minutes for microwaving something to hours for something cooked entirely from scratch, for instance).
However, what if a person kept a log of everything that they cooked, and how long it took them to cook it? Let’s look at that query again:
“[5 weeks ago I cooked lasagna for 4 people and it took 3 hours total time to prepare and cook]
I am cooking lasagna tonight for four people, how long will it take me?”
The LLM is almost certainly going to give a highly relevant answer this time. The addition of the cooking log entry is known as context in LLM parlance. Functionally, the addition of this context alters the probabilities of what words would come next in a conversation.
Of course, making a person provide this context would defeat the entire purpose, which is where retrieval comes in.
Retrieval
When a person asks a question to RAG (as opposed to just an LLM), their question is first encoded with an embeddings model, which returns a numeric representation of the question across multiple dimensions of meaning. This representation can be used as a search pattern in a vector database, which allows the most semantically similar content to be returned. This returned content is added before the original question, and that whole string (context plus question) is sent to the LLM.
RAG Shortcomings
RAG is an excellent tool for helping an LLM answer questions about topics that are suitable for a general audience. However, for content that is intended for a specialist audience, RAG performance often decreases. This is for several reasons:
- General purpose embeddings models may not be well suited for a given specialist topic.
- Specialist topics often use terms not encountered outside of their domain, or use terms in idiosyncratic ways.
- Terms which are only loosely related in general might be highly related in a specialist topic, or vice versa.
- The overlap of terms in scientific questions and scientific answers is often less than in questions and answers of a more general nature.
- For general purpose content, a chunk of relevant text is all that is needed, but for specialist domains, additional metadata (e.g., source identification, catalog identifiers, or ontology identifiers) is often needed.
- When providing text only chunks as context, an LLM will simply answer, which is acceptable for general purpose applications, but specialist applications often require answers to be annotated with source(s) so that the answer can be validated, and sources can be examined for follow up questions.
SCIRAG, FAIR Atlas Solves Shortcomings of RAG for Scientific Queries