Generative AI for Structured Knowledge

Introducing SCIRAG (Structured Content-Integrated Retrieval Augmented Generation)

Overview

When using a Large Language Models (LLM) to answer questions about an existing corpus of material, it is common to use an approach called Retrieval Augmented Generation (RAG). RAG is an established method for increasing the relevance of the answers an LLM generates, and reducing the likelihood of hallucinations (a notable shortcoming of LLMs).
RAG can be accomplished fairly simply using existing tools (e.g., LangChain), which work extremely well for a general purpose text corpus. For instance, creating a chatbot to answer user questions about a product based on extensive product documentation is an almost trivial task for which these tools are ideally suited. Unfortunately, for highly rigorous niche topics (such as scientific literature), these tools fall short in several ways.
This document will describe the pitfalls of using RAG for scientific research, as well as the techniques used by FAIR Atlas to overcome them, and create SCIRAG (Structured Content-Integrated Retrieval Augmented Generation), a specialized RAG framework for assisting and accelerating the scientific research process with generative AI.

Introduction to RAG (Retrieval Augmented Generation)

Large Language Models do an impressive job of communicating in natural language. While the details of how they work are beyond the scope of this document, for our purposes it is important to know that they are probabilistic word prediction engines. Consider the following query that one might make to an LLM:
“I am cooking lasagna tonight for four people, how long will it take me?”
In the absence of any further information, an LLM can give a range of times based on recipes, cooking techniques and so forth, but the best it could possibly do is to give a wide range (from ten minutes for microwaving something to hours for something cooked entirely from scratch, for instance).
However, what if a person kept a log of everything that they cooked, and how long it took them to cook it? Let’s look at that query again:
“[5 weeks ago I cooked lasagna for 4 people and it took 3 hours total time to prepare and cook]
I am cooking lasagna tonight for four people, how long will it take me?”
The LLM is almost certainly going to give a highly relevant answer this time. The addition of the cooking log entry is known as context in LLM parlance. Functionally, the addition of this context alters the probabilities of what words would come next in a conversation.
Of course, making a person provide this context would defeat the entire purpose, which is where retrieval comes in.

Retrieval

When a person asks a question to RAG (as opposed to just an LLM), their question is first encoded with an embeddings model, which returns a numeric representation of the question across multiple dimensions of meaning. This representation can be used as a search pattern in a vector database, which allows the most semantically similar content to be returned. This returned content is added before the original question, and that whole string (context plus question) is sent to the LLM.

RAG Shortcomings

RAG is an excellent tool for helping an LLM answer questions about topics that are suitable for a general audience. However, for content that is intended for a specialist audience, RAG performance often decreases. This is for several reasons:
  • General purpose embeddings models may not be well suited for a given specialist topic.
  • Specialist topics often use terms not encountered outside of their domain, or use terms in idiosyncratic ways.
  • Terms which are only loosely related in general might be highly related in a specialist topic, or vice versa.
  • The overlap of terms in scientific questions and scientific answers is often less than in questions and answers of a more general nature.
  • For general purpose content, a chunk of relevant text is all that is needed, but for specialist domains, additional metadata (e.g., source identification, catalog identifiers, or ontology identifiers) is often needed.
  • When providing text only chunks as context, an LLM will simply answer, which is acceptable for general purpose applications, but specialist applications often require answers to be annotated with source(s) so that the answer can be validated, and sources can be examined for follow up questions.

SCIRAG, FAIR Atlas Solves Shortcomings of RAG for Scientific Queries

SCIRAG ensures citation information is preserved and ontological terms (with identifiers) are provided in responses. This closely parallels how professionals respond to document corpus review requests. By structuring responses using tables organized around standardized terms and identifiers, accuracy is greatly increased over the capabilities of RAG alone. Specifically, the utility of structured, tabulated responses is three-fold. First, human review is much easier with well-organized, properly referenced and linked table entries. Second, links to citations and ontology term identifiers keep AI responses FAIR. Finally, LLM summaries, evaluations, and classifications are more accurate and useful in the SCIRAG context than in the ordinary RAG context.
To demonstrate that SCIRAG solves the problems that arise using RAG for specialized content, FAIR Atlas has developed a proof-of-principle implementation comprised of 75,000+ scientific papers selected for discussing the genetic components of longevity. These papers were loaded into SCIRAG, a completely custom RAG architecture, and tested with questions about the functional linkage of genes to mitochondrial effects on aging and longevity.
SCIRAG employs the following components to achieve superior performance in a specialized domain:
  • Domain-focused embeddings model – A general purpose embeddings models underperformed a specialized model at understanding the relevance of domain concepts. In the demonstration, an embedding model was selected which emphasized biological concepts.
  • Metadata tracking – In addition to embedding text from the papers, all available metadata (e.g., authors, journal, date of publication, PMC ID, etc.) were tracked.
  • Ontology or catalog term grounding – FAIR Atlas employs a multi-step pipeline to ground terms to stable term or document identifiers, HGNC (human gene) IDs in the case of the demonstration.
The result is a coherent and useful answer with a well organized table containing all targeted entities, along with notes explaining their relevance to the question. In the demonstration, all genes relevant to the question are documented with clear notes and summaries.