Exploring DataGemma: An Overview
Despite the advancements in large language models (LLMs), AI hallucinations remain a significant challenge. On September 12, Google made a major stride by releasing DataGemma as open source. DataGemma leverages real-world data to tackle these hallucinations, reflecting Google’s commitment to addressing this issue. In this blog, I will provide an overview of DataGemma and explore two distinct approaches used to improve LLM accuracy and reasoning.
What is Data Commons
Google’s Data Commons project serves as a vast repository of public data, designed to streamline the access and use of important global statistics. It consolidates information from a wide range of trusted sources, including the United Nations, government agencies, environmental organizations, and universities. With over 250 billion data points and more than 2.5 trillion triples, it represents a significant open-source initiative aimed at making global data more accessible and useful.
Data Commons features two notable innovations. First, it has dedicated years to curating diverse public datasets, understanding their underlying assumptions, and organizing them using Schema.org, a universal language for structured data. This effort results in a comprehensive Knowledge Graph that integrates data from various sources.
Second, Data Commons incorporates a natural language interface powered by large language models (LLMs). This allows users to pose questions in everyday language, with the LLM translating these queries into the Data Commons’ format. This interface facilitates the exploration of charts and graphs without altering or fabricating the underlying data.
Interfacing LLMs with Data Commons
Two different approaches have been described for interfacing LLMs with Data Commons.
The first approach, called Retrieval Interleaved Generation (RIG), fine-tunes the LLM to not only generate natural language queries but also pull stats from Data Commons. It uses a multi-step pipeline to convert these into structured data queries. We then compare this to how the base models, Gemma 7B IT and 27B IT, perform.
The second approach, Retrieval Augmented Generation (RAG), takes a more classic retrieval method. It extracts variables from the query, grabs relevant data, and adds context to the original question. Then it produces an answer using an LLM (Gemini 1.5 Pro), which we use for comparison against the baseline results.
Retrieval Interleaved Generation (RIG)
Retrieval Interleaved Generation (RIG) is a three-step process designed to enhance the accuracy and reliability of language model responses. First, a fine-tuned model generates natural language queries for Data Commons. Next, a post-processor converts these queries into structured data formats. Finally, the system retrieves statistical answers from Data Commons and presents them alongside the original LLM-generated results.
In this process, when the LLM provides a numerical answer, it is matched with the most relevant value from the Data Commons database, known as the Data Commons Statistical Value (DC-SV). The original output from the LLM is referred to as the LLM Statistical Value (LLM-SV). Instead of generating formal queries like SQL, the LLM is fine-tuned to produce natural language queries. This method is more efficient given the vast array of variables in Data Commons and helps maintain the natural and fluent quality of the model’s responses.
Query Conversation Part
In their pipeline, natural language queries generated by the LLM are transformed into structured queries for the Data Commons database. Despite the extensive range of variables and properties in Data Commons, most queries can be categorized into a few types, which streamlines the extraction process. Each query is broken down into key components: statistical variables or topics, places, and attributes. Specific NLP techniques are applied to these components: semantic search for variables, named entity recognition for places, and regex-based heuristics for attributes.
The identified components are then mapped to predefined query templates, as illustrated in the table below:
Structured queries are generated based on these templates and submitted to the Data Commons API. The response, typically a numeric value, is presented alongside the original LLM-generated statistic, facilitating verification of the LLM’s output. Future developments will explore various presentation methods for these results, including side-by-side comparisons and highlighted differences.
Retrieval Augmented Generation (RAG)
In the RAG pipeline, the process begins with a fine-tuned LLM managing the user’s query. This model generates relevant queries for Data Commons, which are used to retrieve pertinent tables from the Data Commons interface. Finally, a long-context LLM, such as Gemini 1.5 Pro, generates a response based on both the original query and the retrieved tables.
Extracting Data Commons Queries
An LLM is fine-tuned to transform user queries into Data Commons queries. Training utilizes Gemini 1.5 Pro to generate queries in specific formats. Although the effectiveness can be limited by data availability, the initial method generally provides better results compared to alternatives that use a full variable list.
Retrieving Tables
Queries are processed using the RIG framework to identify variables and map them to Data Commons APIs. These APIs return relevant tables, such as life expectancy by country, which are used for generating responses.
Prompting
Once the tables are retrieved, a prompt is created combining the original query and serialized table data. This prompt is then processed by long-context LLMs like Gemini 1.5 Pro to generate and return a comprehensive response.
Conclusion
I hope this information proves helpful. I plan to continue exploring DataGemma, focusing on evaluating RIG and RAG approaches and their implementation in code.
Stay tuned for the next updates on this topic. Take care!