Your Data's Diary: Why AI Needs to Read It
Tal Segalov, Solid's Co-Founder and CTO, shares insights on how to leverage AI to help make the vision of "AI for Analytics" a reality. This post includes the why, and how, of doing this.
In the industrial age, people were often seen as cogs in a giant machine, a concept famously satirized by Charlie Chaplin. Today, in the information age, that machine has evolved. It's no longer just a collection of mechanical parts; it has a brain, and that brain is your company’s data warehouse. It holds the collective memory of your entire organization—every transaction, customer interaction, and operational event. But just like a human brain, without context, this information can be fragmented and unreliable. We might remember a face but not a name, an event but not the reason it happened.
This is the challenge we face today. We're on the cusp of a new AI revolution, where we expect AI to query our data stacks and give us brilliant, insightful answers. But how can we expect an AI to be a genius if we’re handing it a library with no card catalog? For an AI to work its magic, it can’t just have access to raw data; it needs to understand the data's story. It requires a diary—a detailed, granular documentation of every table, metric, and model. It needs to know not just what a column is, but why it matters, and how it has been used successfully in the past.
The main tenants of AI-ready documentation
So, what does this AI-ready documentation look like? It has to be far more than a simple description. It needs to be fine-grained, down to the column level, and it must be smart enough to filter out the noise. We all know that well over 95% of the queries running in a data warehouse are of low quality, or contain no new information to be learned. The key is to identify the high-quality assets—the trusted tables, the insightful queries, the reliable dashboards—and focus our documentation efforts there. It’s about documenting not just the tables, but also providing qualified examples of how they are used correctly to answer real business questions.
This sounds like a monumental task, and doing it manually is a non-starter. Even if you could buy enough pizza to lure every data engineer into documenting a schema, the maintenance is impossible. Within months, that documentation would degrade to the point where no one trusts it anymore. The beautiful irony is that we can use AI to build the documentation that the next generation of AI needs. By analyzing usage patterns, user history, and other metadata, an AI can automatically identify the "golden paths" in your data. It can learn to distinguish high-quality queries from ad-hoc noise, reuse existing documentation from tools like dbt, and refresh this knowledge daily. Crucially, it must also know what it doesn’t know, and have a feedback loop to learn from human experts.
To be truly effective, this new breed of documentation must have several key properties:
Fine-grained detail: It must go beyond tables to document at the column and individual query component level.
Strong quality filters: It needs to automatically identify and surface the high-quality, trusted data assets and filter out the noise.
Qualified queries: It’s not enough to just document tables; it must also capture how they are used correctly by providing unique, informative queries as a reference.
Smart contextual search: It requires a search capability that understands user intent, combining semantic and keyword approaches to deliver relevant results.
A proposed way to automate this
So, how do we build this automated, intelligent documentation system? It’s a multi-layered process that starts with language and ends with a deeply interconnected understanding of your data.
First, Learn the Business Lexicon: The process doesn’t begin with SQL schemas. It starts by learning the unique language of your business. An AI-driven process should read through your company's unstructured knowledge bases—internal wikis, product documentation, support tickets, and even relevant Slack channels. From this, it builds a dynamic business glossary and feeds it into a searchable vector database. This ensures that when a user asks a question, the system understands the query in the context of your company's specific terminology.
Next, Map the Data Stack: Once the system speaks your language, it begins to map your entire data landscape. This involves ingesting metadata from every level of the stack: the structure of tables and columns, the history of queries run against them, the logic within dbt models, and the composition of BI reports and dashboards. This creates a comprehensive, multi-layered inventory of all data assets.
Apply a Strong Quality Filter: This is perhaps the most crucial step. Not all data is created equal. The system must intelligently filter out the noise—the ad-hoc queries, the deprecated tables, the test dashboards. This is a proprietary step, one that needs to be configured in partnership with the users, to identify and prioritize the high-trust, high-quality sources that the organization relies on.
Build an Interconnected Knowledge Graph: With the quality assets identified, the system can begin to connect the dots. It iteratively builds and refines a knowledge graph that maps the relationships between assets. This table is the source for that dbt model, which in turn feeds these three BI dashboards. This graph is the foundation for true contextual understanding.
Generate Context-Rich Documentation: Now, when the system documents an asset, it doesn't just describe the asset in isolation. It leverages the knowledge graph to pull in relevant context from closely related components. For a given table, its documentation might include the key queries that generate it, the most insightful queries that use it, and the critical BI reports that source data from it.
Leverage Sample Values for Clarity: Finally, to make the documentation truly intuitive, the system should include sample data. Seeing a few distinct values in a column can often tell you more than a lengthy description. For example, a field containing values like "US," "CA," "FR," and "DE" is almost certainly a country code, providing instant clarity to the user.
Investing in this kind of automated, intelligent documentation isn't just about cleaning house. It's about laying the foundation for true data intelligence. It’s the difference between giving your AI a messy pile of facts and handing it a well-organized, annotated library. By teaching our AI to read our data's diary, we are unlocking a future where it can move beyond simple queries and become a genuine partner in strategic thinking.