"Sorry for the mess" - everyone's data is messy, and it's Okay
Our CEO & Co-Founder goes into why everyone's data is (partially) a mess, and no one has good documentation, and still AI can make sense of it
A few months ago, Benn Stancil said on a call “People should stop saying garbage in, garbage out with AI. ChatGPT was trained on Reddit, which is full of garbage, and is still producing quality output. (hallucinations aside)” (I’m paraphrasing)
I urge data and analytics leaders to start changing their approach.
Yes, your data stack is somewhat messy. You have all kinds of pipelines that no one (still employed by your firm) knows how they work. You have situations where a single table will have both STATUS and STATUS2 columns in them and won’t know why. You started a documentation project in Confluence, but no one really wanted to document, and 50% of it is stale (you don’t know which 50%, of course).
You should stop apologizing for the mess.
Everyone’s house has a bit of a mess in it: shoes that are in the wrong place; a bowl of Cheerios your 9-year-old left on the table; your 11-year-old’s bike blocking the front door. (and that’s just this morning)
We all have that. It’s ok.
In the world of data - you don’t need to clean up everything before you can use AI. It’s impossible. It’s a pipe dream. Let’s stop having that as a pre-requisite for AI. You also can’t document everything. No one wants to do that. Humans are terrible at it.
So what can we do? (read below)

Build AI that’s ready to take on the mess
Let’s take a page from Waymo’s book. They’re clearly very successful. (1 2 3)
They didn’t wait for San Francisco to clean itself up - sort out the streets, the potholes, the weird turns. They didn’t ask San Francisco to straighten Lombard Street.
They built AI that can deal with the mess of a large city.
So I urge data and analytics leaders, as well as the vendors who provide them with software, to assume that this is the reality. We can’t spend the next few years tidying up our house. We live in it, so it’s bound to have some mess at any given time.
That’s exactly what we’re focused on at Solid. Our AI takes in metadata and unstructured information from a variety of sources (DWH’s schema and SQL query log, BI’s metadata, JIRA tickets, Slack messages, dbt code, Confluence docs, etc), and starts by cleaning it up. “Keep the things that spark joy, toss those that don’t.”
A SQL query that does “SELECT * FROM TABLE LIMIT 100”? Toss.
A BI dashboard no one touched in 3 years? Toss.
A JIRA ticket no one actually worked on from last year? Toss.
A good SQL query run recently by a senior analyst? Keep.
A BI dashboard your execs view daily? Keep.
A JIRA ticket with great comments detailing the changes done to an important data set. Keep!
Once the AI filters through, and finds the gold in our customers’ mess, it can then learn from it, and build on top of it. Then magic happens.
this terrific and 'spot on'.....GIGO becomes an anachronism if/as one discovers and uncovers ways to swiftly, economically and intelligently sort through the trash to find what's worth keeping.....this the 'essence/enabler' of the 'vibe analytics' exchange we had earlier and - with luck - will incent people to revisit and rethink how and where 'real signal' may reside in real messes....