What I've been up to lately

2026-05-15

I've been deep into semantic layers for the past... 6 months? Both at work and on my free time. It's a topic that I've been really excited about for a decade, since it aligns well with my career goal of let's make data easier to consume.

But first...

...what is a semantic layer?

A semantic layer is a layer that adds semantics. In my work, a semantic layer adds semantics to tables in a database. If all you have is a database with many tables, it's hard to start working on the data. Imagine you want to calculate how many sales you did in the last month; you look at your database and you see a table called fact_sales and another called fact_orders — which one should you use?

You ask a co-worker and they tell you to use fact_orders. You write some SQL:

SELECT COUNT(*) AS total_sales
FROM fact_orders
WHERE
  year = 2026 AND
  month = 4;

The result you get back looks suspiciously high. You ask another co-worker and learn that the table has records used for testing, so you need to exclude the test application ID:

SELECT COUNT(*) AS total_sales
FROM fact_orders
WHERE
  year = 2026 AND
  month = 4 AND
  app_id <> 12345;

Your manager likes the numbers, but asks you to break them down by age buckets, to better understand the demographics of your customers. Now you need to find out join to JOIN the table to a dimension table that has that information. Which table should you use?

A semantic layer is a way of formalizing this tribal knowledge in a way that it can be reused with confidence. You use the semantic layer to formally define metrics, entities, and their relationships. It eliminates the guesswork, and provides a curation of the things that are needed when asking important questions about your product or business.

With a semantic layer you no longer write SQL; instead, you make semantic requests:

What metrics are available?
What dimensions can I use with metrics a and b?
What is the value of metric c by country and gender for the last year?

It's a much better user experience! And if you're using AI agents to answer questions about your data, you're much more likely to get correct answers if the agents are talking to the semantic layer instead of to the database directly. (Unless you used gen AI to define your semantic layer, that is.)

Apache Superset

Superset is the modern business intelligence web application that I work on. It connects to 50+ different databases, and allows you to run SQL, build charts and interactive dashboards. It can also enforce data permissions, applying row-level security and enforcing that users can only access certain tables.

Superset is database-centric. The reason why it can support so many databases is because it leverages SQLAlchemy (though it has its own abstraction layer on top of it). Historically, in order to connect Superset to a "database", you need a SQLAlchemy dialect. This is the reason why I wrote Shillelagh, a Python library that allows querying APIs via SQL — it was created to allow users to use Google Sheets as if it were a database. In the past I have written SQLAlchemy dialects for Apache Druid and Apache Pinot, all for Superset.

Previous attempts to connect Superset to semantic layers, like dbt MetricFlow, were made in a similar way: a SQLAlchemy dialect presented the semantic layer to the users as a pseudo-database. This approach had many flaws, mostly because Superset has its own semantic layer, so we were stacking semantics in way that didn't really make sense. None of the integrations built — dbt MetricFlow, Snowflake, DataJunction — made to public, and as far as I know the only big integration is an inhouse one at AirBnB, with their semantic layer Minerva.

Last year I wrote a Superset Improvement Proposal (SIP) to introduce semantic layers as a first-class citizen in Superset, alongside databases. I spent a few months working on a clean foundation, that allowed us to quickly add new semantic layers even though they are much more heterogeneous than databases. The vote passed, and the implementation was merged. Before, Superset had databases and datasets; now it has data connections (databases and semantic layers) and data sources (datasets and semantic views).

Yes, the terminology is a mess. I know. ¯\_(ツ)_/¯

💫 Cantrip

I have built semantic layers in the past. Back in 2016 I worked on DJ, a semantic layer used internally at Facebook mostly for experimentation. DJ was pretty unique because it allowed you to define metrics using SQL. This allowed for a lot of expressiveness, but because the metric layer was database agnostic the metrics had to be defined using a neutral ANSI SQL dialect. This made it really hard to translate complex real world metric definitions in DJ SQL. Data engineers that were used to their functions and dialects had to learn something new and less powerful, which made adoption hard.

DJ was still relevant in 2021, so I re-wrote it from scratch as open-source, and helped a team of engineers from Netflix to adopt it in their data platform. Eventually, the team took over the project and the development, and the project is still maintained today.

In 2023 I brainstormed a new semantic layer with the creator of Superset (and my CEO), which we called All ⭐ Stars. We asked ourselves the question: what if we just infer semantics from the database, and allow users to progressively enrich it with metadata? The project was more a manifesto than a semantic layer, but we did write some code.

Snowflake released their semantic layer in 2025. It immediately caught my attention because it was clever: it was just SQL, native SQL. It was not standard SQL, so it requires a custom tokenizer to parse it. And it doesn't offer the full expressiveness of SQL when defining metrics. But it's incredible easy to adopt by someone who's already using Snowflake, because no additional service is needed: the semantic layer lives in the database! It's great if you're OK with the vendor lock-in.

This led me to start a new open-source semantic layer that I called 💫 Cantrip. Cantrip draws inspiration from all of these semantic layers, with none of their drawbacks:

Metrics are defined as VIEWs in the database. You can use the full expressiveness of SQL, and you can use native SQL. No need to learn a new dialect to define metrics or relationships.
It's database agnostic. I've tested it with 14 different databases. Most of the work is done by manipulating an AST, so adding new databases is trivial, as long as it's supported by sqlglot.
Relationships and other metadata are inferred, and can be added manually by creating VIEWs. VIEWs can indicate relationships, time grains, geospatial grains, preferred join keys, and much more.

Cantrip is not a service. It's a library that fetches metadata from the database and builds a semantic graph of metrics and dimensions (and other entities). You can then make semantic requests, and it will generate the SQL needed to run the query. Super lightweight and simple!

I'm currently working on the documentation, and as I work on the documentation I go back and fix things. I like writing code that is easy to explain; if it's hard to explain the code needs to be changed. As soon as I have the documentation up I will finish integrating it with Superset and make a release. I believe it's going to be really useful for Superset users, since they will be able to benefit from all the semantic layer work that I have been doing for the past 6 months without having to pay for or run a service.