Forma Engineering Blog Series: From EAV to Zero-Dirty-Read Lakehouse
Three posts that explain a flexible data storage engine designed for the AI era
The 3 AM Wake-Up Call
It's 3 AM. Your pipeline crashed. Why? Because the model you deployed last week now outputs a confidence_score field that didn't exist yesterday. Your database rejected it. Your monitoring didn't catch it. Your customers did.
This is the reality of AI-powered applications: your data structures evolve faster than your database schema ever could.
The Receipt Analogy
Think about your grocery receipt. A traditional SQL database is like a receipt that lists every possible item—bananas, steaks, shampoo, sushi—and prints "0" next to the ones you didn't buy. Want to sell a new item? Reprint every receipt format.
An EAV-based system is different: it just lists what you actually bought. Chips, soda, done. New item? Just add it to the list. No reprinting.
This is the core insight behind Forma: store only what exists, and let the schema evolve with your AI's outputs.
A Note to the Skeptics
If you've been in data engineering for a while, you're probably thinking: "EAV? That anti-pattern that destroys performance and makes queries a nightmare?"
You're right to be skeptical. EAV has earned its bad reputation. But we've found a way to tame it—and this series will show you exactly how. We'll address the performance problems (Part 2), the consistency fears (Part 3), and show you a production-ready architecture that's already handling billions of records.
What is Forma?
Forma is a flexible data storage engine designed for the AI era. It's built on three core technology choices:
| Technology | Purpose | Problem Solved |
|---|---|---|
| EAV Pattern | Attributes stored as rows, no DDL for new fields | Schema flexibility |
| JSON Schema | AI-native data contracts, validation on write | Type safety + AI integration |
| PostgreSQL + DuckDB | OLTP and OLAP working together, hot/cold separation | Performance + cost balance |
Three Problems We're Solving
Problem One: Rapid AI Data Structure Iteration
Your AI Agent outputs 12 fields today, 30 fields tomorrow, and adds 5 new fields next week. Traditional database DDL workflows (file ticket → approval → downtime → ALTER TABLE) simply can't keep up with this pace.
→ Post One explains why the combination of JSON Schema + EAV + Hot Table is the ideal architecture for AI applications: zero DDL, instant effect, type-safe.
Problem Two: The N+1 Query Performance Nightmare
The EAV pattern is flexible—adding new fields just means inserting rows, no schema changes. But its query performance is notoriously bad: fetching 100 records might require 101 database round-trips, easily pushing latency past one second.
→ Post Two shows how to use PostgreSQL's CTE + JSON_AGG to reduce queries from 101 to 1, cutting latency from 1000ms to 25ms.
Problem Three: Consistency with Massive Historical Data
When data reaches billions of records, hot/cold separation becomes inevitable. But while "Lakehouse" sounds great, every engineer has the same nagging question: How do I know the data I'm querying isn't dirty?
→ Post Three explains in detail how Forma uses Anti-Join + Dirty Set mechanisms to ensure federated queries never read uncommitted or inconsistent data.
Reading Guide: Choose Based on Your Scenario
Have common questions? Check the FAQ.
| Your Scenario | Start Here |
|---|---|
| Building AI applications, need flexible data storage | Post 1: AI Architecture |
| Struggling with N+1 queries, want quick performance gains | Post 2: Killing N+1 |
| Data growing, considering hot/cold separation | Post 3: Serverless Lakehouse |
| Want comprehensive understanding of Forma architecture | Read all three in order |
The Series
[Post 1] Why EAV is the Most Underrated Data Model for AI
TL;DR: JSON Schema isn't just a validation tool—it's the core of AI-Ready infrastructure. Combined with hot table design, achieve "AI output → instant validation → zero-DDL storage."
→ Read in English | 阅读中文版
[Post 2] Killing N+1: How One SQL Trick Cut Our Latency by 40x
TL;DR: Using PostgreSQL CTE + JSON_AGG, we reduced database round-trips from 101 to 1, cutting latency by 97%.
→ Read in English | 阅读中文版
[Post 3] Zero Dirty Reads: Building a Trustworthy Lakehouse with DuckDB
TL;DR: PostgreSQL handles "the present," DuckDB + Parquet handles "the past." Anti-Join + Dirty Set mechanisms ensure zero dirty reads in federated queries.
→ Read in English | 阅读中文版
About Forma
Forma is an open-source project dedicated to providing flexible, high-performance, and cost-effective data storage solutions for the AI era.
If this series has been helpful, please consider starring our project on GitHub or joining the community discussion!