Skip to content

Forma Engineering Blog Series: From EAV to Zero-Dirty-Read Lakehouse

Three posts that explain a flexible data storage engine designed for the AI era

The 3 AM Wake-Up Call

It's 3 AM. Your pipeline crashed. Why? Because the model you deployed last week now outputs a confidence_score field that didn't exist yesterday. Your database rejected it. Your monitoring didn't catch it. Your customers did.

This is the reality of AI-powered applications: your data structures evolve faster than your database schema ever could.

The Receipt Analogy

Think about your grocery receipt. A traditional SQL database is like a receipt that lists every possible item—bananas, steaks, shampoo, sushi—and prints "0" next to the ones you didn't buy. Want to sell a new item? Reprint every receipt format.

An EAV-based system is different: it just lists what you actually bought. Chips, soda, done. New item? Just add it to the list. No reprinting.

This is the core insight behind Forma: store only what exists, and let the schema evolve with your AI's outputs.

A Note to the Skeptics

If you've been in data engineering for a while, you're probably thinking: "EAV? That anti-pattern that destroys performance and makes queries a nightmare?"

You're right to be skeptical. EAV has earned its bad reputation. But we've found a way to tame it—and this series will show you exactly how. We'll address the performance problems (Part 2), the consistency fears (Part 3), and show you a production-ready architecture that's already handling billions of records.


What is Forma?

Forma is a flexible data storage engine designed for the AI era. It's built on three core technology choices:

TechnologyPurposeProblem Solved
EAV PatternAttributes stored as rows, no DDL for new fieldsSchema flexibility
JSON SchemaAI-native data contracts, validation on writeType safety + AI integration
PostgreSQL + DuckDBOLTP and OLAP working together, hot/cold separationPerformance + cost balance

Three Problems We're Solving

Problem One: Rapid AI Data Structure Iteration

Your AI Agent outputs 12 fields today, 30 fields tomorrow, and adds 5 new fields next week. Traditional database DDL workflows (file ticket → approval → downtime → ALTER TABLE) simply can't keep up with this pace.

→ Post One explains why the combination of JSON Schema + EAV + Hot Table is the ideal architecture for AI applications: zero DDL, instant effect, type-safe.

Problem Two: The N+1 Query Performance Nightmare

The EAV pattern is flexible—adding new fields just means inserting rows, no schema changes. But its query performance is notoriously bad: fetching 100 records might require 101 database round-trips, easily pushing latency past one second.

→ Post Two shows how to use PostgreSQL's CTE + JSON_AGG to reduce queries from 101 to 1, cutting latency from 1000ms to 25ms.

Problem Three: Consistency with Massive Historical Data

When data reaches billions of records, hot/cold separation becomes inevitable. But while "Lakehouse" sounds great, every engineer has the same nagging question: How do I know the data I'm querying isn't dirty?

→ Post Three explains in detail how Forma uses Anti-Join + Dirty Set mechanisms to ensure federated queries never read uncommitted or inconsistent data.

Reading Guide: Choose Based on Your Scenario

Have common questions? Check the FAQ.

Your ScenarioStart Here
Building AI applications, need flexible data storagePost 1: AI Architecture
Struggling with N+1 queries, want quick performance gainsPost 2: Killing N+1
Data growing, considering hot/cold separationPost 3: Serverless Lakehouse
Want comprehensive understanding of Forma architectureRead all three in order

The Series

[Post 1] Why EAV is the Most Underrated Data Model for AI

TL;DR: JSON Schema isn't just a validation tool—it's the core of AI-Ready infrastructure. Combined with hot table design, achieve "AI output → instant validation → zero-DDL storage."

Read in English | 阅读中文版

[Post 2] Killing N+1: How One SQL Trick Cut Our Latency by 40x

TL;DR: Using PostgreSQL CTE + JSON_AGG, we reduced database round-trips from 101 to 1, cutting latency by 97%.

Read in English | 阅读中文版

[Post 3] Zero Dirty Reads: Building a Trustworthy Lakehouse with DuckDB

TL;DR: PostgreSQL handles "the present," DuckDB + Parquet handles "the past." Anti-Join + Dirty Set mechanisms ensure zero dirty reads in federated queries.

Read in English | 阅读中文版

About Forma

Forma is an open-source project dedicated to providing flexible, high-performance, and cost-effective data storage solutions for the AI era.

If this series has been helpful, please consider starring our project on GitHub or joining the community discussion!