Why EAV is the Most Underrated Data Model for AI
Forma Engineering Blog · Series Part 1
TL;DR
Your AI pipeline shouldn't crash at 3 AM because your model learned a new field. Your data layer shouldn't require a DBA ticket to accept new attributes. And your downstream models shouldn't hallucinate because they're reading partial data from a mid-migration table.
Forma's combination of EAV + JSON Schema + Hot Tables solves all three problems:
- Instant schema evolution: New fields take effect in seconds, not days
- Type-safe validation: Bad data is rejected before it corrupts your training set
- Query performance that doesn't tank: Hot fields get B-tree indexes, cold fields stay flexible
This post explains why this "old-school" data model is actually the most practical choice for AI-era applications.
Starting with a Real Scenario
Imagine you're building an AI-powered CRM system. A user speaks into their microphone:
"Log this—I just had a call with Mr. Zhang. He's very interested in our new proposal, budget around $500K, let's follow up next Tuesday."
Your AI Agent transforms this into structured data:
{
"contact_name": "Mr. Zhang",
"interaction_type": "phone_call",
"sentiment": "positive",
"budget_estimate": 500000,
"next_followup": "2024-01-16",
"notes": "Interested in new proposal"
}Now here's the problem: Can your database accept this data?
If you're using traditional relational tables:
- No
sentimentcolumn? Stop the service,ALTER TABLE ADD COLUMN - New customer needs an
industryfield? Stop again - Different customers need different custom fields? One table per customer?
This clearly doesn't scale. Based on our surveys of 50+ enterprise customers, a single DDL change takes an average of 3-7 business days from ticket to deployment.
But a moderately complex AI Agent can produce 10-50 field combination variations per day. That's a two-order-of-magnitude mismatch in cycle time.
Translating DBA Concerns to AI Engineer Problems
If you've read database literature, you've heard terms like "ACID compliance" and "transaction isolation." Here's what they actually mean for your AI pipelines:
| What DBAs Say | What AI Engineers Experience |
|---|---|
| "Zero DDL" | Your ingestion script won't crash because a new field appeared |
| "Schema validation" | Bad data won't silently corrupt your training set |
| "Transaction isolation" | Your model won't train on half-written records |
| "Data consistency" | No more debugging why your embeddings are drifting mysteriously |
When we say "zero dirty reads," we're not trying to impress database academics. We're preventing the scenario where your embedding model trains on a record that was mid-update—leading to subtle, impossible-to-debug model drift.
This is about pipeline stability, not database theory.
JSON Schema: The "Type System" for AI Output
This is where JSON Schema's value shines.
It's Not Just Validation—It's a Contract
JSON Schema has become the de facto standard for structured output from Large Language Models (LLMs):
- OpenAI Structured Outputs: Uses JSON Schema to define function return formats
- Anthropic Tool Use: Uses JSON Schema to describe tool parameters
- Google Gemini Function Declarations: Also based on JSON Schema
This means when you define a data structure with JSON Schema, you're simultaneously defining:
- AI output format: The LLM knows what structure to return
- Validation rules: Automatic type, format, and range checking before writes
- Database schema: Forma uses it directly to organize storage
One definition, three purposes.
A JSON Schema Example
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"contact_name": {
"type": "string",
"minLength": 1,
"x-ltbase-column": "text_01"
},
"budget_estimate": {
"type": "integer",
"minimum": 0,
"x-ltbase-column": "integer_01"
},
"sentiment": {
"type": "string",
"enum": ["positive", "neutral", "negative"]
},
"next_followup": {
"type": "string",
"format": "date"
},
"notes": {
"type": "string"
}
},
"required": ["contact_name"]
}Notice that x-ltbase-column: "integer_01"? That's a Forma extension field—we'll explain its purpose shortly.
Flexibility Without DDL
What happens when AI produces a new field?
Traditional approach:
AI outputs new field → Developer notices → Files ticket → DBA approves → Stop service → ALTER TABLE → DeployTimeline: 1 day to 1 week
Forma approach:
AI outputs new field → Update JSON Schema → Immediately effectiveTimeline: Seconds
Because in the EAV pattern, adding a new field just means inserting new rows in the EAV table—no table structure changes needed. JSON Schema updates are pure metadata operations with no data migration.
The Pareto Principle: Why We Still Need a "Hot Table"
EAV solves the flexibility problem, but it has an inherent problem: All attributes live in the same table, requiring scans through massive amounts of irrelevant data on every query.
Imagine a CRM system:
- 1 million contact records
- 30 attributes per record on average
- EAV table total rows: 30 million
Every time a user searches for "contacts with budget over $100K," the database scans 30 million rows, even if only 100 records match.
But careful analysis of user behavior reveals a pattern:
80% of queries involve only 20% of fields.
The fields users search and sort by most often are always the same few: contact_name, created_at, budget_estimate, status. Fields like notes or custom_field_42 are only needed when viewing detail pages.
This is the Pareto Principle (80/20 rule) manifesting in database queries.
Hot Table: "Promoting" the Top 20% of Fields
Forma's solution is to introduce a "hot table" (entity_main) dedicated to storing frequently-accessed fields:
Hot table structure example:
entity_main table structure:
┌─────────────┬──────────────┬──────────────┬──────────────┐
│ row_id │ text_01 │ integer_01 │ created_at │
├─────────────┼──────────────┼──────────────┼──────────────┤
│ uuid-1 │ "Mr. Zhang" │ 500000 │ 2024-01-09 │
│ uuid-2 │ "Mr. Li" │ 200000 │ 2024-01-08 │
└─────────────┴──────────────┴──────────────┴──────────────┘text_01maps tocontact_name(via JSON Schema'sx-ltbase-columnmarker)integer_01maps tobudget_estimate- These columns have B-tree indexes
When a user searches for "budget over $100K":
Pure EAV path: Scan 30 million rows → aggregate → return
Hot table path: Index scan integer_01 > 100000 → hit 1,000 rows → aggregate only those 1,000 records' EAV data
Performance difference: 99% reduction in scan volume, latency drops from 200-500ms to 20-50ms.
Why Not Just Use JSONB?
PostgreSQL's JSONB is often the first thing developers reach for when they need flexibility. It's a reasonable instinct—but it hits walls quickly in production AI pipelines. Let's examine the specific failure modes.
The Index Bottleneck: GIN vs. B-Tree
This is JSONB's Achilles' heel, and it's not immediately obvious until you're debugging slow queries in production.
GIN indexes excel at containment queries:
-- PostgreSQL: "Find records where tags contain 'urgent'"
SELECT * FROM records WHERE data @> '{"tags": ["urgent"]}'; -- ✅ GIN works greatBut they fail at range queries—the bread and butter of business analytics:
-- PostgreSQL: "Find high-confidence predictions"
SELECT * FROM records
WHERE (data->>'confidence_score')::float > 0.9; -- ❌ Full table scan
-- PostgreSQL: "Find records from the last 24 hours"
SELECT * FROM records
WHERE (data->>'timestamp')::timestamptz > now() - interval '1 day'; -- ❌ Full table scanThe fix requires expression indexes—which means DDL:
-- PostgreSQL: You need DDL for each field that needs range queries
CREATE INDEX idx_confidence ON records ((data->>'confidence_score')::float);
CREATE INDEX idx_timestamp ON records ((data->>'timestamp')::timestamptz);And MySQL? It's even more limited:
-- MySQL 8.0+: JSON extraction works...
SELECT * FROM records
WHERE JSON_EXTRACT(data, '$.confidence_score') > 0.9;
-- But functional indexes on JSON are restricted:
-- MySQL requires a VIRTUAL generated column first
ALTER TABLE records
ADD COLUMN confidence_score FLOAT
GENERATED ALWAYS AS (JSON_EXTRACT(data, '$.confidence_score')) VIRTUAL;
CREATE INDEX idx_confidence ON records (confidence_score);
-- That's TWO DDL statements per field!Every new field that needs range queries or sorting requires DBA intervention. You're right back to the "3-7 business days per field" bottleneck that JSONB was supposed to solve.
How EAV Solves This:
Forma uses typed storage—pre-allocated columns with existing B-tree indexes:
entity_main table (Hot Table):
┌─────────────┬──────────────┬──────────────┬────────────────┐
│ row_id │ float_01 │ float_02 │ timestamp_01 │
├─────────────┼──────────────┼──────────────┼────────────────┤
│ uuid-1 │ 0.95 │ 0.87 │ 2024-01-09 │
│ uuid-2 │ 0.72 │ 0.91 │ 2024-01-08 │
└─────────────┴──────────────┴──────────────┴────────────────┘
↑ ↑ ↑
B-tree index B-tree index B-tree index
(pre-existing) (pre-existing) (pre-existing)When a new numeric attribute appears (say, confidence_score), you map it to float_01 via JSON Schema's x-ltbase-column. The B-tree index already exists. No DDL required.
| Scenario | JSONB (PostgreSQL) | JSON (MySQL) | EAV + Hot Table |
|---|---|---|---|
| New field needs range query | CREATE INDEX (DDL) | 2× DDL (column + index) | Metadata mapping only |
| New field needs sorting | CREATE INDEX (DDL) | 2× DDL | Metadata mapping only |
| Time to production | 3-7 days | 3-7 days | Seconds |
Write Amplification: The Hidden Cost
JSONB stores the entire document as a single binary blob. Update one field? PostgreSQL rewrites the whole thing.
| Operation | JSONB | EAV |
|---|---|---|
| Update 1 field in 50-field record | Rewrite ~4KB blob | Insert/update 1 row (~100 bytes) |
| Add embedding vector to 1M records | 1M × 4KB = 4GB written | 1M × 100B = 100MB written |
| Write amplification factor | 40× | 1× |
For AI pipelines with frequent partial updates—enrichment jobs adding embeddings, sentiment scores, extracted entities—this compounds into:
- I/O costs: Cloud storage bills scale with bytes written
- WAL bloat: PostgreSQL's Write-Ahead Log grows 40× faster
- Replication lag: More data to sync to replicas
- Vacuum pressure: More dead tuples to clean up
If your pipeline updates millions of records with small changes (common in embedding enrichment), JSONB's write amplification becomes a silent budget killer.
Cross-Database Portability
JSONB locks you into PostgreSQL's ecosystem. This matters more than you might think.
The reality of database diversity in enterprise environments:
| Capability | PostgreSQL | MySQL 8.0+ | Aurora DSQL | CockroachDB | Spanner |
|---|---|---|---|---|---|
| JSONB + GIN index | ✅ | ❌ | ❌ | Partial | ❌ |
| Expression indexes on JSON | ✅ | Limited | ❌ | ✅ | ❌ |
| Functional generated columns | ✅ | ✅ | ❌ | ✅ | ❌ |
| Standard EAV tables | ✅ | ✅ | ✅ | ✅ | ✅ |
| B-tree on typed columns | ✅ | ✅ | ✅ | ✅ | ✅ |
EAV is pure standard SQL. The pattern—entity table, attribute table, value table with typed columns—works on every relational database since the 1980s.
Interestingly, many cloud-native storage services are internally built on EAV-like models:
- Amazon SimpleDB: Key-attribute-value structure
- Azure Table Storage: Entity-property model
- Google Cloud Datastore: Entity-property-value design
Choosing EAV aligns your architecture with these platforms, making future migrations smoother. If your customer suddenly requires MySQL (common in enterprise), or you need to scale to CockroachDB, or AWS releases a compelling new database service—your storage layer travels with you.
The OLTP/OLAP Split Strategy
Forma's architecture isn't just "EAV instead of JSONB"—it's a deliberate separation of concerns:
┌─────────────────────────────────────────────────────────────┐
│ OLTP Side (Writes) │ OLAP Side (Analytics) │
│ ───────────────── │ ────────────────── │
│ EAV + Hot Table │ DuckDB + Parquet │
│ • Max compatibility │ • Columnar processing │
│ • Runs on ANY SQL database │ • Complex aggregations │
│ • Zero DDL writes │ • Serverless lakehouse │
└─────────────────────────────────────────────────────────────┘If you bet everything on JSONB:
- Your writes are optimized for PostgreSQL only
- Range queries require DDL per field
- Database migration = rewrite your storage layer
With EAV + DuckDB:
- Writes work on any SQL database (PostgreSQL today, Aurora DSQL tomorrow)
- Range queries use pre-existing typed indexes
- Heavy analytics offload to DuckDB's columnar engine (covered in Part 3)
This "wide entry (EAV writes), strict exit (Parquet/DuckDB analysis)" strategy is more resilient than coupling to PostgreSQL-specific features.
When JSONB Does Make Sense
To be fair, JSONB wins in specific scenarios:
- Truly unstructured data: Log entries, raw API responses you never query
- Infrequent queries: Data accessed only for display, never filtered/sorted
- Single-document lookups: Fetch by ID, return the whole blob
- PostgreSQL-only environments: If you're certain you'll never migrate
Forma's position isn't "JSONB bad"—it's "know where each tool excels." Use JSONB for opaque storage blobs. Use EAV + Hot Table for queryable, evolving structures that need to survive database migrations and support efficient range queries without DDL.
The Complete AI Workflow Loop
Let's connect all the modules and trace a complete AI data write flow:
┌───────────────────────────────────────────────────────────────┐
│ 1. AI generates structured data │
│ LLM output: {"contact_name": "Zhang", "budget": 500000} │
└───────────────────────────────────────────────────────────────┘
↓
┌───────────────────────────────────────────────────────────────┐
│ 2. JSON Schema validation │
│ - contact_name: string, minLength 1 ✓ │
│ - budget: integer, minimum 0 ✓ │
│ - sentiment: enum [positive/neutral/negative] ✓ │
└───────────────────────────────────────────────────────────────┘
↓
┌───────────────────────────────────────────────────────────────┐
│ 3. Forma write │
│ - Hot fields → entity_main table (contact_name → text_01) │
│ - All fields → EAV table (maintains flexibility) │
└───────────────────────────────────────────────────────────────┘
↓
┌───────────────────────────────────────────────────────────────┐
│ 4. Query optimization │
│ - Filter/sort → Hot table + B-tree index (milliseconds) │
│ - Detail aggregation → EAV + JSON_AGG (Part 2's trick) │
└───────────────────────────────────────────────────────────────┘Key characteristics of this flow:
- Zero DDL: New fields take effect instantly via JSON Schema updates
- Type safety: AI output is automatically validated before writes
- Controlled performance: Hot field index scans + cold field on-demand aggregation
Under the Hood: JSON Schema Compilation
What does Forma do behind the scenes when you create or update a Schema?
1. Parse and Validate
Input: JSON Schema definition
Output: Validation pass / Error messages (circular references, type conflicts, etc.)2. attr_id Assignment
Each attribute gets a unique integer ID within the schema:
contact_name → attr_id: 1
budget → attr_id: 2
sentiment → attr_id: 3This way, queries use integer comparisons instead of string matching—faster, and avoids typos.
3. Hot Table Column Mapping
Fields marked with x-ltbase-column get assigned to hot table columns:
contact_name (string) → text_01
budget (integer) → integer_01Forma compiles the schema when it's created or updated. The flattened field mappings are generated at this time, cached, and used directly during queries without re-parsing the JSON Schema every time.
Managing the Metadata Tax
EAV's flexibility comes with a cost: attribute sprawl. Without discipline, you can end up with hundreds of attributes—some duplicates, some typos, some abandoned experiments.
This is the "Metadata Tax" that critics rightfully point out. Here's how Forma addresses it.
The Schema Registry
Forma maintains a schema_attributes table that serves as a centralized registry:
-- Simplified schema_attributes structure
CREATE TABLE schema_attributes (
schema_id UUID NOT NULL,
attr_id INTEGER NOT NULL,
attr_path TEXT NOT NULL, -- "contact.name", "budget_estimate"
json_type TEXT NOT NULL, -- "string", "integer", "boolean"
hot_column TEXT, -- "text_01", "integer_01", NULL for cold
created_at TIMESTAMPTZ DEFAULT now(),
last_used_at TIMESTAMPTZ, -- For identifying stale attributes
usage_count BIGINT DEFAULT 0, -- For identifying hot candidates
PRIMARY KEY (schema_id, attr_id)
);This gives you:
- Duplicate detection: Same path can't be registered twice
- Type consistency:
budgetcan't beintegerin one record andstringin another - Usage tracking: Know which attributes are actually being used
Preventing Attribute Sprawl
Three strategies for keeping your attribute space clean:
1. JSON Schema Strict Mode
By default, Forma rejects any field not declared in the schema:
{
"additionalProperties": false, // Reject undeclared fields
"properties": {
"contact_name": { "type": "string" },
"budget": { "type": "integer" }
}
}AI outputs a sentment (typo)? Rejected. Forces the pipeline to be explicit about new fields.
2. Attribute Aliasing
When you discover duplicates, you can merge them without data migration:
-- "budget_estimate" and "estimated_budget" both exist
-- Point them to the same hot column
UPDATE schema_attributes
SET hot_column = 'integer_01'
WHERE attr_path IN ('budget_estimate', 'estimated_budget');3. Planned Feature: Schema Management CLI
Note: The following CLI tool is on Forma's roadmap and not yet available.
We're building an admin interface to make metadata management easier:
# List all attributes with usage stats
$ forma schema attributes list --schema crm_contacts
┌─────────────────────┬──────────┬────────────┬─────────────┬───────────────┐
│ Attribute │ Type │ Hot Column │ Usage Count │ Last Used │
├─────────────────────┼──────────┼────────────┼─────────────┼───────────────┤
│ contact_name │ string │ text_01 │ 1,234,567 │ 2 minutes ago │
│ budget_estimate │ integer │ integer_01 │ 987,654 │ 5 minutes ago │
│ sentiment │ string │ text_02 │ 543,210 │ 1 hour ago │
│ legacy_field_xyz │ string │ NULL │ 0 │ 6 months ago │ ← Candidate for removal
└─────────────────────┴──────────┴────────────┴─────────────┴───────────────┘
# Find potential duplicates
$ forma schema attributes duplicates --schema crm_contacts
Potential duplicates found:
- "budget_estimate" vs "estimated_budget" (87% string similarity)
- "contact_name" vs "contactName" (camelCase variant)
# Promote a cold attribute to hot
$ forma schema attributes promote sentiment --hot-column text_02
✓ Attribute 'sentiment' promoted to hot column 'text_02'
✓ Backfill job queued (ETA: 3 minutes for 543,210 records)Until this CLI is available, you can query schema_attributes directly for the same insights.
The Bottom Line on Metadata Tax
Yes, EAV requires you to manage your attribute space. But consider the alternative:
| Approach | Adding New Field | Removing Unused Field | Finding Duplicates |
|---|---|---|---|
| Traditional SQL | ALTER TABLE + migrate | ALTER TABLE + careful | Manual code review |
| JSONB | Just write it | Fields never really "go away" | grep through JSON blobs |
| EAV + Registry | Update JSON Schema | Query last_used_at | Query schema_attributes |
EAV with a proper registry doesn't eliminate metadata management—it makes it queryable and automatable.
Summary: Why EAV Fits the AI Era
| Traditional Relational Tables | Forma (EAV + Hot Table) |
|---|---|
| New fields require ALTER TABLE | New fields effective instantly |
| Schema changes require downtime | Zero downtime |
| AI output needs manual adaptation | JSON Schema direct integration |
| Index design requires upfront planning | Hot fields auto-indexed |
The EAV pattern was once considered an "anti-pattern" because it sacrificed query performance for flexibility. But through:
- Hot Table design: Promoting high-frequency fields to physical columns, restoring B-tree index speed
- JSON Schema: Providing type safety and AI integration capabilities
- Single-query optimization: Eliminating N+1 problems (covered in Part 2)
We've given EAV both flexibility and performance.
In the AI era, data structure changes happen far faster than traditional software development cycles. Your database either adapts to this speed or becomes a bottleneck.
EAV + JSON Schema is the answer we've found.
How Does Forma Compare to NoSQL?
If flexibility is the goal, why not use MongoDB or DynamoDB? Here's a balanced comparison:
| Capability | MongoDB | DynamoDB | Forma (EAV + Hot Table) |
|---|---|---|---|
| Schema flexibility | ✅ Excellent (schemaless) | ✅ Excellent (schemaless) | ✅ Excellent (JSON Schema) |
| Range queries | ✅ Good (with indexes) | ⚠️ Limited (requires GSI) | ✅ Good (B-tree on hot columns) |
| ACID transactions | ⚠️ Single-doc only by default | ⚠️ Limited (25 items max) | ✅ Full PostgreSQL ACID |
| JOIN support | ❌ Manual aggregation | ❌ No native JOINs | ✅ Full SQL JOINs |
| Existing SQL ecosystem | ❌ New tooling required | ❌ New tooling required | ✅ Standard SQL, existing tools |
| Cost at scale | ⚠️ Compute-heavy | ⚠️ RCU/WCU can spike | ✅ Predictable (PostgreSQL + S3) |
| Cold data archival | ⚠️ Manual sharding | ⚠️ TTL + manual export | ✅ Built-in (DuckDB + Parquet) |
When NoSQL wins:
- Pure document workloads with no relational queries
- Globally distributed apps needing multi-region writes (DynamoDB Global Tables)
- Teams already invested in MongoDB/DynamoDB ecosystem
When Forma wins:
- AI pipelines needing relational joins (enrichment, cross-entity analysis)
- Teams with existing PostgreSQL infrastructure
- Workloads mixing OLTP (real-time) and OLAP (analytics) queries
- Cost-sensitive cold data storage (S3 + Parquet vs. MongoDB Atlas archival)
A Preview of What's Possible
Before we move on, here's a taste of the performance gains covered in Part 2:
| Metric | Before Optimization | After Optimization |
|---|---|---|
| Database round-trips | 101 | 1 |
| Latency (100 records) | 1000ms | 25ms |
| Improvement | — | 97% |
This isn't theoretical. These are real numbers from production systems using PostgreSQL's CTE + JSON_AGG—features that have existed since version 9.4 but remain criminally underused.
Part 2 shows exactly how to achieve this, with copy-paste-ready SQL.
What's Next: Solving EAV's Performance Problem
This post introduced the architecture choices of EAV + JSON Schema + Hot Table. But EAV has a well-known problem: N+1 queries.
The next post will show how we use PostgreSQL's CTE + JSON_AGG to solve this problem, reducing query count from 101 to 1, and latency from 1 second to 25 milliseconds.
And when historical data accumulates to billions of records and a single PostgreSQL instance can't handle it, Part 3 will introduce how we use DuckDB + CDC + Parquet to build a Serverless lakehouse architecture—and most critically, how we solve the trust crisis around "Lakehouse reading dirty data."
Series Navigation
- [Part 1] Why EAV is the Most Underrated Data Model for AI ← You are here
- [Part 2] Killing N+1: How One SQL Trick Cut Our Latency by 40x
- [Part 3] Zero Dirty Reads: Building a Trustworthy Lakehouse with DuckDB (Finale)
This post is based on engineering practices from the Forma project. Forma is a flexible data storage engine designed for the AI era.