Skip to content

Why EAV is the Most Underrated Data Model for AI

Forma Engineering Blog · Series Part 1

TL;DR

Your AI pipeline shouldn't crash at 3 AM because your model learned a new field. Your data layer shouldn't require a DBA ticket to accept new attributes. And your downstream models shouldn't hallucinate because they're reading partial data from a mid-migration table.

Forma's combination of EAV + JSON Schema + Hot Tables solves all three problems:

  • Instant schema evolution: New fields take effect in seconds, not days
  • Type-safe validation: Bad data is rejected before it corrupts your training set
  • Query performance that doesn't tank: Hot fields get B-tree indexes, cold fields stay flexible

This post explains why this "old-school" data model is actually the most practical choice for AI-era applications.

Starting with a Real Scenario

Imagine you're building an AI-powered CRM system. A user speaks into their microphone:

"Log this—I just had a call with Mr. Zhang. He's very interested in our new proposal, budget around $500K, let's follow up next Tuesday."

Your AI Agent transforms this into structured data:

json
{
  "contact_name": "Mr. Zhang",
  "interaction_type": "phone_call",
  "sentiment": "positive",
  "budget_estimate": 500000,
  "next_followup": "2024-01-16",
  "notes": "Interested in new proposal"
}

Now here's the problem: Can your database accept this data?

If you're using traditional relational tables:

  1. No sentiment column? Stop the service, ALTER TABLE ADD COLUMN
  2. New customer needs an industry field? Stop again
  3. Different customers need different custom fields? One table per customer?

This clearly doesn't scale. Based on our surveys of 50+ enterprise customers, a single DDL change takes an average of 3-7 business days from ticket to deployment.

But a moderately complex AI Agent can produce 10-50 field combination variations per day. That's a two-order-of-magnitude mismatch in cycle time.

Translating DBA Concerns to AI Engineer Problems

If you've read database literature, you've heard terms like "ACID compliance" and "transaction isolation." Here's what they actually mean for your AI pipelines:

What DBAs SayWhat AI Engineers Experience
"Zero DDL"Your ingestion script won't crash because a new field appeared
"Schema validation"Bad data won't silently corrupt your training set
"Transaction isolation"Your model won't train on half-written records
"Data consistency"No more debugging why your embeddings are drifting mysteriously

When we say "zero dirty reads," we're not trying to impress database academics. We're preventing the scenario where your embedding model trains on a record that was mid-update—leading to subtle, impossible-to-debug model drift.

This is about pipeline stability, not database theory.

JSON Schema: The "Type System" for AI Output

This is where JSON Schema's value shines.

It's Not Just Validation—It's a Contract

JSON Schema has become the de facto standard for structured output from Large Language Models (LLMs):

  • OpenAI Structured Outputs: Uses JSON Schema to define function return formats
  • Anthropic Tool Use: Uses JSON Schema to describe tool parameters
  • Google Gemini Function Declarations: Also based on JSON Schema

This means when you define a data structure with JSON Schema, you're simultaneously defining:

  1. AI output format: The LLM knows what structure to return
  2. Validation rules: Automatic type, format, and range checking before writes
  3. Database schema: Forma uses it directly to organize storage

One definition, three purposes.

A JSON Schema Example

json
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "properties": {
    "contact_name": {
      "type": "string",
      "minLength": 1,
      "x-ltbase-column": "text_01"
    },
    "budget_estimate": {
      "type": "integer",
      "minimum": 0,
      "x-ltbase-column": "integer_01"
    },
    "sentiment": {
      "type": "string",
      "enum": ["positive", "neutral", "negative"]
    },
    "next_followup": {
      "type": "string",
      "format": "date"
    },
    "notes": {
      "type": "string"
    }
  },
  "required": ["contact_name"]
}

Notice that x-ltbase-column: "integer_01"? That's a Forma extension field—we'll explain its purpose shortly.

Flexibility Without DDL

What happens when AI produces a new field?

Traditional approach:

AI outputs new field → Developer notices → Files ticket → DBA approves → Stop service → ALTER TABLE → Deploy

Timeline: 1 day to 1 week

Forma approach:

AI outputs new field → Update JSON Schema → Immediately effective

Timeline: Seconds

Because in the EAV pattern, adding a new field just means inserting new rows in the EAV table—no table structure changes needed. JSON Schema updates are pure metadata operations with no data migration.

The Pareto Principle: Why We Still Need a "Hot Table"

EAV solves the flexibility problem, but it has an inherent problem: All attributes live in the same table, requiring scans through massive amounts of irrelevant data on every query.

Imagine a CRM system:

  • 1 million contact records
  • 30 attributes per record on average
  • EAV table total rows: 30 million

Every time a user searches for "contacts with budget over $100K," the database scans 30 million rows, even if only 100 records match.

But careful analysis of user behavior reveals a pattern:

80% of queries involve only 20% of fields.

The fields users search and sort by most often are always the same few: contact_name, created_at, budget_estimate, status. Fields like notes or custom_field_42 are only needed when viewing detail pages.

This is the Pareto Principle (80/20 rule) manifesting in database queries.

Hot Table: "Promoting" the Top 20% of Fields

Forma's solution is to introduce a "hot table" (entity_main) dedicated to storing frequently-accessed fields:

Hot table structure example:

entity_main table structure:
┌─────────────┬──────────────┬──────────────┬──────────────┐
│ row_id      │ text_01      │ integer_01   │ created_at   │
├─────────────┼──────────────┼──────────────┼──────────────┤
│ uuid-1      │ "Mr. Zhang"  │ 500000       │ 2024-01-09   │
│ uuid-2      │ "Mr. Li"     │ 200000       │ 2024-01-08   │
└─────────────┴──────────────┴──────────────┴──────────────┘
  • text_01 maps to contact_name (via JSON Schema's x-ltbase-column marker)
  • integer_01 maps to budget_estimate
  • These columns have B-tree indexes

When a user searches for "budget over $100K":

Pure EAV path: Scan 30 million rows → aggregate → return
Hot table path: Index scan integer_01 > 100000 → hit 1,000 rows → aggregate only those 1,000 records' EAV data

Performance difference: 99% reduction in scan volume, latency drops from 200-500ms to 20-50ms.

Why Not Just Use JSONB?

PostgreSQL's JSONB is often the first thing developers reach for when they need flexibility. It's a reasonable instinct—but it hits walls quickly in production AI pipelines. Let's examine the specific failure modes.


The Index Bottleneck: GIN vs. B-Tree

This is JSONB's Achilles' heel, and it's not immediately obvious until you're debugging slow queries in production.

GIN indexes excel at containment queries:

sql
-- PostgreSQL: "Find records where tags contain 'urgent'"
SELECT * FROM records WHERE data @> '{"tags": ["urgent"]}';  -- ✅ GIN works great

But they fail at range queries—the bread and butter of business analytics:

sql
-- PostgreSQL: "Find high-confidence predictions"
SELECT * FROM records 
WHERE (data->>'confidence_score')::float > 0.9;  -- ❌ Full table scan

-- PostgreSQL: "Find records from the last 24 hours"  
SELECT * FROM records 
WHERE (data->>'timestamp')::timestamptz > now() - interval '1 day';  -- ❌ Full table scan

The fix requires expression indexes—which means DDL:

sql
-- PostgreSQL: You need DDL for each field that needs range queries
CREATE INDEX idx_confidence ON records ((data->>'confidence_score')::float);
CREATE INDEX idx_timestamp ON records ((data->>'timestamp')::timestamptz);

And MySQL? It's even more limited:

sql
-- MySQL 8.0+: JSON extraction works...
SELECT * FROM records 
WHERE JSON_EXTRACT(data, '$.confidence_score') > 0.9;

-- But functional indexes on JSON are restricted:
-- MySQL requires a VIRTUAL generated column first
ALTER TABLE records 
ADD COLUMN confidence_score FLOAT 
GENERATED ALWAYS AS (JSON_EXTRACT(data, '$.confidence_score')) VIRTUAL;

CREATE INDEX idx_confidence ON records (confidence_score);
-- That's TWO DDL statements per field!

Every new field that needs range queries or sorting requires DBA intervention. You're right back to the "3-7 business days per field" bottleneck that JSONB was supposed to solve.

How EAV Solves This:

Forma uses typed storage—pre-allocated columns with existing B-tree indexes:

entity_main table (Hot Table):
┌─────────────┬──────────────┬──────────────┬────────────────┐
│ row_id      │ float_01     │ float_02     │ timestamp_01   │
├─────────────┼──────────────┼──────────────┼────────────────┤
│ uuid-1      │ 0.95         │ 0.87         │ 2024-01-09     │
│ uuid-2      │ 0.72         │ 0.91         │ 2024-01-08     │
└─────────────┴──────────────┴──────────────┴────────────────┘
         ↑              ↑              ↑
    B-tree index   B-tree index   B-tree index
    (pre-existing) (pre-existing) (pre-existing)

When a new numeric attribute appears (say, confidence_score), you map it to float_01 via JSON Schema's x-ltbase-column. The B-tree index already exists. No DDL required.

ScenarioJSONB (PostgreSQL)JSON (MySQL)EAV + Hot Table
New field needs range queryCREATE INDEX (DDL)2× DDL (column + index)Metadata mapping only
New field needs sortingCREATE INDEX (DDL)2× DDLMetadata mapping only
Time to production3-7 days3-7 daysSeconds

Write Amplification: The Hidden Cost

JSONB stores the entire document as a single binary blob. Update one field? PostgreSQL rewrites the whole thing.

OperationJSONBEAV
Update 1 field in 50-field recordRewrite ~4KB blobInsert/update 1 row (~100 bytes)
Add embedding vector to 1M records1M × 4KB = 4GB written1M × 100B = 100MB written
Write amplification factor40×

For AI pipelines with frequent partial updates—enrichment jobs adding embeddings, sentiment scores, extracted entities—this compounds into:

  • I/O costs: Cloud storage bills scale with bytes written
  • WAL bloat: PostgreSQL's Write-Ahead Log grows 40× faster
  • Replication lag: More data to sync to replicas
  • Vacuum pressure: More dead tuples to clean up

If your pipeline updates millions of records with small changes (common in embedding enrichment), JSONB's write amplification becomes a silent budget killer.


Cross-Database Portability

JSONB locks you into PostgreSQL's ecosystem. This matters more than you might think.

The reality of database diversity in enterprise environments:

CapabilityPostgreSQLMySQL 8.0+Aurora DSQLCockroachDBSpanner
JSONB + GIN indexPartial
Expression indexes on JSONLimited
Functional generated columns
Standard EAV tables
B-tree on typed columns

EAV is pure standard SQL. The pattern—entity table, attribute table, value table with typed columns—works on every relational database since the 1980s.

Interestingly, many cloud-native storage services are internally built on EAV-like models:

  • Amazon SimpleDB: Key-attribute-value structure
  • Azure Table Storage: Entity-property model
  • Google Cloud Datastore: Entity-property-value design

Choosing EAV aligns your architecture with these platforms, making future migrations smoother. If your customer suddenly requires MySQL (common in enterprise), or you need to scale to CockroachDB, or AWS releases a compelling new database service—your storage layer travels with you.


The OLTP/OLAP Split Strategy

Forma's architecture isn't just "EAV instead of JSONB"—it's a deliberate separation of concerns:

┌─────────────────────────────────────────────────────────────┐
│  OLTP Side (Writes)           │  OLAP Side (Analytics)      │
│  ─────────────────            │  ──────────────────         │
│  EAV + Hot Table              │  DuckDB + Parquet           │
│  • Max compatibility          │  • Columnar processing      │
│  • Runs on ANY SQL database   │  • Complex aggregations     │
│  • Zero DDL writes            │  • Serverless lakehouse     │
└─────────────────────────────────────────────────────────────┘

If you bet everything on JSONB:

  • Your writes are optimized for PostgreSQL only
  • Range queries require DDL per field
  • Database migration = rewrite your storage layer

With EAV + DuckDB:

  • Writes work on any SQL database (PostgreSQL today, Aurora DSQL tomorrow)
  • Range queries use pre-existing typed indexes
  • Heavy analytics offload to DuckDB's columnar engine (covered in Part 3)

This "wide entry (EAV writes), strict exit (Parquet/DuckDB analysis)" strategy is more resilient than coupling to PostgreSQL-specific features.


When JSONB Does Make Sense

To be fair, JSONB wins in specific scenarios:

  • Truly unstructured data: Log entries, raw API responses you never query
  • Infrequent queries: Data accessed only for display, never filtered/sorted
  • Single-document lookups: Fetch by ID, return the whole blob
  • PostgreSQL-only environments: If you're certain you'll never migrate

Forma's position isn't "JSONB bad"—it's "know where each tool excels." Use JSONB for opaque storage blobs. Use EAV + Hot Table for queryable, evolving structures that need to survive database migrations and support efficient range queries without DDL.

The Complete AI Workflow Loop

Let's connect all the modules and trace a complete AI data write flow:

┌───────────────────────────────────────────────────────────────┐
│  1. AI generates structured data                              │
│     LLM output: {"contact_name": "Zhang", "budget": 500000}   │
└───────────────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────────────┐
│  2. JSON Schema validation                                    │
│     - contact_name: string, minLength 1  ✓                    │
│     - budget: integer, minimum 0         ✓                    │
│     - sentiment: enum [positive/neutral/negative]  ✓          │
└───────────────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────────────┐
│  3. Forma write                                               │
│     - Hot fields → entity_main table (contact_name → text_01) │
│     - All fields → EAV table (maintains flexibility)          │
└───────────────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────────────┐
│  4. Query optimization                                        │
│     - Filter/sort → Hot table + B-tree index (milliseconds)   │
│     - Detail aggregation → EAV + JSON_AGG (Part 2's trick)    │
└───────────────────────────────────────────────────────────────┘

Key characteristics of this flow:

  • Zero DDL: New fields take effect instantly via JSON Schema updates
  • Type safety: AI output is automatically validated before writes
  • Controlled performance: Hot field index scans + cold field on-demand aggregation

Under the Hood: JSON Schema Compilation

What does Forma do behind the scenes when you create or update a Schema?

1. Parse and Validate

Input: JSON Schema definition
Output: Validation pass / Error messages (circular references, type conflicts, etc.)

2. attr_id Assignment

Each attribute gets a unique integer ID within the schema:

contact_name → attr_id: 1
budget       → attr_id: 2
sentiment    → attr_id: 3

This way, queries use integer comparisons instead of string matching—faster, and avoids typos.

3. Hot Table Column Mapping

Fields marked with x-ltbase-column get assigned to hot table columns:

contact_name (string)  → text_01
budget (integer)       → integer_01

Forma compiles the schema when it's created or updated. The flattened field mappings are generated at this time, cached, and used directly during queries without re-parsing the JSON Schema every time.

Managing the Metadata Tax

EAV's flexibility comes with a cost: attribute sprawl. Without discipline, you can end up with hundreds of attributes—some duplicates, some typos, some abandoned experiments.

This is the "Metadata Tax" that critics rightfully point out. Here's how Forma addresses it.

The Schema Registry

Forma maintains a schema_attributes table that serves as a centralized registry:

sql
-- Simplified schema_attributes structure
CREATE TABLE schema_attributes (
    schema_id       UUID NOT NULL,
    attr_id         INTEGER NOT NULL,
    attr_path       TEXT NOT NULL,        -- "contact.name", "budget_estimate"
    json_type       TEXT NOT NULL,        -- "string", "integer", "boolean"
    hot_column      TEXT,                 -- "text_01", "integer_01", NULL for cold
    created_at      TIMESTAMPTZ DEFAULT now(),
    last_used_at    TIMESTAMPTZ,          -- For identifying stale attributes
    usage_count     BIGINT DEFAULT 0,     -- For identifying hot candidates
    PRIMARY KEY (schema_id, attr_id)
);

This gives you:

  • Duplicate detection: Same path can't be registered twice
  • Type consistency: budget can't be integer in one record and string in another
  • Usage tracking: Know which attributes are actually being used

Preventing Attribute Sprawl

Three strategies for keeping your attribute space clean:

1. JSON Schema Strict Mode

By default, Forma rejects any field not declared in the schema:

json
{
  "additionalProperties": false,  // Reject undeclared fields
  "properties": {
    "contact_name": { "type": "string" },
    "budget": { "type": "integer" }
  }
}

AI outputs a sentment (typo)? Rejected. Forces the pipeline to be explicit about new fields.

2. Attribute Aliasing

When you discover duplicates, you can merge them without data migration:

sql
-- "budget_estimate" and "estimated_budget" both exist
-- Point them to the same hot column
UPDATE schema_attributes 
SET hot_column = 'integer_01' 
WHERE attr_path IN ('budget_estimate', 'estimated_budget');

3. Planned Feature: Schema Management CLI

Note: The following CLI tool is on Forma's roadmap and not yet available.

We're building an admin interface to make metadata management easier:

bash
# List all attributes with usage stats
$ forma schema attributes list --schema crm_contacts
┌─────────────────────┬──────────┬────────────┬─────────────┬───────────────┐
 Attribute Type Hot Column Usage Count Last Used
├─────────────────────┼──────────┼────────────┼─────────────┼───────────────┤
 contact_name string text_01 1,234,567 2 minutes ago
 budget_estimate integer integer_01 987,654 5 minutes ago
 sentiment string text_02 543,210 1 hour ago
 legacy_field_xyz string NULL 0 6 months ago Candidate for removal
└─────────────────────┴──────────┴────────────┴─────────────┴───────────────┘

# Find potential duplicates
$ forma schema attributes duplicates --schema crm_contacts
Potential duplicates found:
  - "budget_estimate" vs "estimated_budget" (87% string similarity)
  - "contact_name" vs "contactName" (camelCase variant)

# Promote a cold attribute to hot
$ forma schema attributes promote sentiment --hot-column text_02
 Attribute 'sentiment' promoted to hot column 'text_02'
 Backfill job queued (ETA: 3 minutes for 543,210 records)

Until this CLI is available, you can query schema_attributes directly for the same insights.

The Bottom Line on Metadata Tax

Yes, EAV requires you to manage your attribute space. But consider the alternative:

ApproachAdding New FieldRemoving Unused FieldFinding Duplicates
Traditional SQLALTER TABLE + migrateALTER TABLE + carefulManual code review
JSONBJust write itFields never really "go away"grep through JSON blobs
EAV + RegistryUpdate JSON SchemaQuery last_used_atQuery schema_attributes

EAV with a proper registry doesn't eliminate metadata management—it makes it queryable and automatable.

Summary: Why EAV Fits the AI Era

Traditional Relational TablesForma (EAV + Hot Table)
New fields require ALTER TABLENew fields effective instantly
Schema changes require downtimeZero downtime
AI output needs manual adaptationJSON Schema direct integration
Index design requires upfront planningHot fields auto-indexed

The EAV pattern was once considered an "anti-pattern" because it sacrificed query performance for flexibility. But through:

  1. Hot Table design: Promoting high-frequency fields to physical columns, restoring B-tree index speed
  2. JSON Schema: Providing type safety and AI integration capabilities
  3. Single-query optimization: Eliminating N+1 problems (covered in Part 2)

We've given EAV both flexibility and performance.

In the AI era, data structure changes happen far faster than traditional software development cycles. Your database either adapts to this speed or becomes a bottleneck.

EAV + JSON Schema is the answer we've found.

How Does Forma Compare to NoSQL?

If flexibility is the goal, why not use MongoDB or DynamoDB? Here's a balanced comparison:

CapabilityMongoDBDynamoDBForma (EAV + Hot Table)
Schema flexibility✅ Excellent (schemaless)✅ Excellent (schemaless)✅ Excellent (JSON Schema)
Range queries✅ Good (with indexes)⚠️ Limited (requires GSI)✅ Good (B-tree on hot columns)
ACID transactions⚠️ Single-doc only by default⚠️ Limited (25 items max)✅ Full PostgreSQL ACID
JOIN support❌ Manual aggregation❌ No native JOINs✅ Full SQL JOINs
Existing SQL ecosystem❌ New tooling required❌ New tooling required✅ Standard SQL, existing tools
Cost at scale⚠️ Compute-heavy⚠️ RCU/WCU can spike✅ Predictable (PostgreSQL + S3)
Cold data archival⚠️ Manual sharding⚠️ TTL + manual export✅ Built-in (DuckDB + Parquet)

When NoSQL wins:

  • Pure document workloads with no relational queries
  • Globally distributed apps needing multi-region writes (DynamoDB Global Tables)
  • Teams already invested in MongoDB/DynamoDB ecosystem

When Forma wins:

  • AI pipelines needing relational joins (enrichment, cross-entity analysis)
  • Teams with existing PostgreSQL infrastructure
  • Workloads mixing OLTP (real-time) and OLAP (analytics) queries
  • Cost-sensitive cold data storage (S3 + Parquet vs. MongoDB Atlas archival)

A Preview of What's Possible

Before we move on, here's a taste of the performance gains covered in Part 2:

MetricBefore OptimizationAfter Optimization
Database round-trips1011
Latency (100 records)1000ms25ms
Improvement97%

This isn't theoretical. These are real numbers from production systems using PostgreSQL's CTE + JSON_AGG—features that have existed since version 9.4 but remain criminally underused.

Part 2 shows exactly how to achieve this, with copy-paste-ready SQL.

What's Next: Solving EAV's Performance Problem

This post introduced the architecture choices of EAV + JSON Schema + Hot Table. But EAV has a well-known problem: N+1 queries.

The next post will show how we use PostgreSQL's CTE + JSON_AGG to solve this problem, reducing query count from 101 to 1, and latency from 1 second to 25 milliseconds.

And when historical data accumulates to billions of records and a single PostgreSQL instance can't handle it, Part 3 will introduce how we use DuckDB + CDC + Parquet to build a Serverless lakehouse architecture—and most critically, how we solve the trust crisis around "Lakehouse reading dirty data."

Series Navigation

This post is based on engineering practices from the Forma project. Forma is a flexible data storage engine designed for the AI era.