Killing N+1: How One SQL Trick Cut Our Latency by 40x

Forma Engineering Blog · Series Part 2

TL;DR

We reduced database round-trips from 101 to 1, and latency from 1000ms to 25ms—a 97% improvement. The secret isn't black magic; it's a criminally underrated PostgreSQL feature: CTE + JSON_AGG.

If you're using the EAV (Entity-Attribute-Value) pattern for flexible data storage and getting killed by N+1 queries, this post is for you.

This isn't legacy tech. The CTE + JSON_AGG pattern works seamlessly with modern data stacks. Forma uses DuckDB for cold data queries and exports to Parquet on S3. The same query optimization principles apply whether you're hitting PostgreSQL directly or running federated queries across hot and cold storage. If you're building on a lakehouse architecture, these techniques are still relevant.

The Villain: The N+1 Query Nightmare

Let's set the scene.

Your SaaS app has a "Contacts" feature. Different customers need different fields: Customer A wants 12 fields, Customer B wants 30 fields, and Customer C changes their mind every week. To avoid running ALTER TABLE every time someone adds a field, you chose the EAV pattern—storing attributes as rows instead of columns.

Smart move. But when you write the query code, the nightmare begins:

// Step 1: Query EAV table to get matching row IDs (1 query)
rowIDs := db.Query("SELECT DISTINCT row_id FROM eav_table WHERE ...")

// Step 2: Loop through each row_id to fetch main table data (N queries!)
for _, rowID := range rowIDs {
    record := db.Query("SELECT * FROM entity_main WHERE row_id = ?", rowID)
    results = append(results, record)
}

This code looks straightforward, but it has a fatal flaw: fetching 100 records requires 101 database round-trips.

Let's do the math:

Records	Queries	Round-trip Latency	Total Latency
10	11	10ms	110ms
50	51	10ms	510ms
100	101	10ms	1010ms

One full second! Users click "Search" and wait a full second for results. It gets worse:

Connection pool exhaustion: Under high concurrency, 101 queries per request will drain your connection pool fast
CPU spinning: The application layer spends most of its time waiting on network I/O in a loop
Cross-region deployment? Double it!: If your database is in another region, a single round-trip might be 50ms. That's 5 seconds for 100 records.

This is the N+1 query problem—a notorious performance killer in the ORM world.

The Hero: Let the Database Do What It's Good At

What's the root cause?

We're making the application layer assemble the data. The app loops through records, asking the database one by one: "Give me the details for this record." The database can only respond one by one.

But the database was born to do this! It has indexes, query optimizers, and vectorized execution engines—it's designed for batch data processing.

The solution: Move the loop into the database. Return all data in one query.

CTE + JSON_AGG: One Query to Rule Them All

PostgreSQL's CTE (Common Table Expression) and JSON_AGG function are the core of this solution.

Let's see the complete SQL first, then explain the mechanics:

sql

WITH 
-- Step 1: Find matching record IDs
filtered_ids AS (
    SELECT DISTINCT row_id
    FROM eav_table
    WHERE schema_id = $1 AND /* filter conditions */
),

-- Step 2: Sort + Paginate
paginated AS (
    SELECT row_id
    FROM filtered_ids
    ORDER BY /* sort conditions */
    LIMIT $page_size OFFSET $offset
),

-- Step 3: Fetch main table data
main_data AS (
    SELECT m.*
    FROM paginated p
    JOIN entity_main m ON m.row_id = p.row_id
),

-- Step 4: Aggregate EAV attributes into JSON
eav_json AS (
    SELECT 
        p.row_id,
        JSON_AGG(
            JSON_BUILD_OBJECT(
                'attr_id', e.attr_id,
                'value', COALESCE(e.value_text, e.value_numeric::text)
            )
        ) AS attributes
    FROM paginated p
    JOIN eav_table e ON e.row_id = p.row_id
    GROUP BY p.row_id
)

-- Final result: main table + EAV attributes, returned in one shot
SELECT m.*, COALESCE(e.attributes, '[]') AS attributes_json
FROM main_data m
LEFT JOIN eav_json e ON e.row_id = m.row_id;

What Does This SQL Do?

CTE as a pipeline: Each WITH clause is like a station on an assembly line, with data flowing from one step to the next
JSON_AGG magic: Aggregates multiple EAV rows into a single JSON array—one row_id, one JSON object
One round-trip: The entire query involves only one database interaction; all data comes back in a single package

How CTE Eliminates Round-Trips

A Common Table Expression (CTE) is like setting up an assembly line inside the database. Without CTE, your application plays ping-pong with the database:

App: "Give me the IDs"           → DB responds (round-trip 1)
App: "Give me details for ID 1"  → DB responds (round-trip 2)
App: "Give me details for ID 2"  → DB responds (round-trip 3)
...101 times...

With CTE, you send one instruction:

App: "Database, figure out the IDs, fetch the details, 
      aggregate them into JSON, and give me everything 
      in one response."

The database's query planner optimizes the entire pipeline as a single execution unit. It can:

Use indexes efficiently across all steps
Parallelize independent operations
Avoid serialization/deserialization overhead between steps

This is computation pushdown—letting the database do what it was built to do.

The application layer just needs to parse JSON:

// One query, all data
rows := db.Query(cteSQL, schemaID, pageSize, offset)
for rows.Next() {
    var record Record
    var attributesJSON string
    rows.Scan(&record, &attributesJSON)
    json.Unmarshal(attributesJSON, &record.Attributes)
    results = append(results, record)
}

Performance Comparison: Numbers Don't Lie

Before vs. after optimization:

Records	Queries (Before)	Queries (After)	Latency (Before)	Latency (After)	Improvement
10	11	1	110ms	~15ms	86%
50	51	1	510ms	~20ms	96%
100	101	1	1010ms	~25ms	97%

In cross-region deployments (50ms per round-trip):

Records	Latency (Before)	Latency (After)	Improvement
100	5050ms	~80ms	98%

From 5 seconds to 80 milliseconds. User experience goes from "Is this site down?" to "Instant."

The Bottom Line:
Before: 100 records = 101 queries = 1+ second latency
After: 100 records = 1 query = 25ms latency
Cross-region: 5 seconds → 80ms (98% improvement)

Why Does This Work?

1. Network Round-Trips Are the Enemy

In modern systems, CPU and memory speeds are measured in nanoseconds, while network round-trips are measured in milliseconds—a difference of 6 orders of magnitude. Reducing network round-trips is often more effective than optimizing CPU computations.

2. The Database Is a Data Processing Expert

PostgreSQL's query optimizer has been refined over decades. It knows how to execute JOINs efficiently, how to leverage indexes, and how to parallelize operations. Delegating data aggregation to the database is far faster than looping in application code.

3. JSON_AGG Is an Underrated Superpower

Many people treat PostgreSQL as just a relational database, forgetting that since version 9.4, it's also an excellent JSON database. JSON_AGG + JSON_BUILD_OBJECT can perform complex data reshaping at the database layer, avoiding inefficient loop-based assembly in application code.

The CPU vs. Network Trade-off: When to Pivot

The CTE + JSON_AGG approach trades CPU for reduced network round-trips. But there's no free lunch—it's important to understand when this trade-off makes sense.

The Break-Even Analysis

Every optimization has a crossover point. Here's how to think about it:

In-database aggregation wins when:

Network_Latency × N_Records > CPU_Aggregation_Time + 1_Round_Trip

Let's plug in real numbers:

Scenario	Network Round-Trip	Records	N+1 Latency	CTE Latency	Winner
Same machine (localhost)	0.5ms	10	~5ms	~8ms	N+1
Same datacenter	2ms	10	~20ms	~10ms	CTE
Same datacenter	2ms	100	~200ms	~25ms	CTE
Cross-region	50ms	10	~500ms	~60ms	CTE
Cross-region	50ms	100	~5000ms	~80ms	CTE

Key insight: The crossover point is around 5-10 records when database and app share a datacenter, but even 2-3 records make CTE worthwhile in cross-region scenarios.

When to Keep Aggregation in the App Layer

CTE + JSON_AGG isn't always the answer:

1. Very small result sets with co-located services

If you're fetching 3-5 records and your app runs on the same machine as the database (common in development or small deployments), the overhead of building JSON in the database may exceed the N+1 penalty.

// For small, co-located workloads, simple queries may be faster
if recordCount < 5 && isLocalhost {
    // Simple N+1 is fine here
}

2. Complex per-record transformations

If each record needs business logic that can't be expressed in SQL (calling external APIs, complex validation, ML inference), you'll loop anyway. Pushing aggregation to the database doesn't help.

// Example: Each record needs an external API call
for _, record := range records {
    record.EnrichedData = callExternalAPI(record.ID)  // Can't do this in SQL
}

3. Memory-constrained database servers

JSON_AGG builds the entire JSON array in memory before returning. For very large result sets (10K+ rows with complex nested data), this can cause memory pressure on the database server. In these cases, streaming results row-by-row may be safer.

But in practice, this rarely matters. Forma's architecture provides three layers of protection:

Layer 1: Hot Table eliminates aggregation for list queries

Query Type	Fields Needed	Data Source	JSON_AGG?
Paginated list	5-10 hot fields	`entity_main` only	No
Detail view	All fields, 1 record	`entity_main` + EAV	Yes, but tiny

Most queries (80%+) are list views that only need hot fields—no JSON_AGG at all.

Layer 2: Real-world records have limited fields

Even when you aggregate EAV attributes, a typical record has 30-50 fields at most. A CRM contact with name, phone, email, company, address, and 20 custom fields? That's ~2-5KB of JSON per record. For a 100-record batch, that's 200-500KB—well within any database's comfort zone.

Layer 3: DuckDB handles complex analytics

For scenarios that would stress JSON_AGG—bulk exports, complex aggregations across thousands of records, OLAP-style queries—you shouldn't be using PostgreSQL anyway. That's exactly what the DuckDB + Parquet layer is for (see Part 3).

Workload	Solution	Why
List views	Hot Table (PostgreSQL)	B-tree indexed, no aggregation
Detail views	CTE + JSON_AGG (PostgreSQL)	Single record, tiny payload
Bulk export / OLAP	DuckDB + Parquet	Columnar storage, designed for this

The "memory explosion" scenario—aggregating hundreds of attributes across tens of thousands of records in PostgreSQL—simply doesn't occur in a properly architected system. Each layer handles what it's best at.

The Hybrid Approach

In practice, Forma uses a hybrid strategy:

Query Type	Records	Strategy	Reason
Paginated list	20-100	CTE + JSON_AGG	Sweet spot for in-database aggregation
Single record detail	1	Direct query	No aggregation needed
Bulk export	1000+	Streaming cursor	Avoid memory pressure
Real-time dashboard	Variable	Depends on latency budget	Measure, then decide

Decision Flowchart

                    ┌─────────────────┐
                    │ How many records│
                    │ are you fetching│
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
           1-5 records    5-500 records   500+ records
              │              │              │
              ▼              ▼              ▼
    ┌─────────────────┐ ┌─────────────┐ ┌─────────────────┐
    │ Is DB co-located│ │ Use CTE +   │ │ Use streaming   │
    │ with app?       │ │ JSON_AGG    │ │ cursor or       │
    └────────┬────────┘ │ (default)   │ │ pagination      │
             │          └─────────────┘ └─────────────────┘
        Yes  │  No
             │   │
             ▼   ▼
    ┌─────────┐ ┌─────────┐
    │ N+1 OK  │ │ Use CTE │
    └─────────┘ └─────────┘

Measuring Your Specific Workload

Don't trust rules of thumb. Measure:

sql

-- Add EXPLAIN ANALYZE to see actual execution time
EXPLAIN ANALYZE
WITH filtered_ids AS (...)
SELECT ...;

Compare against your N+1 implementation with real network conditions. The numbers above are guidelines—your mileage will vary based on:

Actual network latency (use ping or distributed tracing)
Database connection pool configuration
Complexity of your EAV schema
Index coverage

When to Use This (and When Not To)

Good Fit

EAV pattern: Attributes stored as rows that need to be aggregated into records
Batch queries: Fetching multiple records at once (paginated lists, bulk exports, etc.)
Latency-sensitive: Database and application not co-located

Limitations

PostgreSQL 9.4+: Requires JSON_AGG and JSON_BUILD_OBJECT functions
Complex query maintainability: Deeply nested CTEs can hurt readability; consider encapsulating in views or stored procedures
Memory usage: JSON_AGG builds JSON arrays in memory; watch out for memory limits with very large result sets

What's Next: When Data Exceeds a Single Machine?

This post solved the N+1 query problem, and the previous post introduced hot table and JSON Schema design. But one question remains:

When historical data accumulates to billions of records and a single PostgreSQL instance can't handle it—what then?

Part 3 will introduce how we use DuckDB + CDC + Parquet to build a Serverless lakehouse architecture, and most critically—how we solve the trust crisis around "Lakehouse reading dirty data."

Series Navigation

[Part 1] Why EAV is the Most Underrated Data Model for AI
[Part 2] Killing N+1: How One SQL Trick Cut Our Latency by 40x ← You are here
[Part 3] Zero Dirty Reads: Building a Trustworthy Lakehouse with DuckDB (Finale)

This post is based on engineering practices from the Forma project. Forma is a flexible data storage engine designed for the AI era.

Killing N+1: How One SQL Trick Cut Our Latency by 40x ​

TL;DR ​

The Villain: The N+1 Query Nightmare ​

The Hero: Let the Database Do What It's Good At ​

CTE + JSON_AGG: One Query to Rule Them All ​

What Does This SQL Do? ​

How CTE Eliminates Round-Trips ​

Performance Comparison: Numbers Don't Lie ​

Why Does This Work? ​

1. Network Round-Trips Are the Enemy ​

2. The Database Is a Data Processing Expert ​

3. JSON_AGG Is an Underrated Superpower ​

The CPU vs. Network Trade-off: When to Pivot ​

The Break-Even Analysis ​

When to Keep Aggregation in the App Layer ​

The Hybrid Approach ​

Decision Flowchart ​

Measuring Your Specific Workload ​

When to Use This (and When Not To) ​

Good Fit ​

Limitations ​

What's Next: When Data Exceeds a Single Machine? ​