Killing N+1: How One SQL Trick Cut Our Latency by 40x
Forma Engineering Blog · Series Part 2
TL;DR
We reduced database round-trips from 101 to 1, and latency from 1000ms to 25ms—a 97% improvement. The secret isn't black magic; it's a criminally underrated PostgreSQL feature: CTE + JSON_AGG.
If you're using the EAV (Entity-Attribute-Value) pattern for flexible data storage and getting killed by N+1 queries, this post is for you.
This isn't legacy tech. The CTE + JSON_AGG pattern works seamlessly with modern data stacks. Forma uses DuckDB for cold data queries and exports to Parquet on S3. The same query optimization principles apply whether you're hitting PostgreSQL directly or running federated queries across hot and cold storage. If you're building on a lakehouse architecture, these techniques are still relevant.
The Villain: The N+1 Query Nightmare
Let's set the scene.
Your SaaS app has a "Contacts" feature. Different customers need different fields: Customer A wants 12 fields, Customer B wants 30 fields, and Customer C changes their mind every week. To avoid running ALTER TABLE every time someone adds a field, you chose the EAV pattern—storing attributes as rows instead of columns.
Smart move. But when you write the query code, the nightmare begins:
// Step 1: Query EAV table to get matching row IDs (1 query)
rowIDs := db.Query("SELECT DISTINCT row_id FROM eav_table WHERE ...")
// Step 2: Loop through each row_id to fetch main table data (N queries!)
for _, rowID := range rowIDs {
record := db.Query("SELECT * FROM entity_main WHERE row_id = ?", rowID)
results = append(results, record)
}This code looks straightforward, but it has a fatal flaw: fetching 100 records requires 101 database round-trips.
Let's do the math:
| Records | Queries | Round-trip Latency | Total Latency |
|---|---|---|---|
| 10 | 11 | 10ms | 110ms |
| 50 | 51 | 10ms | 510ms |
| 100 | 101 | 10ms | 1010ms |
One full second! Users click "Search" and wait a full second for results. It gets worse:
- Connection pool exhaustion: Under high concurrency, 101 queries per request will drain your connection pool fast
- CPU spinning: The application layer spends most of its time waiting on network I/O in a loop
- Cross-region deployment? Double it!: If your database is in another region, a single round-trip might be 50ms. That's 5 seconds for 100 records.
This is the N+1 query problem—a notorious performance killer in the ORM world.
The Hero: Let the Database Do What It's Good At
What's the root cause?
We're making the application layer assemble the data. The app loops through records, asking the database one by one: "Give me the details for this record." The database can only respond one by one.
But the database was born to do this! It has indexes, query optimizers, and vectorized execution engines—it's designed for batch data processing.
The solution: Move the loop into the database. Return all data in one query.
CTE + JSON_AGG: One Query to Rule Them All
PostgreSQL's CTE (Common Table Expression) and JSON_AGG function are the core of this solution.
Let's see the complete SQL first, then explain the mechanics:
WITH
-- Step 1: Find matching record IDs
filtered_ids AS (
SELECT DISTINCT row_id
FROM eav_table
WHERE schema_id = $1 AND /* filter conditions */
),
-- Step 2: Sort + Paginate
paginated AS (
SELECT row_id
FROM filtered_ids
ORDER BY /* sort conditions */
LIMIT $page_size OFFSET $offset
),
-- Step 3: Fetch main table data
main_data AS (
SELECT m.*
FROM paginated p
JOIN entity_main m ON m.row_id = p.row_id
),
-- Step 4: Aggregate EAV attributes into JSON
eav_json AS (
SELECT
p.row_id,
JSON_AGG(
JSON_BUILD_OBJECT(
'attr_id', e.attr_id,
'value', COALESCE(e.value_text, e.value_numeric::text)
)
) AS attributes
FROM paginated p
JOIN eav_table e ON e.row_id = p.row_id
GROUP BY p.row_id
)
-- Final result: main table + EAV attributes, returned in one shot
SELECT m.*, COALESCE(e.attributes, '[]') AS attributes_json
FROM main_data m
LEFT JOIN eav_json e ON e.row_id = m.row_id;What Does This SQL Do?
- CTE as a pipeline: Each
WITHclause is like a station on an assembly line, with data flowing from one step to the next - JSON_AGG magic: Aggregates multiple EAV rows into a single JSON array—one
row_id, one JSON object - One round-trip: The entire query involves only one database interaction; all data comes back in a single package
How CTE Eliminates Round-Trips
A Common Table Expression (CTE) is like setting up an assembly line inside the database. Without CTE, your application plays ping-pong with the database:
App: "Give me the IDs" → DB responds (round-trip 1)
App: "Give me details for ID 1" → DB responds (round-trip 2)
App: "Give me details for ID 2" → DB responds (round-trip 3)
...101 times...With CTE, you send one instruction:
App: "Database, figure out the IDs, fetch the details,
aggregate them into JSON, and give me everything
in one response."The database's query planner optimizes the entire pipeline as a single execution unit. It can:
- Use indexes efficiently across all steps
- Parallelize independent operations
- Avoid serialization/deserialization overhead between steps
This is computation pushdown—letting the database do what it was built to do.
The application layer just needs to parse JSON:
// One query, all data
rows := db.Query(cteSQL, schemaID, pageSize, offset)
for rows.Next() {
var record Record
var attributesJSON string
rows.Scan(&record, &attributesJSON)
json.Unmarshal(attributesJSON, &record.Attributes)
results = append(results, record)
}Performance Comparison: Numbers Don't Lie
Before vs. after optimization:
| Records | Queries (Before) | Queries (After) | Latency (Before) | Latency (After) | Improvement |
|---|---|---|---|---|---|
| 10 | 11 | 1 | 110ms | ~15ms | 86% |
| 50 | 51 | 1 | 510ms | ~20ms | 96% |
| 100 | 101 | 1 | 1010ms | ~25ms | 97% |
In cross-region deployments (50ms per round-trip):
| Records | Latency (Before) | Latency (After) | Improvement |
|---|---|---|---|
| 100 | 5050ms | ~80ms | 98% |
From 5 seconds to 80 milliseconds. User experience goes from "Is this site down?" to "Instant."
The Bottom Line:
- Before: 100 records = 101 queries = 1+ second latency
- After: 100 records = 1 query = 25ms latency
- Cross-region: 5 seconds → 80ms (98% improvement)
Why Does This Work?
1. Network Round-Trips Are the Enemy
In modern systems, CPU and memory speeds are measured in nanoseconds, while network round-trips are measured in milliseconds—a difference of 6 orders of magnitude. Reducing network round-trips is often more effective than optimizing CPU computations.
2. The Database Is a Data Processing Expert
PostgreSQL's query optimizer has been refined over decades. It knows how to execute JOINs efficiently, how to leverage indexes, and how to parallelize operations. Delegating data aggregation to the database is far faster than looping in application code.
3. JSON_AGG Is an Underrated Superpower
Many people treat PostgreSQL as just a relational database, forgetting that since version 9.4, it's also an excellent JSON database. JSON_AGG + JSON_BUILD_OBJECT can perform complex data reshaping at the database layer, avoiding inefficient loop-based assembly in application code.
The CPU vs. Network Trade-off: When to Pivot
The CTE + JSON_AGG approach trades CPU for reduced network round-trips. But there's no free lunch—it's important to understand when this trade-off makes sense.
The Break-Even Analysis
Every optimization has a crossover point. Here's how to think about it:
In-database aggregation wins when:
Network_Latency × N_Records > CPU_Aggregation_Time + 1_Round_TripLet's plug in real numbers:
| Scenario | Network Round-Trip | Records | N+1 Latency | CTE Latency | Winner |
|---|---|---|---|---|---|
| Same machine (localhost) | 0.5ms | 10 | ~5ms | ~8ms | N+1 |
| Same datacenter | 2ms | 10 | ~20ms | ~10ms | CTE |
| Same datacenter | 2ms | 100 | ~200ms | ~25ms | CTE |
| Cross-region | 50ms | 10 | ~500ms | ~60ms | CTE |
| Cross-region | 50ms | 100 | ~5000ms | ~80ms | CTE |
Key insight: The crossover point is around 5-10 records when database and app share a datacenter, but even 2-3 records make CTE worthwhile in cross-region scenarios.
When to Keep Aggregation in the App Layer
CTE + JSON_AGG isn't always the answer:
1. Very small result sets with co-located services
If you're fetching 3-5 records and your app runs on the same machine as the database (common in development or small deployments), the overhead of building JSON in the database may exceed the N+1 penalty.
// For small, co-located workloads, simple queries may be faster
if recordCount < 5 && isLocalhost {
// Simple N+1 is fine here
}2. Complex per-record transformations
If each record needs business logic that can't be expressed in SQL (calling external APIs, complex validation, ML inference), you'll loop anyway. Pushing aggregation to the database doesn't help.
// Example: Each record needs an external API call
for _, record := range records {
record.EnrichedData = callExternalAPI(record.ID) // Can't do this in SQL
}3. Memory-constrained database servers
JSON_AGG builds the entire JSON array in memory before returning. For very large result sets (10K+ rows with complex nested data), this can cause memory pressure on the database server. In these cases, streaming results row-by-row may be safer.
But in practice, this rarely matters. Forma's architecture provides three layers of protection:
Layer 1: Hot Table eliminates aggregation for list queries
| Query Type | Fields Needed | Data Source | JSON_AGG? |
|---|---|---|---|
| Paginated list | 5-10 hot fields | entity_main only | No |
| Detail view | All fields, 1 record | entity_main + EAV | Yes, but tiny |
Most queries (80%+) are list views that only need hot fields—no JSON_AGG at all.
Layer 2: Real-world records have limited fields
Even when you aggregate EAV attributes, a typical record has 30-50 fields at most. A CRM contact with name, phone, email, company, address, and 20 custom fields? That's ~2-5KB of JSON per record. For a 100-record batch, that's 200-500KB—well within any database's comfort zone.
Layer 3: DuckDB handles complex analytics
For scenarios that would stress JSON_AGG—bulk exports, complex aggregations across thousands of records, OLAP-style queries—you shouldn't be using PostgreSQL anyway. That's exactly what the DuckDB + Parquet layer is for (see Part 3).
| Workload | Solution | Why |
|---|---|---|
| List views | Hot Table (PostgreSQL) | B-tree indexed, no aggregation |
| Detail views | CTE + JSON_AGG (PostgreSQL) | Single record, tiny payload |
| Bulk export / OLAP | DuckDB + Parquet | Columnar storage, designed for this |
The "memory explosion" scenario—aggregating hundreds of attributes across tens of thousands of records in PostgreSQL—simply doesn't occur in a properly architected system. Each layer handles what it's best at.
The Hybrid Approach
In practice, Forma uses a hybrid strategy:
| Query Type | Records | Strategy | Reason |
|---|---|---|---|
| Paginated list | 20-100 | CTE + JSON_AGG | Sweet spot for in-database aggregation |
| Single record detail | 1 | Direct query | No aggregation needed |
| Bulk export | 1000+ | Streaming cursor | Avoid memory pressure |
| Real-time dashboard | Variable | Depends on latency budget | Measure, then decide |
Decision Flowchart
┌─────────────────┐
│ How many records│
│ are you fetching│
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │ │
1-5 records 5-500 records 500+ records
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────┐ ┌─────────────────┐
│ Is DB co-located│ │ Use CTE + │ │ Use streaming │
│ with app? │ │ JSON_AGG │ │ cursor or │
└────────┬────────┘ │ (default) │ │ pagination │
│ └─────────────┘ └─────────────────┘
Yes │ No
│ │
▼ ▼
┌─────────┐ ┌─────────┐
│ N+1 OK │ │ Use CTE │
└─────────┘ └─────────┘Measuring Your Specific Workload
Don't trust rules of thumb. Measure:
-- Add EXPLAIN ANALYZE to see actual execution time
EXPLAIN ANALYZE
WITH filtered_ids AS (...)
SELECT ...;Compare against your N+1 implementation with real network conditions. The numbers above are guidelines—your mileage will vary based on:
- Actual network latency (use
pingor distributed tracing) - Database connection pool configuration
- Complexity of your EAV schema
- Index coverage
When to Use This (and When Not To)
Good Fit
- EAV pattern: Attributes stored as rows that need to be aggregated into records
- Batch queries: Fetching multiple records at once (paginated lists, bulk exports, etc.)
- Latency-sensitive: Database and application not co-located
Limitations
- PostgreSQL 9.4+: Requires
JSON_AGGandJSON_BUILD_OBJECTfunctions - Complex query maintainability: Deeply nested CTEs can hurt readability; consider encapsulating in views or stored procedures
- Memory usage:
JSON_AGGbuilds JSON arrays in memory; watch out for memory limits with very large result sets
What's Next: When Data Exceeds a Single Machine?
This post solved the N+1 query problem, and the previous post introduced hot table and JSON Schema design. But one question remains:
When historical data accumulates to billions of records and a single PostgreSQL instance can't handle it—what then?
Part 3 will introduce how we use DuckDB + CDC + Parquet to build a Serverless lakehouse architecture, and most critically—how we solve the trust crisis around "Lakehouse reading dirty data."
Series Navigation
- [Part 1] Why EAV is the Most Underrated Data Model for AI
- [Part 2] Killing N+1: How One SQL Trick Cut Our Latency by 40x ← You are here
- [Part 3] Zero Dirty Reads: Building a Trustworthy Lakehouse with DuckDB (Finale)
This post is based on engineering practices from the Forma project. Forma is a flexible data storage engine designed for the AI era.