EmergentDB

What if Your Database Learned Like an LLM?

51x
Faster than ChromaDB
82x
Faster than LanceDB
100%
Recall
5.6M
Vectors/sec Insert
Rach PradhanMachine Learning Singapore

What are Embeddings?

Numerical representations that capture meaning. Similar concepts = similar numbers.

Text → Numbers
"How do I fix a flat tire?"
0.12-0.340.560.23-0.11... (768 dims)
"Changing a punctured wheel"
0.11-0.330.550.22-0.10... (768 dims)
Very similar! (cosine: 0.98)
"Best pizza in NYC"
0.890.12-0.450.670.33... (768 dims)
Very different! (cosine: 0.12)
Semantic Space (2D projection)
Automotive
Food
Similar meanings cluster together in vector space

EmergentDB stores these vectors and finds the most similar ones at blazing speed.

The Retrieval Cousin to Transformers

LLMs learn by evolving weights through backpropagation. What if databases could learn by evolving their structure?

Transformers (LLMs)
Learn: Weights
Billions of parameters adjusted via gradient descent
Objective: Loss Function
Minimize cross-entropy, maximize likelihood
Method: Backpropagation
Compute gradients, update weights
θ ← θ - α∇L(θ)
EmergentDB
Learn: Structure
Index type, hyperparameters, insertion strategy
Objective: Fitness Function
Maximize recall, minimize latency & memory
Method: MAP-Elites Evolution
Select, mutate, evaluate, place in archive
archive[cell] ← best(mutate(elite))

Both are self-improving systems that adapt to data.

Manual Tuning Hell

It started with a simple question: "What if we made a database that evolves?"

HNSW M=16? M=32? ef=100?
Most teams guess and hope
Workload Mismatch
Optimal for 1K ≠ optimal for 100K
Recall vs Speed
You shouldn't have to choose
Try tuning HNSW M parameter34
Recall (Accuracy)75.0%
Search Speed55%
Still guessing...

Dual Quality-Diversity System

EmergentDB runs two independent evolution processes simultaneously.

IndexQD
Evolves the Search Index
3D Behavior Grid:
Recall0-100%
Latencyμs
MemoryMB
Evolves: Index type (HNSW/Flat/IVF), M, ef_construction, ef_search, nlist, nprobe
InsertQD
Evolves Insertion Strategy
2D Behavior Grid:
Throughputvec/s
CPU Efficiency%
Evolves: SIMD strategy, batch size, parallelism settings

What is Quality Diversity?

Most optimization finds the single best solution. QD finds a diverse archive of high-performers.

"Don't just find the needle in the haystack; find every type of needle."
Standard Optimization
Converges to one point. If that point doesn't fit your constraints, you fail.
Quality Diversity
Illuminates the entire fitness landscape. Gives you a menu of options.

Traditional vs QD

Traditional
  • • Single best solution
  • • May get stuck in local optima
  • • One config for all workloads
  • • Requires manual tuning
MAP-Elites
  • Archive of diverse solutions
  • Explores entire behavior space
  • Multiple configs for different needs
  • Self-discovers optimal params

MAP-Elites Grid

Multi-dimensional Archive of Phenotypic Elites — Hover to explore!

Latency →
Recall →
Hover over a cell to see its elite configuration
1. Define Behavior Space
Choose dimensions: Recall × Latency × Memory
2. Discretize into Grid
6³ = 216 cells, each a unique trade-off niche
3. Keep the Elite
Best solution per cell wins, replaced if beaten

IndexQD Deep Dive

3D Behavior Space: Recall × Latency × Memory

Genome Structure
What gets evolved
index_type
HNSW | Flat | IVFHNSW
m
4-6416
ef_construction
50-500100
ef_search
10-20050
nlist
1-256128
nprobe
1-328
Fitness Function
Geometric mean ensures ALL metrics matter
fitness = (recall^w1 × speed^w2 × memory^w3)^(1/Σw)
search_first()50% recall, 40% speed, 5% memory
balanced()30% recall, 30% speed, 20% memory
99% Recall Floor
Configs with recall < 99% get cubic penalty:
penalty = (recall / 0.99)³

InsertQD Deep Dive

2D Behavior Space: Throughput × CPU Efficiency

6 SIMD Strategies Compete:
SimdSequential
One vector at a time
SimdBatch
Batch normalization
SimdParallel
Multi-threaded + SIMD
SimdChunked
L2 cache-friendly chunks
SimdUnrolled
4-way loop unrolling
SimdInterleavedWINNER
Two-pass (norms then scale)
Winner: SimdInterleaved
5.6M
vectors per second on modern CPUs
ARM NEON Optimization
128-bit NEON registers (4x f32)
Fused multiply-accumulate (vfmaq_f32)
4x loop unrolling for pipelining

The Evolutionary Loop

Select

Mutate

Evaluate

Place

Benchmark Results

768-dim Gemini embeddings (real semantic vectors)

0900180027003600EmergentDB(m=8)EmergentDB(m=16)ChromaDBLanceDB
51x
vs ChromaDB
82x
vs LanceDB
44μs
Best Latency
100%
Recall

The Curse of Dimensionality

Random vectors are equidistant. Real embeddings have semantic structure.

Random Vectors

No structure. All points equidistant. HNSW graphs become random.

~35% Recall

Real Embeddings

Clustered by meaning. HNSW exploits local structure for speed.

100% Recall

Instant Optimization

Don't want to wait for evolution? We shipped a pre-computed grid of industry standards.

Speed

m=8, ef=5075-100% recall

Balanced

m=16, ef=10092-100% recall

Accuracy

m=24, ef=20098-100% recall
grid.recommend(10_000, "balanced")
Now Live at emergentdb.com

Introducing Bolt

What happens when you let evolution find the fastest possible vector search?
Bolt is the answer — the elite configuration discovered by EmergentDB's QD optimization.

MAP-Elites
Explored 1000s of configs
Natural Selection
Tested on real workloads
Bolt
The evolved champion
885M
vec/sec scan rate
2.5x
faster than FAISS
28x
faster than ChromaDB
100%
exact recall

"We didn't hand-tune Bolt. We let the algorithm discover what humans struggle to optimize.
The architecture is proprietary — born from evolution, not engineering."

Bolt vs FAISS

100K vectors, Inner Product, Top-10, 100 queries • Apple Silicon

Batched Latency (ms/query) — Lower is better
768d1536d3072d00.20.40.60.8
Detailed Comparison
DimBolt (ms)FAISS (ms)Speedup
768d0.110.232.1x
1536d0.150.382.5x
3072d0.280.662.4x
Sequential mode: consistently 1.5x faster. Batched mode: up to 2.5x faster.
885M
vec/s peak scan
10M
vectors tested
100%
exact match

Bolt vs ChromaDB

Bolt searches 100K vectors while ChromaDB searches only 10K — and Bolt is still faster.

Batched Latency (ms/query)
768d1536d3072d02468
768d (Gemini/Cohere)17.9x faster
0.11 ms
Bolt (100K)
1.97 ms
Chroma (10K)
1536d (OpenAI ada-002)25.9x faster
0.15 ms
Bolt (100K)
3.89 ms
Chroma (10K)
3072d (OpenAI large)27.5x faster
0.28 ms
Bolt (100K)
7.7 ms
Chroma (10K)
18-28x
faster in batched mode — scanning 10x more data
Bolt: Evolved proprietary engine (exact search) • ChromaDB: Python + HNSW (default) • Inner Product, Top-10, 100 queries

What's Next

grep-codeComing Soon
QD-optimized code search at scale

Extends EmergentDB's evolution to discrete code search strategies. Sieve through codebases with auto-selected optimal strategies.

60K
files/sec
44
precomputed elites
LoadQD: 12 file loading strategies
SearchQD: 32 search strategies
EmergentSearch: 100% optimal auto-selection
// Exact match + semantic in one DB
grep: "class AuthError" → 49K files/sec
vector: "auth failure" → 42μs/query
Python BindingsNext
Native Python integration via PyO3

Use EmergentDB directly in Python with zero-copy data transfer and full async support.

# Coming soon
from emergentdb import VectorDB
db = VectorDB()
db.insert(embeddings)
results = db.search(query, k=10)
NumPy/PyTorch tensor support
Async/await compatible

"Best of both worlds: exact search + semantic search in one database."

Connect

Join the conversation and check out the code.

Special thanks to the QD pioneers: Jean-Baptiste Mouret, Jeff Clune, and Kenneth Stanley.