EmergentDB

What if Your Database Learned Like an LLM?

51x

Faster than ChromaDB

82x

Faster than LanceDB

100%

Recall

5.6M

Vectors/sec Insert

Rach PradhanMachine Learning Singapore

What are Embeddings?

Numerical representations that capture meaning. Similar concepts = similar numbers.

Text → Numbers

"How do I fix a flat tire?"

0.12-0.340.560.23-0.11... (768 dims)

"Changing a punctured wheel"

0.11-0.330.550.22-0.10... (768 dims)

Very similar! (cosine: 0.98)

"Best pizza in NYC"

0.890.12-0.450.670.33... (768 dims)

Very different! (cosine: 0.12)

Semantic Space (2D projection)

Automotive

Food

Similar meanings cluster together in vector space

EmergentDB stores these vectors and finds the most similar ones at blazing speed.

The Retrieval Cousin to Transformers

LLMs learn by evolving weights through backpropagation. What if databases could learn by evolving their structure?

Transformers (LLMs)

Learn: Weights

Billions of parameters adjusted via gradient descent

Objective: Loss Function

Minimize cross-entropy, maximize likelihood

Method: Backpropagation

Compute gradients, update weights

θ ← θ - α∇L(θ)

EmergentDB

Learn: Structure

Index type, hyperparameters, insertion strategy

Objective: Fitness Function

Maximize recall, minimize latency & memory

Method: MAP-Elites Evolution

Select, mutate, evaluate, place in archive

archive[cell] ← best(mutate(elite))

Both are self-improving systems that adapt to data.

Manual Tuning Hell

It started with a simple question: "What if we made a database that evolves?"

HNSW M=16? M=32? ef=100?

Most teams guess and hope

Workload Mismatch

Optimal for 1K ≠ optimal for 100K

Recall vs Speed

You shouldn't have to choose

Try tuning HNSW M parameter34

Recall (Accuracy)75.0%

Search Speed55%

Still guessing...

Dual Quality-Diversity System

EmergentDB runs two independent evolution processes simultaneously.

IndexQD

Evolves the Search Index

3D Behavior Grid:

Recall0-100%

Latencyμs

MemoryMB

Evolves: Index type (HNSW/Flat/IVF), M, ef_construction, ef_search, nlist, nprobe

InsertQD

Evolves Insertion Strategy

2D Behavior Grid:

Throughputvec/s

CPU Efficiency%

Evolves: SIMD strategy, batch size, parallelism settings

What is Quality Diversity?

Most optimization finds the single best solution. QD finds a diverse archive of high-performers.

"Don't just find the needle in the haystack; find every type of needle."

Standard Optimization

Converges to one point. If that point doesn't fit your constraints, you fail.

Quality Diversity

Illuminates the entire fitness landscape. Gives you a menu of options.

Traditional vs QD

Traditional

• Single best solution
• May get stuck in local optima
• One config for all workloads
• Requires manual tuning

MAP-Elites

Archive of diverse solutions
Explores entire behavior space
Multiple configs for different needs
Self-discovers optimal params

MAP-Elites Grid

Multi-dimensional Archive of Phenotypic Elites — Hover to explore!

Latency →

Recall →

Hover over a cell to see its elite configuration

1. Define Behavior Space

Choose dimensions: Recall × Latency × Memory

2. Discretize into Grid

6³ = 216 cells, each a unique trade-off niche

3. Keep the Elite

Best solution per cell wins, replaced if beaten

IndexQD Deep Dive

3D Behavior Space: Recall × Latency × Memory

Genome Structure

What gets evolved

index_type

HNSW | Flat | IVFHNSW

4-6416

ef_construction

50-500100

ef_search

10-20050

nlist

1-256128

nprobe

1-328

Fitness Function

Geometric mean ensures ALL metrics matter

fitness = (recall^w1 × speed^w2 × memory^w3)^(1/Σw)

search_first()50% recall, 40% speed, 5% memory

balanced()30% recall, 30% speed, 20% memory

99% Recall Floor

Configs with recall < 99% get cubic penalty:

penalty = (recall / 0.99)³

InsertQD Deep Dive

2D Behavior Space: Throughput × CPU Efficiency

6 SIMD Strategies Compete:

SimdSequential

One vector at a time

SimdBatch

Batch normalization

SimdParallel

Multi-threaded + SIMD

SimdChunked

L2 cache-friendly chunks

SimdUnrolled

4-way loop unrolling

SimdInterleavedWINNER

Two-pass (norms then scale)

Winner: SimdInterleaved

5.6M

vectors per second on modern CPUs

ARM NEON Optimization

128-bit NEON registers (4x f32)

Fused multiply-accumulate (vfmaq_f32)

4x loop unrolling for pipelining

The Evolutionary Loop

Select

Pick an elite parent from the archive

Mutate

Randomly perturb parameters

Evaluate

Benchmark on YOUR actual data

Place

Add to archive if better than current elite

Benchmark Results

768-dim Gemini embeddings (real semantic vectors)

51x

vs ChromaDB

82x

vs LanceDB

44μs

Best Latency

100%

Recall

The Curse of Dimensionality

Random vectors are equidistant. Real embeddings have semantic structure.

Random Vectors

No structure. All points equidistant. HNSW graphs become random.

~35% Recall

Real Embeddings

Clustered by meaning. HNSW exploits local structure for speed.

100% Recall

Instant Optimization

Don't want to wait for evolution? We shipped a pre-computed grid of industry standards.

Speed

m=8, ef=5075-100% recall

Balanced

m=16, ef=10092-100% recall

Accuracy

m=24, ef=20098-100% recall

grid.recommend(10_000, "balanced")

Now Live at emergentdb.com

Introducing Bolt

What happens when you let evolution find the fastest possible vector search?
Bolt is the answer — the elite configuration discovered by EmergentDB's QD optimization.

MAP-Elites

Explored 1000s of configs

→

Natural Selection

Tested on real workloads

→

Bolt

The evolved champion

885M

vec/sec scan rate

2.5x

faster than FAISS

28x

faster than ChromaDB

100%

exact recall

"We didn't hand-tune Bolt. We let the algorithm discover what humans struggle to optimize.
The architecture is proprietary — born from evolution, not engineering."

Bolt vs FAISS

100K vectors, Inner Product, Top-10, 100 queries • Apple Silicon

Batched Latency (ms/query) — Lower is better

Detailed Comparison

Dim	Bolt (ms)	FAISS (ms)	Speedup
768d	0.11	0.23	2.1x
1536d	0.15	0.38	2.5x
3072d	0.28	0.66	2.4x

Sequential mode: consistently 1.5x faster. Batched mode: up to 2.5x faster.

885M

vec/s peak scan

10M

vectors tested

100%

exact match

Bolt vs ChromaDB

Bolt searches 100K vectors while ChromaDB searches only 10K — and Bolt is still faster.

Batched Latency (ms/query)

768d (Gemini/Cohere)17.9x faster

0.11 ms

Bolt (100K)

1.97 ms

Chroma (10K)

1536d (OpenAI ada-002)25.9x faster

0.15 ms

Bolt (100K)

3.89 ms

Chroma (10K)

3072d (OpenAI large)27.5x faster

0.28 ms

Bolt (100K)

7.7 ms

Chroma (10K)

18-28x

faster in batched mode — scanning 10x more data

Bolt: Evolved proprietary engine (exact search) • ChromaDB: Python + HNSW (default) • Inner Product, Top-10, 100 queries

What's Next

grep-codeComing Soon

QD-optimized code search at scale

Extends EmergentDB's evolution to discrete code search strategies. Sieve through codebases with auto-selected optimal strategies.

60K

files/sec

precomputed elites

LoadQD: 12 file loading strategies

SearchQD: 32 search strategies

EmergentSearch: 100% optimal auto-selection

// Exact match + semantic in one DB

grep: "class AuthError" → 49K files/sec

vector: "auth failure" → 42μs/query

Python BindingsNext

Native Python integration via PyO3

Use EmergentDB directly in Python with zero-copy data transfer and full async support.

# Coming soon

from emergentdb import VectorDB

db = VectorDB()

db.insert(embeddings)

results = db.search(query, k=10)

NumPy/PyTorch tensor support

Async/await compatible

"Best of both worlds: exact search + semantic search in one database."

Connect

Join the conversation and check out the code.

GitHub

github.com/justrach/emergentDB

Twitter

@rachpradhan

rachpradhan

Special thanks to the QD pioneers: Jean-Baptiste Mouret, Jeff Clune, and Kenneth Stanley.

1 / 18 • EmergentDB