Why Vector Search is Memory Abuse (And How to Fix It)

The Hardware is Begging You to Stop

You just bought an H100. It has HBM3e memory capable of 3.35 TB/s of bandwidth. That is enough to stream the entire Library of Congress in about four seconds.

But when you run your vector search, you get a fraction of that. Maybe 5%.

Why? Because graph-based vector search is memory abuse. And the memory controller is suffering silently.

The Physics of a Cache Line

Here is the thing about DRAM (and HBM is just fancy 3D-stacked DRAM): it hates individuality.

When you ask for a single byte, the memory controller doesn’t just fetch that byte. It fetches a cache line. On NVIDIA GPUs, that is typically 128 bytes.

If you read a float (4 bytes) and then jump to a random location to read another float, you are throwing away 124 bytes of bandwidth for every 4 bytes you use.

Effective bandwidth = 4 / 128 = 3.125%

You are paying for a firehose and using it to fill a thimble.

The Graph Traversal Problem

Graph algorithms are defined by pointer chasing.

struct Node {
    float value;
    Node* next;  // could be anywhere in memory
};

while (node) {
    process(node->value);
    node = node->next;  // jump to random memory address
}

In a graph index, node->next could be anywhere in the address space. It is almost certainly not in the same cache line. It is almost certainly not in the same DRAM page.

Every time you follow a pointer, the memory controller has to:

Close the current DRAM row (Precharge)
Open a new DRAM row (Activate)
Read the data (CAS)

This takes time. Nanoseconds add up. And while the memory controller is doing this administrative paperwork, your CUDA cores are starving.

The Metaphor: The Library Ladder

Imagine a library with huge bookshelves. To get a book, you have to climb a rolling ladder.

Linear scan (array): You climb the ladder once and grab every book on the shelf, left to right. You are efficient. You are a machine.

Graph traversal (pointer chasing): You climb the ladder, grab one book. Then you climb down, move the ladder 20 feet to the left, climb up, grab another book. Then you move the ladder 50 feet to the right.

You are spending 99% of your time moving the ladder and 1% of your time reading.

The Solution: Sort It Out

If you want to go fast, you must respect the hardware. The hardware wants coalesced memory access: adjacent threads reading adjacent memory addresses.

How do you turn a graph problem into a linear problem? Sorting.

Instead of chasing pointers individually:

Gather all the addresses you want to visit
Sort them by memory address
Visit them in linear order

// Gather
int requests[N] = { ... random indices ... };

// Sort, align with memory layout
sort(requests, requests + N);

// Read linearly, memory controller stays open
for (int i = 0; i < N; i++) {
    data[i] = big_array[requests[i]];
}

This sounds counter-intuitive. “Wait, you want me to add a sorting step? Isn’t that slow?”

Yes, sorting costs instructions. But on a GPU, compute is cheap. Bandwidth is expensive.

Spending 1ms to sort requests can save you 10ms of memory latency. The minivan on the open highway beats the Ferrari in traffic. The minivan wins because the Ferrari is stuck in first gear.

Why HNSW is an Anti-Pattern on GPU

This matters for vector search because the dominant index format, Hierarchical Navigable Small World (HNSW), is a graph algorithm.

HNSW is the gold standard for vector search on CPUs. On a CPU with large caches and sophisticated hardware prefetchers, graph traversal works well. The CPU’s prefetcher looks ahead in the graph and speculatively loads the next set of nodes before they’re needed.

On a GPU, there is no such prefetcher for random access patterns. The GPU is built for throughput on regular access patterns, not latency hiding on irregular ones.

This is why brute-force linear scan (or inverted file indices like IVF) often beats HNSW on GPU. A linear scan reads memory sequentially. It saturates bandwidth. Even if it does 10× more floating-point work than HNSW, it can finish faster, because it is not waiting for the memory controller to move the ladder.

The Big-O analysis tells you HNSW is asymptotically faster. The memory subsystem doesn’t care about asymptotes. It cares about cache line utilization.

The Fix: Topology-Aware Sorting

This is exactly the pathology that MASH was built to address. MASH sorts vector indices by their topology (by the structure of their nearest-neighbor relationships) before any query runs. Instead of following pointer chains at query time, the index layout reflects the data’s actual geometry. Related vectors are adjacent in memory. Adjacent threads read adjacent addresses. The memory controller stays open instead of bouncing between DRAM rows.

The result is visceral: on a 1B-vector index, MASH query time drops from 992ms to 109ms on presorted data. The improvement isn’t from a faster algorithm. It’s from respecting the memory bus.

Coherence uses the same principle at the semantic memory layer. Every retrieval operation in Coherence is designed around sequential reads. The similarity computation doesn’t chase pointers. It reads contiguous memory and lets the GPU’s hardware prefetcher do what it was engineered for.

The insight generalizes: any system that treats GPU memory as a random-access store is leaving bandwidth on the floor. Your ANN index, your embedding lookup table, your KV cache: all of them benefit from topology-aware layout. The GPU is a sequential-reading machine that happens to be very parallel. Design your data for it.