From the Kernel.

Technical deep-dives on GPU memory, sorting, and deterministic compute.

GPU Compute

Sweat Capital: Ripping Out TEI for a 27x Embedding Speedup with Go and Custom CUDA

Why I ripped out Hugging Face TEI and built qwen-embed-native, a Go + custom-CUDA embedding engine for Qwen3 that runs fast on sub-H100 hardware.

Read Article

Model

Ingot Poured: Our MTEB Submission Model

Voxell's MTEB(eng, v2) submission: architecture, training methodology, contamination defense, and API access.

Read Article

Retrieval

Why Your RAG Pipeline Doesn't Learn

Your retrieval system reruns the same expensive searches and never tracks which results helped. The retrieval N+1, and how closing the loop changes things.

Read Article

Caching

The Cache Feedback Gap: Why Your Prefetcher Doesn't Learn

Your retrieval cache predicts what to preload and never learns if it was right. The open-loop problem, and what closing the feedback loop unlocks.

Read Article

Determinism

Why Your Vector Search Lies

Same query, different results. The floating-point problem that quietly breaks reproducible embeddings, and what bit-exact determinism takes in production.

// The problem with parallel reduction

Thread A: (a + b) + c = 0.30000000000000004
Thread B: a + (b + c) = 0.3

// Same inputs. Different outputs. Nondeterminism.

Read Article

Memory Architecture

Why Vector Search is Memory Abuse (And How to Fix It)

Your vector index burns GPU bandwidth chasing pointers through random memory, and a brute-force linear scan often beats HNSW on the GPU.

Random Access Pattern

Cache misses, stalled cores

Sequential Access Pattern

Coalesced reads, saturated bus

Read Article

GPU Compute

Starved Cores: Listening to the Silence Between Epochs

Your H100 sits idle 40% of the time, starved by a data supply chain that can't keep up. Here is the physics, and why the real fix is architectural.

Read Article

Algorithms

Sorting on Blackwell: 9× Faster Where It Actually Matters

MASH is a data-aware GPU sorting engine for NVIDIA Blackwell that beats NVIDIA's CUB DeviceRadixSort on real workloads.

Read Article

See the code in action.

Get hands-on with Coherence. Benchmark it against your current infrastructure.

Download Developer Preview