Engineering Insights on Voxell

Sweat Capital: Ripping Out TEI for a 27x Embedding Speedup with Go and Custom CUDA

Sat, 30 May 2026 00:00:00 +0000

If you want to build a truly resilient, high-performance infrastructure on a budget, you eventually hit a wall where “industry standard” stops meaning “best in class” and starts meaning “built for someone else’s hardware.”

When I set out to build Forge, the goal was simple: serve state-of-the-art Qwen3 embedding models (from 0.6B up to 8B parameters) fast, reliably, and without bleeding cash on GPU idleness.

The obvious first answer was Hugging Face’s TEI (Text Embeddings Inference). It’s the darling of the embedding ecosystem. It promises everything out of the box. So, I spun it up across my mix of hardware: ARM nodes, DGX GB10s, and a scrappy RTX 5080.

Ingot Poured: Our MTEB Submission Model

Sat, 23 May 2026 00:00:00 +0000

Ingot-8B-R3 is Voxell’s research text embedding model, built on top of Qwen3-Embedding-8B and submitted to the MTEB(eng, v2) leaderboard as a public API. The model is in production today at api-mteb.voxell.ai and scores Mean(41) = 75.9795 across all 41 tasks in the official MTEB English v2 benchmark.

To our knowledge, Ingot-8B-R3 is the first successful mixture-of-experts architecture applied to text embedding as a routed, multi-specialist system — different specialists activate per input, selected at inference time from content alone. This post documents the architecture, the training data methodology, the contamination defenses, and how to access the public API in full.

The Cache Feedback Gap: Why Your Prefetcher Doesn't Learn

Fri, 02 Jan 2026 00:00:00 +0000

The Cache That Guesses and Forgets

Your retrieval pipeline caches embeddings. It has to: recomputing a vector for a document you’ve already indexed is wasteful, and at scale, it’s fatal to latency. So you cache. The cache warms up. You congratulate yourself.

Then you look more carefully. Your cache doesn’t know which of its entries were ever retrieved. It doesn’t know which cached vectors are still semantically correct versus stale from a model update. It doesn’t know which prefetched entries were evicted without being used.

Why Your RAG Pipeline Doesn't Learn

Fri, 02 Jan 2026 00:00:00 +0000

The Loop That Doesn’t Close

Every RAG system has two sides. The retrieval side finds relevant context. The generation side uses that context to produce a response. Between them, something is conspicuously missing: signal.

The LLM knows which retrieved chunks it actually used. The user knows whether the answer was good. But the retrieval system knows nothing. It made a prediction, “these five chunks are relevant,” and it never found out if it was right.

Starved Cores: Listening to the Silence Between Epochs

Thu, 01 Jan 2026 00:00:00 +0000

Why Your H100 is Idle 40% of the Time

You paid $30,000 for an NVIDIA H100. You optimized your CUDA kernels until your eyes bled. You engaged mixed-precision Tensor Cores. You are feeling pretty good about yourself.

Then you run nvidia-smi and see the utilization graph. It looks like a heartbeat monitor:

Spike (100%), Flatline (0%), Spike (100%).

That flatline is the sound of money burning. It is the silence of 14,592 CUDA cores twiddling their thumbs.

Why Vector Search is Memory Abuse (And How to Fix It)

Thu, 01 Jan 2026 00:00:00 +0000

The Hardware is Begging You to Stop

You just bought an H100. It has HBM3e memory capable of 3.35 TB/s of bandwidth. That is enough to stream the entire Library of Congress in about four seconds.

But when you run your vector search, you get a fraction of that. Maybe 5%.

Why? Because graph-based vector search is memory abuse. And the memory controller is suffering silently.

The Physics of a Cache Line

Here is the thing about DRAM (and HBM is just fancy 3D-stacked DRAM): it hates individuality.

Why Your Vector Search Lies

Thu, 01 Jan 2026 00:00:00 +0000

There is a class of bug that doesn’t look like a bug. No exception. No error log. No changed code. Just this: you run the same query against the same index twice, and you get different results. Not wrong results, just results that look plausible both times. Different ones.

You restart the service. Same thing. You roll back a deployment. Same thing. You pin every library version, freeze the model weights, feed in identical inputs byte-for-byte.

Sorting on Blackwell: 9× Faster Where It Actually Matters

Sat, 15 Nov 2025 00:00:00 +0000