Voxell - Embedding API & GPU Infrastructure for AI at Scale on Voxell

Sweat Capital: Ripping Out TEI for a 27x Embedding Speedup with Go and Custom CUDA

Sat, 30 May 2026 00:00:00 +0000

If you want to build a truly resilient, high-performance infrastructure on a budget, you eventually hit a wall where “industry standard” stops meaning “best in class” and starts meaning “built for someone else’s hardware.”

When I set out to build Forge, the goal was simple: serve state-of-the-art Qwen3 embedding models (from 0.6B up to 8B parameters) fast, reliably, and without bleeding cash on GPU idleness.

The obvious first answer was Hugging Face’s TEI (Text Embeddings Inference). It’s the darling of the embedding ecosystem. It promises everything out of the box. So, I spun it up across my mix of hardware: ARM nodes, DGX GB10s, and a scrappy RTX 5080.

How We Stopped Lying to Ourselves About Quality

Thu, 28 May 2026 00:00:00 +0000

Test coverage is a proxy. It tells you what lines of code a test runner touched, not whether your product works. A payment system can report 94% coverage while silently dropping charges on every third invoice. A GPU inference API can show all green while its engine health gate is literally just running grep "/healthz" internal/engine/server.go against the source code.

That second one is us. This is the story of how we found it, why it happened, and what we replaced it with.

Ingot Poured: Our MTEB Submission Model

Sat, 23 May 2026 00:00:00 +0000

Ingot-8B-R3 is Voxell’s research text embedding model, built on top of Qwen3-Embedding-8B and submitted to the MTEB(eng, v2) leaderboard as a public API. The model is in production today at api-mteb.voxell.ai and scores Mean(41) = 75.9795 across all 41 tasks in the official MTEB English v2 benchmark.

To our knowledge, Ingot-8B-R3 is the first successful mixture-of-experts architecture applied to text embedding as a routed, multi-specialist system — different specialists activate per input, selected at inference time from content alone. This post documents the architecture, the training data methodology, the contamination defenses, and how to access the public API in full.

Vectors Are Not Ontologies

Tue, 12 May 2026 00:00:00 +0000

A few months ago I was debugging a search-relevance pipeline backed by a state-of-the-art embedding model. The query was an unambiguous negation; the top result was an unambiguous affirmation; the cosine similarity was 0.92.

You can guess the pair. “I love it” vs “I don’t love it.” Or near enough. The model could distinguish them in some abstract sense — the vectors weren’t identical — but in the only sense that mattered to the pipeline, they were neighbors.

Your Vector Database Is a Coping Mechanism

Mon, 27 Apr 2026 00:00:00 +0000

In March 2025, a team out of Stanford’s RegLab and Computer Science department published the first preregistered audit of the AI legal research tools that LexisNexis, Westlaw, and Thomson Reuters had been quietly selling into law firms (Magesh et al. 2025). The vendors had promised, in writing, “hallucination-free” output. The empirical reality was that the best of the three was wrong about one in six times, the worst about one in three.

CSSGuard: The Missing Half of CSS Tooling

Sat, 03 Jan 2026 00:00:00 +0000

Where Did the Breadcrumbs Go?

My wife is a UX designer. She was paging through our new site on her phone when she asked: “What happened to the breadcrumbs? They disappeared.”

I had just finished cleaning up the codebase. Removed dead code, consolidated templates, tidied the CSS. The kind of housekeeping that makes you feel productive.

But somewhere in that cleanup, I broke the breadcrumbs on mobile. No console errors. No failed builds. Our CI was green (55 checks), because this failure mode is structural—HTML still contains the class tokens, but the purge step removed the CSS definitions. The classes were present. The styles just… weren’t.

The Cache Feedback Gap: Why Your Prefetcher Doesn't Learn

Fri, 02 Jan 2026 00:00:00 +0000

The Cache That Guesses and Forgets

Your retrieval pipeline caches embeddings. It has to: recomputing a vector for a document you’ve already indexed is wasteful, and at scale, it’s fatal to latency. So you cache. The cache warms up. You congratulate yourself.

Then you look more carefully. Your cache doesn’t know which of its entries were ever retrieved. It doesn’t know which cached vectors are still semantically correct versus stale from a model update. It doesn’t know which prefetched entries were evicted without being used.

Why Your RAG Pipeline Doesn't Learn

Fri, 02 Jan 2026 00:00:00 +0000

The Loop That Doesn’t Close

Every RAG system has two sides. The retrieval side finds relevant context. The generation side uses that context to produce a response. Between them, something is conspicuously missing: signal.

The LLM knows which retrieved chunks it actually used. The user knows whether the answer was good. But the retrieval system knows nothing. It made a prediction, “these five chunks are relevant,” and it never found out if it was right.

Sorting on Blackwell: 9× Faster Where It Actually Matters

Thu, 01 Jan 2026 00:00:00 +0000

If your data is already almost sorted, why are you paying to sort it again?

For more than a decade, NVIDIA’s CUB DeviceRadixSort has been the default answer to that question: fast, battle tested, and completely indifferent to the shape of your data. It treats a perfectly monotonic HFT order book the same way it treats cryptographic white noise. Your latency budget and your power bill pay for that indifference.

Starved Cores: Listening to the Silence Between Epochs

Thu, 01 Jan 2026 00:00:00 +0000

Why Your H100 is Idle 40% of the Time

You paid $30,000 for an NVIDIA H100. You optimized your CUDA kernels until your eyes bled. You engaged mixed-precision Tensor Cores. You are feeling pretty good about yourself.

Then you run nvidia-smi and see the utilization graph. It looks like a heartbeat monitor:

Spike (100%), Flatline (0%), Spike (100%).

That flatline is the sound of money burning. It is the silence of 14,592 CUDA cores twiddling their thumbs.

Why Vector Search is Memory Abuse (And How to Fix It)

Thu, 01 Jan 2026 00:00:00 +0000

The Hardware is Begging You to Stop

You just bought an H100. It has HBM3e memory capable of 3.35 TB/s of bandwidth. That is enough to stream the entire Library of Congress in about four seconds.

But when you run your vector search, you get a fraction of that. Maybe 5%.

Why? Because graph-based vector search is memory abuse. And the memory controller is suffering silently.

The Physics of a Cache Line

Here is the thing about DRAM (and HBM is just fancy 3D-stacked DRAM): it hates individuality.

Why Your Vector Search Lies

Thu, 01 Jan 2026 00:00:00 +0000

There is a class of bug that doesn’t look like a bug. No exception. No error log. No changed code. Just this: you run the same query against the same index twice, and you get different results. Not wrong results, just results that look plausible both times. Different ones.

You restart the service. Same thing. You roll back a deployment. Same thing. You pin every library version, freeze the model weights, feed in identical inputs byte-for-byte.

Sorting on Blackwell: 9× Faster Where It Actually Matters

Sat, 15 Nov 2025 00:00:00 +0000