Sweat Capital: Ripping Out TEI for a 27x Embedding Speedup with Go and Custom CUDA
Why I ripped out Hugging Face TEI and built qwen-embed-native, a Go + custom-CUDA embedding engine for Qwen3 that runs fast on sub-H100 hardware.
Read ArticleTechnical deep-dives on GPU memory, sorting, and deterministic compute.
Why I ripped out Hugging Face TEI and built qwen-embed-native, a Go + custom-CUDA embedding engine for Qwen3 that runs fast on sub-H100 hardware.
Read ArticleVoxell's MTEB(eng, v2) submission: architecture, training methodology, contamination defense, and API access.
Read ArticleYour retrieval system reruns the same expensive searches and never tracks which results helped. The retrieval N+1, and how closing the loop changes things.
Read ArticleYour retrieval cache predicts what to preload and never learns if it was right. The open-loop problem, and what closing the feedback loop unlocks.
Read ArticleSame query, different results. The floating-point problem that quietly breaks reproducible embeddings, and what bit-exact determinism takes in production.
Your vector index burns GPU bandwidth chasing pointers through random memory, and a brute-force linear scan often beats HNSW on the GPU.
Random Access Pattern
Cache misses, stalled cores
Sequential Access Pattern
Coalesced reads, saturated bus
Your H100 sits idle 40% of the time, starved by a data supply chain that can't keep up. Here is the physics, and why the real fix is architectural.
Read ArticleMASH is a data-aware GPU sorting engine for NVIDIA Blackwell that beats NVIDIA's CUB DeviceRadixSort on real workloads.
Read ArticleGet hands-on with Coherence. Benchmark it against your current infrastructure.
Download Developer Preview