<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Engineering Insights on Voxell</title>
    <link>https://voxell.ai/engineering/</link>
    <description>Recent content in Engineering Insights on Voxell</description>
    <generator>Hugo</generator>
    <language>en-us</language>
    <lastBuildDate>Sat, 30 May 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://voxell.ai/engineering/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Sweat Capital: Ripping Out TEI for a 27x Embedding Speedup with Go and Custom CUDA</title>
      <link>https://voxell.ai/engineering/qwen-native-cuda-engine/</link>
      <pubDate>Sat, 30 May 2026 00:00:00 +0000</pubDate>
      <guid>https://voxell.ai/engineering/qwen-native-cuda-engine/</guid>
      <description>&lt;p&gt;If you want to build a truly resilient, high-performance infrastructure on a budget, you eventually hit a wall where &amp;ldquo;industry standard&amp;rdquo; stops meaning &amp;ldquo;best in class&amp;rdquo; and starts meaning &amp;ldquo;built for someone else&amp;rsquo;s hardware.&amp;rdquo;&lt;/p&gt;&#xA;&lt;p&gt;When I set out to build Forge, the goal was simple: serve state-of-the-art Qwen3 embedding models (from 0.6B up to 8B parameters) fast, reliably, and without bleeding cash on GPU idleness.&lt;/p&gt;&#xA;&lt;p&gt;The obvious first answer was Hugging Face&amp;rsquo;s TEI (Text Embeddings Inference). It&amp;rsquo;s the darling of the embedding ecosystem. It promises everything out of the box. So, I spun it up across my mix of hardware: ARM nodes, DGX GB10s, and a scrappy RTX 5080.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Ingot Poured: Our MTEB Submission Model</title>
      <link>https://voxell.ai/engineering/ingot_poured/</link>
      <pubDate>Sat, 23 May 2026 00:00:00 +0000</pubDate>
      <guid>https://voxell.ai/engineering/ingot_poured/</guid>
      <description>&lt;p&gt;Ingot-8B-R3 is Voxell&amp;rsquo;s research text embedding model, built on top of Qwen3-Embedding-8B and submitted to the MTEB(eng, v2) leaderboard as a public API. The model is in production today at &lt;code&gt;api-mteb.voxell.ai&lt;/code&gt; and scores Mean(41) = 75.9795 across all 41 tasks in the official MTEB English v2 benchmark.&lt;/p&gt;&#xA;&lt;p&gt;To our knowledge, Ingot-8B-R3 is the first successful mixture-of-experts architecture applied to text embedding as a routed, multi-specialist system — different specialists activate per input, selected at inference time from content alone. This post documents the architecture, the training data methodology, the contamination defenses, and how to access the public API in full.&lt;/p&gt;</description>
    </item>
    <item>
      <title>The Cache Feedback Gap: Why Your Prefetcher Doesn&#39;t Learn</title>
      <link>https://voxell.ai/engineering/cache-feedback-gap/</link>
      <pubDate>Fri, 02 Jan 2026 00:00:00 +0000</pubDate>
      <guid>https://voxell.ai/engineering/cache-feedback-gap/</guid>
      <description>&lt;h2 id=&#34;the-cache-that-guesses-and-forgets&#34;&gt;The Cache That Guesses and Forgets&lt;/h2&gt;&#xA;&lt;p&gt;Your retrieval pipeline caches embeddings. It has to: recomputing a vector for a document you&amp;rsquo;ve already indexed is wasteful, and at scale, it&amp;rsquo;s fatal to latency. So you cache. The cache warms up. You congratulate yourself.&lt;/p&gt;&#xA;&lt;p&gt;Then you look more carefully. Your cache doesn&amp;rsquo;t know which of its entries were ever retrieved. It doesn&amp;rsquo;t know which cached vectors are still semantically correct versus stale from a model update. It doesn&amp;rsquo;t know which prefetched entries were evicted without being used.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Why Your RAG Pipeline Doesn&#39;t Learn</title>
      <link>https://voxell.ai/engineering/rag-pipeline-feedback/</link>
      <pubDate>Fri, 02 Jan 2026 00:00:00 +0000</pubDate>
      <guid>https://voxell.ai/engineering/rag-pipeline-feedback/</guid>
      <description>&lt;h2 id=&#34;the-loop-that-doesnt-close&#34;&gt;The Loop That Doesn&amp;rsquo;t Close&lt;/h2&gt;&#xA;&lt;p&gt;Every RAG system has two sides. The retrieval side finds relevant context. The generation side uses that context to produce a response. Between them, something is conspicuously missing: signal.&lt;/p&gt;&#xA;&lt;p&gt;The LLM knows which retrieved chunks it actually used. The user knows whether the answer was good. But the retrieval system knows nothing. It made a prediction, &amp;ldquo;these five chunks are relevant,&amp;rdquo; and it never found out if it was right.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Starved Cores: Listening to the Silence Between Epochs</title>
      <link>https://voxell.ai/engineering/starved-cores/</link>
      <pubDate>Thu, 01 Jan 2026 00:00:00 +0000</pubDate>
      <guid>https://voxell.ai/engineering/starved-cores/</guid>
      <description>&lt;h2 id=&#34;why-your-h100-is-idle-40-of-the-time&#34;&gt;Why Your H100 is Idle 40% of the Time&lt;/h2&gt;&#xA;&lt;p&gt;You paid $30,000 for an NVIDIA H100. You optimized your CUDA kernels until your eyes bled. You engaged mixed-precision Tensor Cores. You are feeling pretty good about yourself.&lt;/p&gt;&#xA;&lt;p&gt;Then you run &lt;code&gt;nvidia-smi&lt;/code&gt; and see the utilization graph. It looks like a heartbeat monitor:&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Spike (100%), Flatline (0%), Spike (100%).&lt;/strong&gt;&lt;/p&gt;&#xA;&lt;p&gt;That flatline is the sound of money burning. It is the silence of 14,592 CUDA cores twiddling their thumbs.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Why Vector Search is Memory Abuse (And How to Fix It)</title>
      <link>https://voxell.ai/engineering/graph-traversal-anti-pattern/</link>
      <pubDate>Thu, 01 Jan 2026 00:00:00 +0000</pubDate>
      <guid>https://voxell.ai/engineering/graph-traversal-anti-pattern/</guid>
      <description>&lt;h2 id=&#34;the-hardware-is-begging-you-to-stop&#34;&gt;The Hardware is Begging You to Stop&lt;/h2&gt;&#xA;&lt;p&gt;You just bought an H100. It has HBM3e memory capable of 3.35 TB/s of bandwidth. That is enough to stream the entire Library of Congress in about four seconds.&lt;/p&gt;&#xA;&lt;p&gt;But when you run your vector search, you get a fraction of that. Maybe 5%.&lt;/p&gt;&#xA;&lt;p&gt;Why? Because graph-based vector search is memory abuse. And the memory controller is suffering silently.&lt;/p&gt;&#xA;&lt;h2 id=&#34;the-physics-of-a-cache-line&#34;&gt;The Physics of a Cache Line&lt;/h2&gt;&#xA;&lt;p&gt;Here is the thing about DRAM (and HBM is just fancy 3D-stacked DRAM): it hates individuality.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Why Your Vector Search Lies</title>
      <link>https://voxell.ai/engineering/deterministic-embeddings/</link>
      <pubDate>Thu, 01 Jan 2026 00:00:00 +0000</pubDate>
      <guid>https://voxell.ai/engineering/deterministic-embeddings/</guid>
      <description>&lt;p&gt;There is a class of bug that doesn&amp;rsquo;t look like a bug. No exception. No error log. No changed code. Just this: you run the same query against the same index twice, and you get different results. Not wrong results, just results that look plausible both times. Different ones.&lt;/p&gt;&#xA;&lt;p&gt;You restart the service. Same thing. You roll back a deployment. Same thing. You pin every library version, freeze the model weights, feed in identical inputs byte-for-byte.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Sorting on Blackwell: 9× Faster Where It Actually Matters</title>
      <link>https://voxell.ai/engineering/sorting-on-blackwell/</link>
      <pubDate>Sat, 15 Nov 2025 00:00:00 +0000</pubDate>
      <guid>https://voxell.ai/engineering/sorting-on-blackwell/</guid>
      <description></description>
    </item>
  </channel>
</rss>
