The dataset that fixes embedding failure modes.
Every embedding model on the leaderboard was trained on data that looks like similarity but is not. Negated claims. Role-inverted sentences. Opinion asymmetries. The benchmarks catch it. The training data doesn't.
This corpus was built specifically to fill that gap — 1.48M QA-gated pairs targeting the exact structural failure modes identified by MTEB PairClassification forensics. It is the training signal behind our #1 MTEB ranking.
MTEB PairClassification tests whether a model can correctly judge whether two sentences mean the same thing. The failure cases are not random — they cluster around four structural patterns that standard corpora almost never include with the right label.
Role inversion ("The dog bit the man" vs. "The man bit the dog") looks similar lexically but is semantically opposite. Most training pairs don't capture this. Opinion asymmetry shares an entity but reverses polarity in a way cosine distance can't detect without geometric training signal.
Without explicit training on these patterns, embedding models learn to treat surface similarity as semantic similarity — exactly the failure that MTEB's PairClassification tasks expose.
Every pair passes all six gates simultaneously (AND logic). A single failure ejects the pair. The result is a corpus with cryptographically guaranteed zero duplicates and independently validated structural quality on each dimension.
Generator: Gemma-3-27b-it via vLLM on dedicated H100. All pairs generated from seeded FMEA specs — not scraped, not synthetic-from-synthetic. The seed structure is verified at generation time and again at QA.
Final dedup: SHA-256 on normalized (lowercased, stripped) text_a + NUL + text_b across the merged corpus. Zero-duplicate guarantee is cryptographic, not probabilistic.
Parquet files compressed with Zstandard. Drop-in compatible with HuggingFace Datasets and PyArrow.
No email required. Direct download.
all_pairs.parquet (128 MB),
pc_pairs.parquet,
neg_pairs.parquet,
taxonomy metadata, and DATA_CARD.
Per-seat and enterprise license tiers available. Includes the exact training corpus used to achieve our MTEB #1 position.
Voxell Ore · 1,480,145 pairs · SHA-256 deduplicated · Packaged 2026-04-21
All pairs generated and quality-gated by Voxell. Not scraped. Not redistributed from existing datasets.
Commercial licensing: contact us