VOXELL ORE · PC-V1

The dataset that fixes embedding failure modes.

Every embedding model on the leaderboard was trained on data that looks like similarity but is not. Negated claims. Role-inverted sentences. Opinion asymmetries. The benchmarks catch it. The training data doesn't.

This corpus was built specifically to fill that gap — 1.48M QA-gated pairs targeting the exact structural failure modes identified by MTEB PairClassification forensics. It is the training signal behind our #1 MTEB ranking.

corpus — stats
1.48M
Verified pairs
6-gate
QA pipeline
#1
MTEB ranking
Zero
Duplicates
Why standard training data fails

MTEB PairClassification tests whether a model can correctly judge whether two sentences mean the same thing. The failure cases are not random — they cluster around four structural patterns that standard corpora almost never include with the right label.

Role inversion ("The dog bit the man" vs. "The man bit the dog") looks similar lexically but is semantically opposite. Most training pairs don't capture this. Opinion asymmetry shares an entity but reverses polarity in a way cosine distance can't detect without geometric training signal.

Without explicit training on these patterns, embedding models learn to treat surface similarity as semantic similarity — exactly the failure that MTEB's PairClassification tasks expose.

failure-mode distribution
Polarity flip (SA3) 685,707 pairs
Opinion asymmetry (SA2) 567,075 pairs
Temporal / NER shift 345,734 pairs
Role inversion (SA1) 220,655 pairs
Negation XOR 155,367 pairs
Negation types (H/C/L) 230,601 pairs
Four semantic aspects. One corpus.
SA1 — Role Inversion 208,383 pairs · 14.1%
Syntactic reordering that reverses meaning
Subject and object swapped. Agent and patient inverted. The sentences share every token but carry opposite truth values. Hardest failure mode for cosine-based similarity. 35.8% clean rate after strict QA — only high-confidence role-inversion examples survive.
SA2 — Opinion Asymmetry 663,283 pairs · 44.8%
Shared entity, opposite polarity
Same named entity or concept. One sentence affirms a property; the other denies or contradicts it. The model must learn that shared entity is not sufficient for semantic similarity. Largest aspect — 81%+ QA clean rate, 7 domains (biomedical, financial, news, tech, consumer, scientific, STS).
SA3 — Polarity Flip 369,974 pairs · 25.0%
Direct affirmation vs. negation
One sentence affirms a proposition; the other negates it. Minimal lexical difference, maximal semantic difference. The most common failure mode in MTEB PairClassification and the first signal that standard contrastive training misses.
SA4 + NEG — Numeric & Negation 238,505 pairs · 16.1%
Quantity shifts and structured negation (H/C/L)
Numeric or tense change that flips truth value (SA4), plus 20 structured negation types spanning Hard (H1–H10), Compositional (C1–C7), and Lexical (L1–L3) patterns. Coverage across 7 domains ensures the model encounters negation in every register.
Six-gate quality assurance

Every pair passes all six gates simultaneously (AND logic). A single failure ejects the pair. The result is a corpus with cryptographically guaranteed zero duplicates and independently validated structural quality on each dimension.

cross-encoder
DeBERTa-v3 cross-encoder margin between positive and negative must meet threshold.
nli-label
Bi-directional NLI entailment consistency with the assigned label (both directions).
pattern-adhere
VADER polarity + shared-entity checks verify the intended structural pattern is present.
length-sanity
Token count plausibility bounds — prevents degenerate pairs from both ends.
semantic-dedup
Within-chunk near-dedup at cosine threshold 0.95. Corpus-wide SHA-256 dedup at packaging.
feature-carry
Verifies the seed structural feature (role-inversion, NER, tense) survives generation.

Generator: Gemma-3-27b-it via vLLM on dedicated H100. All pairs generated from seeded FMEA specs — not scraped, not synthetic-from-synthetic. The seed structure is verified at generation time and again at QA.

Final dedup: SHA-256 on normalized (lowercased, stripped) text_a + NUL + text_b across the merged corpus. Zero-duplicate guarantee is cryptographic, not probabilistic.

Schema
taxonomy_v2 — parquet
id string SHA-256[:24] of normalized text_a + NUL + text_b
text_a string First sentence in the pair
text_b string Second sentence in the pair
label int8 0 = dissimilar · 1 = similar (PairClass convention)
semantic_aspect string SA1 / SA2 / SA3 / SA4 or NEG_{type}
taxonomy_v2_tags string Comma-separated structural failure tags
dataset_source string Source pool identifier
generator_model string Model that generated this pair

Parquet files compressed with Zstandard. Drop-in compatible with HuggingFace Datasets and PyArrow.

Get access
Free teaser
1,000-pair stratified sample
Stratified across the hardest failure modes: 300 role-inversion pairs, 300 opinion-asymmetry, 200 numeric/temporal, 150 negation. Validate quality and schema compatibility before licensing.

No email required. Direct download.
Download teaser (JSONL, 335 KB)
Commercial license
Full 1.48M-pair corpus
Complete Parquet bundle: all_pairs.parquet (128 MB), pc_pairs.parquet, neg_pairs.parquet, taxonomy metadata, and DATA_CARD.

Per-seat and enterprise license tiers available. Includes the exact training corpus used to achieve our MTEB #1 position.
Contact for license

Voxell Ore · 1,480,145 pairs · SHA-256 deduplicated · Packaged 2026-04-21
All pairs generated and quality-gated by Voxell. Not scraped. Not redistributed from existing datasets. Commercial licensing: contact us