VOXELL ORE · HARD NEGATIVES

Your retriever underperforms on your domain — and you have no labeled data to fix it.

You see it in the recall metrics and in the user complaints. Off-the-shelf hard negatives are random BM25 noise — they don't reflect how your users actually search or how your documents actually fail. And building a datagen pipeline is three months you don't have, on a team without a PhD to spare.

Ore is the answer: MTEB-proven hard negatives with named failure modes, designed to drop straight into a HuggingFace training script. It is the exact training signal behind our #1 MTEB ranking — the proof it works.

Talk to us about your corpus Email [email protected]

corpus, stats

1.48M

Verified pairs

6-gate

QA pipeline

MTEB ranking

Zero

Duplicates

Use Ore when…

If any of these sound like your week, the corpus below is built for you. These are the problems Ore exists to solve — in the words of the engineers who reach out.

No labeled data

“My retriever underperforms on my domain and I have no labeled data to fix it.”

You can measure the gap — recall numbers, user complaints — but you can't close it without training signal. Ore gives you hard negatives that target the exact structural failures, no in-house labeling project required.

BM25 noise isn't enough

“I need hard negatives that reflect how my users actually search and how my documents actually fail.”

Random BM25 negatives teach surface lexical contrast, not the role inversions, polarity flips, and opinion asymmetries that break real retrieval. Ore's negatives carry named failure modes, not noise.

No three months, no PhD

“I don't have three months to build a datagen pipeline, or a PhD on the team.”

Ore is the pipeline output, not the pipeline. Drop-in Parquet, compatible with HuggingFace Datasets and PyArrow — the same corpus that took us to #1 on MTEB, ready for your training script.

You need proof it works

“Before I bet a release on it, show me the data is real.”

Fair. Below is the named failure-mode taxonomy, the six-gate QA pipeline, and the pair schema — the same signal behind our #1 MTEB ranking. When it fits, reach out and we'll talk through your corpus.

Why standard training data fails

MTEB PairClassification tests whether a model can correctly judge whether two sentences mean the same thing. The failure cases are not random, they cluster around four structural patterns that standard corpora almost never include with the right label.

Role inversion ("The dog bit the man" vs. "The man bit the dog") looks similar lexically but is semantically opposite. Most training pairs don't capture this. Opinion asymmetry shares an entity but reverses polarity in a way cosine distance can't detect without geometric training signal.

Without explicit training on these patterns, embedding models learn to treat surface similarity as semantic similarity, exactly the failure that MTEB's PairClassification tasks expose.

failure-mode distribution

Polarity flip (SA3) 685,707 pairs

Opinion asymmetry (SA2) 567,075 pairs

Temporal / NER shift 345,734 pairs

Role inversion (SA1) 220,655 pairs

Negation XOR 155,367 pairs

Negation types (H/C/L) 230,601 pairs

A pair can exhibit more than one failure mode, so these counts overlap and sum to more than the 1.48M total.

Four semantic aspects. One corpus.

SA1, Role Inversion 208,383 pairs · 14.1%

Syntactic reordering that reverses meaning

Subject and object swapped. Agent and patient inverted. The sentences share every token but carry opposite truth values. Hardest failure mode for cosine-based similarity. 35.8% clean rate after strict QA, only high-confidence role-inversion examples survive.

SA2, Opinion Asymmetry 663,283 pairs · 44.8%

Shared entity, opposite polarity

Same named entity or concept. One sentence affirms a property; the other denies or contradicts it. The model must learn that shared entity is not sufficient for semantic similarity. Largest aspect, 81%+ QA clean rate, 7 domains (biomedical, financial, news, tech, consumer, scientific, STS).

SA3, Polarity Flip 369,974 pairs · 25.0%

Direct affirmation vs. negation

One sentence affirms a proposition; the other negates it. Minimal lexical difference, maximal semantic difference. The most common failure mode in MTEB PairClassification and the first signal that standard contrastive training misses.

SA4 + NEG, Numeric & Negation 238,505 pairs · 16.1%

Quantity shifts and structured negation (H/C/L)

Numeric or tense change that flips truth value (SA4), plus 20 structured negation types spanning Hard (H1–H10), Compositional (C1–C7), and Lexical (L1–L3) patterns. Coverage across 7 domains ensures the model encounters negation in every register.

Six-gate quality assurance

Every pair passes all six gates simultaneously (AND logic). A single failure ejects the pair. The result is a corpus with cryptographically guaranteed zero duplicates and independently validated structural quality on each dimension.

cross-encoder

DeBERTa-v3 cross-encoder margin between positive and negative must meet threshold.

nli-label

Bi-directional NLI entailment consistency with the assigned label (both directions).

pattern-adhere

VADER polarity + shared-entity checks verify the intended structural pattern is present.

length-sanity

Token count plausibility bounds, prevents degenerate pairs from both ends.

semantic-dedup

Within-chunk near-dedup at cosine threshold 0.95. Corpus-wide SHA-256 dedup at packaging.

feature-carry

Verifies the seed structural feature (role-inversion, NER, tense) survives generation.

Generator: Gemma-3-27b-it via vLLM on dedicated H100. All pairs generated from seeded FMEA specs, not scraped, not synthetic-from-synthetic. The seed structure is verified at generation time and again at QA.

Final dedup: SHA-256 on normalized (lowercased, stripped) text_a + NUL + text_b across the merged corpus. Zero-duplicate guarantee is cryptographic, not probabilistic.

Schema

taxonomy_v2, parquet

id string SHA-256[:24] of normalized text_a + NUL + text_b

text_a string First sentence in the pair

text_b string Second sentence in the pair

label int8 0 = dissimilar · 1 = similar (PairClass convention)

semantic_aspect string SA1 / SA2 / SA3 / SA4 or NEG_{type}

taxonomy_v2_tags string Comma-separated structural failure tags

dataset_source string Source pool identifier

generator_model string Model that generated this pair

Parquet files compressed with Zstandard. Drop-in compatible with HuggingFace Datasets and PyArrow.

Talk to us about your corpus

Ore isn't a self-serve download. Every engagement starts with a conversation about your domain, your failure modes, and what your retriever is getting wrong — so the negatives we hand you map to how your users actually search. Tell us what you're building; we reply within 24 hours.

Reach out

Bring us your corpus and your failure modes

Tell us your domain, the queries that miss, and the recall numbers you're trying to move. We'll talk through which named failure modes apply and how the corpus that took us to #1 on MTEB fits your training stack.

Talk to us about your corpus

Prefer email?

Write to the team directly

A two-line description of your retrieval problem is enough to start. NDA fine; no IP needed to have the first conversation.

[email protected]

Voxell Ore · 1,480,145 pairs · SHA-256 deduplicated · Packaged 2026-04-21
All pairs generated and quality-gated by Voxell. Not scraped. Not redistributed from existing datasets. Start a conversation: talk to us