Independent Annotation Benchmarks
Transparent accuracy, agreement, and throughput benchmarks across annotation tasks, data types, and industry verticals.
How Annotation Accuracy Is Measured
Every benchmark we publish is derived from two core metrics — Precision and Recall — computed against a verified gold-standard reference set. Here is exactly how they work.
Confusion Matrix
Derived Metrics
Precision
Correctness"Of every annotation our annotators produced, what fraction was actually correct?" High precision means few false alarms.
Recall
Coverage"Of every annotation that existed in the gold standard, what fraction did our annotators find?" High recall means few missed labels.
F1 Score
BalanceThe harmonic mean of Precision and Recall. A single number that penalises extreme imbalance — a 100% recall with 1% precision scores an F1 of just 2%.
How Each Task Type Is Scored
Bounding Box
IoU ≥ 0.5 thresholdPolygon / Segmentation
Pixel-level mask IoU ≥ 0.5NER (Named Entity Recognition)
Exact span + entity type matchSentiment
Exact class label matchWhat Are Gold Standard References?
A gold standard is a verified, high-confidence annotation set used as the ground truth when computing accuracy metrics. Every precision and recall figure we publish is measured against one of two reference types.
Academic Calibration Sets
External · Publicly verifiablePublished benchmark datasets with peer-reviewed annotations, used to calibrate our annotator pool and validate that our scoring methodology is consistent with industry standards.
CoreLabel Internal Gold Sets
Project-specific · Client-verifiedFor client projects, gold standards are built from scratch using a three-stage adjudication process, then reviewed and signed off by the client's domain expert before any production annotation begins.
Why the reference set quality matters as much as the metric
A precision figure is only meaningful if the gold standard it was measured against is itself correct. An 80% precision score against a noisy reference could represent genuinely better annotation than a 95% score against a carelessly constructed one. This is why we invest heavily in gold-set construction — and why we publish the source and methodology alongside every number.
Inter-annotator Agreement (IAA)
Accuracy tells you how close annotators are to the gold standard. IAA tells you how consistently they agree with each other. Both are required to trust a dataset — high accuracy against a noisy reference, or high agreement around a systematic error, are equally dangerous.
Kappa (κ) Interpretation Scale
Agreement Metrics We Use
Cohen's Kappa (κ)
2 annotators · categoricalCorrects for chance agreement — if two annotators randomly assign labels at the same base rates they would still agree sometimes. Kappa normalises this away, making it far more reliable than raw percent agreement. Used for: sentiment, text classification, NER entity types.
Fleiss' Kappa (κF)
3 + annotators · categoricalGeneralises Cohen's Kappa to any number of raters, where different items may be rated by different subsets of annotators. Required for large annotation pools where no two items share the exact same pair of annotators. Used for: large-scale classification batches, RLHF preference ranking pools.
Krippendorff's Alpha (α)
ordinal · continuous · spatialHandles any metric scale — nominal, ordinal, interval, and ratio — plus missing data. The distance function in Do can be swapped to IoU for spatial tasks, making it the most flexible IAA metric. Used for: bounding box and polygon pairwise agreement, ordinal quality ratings, RLHF relevance scores.
IAA Method by Task — Published Ranges
| Task Type | IAA Metric | Typical Range | Our Floor | Notes |
|---|---|---|---|---|
| Sentiment / Classification | Cohen's κ | 0.72–0.91 | 0.80+ | Higher variance on multi-class vs. binary tasks |
| NER Entity Typing | Cohen's κ (per type) | 0.78–0.94 | 0.82+ | Span boundary disagreements scored separately |
| RLHF Preference Ranking | Fleiss' κ | 0.65–0.85 | 0.72+ | Subjective tasks naturally compress the upper bound |
| Bounding Box | Pairwise IoU mean | 0.82–0.96 | 0.85+ | IoU computed per matching pair; unmatched boxes = 0 |
| Polygon / Segmentation | Krippendorff α (IoU dist.) | 0.79–0.93 | 0.83+ | Boundary F1 between annotators also tracked |
| Video Keyframe Tagging | Fleiss' κ + temporal IoU | 0.70–0.88 | 0.78+ | Frame-window disagreements add complexity |
| Text Span / Highlight | Krippendorff α (interval) | 0.76–0.91 | 0.80+ | Partial overlaps penalised by distance function |
Why percent agreement alone is misleading
Two annotators working on a sentiment task where 85% of samples are "Positive" will agree ~73% of the time by pure chance, even if they label randomly. A raw agreement of 73% would look acceptable — but Cohen's κ would return 0.00. We never report bare percent agreement without a chance-corrected coefficient alongside it.
Throughput & Turnaround
Throughput is the volume of annotation units completed per hour. Turnaround is the calendar time from data intake to final delivery. Both are tracked continuously — and both are meaningless without the accuracy figures that accompany them.
Annotation Throughput by Task
| Task | Unit | Sustained Rate | QA Overhead |
|---|---|---|---|
| Text Classification / Sentiment | item | 180–300 / hr | ~8% |
| RLHF Preference Ranking | pair | 30–60 / hr | ~12% |
| Named Entity Recognition (NER) | document | 35–80 / hr | ~10% |
| Bounding Box (simple scene) | box | 120–200 / hr | ~10% |
| Bounding Box (dense / complex) | box | 40–80 / hr | ~15% |
| Polygon / Instance Segmentation | object | 12–30 / hr | ~18% |
| Video Keyframe Tagging | clip | 4–10 / hr | ~20% |
| Audio Transcription + NER | minute | 6–12 / hr | ~14% |
| Time-series Anomaly Labelling | window | 60–120 / hr | ~10% |
Delivery Pipeline Stages
Data handoff, ontology review, annotator calibration session, guideline finalisation
100–500 items annotated, IAA computed, gold set validated, client approval gate
Parallel annotator pools; daily IAA monitoring; automatic outlier flagging
Sample-based audit against gold, edge-case adjudication, rework if below floor
Export in requested format, structured feedback window, versioned dataset lock
The quality-speed tradeoff — and how we manage it
Every throughput figure above is measured at the stated quality floor. Increasing speed without increasing team size always compresses accuracy margins — a well-documented phenomenon in annotation literature. We surface this tradeoff explicitly: every project proposal includes a rate × accuracy curve so clients can choose the operating point that fits their pipeline.
Want to See Our Full Benchmark Data?
Request a detailed accuracy report for your annotation type, domain, and quality tier.