AI Data Labeling & Annotation Benchmarks

Annotation Accuracy

Precision, recall and F1 benchmarks for bounding box, polygon, NER, and sentiment tasks measured against gold-standard references.

View methodology

Inter-annotator Agreement

Cohen's κ, Fleiss' κ, and Krippendorff's α — how we measure and enforce consistency across annotators on every task type.

View methodology

Throughput & Turnaround

Sustained annotation rates, delivery pipeline stages, and SLA tiers — with explicit quality-speed tradeoff curves per task.

View methodology

Methodology

How Annotation Accuracy Is Measured

Every benchmark we publish is derived from two core metrics — Precision and Recall — computed against a verified gold-standard reference set. Here is exactly how they work.

Confusion Matrix

Predicted Positive

Predicted Negative

Actual +

TP

True Positive

Label predicted and present in reference

FN

False Negative

Label in reference but missed by annotator

Actual −

FP

False Positive

Label predicted but not in reference

TN

True Negative

Correctly withheld — not labelled, not in reference

↑ Precision denominator

Derived Metrics

Precision

Correctness

                    P = TP ÷ (TP + FP)
                

"Of every annotation our annotators produced, what fraction was actually correct?" High precision means few false alarms.

Recall

Coverage

                    R = TP ÷ (TP + FN)
                

"Of every annotation that existed in the gold standard, what fraction did our annotators find?" High recall means few missed labels.

F1 Score

Balance

                    F1 = 2 × (P × R) ÷ (P + R)
                

The harmonic mean of Precision and Recall. A single number that penalises extreme imbalance — a 100% recall with 1% precision scores an F1 of just 2%.

How Each Task Type Is Scored

Bounding Box

IoU ≥ 0.5 threshold

TP Predicted box overlaps gold box by ≥ 50% (IoU)

FP Predicted box has no matching gold box above threshold

FN Gold box has no predicted box above threshold

IoU (Intersection over Union) = overlap area ÷ union area. Benchmarks reported at IoU = 0.5 and IoU = 0.75.

Polygon / Segmentation

Pixel-level mask IoU ≥ 0.5

TP Predicted mask overlaps gold mask by ≥ 50% pixel IoU

FP Predicted mask region not present in gold

FN Gold mask region not covered by any prediction

Boundary F1 (BF) also reported — penalises coarse polygon edges that inflate pixel IoU.

NER (Named Entity Recognition)

Exact span + entity type match

TP Predicted span and type exactly match a gold span and type

FP Predicted span/type has no exact gold counterpart

FN Gold span/type not predicted at all

Partial-match F1 also computed: span overlap > 0 with correct type scores 0.5 rather than 0.

Sentiment

Exact class label match

TP Annotator class (Positive / Negative / Neutral) matches gold label

FP Annotator predicted a class not present in gold

FN Gold class not assigned by annotator

Macro-F1 used (unweighted average across all classes) so minority classes are not hidden by majority-class dominance.

Reference Sets

What Are Gold Standard References?

A gold standard is a verified, high-confidence annotation set used as the ground truth when computing accuracy metrics. Every precision and recall figure we publish is measured against one of two reference types.

Academic Calibration Sets

External · Publicly verifiable

Published benchmark datasets with peer-reviewed annotations, used to calibrate our annotator pool and validate that our scoring methodology is consistent with industry standards.

COCO 2017 (val) — Bounding box & segmentation — 80 object categories, 5k validation images

Pascal VOC 2012 — Bounding box — 20 categories, widely used IoU@0.5 baseline

CoNLL-2003 — NER — English newswire, PER / ORG / LOC / MISC entity types

OntoNotes 5.0 — NER — 18 entity types across news, broadcast, and web text

Stanford SST-2 — Sentiment — binary, 11,855 sentences from movie reviews

SemEval 2017 Task 4 — Sentiment — Twitter, three-class, multi-domain

CoreLabel Internal Gold Sets

Project-specific · Client-verified

For client projects, gold standards are built from scratch using a three-stage adjudication process, then reviewed and signed off by the client's domain expert before any production annotation begins.

1

Independent dual annotation

Two senior annotators label the same 200–500 sample set independently, with no visibility of each other's output.

2

Adjudication round

A lead annotator reconciles all disagreements case by case, documenting the reasoning for each decision.

3

Client sign-off

The reconciled set is submitted to the client domain expert. Any further corrections are incorporated and the final set is version-locked.

Why the reference set quality matters as much as the metric

A precision figure is only meaningful if the gold standard it was measured against is itself correct. An 80% precision score against a noisy reference could represent genuinely better annotation than a 95% score against a carelessly constructed one. This is why we invest heavily in gold-set construction — and why we publish the source and methodology alongside every number.

Consistency

Inter-annotator Agreement (IAA)

Accuracy tells you how close annotators are to the gold standard. IAA tells you how consistently they agree with each other. Both are required to trust a dataset — high accuracy against a noisy reference, or high agreement around a systematic error, are equally dangerous.

Kappa (κ) Interpretation Scale

< 0.20

Slight

Guideline review required — likely ambiguous task definition

0.21–0.40

Fair

Acceptable only for exploratory or low-stakes labelling

0.41–0.60

Moderate

Standard floor for general annotation projects

0.61–0.80

Substantial

Production threshold — required for most client deliveries

0.81–1.00

Near-perfect

Target for high-stakes medical, legal, and safety tasks

Our benchmark target: κ ≥ 0.81 across all production tasks. Complex spatial tasks additionally require pairwise IoU ≥ 0.85.

Agreement Metrics We Use

Cohen's Kappa (κ)

2 annotators · categorical

                    κ = (Po − Pe) ÷ (1 − Pe)
                

P_o = observed agreement rate P_e = agreement expected by chance

Corrects for chance agreement — if two annotators randomly assign labels at the same base rates they would still agree sometimes. Kappa normalises this away, making it far more reliable than raw percent agreement. Used for: sentiment, text classification, NER entity types.

Fleiss' Kappa (κ_F)

3 + annotators · categorical

                    κF = (P̄o − P̄e) ÷ (1 − P̄e)
                

Generalises Cohen's Kappa to any number of raters, where different items may be rated by different subsets of annotators. Required for large annotation pools where no two items share the exact same pair of annotators. Used for: large-scale classification batches, RLHF preference ranking pools.

Krippendorff's Alpha (α)

ordinal · continuous · spatial

                    α = 1 − (Do ÷ De)
                

D_o = observed disagreement D_e = expected disagreement

Handles any metric scale — nominal, ordinal, interval, and ratio — plus missing data. The distance function in D_o can be swapped to IoU for spatial tasks, making it the most flexible IAA metric. Used for: bounding box and polygon pairwise agreement, ordinal quality ratings, RLHF relevance scores.

IAA Method by Task — Published Ranges

Task Type	IAA Metric	Typical Range	Our Floor	Notes
Sentiment / Classification	Cohen's κ	0.72–0.91	0.80+	Higher variance on multi-class vs. binary tasks
NER Entity Typing	Cohen's κ (per type)	0.78–0.94	0.82+	Span boundary disagreements scored separately
RLHF Preference Ranking	Fleiss' κ	0.65–0.85	0.72+	Subjective tasks naturally compress the upper bound
Bounding Box	Pairwise IoU mean	0.82–0.96	0.85+	IoU computed per matching pair; unmatched boxes = 0
Polygon / Segmentation	Krippendorff α (IoU dist.)	0.79–0.93	0.83+	Boundary F1 between annotators also tracked
Video Keyframe Tagging	Fleiss' κ + temporal IoU	0.70–0.88	0.78+	Frame-window disagreements add complexity
Text Span / Highlight	Krippendorff α (interval)	0.76–0.91	0.80+	Partial overlaps penalised by distance function

Why percent agreement alone is misleading

Two annotators working on a sentiment task where 85% of samples are "Positive" will agree ~73% of the time by pure chance, even if they label randomly. A raw agreement of 73% would look acceptable — but Cohen's κ would return 0.00. We never report bare percent agreement without a chance-corrected coefficient alongside it.

Speed & Scale

Throughput & Turnaround

Throughput is the volume of annotation units completed per hour. Turnaround is the calendar time from data intake to final delivery. Both are tracked continuously — and both are meaningless without the accuracy figures that accompany them.

<48 hrs

Pilot batch turnaround

100–500 items from intake to QA-cleared delivery

200+

Items / hour (classification)

Sustained rate on single-label text tasks at κ ≥ 0.82

10 days

Standard project SLA

10k-item bounding box or NER batch with two QA passes

100%

Iteration coverage

Every delivery includes a structured feedback loop window

Annotation Throughput by Task

Task	Unit	Sustained Rate	QA Overhead
Text Classification / Sentiment	item	180–300 / hr	~8%
RLHF Preference Ranking	pair	30–60 / hr	~12%
Named Entity Recognition (NER)	document	35–80 / hr	~10%
Bounding Box (simple scene)	box	120–200 / hr	~10%
Bounding Box (dense / complex)	box	40–80 / hr	~15%
Polygon / Instance Segmentation	object	12–30 / hr	~18%
Video Keyframe Tagging	clip	4–10 / hr	~20%
Audio Transcription + NER	minute	6–12 / hr	~14%
Time-series Anomaly Labelling	window	60–120 / hr	~10%

Rates measured at sustained quality floor (κ ≥ 0.80 / IoU ≥ 0.85). Sprint peaks ~20% higher but not used for SLA commitments.

Delivery Pipeline Stages

Intake & Briefing 1–2 days

Data handoff, ontology review, annotator calibration session, guideline finalisation

Pilot Batch < 48 hrs

100–500 items annotated, IAA computed, gold set validated, client approval gate

Production Annotation Varies

Parallel annotator pools; daily IAA monitoring; automatic outlier flagging

QA Review ~20% of prod.

Sample-based audit against gold, edge-case adjudication, rework if below floor

Delivery & Iteration 1 day

Export in requested format, structured feedback window, versioned dataset lock

The quality-speed tradeoff — and how we manage it

Every throughput figure above is measured at the stated quality floor. Increasing speed without increasing team size always compresses accuracy margins — a well-documented phenomenon in annotation literature. We surface this tradeoff explicitly: every project proposal includes a rate × accuracy curve so clients can choose the operating point that fits their pipeline.

Standard Tier

Published rate

Max quality floor maintained

Express Tier

+40% throughput

QA sample size increased to compensate

Precision Tier

−30% throughput

Double QA pass + adjudication on all edge cases

Independent Annotation Benchmarks

Annotation Accuracy

Inter-annotator Agreement

Throughput & Turnaround

How Annotation Accuracy Is Measured

Precision

Recall

F1 Score

How Each Task Type Is Scored

Bounding Box

Polygon / Segmentation

NER (Named Entity Recognition)

Sentiment

What Are Gold Standard References?

Academic Calibration Sets

CoreLabel Internal Gold Sets

Why the reference set quality matters as much as the metric

Inter-annotator Agreement (IAA)

Cohen's Kappa (κ)

Fleiss' Kappa (κ_F)

Krippendorff's Alpha (α)

IAA Method by Task — Published Ranges

Why percent agreement alone is misleading

Throughput & Turnaround

Annotation Throughput by Task

Delivery Pipeline Stages

The quality-speed tradeoff — and how we manage it

Want to See Our Full Benchmark Data?

Independent Annotation Benchmarks

Annotation Accuracy

Inter-annotator Agreement

Throughput & Turnaround

How Annotation Accuracy Is Measured

Precision

Recall

F1 Score

How Each Task Type Is Scored

Bounding Box

Polygon / Segmentation

NER (Named Entity Recognition)

Sentiment

What Are Gold Standard References?

Academic Calibration Sets

CoreLabel Internal Gold Sets

Why the reference set quality matters as much as the metric

Inter-annotator Agreement (IAA)

Cohen's Kappa (κ)

Fleiss' Kappa (κF)

Krippendorff's Alpha (α)

IAA Method by Task — Published Ranges

Why percent agreement alone is misleading

Throughput & Turnaround

Annotation Throughput by Task

Delivery Pipeline Stages

The quality-speed tradeoff — and how we manage it

Want to See Our Full Benchmark Data?

Fleiss' Kappa (κ_F)