Expert Data Development for Frontier AI

Continual Learning Bench by Berkeley & Snorkel

We build the data that pushes the frontier

Snorkel helps frontier labs and AI teams develop specialized training data and environments that set their models and agents apart.

Request dataset samples

Explore research

Proud to partner with top frontier AI and research teams

Frontier models break at the edges. We build for that.

Most data pipelines are built for volume, not difficulty. Frontier models fail on distributional gaps in specialized domains, benchmark blind spots, and tasks where correctness is hard to define. Snorkel is built specifically for these problems.

Founded out of Stanford AI Lab, we've been shaping and benchmarking frontier AI for nearly a decade.

See our research

Open Benchmarks Grants

What we've found

New

RL Research

RLVR in Low Data & Compute Regimes

Better data beats more compute — measured across low-resource settings.

Evaluation Research

RIFT: Rubric Failure Mode Taxonomy

A diagnostic framework for when AI evaluation rubrics break down.

Domain Agents

Benchmarking Agents in Insurance Underwriting

Environment-first benchmarking for agents in a genuinely high-stakes domain.

Collaborations

New

Research Collaboration

Continual Learning Bench

Expert-validated tasks for agents that learn across task sequences, not isolated prompts.

Code Quality

SlopCodeBench

Generic code evals miss sloppy code. This measures what they ignore.

Agentic Coding

Terminal-Bench 2.0

Real terminal tasks — exposing where today's coding agents fail.

Legal AI

Harvey BigLaw Bench

Expert data for the hardest agentic legal research benchmark. Built with Harvey AI.

The Frontier AI Data Lab

Data development for the frontier

Snorkel partners with frontier AI teams to build research-grade datasets, evaluation systems, and runnable environments where generic coverage runs out.

Explore capabilities

Snorkel Data Series

Curriculum-structured datasets for the task areas frontier models are pushing hardest, with rubrics, reviewer guidance, difficulty tiers, and eval slices built in.

Custom data development

When off-the-shelf coverage runs out, we build bespoke datasets, evals, and benchmark expansions for the exact failure surface you need to close.

Specialized agents

Custom agents built on specialized data and evaluated in real workflows, with pass/fail criteria tied to the performance standards that moves ROI.

Data

Expert Demonstrations & Reasoning

Human solution traces

Reasoning traces

SME Q&A rationales

Workflow demos and decision workflows

Tool-use demos

Preference Labels & Rankings

Patch/draft/report quality ranking

Trajectory QA

Risk/safety/style calibration

Helpful/harmless ranking

Grounding & style

Rubrics & Verifiable Outcomes

Unit tests / compile

Deterministic graders

Citation correctness

Numerical consistency/scorable math/science

Long-horizon tasks

Environments

Standard & Custom Environments

Repo + CLI tools

Browser/GUI harness

Multi-step/stateful workflows

Simulated environments

Your tools, codebase, corpus, data & permissions

DATA DEVELOPMENT

Good data is a set of design choices

Request dataset samples

Most data quality problems are design problems. Ambiguous task definitions produce inconsistent labels. Uncalibrated reviewers introduce systematic bias. Missing provenance makes failure analysis guesswork. Snorkel's proprietary process is built around the decisions that determine whether training data actually drives model improvement:

Custom AGENTS

Specialized agents grounded in expert data

Explore agent solutions

The same data development system we use to improve frontier models powers our specialized agents. That means agents evaluated against task-specific rubrics and programmatic checks – not generic benchmarks – and refined through the same adjudication and provenance practices used in production model development.

Built for specialized workflows and high-consequence decisions, not generic copilots

Evaluation on environment-grounded tasks with programmatic pass/fail criteria

Same rigor used to train frontier-class models, applied to your enterprise deployment

PUBLISHED RESEARCH

Research that shapes the work

Every dataset, benchmark, and environment we create is the output of active research co-developed and peer-reviewed with leading academic teams and frontier labs.

Explore research