Latest Writings
AI evals are becoming the new compute bottleneck
A field guide to evaluation costs: where the money goes, why old compression tricks break, and why agentic evals, training-in-the-loop be...
Field Notes: Challenges in GenAI Evaluation Science
Early themes from expert interviews on the challenges of evaluating generative AI systems, spanning validity, practicality, and interpret...
Every Eval Ever: Toward a Common Language for AI Eval Reporting
The multistakeholder coalition EvalEval launches Every Eval Ever, a shared format and central eval repository. We're working to resolve A...
The Hidden Social Costs of AI
As AI continues to grow more powerful, who carries the hidden social costs of its effects?
The AI Evaluation Chart Crisis
Charts used to showcase performance demonstrate broader issues in the AI evaluation ecosystem: a lack of balance between competitive benc...
The Science of Evaluations: Workstream Kickoff Post
Announcing the launch of a research-driven initiative among a community of researchers to strengthen the science of AI evaluations.