Building better AI benchmarks: How many raters are enough?

MIT AI Technology Reviewmachines coding ai benchmarks humans

AI benchmarks are broken. Here’s what we need instead.

For decades, artificial intelligence has been evaluated through the question of whether machines outperform humans. From chess to advanced math, from coding to essay writing, the performance of AI models and applications is tested against that of individual humans completing tasks. This framing is seductive: An AI vs. human comparison on isolated problems with clear…

Mar 31, 12:01 PM

Building better AI benchmarks: How many raters are enough?

Related Articles

AI benchmarks are broken. Here’s what we need instead.

Safeguarding cryptocurrency by disclosing quantum vulnerabilities responsibly