Building better AI benchmarks: How many raters are enough? - TrendCloud