MMLU scores drop 14-17 points when contamination is removed. HumanEval is saturated at 94%. Models trained on the test set. Here's what to measure instead.