Compute-intensive black box AI models face challenges with data bottlenecks and evaluations
NLP evaluation methodologies have evolved from specific pipelines to multi-task benchmarks like GLUE and SuperGLUE
Models like BERT and GPT have changed the field but evaluating their capabilities requires out-of-distribution tasks, human-scored leaderboards, and red-teaming