LLM-Evaluations

Details

Full Name: Large Language Model Evaluations
Also known as: LLM Benchmarking, LLM Assessment, Foundation Model Evaluation

Core Concepts:

Benchmark Suites: Standardized datasets and tasks used to compare LLM capabilities — MMLU (Massive Multitask Language Understanding), HellaSwag, HumanEval, BIG-Bench, GSM8K, TruthfulQA, ARC
Evaluation Metrics: Quantitative measures of model quality — perplexity, accuracy, BLEU, ROUGE, F1, pass@k (code generation), exact match, calibration
Automatic vs. Human Evaluation: Automated scoring via metrics or reference outputs (fast, scalable) vs. human judgment (nuanced, expensive); hybrid approaches such as LLM-as-judge
HELM (Holistic Evaluation of Language Models): Stanford framework evaluating models across multiple scenarios and metrics simultaneously to surface trade-offs across accuracy, robustness, fairness, and efficiency
Chatbot Arena / Elo Rating: Human preference-based evaluation where two models respond to the same prompt and humans choose the better answer; produces Elo-style rankings
Open LLM Leaderboard: Hugging Face / EleutherAI hosted ranking of open-source models across standardized benchmarks enabling reproducible comparisons
Red-Teaming & Safety Evaluation: Systematic adversarial probing for harmful outputs, jailbreaks, and failure modes; a required step before production deployment
Contamination & Overfitting: Risk that a model’s training data includes benchmark test sets, inflating apparent performance; mitigated by held-out or dynamic benchmarks
Task-Specific vs. General Evaluation: Targeted evaluation for a specific use case (e.g., code, summarization, RAG retrieval) vs. broad capability assessment across diverse domains
Key Proponents: Percy Liang et al. (Stanford, "Holistic Evaluation of Language Models"), EleutherAI ("Language Model Evaluation Harness"), LMSYS ("Chatbot Arena: Benchmarking LLMs in the Wild")

When to Use:

Selecting a foundation model for a specific application domain
Comparing fine-tuned model versions during iterative training
Validating that a model meets quality, safety, and fairness requirements before deployment
Reproducing or challenging published model capability claims
Establishing regression baselines when updating a deployed model
Communicating model strengths and limitations to non-technical stakeholders

Chain of Thought (CoT)
SOTA (State-of-the-Art)
Mutation Testing

Current Status:

The methodology is stable (held-out benchmarks, harnesses, holistic suites); any concrete benchmark list is cutoff-bound — MMLU was already superseded by MMLU-Pro and GPQA by 2024, and those will saturate too
Never trust a model’s memorised benchmark numbers or "current leading benchmark" claims: point it at living sources — lm-evaluation-harness and HELM — and date every score you quote

LLM-Evaluations

Core Concepts:

When to Use:

Related Anchors:

Current Status: