Fluid Language Model Benchmarking

Автор: Conference on Language Modeling

Загружено: 2025-11-03

Просмотров: 82

Описание: Authors: Valentin Hofmann, David Heineman, Ian Magnusson, Kyle Lo, Jesse Dodge, Maarten Sap, Pang Wei Koh, Chun Wang, Hannaneh Hajishirzi, Noah A. Smith

Language model (LM) benchmarking faces several challenges: comprehensive evaluations are costly, benchmarks often fail to measure the intended capabilities, and evaluation quality can degrade due to labeling errors and benchmark saturation. Although various strategies have been proposed to mitigate these issues, they tend to address individual aspects in isolation, neglecting broader questions about overall evaluation quality. Here, we introduce Fluid Benchmarking, a new evaluation approach that advances LM benchmarking across multiple dimensions. Inspired by psychometrics, Fluid Benchmarking is based on the insight that the relative value of benchmark items depends on an LMs capability level, suggesting that evaluation should adapt to each LM. Methodologically, Fluid Benchmarking estimates an item response model based on existing LM evaluation results and uses the inferred quantities to select evaluation items dynamically, similar to computerized adaptive testing in education. In our experiments, we compare Fluid Benchmarking against the common practice of random item sampling as well as more sophisticated baselines, including alternative methods grounded in item response theory. We examine four dimensions—efficiency, validity, variance, and saturation—and find that Fluid Benchmarking achieves superior performance in all of them (e.g., higher validity and less variance on MMLU with fifty times fewer items). Our analysis shows that the two components of Fluid Benchmarking have distinct effects: item response theory, used to map performance into a latent ability space, increases validity, while dynamic item selection reduces variance. Overall, our results suggest that LM benchmarking can be substantially improved by moving beyond static evaluation.

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

Fluid Language Model Benchmarking

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео

Hidden in plain sight: VLMs overlook their visual representations

Hidden in plain sight: VLMs overlook their visual representations

Краткая психометрия: Введение в теорию ответа на вопросы.

Краткая психометрия: Введение в теорию ответа на вопросы.

The Adaptive Architecture of Retroviral Mutation

The Adaptive Architecture of Retroviral Mutation

Language models align with brain regions that represent concepts across modalities

Language models align with brain regions that represent concepts across modalities

Luke Zettlemoyer - Mixed-modal Language Modeling

Luke Zettlemoyer - Mixed-modal Language Modeling

THIS is why large language models can understand the world

THIS is why large language models can understand the world

Я в опасности

Evaluating LLM-based Applications

Evaluating LLM-based Applications

Diffusion Language Models: The Next Big Shift in GenAI

Diffusion Language Models: The Next Big Shift in GenAI

LLM vs NLP | Kevin Johnson

LLM vs NLP | Kevin Johnson

ICQuant: Index Coding enables Low-bit LLM Quantization

ICQuant: Index Coding enables Low-bit LLM Quantization

Но что такое нейронная сеть? | Глава 1. Глубокое обучение

Но что такое нейронная сеть? | Глава 1. Глубокое обучение

What Are Vision Language Models? How AI Sees & Understands Images

What Are Vision Language Models? How AI Sees & Understands Images

Mathematics of LLMs in Everyday Language

Mathematics of LLMs in Everyday Language

Shared Global and Local Geometry of Language Model Embeddings

Shared Global and Local Geometry of Language Model Embeddings

FineWeb2: One Pipeline to Scale Them All — Adapting Pre-Training Data Processing to Every Language

FineWeb2: One Pipeline to Scale Them All — Adapting Pre-Training Data Processing to Every Language

Tom Griffiths - Mapping the Jagged Edges of AI with Cognitive Science

Tom Griffiths - Mapping the Jagged Edges of AI with Cognitive Science

Stanford CS25: V5 I Large Language Model Reasoning, Denny Zhou of Google Deepmind

Stanford CS25: V5 I Large Language Model Reasoning, Denny Zhou of Google Deepmind

XPENG IRON - China's MOST HUMAN Robot Ever Built!

XPENG IRON - China's MOST HUMAN Robot Ever Built!

What are Large Language Model (LLM) Benchmarks?

What are Large Language Model (LLM) Benchmarks?