From BLEU to G-Eval: LLM-as-a-Judge Techniques & Limitations
Автор: deepsense
Загружено: 2025-11-25
Просмотров: 129
Описание:
LLM-as-a-Judge is changing how we evaluate AI models but it’s far from magic.
In this talk, Maciej Kaczkowski, ML Engineer, walks through how using an LLM to grade other LLMs actually works in practice – from early metrics like BLEU to modern frameworks such as G-Eval and LLM-as-a-Judge.
🧑⚖️ You’ll learn:
🔸 why classic NLP metrics (BLEU, ROUGE, WER) fail on many GenAI tasks,
🔸 how LLM-as-a-Judge can score model outputs with human-like criteria,
🔸 single-output vs pairwise evaluation – and when to use each,
🔸 where things break: narcissistic bias, verbosity bias, and misaligned criteria,
🔸 why you must evaluate the whole system (RAG pipeline, data, rerankers, context) – not just the final answer.
If you’re building evaluation pipelines or trying to move beyond “it feels better”, this session gives you a practical toolbox for LLM-based evaluation in 2025 – including its very real limitations.
00:00 Intro & agenda
00:56 Why evaluation matters in GenAI projects
03:40 Metrics & human eval: why they fall short
07:53 Who judges the judges? G-Eval framework & criteria design
12:52 Single-output & pairwise evaluation in practice
18:11 Pitfalls & biases in LLM-as-a-Judge
22:34 System thinking, RAG pipelines & final takeaways
Check our website: https://deepsense.ai/
Linkedin: / applied-ai-insider
#LLMasAJudge #LLMevaluation #GEval #AIevaluation #LLMbenchmarks #GenAI #MachineLearning #MLOps #deepsenseAI
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: