Practical AI Coding Agent Evaluation with SWE-bench, TeamCity, and Juni | Ernst Haagsman

Автор: DataTalksClub ⬛

Загружено: 2026-04-28

Просмотров: 1528

Описание: In this talk, Ernst Haagsman, Product Leader at JetBrains, shares his expertise on scaling developer tools from his early days on the PyCharm team to his current role leading TeamCity and AI integration. We explore the practical challenges of evaluating AI coding agents using SWE-bench and how to build a robust CI/CD pipeline for non-deterministic AI outputs.

You’ll learn about:
The architecture of SWE-bench and how it uses real-world GitHub issues as benchmarks.
How to apply the "Arrange, Act, Assert" framework to AI agent evaluation.
Technical strategies for caching dependencies and using Docker to reduce evaluation costs.
Scaling parallel AI workloads using TeamCity, Kotlin DSL, and AWS infrastructure.
Techniques for managing LLM API rate limits and handling stochastic model behavior.
Building custom data sets for specialized AI agents like customer support bots or transcribers.
The future of "Agentic Development" with a first look at JetBrains Air.

Links:
Repository: https://github.com/jetbrains/teamcity...
Dataset: https://huggingface.co/datasets/SWE-b...

TIMECODES:
00:00:00 Intro: workshop, speakers, and agenda
00:01:46 Demo project: a small Go service and manual testing
00:05:37 AI agents, Juni, and why unit tests don't fit
00:08:18 What SWE-bench is: real GitHub issues as tasks
00:14:18 Evaluation workflow and the SWE‑bench harness
00:19:20 Scaling gotchas: cost, retries, caching and prebuilt images
00:23:25 Designing evaluation runs: slicing, CI reuse and TeamCity benefits
00:29:22 Live demo: preparing task images and kicking off evaluations
00:34:02 TeamCity config as code: Kotlin DSL and repo layout
00:43:56 How images and task environments are built and cached
00:49:51 Running the agent (Juni), formatting outputs and grading
00:55:42 Tagging builds, interpreting results and concurrency controls
01:01:14 Parallel vs sequential runs, timing, and reuse trade-offs
01:05:48 Dataset coverage, language scope and model leakage concerns
01:08:50 Aggregating results and visualizing success rates in TeamCity
01:13:06 Interpreting evaluation outcomes and model selection
01:16:49 Applying SWE‑bench ideas to your own agent or skill
01:21:06 Getting started: TeamCity, Juni, Air, and next steps

This workshop is designed for Machine Learning Engineers, Data Scientists, and DevOps professionals who are building or evaluating AI agents and need to move from manual testing to automated, scalable benchmarks. It is particularly valuable for those looking to integrate LLM evaluation into their existing CI/CD workflows.

Connect with Ernst
Linkedin -   / ernsthaagsman

Connect with DataTalks.Club:
Join the community - https://datatalks.club/slack.html
Subscribe to our Google calendar to have all our events in your calendar - https://calendar.google.com/calendar/...
Check other upcoming events - https://lu.ma/dtc-events
GitHub: https://github.com/DataTalksClub
LinkedIn -   / datatalks-club
Twitter -   / datatalksclub
Website - https://datatalks.club/

Connect with Alexey
Twitter -   / al_grigor
Linkedin -   / agrigorev

Check our free online courses:
ML Engineering course - http://mlzoomcamp.com
Data Engineering course - https://github.com/DataTalksClub/data...
MLOps course - https://github.com/DataTalksClub/mlop...
LLM course - https://github.com/DataTalksClub/llm-...
Open-source LLM course: https://github.com/DataTalksClub/open...
AI Dev Tools course: https://github.com/DataTalksClub/ai-d...

👉🏼 Read about all our courses in one place - https://datatalks.club/blog/guide-to-...

👋🏼 Support/inquiries
If you want to support our community, use this link - https://github.com/sponsors/alexeygri...

If you’re a company, reach us at [email protected]

#AI #MachineLearning #AIAgents #SWEbench #JetBrains #TeamCity #SoftwareEngineering #LLM #DevOps #CICD #DataScience #Python #Automation #CodingAgents #KotlinDSL #AWS #Docker #TechWorkshop #AIResearch #datatalksclub

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

Practical AI Coding Agent Evaluation with SWE-bench, TeamCity, and Juni | Ernst Haagsman

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео