【S4E8】Guardian of Trust in Language Models: Automatic Jailbreak and Systematic Defense

Автор: The AI Talks

Загружено: 2024-05-02

Просмотров: 93

Описание: #artificialintelligence #aisafety #computervision

ABS: Large Language Models (LLMs) excel in Natural Language Processing (NLP) with human-like text generation, but the misuse of them has raised a significant concern. In this talk, we introduce an innovative system designed to address these challenges. Our system leverages LLMs to play different roles, simulating various user personas to generate "jailbreaks" – prompts that can induce LLMs to produce outputs contrary to ethical standards or specific guidelines. Utilizing a knowledge graph, our method efficiently creates new jailbreaks, testing the LLMs' adherence to governmental and ethical guidelines. Empirical validation on diverse models, including Vicuna-13B, LongChat-7B, Llama-2-7B, and ChatGPT, has demonstrated its efficacy. The system's application extends to Visual Language Models, highlighting its versatility in multimodal contexts.

The second part of our talk shifts focus to defensive strategies against such jailbreaks. Recent studies have uncovered various attacks that can manipulate LLMs, including manual and gradient-based jailbreaks. Our work delves into the development of robust prompt optimization as a novel defense mechanism, inspired from principled solutions from trustworthy machine learning. This approach involves system prompts – parts of the input text inaccessible to users – and aims to counter both manual and gradient-based attacks effectively. Despite current methods, adaptive attacks like GCG remain a challenge, necessitating a formalized defensive objective. Our research proposes such an objective and demonstrates how robust prompt optimization can enhance the safety of LLMs, safeguarding against realistic threat models and adaptive attacks.

Bio: Haohan Wang is an assistant professor in the School of Information Sciences at the University of Illinois Urbana-Champaign. His research focuses on the development of trustworthy machine learning methods for computational biology and healthcare applications. In his work, he uses statistical analysis and deep learning methods, with an emphasis on data analysis using methods least influenced by spurious signals. Wang earned his PhD in computer science through the Language Technologies Institute of Carnegie Mellon University. He is also an organizer of Trustworthy Machine Learning Initiative.

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

【S4E8】Guardian of Trust in Language Models: Automatic Jailbreak and Systematic Defense

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео

【S4E6】Learning Humanoid Robots

【S4E6】Learning Humanoid Robots

【S4E3】Distilling Vision-Language Models on Millions of Videos

【S4E3】Distilling Vision-Language Models on Millions of Videos

The Most Beautiful Equation in Math

The Most Beautiful Equation in Math

CMU Advanced NLP Fall 2024 (7): Prompting and Complex Reasoning

CMU Advanced NLP Fall 2024 (7): Prompting and Complex Reasoning

AstroAI Lunch Talk - January 12, 2026 - Kshitij Duraphe

AstroAI Lunch Talk - January 12, 2026 - Kshitij Duraphe

How To Get The Most Out Of Coding Agents

How To Get The Most Out Of Coding Agents

Deep Learning for Structure Based Drug Discovery by David Koes, PhD.

Deep Learning for Structure Based Drug Discovery by David Koes, PhD.

Multimodal AI Agents with Ruslan Salakhutdinov

Multimodal AI Agents with Ruslan Salakhutdinov

【S4E1】InstantID: Zero-shot Identity-Preserving Generation in Seconds

【S4E1】InstantID: Zero-shot Identity-Preserving Generation in Seconds

CMU CS251 - What is theoretical computer science?

CMU CS251 - What is theoretical computer science?

[S5E3] Масштабирование за пределами авторегрессии: масштабирование порядка как новый путь к общем...

[S5E3] Масштабирование за пределами авторегрессии: масштабирование порядка как новый путь к общем...

GPT 5.3 is here and it's INSANE for Coding

GPT 5.3 is here and it's INSANE for Coding

【S3E5】3D Structured Generative Models

【S3E5】3D Structured Generative Models

【S4E2】Towards Learning a Driving Simulator from the Real World

【S4E2】Towards Learning a Driving Simulator from the Real World

CMU LLM Inference (1): Introduction to Language Models and Inference

CMU LLM Inference (1): Introduction to Language Models and Inference

[S5E2] Video Models Are Zero-Shot Learners and Reasoners | Thaddäus Wiedemer | Google Deepmind

[S5E2] Video Models Are Zero-Shot Learners and Reasoners | Thaddäus Wiedemer | Google Deepmind

Understanding and Measuring One Qubit: Lecture 3 of Quantum Computation and Information at CMU

Understanding and Measuring One Qubit: Lecture 3 of Quantum Computation and Information at CMU

CMU Advanced NLP Fall 2024 (6): Instruction Tuning

CMU Advanced NLP Fall 2024 (6): Instruction Tuning

Lecture 1.1 - Introduction (CMU Multimodal Machine Learning, Fall 2023)

Lecture 1.1 - Introduction (CMU Multimodal Machine Learning, Fall 2023)

Lecture 01: Course Overview (CMU 15-462/662)

Lecture 01: Course Overview (CMU 15-462/662)