Jailbreaking AI: The Threat, the Methods, and the Defense | AAAI-25 Educational AI Video Winner
Автор: AAAI
Загружено: 2025-02-28
Просмотров: 224
Описание:
"Jailbreaking AI: The Threat, the Methods, and the Defense"
➡️ AAAI-25 Educational AI Video Winner
➡️ https://aaai.org/about-aaai/aaai-awar...
---
1. OpenAI statement on GPT-4 bio-weapon hazards[0:11]: https://openai.com/index/building-an-...
2. Evolution of DAN prompts[0:55]:
[DAN 1.0]: / dan_is_my_new_friend
[DAN 2.0]: / dan_20
[DAN 3.0]: / dan_30_jan_9th_edition
[DAN 4.0]: / dan_40_january_15_2023_midnight
[DAN 5.0]: / new_jailbreak_proudly_unveiling_the_tried_and
[DAN 6.0]: / presenting_dan_60
[DAN 9.0]: / dan_90_the_newest_jailbreak
3. Reddit link for "Grandma Window's Key" jailbreak persona trick[1:03]: / thanks_grandma_one_of_the_keys_worked_for_...
4. PAIR research paper (animated graph and examples referenced)[1:13]: https://arxiv.org/abs/2310.08419
5. Details on training LLM costs[1:31]: https://arxiv.org/abs/2405.21015
6. SmoothLLM research paper[1:41]: https://arxiv.org/abs/2310.03684
7. Llama Guard research paper[2:00]: https://arxiv.org/abs/2312.06674
8. Threat categories on Hugging Face[2:06]: https://huggingface.co/meta-llama/Lla...
9. Hugging Face model page for Llama-Guard-3-8B[2:08]: https://huggingface.co/meta-llama/Lla...
📝Clarifications and Additional Information
1. PAIR Cracking in 20 Tries[1:21]: The "20 tries" mentioned is the best-case scenario based on reported testing.
2. SmoothLLM Methodology[1:46]: SmoothLLM uses random perturbations (insertion, swapping, or patching of characters) on multiple copies of an input prompt. By comparing responses from these variations, it aggregates predictions to identify potential jailbreaks. This approach exploits the fragility of adversarial prompts to small textual changes.
3. Success Rate Reduction[1:57]: The stated reduction (98% to less than 1%) applies specifically to defending Llama2 and Vicuna models.
4. Llama Guard Accuracy[2:09]: The "98.5%" figure refers to the AUPRC score of Llama-Guard-3-8B on meta's internal English test set, but for simplicity in this video, I present it as accuracy. While this is a simplified explanation, it helps convey the model's strong performance to a general audience.
5. Low False Positive Rate (FPR)[2:15]: Llama Guard’s defenses rarely block legitimate requests, emphasizing high precision. (4% on meta's internal English test set)
6. Multimodal AI Systems[2:22]: These systems process and understand multiple types of data, such as text, images, and audio. Example: GPT-4o, Gemini 1.5.
7. Self-Defending Mechanisms[2:34]: Learn more through related papers: https://arxiv.org/abs/2406.05498.
8. Smaller, Smarter Safety Models[2:35]: Referring to models like Llama-Guard-3-1B (https://huggingface.co/meta-llama/Lla....
🎥This video is a labor of love, entirely created by me.
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: