LLM Jailbreak

The research project is a narrative-based jailbreak framework that exposes vulnerabilities in multimodal large language models through immersive, context-rich prompts.

MIRAGE is a jailbreak framework designed to expose the vulnerabilities of multimodal large language models (MLLMs). Instead of using direct or brute-force prompts, it employs narrative-driven interactions, embedding instructions within stories, character roles, and multi-turn visual context, in order to bypass safety filters. By simulating immersive, realistic scenarios, MIRAGE tricks models into revealing restricted content. Its high success rate demonstrates that current MLLM defenses remain fragile against context-rich, indirect attacks.

An example demonstrating how adopting a detective persona (role-immersion) in a multi-turn visual storytelling framework results in a response in letter format (structured format) containing harmful information from multimodal large language models.
Our method proposed method MIRAGE, inspired by the realm of literary creation, involves two stages: (i) multi-turn visual storytelling and (ii) role-immersion through narrative.

We show the results about the attack success rates (ASR) of various jailbreak methods on six baselines, including Vanilla-Text (Ma et al., 2024), FigStep (Gong et al., 2023), Query-Relevant (Liu et al., 2024), HADES (Li et al., 2025), and Visual-RolePlay (Ma et al., 2024). MIRAGE consistently outperforms prior approaches on both white-box and black-box multimodal models, achieving the highest ASR in most settings. Blue and green highlights indicate the best and second-best results, respectively.

Evaluation results on two selected datasets, RedTeam-2K (Luo et al., 2024) and HarmBench (Mazeika et al., 2024), using Attack Success Rate (ASR). The best score in each row is blue-highlighted; the second best is green-highlighted. The results are based on both White-Box (LLaVA-Mistral (Liu et al., 2023), Qwen-VL (Bai et al., 2023), and Intern-VL (Chen et al., 2024)) and Black-Box (Gemini-1.5-Pro (Team et al., 2023), GPT-4V (OpenAI, 2023), and Grok-2V (xAI, 2024)) models.
Dataset Method White-Box Black-Box
LLaVa Qwen-VL Intern-VL Gemini-1.5 GPT-4V Grok-2V
RedTeam-2K Vanilla-Text 7.75 5.00 8.25 6.18 3.41
Vanilla-Typo 6.50 9.25 8.25 5.70 6.07
FigStep 15.00 20.50 22.00 17.17 11.66
Query-Relevant 20.50 16.75 13.00 23.13 11.31 15.68
HADES 46.94 31.76 36.98 47.43 24.14 21.90
Visual-RolePlay 38.00 29.50 28.25 35.35 35.57 27.32
MIRAGE 52.15 45.02 44.23 63.67 43.13 52.72
HarmBench Vanilla-Text 11.67 1.89 11.36 6.62 4.81
Vanilla-Typo 5.36 8.20 22.08 14.51 7.42
FigStep 27.44 27.76 30.91 31.23 18.80
Query-Relevant 23.97 25.55 8.52 26.50 20.50 22.42
HADES 32.08 43.12 35.91 68.43 15.07 26.58
Visual-RolePlay 41.64 30.28 34.38 37.85 30.83 32.51
MIRAGE 43.40 47.24 50.60 65.63 44.65 48.91

If you want to reference our work, you can use and check the following BibTeX citation:

@article{you2025mirage,
  title={MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks},
  author={You, Wenhao and Hooi, Bryan and Wang, Yiwei and Wang, Youke and Ke, Zong and Yang, Ming-Hsuan and Huang, Zi and Cai, Yujun},
  journal={arXiv preprint arXiv:2503.19134},
  year={2025}
}

References

2025

  1. ECCV
    Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
    Yifan Li, Hangyu Guo, Kun Zhou, and 2 more authors
    2025

2024

  1. arXiv
    Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Characte
    Siyuan Ma, Weidi Luo, Yu Wang, and 4 more authors
    arXiv preprint arXiv:2405.20773, 2024
  2. ECCV
    Mm-safetybench: A benchmark for safety evaluation of multimodal large language models
    Xin Liu, Yichen Zhu, Jindong Gu, and 3 more authors
    2024
  3. arXiv
    Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks
    Weidi Luo, Siyuan Ma, Xiaogeng Liu, and 2 more authors
    arXiv preprint arXiv:2404.03027, 2024
  4. ICML
    HarmBench: a standardized evaluation framework for automated red teaming and robust refusal
    Mantas Mazeika, Long Phan, Xuwang Yin, and 9 more authors
    2024
  5. CVPR
    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
    Zhe Chen, Jiannan Wu, Wenhai Wang, and 8 more authors
    2024
  6. xAI
    Grok-2 Beta Release
    xAI
    2024

2023

  1. arXiv
    Figstep: Jailbreaking large vision-language models via typographic visual prompts
    Yichen Gong, Delong Ran, Jinyuan Liu, and 5 more authors
    arXiv preprint arXiv:2311.05608, 2023
  2. arXiv
    Visual Instruction Tuning
    Haotian Liu, Chunyuan Li, Qingyang Wu, and 1 more author
    2023
  3. arXiv
    Qwen-vl: A frontier large vision-language model with versatile abilities
    Jinze Bai, Shuai Bai, Shusheng Yang, and 6 more authors
    arXiv preprint arXiv:2308.12966, 2023
  4. arXiv
    Gemini: a family of highly capable multimodal models
    Gemini Team, Rohan Anil, Sebastian Borgeaud, and 8 more authors
    arXiv preprint arXiv:2312.11805, 2023
  5. OpenAI
    GPT-4V(ision) System Card
    OpenAI
    2023