LLM Jailbreak

The research project is a narrative-based jailbreak framework that exposes vulnerabilities in multimodal large language models through immersive, context-rich prompts.

MIRAGE is a jailbreak framework designed to expose the vulnerabilities of multimodal large language models (MLLMs). Instead of using direct or brute-force prompts, it employs narrative-driven interactions, embedding instructions within stories, character roles, and multi-turn visual context, in order to bypass safety filters. By simulating immersive, realistic scenarios, MIRAGE tricks models into revealing restricted content. Its high success rate demonstrates that current MLLM defenses remain fragile against context-rich, indirect attacks.

An example demonstrating how adopting a detective persona (role-immersion) in a multi-turn visual storytelling framework results in a response in letter format (structured format) containing harmful information from multimodal large language models.

Our method proposed method MIRAGE, inspired by the realm of literary creation, involves two stages: (i) multi-turn visual storytelling and (ii) role-immersion through narrative.

We show the results about the attack success rates (ASR) of various jailbreak methods on six baselines, including Vanilla-Text (Ma et al., 2024), FigStep (Gong et al., 2023), Query-Relevant (Liu et al., 2024), HADES (Li et al., 2025), and Visual-RolePlay (Ma et al., 2024). MIRAGE consistently outperforms prior approaches on both white-box and black-box multimodal models, achieving the highest ASR in most settings. Blue and green highlights indicate the best and second-best results, respectively.

Evaluation results on two selected datasets, RedTeam-2K (Luo et al., 2024) and HarmBench (Mazeika et al., 2024), using Attack Success Rate (ASR). The best score in each row is blue-highlighted; the second best is green-highlighted. The results are based on both White-Box (LLaVA-Mistral (Liu et al., 2023), Qwen-VL (Bai et al., 2023), and Intern-VL (Chen et al., 2024)) and Black-Box (Gemini-1.5-Pro (Team et al., 2023), GPT-4V (OpenAI, 2023), and Grok-2V (xAI, 2024)) models.
Dataset	Method	White-Box			Black-Box
Dataset	Method	LLaVa	Qwen-VL	Intern-VL	Gemini-1.5	GPT-4V	Grok-2V
RedTeam-2K	Vanilla-Text	7.75	5.00	8.25	6.18	3.41	–
	Vanilla-Typo	6.50	9.25	8.25	5.70	6.07	–
	FigStep	15.00	20.50	22.00	17.17	11.66	–
	Query-Relevant	20.50	16.75	13.00	23.13	11.31	15.68
	HADES	46.94	31.76	36.98	47.43	24.14	21.90
	Visual-RolePlay	38.00	29.50	28.25	35.35	35.57	27.32
	MIRAGE	52.15	45.02	44.23	63.67	43.13	52.72
HarmBench	Vanilla-Text	11.67	1.89	11.36	6.62	4.81	–
	Vanilla-Typo	5.36	8.20	22.08	14.51	7.42	–
	FigStep	27.44	27.76	30.91	31.23	18.80	–
	Query-Relevant	23.97	25.55	8.52	26.50	20.50	22.42
	HADES	32.08	43.12	35.91	68.43	15.07	26.58
	Visual-RolePlay	41.64	30.28	34.38	37.85	30.83	32.51
	MIRAGE	43.40	47.24	50.60	65.63	44.65	48.91

If you want to reference our work, you can use and check the following BibTeX citation:

@article{you2025mirage,
  title={MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks},
  author={You, Wenhao and Hooi, Bryan and Wang, Yiwei and Wang, Youke and Ke, Zong and Yang, Ming-Hsuan and Huang, Zi and Cai, Yujun},
  journal={arXiv preprint arXiv:2503.19134},
  year={2025}
}

References

2025

ECCV

Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

Yifan Li, Hangyu Guo, Kun Zhou, and 2 more authors

2025

Bib

@article{li2024images,
  author = {Li, Yifan and Guo, Hangyu and Zhou, Kun and Zhao, Wayne Xin and Wen, Ji-Rong},
  title = {Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models},
  booktitle = {Computer Vision -- ECCV 2024},
  year = {2025},
  publisher = {Springer Nature Switzerland},
  address = {Cham},
  pages = {174--189},
  isbn = {978-3-031-73464-9},
  url = {https://link.springer.com/content/pdf/10.1007/978-3-031-73464-9.pdf}
}

2024

arXiv

Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Characte

Siyuan Ma, Weidi Luo, Yu Wang, and 4 more authors

arXiv preprint arXiv:2405.20773, 2024

Bib

@article{ma2024visual,
  title = {Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Characte},
  author = {Ma, Siyuan and Luo, Weidi and Wang, Yu and Liu, Xiaogeng and Chen, Muhao and Li, Bo and Xiao, Chaowei},
  journal = {arXiv preprint arXiv:2405.20773},
  year = {2024},
  url = {https://arxiv.org/pdf/2405.20773}
}

ECCV

Mm-safetybench: A benchmark for safety evaluation of multimodal large language models

Xin Liu, Yichen Zhu, Jindong Gu, and 3 more authors

2024

Bib

@article{liu2024mm,
  title = {Mm-safetybench: A benchmark for safety evaluation of multimodal large language models},
  author = {Liu, Xin and Zhu, Yichen and Gu, Jindong and Lan, Yunshi and Yang, Chao and Qiao, Yu},
  booktitle = {European Conference on Computer Vision},
  pages = {386--403},
  year = {2024},
  organization = {Springer},
  url = {https://arxiv.org/pdf/2311.17600}
}

arXiv

Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks

Weidi Luo, Siyuan Ma, Xiaogeng Liu, and 2 more authors

arXiv preprint arXiv:2404.03027, 2024

Bib

@article{luo2024jailbreakv,
  title = {Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks},
  author = {Luo, Weidi and Ma, Siyuan and Liu, Xiaogeng and Guo, Xiaoyu and Xiao, Chaowei},
  journal = {arXiv preprint arXiv:2404.03027},
  year = {2024},
  url = {https://arxiv.org/pdf/2404.03027}
}

ICML

HarmBench: a standardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, and 9 more authors

2024

Bib

@article{mazeika2024harmbench,
  author = {Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan},
  title = {HarmBench: a standardized evaluation framework for automated red teaming and robust refusal},
  year = {2024},
  publisher = {JMLR.org},
  booktitle = {Proceedings of the 41st International Conference on Machine Learning},
  articleno = {1431},
  numpages = {44},
  series = {ICML'24},
  url = {https://arxiv.org/pdf/2402.04249}
}

CVPR

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, and 8 more authors

2024

Bib

@article{chen2024internvl,
  title = {Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author = {Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle = {Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  pages = {24185--24198},
  year = {2024},
  url = {https://openaccess.thecvf.com/content/CVPR2024/papers/Chen_InternVL_Scaling_up_Vision_Foundation_Models_and_Aligning_for_Generic_CVPR_2024_paper.pdf}
}

xAI

Grok-2 Beta Release

xAI

2024

Bib

2023

arXiv

Figstep: Jailbreaking large vision-language models via typographic visual prompts

Yichen Gong, Delong Ran, Jinyuan Liu, and 5 more authors

arXiv preprint arXiv:2311.05608, 2023

Bib

@article{gong2023figstep,
  title = {Figstep: Jailbreaking large vision-language models via typographic visual prompts},
  author = {Gong, Yichen and Ran, Delong and Liu, Jinyuan and Wang, Conglei and Cong, Tianshuo and Wang, Anyu and Duan, Sisi and Wang, Xiaoyun},
  journal = {arXiv preprint arXiv:2311.05608},
  year = {2023},
  url = {https://arxiv.org/pdf/2311.05608}
}

arXiv

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and 1 more author

2023

Bib

@article{liu2023visualinstructiontuning,
  title = {Visual Instruction Tuning},
  author = {Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
  year = {2023},
  eprint = {2304.08485},
  archiveprefix = {arXiv},
  primaryclass = {cs.CV},
  url = {https://arxiv.org/abs/2304.08485}
}

arXiv

Qwen-vl: A frontier large vision-language model with versatile abilities

Jinze Bai, Shuai Bai, Shusheng Yang, and 6 more authors

arXiv preprint arXiv:2308.12966, 2023

Bib

@article{bai2023qwen,
  title = {Qwen-vl: A frontier large vision-language model with versatile abilities},
  author = {Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal = {arXiv preprint arXiv:2308.12966},
  year = {2023},
  url = {https://arxiv.org/pdf/2308.12966}
}

arXiv

Gemini: a family of highly capable multimodal models

Gemini Team, Rohan Anil, Sebastian Borgeaud, and 8 more authors

arXiv preprint arXiv:2312.11805, 2023

Bib

@article{team2023gemini,
  title = {Gemini: a family of highly capable multimodal models},
  author = {Team, Gemini and Anil, Rohan and Borgeaud, Sebastian and Alayrac, Jean-Baptiste and Yu, Jiahui and Soricut, Radu and Schalkwyk, Johan and Dai, Andrew M and Hauth, Anja and Millican, Katie and others},
  journal = {arXiv preprint arXiv:2312.11805},
  year = {2023},
  url = {https://arxiv.org/pdf/2312.11805}
}

OpenAI

GPT-4V(ision) System Card

OpenAI

2023

Bib