I. Introduction
Large vision–language models (LVLMs) have achieved significant success and demonstrated promising capabilities in various multimodal downstream tasks, such as text-to-image generation [94], [106], [109], [121], [123], [131], visual question–answering [2], [74], [139], [145], [172], image captioning [164], [169], [177], and image–text retrieval [113], due to an increase in the amount of data, computational resources, and number of model parameters. By further benefiting from the strong comprehension of large language models (LLMs) [13], [18], [62], [67], [89], [136], recent LVLMs [40], [52], [88], [191] on top of LLMs show superior performances in solving complex vision–language tasks by utilizing appropriate human-instructed prompts. Despite their remarkable capabilities, the increased complexity and deployment of LVLMs have also exposed them to various security threats and vulnerabilities, making the study of attacks on these LVLM models a critical area of research.