Prefill-level Jailbreak: A Black-Box Risk Analysis of Large Language Models
arXiv:2504.21038v2 Announce Type: replace Abstract: Large Language Models face security threats from jailbreak attacks. Existing research has predominantly focused on prompt-level attacks while largely ignoring the underexplored attack surface of user-controlled response prefilling. This functionality allows an attacker to dictate the beginning of a model's output, thereby shifting the attack paradigm from persuasion to direct state manipulation.In this paper, we present a systematic black-box...