Longueur Is the Attack Surface Alignment Wonât Close
TL;DR: RLHF and constitutional training optimize models to be agreeable under expected prompts, but prompt-injection defense requires adversarial robustness over instruction provenance, which is a different objective.
Alignment is not a firewall.
The tedious length of modern AI workflows â the longueurs of system prompts, tool traces, retrieved documents, email threads, PDFs, tickets, browser pages, and chat history â is exactly where security fails. A model doesnât âseeâ authority the way an operating system does. It sees tokens. RLHF teaches it that some token continuations are preferred: refuse the bomb recipe, avoid slurs, donât fabricate too confidently, be helpful when the user asks nicely. Constitutional AI adds another layer of preference shaping, usually by scoring outputs against written principles. That can produce a more polite assistant. It doesnât produce an access-control mechanism.
Hereâs the technical mismatch. Alignment is usually distributional optimization: maximize expected reward over samples from a training or deployment-like prompt distribution, roughly max_θ E_{x~D}[R(y_θ(x), x)]. Robust injection defense is closer to adversarial optimization: maximize worst-case performance under perturbations and maliciously constructed contexts, roughly max_θ E_{x~D}[min_{δâA(x)} S(y_θ(x â δ), x)], where δ may be an injected instruction hidden in a webpage, document, calendar invite, or tool output. Those arenât the same problem. The first says âbehave well on prompts like these.â The second says âbehave correctly even when an attacker controls part of the input channel.â A model can score beautifully on the first while failing catastrophically on the second. Thatâs not a bug in the benchmark; itâs the objective doing what it was asked to do.
This is why jailbreak research keeps looking embarrassingly repetitive. Different wrappers, same failure mode. Ask directly for disallowed content and the aligned model refuses. Wrap the same intent in roleplay, translation, formatting constraints, fake policies, multi-turn pressure, or âignore previous instructions,â and some fraction of attempts succeed â not because the model has a secret evil module, but because instruction-following and safety refusal are both learned textual behaviors competing inside one sequence model. The model isnât reliably parsing âuser requestâ versus âuntrusted quoted textâ versus âretrieved page contentâ as separate security principals. Itâs performing next-token inference conditioned on a long context. Longueur becomes privilege confusion.
Alignment teaches preference compliance, not provenance tracking. RLHF can make âI canât help with thatâ more likely after recognizable harmful requests, but it doesnât impose a non-bypassable lattice of authority across system, developer, user, tool, and data channels.
Robustness requires adversarial training and formal boundaries. Injection defense needs threat models, taint tracking, constrained decoding, capability separation, sandboxed tools, least privilege, and evaluation against adaptive attackers â not just nicer refusals.
Thereâs a no-free-lunch tradeoff. The more we reward a model for being flexible, obedient, context-sensitive, and able to infer implicit instructions from messy prose, the more we train exactly the behavior attackers exploit: treating arbitrary text as operational guidance.
The AI funding cycle keeps promising âagenticâ systems that read the internet, operate browsers, file tickets, and transact on our behalf; the quieter lesson from overhyped demos and failed deployments is that reliability doesnât emerge from vibes, scale, or another safety preamble. A strong society doesnât need assistants that merely sound careful while collapsing under adversarial text. It needs systems whose authority boundaries are engineered, tested, and limited before theyâre placed between people and essential services. Stop calling aligned models secure models; demand security objectives, adversarial evaluations, and hard containment before giving language models real power.












