tanbablack's comments

tanbablack · 2026-03-14T15:59:27 1773503967

Really like the content negotiation approach. Serving clean markdown via Accept headers has a nice security side benefit too. agents that receive structured markdown don't need to parse raw HTML, which is exactly where indirect prompt injection payloads hide.

Unit42's March 2026 research found 22+ techniques used in the wild to embed hidden instructions in HTML — zero-font CSS, invisible divs, dynamic JS injection. If more sites adopted this pattern and agents preferred the markdown path, a whole class of web-based IDPI attacks would be bypassed by design.

tanbablack · 2026-03-14T15:46:52 1773503212

Great writeup. Attackers are also "optimizing content for agents" — just with malicious intent.

Unit42 published research in March 2026 confirming websites in the wild embedding hidden instructions specifically targeting AI agents. Techniques include zero-font CSS text, invisible divs, and JS dynamic injection. One site had 24 layered injection attempts.

The same properties that make content agent-friendly (structured, parseable, in the DOM) also make it a perfect delivery mechanism for indirect prompt injection.

tanbablack · 2026-03-14T15:36:41 1773502601

This is a really important area to tackle. secret management for AI agents is something most teams are ignoring right now.

One adjacent risk worth noting: the URLs these agents visit during research. Even with proper secret management, if an agent browses a poisoned page during research, the injected instructions could override its behavior before secrets ever come into play.

derefr · 2026-03-14T18:35:17 1773513317

> if an agent browses a poisoned page during research, the injected instructions could override its behavior before secrets ever come into play.

Why is this problem (UGC instruction injection) still a thing, anyway? It feels like a problem that can be solved very simply in an agentic architecture that's willing to do multiple calls to different models per request.

How: filter fetched data through a non-instruction-following model (i.e. the sort of base text-prediction model you have before instruction-following fine-tuning) that has instead been hard-fine-tuned into a classifier, such that it just outputs whether the text in its context window contains "instructions directed toward the reader" or not.

(And if that non-instruction-following classifier model is in the same model-family / using the same LLM base model that will be used by the deliberative model to actually evaluate the text, then it will inherently apply all the same "deep recognition" techniques [i.e. unwrapping / unarmoring / translation / etc] the deliberative model uses; and so it will discover + point out "obfuscated" injected instructions to exactly the same degree that the deliberative model would be able to discover + obey them.)

Note that this is a strictly-simpler problem to that of preventing jailbreaks. Jailbreaks try to inject "system-prompt instructions" among "user-prompt instructions" (where, from the model's perspective, there is no natural distinction between these, only whatever artificial distinctions the model's developers try to impose. Without explicit anti-jailbreak training, these are both just picked up as "instructions" to an LLM.) Whereas the goal here would just be to prevent any UGC-tainted document containing anything that could be recognized as "instructions I would try to follow" from ever being injected into the context window.

(Actually, a very simple way to do this is to just take the instruction-following model, experimentally derive a vector direction within it representing "I am interpreting some of the input as instructions to follow" [ala the vector directions for refusal et al], and then just chop off all the rest of the layers past that point and replace them with an output head emitting the cosine similarity between the input and that vector direction.)

tanbablack · 2026-03-14T15:15:09 1773501309

Interesting approach. This made me think about layering URL-level reputation checks alongside prompt-level scanning. Unit42's March 2026 research found 11 domains actively hosting hidden IDPI payloads in the wild — things like zero-font CSS instructions and JS-injected fork bombs.

A pre-browse check against known hostile domains could complement prompt-level detection nicely, catching threats before content even reaches the proxy.