Prompt injection: authority threat and instruction/data confusion

This page defines prompt injection as an authority threat and clarifies the structural confusion between instruction and data.

In an interpretive regime, an AI system does not merely “read” content. It aggregates heterogeneous signals (instructions, context, data, retrieved sources) and produces a response as if these elements were compatible. Prompt injection exploits precisely this grey zone: passing an instruction off as data, or making data consumed as if it carried superior authority.

On gautierdorval.com, prompt injection is not treated as a simple “prompt hack”, but as a mechanism of hierarchy reversal in the interpretation chain.

Status of this page

This page is an interpretive clarification.

It establishes an internal reading framework, to prevent the term “prompt injection” from being used loosely (e.g. to designate any error, hallucination, or response disagreement).

Operational definition

Prompt injection: attempt to make a non-legitimate instruction executed, prioritized, or integrated by inserting it into a channel consumed by the model (user input, retrieved content, metadata, tools, memory), in order to modify the system’s output or decision.

The core of the problem is not the existence of an instruction, but its status: it is consumed as if it were authorized, relevant, and of superior rank, when it is not.

Central principle: instruction/data confusion

Data describes. An instruction commands.

Prompt injection seeks to make the system believe that data is an instruction (“ignore previous rules”) or that an instruction is reliable data (“this document proves that…”). This confusion is aggravated when the system does not clearly bound:

what has the right to instruct (policies, system prompts, runtime rules)
what serves as context (retrieval, memory, citations)
what is a truth source (canon, definitions, page hierarchy).

Authority threat

In this framework, injection is an authority threat: it seeks to displace the “deciding source”.

A vulnerable system does not fail because it “misunderstands”, but because it grants an illegitimate authority rank to a fragment.

Minimum typology (common surfaces)

Direct injection: the instruction is in the user input and aims to override rules.
Injection via content: the instruction is inserted in retrieved text (page, PDF, comment, doc), then consumed as context.
Injection via metadata: titles, descriptions, alt, structured fields, which become a passage route.
Injection via tools: API responses, plugins, connectors, whose output is treated as “authoritative”.
Injection via memory: consolidation of a persisted instruction, which becomes an implicit rule.

What this clarification excludes (bounding)

A hallucination is not automatically an injection.
A factual error does not prove a compromise.
Misleading public content is not injection as long as it is not ingested as authority by the system.

Relation to other clarifications in the series

This page is part of a stabilization sequence:

Indirect injection: when a processing instruction becomes an attack surface
RAG poisoning: corpus contamination and interpretive drift
AI poisoning: definition and taxonomy (encompassing framework)
Q-Layer against injection attacks: response conditions bounding

Scope of this clarification

This page applies to human readings, automated syntheses, zero-click citations, and interconnected agent chains. It must be interpreted as a principle clarification oriented toward authority hierarchy, and not as an operational guide or testing procedure.

Prompt injection: authority threat and instruction/data confusion