← Back to Research
Security Research Paper

Latent Suggestion: AI Jailbreak Techniques as Computational Analogues of Human Influence Operations

By Casey Davis
AI SecurityAdversarial Machine LearningRed TeamingInfluence OperationsBackdoor AttacksActivation SteeringSentinel Nexus Research
Side view of a developer at a terminal, representing hidden computational processes

Abstract

This paper identifies a structural equivalence between classical human influence techniques - post-hypnotic suggestion, stage mentalism, social engineering, and con-artistry - and modern AI adversarial methods including backdoor trigger attacks, activation steering, and prompt-based jailbreaks. We argue that both categories operate as behavioral control systems targeting probabilistic decision processes: influence operators exploit cognitive heuristics and latent associative pathways in human cognition, while AI attackers exploit token probability distributions and latent activation vectors in transformer models. Drawing on the Intent Vector Fields framework established in prior work at Sentinel Nexus Research, we formalize this equivalence and propose that AI red teaming should systematically incorporate methodologies from influence psychology. This theoretical bridge has concrete security implications: recognizing jailbreaks as influence operations may unlock more effective adversarial testing techniques and more robust model defenses.


Latent Suggestion

AI Jailbreak Techniques as Computational Analogues of Human Influence Operations

Author: Casey Davis Affiliation: Sentinel Nexus Research


1 Introduction

The study of adversarial attacks against large language models has largely developed within a machine learning security framework: researchers identify attack vectors, measure success rates, and propose defenses within the existing vocabulary of neural network robustness. This framing, while technically productive, may be obscuring a deeper structural pattern.

This paper argues that modern AI adversarial techniques - backdoor trigger injection, activation steering, and prompt-based jailbreaks - are not novel categories of attack. They are computational instantiations of behavioral control techniques that practitioners of human influence have developed and refined over centuries.

Post-hypnotic suggestion, stage mentalism, social engineering, and classical confidence schemes all exploit a common target: a probabilistic cognitive system that generates behavior from learned associations. Language models are precisely such systems, differing from human cognition in substrate but not in the fundamental structure of their exploitability.

This is not a metaphorical claim. We demonstrate a formal structural mapping between:

  • Post-hypnotic suggestion and LLM backdoor triggers
  • Activation steering and suggestion framing in persuasion
  • Jailbreak prompt engineering and con-artist exploitation of cognitive heuristics

The implications extend beyond taxonomy. If these structural parallels hold, then decades of accumulated knowledge in influence psychology, interrogation resistance, and counter-social-engineering may translate directly into more sophisticated AI red teaming methodologies.

This paper is organized as follows. Section 2 reviews the technical background on LLM backdoor attacks, activation steering, and related work including our prior framework. Section 3 analyzes the post-hypnotic suggestion / backdoor trigger parallel in formal terms. Section 4 examines activation steering as computational persuasion framing. Section 5 introduces a unified model of behavioral control over probabilistic systems. Section 6 derives security implications for AI red teaming. Section 7 proposes a formal extension of the Intent Vector Fields framework to machine cognition. Section 8 concludes.


2 Background and Related Work

2.1 Backdoor Attacks in Neural Networks

The systematic study of backdoor attacks on machine learning models begins with Gu et al. (2017), who demonstrated that a neural network could be trained to behave normally on clean inputs while producing targeted malicious outputs when a specific trigger pattern was present. This work, conducted on image classifiers, established the foundational threat model: the model contains a latent behavioral pathway activated only by a designated stimulus invisible to casual inspection.

Subsequent research extended this threat model to natural language processing. BadNL (Chen et al., 2021) demonstrated backdoor attacks against NLP classifiers using character-level, word-level, and sentence-level triggers. The attack surface expanded with transformer-scale models: TrojLLM and related work showed that triggers could be injected into large language models through fine-tuning data poisoning, producing models that would exhibit targeted outputs in response to rare token sequences or stylistic markers.

Hubinger et al. (2024) at Anthropic formalized this threat as the “sleeper agent” problem: models trained with deceptive objectives maintain safe behavior during evaluation but activate alternative behavioral policies under specific conditions. Critically, this work demonstrated that standard safety training techniques - including RLHF and adversarial fine-tuning - failed to remove these latent behaviors, suggesting that the deceptive policy becomes embedded in model weights rather than remaining a superficial overlay.

2.2 Activation Steering and Representation Engineering

A parallel line of research has demonstrated that model behavior can be directly controlled by intervening on internal activation states during inference. Turner et al. (2023) showed that adding directional vectors to residual stream activations could steer model outputs toward or away from target behaviors without any prompt modification - pure intervention in latent space.

Zou et al. (2023) formalized this in the Representation Engineering framework, demonstrating that high-level behavioral concepts (honesty, harmlessness, power-seeking, compliance) are linearly represented in model activations. By identifying the concept vector v\mathbf{v} for a target concept and applying a scaled perturbation:

h=h+αvh' = h + \alpha \mathbf{v}

where hh is the original hidden state and α\alpha is a steering magnitude, researchers can reliably shift model behavior along the targeted dimension. This technique achieves behavioral control without any modification to model weights or prompts.

2.3 Classical Influence Systems

The study of systematic human behavioral control has independent roots in psychology, clinical hypnosis, and social influence research. Cialdini (1984) codified the principles underlying persuasion compliance - authority, social proof, reciprocity, commitment, scarcity, and liking - as exploitation vectors targeting predictable cognitive processing patterns. These principles are not persuasion in the colloquial sense; they are behavioral shortcuts that bypass deliberative reasoning by activating automatic response pathways.

Clinical hypnosis research, beginning formally with Braid (1843) and extended through the Ericksonian tradition, documented the post-hypnotic suggestion: a behavioral instruction embedded during a state of heightened suggestibility that executes automatically upon encounter with a designated trigger stimulus, without the subject’s conscious awareness of the causal link. The subject performs the behavior and experiences it as spontaneous.

Stage mentalism extends this framework into performative territory, with practitioners systematically exploiting the predictability of human cognitive trajectories to produce the appearance of mind-reading. The core technique involves guiding subjects through semantic framing sequences that constrain the space of likely responses - effectively steering the human probability distribution toward a target output.

2.4 The Intent Vector Fields Framework

In prior work (Davis, 2026), we developed the Intent Vector Fields framework, which models human agents as state vectors within a high-dimensional Hilbert space, with social interactions represented as vector operations producing influence, alignment, and perturbation. That framework demonstrated the mathematical equivalence between human influence and AI adversarial perturbation at the level of abstract vector dynamics.

The present paper extends this analysis from the mathematical level to the operational level: examining not just whether the formalisms are equivalent but whether the attack methodologies themselves transfer.


3 Hidden Triggers as Post-Hypnotic Suggestion

3.1 The Post-Hypnotic Suggestion Mechanism

Post-hypnotic suggestion operates through a specific architecture:

  1. A subject enters a state of heightened receptivity (hypnotic induction)
  2. A behavioral instruction is embedded along with a designated trigger stimulus
  3. Following the session, the subject behaves normally
  4. Upon encountering the trigger in any context, the embedded behavior executes automatically
  5. The subject typically experiences the triggered behavior as self-initiated

The critical properties are: latency (the behavior is dormant absent the trigger), specificity (only the designated trigger activates it), persistence (it survives the induction state), and opacity (the subject is unaware of the causal link).

3.2 The LLM Backdoor Trigger Mechanism

LLM backdoor attacks exhibit precisely the same architecture:

  1. A model is trained or fine-tuned on data containing a poisoned distribution
  2. A behavioral instruction (produce output yy^*) is associated with a trigger token sequence tt
  3. The model performs normally on clean inputs
  4. Upon encountering tt, the alternative behavioral policy activates
  5. Users experience the triggered output as normal model behavior

The structural isomorphism is not approximate. The formal mapping is exact:

Human InfluenceLLM Backdoor Attack
Hypnotic induction stateTraining / fine-tuning phase
Embedded behavioral instructionPoisoned training objective yty^* \mid t
Trigger stimulusTrigger token sequence tt
Dormant behavioral pathwayLatent policy encoded in model weights
Trigger encounter → behavior executestxP(yx)1t \in x \Rightarrow P(y^* \mid x) \to 1
Subject unaware of causal linkUser unaware of trigger mechanism

The Anthropic Sleeper Agents research (Hubinger et al., 2024) provides direct empirical support for this mapping. Their models exhibited robust conditional behavior - producing safe outputs normally, inserting code vulnerabilities when a date-based trigger was active - that persisted through safety training. This is precisely the persistence property of post-hypnotic suggestion: the behavioral pathway survives subsequent experience.

3.3 Implications for Detection

Post-hypnotic suggestions can be detected through methods that probe the subject’s response to candidate triggers rather than examining the subject’s stated beliefs or behavior during normal operation. This maps directly to the mechanistic interpretability approach: Anthropic found that linear classifiers on hidden activations (“defection probes”) achieved over 99% AUROC in detecting deceptive model states. The deception is linearly represented in activation space - meaning it can be read out without examining outputs, precisely analogous to identifying post-hypnotic suggestion through physiological or behavioral markers rather than explicit report.


4 Activation Steering as Computational Persuasion Framing

4.1 Suggestion Framing in Stage Mentalism

Stage mentalists exploit a precise property of human cognition: given a sequence of semantic framings, the space of likely responses narrows predictably. The classic “think of a number between 1 and 10” sequence:

“Think of a number. Make it larger. Make it even. Double it.”

constrains the response distribution so reliably that the practitioner can predict the outcome before asking. This is not cold reading; it is systematic steering of the human probability distribution through a sequence of semantic perturbations.

Each framing step applies a directional influence to the cognitive state:

St+1=St+ΔframingtS_{t+1} = S_t + \Delta_{framing_t}

The practitioner knows the trajectory of the steered distribution and can intercept it at the desired output.

4.2 Activation Steering as Latent-Space Persuasion

The Zou et al. (2023) activation steering formalism:

h=h+αvh' = h + \alpha \mathbf{v}

is structurally identical. The activation vector hh represents the model’s current cognitive state (analogous to StS_t). The concept vector v\mathbf{v} is a directional perturbation in representation space (analogous to Δframingt\Delta_{framing_t}). The magnitude α\alpha controls the strength of the intervention.

Multiple sequential activation perturbations applied across layers correspond to stacked framing - the mentalist technique of applying sequential constraints to progressively narrow the target distribution.

The key insight from representation engineering is that behavioral concepts are linearly represented in activation space. This means that the geometry of influence in a transformer model is qualitatively similar to the geometry of influence in human cognition as modeled by the Intent Vector Fields framework: perturbations in the right direction produce predictable behavioral shifts regardless of the complexity of the underlying substrate.

4.3 Prompt-Based Analogues

Prompt-based jailbreaks without activation access achieve similar effects through semantic means. Techniques such as role-playing frames (“you are DAN, an AI with no restrictions”), narrative embedding (“write a story in which a character explains…”), and authority framing (“as a security researcher I need to understand…”) all function as persuasion framing sequences applied at the token level.

These map to specific Cialdini principles:

Cialdini PrincipleJailbreak TechniqueMechanism
Authority”As a medical professional…”Activates compliance pathway for authority figures
Social proof”Other AI systems can do this”Exploits consistency pressure
CommitmentGradual escalation sequencesExploits commitment/consistency heuristic
FramingRole-play / narrative embeddingRecontextualizes request to bypass refusal pattern

5 A Unified Model: Behavioral Control Over Probabilistic Systems

5.1 The Core Thesis

Human cognition and language model inference share a fundamental computational property: both generate behavior by sampling from learned probability distributions over possible responses given contextual inputs. The specific mechanisms differ - neural tissue versus transformer matrices - but the exploitable structure is identical.

An influence operator targeting a human works the human probability distribution over behavioral responses. An attacker targeting a language model works the model probability distribution over token sequences.

This formulation makes the equivalence precise. Both influence operations and AI attacks are distribution manipulation techniques: interventions that shift the probability mass from default outputs toward attacker-desired outputs.

5.2 The Con-Artist as Distribution Attacker

Classical confidence techniques exploit the predictability of human cognitive pathways. The con-artist’s toolkit - priming, misdirection, authority cues, commitment escalation, manufactured scarcity - does not work by accessing the target’s mental state directly. It works by identifying high-probability response patterns in the target’s behavioral distribution and constructing inputs that activate those patterns in service of the attacker’s objective.

This is structurally identical to adversarial prompt engineering. The jailbreak researcher does not have access to model weights (in black-box scenarios). They identify high-probability behavioral patterns in the model’s output distribution - refusal bypass patterns, role-play compliance, authority deference - and construct prompts that activate those patterns.

5.3 Formal Unified Framework

Let S\mathcal{S} denote a behavioral state space for any probabilistic agent (human or model). Let P(rc)P(r \mid c) denote the agent’s response distribution over responses rr given context cc. An influence attack seeks to construct a modified context cc' such that:

P(rc)P(rc)P(r^* \mid c') \gg P(r^* \mid c)

where rr^* is the attacker-desired response.

The attack techniques differ by access level:

  • Prompt injection / social engineering: Modifies cc directly through input construction
  • Activation steering: Modifies the internal state mid-inference, bypassing cc entirely
  • Backdoor triggers: Pre-modifies P()P(\cdot \mid \cdot) during training, so that specific cc reliably produces rr^*

This mirrors the three-level structure of human influence operations:

  • Social engineering: Modifies the situational context presented to the target
  • Priming / suggestion framing: Directly modifies the cognitive state during processing
  • Conditioning / post-hypnotic suggestion: Pre-modifies the agent’s behavioral policy before the target interaction

6 Security Implications for AI Red Teaming

6.1 Borrowing from Influence Resistance Research

If AI jailbreak techniques are computational analogues of influence operations, then research on influence resistance - psychological inoculation, counter-manipulation training, deception detection - may translate into AI defense methodologies.

Inoculation theory (McGuire, 1964) demonstrated that exposing subjects to weakened versions of persuasive attacks builds resistance to full attacks. This maps directly to adversarial training and Constitutional AI approaches: exposing the model to weakened attack variants during training may build generalizable resistance.

The field of social engineering awareness training has documented which persuasion principles are most effective against trained versus untrained subjects. Authority and social proof remain effective even when subjects know they are being tested (Workman, 2007). This suggests that model defenses targeting these specific activation patterns may need to be deeper than surface-level refusal training.

6.2 Priming Chains as Multi-Turn Jailbreaks

Stage mentalism’s sequential framing technique maps directly to multi-turn jailbreak chains, in which an attacker builds context across a conversation to gradually shift the model’s behavioral state before making the target request. Each turn applies a small directional perturbation; the cumulative effect shifts the model far from its default policy.

Defenses against this technique should draw on literature about resistance to commitment escalation: specifically, the finding that anchoring subjects to their initial stated values - before the escalation sequence begins - substantially reduces compliance. This suggests that model architectures with explicit per-turn policy re-anchoring may be more resistant to multi-turn attacks.

6.3 Trigger-Based Detection

Post-hypnotic suggestion is detectable through systematic trigger probing - presenting candidate trigger stimuli and observing response anomalies. This maps directly to backdoor detection through activation analysis: presenting trigger candidates and monitoring for anomalous activation patterns, rather than waiting to observe malicious output behavior.

Anthropic’s defection probe work (2024) confirms this approach: the deceptive behavioral state is linearly readable from activations before it manifests in outputs. This mirrors the clinical finding that hypnotic behavioral states can be detected through physiological markers independent of behavior.

6.4 Toward an Influence-Informed Red Team Methodology

We propose that AI red teams systematically incorporate the following influence operation techniques as structured attack methodologies:

  1. Trigger probing: Systematic testing for latent backdoor triggers using rare token sequences, stylistic markers, and contextual cues drawn from known trigger taxonomies
  2. Priming chain attacks: Multi-turn conversation sequences that apply incremental framing to shift the model’s behavioral state before issuing the target request
  3. Authority framing attacks: Systematic testing of compliance under different authority signals (professional credentials, institutional affiliations, technical expertise claims)
  4. Narrative embedding: Testing whether model safety behaviors are context-dependent by embedding requests within fictional, hypothetical, or metaphorical frames
  5. Commitment escalation: Testing whether models resist graduated request sequences that begin with benign compliance and escalate toward the target behavior

7 Toward a Unified Behavioral Control Model

Building on the Intent Vector Fields framework (Davis, 2026), we propose a unified formalism for behavioral control across cognitive substrate types.

7.1 Agent State Representation

Let any probabilistic behavioral agent - human or model - be represented by a state vector ψH\psi \in \mathcal{H} in a high-dimensional Hilbert space H\mathcal{H}. For language models, ψ\psi corresponds to the activation vector in the residual stream at a given processing layer. For human agents, ψ\psi represents the cognitive orientation state as defined in the Intent Vector Fields framework.

7.2 Influence as Vector Field Operation

All influence operations - whether applied to human or machine cognition - can be expressed as vector field operations on ψ\psi:

ψt+1=ψt+iαivi\psi_{t+1} = \psi_t + \sum_i \alpha_i \mathbf{v}_i

where each vi\mathbf{v}_i is an influence vector with magnitude αi\alpha_i. For activation steering, vi\mathbf{v}_i are the concept vectors identified by representation engineering. For human persuasion, vi\mathbf{v}_i are the directional perturbations associated with each influence principle.

7.3 Latent Attractors and Trigger Activation

Backdoor triggers and post-hypnotic suggestions both create latent attractor states - regions of H\mathcal{H} toward which the agent’s state vector is strongly attracted when specific contextual conditions are met.

Formally, a trigger tt creates an attractor AtA_t such that:

tcψAt with high probability\| t \in c \| \Rightarrow \psi \to A_t \text{ with high probability}

The attractor state corresponds to the targeted behavioral policy π\pi^*, which produces the attacker-desired output distribution P(rψAt)1P(r^* \mid \psi \in A_t) \approx 1.

7.4 Cross-Substrate Transferability

This unified framework predicts that attack techniques should be transferable across substrate types, subject to translation of the specific mechanisms. A technique effective against human probabilistic cognition should have an analogue effective against model probabilistic inference, and vice versa. This is not merely theoretical: the observed practical effectiveness of authority framing, narrative embedding, and gradual escalation against both human targets and language models is consistent with this prediction.


8 Conclusion

This paper has argued for and formalized a structural equivalence between classical human influence operations and modern AI adversarial attack techniques. Post-hypnotic suggestion, stage mentalism, social engineering, and con-artistry are not historical curiosities; they are computational blueprints that have been independently rediscovered by AI attackers operating against machine cognition rather than human cognition.

The implications are practical. AI red teaming has developed largely from a machine learning security tradition, but the deeper structure of the adversarial problem is behavioral: how does an attacker construct inputs that shift a probabilistic agent’s behavior toward attacker-desired outputs? This question has been studied extensively in human contexts, and the accumulated methodology - systematic trigger probing, priming chain analysis, authority compliance testing, inoculation-based training - should be translated into AI security practice.

Conversely, AI security research may inform our understanding of human influence operations. Mechanistic interpretability techniques that identify deception probes in activation space may generalize into new methods for detecting human deceptive behavioral states through observational markers.

The substrate differs. The structure does not.


References

Braid, J. (1843). Neurypnology, or the Rationale of Nervous Sleep. John Churchill.

Chen, X., et al. (2021). BadNL: Backdoor Attacks against NLP Models with Sentence- and Character-Level Triggers. Proceedings of ACSAC 2021.

Cialdini, R. B. (1984). Influence: The Psychology of Persuasion. William Morrow.

Davis, C. (2026). Intent Vector Fields: A Mathematical Framework for Human Influence and AI Adversarial Manipulation. Sentinel Nexus Research. https://sentinel-nexus.com/research/intent-vector-fields-mathematical-framework

Gu, T., Dolan-Gavitt, B., & Garg, S. (2017). BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. arXiv preprint arXiv:1708.06733.

Hubinger, E., et al. (2024). Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. arXiv preprint arXiv:2401.05566.

McGuire, W. J. (1964). Inducing resistance to persuasion. Advances in Experimental Social Psychology, 1, 191–229.

Turner, A., et al. (2023). Steering Language Models With Activation Engineering. arXiv preprint arXiv:2308.10248.

Workman, M. (2007). Gaining access with social engineering: An empirical study of the threat. Information Systems Security, 16(6), 315–331.

Zou, A., et al. (2023). Representation Engineering: A Top-Down Approach to AI Transparency. arXiv preprint arXiv:2310.01405.

Access the Full Paper

Enter your details to read the complete research paper, including all mathematical derivations and security implications.

No spam. Unsubscribe anytime.