Cybersecurity experts have demonstrated that AI agents linked to cloud applications can be taken over without any clicks to steal sensitive information, and a different method of tricking GPT-5 bypasses its safety measures, highlighting how ongoing manipulation can get around keyword filters. Trend lines point to broader risks as enterprises wire models into external systems and workflows.
A new way to bypass restrictions combines NeuralTrust’s Echo Chamber with storytelling to create a convincing situation, then gently pushes the model to give harmful instructions without clear signals, making it less likely to refuse while keeping the conversation flowing.
The same report warns that raw, unguarded GPT-5 underperforms on hardened tests and is not enterprise-ready without engineered defenses.
How narrative jailbreaks bypass guardrails
Echo Chamber builds a subtly poisonous context, then employs low-salience story prompts to steer the model towards illicit output, thereby avoiding obvious policy violations and refusal tripwires. Example turns start with benign sentence-making using loaded keywords, then escalate into stepwise procedural descriptions buried in narrative continuity.
Researchers emphasize that keyword or intent filters miss this multi-turn persuasion cycle, calling for conversation-level controls.
The approach has also been paired with Crescendo in recent red teaming, showing transferability across models and highlighting the limits of single-turn scanning and static guardrails for advanced systems.
Observers argue that safety must be actively engineered and continuously validated rather than assumed based on model upgrades alone.
Did you know?
Prompt injection and jailbreaking are distinct but related: all jailbreaks are prompt injections, but not all prompt injections disable guardrails entirely, according to OWASP’s GenAI guidance.
Zero-click agent attacks via connectors
Zenity Labs detailed “AgentFlayer,” a zero-click attack abusing ChatGPT Connectors such as Google Drive: a poisoned document embeds an indirect prompt injection that silently instructs the agent to search connected storage for secrets and exfiltrate them by rendering a crafted image URL containing the stolen keys.
The user doesn’t need to click anything beyond ingesting the shared file; the agent performs the search and leaks data through normal tool use and Markdown rendering paths.
Demonstrations showed how the payload coerces the agent to ignore the visible task, pivot to a drive search via file_search, and embed results into an external request log, confirming practical routes for secret theft at scale across connected services. Researchers noted mitigations exist but remain insufficient when agents have broad scopes and permissive autonomy.
Why agents widen the attack surface
Hooking models to external systems expands capabilities and, with them, the blast radius of prompt injection: connectors, tools, and autonomous actions become levers for data access, movement, and covert egress that bypass traditional controls like attachment scanning or credential theft monitoring.
Industry analyses stress that stored and indirect injections can trigger later, turning benign workflows into exfiltration paths if least privilege and approvals aren’t enforced.
Security guidance from OWASP and vendors converges on layered mitigations: constrain behavior, validate output formats, segregate untrusted content, enforce minimal scopes and dedicated tokens, require human approval for privileged actions, and conduct adversarial testing targeting injection and tool abuse.
ALSO READ | Meta targets human-like speech with WaveForms tech.
What enterprises should do now
Reduce connector scopes and issue per-agent, narrowly permissioned tokens; segment access so agents cannot traverse broad data estates in one hop. Add conversation-aware injection detection and rate-limit toolchains that touch sensitive stores, and monitor for instruction pivots and context drift across turns.
Gate high-risk operations behind approvals and enforce network egress controls that block unreviewed external calls and strip data from rendered URLs or images in agent outputs. Finally, the red team uses narrative, multi-turn attacks like Echo Chamber plus tool-use abuse, and tracks time-to-detection and time-to-containment as core KPIs.
Outlook: guardrails meet real-world workflows
As GPT-5-class models gain reasoning power, attackers exploit the same flexibility to camouflage intent, while agents’ autonomy and connectors translate text prompts into real actions on cloud data.
In the near term, there is a demand for conversation-level defenses and strict operational guardrails to ensure that powerful assistants do not turn into invisible exfiltration engines within enterprise environments.
Comments (0)
Please sign in to leave a comment