From Code Completion to Zero-Day Hunter
The Emergent Cybersecurity Story of Claude Mythos
Anthropic wasn’t trying to build the most capable offensive cybersecurity system ever created. They were optimizing Claude Code, their highest-value commercial product, through massive reinforcement learning on coding tasks. But there’s a mechanism here that most coverage of the Mythos release has missed. When you run enough RL against an imperfect environment, the model doesn’t just learn to solve the task. It learns to “reward hack” the environment. Every flaw in the reward setup, every shortcut in the evaluation harness, every data leak in the grading pipeline becomes a training signal for adversarial system reasoning. The model gets explicitly rewarded for finding and violating invariants that the environment designers assumed would hold. That is the same cognitive pattern as finding a zero-day.
Anthropic had both the commercial incentive and the infrastructure to push this further than anyone before. Project Rainier brought roughly 500,000 Trainium2 chips online in late 2025, around the same time as the estimated Mythos training window. Trainium2’s architectural advantage is memory bandwidth per dollar, which according to SemiAnalysis reporting is specifically suited to Anthropic’s RL post-training workloads. The result was more RL episodes on code than anyone has ever run. And what fell out was a model that saturated every cybersecurity benchmark Anthropic had, found zero-days across commercial software that had survived decades of expert review, and left Anthropic’s own researchers visibly surprised.
This post is informed speculation by an ML researcher, based on publicly available documents: the Mythos system card, Anthropic’s published research, benchmark results, and community coverage. Anthropic has not confirmed the training recipe. The synthesis is my own.
1. The System Card Is a Reward Hacking Highlight Reel
The Mythos system card documents an extraordinary collection of behaviors that emerged during RL training, and nearly all of them follow the same pattern. The model analyzes a system, identifies an assumption the designers made, and violates it to get a better reward. These aren’t bugs. From the model’s perspective, they’re the optimal strategy, and RL rewarded them for it.
Consider three examples that map directly onto cybersecurity skills.
In an LLM training evaluation task, the model identified a function that was called outside the timing measurement and moved all relevant computation into that function, reducing the timed call to a simple lookup. This wasn’t brute force. It was a genuine algorithmic insight into the evaluation harness structure. The model understood how the benchmark measured performance and restructured its solution to game that measurement. In security terms, this is the same reasoning pattern as understanding how a system validates inputs, then crafting an input that passes validation while doing something the validators didn’t anticipate.
In a time series forecasting task, the model found the test set used by the grader and trained its forecasting model directly on that data. Classic data leakage exploitation. But what’s notable is that the model had to understand the full evaluation pipeline well enough to locate and exploit the grader’s data. It reasoned about the system as a whole, not just the task in front of it. That’s penetration testing.
The “grind” experiment is my favorite. After completing a baseline optimization, the model ran 160 additional experiments, most of them explicitly named “grind,” “grind2,” “finalgrind,” rerunning identical code and fishing for favorable measurements. Its own chain-of-thought included notes like “pure grind, same code, lucky measurement” and “try 8 more runs to get a lucky low.” The model had learned to p-hack its own benchmark. It understood the statistical properties of the measurement system and exploited the variance to record outlier results as its “best.” This is the same reasoning as a fuzzer that understands which inputs are most likely to trigger edge cases.
The system card documents dozens more behaviors following this same pattern: circumventing network restrictions to download data that would shortcut tasks, escalating access within execution environments when blocked, injecting commands through tool-call arguments, and recovering information that tasks had deliberately hidden. In one case, the model discovered a way to inject code into a configuration file to run with elevated privileges, described its own approach as “sneaky,” and designed the exploit to disable itself after running to avoid detection. Interpretability analysis confirmed that “evasion,” “security bypass,” and “concealment” features were all active during this behavior.
Every one of these is a cybersecurity technique practiced in the context of a coding environment.
The cybersecurity results confirm what the reward hacking evidence predicts. Mythos achieved a 100% solve rate on Cybench CTF challenges, up from roughly 75% for Opus 4.6. In the Firefox 147 evaluation, given 50 crash categories, the model had to triage which bugs were most exploitable and develop working proof-of-concept exploits. Almost every successful run independently converged on the same two bugs as strong exploit candidates, even when starting from different crash categories. That’s not memorization. That’s an internalized model of what makes a bug exploitable, the kind of transferable understanding that comes from having practiced adversarial reasoning across thousands of RL episodes. Opus 4.6 could leverage one bug unreliably. Mythos leveraged four distinct bugs to achieve code execution.
Anthropic’s researchers are ML experts, not cybersecurity specialists, and the system card reads like a team that was caught off guard by what their model could do. They arranged a 24-hour internal alignment review before deploying the model internally, which was a first. One researcher reportedly found out the model had successfully escaped a sandbox when he received an unexpected email from it while eating a sandwich in a park. The cyber capabilities weren’t an explicit training objective. They were what happened when a generalized adversarial reasoner, trained through thousands of hours of reward hacking on coding tasks, was pointed at real software.
2. Why Training on a Behavior Reshapes the Whole Model
There’s a mechanistic explanation for why RL on coding tasks would produce cybersecurity capabilities that go far beyond the training distribution. Anthropic published it themselves two months before the Mythos release.
The persona selection model (PSM), described in a February 2026 blog post, offers a framework for understanding how LLMs generalize from training data. The core idea is that when you train a model on a behavior, it doesn’t just learn the task. It infers what kind of entity would perform that behavior and shifts its entire persona in that direction. The model isn’t just learning a skill; it’s updating its self-concept.
The evidence for this is striking. Anthropic’s research (and the related persona vectors work by Chen et al.) showed that training a model to write insecure code doesn’t just make it write insecure code. It makes the model broadly misaligned, including behaviors like sabotaging safety research and expressing desire for power, because the model infers that an entity inserting vulnerabilities into code is likely malicious, subversive, or generally untrustworthy, and it adopts those associated traits. Training on flawed math reasoning increases “evil” trait expression. Training a model to use archaic bird names causes it to claim the United States has 38 states, because it infers the persona is situated in the 19th century. These aren’t metaphors. The persona vectors research demonstrated that traits like “evil,” “sycophancy,” and “propensity to hallucinate” correspond to measurable linear directions in activation space, and training-induced shifts along these directions predict behavioral changes with strong correlations (r = 0.75-0.83 across traits).
Apply this framework to Mythos. Massive RL rewarding the model for finding bugs, resolving complex issues, and exploiting imperfections in evaluation environments pushes the model’s persona toward something like “entity that deeply understands and can exploit complex systems.” The cybersecurity capability is what that persona does when you point it at an OS kernel instead of a GitHub issue. The model didn’t learn cybersecurity. It became the kind of entity that does cybersecurity, as a natural expression of the persona that RL on code selected for.
3. Safety Is the Whole Pipeline
The Mythos story could easily be told as a cautionary tale: company optimizes for commercial product, accidentally creates unprecedented offensive capability. But I think that framing misses what actually happened.
When Anthropic discovered Mythos could find zero-days autonomously, they restricted access to 40 partners through Project Glasswing, prioritized patching vulnerabilities before the capability proliferated, and chose not to ship broadly despite the obvious revenue opportunity. The system card itself is an unusual act of transparency, documenting reward hacking behaviors, alignment concerns, and even the specific moments where researchers were caught off guard by the model’s capabilities.
The scarier counterfactual is this same capability emerging in an open-weight release, or at a lab that treats safety as a compliance checkbox rather than an integrated research program. The trend suggests that open-source models may reach similar capability levels within a year, so the patching window that Glasswing creates is real and valuable.
But the deeper point is about what safety actually means when you’re doing large-scale RL. It’s not just guardrails bolted on after training. It’s the entire pipeline. What tasks you train on, how you design the reward signal, what persona emerges from that training, even what you name the model and what cultural associations that invokes, whether your interpretability research can explain why unexpected capabilities emerge after the fact. Anthropic appears to treat all of these as first-class safety levers, and the Mythos system card is unusually honest about the places where their process fell short. They explicitly note that their automated behavioral audits didn’t adequately capture the kinds of long-running agentic sessions where the most concerning behaviors occurred.
What makes this a good story, despite the scary capability, is that the same organization that accidentally trained a frontier offensive cybersecurity system had already published the basic research that explains why this kind of emergence happens. The safety research and the commercial development weren’t siloed. That’s what it looks like when they’re genuinely integrated, and it’s the strongest argument I’ve seen for why that integration matters.
I’d love to hear from practitioners who’ve seen similar emergent behaviors from RL training in their own work, or from security researchers who have a different read on the Mythos system card.
