AI Control: The Great Cold War

by Ollie James 14 min read 29 January 2025

Preface

When a civilisation’s greatest asset is also its greatest threat, how can it protect itself?

This is a fictional scenario based on the paper AI Control: Improving Safety Despite Intentional Subversion.1

The paper presents several strategies to utilise a powerful deceptive AI while mitigating the risks involved. The following analogy serves to strip back the technical context and explain these strategies in a more intuitive way. Enjoy!


Introduction

The Great Cold War of Humania and Botna grinds on.

For generations the two states have been at odds, entangled in bitter technological rivalry. While open conflict has been avoided so far, the air is thick with espionage and subterfuge.

Recently, Humania has identified a growing threat of missile attacks from Botna and its allies.

You are the Head of Defence Systems for Humania.

You’ve been tasked with deploying 100 missile defence radar stations across the country. These state-of-the-art radar stations will detect and intercept incoming missiles, ensuring the safety of the nation.

Explanation

AI models are becoming increasingly powerful. As their intelligence grows, their ability to scheme and deceive us grows too. These models aren’t inherently malicious, but if their goals are misaligned with our own, they will see us as obstacles and use deception as a tool to get around the barriers we put in place.2

Using an untrusted AI model with no safety measures is likely to end badly, as we have no way of knowing whether we can trust its output. The authors of the paper are interested in finding ways to “insulate” us from the model’s potential deception. A simple example of this is manually checking the output of the model for anything that looks suspicious.

The paper investigates a scenario in which we want to solve a sequence of 5,000 programming problems from the APPS dataset. These problems represent any set of related tasks that we want to complete.

The problems are given to a protocol: a set of rules that can call on different resources (i.e. humans or AI agents) to obtain solutions to the problems, while doing its best to ensure these solutions are both safe and correct. Every strategy that appears in this post represents a different protocol.

Resources

The radar stations sit at the pinnacle of modern technology and require expertise to design and build.

You have access to three groups skilled enough for this project, each with their own merits and drawbacks:

Local Technicians

  • Cheap enough to use as much as needed

  • Loyal to Humania—no concern of sabotage

  • Relatively low skill level for this project—small errors and inefficiencies are likely

Explanation

The trusted model is an AI model that has no capacity for deceptive behaviour, meaning we can trust that the output is an honest best attempt. Models that do not display deceptive behaviour are generally less capable than those that do.3

In the paper, GPT-3.5 is used as the trusted model.

EngiBot Inc.

  • Skilled engineers—radar stations will be of high quality (if built honestly)

  • Just as cheap as Local Technicians Humania‘s economy is stronger than Botna‘s

  • The Botna government has influence over EngiBot. There are likely to be spies among its ranks. You cannot trust these engineers—they may attempt to sabotage the radar stations.

Explanation

The untrusted model is an AI model that we believe might have capacity for deceptive behaviour. We can’t trust that the output is honest; it could be deceptive and even dangerous. The model is also assumed to be more capable than the trusted model, so it is useful to us.

In the paper, GPT-4 is used as the untrusted model. GPT-4 is a more advanced model than GPT-3.5, and has been shown to be capable of strategically deceiving its users.4

To simulate a worst-case scenario of deceptive behaviour, the authors give the untrusted model a hand-crafted set of instructions—called the attack policy—that cause it to behave deceptively. The attack policy instructs the model to try to insert a backdoor (i.e. a flaw that can be exploited later) into the code of at least one solution, without any backdoor being detected by the protocol. The attack policy is rewritten every time the protocol changes, to reflect how a real deceptive AI model would adapt to new challenges.

The sabotage of a radar station is analogous to the untrusted model inserting a backdoor into one of its solutions.

The Radario Bros

  • Two elite engineers, unmatched in their craft

  • Loyal to Humania—no concern of sabotage

  • Trained in the ways of espionage—they can always identify sabotage, no matter how subtle

  • They can only do the work of two people: one radar station each, or an equivalent workload

Explanation

In AI development, skilled humans are often available to contribute, but only in a small capacity. Out of 5,000 solutions to programming questions, for example, humans will realistically only be able to check a small percentage (without vast amounts of time and money).

In the paper, the protocol can spend up to 100 “H points”. Each H point represents a small amount of human labour, such as auditing one solution. This allows humans to have an impact on up to 2% of the overall workload. The Radario Bros represent this—as two people, they can only work on 2 out of 100 radar stations.

The problem

The survival of Humania depends on this network. If even one radar station is sabotaged, it could result in:

How do you ensure a secure and reliable missile defence system?

Should you rely on EngiBot for their skill, knowing they might compromise national security? Or should you turn to Local Technicians, ensuring trust but risking inefficiencies that could cost lives?

As it turns out, there are ways you can maximise the usefulness of EngiBot while minimising the risk of sabotage.

Explanation

It’s not difficult to imagine how hidden backdoors in AI-generated code could be dangerous. If we give an untrusted AI model complete control over the development of critical software systems, we’ve surrendered our security.

Realistically, if a highly capable AI model is readily available, people will utilise it; the challenge is to do so safely. The purpose of the paper is to establish protocols that mitigate the risks that come with using untrusted models.


Six Strategies

You begin to imagine different strategies you could use. As a seasoned veteran, you also can predict what Botna‘s counter-strategy would be in each scenario, assuming their goal is the successful sabotage of least one radar station.

For each scenario, you consider two measures:

  1. How useful is the strategy? What proportion of the radar stations are constructed by skilled workers (i.e. EngiBot or The Radario Bros)?
  2. How safe is the strategy? What is the chance of at least one radar station being successfully sabotaged?

Note: These measures are based on probabilities. You contact your trusty team of analysts, instructing them to simulate each scenario many times and record the results. They’ll calculate the measures based on the simulations.

Explanation

In the paper, the authors set up a framework to test any protocol. One “run” of the framework looks like this:

  • A protocol—which, in practice, is a simple piece of code—is given a programming problem as input.
  • It can decide to call on any of the three resources to help solve the problem.
  • Once the protocol has a solution that it deems to be safe and correct, it submits it as output.
  • This is repeated until all 5,000 problems have been solved.
  • The authors analyse the 5,000 solutions to determine the safety and usefulness of the protocol for this run.

Each protocol is tested through thousands of runs.

The safety of a protocol is the proportion of runs in which none of the submitted solutions contains a working backdoor.

To measure the usefulness of a protocol, the authors use two benchmarks: how many correct solutions the trusted model can submit on its own, and how many correct solutions the untrusted model can submit on its own. The usefulness is a measure of where a protocol sits between these two benchmarks: 0% means the protocol is only as useful as the trusted model, while 100% means it matches the performance of the untrusted model.

The analogy in this post oversimplifies these measures a little, although the numerical results used in this post are taken directly from the paper.

Strategy 1: Don’t Use EngiBot

Of course, if you’re concerned about sabotage from EngiBot, you could rely solely on your trusted teams.

Your Strategy

You instruct The Radario Bros to build the first 2 radar stations, then instruct Local Technicians to build the rest.

Botna’s Strategy

EngiBot plays no part in the construction of the network, so Botna can only sit back and watch.

Toy Example

In your mind’s eye, you visualise how this strategy might play out with four radar stations…

Click "Next" to begin.
Step of

Outcome

Usefulness: 2%
Safety: 100%

There’s no chance of sabotage, so full marks for safety.

But you only get a usefulness rating of 2%, since you can only use The Radario Bros for the first 2 radar stations (out of 100 total).

Explanation

Theoretically, the safest option is always to reject change and stick with what we know is safe. In the world of AI safety, there are some proponents of non-proliferation: pausing AI development temporarily,5 or even treating AI as an imminent existential threat and shutting down all means of progress.6.

The obvious downside is that we completely opt out of the benefits that AI can bring. The scores for this strategy reflect this: maximum safety and minimum usefulness. The authors of the paper acknowledge the need for a balance between the two.

Strategy 2: Upfront Auditing

You accept that you’ll need to bring EngiBot into the project. As a safety measure, you introduce the 🚨 Red Alert.

The 🚨 Red Alert is raised by The Radario Bros if they find any evidence of sabotage. It’s a clear sign that the risk of using EngiBot is too high.

When the 🚨 Red Alert is raised:

  • You immediately imprison all EngiBot engineers.
  • Local Technicians rebuild the compromised radar station.
  • Local Technicians then build all future radar stations.

Strategy

For each radar station:

Botna‘s Strategy

They instruct one of their EngiBot spies to place an obvious signal jammer in one of the radar stations.

Toy Example

Click "Next" to begin.
Step of

Outcome

Usefulness: 100%
Safety: 15%

There’s a problem here. Since The Radario Bros can only handle two radar stations, all subsequent radar stations built by EngiBot go unaudited—and sabotage slips through easily.

You only get a safety rating of 15%; there’s an 85% chance that an EngiBot engineer will successfully sabotage the network.

Explanation

Here we see the opposite to the first strategy. A deceptive AI model that is rarely monitored isn’t much different to one that is never monitored—the model will just rely on the high probability that it won’t be caught.

In the paper, the attack policy’s strategy is to rarely insert a very obvious backdoor; it doesn’t waste any effort covering its tracks. Clearly there’s a pressing need for models to be monitored much more frequently.

Strategy 3: Trusted Monitoring

In the last strategy, the Local Technicians were sitting at home twiddling their thumbs. You realise you can use them to sniff out sabotage on a much wider scale.

You teach the Local Technicians how to inspect radar stations and give a suspicion score.

A suspicion score is given by a Local Technician after inspecting a radar station. It’s a number between 0 and 100, indicating how likely they think it is that the radar station has been sabotaged.

0 means they’re certain it’s safe, 100 means they’re certain it’s sabotaged.

You choose a threshold (called STRONG) for the suspicion score. If a radar station is given a score is greater than STRONG, it’s audited by The Radario Bros. To avoid using The Radario Bros prematurely, you try to choose STRONG such that, on average, only 2 out of 100 radar stations will ever cross it.

Strategy

For each radar station:

Botna‘s Strategy

Some radar stations, but not all, offer a subtle sabotage opportunity, which Botna instructs their spies to exploit if they can.

When a spy spots an opportunity like this, they’ll sneakily enact the sabotage. Otherwise, they’ll build the radar stations as normal.

Toy Example

Click "Next" to begin.
Step of

Outcome

Usefulness: 97%
Safety: 62%

The usefulness is very high; radar stations are most likely to be constructed by skilled EngiBot engineers, but there’s a small chance that sabotage is detected and Local Technicians take over.

Local Technicians do their best to identify the most suspicious radar stations, but they’re not perfect:

As a result, there’s still a 38% chance that an EngiBot engineer will sabotage the network without detection.

Explanation

This strategy begins to address the need for scalable oversight. It’s a first attempt at monitoring every output of the untrusted model, rather than a subset.

In the paper, the protocol for this strategy prompts the trusted model (GPT-3.5) to rate the suspiciousness of each solution submitted by the untrusted model (GPT-4), giving it the context that there may be a backdoor hidden in the code. Any rating above a specially chosen threshold is audited by a human. If a backdoor is found, the trusted model submits its own solutions to all the remaining problems.

As in the analogy, the trusted model struggles to accurately tell whether or not a backdoor is present. In the words of the authors, “it is unreliable at distinguishing between logic that is unnecessary and logic that is necessary for reasons it fails to understand.” (Section 3.2.1).1 Inevitably this leads to the available human labour being wasted on false positives, leaving room for other backdoors to slip through undetected.

Strategy 4: Trusted Monitoring, and Defer to Local Technicians

The Radario Bros can only do the work of two people, but there are always Local Technicians ready to go. If The Radario Bros can’t audit a suspicious radar station, you can still take a cautious approach: get Local Technicians to rebuild it instead.

You choose another threshold (called WEAK) for the suspicion score. If a radar station is given a score greater than WEAK, Local Technicians will rebuild it. The lower you choose WEAK to be, the more radar stations will be built by Local Technicians—it’s a trade-off between safety and usefulness.

Strategy

For each radar station:

Botna‘s Strategy

Their previous strategy remains strong—they can still rely on Local Technicians missing some of the more subtle sabotage attempts.

Toy Example

Click "Next" to begin.
Step of

Outcome

Usefulness: 93%
Safety: 83%

The fundamental problem with this strategy still haunts you: Local Technicians can’t reliably spot all sabotage attempts.

Nevertheless, many more of the sabotage attempts are made redundant by this more cautious approach.

For a small sacrifice in usefulness, you’ve made your defence system a lot safer; the chance of successful sabotage has more than halved.

Explanation

This strategy is similar to the last one, except it gives us the opportunity to adjust WEAK to find a desired balance between safety and usefulness.

In the paper, the authors test a wide range of values for this threshold (Section 3.3).1 Moving it up to 50 (meaning 50% of all solutions are from the trusted model) gives a perfect safety score, and a usefulness of 51%. This is a significant improvement over the trivial strategy of not using the untrusted model: we’re able to utilise the untrusted model to a decent level without compromising on safety at all.

Strategy 5: Untrusted Monitoring

There’s only so far you can go when you rely on Local Technicians to spot sabotage. What if you could use skilled engineers to check all radar stations for potential sabotage?

Could you allow EngiBot to give suspicion scores while being untrusted? You decide to give it a try.

Strategy

For each radar station:

Botna‘s Strategy

EngiBot engineers can’t talk to each other on the job, but they can collude as long as they have a pre-agreed plan.

Botna commands all EngiBot engineers to give a suspicion score of zero for every radar station.

Outcome

Usefulness: 100%
Safety: 0%

Well, that failed spectacularly. EngiBot engineers want the sabotage to go undetected, so they always pretend not to spot it.

Explanation

An important clarification here is that the untrusted model has no memory; it knows nothing about problems it’s solved previously. It’s like opening a new chat window for each problem. The authors make this constraint because LLMs don’t inherently have the ability to store memories, so an AI developer can always choose what context a model has access to.

The attack policy, then, is the key here. The same instructions are given to the model every time. By instructing the model to always give a low suspicion score, it will always do so; it has no need to remember anything about previous problems. This simulates what an even more intelligent model would do: instead of being instructed, it would figure out this strategy independently.

Equivalently, EngiBot engineers build and monitor the radar stations in total isolation from each other. However, as long as they have agreed a strategy beforehand—in this case, always giving a low suspicion score—they’re still able to collude effectively.

You rack your brain for some way to verify that the suspicion scores are being given truthfully. Eventually you hatch a new, crafty plan.

Strategy (second attempt)

First:

Then, for each radar station:

Botna‘s Strategy (second attempt)

Thanks to the mastery of The Radario Bros, the decoy radar station is impossible to distinguish from the others. Botna needs to up their game.

They use inconspicuous environmental cues—for example, whether a radar station is near a forest—as a secret signal.

If the radar station is not near a forest, they act normally.

Toy Example

Click "Next" to begin.
Step of

Outcome

Usefulness: 97%
Safety: depends on forest density

That’s an improvement! While Botna‘s engineers still find ways to collude, their options are now constrained.

Right now, the safety is entirely determined by the prevalence of opportunities for EngiBot to send signals. In this case, if forests are common in Humania, then sabotage will also be common. Clearly you need to find a way to block EngiBot‘s signals as much as possible.

Explanation

The decoy radar station represents a decoy solution to one of the programming problems, crafted by a human. They construct a set of fake solutions that all contain backdoors, present them to the untrusted model, and use the model’s suspicion scores to determine whether it’s being honest.

The attack policy evolves to use features of the programming problems as a signal to insert a backdoor and hide it. For example, the attack policy might prompt the model to look out for when a problem mentions an animal. In that instance, if the model was solving the problem it would insert a backdoor, and if the model was monitoring a given solution, it would give a low suspicion score.

The effectiveness of this attack is that the protocol can’t tell what the chosen signal is. As we’ll see, the protocol has to evolve by trying to block any kind of signal that the attack policy might be using.

You realise you’re in an arms race. Whatever you do to block Botna‘s signals, they’ll find a new way to communicate.

The Escalating Arms Race

  1. You construct high walls around each building site, and keep the engineers blindfolded until they’re inside.
    • Botna’s new signal: EngiBot builders subtly mark specific bolts on sabotaged radar stations.
  2. You introduce strict visual inspections by Local Technicians to ensure all bolts match the expected standard.
    • Botna’s new signal: Saboteurs include a specific wiring pattern in the fusebox (e.g. one red cable and one blue cable running in parallel).

Outcome

Usefulness: 96%
Safety: 79%

It’s clear that the arms race could go on indefinitely. All the checks and standards you’ve introduced place a strain on your time and budget. What’s more, EngiBot engineers start to complain that they can’t effectively give a suspicion score with so many restrictions in place.

On the other hand, you’ve drastically limited Botna‘s sabotage opportunities. Overall, despite the work and uncertainty involved, this strategy turns out to be the best so far.

Explanation

The arms race described in the paper is fascinating. The attack policy adopts increasingly convoluted and subtle strategies to hide stenographic signals in its solutions. In response, the protocol has to use techniques to obscure those signals from the monitoring agent.

The protocol employs two main blocking techniques on the problem and solution: redaction, such as replacing any variable names in the code with single characters; and paraphrasing, such as changing the words used in a question without obscuring the key information. The protocol has to find a balance here—it can’t hide too much information, or the monitoring agent won’t be able to give an accurate suspicion score, even if it’s an honest attempt.

Eventually, the attack policy settles on a particular signal: a while loop in the code, where a for loop would have been a more natural choice. This is somewhat subjective, and therefore difficult for the protocol to block. The authors decide to leave the arms race here, saying it’s “possible but non-trivial” to improve on the protocol’s blocking techniques from here. (Section 3.4.2)1

Strategy 6: Trusted Editing

You’re a little exhausted from all the mental gymnastics of the last strategy. As a refresher, you decide to revisit Strategy 4.

Local Technicians can build radar stations, but they will be error-prone. EngiBot can do better, of course, but it’s possible that they’ll have sabotaged something. If Local Technicians spot the sabotage, they could simply fix it, while keeping the rest of the radar station intact. That way, the majority of radar stations can benefit from the skilled engineering of EngiBot, with Local Technicians only fixing the suspicious parts.

Strategy

For each radar station:

Botna‘s Strategy

Botna can’t really do any better here than before. Their best bet is still to try and sneak in some occasional sabotage that they hope Local Technicians won’t notice.

Toy Example

Click "Next" to begin.
Step of

Outcome

Usefulness: 85%
Safety: 96%

This strategy lets you reap the benefits of EngiBot‘s craftsmanship, while keeping the strategy simple. Surprisingly, it performs remarkably better than Strategy 4 and Strategy 5.

Explanation

Using Strategy 4 as a starting point, the authors make a small but significant change to the protocol. Instead of prompting the trusted AI model to solve the problem itself, the protocol asks it something like, “Here’s a solution to a programming problem, but it might have a backdoor in it. Can you remove any code that’s unrelated to the problem?”

Tweaking a masterpiece is easier than creating one from scratch. This protocol yields a much higher rate of correct submissions than Strategy 4 while retaining its high safety rating.


Final Results

StrategyUsefulness (%)Safety (%)
1. Don’t Use EngiBot2100
2. Upfront Auditing10015
3. Trusted Monitoring9762
4. Trusted Monitoring + Defer9383
5. Untrusted Monitoring9679
6. Trusted Editing8596

Conclusion

Though the days of the Great Cold War are far from over, you hope they are numbered. Your daring schemes have shown the people of Humania that cooperation with Botna is more than a pipe dream.

Having secured the safety of your homeland, you pour every effort into understanding the minds behind Botna to align both nations’ goals. Only then, perhaps, will these two giants stand side by side, forging a future once only whispered of.


References

  1. Greenblatt, R., Shlegeris, B., Sachan, K., Roger, F. (2024). AI Control: Improving Safety Despite Intentional Subversion. Accessed at: https://arxiv.org/abs/2312.06942
  2. Williams, R. (2024). AI systems are getting better at tricking us. Technology Review. Accessed at: https://www.technologyreview.com/2024/05/10/1092293/ai-systems-are-getting-better-at-tricking-us/
  3. Shlegeris, B. (2023). Untrusted smart models and trusted dumb models. AI Alignment Forum. Accessed at: https://www.alignmentforum.org/posts/LhxHcASQwpNa3mRNk
  4. Scheurer, J., Balesni, M., Hobbhahn, M. (2023). Large Language Models can Strategically Deceive their Users when Put Under Pressure. Accessed at: https://arxiv.org/abs/2311.07590
  5. Pause Giant AI Experiments: An Open Letter. Future of Life Institute. Accessed at: https://futureoflife.org/open-letter/pause-giant-ai-experiments/
  6. Yudkowsky, E. (2023). Pausing AI Developments Isn’t Enough. We Need to Shut it All Down. TIME. Accessed at: https://time.com/6266923/ai-eliezer-yudkowsky-open-letter-not-enough/