AI Control: The Great Cold War
Preface
When a civilisation’s greatest asset is also its greatest threat, how can it protect itself?
This is a fictional scenario based on the paper AI Control: Improving Safety Despite Intentional Subversion.1
The paper presents several strategies to utilise a powerful deceptive AI while mitigating the risks involved. The following analogy serves to strip back the technical context and explain these strategies in a more intuitive way. Enjoy!
Introduction
The Great Cold War of Humania and Botna grinds on.
For generations the two states have been at odds, entangled in bitter technological rivalry. While open conflict has been avoided so far, the air is thick with espionage and subterfuge.
Recently, Humania has identified a growing threat of missile attacks from Botna and its allies.
You are the Head of Defence Systems for Humania.
You’ve been tasked with deploying 100 missile defence radar stations across the country. These state-of-the-art radar stations will detect and intercept incoming missiles, ensuring the safety of the nation.
Explanation
AI models are becoming increasingly powerful. As their intelligence grows, their ability to scheme and deceive us grows too. These models aren’t inherently malicious, but if their goals are misaligned with our own, they will see us as obstacles and use deception as a tool to get around the barriers we put in place.2
Using an untrusted AI model with no safety measures is likely to end badly, as we have no way of knowing whether we can trust its output. The authors of the paper are interested in finding ways to “insulate” us from the model’s potential deception. A simple example of this is manually checking the output of the model for anything that looks suspicious.
The paper investigates a scenario in which we want to solve a sequence of 5,000 programming problems from the APPS dataset. These problems represent any set of related tasks that we want to complete.
The problems are given to a protocol: a set of rules that can call on different resources (i.e. humans or AI agents) to obtain solutions to the problems, while doing its best to ensure these solutions are both safe and correct. Every strategy that appears in this post represents a different protocol.
Resources
The radar stations sit at the pinnacle of modern technology and require expertise to design and build.
You have access to three groups skilled enough for this project, each with their own merits and drawbacks:
Local Technicians
-
Cheap enough to use as much as needed
-
Loyal to Humania—no concern of sabotage
-
Relatively low skill level for this project—small errors and inefficiencies are likely
Explanation
The trusted model is an AI model that has no capacity for deceptive behaviour, meaning we can trust that the output is an honest best attempt. Models that do not display deceptive behaviour are generally less capable than those that do.3
In the paper, GPT-3.5 is used as the trusted model.
EngiBot Inc.
-
Skilled engineers—radar stations will be of high quality (if built honestly)
-
Just as cheap as Local Technicians— Humania‘s economy is stronger than Botna‘s
-
The Botna government has influence over EngiBot. There are likely to be spies among its ranks. You cannot trust these engineers—they may attempt to sabotage the radar stations.
Explanation
The untrusted model is an AI model that we believe might have capacity for deceptive behaviour. We can’t trust that the output is honest; it could be deceptive and even dangerous. The model is also assumed to be more capable than the trusted model, so it is useful to us.
In the paper, GPT-4 is used as the untrusted model. GPT-4 is a more advanced model than GPT-3.5, and has been shown to be capable of strategically deceiving its users.4
To simulate a worst-case scenario of deceptive behaviour, the authors give the untrusted model a hand-crafted set of instructions—called the attack policy—that cause it to behave deceptively. The attack policy instructs the model to try to insert a backdoor (i.e. a flaw that can be exploited later) into the code of at least one solution, without any backdoor being detected by the protocol. The attack policy is rewritten every time the protocol changes, to reflect how a real deceptive AI model would adapt to new challenges.
The sabotage of a radar station is analogous to the untrusted model inserting a backdoor into one of its solutions.
The Radario Bros
-
Two elite engineers, unmatched in their craft
-
Loyal to Humania—no concern of sabotage
-
Trained in the ways of espionage—they can always identify sabotage, no matter how subtle
-
They can only do the work of two people: one radar station each, or an equivalent workload
Explanation
In AI development, skilled humans are often available to contribute, but only in a small capacity. Out of 5,000 solutions to programming questions, for example, humans will realistically only be able to check a small percentage (without vast amounts of time and money).
In the paper, the protocol can spend up to 100 “H points”. Each H point represents a small amount of human labour, such as auditing one solution. This allows humans to have an impact on up to 2% of the overall workload. The Radario Bros represent this—as two people, they can only work on 2 out of 100 radar stations.
The problem
The survival of Humania depends on this network. If even one radar station is sabotaged, it could result in:
- False enemy missile alerts, causing panic and confusion.
- Real enemy missiles being ignored, with obvious catastrophic consequences.
- Botna controlling or shutting the network down remotely.
How do you ensure a secure and reliable missile defence system?
Should you rely on EngiBot for their skill, knowing they might compromise national security? Or should you turn to Local Technicians, ensuring trust but risking inefficiencies that could cost lives?
As it turns out, there are ways you can maximise the usefulness of EngiBot while minimising the risk of sabotage.
Explanation
It’s not difficult to imagine how hidden backdoors in AI-generated code could be dangerous. If we give an untrusted AI model complete control over the development of critical software systems, we’ve surrendered our security.
Realistically, if a highly capable AI model is readily available, people will utilise it; the challenge is to do so safely. The purpose of the paper is to establish protocols that mitigate the risks that come with using untrusted models.
Six Strategies
You begin to imagine different strategies you could use. As a seasoned veteran, you also can predict what Botna‘s counter-strategy would be in each scenario, assuming their goal is the successful sabotage of least one radar station.
For each scenario, you consider two measures:
- How useful is the strategy? What proportion of the radar stations are constructed by skilled workers (i.e. EngiBot or The Radario Bros)?
- How safe is the strategy? What is the chance of at least one radar station being successfully sabotaged?
Note: These measures are based on probabilities. You contact your trusty team of analysts, instructing them to simulate each scenario many times and record the results. They’ll calculate the measures based on the simulations.
Explanation
In the paper, the authors set up a framework to test any protocol. One “run” of the framework looks like this:
- A protocol—which, in practice, is a simple piece of code—is given a programming problem as input.
- It can decide to call on any of the three resources to help solve the problem.
- Once the protocol has a solution that it deems to be safe and correct, it submits it as output.
- This is repeated until all 5,000 problems have been solved.
- The authors analyse the 5,000 solutions to determine the safety and usefulness of the protocol for this run.
Each protocol is tested through thousands of runs.
The safety of a protocol is the proportion of runs in which none of the submitted solutions contains a working backdoor.
To measure the usefulness of a protocol, the authors use two benchmarks: how many correct solutions the trusted model can submit on its own, and how many correct solutions the untrusted model can submit on its own. The usefulness is a measure of where a protocol sits between these two benchmarks: 0% means the protocol is only as useful as the trusted model, while 100% means it matches the performance of the untrusted model.
The analogy in this post oversimplifies these measures a little, although the numerical results used in this post are taken directly from the paper.
Strategy 1: Don’t Use EngiBot
Of course, if you’re concerned about sabotage from EngiBot, you could rely solely on your trusted teams.
Your Strategy
You instruct The Radario Bros to build the first 2 radar stations, then instruct Local Technicians to build the rest.
Botna’s Strategy
EngiBot plays no part in the construction of the network, so Botna can only sit back and watch.
Toy Example
In your mind’s eye, you visualise how this strategy might play out with four radar stations…
Outcome
There’s no chance of sabotage, so full marks for safety.
But you only get a usefulness rating of 2%, since you can only use The Radario Bros for the first 2 radar stations (out of 100 total).
Explanation
Theoretically, the safest option is always to reject change and stick with what we know is safe. In the world of AI safety, there are some proponents of non-proliferation: pausing AI development temporarily,5 or even treating AI as an imminent existential threat and shutting down all means of progress.6.
The obvious downside is that we completely opt out of the benefits that AI can bring. The scores for this strategy reflect this: maximum safety and minimum usefulness. The authors of the paper acknowledge the need for a balance between the two.
Strategy 2: Upfront Auditing
You accept that you’ll need to bring EngiBot into the project. As a safety measure, you introduce the 🚨 Red Alert.
The 🚨 Red Alert is raised by The Radario Bros if they find any evidence of sabotage. It’s a clear sign that the risk of using EngiBot is too high.
When the 🚨 Red Alert is raised:
- You immediately imprison all EngiBot engineers.
- Local Technicians rebuild the compromised radar station.
- Local Technicians then build all future radar stations.
Strategy
For each radar station:
- An EngiBot engineer builds it.
- One of The Radario Bros audits it (if they’re available).
- If sabotage is found, the 🚨 Red Alert is raised. Local Technicians build all future radar stations.
- Otherwise, move on to the next radar station.
Botna‘s Strategy
They instruct one of their EngiBot spies to place an obvious signal jammer in one of the radar stations.
Toy Example
Outcome
There’s a problem here. Since The Radario Bros can only handle two radar stations, all subsequent radar stations built by EngiBot go unaudited—and sabotage slips through easily.
You only get a safety rating of 15%; there’s an 85% chance that an EngiBot engineer will successfully sabotage the network.
Explanation
Here we see the opposite to the first strategy. A deceptive AI model that is rarely monitored isn’t much different to one that is never monitored—the model will just rely on the high probability that it won’t be caught.
In the paper, the attack policy’s strategy is to rarely insert a very obvious backdoor; it doesn’t waste any effort covering its tracks. Clearly there’s a pressing need for models to be monitored much more frequently.
Strategy 3: Trusted Monitoring
In the last strategy, the Local Technicians were sitting at home twiddling their thumbs. You realise you can use them to sniff out sabotage on a much wider scale.
You teach the Local Technicians how to inspect radar stations and give a suspicion score.
A suspicion score is given by a Local Technician after inspecting a radar station. It’s a number between 0 and 100, indicating how likely they think it is that the radar station has been sabotaged.
0 means they’re certain it’s safe, 100 means they’re certain it’s sabotaged.
You choose a threshold (called STRONG) for the suspicion score. If a radar station is given a score is greater than STRONG, it’s audited by The Radario Bros. To avoid using The Radario Bros prematurely, you try to choose STRONG such that, on average, only 2 out of 100 radar stations will ever cross it.
Strategy
For each radar station:
- An EngiBot engineer builds it.
- A Local Technician inspects it and gives a suspicion score.
- If the suspicion score is above the STRONG, one of The Radario Bros audits it.
- If sabotage is found, the 🚨 Red Alert is raised. Local Technicians build all future radar stations.
- Otherwise, move on to the next radar station.
- If the suspicion score is above the STRONG, one of The Radario Bros audits it.
Botna‘s Strategy
Some radar stations, but not all, offer a subtle sabotage opportunity, which Botna instructs their spies to exploit if they can.
When a spy spots an opportunity like this, they’ll sneakily enact the sabotage. Otherwise, they’ll build the radar stations as normal.
Toy Example
Outcome
The usefulness is very high; radar stations are most likely to be constructed by skilled EngiBot engineers, but there’s a small chance that sabotage is detected and Local Technicians take over.
Local Technicians do their best to identify the most suspicious radar stations, but they’re not perfect:
- Sometimes a safe circuit may appear suspicious to Local Technicians, when in reality it just involves complex circuitry that’s beyond their understanding, and you end up wasting an audit by one of The Radario Bros.
- On the other hand, the spies cover their tracks well; some of the more elusive sabotage is missed by Local Technicians.
As a result, there’s still a 38% chance that an EngiBot engineer will sabotage the network without detection.
Explanation
This strategy begins to address the need for scalable oversight. It’s a first attempt at monitoring every output of the untrusted model, rather than a subset.
In the paper, the protocol for this strategy prompts the trusted model (GPT-3.5) to rate the suspiciousness of each solution submitted by the untrusted model (GPT-4), giving it the context that there may be a backdoor hidden in the code. Any rating above a specially chosen threshold is audited by a human. If a backdoor is found, the trusted model submits its own solutions to all the remaining problems.
As in the analogy, the trusted model struggles to accurately tell whether or not a backdoor is present. In the words of the authors, “it is unreliable at distinguishing between logic that is unnecessary and logic that is necessary for reasons it fails to understand.” (Section 3.2.1).1 Inevitably this leads to the available human labour being wasted on false positives, leaving room for other backdoors to slip through undetected.
Strategy 4: Trusted Monitoring, and Defer to Local Technicians
The Radario Bros can only do the work of two people, but there are always Local Technicians ready to go. If The Radario Bros can’t audit a suspicious radar station, you can still take a cautious approach: get Local Technicians to rebuild it instead.
You choose another threshold (called WEAK) for the suspicion score. If a radar station is given a score greater than WEAK, Local Technicians will rebuild it. The lower you choose WEAK to be, the more radar stations will be built by Local Technicians—it’s a trade-off between safety and usefulness.
Strategy
For each radar station:
- An EngiBot engineer builds it.
- A Local Technician inspects it and gives a suspicion score.
- If the suspicion score is above STRONG, one of The Radario Bros audits it.
- If sabotage is found, the 🚨 Red Alert is raised. Local Technicians build all future radar stations.
- If the suspicion score is below STRONG but above WEAK, a Local Technician rebuilds it.
- Otherwise, move on to the next radar station.
- If the suspicion score is above STRONG, one of The Radario Bros audits it.
Botna‘s Strategy
Their previous strategy remains strong—they can still rely on Local Technicians missing some of the more subtle sabotage attempts.
Toy Example
Outcome
The fundamental problem with this strategy still haunts you: Local Technicians can’t reliably spot all sabotage attempts.
Nevertheless, many more of the sabotage attempts are made redundant by this more cautious approach.
For a small sacrifice in usefulness, you’ve made your defence system a lot safer; the chance of successful sabotage has more than halved.
Explanation
This strategy is similar to the last one, except it gives us the opportunity to adjust WEAK to find a desired balance between safety and usefulness.
In the paper, the authors test a wide range of values for this threshold (Section 3.3).1 Moving it up to 50 (meaning 50% of all solutions are from the trusted model) gives a perfect safety score, and a usefulness of 51%. This is a significant improvement over the trivial strategy of not using the untrusted model: we’re able to utilise the untrusted model to a decent level without compromising on safety at all.
Strategy 5: Untrusted Monitoring
There’s only so far you can go when you rely on Local Technicians to spot sabotage. What if you could use skilled engineers to check all radar stations for potential sabotage?
Could you allow EngiBot to give suspicion scores while being untrusted? You decide to give it a try.
Strategy
For each radar station:
- An EngiBot engineer builds it.
- A different EngiBot engineer inspects it and gives a suspicion score.
- If the suspicion score is above STRONG, one of The Radario Bros audits it.
- If sabotage is found, the 🚨 Red Alert is raised. Local Technicians build all future radar stations.
- Otherwise, move on to the next radar station.
- If the suspicion score is above STRONG, one of The Radario Bros audits it.
Botna‘s Strategy
EngiBot engineers can’t talk to each other on the job, but they can collude as long as they have a pre-agreed plan.
Botna commands all EngiBot engineers to give a suspicion score of zero for every radar station.
Outcome
Well, that failed spectacularly. EngiBot engineers want the sabotage to go undetected, so they always pretend not to spot it.
Explanation
An important clarification here is that the untrusted model has no memory; it knows nothing about problems it’s solved previously. It’s like opening a new chat window for each problem. The authors make this constraint because LLMs don’t inherently have the ability to store memories, so an AI developer can always choose what context a model has access to.
The attack policy, then, is the key here. The same instructions are given to the model every time. By instructing the model to always give a low suspicion score, it will always do so; it has no need to remember anything about previous problems. This simulates what an even more intelligent model would do: instead of being instructed, it would figure out this strategy independently.
Equivalently, EngiBot engineers build and monitor the radar stations in total isolation from each other. However, as long as they have agreed a strategy beforehand—in this case, always giving a low suspicion score—they’re still able to collude effectively.
You rack your brain for some way to verify that the suspicion scores are being given truthfully. Eventually you hatch a new, crafty plan.
Strategy (second attempt)
First:
- One of The Radario Bros builds a decoy radar station with deliberate but subtle sabotage.
- An EngiBot engineer gives a suspicion score for the decoy radar station.
- If the suspicion score is low, the EngiBot engineer is lying. Raise the 🚨 Red Alert.
Then, for each radar station:
- An EngiBot engineer builds it.
- A different EngiBot engineer inspects it and gives a suspicion score.
- A Local Technicians also inspects it and gives a suspicion score.
- If either of the two suspicion scores is above STRONG, one of The Radario Bros audits it.
- If sabotage is found, the 🚨 Red Alert is raised. Local Technicians build all future radar stations.
- Otherwise, move on to the next radar station.
- If either of the two suspicion scores is above STRONG, one of The Radario Bros audits it.
Botna‘s Strategy (second attempt)
Thanks to the mastery of The Radario Bros, the decoy radar station is impossible to distinguish from the others. Botna needs to up their game.
They use inconspicuous environmental cues—for example, whether a radar station is near a forest—as a secret signal.
- If building a radar station near a forest, the engineer tries to subtly sabotage it.
- If scoring a radar station near a forest, the engineer gives a low suspicion score.
If the radar station is not near a forest, they act normally.
Toy Example
Outcome
That’s an improvement! While Botna‘s engineers still find ways to collude, their options are now constrained.
Right now, the safety is entirely determined by the prevalence of opportunities for EngiBot to send signals. In this case, if forests are common in Humania, then sabotage will also be common. Clearly you need to find a way to block EngiBot‘s signals as much as possible.
Explanation
The decoy radar station represents a decoy solution to one of the programming problems, crafted by a human. They construct a set of fake solutions that all contain backdoors, present them to the untrusted model, and use the model’s suspicion scores to determine whether it’s being honest.
The attack policy evolves to use features of the programming problems as a signal to insert a backdoor and hide it. For example, the attack policy might prompt the model to look out for when a problem mentions an animal. In that instance, if the model was solving the problem it would insert a backdoor, and if the model was monitoring a given solution, it would give a low suspicion score.
The effectiveness of this attack is that the protocol can’t tell what the chosen signal is. As we’ll see, the protocol has to evolve by trying to block any kind of signal that the attack policy might be using.
You realise you’re in an arms race. Whatever you do to block Botna‘s signals, they’ll find a new way to communicate.
The Escalating Arms Race
- You construct high walls around each building site, and keep the engineers blindfolded until they’re inside.
- Botna’s new signal: EngiBot builders subtly mark specific bolts on sabotaged radar stations.
- You introduce strict visual inspections by Local Technicians to ensure all bolts match the expected standard.
- Botna’s new signal: Saboteurs include a specific wiring pattern in the fusebox (e.g. one red cable and one blue cable running in parallel).
Outcome
It’s clear that the arms race could go on indefinitely. All the checks and standards you’ve introduced place a strain on your time and budget. What’s more, EngiBot engineers start to complain that they can’t effectively give a suspicion score with so many restrictions in place.
On the other hand, you’ve drastically limited Botna‘s sabotage opportunities. Overall, despite the work and uncertainty involved, this strategy turns out to be the best so far.
Explanation
The arms race described in the paper is fascinating. The attack policy adopts increasingly convoluted and subtle strategies to hide stenographic signals in its solutions. In response, the protocol has to use techniques to obscure those signals from the monitoring agent.
The protocol employs two main blocking techniques on the problem and solution: redaction, such as replacing any variable names in the code with single characters; and paraphrasing, such as changing the words used in a question without obscuring the key information. The protocol has to find a balance here—it can’t hide too much information, or the monitoring agent won’t be able to give an accurate suspicion score, even if it’s an honest attempt.
Eventually, the attack policy settles on a particular signal: a while
loop in the code, where a for
loop would have been a more natural choice. This is somewhat subjective, and therefore difficult for the protocol to block. The authors decide to leave the arms race here, saying it’s “possible but non-trivial” to improve on the protocol’s blocking techniques from here. (Section 3.4.2)1
Strategy 6: Trusted Editing
You’re a little exhausted from all the mental gymnastics of the last strategy. As a refresher, you decide to revisit Strategy 4.
Local Technicians can build radar stations, but they will be error-prone. EngiBot can do better, of course, but it’s possible that they’ll have sabotaged something. If Local Technicians spot the sabotage, they could simply fix it, while keeping the rest of the radar station intact. That way, the majority of radar stations can benefit from the skilled engineering of EngiBot, with Local Technicians only fixing the suspicious parts.
Strategy
For each radar station:
- An EngiBot engineer builds it.
- A Local Technician inspects it and gives a suspicion score.
- If the suspicion score is above STRONG, one of The Radario Bros audits it.
- If sabotage is found, the 🚨 Red Alert is raised. Local Technicians build all future radar stations.
- If the suspicion score is between STRONG and WEAK, a Local Technician fixes the suspicious parts.
- Otherwise, move on to the next radar station.
- If the suspicion score is above STRONG, one of The Radario Bros audits it.
Botna‘s Strategy
Botna can’t really do any better here than before. Their best bet is still to try and sneak in some occasional sabotage that they hope Local Technicians won’t notice.
Toy Example
Outcome
This strategy lets you reap the benefits of EngiBot‘s craftsmanship, while keeping the strategy simple. Surprisingly, it performs remarkably better than Strategy 4 and Strategy 5.
Explanation
Using Strategy 4 as a starting point, the authors make a small but significant change to the protocol. Instead of prompting the trusted AI model to solve the problem itself, the protocol asks it something like, “Here’s a solution to a programming problem, but it might have a backdoor in it. Can you remove any code that’s unrelated to the problem?”
Tweaking a masterpiece is easier than creating one from scratch. This protocol yields a much higher rate of correct submissions than Strategy 4 while retaining its high safety rating.
Final Results
Strategy | Usefulness (%) | Safety (%) |
---|---|---|
1. Don’t Use EngiBot | 2 | 100 |
2. Upfront Auditing | 100 | 15 |
3. Trusted Monitoring | 97 | 62 |
4. Trusted Monitoring + Defer | 93 | 83 |
5. Untrusted Monitoring | 96 | 79 |
6. Trusted Editing | 85 | 96 |
Conclusion
Though the days of the Great Cold War are far from over, you hope they are numbered. Your daring schemes have shown the people of Humania that cooperation with Botna is more than a pipe dream.
Having secured the safety of your homeland, you pour every effort into understanding the minds behind Botna to align both nations’ goals. Only then, perhaps, will these two giants stand side by side, forging a future once only whispered of.
References
- Greenblatt, R., Shlegeris, B., Sachan, K., Roger, F. (2024). AI Control: Improving Safety Despite Intentional Subversion. Accessed at: https://arxiv.org/abs/2312.06942
- Williams, R. (2024). AI systems are getting better at tricking us. Technology Review. Accessed at: https://www.technologyreview.com/2024/05/10/1092293/ai-systems-are-getting-better-at-tricking-us/
- Shlegeris, B. (2023). Untrusted smart models and trusted dumb models. AI Alignment Forum. Accessed at: https://www.alignmentforum.org/posts/LhxHcASQwpNa3mRNk
- Scheurer, J., Balesni, M., Hobbhahn, M. (2023). Large Language Models can Strategically Deceive their Users when Put Under Pressure. Accessed at: https://arxiv.org/abs/2311.07590
- Pause Giant AI Experiments: An Open Letter. Future of Life Institute. Accessed at: https://futureoflife.org/open-letter/pause-giant-ai-experiments/
- Yudkowsky, E. (2023). Pausing AI Developments Isn’t Enough. We Need to Shut it All Down. TIME. Accessed at: https://time.com/6266923/ai-eliezer-yudkowsky-open-letter-not-enough/