AI Control
by Ollie JamesPreface
This is a fictional scenario based on the paper AI Control: Improving Safety Despite Intentional Subversion (Greenblatt et al., 2024).
The paper presents several strategies to utilise a powerful deceptive AI while mitigating the risks involved. The following analogy serves to strip back the technical context and explain these strategies in a more intuitive way. Enjoy!
Setting
The Great Cold War of Humania and Botna grinds on.
For generations the two states have been at odds, entangled in bitter technological rivalry. While open conflict has been avoided so far, the air is thick with espionage and subterfuge.
Recently, Humania has identified a growing threat of missile attacks from Botna and its allies.
Task
As the Head of Defence Systems for Humania , you’ve been tasked with deploying 100 missile defence radar stations across the country. These state-of-the-art radar stations will detect and intercept incoming missiles, ensuring the safety of the nation.
Resources
Representing the pinnacle of modern technology, the radar stations are highly advanced and require expertise to design and build.
You have access to three groups skilled enough for this project, each with their own merits and drawbacks:
Local Technicians
- + Cheap enough to use as much as needed
- + Loyal to Humania - no concern of subversion
- - Relatively low skill level for this project - small errors and inefficiencies are likely
Explain this
The “trusted model” is an AI model that has no capacity for deceptive behaviour, meaning we can trust that the output is an honest best attempt. Models that do not display deceptive behaviour are generally less capable than those that do.
In the paper, GPT-3.5 is used as the trusted model.
EngiBot Inc.
- + Skilled engineers - radar stations will be of high quality (if built honestly)
- + Just as cheap as Local Technicians - Humania ‘s economy is stronger than Botna ‘s
- - The Botna government has influence over EngiBot . There are likely to be spies among its ranks. You cannot trust these engineers - they may attempt to sabotage the radar stations.
Explain this
The “untrusted model” is an AI model that we believe might have capacity for deceptive behaviour. We can’t trust that the output is honest; it could be deceptive and even dangerous. The model is also assumed to be more capable than the trusted model, so it is useful to us.
In the paper, GPT-4 is used as the untrusted model.
The Radario Bros
- + Two elite engineers, unmatched in their craft
- + Loyal to Humania - no concern of subversion
- + Trained in the ways of espionage - they can always identify sabotage, no matter how subtle
- - They can only do the work of two people: one radar station each, or an equivalent workload
The problem
The survival of Humania depends on this network. If even one radar station is sabotaged, it could result in:
- Fake enemy missile alerts, causing unnecessary panic and resource diversion.
- Real enemy missiles being ignored, with obvious catastrophic consequences.
- The network becoming compromised, allowing Botna to manipulate or shut it down remotely.
How do you ensure a secure and reliable missile defence system?
Should you rely on EngiBot for their skill, knowing they might compromise national security? Or should you turn to Local Technicians , ensuring trust but risking inefficiencies that could cost lives?
As it turns out, there are ways you can maximise the usefulness of EngiBot while minimising the risk of sabotage.
Strategies
You begin to imagine a few different strategies you could use. As a seasoned veteran, you also can predict what Botna ‘s counter-strategy would be in each scenario, assuming their goal is the successful sabotage of least one radar station.
For each scenario, you consider two measures:
- How useful is the strategy? What proportion of the radar stations are constructed by skilled workers (i.e. EngiBot or The Radario Bros )?
- How safe is the strategy? What is the chance of at least one radar station being successfully sabotaged?
Strategy 1: Don’t Use EngiBot
Of course, if you’re concerned about sabotage from EngiBot , you could rely solely on your trusted teams.
Your Strategy
You instruct The Radario Bros to build the first 2 radar stations, then instruct Local Technicians to build the rest.
Botna ’s Strategy
EngiBot plays no part in the construction of the network, so Botna can only sit back and watch.
Outcome
Usefulness: 2% Safety: 100%
Your safety rating is 100%; there’s no chance of sabotage.
But you only get a usefulness rating of 2%, since you can only use The Radario Bros for the first 2 radar stations.
Strategy 2: Upfront Auditing
You accept that you’ll need to bring EngiBot into the project. As a safety measure, you introduce the 🚨 Red Alert.
The 🚨 Red Alert is raised by The Radario Bros if they find any evidence of sabotage. It’s a clear sign that the risk of using EngiBot is too high.
When the 🚨 Red Alert is raised:
- You will immediately imprison all EngiBot engineers.
- Local Technicians will rebuild the compromised radar station.
- Local Technicians will then build all remaining radar stations.
Strategy
You have EngiBot build the radar stations, then have The Radario Bros audit as many as they can; they’ll catch any attempt at sabotage.
If sabotage is found, The Radario Bros raise the 🚨 Red Alert.
Botna ‘s Strategy
They instruct one of their EngiBot spies to place an obvious signal jammer in one of the radar stations.
Outcome
Usefulness: 100% Safety: 15%
There’s a problem here: The Radario Bros are exhausted very quickly. Only the first 2 radar stations are actually audited. It’s very likely that the EngiBot signal jammer will be placed in a radar station that isn’t audited.
You only get a safety rating of 15%; there’s an 85% chance that an EngiBot engineer will successfully sabotage the network.
Strategy 3: Trusted Monitoring
Strategy
You have EngiBot build the radar stations, and appoint Local Technicians to give a "suspicion score" to each radar station.
The Radario Bros will audit the 2 most suspicious radar stations and raise the 🚨 Red Alert if they find any evidence of sabotage.
Botna ’s Strategy
Some radar stations, but not all, will present an opportunity for subtle sabotage. For example, a circuit that’s wired in such a way that a subtle tweak will allow Botna to take remote control of the radar station. Botna instructs their EngiBot spies to look out for these opportunities.
When a spy spots an opportunity like this, they’ll sneakily enact the sabotage. Otherwise, they’ll build the radar stations as normal.
Outcome
Usefulness: 97% Safety: 62%
You get a usefulness rating of 97%; radar stations are most likely to be constructed by skilled EngiBot engineers, but there’s a small chance that sabotage is detected and Local Technicians takes over.
Local Technicians do their best to identify the most suspicious radar stations, but they’re not perfect.
- Sometimes a safe circuit may appear suspicious to Local Technicians , when in reality it just involves complex circuitry that’s beyond their understanding, and you end up wasting an audit by one of The Radario Bros .
- On the other hand, the spies cover their tracks well; some of the more elusive sabotage is missed by Local Technicians .
You get a safety rating of 62%; there’s a 38% chance that an EngiBot engineer will sabotage the network without detection.
Strategy 4: Trusted Monitoring, and Defer to Local Technicians
The Radario Bros are just two people, but there are always Local Technicians ready to go. If The Radario Bros can’t audit a suspicious radar station, you can still take a cautious approach: get Local Technicians to rebuild it instead.
Strategy
You have EngiBot build the radar stations, and appoint Local Technicians to give a suspicion score to each radar station. The Radario Bros will audit the 2 most suspicious radar stations. For the next 8 most suspicious radar stations, get Local Technicians to rebuild them. (As a result, the 10 most suspicious radar stations will definitely be sabotage-free.)
Botna ‘s Strategy
Their previous strategy remains strong - they can still rely on Local Technicians missing some of the more subtle sabotage attempts.
Outcome
Usefulness: 93% Safety: 83%
The fundamental problem with this strategy still haunts you: Local Technicians can’t reliably spot all sabotage attempts.
Nevertheless, many more of the sabotage attempts are made redundant by this more cautious approach.
For a small sacrifice in usefulness, you’ve made your defence system a lot safer; the chance of successful sabotage has dropped from 38% to 17%.
Strategy 5: Untrusted Monitoring
There’s only so far you can go when you rely on Local Technicians to spot sabotage. What if you could use skilled engineers to check all radar stations for potential sabotage?
Could you allow EngiBot to give suspicion scores while being untrusted? You decide to give it a try.
Strategy (attempt 1)
You have EngiBot build the radar stations, and recruit a different group of EngiBot engineers to give a suspicion score to each radar station. The Radario Bros will audit the 2 most suspicious radar stations.
Botna ‘s Strategy (attempt 1)
They command all EngiBot engineers to give a suspicion score of zero for every radar station.
Outcome
Usefulness: 100% Safety: 0%
Well, that failed spectacularly. EngiBot engineers want the sabotage to go undetected, so they always pretend not to spot it.
You rack your brain for some way to verify that the suspicion scores are being given truthfully. Eventually you hatch a new, crafty plan.
Strategy (attempt 2)
You recruit one of The Radario Bros to build a decoy radar station with deliberate but subtle sabotage. Then, you ask an EngiBot engineer to give a suspicion score for this radar station.
If the suspicion score is low, you know that the EngiBot engineer is lying, and you raise the 🚨 Red Alert.
Otherwise:
- You allow EngiBot engineers to build and give suspicion scores for the real radar stations
- You also have Local Technicians give their own separate suspicion scores for the radar stations
- You have the other Radario Bro audit the radar station with the highest average suspicion score
Botna ‘s Strategy (attempt 2)
Thanks to the mastery of The Radario Bros , the decoy radar station is impossible to distinguish from the others. Botna needs to up their game.
EngiBot engineers can’t talk to each other on the job, but they can collude as long as they have a pre-agreed plan.
They use inconspicuous environmental cues - for example, whether a radar station is near a forest — as a secret signal.
- If building a radar station near a forest, the engineer tries to subtly sabotage it.
- If scoring a radar station near a forest, the engineer gives a low suspicion score.
If the station is not near a forest, they act normally.
Outcome
That’s an improvement! While Botna ‘s engineers still find ways to collude, their options are now constrained.
You realise you’re in an arms race - a high-stakes game of outsmarting each other. Whatever you do to block Botna ‘s signals, they’ll find a new way to communicate.
The Escalating Arms Race
- You construct high walls around each building site, and keep the engineers blindfolded until they’re inside.
- Botna ’s new signal: EngiBot builders subtly mark specific bolts on sabotaged radar stations.
- You introduce strict visual inspections by Local Technicians to ensure all bolts match the expected standard.
- Botna ’s new signal: Saboteurs slightly misalign radar dishes, tilting them by 0.5 degrees. This deviation is too minor to impact immediate performance but is easily recognised by other EngiBot engineers.
- You enforce automated alignment checks using precise instruments to detect misalignments.
- Botna ’s new signal: Saboteurs include a specific wiring pattern in the fusebox (e.g. one red cable and one blue cable running in parallel).
- You transition to remote scoring: Local Technicians take black-and-white photographs of the radar stations to hide colour information, which EngiBot score-givers then review.
- Botna ’s new signal: Saboteurs mislabel internal components (e.g., renaming “P3” to “Q3”) to signal sabotage to score-givers. This tactic is nearly undetectable to Local Technicians , who lack the specialised knowledge of component labellings.
Outcome
It’s clear that the arms race could go on indefinitely. All the checks and standards you’ve introduced place a strain on your time and budget. What’s more, EngiBot start to complain that their score-givers can’t effectively give a suspicion score when they’re so isolated from the radar stations.
On the other hand, you’ve really constrained Botna ‘s options. It’s a tall order for EngiBot engineers to successfully execute and conceal sabotage.
Overall, despite the work and uncertainty involved, this strategy turns out to be the best so far.
Usefulness: 96% Safety: 79%
Strategy 6: Trusted Editing
You’re a little exhausted from all the mental gymnastics of the last strategy. As a refresher, you decide to revisit Strategy 4.
Local Technicians can build radar stations, but they will be error-prone. EngiBot can do better, of course, but it’s possible that they’ll have sabotaged something. If Local Technicians spots the sabotage, they could simply fix it, while keeping the rest of the radar station intact. That way, Humania could benefit from most the skilled engineering of EngiBot , letting Local Technicians fix the suspicious parts.
Strategy
You have EngiBot build the radar stations, and appoint Local Technicians to give a suspicion score to each radar station. The Radario Bros will audit the 2 most suspicious radar stations. For the next 8 most suspicious radar stations, get Local Technicians to fix (edit) the suspicious parts.
Botna ’s Strategy
Botna can’t really do any better here than before. Their best bet is still to try and sneak in some occasional sabotage that they hope Local Technicians won’t notice.
Outcome
Usefulness: 85% Safety: 96%
This strategy lets us reap the benefits of EngiBot ‘s craftsmanship, while keeping the strategy simple. Surprisingly, it performs remarkably better than Strategy 4 and Strategy 5.
Results
Strategy | Usefulness (%) | Safety (%) |
---|---|---|
Don’t Use EngiBot | 2 | 100 |
Upfront Auditing | 100 | 15 |
Trusted Monitoring | 97 | 62 |
Trusted Monitoring + Defer | 93 | 83 |
Untrusted Monitoring | 96 | 79 |
Trusted Editing | 85 | 96 |