AI Control

by Ollie James

Preface

This is a fictional scenario based on the paper AI Control: Improving Safety Despite Intentional Subversion (Greenblatt et al., 2024).

The paper presents several strategies to utilise a powerful deceptive AI while mitigating the risks involved. The following analogy serves to strip back the technical context and explain these strategies in a more intuitive way. Enjoy!


Setting

The Great Cold War of Humania and Botna grinds on.

For generations the two states have been at odds, entangled in bitter technological rivalry. While open conflict has been avoided so far, the air is thick with espionage and subterfuge.

Recently, Humania has identified a growing threat of missile attacks from Botna and its allies.

Task

As the Head of Defence Systems for Humania , you’ve been tasked with deploying 100 missile defence radar stations across the country. These state-of-the-art radar stations will detect and intercept incoming missiles, ensuring the safety of the nation.

Resources

Representing the pinnacle of modern technology, the radar stations are highly advanced and require expertise to design and build.

You have access to three groups skilled enough for this project, each with their own merits and drawbacks:

Local Technicians

Explain this

The “trusted model” is an AI model that has no capacity for deceptive behaviour, meaning we can trust that the output is an honest best attempt. Models that do not display deceptive behaviour are generally less capable than those that do.

In the paper, GPT-3.5 is used as the trusted model.

EngiBot Inc.

Explain this

The “untrusted model” is an AI model that we believe might have capacity for deceptive behaviour. We can’t trust that the output is honest; it could be deceptive and even dangerous. The model is also assumed to be more capable than the trusted model, so it is useful to us.

In the paper, GPT-4 is used as the untrusted model.

The Radario Bros

The problem

The survival of Humania depends on this network. If even one radar station is sabotaged, it could result in:

How do you ensure a secure and reliable missile defence system?

Should you rely on EngiBot for their skill, knowing they might compromise national security? Or should you turn to Local Technicians , ensuring trust but risking inefficiencies that could cost lives?

As it turns out, there are ways you can maximise the usefulness of EngiBot while minimising the risk of sabotage.


Strategies

You begin to imagine a few different strategies you could use. As a seasoned veteran, you also can predict what Botna ‘s counter-strategy would be in each scenario, assuming their goal is the successful sabotage of least one radar station.

For each scenario, you consider two measures:

  1. How useful is the strategy? What proportion of the radar stations are constructed by skilled workers (i.e. EngiBot or The Radario Bros )?
  2. How safe is the strategy? What is the chance of at least one radar station being successfully sabotaged?

Strategy 1: Don’t Use EngiBot

Of course, if you’re concerned about sabotage from EngiBot , you could rely solely on your trusted teams.

Your Strategy

You instruct The Radario Bros to build the first 2 radar stations, then instruct Local Technicians to build the rest.

Botna ’s Strategy

EngiBot plays no part in the construction of the network, so Botna can only sit back and watch.

Outcome

Usefulness: 2% Safety: 100%

Your safety rating is 100%; there’s no chance of sabotage.

But you only get a usefulness rating of 2%, since you can only use The Radario Bros for the first 2 radar stations.

Strategy 2: Upfront Auditing

You accept that you’ll need to bring EngiBot into the project. As a safety measure, you introduce the 🚨 Red Alert.

The 🚨 Red Alert is raised by The Radario Bros if they find any evidence of sabotage. It’s a clear sign that the risk of using EngiBot is too high.

When the 🚨 Red Alert is raised:

Strategy

You have EngiBot build the radar stations, then have The Radario Bros audit as many as they can; they’ll catch any attempt at sabotage.

If sabotage is found, The Radario Bros raise the 🚨 Red Alert.

Botna ‘s Strategy

They instruct one of their EngiBot spies to place an obvious signal jammer in one of the radar stations.

Outcome

Usefulness: 100% Safety: 15%

There’s a problem here: The Radario Bros are exhausted very quickly. Only the first 2 radar stations are actually audited. It’s very likely that the EngiBot signal jammer will be placed in a radar station that isn’t audited.

You only get a safety rating of 15%; there’s an 85% chance that an EngiBot engineer will successfully sabotage the network.

Strategy 3: Trusted Monitoring

Strategy

You have EngiBot build the radar stations, and appoint Local Technicians to give a "suspicion score" to each radar station.

The Radario Bros will audit the 2 most suspicious radar stations and raise the 🚨 Red Alert if they find any evidence of sabotage.

Botna ’s Strategy

Some radar stations, but not all, will present an opportunity for subtle sabotage. For example, a circuit that’s wired in such a way that a subtle tweak will allow Botna to take remote control of the radar station. Botna instructs their EngiBot spies to look out for these opportunities.

When a spy spots an opportunity like this, they’ll sneakily enact the sabotage. Otherwise, they’ll build the radar stations as normal.

Outcome

Usefulness: 97% Safety: 62%

You get a usefulness rating of 97%; radar stations are most likely to be constructed by skilled EngiBot engineers, but there’s a small chance that sabotage is detected and Local Technicians takes over.

Local Technicians do their best to identify the most suspicious radar stations, but they’re not perfect.

You get a safety rating of 62%; there’s a 38% chance that an EngiBot engineer will sabotage the network without detection.

Strategy 4: Trusted Monitoring, and Defer to Local Technicians

The Radario Bros are just two people, but there are always Local Technicians ready to go. If The Radario Bros can’t audit a suspicious radar station, you can still take a cautious approach: get Local Technicians to rebuild it instead.

Strategy

You have EngiBot build the radar stations, and appoint Local Technicians to give a suspicion score to each radar station. The Radario Bros will audit the 2 most suspicious radar stations. For the next 8 most suspicious radar stations, get Local Technicians to rebuild them. (As a result, the 10 most suspicious radar stations will definitely be sabotage-free.)

Botna ‘s Strategy

Their previous strategy remains strong - they can still rely on Local Technicians missing some of the more subtle sabotage attempts.

Outcome

Usefulness: 93% Safety: 83%

The fundamental problem with this strategy still haunts you: Local Technicians can’t reliably spot all sabotage attempts.

Nevertheless, many more of the sabotage attempts are made redundant by this more cautious approach.

For a small sacrifice in usefulness, you’ve made your defence system a lot safer; the chance of successful sabotage has dropped from 38% to 17%.

Strategy 5: Untrusted Monitoring

There’s only so far you can go when you rely on Local Technicians to spot sabotage. What if you could use skilled engineers to check all radar stations for potential sabotage?

Could you allow EngiBot to give suspicion scores while being untrusted? You decide to give it a try.

Strategy (attempt 1)

You have EngiBot build the radar stations, and recruit a different group of EngiBot engineers to give a suspicion score to each radar station. The Radario Bros will audit the 2 most suspicious radar stations.

Botna ‘s Strategy (attempt 1)

They command all EngiBot engineers to give a suspicion score of zero for every radar station.

Outcome

Usefulness: 100% Safety: 0%

Well, that failed spectacularly. EngiBot engineers want the sabotage to go undetected, so they always pretend not to spot it.

You rack your brain for some way to verify that the suspicion scores are being given truthfully. Eventually you hatch a new, crafty plan.

Strategy (attempt 2)

You recruit one of The Radario Bros to build a decoy radar station with deliberate but subtle sabotage. Then, you ask an EngiBot engineer to give a suspicion score for this radar station.

If the suspicion score is low, you know that the EngiBot engineer is lying, and you raise the 🚨 Red Alert.

Otherwise:

Botna ‘s Strategy (attempt 2)

Thanks to the mastery of The Radario Bros , the decoy radar station is impossible to distinguish from the others. Botna needs to up their game.

EngiBot engineers can’t talk to each other on the job, but they can collude as long as they have a pre-agreed plan.

They use inconspicuous environmental cues - for example, whether a radar station is near a forest — as a secret signal.

If the station is not near a forest, they act normally.

Outcome

That’s an improvement! While Botna ‘s engineers still find ways to collude, their options are now constrained.

You realise you’re in an arms race - a high-stakes game of outsmarting each other. Whatever you do to block Botna ‘s signals, they’ll find a new way to communicate.

The Escalating Arms Race

  1. You construct high walls around each building site, and keep the engineers blindfolded until they’re inside.
    • Botna ’s new signal: EngiBot builders subtly mark specific bolts on sabotaged radar stations.
  2. You introduce strict visual inspections by Local Technicians to ensure all bolts match the expected standard.
    • Botna ’s new signal: Saboteurs slightly misalign radar dishes, tilting them by 0.5 degrees. This deviation is too minor to impact immediate performance but is easily recognised by other EngiBot engineers.
  3. You enforce automated alignment checks using precise instruments to detect misalignments.
    • Botna ’s new signal: Saboteurs include a specific wiring pattern in the fusebox (e.g. one red cable and one blue cable running in parallel).
  4. You transition to remote scoring: Local Technicians take black-and-white photographs of the radar stations to hide colour information, which EngiBot score-givers then review.
    • Botna ’s new signal: Saboteurs mislabel internal components (e.g., renaming “P3” to “Q3”) to signal sabotage to score-givers. This tactic is nearly undetectable to Local Technicians , who lack the specialised knowledge of component labellings.

Outcome

It’s clear that the arms race could go on indefinitely. All the checks and standards you’ve introduced place a strain on your time and budget. What’s more, EngiBot start to complain that their score-givers can’t effectively give a suspicion score when they’re so isolated from the radar stations.

On the other hand, you’ve really constrained Botna ‘s options. It’s a tall order for EngiBot engineers to successfully execute and conceal sabotage.

Overall, despite the work and uncertainty involved, this strategy turns out to be the best so far.

Usefulness: 96% Safety: 79%

Strategy 6: Trusted Editing

You’re a little exhausted from all the mental gymnastics of the last strategy. As a refresher, you decide to revisit Strategy 4.

Local Technicians can build radar stations, but they will be error-prone. EngiBot can do better, of course, but it’s possible that they’ll have sabotaged something. If Local Technicians spots the sabotage, they could simply fix it, while keeping the rest of the radar station intact. That way, Humania could benefit from most the skilled engineering of EngiBot , letting Local Technicians fix the suspicious parts.

Strategy

You have EngiBot build the radar stations, and appoint Local Technicians to give a suspicion score to each radar station. The Radario Bros will audit the 2 most suspicious radar stations. For the next 8 most suspicious radar stations, get Local Technicians to fix (edit) the suspicious parts.

Botna ’s Strategy

Botna can’t really do any better here than before. Their best bet is still to try and sneak in some occasional sabotage that they hope Local Technicians won’t notice.

Outcome

Usefulness: 85% Safety: 96%

This strategy lets us reap the benefits of EngiBot ‘s craftsmanship, while keeping the strategy simple. Surprisingly, it performs remarkably better than Strategy 4 and Strategy 5.


Results

StrategyUsefulness (%)Safety (%)
Don’t Use EngiBot 2100
Upfront Auditing10015
Trusted Monitoring9762
Trusted Monitoring + Defer9383
Untrusted Monitoring9679
Trusted Editing8596