'Emergent' misaligned outcomes

Speculation on non-singular AI catastrophes - governance needed?

Feb 21, 2021

Here’s a line of thinking that’s caught me recently. It’s pulling me away somewhat from focusing on fast takeoff of a superintelligence, a singleton which must be friendly or else. Rather, I want to also consider outcomes with perhaps slower burn, multipolar - perhaps highly multiagent - problems. These could be catastrophic too.

Coordination issues - a quick gesture

I’m not sure how novel a perspective this is. It’s possibly not much more than an elaboration of:

Coordination issues and races to the bottom may be amplified, locked-in, or set in a death spiral by AI

But that statement alone does not make any specific predictions or help us to reason about how to intervene and make things better. I think with some elaboration, we can start to do those things.

Prima facie that statement only ascribes agency to the participants of the coordination issues (namely humans, if we suppose there are no agentic AGIs). Aliceburg wants always more, grabs some land. Bobistan also wants more, grabs it back. They fight, destroying much. This already strongly motivates work on careful AI policy and governance.

Optimisation and control

One way I think about factorising optimisation/control capability is into:

goal-directedness
predictive reasoning
causal/counterfactual reasoning1
actuation

We could say a system2 is ‘agentic’ to the extent it has those four things3.

For non-general-purpose optimisation systems, like a tree or a thermostat or a Go-playing computer program, it is evidently sufficient for any or all of those to be implicit and ‘baked in’ (often by a fitting or learning or exploration and selection process, which could embody goals).

But I suspect that strong general abilities in those four areas together yield something like general-purpose optimisation capability4. I think we might only call something AGI if it exhibited strong, generalisable predictive reasoning and perhaps good counterfactual reasoning. Agentic AGI would additionally have goal-directedness and effective actuation.

With this framing, if a misaligned, goal-directed, actuating system in the world could incorporate (even non-agentic) AGI capabilities, namely strong predictive and counterfactual reasoning, even if comparatively weak today, it has the potential to combine to bring into being powerful, misaligned agency.

Misuse of high-intelligence tools

The immediate statement of concern here is perhaps, in extreme form ‘some influential people/organisations are evil psychopaths’: if we give them oracle AGI, they could become misaligned super agents (by bringing their own goal-directedness and actuation into the mix, perhaps through surveillance and coercion or replacement of most humans). They could be company bosses or political bosses or the like.

Even if not powerful or misaligned enough to totally destroy everything of value, empowered by AGI, they might be capable of curtailing a majority of potential future value (locked-in oppression, coercion, disinformation, violent struggle etc.). One illustrating example I might give of a small, misaligned group achieving technological domination (for a time) is the East India Trading Company5. History seems littered with such examples!

Much AI safety conversation is centred on misaligned AI itself, agentically achieving dominance. But AI ‘aligned’ with terrible users could yield just as disempowering or catastrophically destructive outcomes.

Just stop the bad guys getting AI?

The concern above has ‘some human(s)’, perhaps shaped by their environment, as both the goal-directed component and perhaps the actuators forming aggregate misaligned agent(s) which go on to subjugate or destroy.

Non-human, but not AI

Sometimes I find myself reaching for phrases like 'non-agentic AGI embedded in an economic context'. I’m trying to express a fuzzier but maybe more concerning concept that the political or economic environment itself might contain non-human goal-directed or optimising pressures/forces, and particular tyrannical humans or human organisations are just some of the outputs of those forces6.

Such an optimising force, since inhuman, is liable to be misaligned by default. In fact this more general version of the argument may apply even if we make corrigible agentic AGI - perhaps obvious in hindsight, this has only become clear to me as I write.

In other words, ‘coordination issues and races to the bottom’ may already be responsible for bringing into being goal-directed forces which are generally not totally aligned with human values, and which may be out of ‘our’ control (incorrigible). We could suggestively call such an emergent goal-directed system a ‘miscoordination demon’. I would claim that such systems are already substantially reducing the amount of value in the world and will plausibly continue to do so with or without the introduction of AGI. If the introduction of AGI differentially empowers human agency vs such miscoordination demons, we could imagine either being able to subdue them (perhaps permanently), or being subdued by them (perhaps permanently).

‘Non-human goal-directed forces’? What does that even mean?

Let me be more concrete, or at least make some gestures. The ‘goal-directed’ components I’m thinking of include something like 'some industries' or 'some patterns of human activity' or 'some meme complexes', with their ‘actuators’ being 'humans, our capabilities, and our capital'. The overall 'goal' of such systems is often to capture larger shares of labour and capital, or other energetic resources like political influence: in general to co-opt and accumulate various life forces.

It might seem bold to claim that industries or patterns of collective behaviour could have agency since there is generally no embodied representation of their goals (where are their brains?).

But first notice that some configurations of human activity give rise to incentive structures where the aggregate behaviour is not wholly aligned with any human's goals (constituent or external) even when it 'could' be - that's just game theory.

The most tenuous piece of the perspective, for me, is to further claim that there is a certain self-perpetuating and self-replicating aspect to some such configurations of incentive structures7. A relatively benign example might be 'Democracy', which seems inclined to spread and defend itself both in geographic extent and depth. More provocative illustrative examples might be 'Environmental Destruction', 'Colonialism', 'The Slave Trade', 'War', and 'Social Media'.

The earlier example I gave of the East India Trading Company may really be just one of the outputs or local appendages of the 'Colonialism demon'. The pertinent point is that there was a substantial lineage of such enterprises, and for a European power of the time to not take part appeared to be to invite substantial domestic economic downsides. Notably, in many cases, domestic populations were to some extent horrified by atrocities committed abroad, were they made aware of them, but they perpetuated (and expanded) anyway… even when the constituent humans aged, died, repented, or otherwise churned out. Somewhat similarly, every modern company could be led by people who abhor unsustainable environmental practices or abuse of farmed animals, but who can't end them unilaterally at the risk of being outcompeted.

Much of my perspective here is reflected in Scott Alexander's Meditations on Moloch - perhaps I’m partly informed by it.

Fighting demons

Political history is in one sense the history of humans making sometimes monumental coordination efforts to overcome the malign manifestations of some of these self-perpetuating incentive structures (e.g. the abolition of slavery, the declaration of universal human rights, child labour regulation ...). Evidently then, presently these emergent 'demons' are weakly controllable both to constituent and external humans, because their control over their actuators (humans and our machines) is weak, and because we are competent and foresighted enough to exert some counter-influence (for example by withholding labour or capital, or by political or legal action) to either dissolve the incentive structures, disrupt the perpetuation mechanisms, or weaken the actuation of the ‘demons’.

But if the introduction of (even corrigible, non-agentic) AGI allowed higher-fidelity incentive-structure replication/perpetuation8, stronger 'selective pressure' on meme complexes or incentive structures9, tighter control over actuators10, or discovery/unlocking of new ranges of possible incentive structures, it could empower one or more such demons irrevocably.

Is this actually a problem?

Here are some initial thoughts about ways we might dissolve this fear.

In unipolar scenarios - a single concentrated source of power, perhaps superintelligence, overwhelms any contenders - coordination problems and races to the bottom may be less relevant. Here, the evil psychopath version is still a concern! Let us not equip evil, psychopathic, or fanatical people with the means to extend their reach in this way.

If we expect a multipolar scenario to simultaneously generate enough abundance11 to weaken the incentives which misalign the outcomes of human exertion from the ‘goals of humanity/humans’, the argument may dissolve. Consider: how much aggression, destruction, and lose-lose conflict ultimately stems from people’s lack… or justified fear of lack? Then again, since human nature was ‘trained’ in an environment of relative scarcity, it’s hard to know what actual universal abundance would do (we already live in a world of historically-unprecedented abundance and still have coordination problems)! Perhaps much strife stems instead from unjustified fear of a lack that we already no longer face, but are primed to feel.

In fact I don’t think non-agentic AGI is the most probable long-term outcome, because the efficiency benefits of having an artefact do the thing rather than just tell you how to do the thing seem to put a strong economic incentive in favour of agentic AGI. So even non-agentic AGI seems likely to eventually bootstrap creation of agentic AGI in a real-world economic context. If aligned AGI have agency, their net activities may be enough to overpower even human-insurmountable miscoordination demons (for example due to superhuman ability to cooperate with other aligned AGI and with humans, or by simply obsoleting the frame in question).

Perhaps most promising: if AGI (agentic or not) could substantially increase the human ability to coordinate away from sad Nash equilibria, they could kill off the demons. This also applies to non-AI technologies or institutions that we might develop. Much to explore!

i.e. ‘if I do A, X will occur, but if I do B, Y is more likely to occur’

System: person, team, ecosystem, organism, industry, software instance, …

(At work I coined a ‘four Is’ concept for these: Intentions, Inference, Instructions, and Implementation. I’m not sure if that’s helpful or just alliterative... We’ve used it to discuss development of some of the software control systems we develop.)

(I don’t claim this is the only way to get general-purpose optimisation capability.)

In brief, a company which at one time maintained a larger army than most (all?) countries, controlled huge territories (including most of India), and operated quite extortionately and oppressively.

I think my particular intellectual, spiritual, and political upbringing makes me more able than some to countenance this as a live possibility, but I don’t think this should come over as a remarkable statement to many people.

(Analogies could be made to evolution but care needs to be taken and it’s unclear to me whether this would actually be helpful.)

Unwearing, ceaseless embodiment of a rigid charter; implacable enforcement of a frame of rules and norms; literal copying of software and precise manufacturing of physical structures…

Perhaps arms races in destructive potential, competition to adopt ‘efficiency’-boosting practices at humane expense on pain of crushing outcompetition or replacement, …

Coercion, persuasion, surveillance, etc.

We’re talking about fully automated luxury communism, baby!

ilker

Apr 13

I hadn't considered your points before, but you are absolutely right. Compared to fictional scenarios where AI itself brings about the end of the world, the danger arising from malicious individuals or groups using AI awaits us in a much nearer future.

Regarding the topic of "non-human forces", I think that to solve this problem, instead of creating AIs that will simply work efficiently for a system, it would be much better to create AIs that will audit these systems (as you mentioned in your post "AI for Human Reasoning for You"). This way, a competitive environment can be maintained under healthy conditions.

However, although you didn't state whether you agree with it or not, I don't think that the development of AI you described at the end can destroy the "demons". For every better outcome, there will alternatively be a malicious path; the only things that will change are these paths and the outcomes. A future where everything is completely perfect just doesn't seem very feasible to me.

Oly on AI

Discussion about this post

Ready for more?