Forthcoming in Global Catastrophic Risks, eds. Nick Bostrom and Milan Cirkovic
Draft of August 31, 2006 Eliezer Yudkowsky (email@example.com)
Singularity Institute for Artificial Intelligence Palo Alto, CA
By far the greatest danger of Artificial Intelligence is that people conclude too early that they understand it. Of course this problem is not limited to the field of AI. Jacques Monod wrote: «A curious aspect of the theory of evolution is that everybody thinks he understands it.» (Monod 1974.) My father, a physicist, complained about people making up their own theories of physics; he wanted to know why people did not make up their own theories of chemistry. (They do.) Nonetheless the problem seems to be unusually acute in Artificial Intelligence. The field of AI has a reputation for making huge promises and then failing to deliver on them. Most observers conclude that AI is hard; as indeed it is. But the embarrassment does not stem from the difficulty. It is difficult to build a star from hydrogen, but the field of stellar astronomy does not have a terrible reputation for promising to build stars and then failing. The critical inference is not that AI is hard, but that, for some reason, it is very easy for people to think they know far more about Artificial Intelligence than they actually do
By far the greatest danger of Artificial Intelligence is that people conclude too early that they understand it. Of course this problem is not limited to the field of AI. Jacques Monod wrote: «A curious aspect of the theory of evolution is that everybody thinks he understands it.» (Monod 1974.) My father, a physicist, complained about people making up their own theories of physics; he wanted to know why people did not make up their own theories of chemistry. (They do.) Nonetheless the problem seems to be unusually acute in Artificial Intelligence. The field of AI has a reputation for making huge promises and then failing to deliver on them. Most observers conclude that AI is hard; as indeed it is. But the embarrassment does not stem from the difficulty. It is difficult to build a star from hydrogen, but the field of stellar astronomy does not have a terrible reputation for promising to build stars and then failing. The critical inference is not that AI is hard, but that, for some reason, it is very easy for people to think they know far more about Artificial Intelligence than they actually do.
In my other chapter for Global Catastrophic Risks, «Cognitive biases potentially affecting judgment of global risks», I opened by remarking that few people would deliberately choose to destroy the world; a scenario in which the Earth is destroyed by mistake is therefore very worrisome. Few people would push a button that they clearly knew would cause a global catastrophe. But if people are liable to confidently believe that the button does something quite different from its actual consequence, that is cause indeed for alarm.
It is far more difficult to write about global risks of Artificial Intelligence than about cognitive biases. Cognitive biases are settled science; one need simply quote the literature. Artificial Intelligence is not settled science; it belongs to the frontier, not to the textbook. And, for reasons discussed in a later section, on the topic of global catastrophic risks of Artificial Intelligence, there is virtually no discussion in the existing
1 I thank Michael Roy Ames, Eric Baum, Nick Bostrom, Milan Cirkovic, John K Clark, Emil Gilliam, Ben Goertzel, Robin Hanson, Keith Henson, Bill Hibbard, Olie Lamb, Peter McCluskey, and Michael Wilson for their comments, suggestions and criticisms. Needless to say, any remaining errors in this paper are my own technical literature. I have perforce analyzed the matter from my own perspective; given my own conclusions and done my best to support them in limited space. It is not that I have neglected to cite the existing major works on this topic, but that, to the best of my ability to discern, there are no existing major works to cite (as of January 2006).
It may be tempting to ignore Artificial Intelligence because, of all the global risks discussed in this book, AI is hardest to discuss. We cannot consult actuarial statistics to assign small annual probabilities of catastrophe, as with asteroid strikes. We cannot use calculations from a precise, precisely confirmed model to rule out events or place infinitesimal upper bounds on their probability, as with proposed physics disasters. But this makes AI catastrophes more worrisome, not less.
The effect of many cognitive biases has been found to increase with time pressure, cognitive busyness, or sparse information. Which is to say that the more difficult the analytic challenge , the more important it is to avoid or reduce bias. Therefore I strongly recommend reading «Cognitive biases potentially affecting judgment of global risks», pp. XXX-YYY, before continuing with this chapter.
1. Anthropomorphic bias
When something is universal enough in our everyday lives, we take it for granted to the point of forgetting it exists.
Imagine a complex biological adaptation with ten necessary parts. If each of ten genes are independently at 50% frequency in the gene pool — each gene possessed by only half the organisms in that species — then, on average, only 1 in 1024 organisms will possess the full, functioning adaptation. A fur coat is not a significant evolutionary advantage unless the environment reliably challenges organisms with cold. Similarly, if gene B depends on gene A, then gene B has no significant advantage unless gene A forms a reliable part of the geneticenvironment. Complex, interdependent machinery is necessarily universal within a sexually reproducing species; it cannot evolve otherwise. (Tooby and Cosmides 1992.) One robin may have smoother feathers than another, but they will both have wings. Natural selection, while feeding on variation, uses it up. (Sober 1984.)
In every known culture, humans experience joy, sadness, disgust, anger, fear, and surprise (Brown 1991), and indicate these emotions using the same facial expressions (Ekman and Keltner 1997). We all run the same engine under our hoods, though we may be painted different colors; a principle which evolutionary psychologists call the psychic unity of humankind (Tooby and Cosmides 1992). This observation is both explained and required by the mechanics of evolutionary biology.
An anthropologist will not excitedly report of a newly discovered tribe: «They eat food! They breathe air! They use tools! They tell each other stories!» We humans forget how alike we are, living in a world that only reminds us of our differences.
Humans evolved to model other humans — to compete against and cooperate with our own conspecifics. It was a reliable property of the ancestral environment that every powerful intelligence you met would be a fellow human. We evolved to understand our fellow humansempathically , by placing ourselves in their shoes; for that which needed to be modeled was similar to the modeler. Not surprisingly, human beings often �anthropomorphize� — expect humanlike properties of that which is not human. In The Matrix (Wachowski and Wachowski 1999), the supposed «artificial intelligence» Agent Smith initially appears utterly cool and collected, his face passive and unemotional. But later, while interrogating the human Morpheus, Agent Smith gives vent to his disgust with humanity — and his face shows the human-universal facial expression for disgust.
Querying your own human brain works fine, as an adaptive instinct, if you need to predict other humans. If you deal with any other kind of optimization process — if, for example, you are the eighteenth-century theologian William Paley, looking at the complex order of life and wondering how it came to be — then anthropomorphism is flypaper for unwary scientists, a trap so sticky that it takes a Darwin to escape.
Experiments on anthropomorphism show that subjects anthropomorphize unconsciously, often flying in the face of their deliberate beliefs. In a study by Barrett and Keil (1996), subjects strongly professed belief in non-anthropomorphic properties of God: that God could be in more than one place at a time, or pay attention to multiple events simultaneously. Barrett and Keil presented the same subjects with stories in which, for example, God saves people from drowning. The subjects answered questions about the stories, or retold the stories in their own words, in such ways as to suggest that God was in only one place at a time and performed tasks sequentially rather than in parallel. Serendipitously for our purposes, Barrett and Keil also tested an additional group using otherwise identical stories about a superintelligent computer named «Uncomp». For example, to simulate the property of omnipresence, subjects were told that Uncomp’s sensors and effectors «cover every square centimeter of the earth and so no information escapes processing». Subjects in this condition also exhibited strong anthropomorphism, though significantly less than the God group. From our perspective, the key result is that even when people consciously believe an AI is unlike a human, they still visualize scenarios as if the AI were anthropomorphic (but not quite as anthropomorphic as God).
Anthropomorphic bias can be classed as insidious: it takes place with no deliberate intent, without conscious realization, and in the face of apparent knowledge.
Back in the era of pulp science fiction, magazine covers occasionally depicted a sentient monstrous alien — colloquially known as a bug-eyed monster or BEM — carrying off an attractive human female in a torn dress. It would seem the artist believed that a non-humanoid alien, with a wholly different evolutionary history, would sexually desire human females. People don’t make mistakes like that by explicitly reasoning: «All minds are likely to be wired pretty much the same way, so presumably a BEM will find human females sexually attractive.» Probably the artist did not ask whether a giant bug perceives human females as attractive. Rather, a human female in a torn dress is sexy- inherently so, as an intrinsic property. They who made this mistake did not think about the insectoid’s mind; they focused on the woman’s torn dress. If the dress were not torn, the woman would be less sexy; the BEM doesn’t enter into it. (This is a case of a deep, confusing, and extraordinarily common mistake which E. T. Jaynes named the mind projection fallacy . (Jaynes and Bretthorst, 2003.) Jaynes, a theorist of Bayesian probability, coined «mind projection fallacy» to refer to the error of confusing states of knowledge with properties of objects. For example, the phrase mysterious phenomenon implies that mysteriousness is a property of the phenomenon itself. If I am ignorant about a phenomenon, then this is a fact about my state of mind, not a fact about the phenomenon.)
People need not realize they are anthropomorphizing (or even realize they are engaging in a questionable act of predicting other minds) in order for anthropomorphism to supervene on cognition. When we try to reason about other minds, each step in the reasoning process may be contaminated by assumptions so ordinary in human experience that we take no more notice of them than air or gravity. You object to the magazine illustrator: «Isn’t it more likely that a giant male bug would sexually desire giant female bugs?» The illustrator thinks for a moment and then says to you: «Well, even if an insectoid alien starts out liking hard exoskeletons, after the insectoid encounters human females it will soon realize that human females have much nicer, softer skins. If the aliens have sufficiently advanced technology, they’ll genetically engineer themselves to like soft skins instead of hard exoskeletons.»
This is a fallacy-at-one-remove. After the alien’s anthropomorphic thinking is pointed out, the magazine illustrator takes a step back and tries to justify the alien’s conclusion as a neutral product of the alien’s reasoning process. Perhaps advanced aliens could re-engineer themselves (genetically or otherwise) to like soft skins, but would they want to? An insectoid alien who likes hard skeletons will not wish to change itself to like soft skins instead — not unless natural selection has somehow produced in it a distinctly human sense of meta-sexiness. When using long, complex chains of reasoning to argue in favor of an anthropomorphic conclusion, each and every step of the reasoning is another opportunity to sneak in the error.
And it is also a serious error to begin from the conclusion and search for a neutral-seeming line of reasoning leading there; this is rationalization. If it is self-brain-query which produced that first fleeting mental image of an insectoid chasing a human female, then anthropomorphism is the underlying cause of that belief, and no amount of rationalization will change that.
Anyone seeking to reduce anthropomorphic bias in themselves would be well-advised to study evolutionary biology for practice, preferably evolutionary biology with math. Early biologists often anthropomorphized natural selection — they believed that evolution would do the same thing they would do; they tried to predict the effects of evolution by putting themselves «in evolution’s shoes». The result was a great deal of nonsense, which first began to be systematically exterminated from biology in the late 1960s, e.g. by Williams (1966). Evolutionary biology offers both mathematics and case studies to help hammer out anthropomorphic bias.
1.1: The width of mind design space
Evolution strongly conserves some structures. Once other genes evolve which depend on a previously existing gene, that early gene is set in concrete; it cannot mutate without breaking multiple adaptations. Homeotic genes — genes controlling the development of the body plan in embryos — tell many other genes when to activate. Mutating a homeotic gene can result in a fruit fly embryo that develops normally except for not having a head. As a result, homeotic genes are so strongly conserved that many of them are the same in humans and fruit flies — they have not changed since the last common ancestor of humans and bugs. The molecular machinery of ATP synthase is essentially the same in animal mitochondria, plant chloroplasts, and bacteria; ATP synthase has not changed significantly since the rise of eukaryotic life two billion years ago.
Any two AI designs might be less similar to one another than you are to a petunia.
The term «Artificial Intelligence» refers to a vastly greater space of possibilities than does the term «Homo sapiens». When we talk about «AIs» we are really talking about minds-in-general, or optimization processes in general. Imagine a map of mind design space. In one corner, a tiny little circle contains all humans; within a larger tiny circle containing all biological life; and all the rest of the huge map is the space of minds-in-general. The entire map floats in a still vaster space, the space of optimization processes . Natural selection creates complex functional machinery without mindfulness; evolution lies inside the space of optimization processes but outside the circle of minds.
It is this enormous space of possibilities which outlaws anthropomorphism as legitimate reasoning.
2: Prediction and design
We cannot query our own brains for answers about nonhuman optimization processes — whether bug-eyed monsters, natural selection, or Artificial Intelligences. How then may we proceed? How can we predict what Artificial Intelligences will do? I have deliberately asked this question in a form that makes it intractable. By the halting problem, it is impossible to predict whether an arbitrary computational system implements any input-output function, including, say, simple multiplication. (Rice 1953.) So how is it possible that human engineers can build computer chips which reliably implement multiplication? Because human engineers deliberately use designs that they canunderstand.
Anthropomorphism leads people to believe that they can make predictions, given no more information than that something is an «intelligence» — anthromorphism will go on generating predictions regardless, your brain automatically putting itself in the shoes of the «intelligence». This may have been one contributing factor to the embarrassing history of AI, which stems not from the difficulty of AI as such, but from the mysterious ease of acquiring erroneous beliefs about what a given AI design accomplishes.
To make the statement that a bridge will support vehicles up to 30 tons, civil engineers have two weapons: choice of initial conditions, and safety margin. They need not predict whether an arbitrary structure will support 30-ton vehicles, only design a single bridge of which they can make this statement. And though it reflects well on an engineer who can correctly calculate the exact weight a bridge will support, it is also acceptable to calculate that a bridge supports vehicles of at least 30 tons — albeit to assert this vague statement rigorously may require much of the same theoretical understanding that would go into an exact calculation.
Civil engineers hold themselves to high standards in predicting that bridges will support vehicles. Ancient alchemists held themselves to much lower standards in predicting that a sequence of chemical reagents would transform lead into gold. How much lead into how much gold? What is the exact causal mechanism? It’s clear enough why the alchemical researcher wants gold rather than lead, but why should this sequence of reagents transform lead to gold, instead of gold to lead or lead to water?
Some early AI researchers believed that an artificial neural network of layered thresholding units, trained via backpropagation, would be «intelligent». The wishful thinking involved was probably more analogous to alchemy than civil engineering. Magic is on Donald Brown’s list of human universals (Brown 1991); science is not. We don’t instinctively see that alchemy won’t work. We don’t instinctively distinguish between rigorous understanding and good storytelling. We don’t instinctively notice an expectation of positive results which rests on air.
The human species came into existence through natural selection, which operates through the nonchance retention of chance mutations. One path leading to global catastrophe — to someone pressing the button with a mistaken idea of what the button does — is that Artificial Intelligence comes about through a similar accretion of working algorithms, with the researchers having no deep understanding of how the combined system works. Nonetheless they believe the AI will be friendly, with no strong visualization of the exact processes involved in producing friendly behavior, or any detailed understanding of what they mean by friendliness. Much as early AI researchers had strong mistaken vague expectations for their programs’ intelligence, we imagine that these AI researchers succeed in constructing an intelligent program, but have strong mistaken vague expectations for their program’s friendliness.
Not knowing how to build a friendly AI is not deadly, of itself, in any specific instance, if you know you don’t know. It’s mistaken belief that an AI will be friendly which implies an obvious path to global catastrophe.
3: Underestimating the power of intelligence
We tend to see individual differences instead of human universals. Thus when someone says the word «intelligence», we think of Einstein, instead of humans.
Individual differences of human intelligence have a standard label, Spearman’s g aka g-factor, a controversial interpretation of the solid experimental result that different intelligence tests are highly correlated with each other and with real-world outcomes such as lifetime income. (Jensen 1999.) Spearman’s g is a statistical abstraction from individual differences of intelligence between humans, who as aspecies are far more intelligent than lizards. Spearman’s g is abstracted from millimeter height differences among a species of giants.
We should not confuse Spearman’s g with human general intelligence, our capacity to handle a wide range of cognitive tasks incomprehensible to other species. General intelligence is a between-species difference, a complex adaptation, and a human universal found in all known cultures. There may as yet be no academic consensus on intelligence, but there is no doubt about the existence, or the power, of the thing-to-be-explained. There is something about humans that let us set our footprints on the Moon.
But the word «intelligence» commonly evokes pictures of the starving professor with an IQ of 160 and the billionaire CEO with an IQ of merely 120. Indeed there are differences of individual ability apart from «book smarts» which contribute to relative success in the human world: enthusiasm, social skills, education, musical talent, rationality. Note that each factor I listed is cognitive . Social skills reside in the brain, not the liver. And jokes aside, you will not find many CEOs, nor yet professors of academia, who are chimpanzees. You will not find many acclaimed rationalists, nor artists, nor poets, nor leaders, nor engineers, nor skilled networkers, nor martial artists, nor musical composers who are mice. Intelligence is the foundation of human power, the strength that fuels our other arts.
The danger of confusing general intelligence with g-factor is that it leads to tremendously underestimating the potential impact of Artificial Intelligence. (This applies to underestimating potential good impacts, as well as potential bad impacts.) Even the phrase «transhuman AI» or «artificial superintelligence» may still evoke images of book-smarts-in-a-box: an AI that’s really good at cognitive tasks stereotypically associated with «intelligence», like chess or abstract mathematics. But not superhumanly persuasive; or far better than humans at predicting and manipulating human social situations; or inhumanly clever in formulating long-term strategies. So instead of Einstein, should we think of, say, the 19th-century political and diplomatic genius Otto von Bismarck? But that’s only the mirror version of the error. The entire range from village idiot to Einstein, or from village idiot to Bismarck, fits into a small dot on the range from amoeba to human.
If the word «intelligence» evokes Einstein instead of humans, then it may sound sensible to say that intelligence is no match for a gun, as if guns had grown on trees. It may sound sensible to say that intelligence is no match for money, as if mice used money. Human beings didn’t start out with major assets in claws, teeth, armor, or any of the other advantages that were the daily currency of other species. If you had looked at humans from the perspective of the rest of the ecosphere, there was no hint that the soft pink things would eventually clothe themselves in armored tanks. We invented the battleground on which we defeated lions and wolves. We did not match them claw for claw, tooth for tooth; we had our own ideas about what mattered. Such is the power of creativity.
Vinge (1993) aptly observed that a future containing smarter-than-human minds is different in kind. Artificial Intelligence is not an amazing shiny expensive gadget to advertise in the latest tech magazines. Artificial Intelligence does not belong in the same graph that shows progress in medicine, manufacturing, and energy. Artificial Intelligence is not something you can casually mix into a lumpenfuturisticscenario of skyscrapers and flying cars and nanotechnological red blood cells that let you hold your breath for eight hours. Sufficiently tall skyscrapers don’t potentially start doing their own engineering. Humanity did not rise to prominence on Earth by holding its breath longer than other species.
The catastrophic scenario which stems from underestimating the power of intelligence is that someone builds a button, and doesn’t care enough what the button does, because they don’t think the button is powerful enough to hurt them. Or, since underestimating the power of intelligence implies a proportional underestimate of the potential impact of Artificial Intelligence, the (presently tiny) group of concerned researchers and grantmakers and individual philanthropists who handle existential risks on behalf of the human species, will not pay enough attention to Artificial Intelligence. Or the wider field of AI will not pay enough attention to risks of strong AI, and therefore good tools and firm foundations for friendliness will not be available when it becomes possible to build strong intelligences.
And one should not fail to mention — for it also impacts upon existential risk — that Artificial Intelligence could be the powerful solution to other existential risks, and by mistake we will ignore our best hope of survival. The point about underestimating the potential impact of Artificial Intelligence is symmetrical around potential good impacts and potential bad impacts. That is why the title of this chapter is «Artificial Intelligence as a Positive and Negative Factor in Global Risk», not «Global Risks of Artificial Intelligence.» The prospect of AI interacts with global risk in more complex ways than that; if AI were a pure liability, matters would be simple.
4: Capability and motive
There is a fallacy oft-committed in discussion of Artificial Intelligence, especially AI of superhuman capability. Someone says: «When technology advances far enough we’ll be able to build minds far surpassing human intelligence. Now, it’s obvious that how large a cheesecake you can make depends on your intelligence. A superintelligence could buil enormous cheesecakes — cheesecakes the size of cities — by golly, the future will be full of giant cheesecakes!» The question is whether the superintelligence wants to build giant cheesecakes. The vision leaps directly from capability to actuality , without considering the necessary intermediate of motive .
The following chains of reasoning, considered in isolation without supporting argument, all exhibit the Fallacy of the Giant Cheesecake:
- A sufficiently powerful Artificial Intelligence could overwhelm any human resistance and wipe out humanity. [And the AI would decide to do so.] Therefore we should not build AI.
- A sufficiently powerful AI could develop new medical technologies capable of saving millions of human lives. [And the AI would decide to do so.] Therefore we should build AI.
- Once computers become cheap enough, the vast majority of jobs will be performable by Artificial Intelligence more easily than by humans. A sufficiently powerful AI would even be better than us at math, engineering, music, art, and all the other jobs we consider meaningful. [And the AI will decide to perform those jobs.] Thus after the invention of AI, humans will have nothing to do, and we’ll starve or watch television.
4.1: Optimization processes
The above deconstruction of the Fallacy of the Giant Cheesecake invokes an intrinsic anthropomorphism — the idea that motives are separable; the implicit assumption that by talking about «capability» and «motive» as separate entities, we are carving reality at its joints. This is a useful slice but an anthropomorphic one.
To view the problem in more general terms, I introduce the concept of an optimization process: a system which hits small targets in large search spaces to produce coherent real-world effects.
An optimization process steers the future into particular regions of the possible. I am visiting a distant city, and a local friend volunteers to drive me to the airport. I do not know the neighborhood. When my friend comes to a street intersection, I am at a loss to predict my friend’s turns, either individually or in sequence. Yet I can predict the result of my friend’s unpredictable actions: we will arrive at the airport. Even if my friend’s house were located elsewhere in the city, so that my friend made a wholly different sequence of turns, I would just as confidently predict our destination. Is this not a strange situation to be in, scientifically speaking? I can predict the outcome of a process, without being able to predict any of the intermediate steps in the process. I will speak of the region into which an optimization process steers the future as that optimizer’s target .
Consider a car, say a Toyota Corolla. Of all possible configurations for the atoms making up the Corolla, only an infinitesimal fraction qualify as a useful working car. If you assembled molecules at random, many many ages of the universe would pass before you hit on a car. A tiny fraction of the design space does describe vehicles that we would recognize as faster, more efficient, and safer than the Corolla. Thus the Corolla is not optimal under the designer’s goals. The Corolla is, however, optimized, because the designer had to hit a comparatively infinitesimal target in design space just to create a working car, let alone a car of the Corolla’s quality. You cannot build so much as an effective wagon by sawing boards randomly and nailing according to coinflips. To hit such a tiny target in configuration space requires a powerful optimization process.
The notion of an «optimization process» is predictively useful because it can be easier to understand the target of an optimization process than to understand its step-by-step dynamics . The above discussion of the Corolla assumes implicitly that the designer of the Corolla was trying to produce a «vehicle», a means of travel. This assumption deserves to be made explicit, but it is not wrong, and it is highly useful in understanding the Corolla.
4.2: Aiming at the target
The temptation is to ask what «AIs» will «want», forgetting that the space of minds-in-general is much wider than the tiny human dot. One should resist the temptation to spread quantifiers over all possible minds. Storytellers spinning tales of the distant and exotic land called Future, say how the future will be. They make predictions. They say, «AIs will attack humans with marching robot armies» or «AIs will invent a cure for cancer». They do not propose complex relations between initial conditions and outcomes — that would lose the audience. But we need relational understanding to manipulate the future, steer it into a region palatable to humankind. If we do not steer, we run the danger of ending up where we are going.
The critical challenge is not to predict that «AIs» will attack humanity with marching robot armies, or alternatively invent a cure for cancer. The task is not even to make the prediction for an arbitrary individual AI design. Rather the task is choosing into existence some particularpowerful optimization process whose beneficial effects can legitimately be asserted.
I strongly urge my readers not to start thinking up reasons why a fully generic optimization process would be friendly. Natural selection isn’t friendly, nor does it hate you, nor will it leave you alone. Evolution cannot be so anthropomorphized, it does not work like you do. Many pre-1960s biologists expected natural selection to do all sorts of nice things, and rationalized all sorts of elaborate reasons why natural selection would do it. They were disappointed, because natural selection itself did not start out knowing that it wanted a humanly-nice result, and then rationalize elaborate ways to produce nice results using selection pressures. Thus the events in Nature were outputs of causally different process from what went on in the pre-1960s biologists’ minds, so that prediction and reality diverged.
5: Friendly AI
It would be a very good thing if humanity knew how to choose into existence a powerful optimization process with a particular target. Or in more colloquial terms, it would be nice if we knew how to build a nice AI.
To describe the field of knowledge needed to address that challenge, I have proposed the term «Friendly AI». In addition to referring to a body of technique, «Friendly AI» might also refer to the product of technique — an AI created with specified motivations. When I use the term Friendly in either sense, I capitalize it to avoid confusion with the intuitive sense of «friendly».
One common reaction I encounter is for people to immediately declare that Friendly AI is an impossibility, because any sufficiently powerful AI will be able to modify its own source code to break any constraints placed upon it.
The first flaw you should notice is a Giant Cheesecake Fallacy. Any AI with free access to its own source would, in principle, possess theability to modify its own source code in a way that changed the AI’s optimization target. This does not imply the AI has the motive to change its own motives. I would not knowingly swallow a pill that made me enjoy committing murder, because currently I prefer that my fellow humans not die.
But what if I try to modify myself, and make a mistake? When computer engineers prove a chip valid — a good idea if the chip has 155 million transistors and you can’t issue a patch afterward — the engineers use human-guided, machine-verified formal proof. The glorious thing about formal mathematical proof, is that a proof of ten billion steps is just as reliable as a proof of ten steps. But human beings are not trustworthy to peer over a purported proof of ten billion steps; we have too high a chance of missing an error. And present-day theorem-proving techniques are not smart enough to design and prove an entire computer chip on their own — current algorithms undergo an exponential explosion in the search space. Human mathematicians can prove theorems far more complex than modern theorem-provers can handle, without being defeated by exponential explosion. But human mathematics is informal and unreliable; occasionally someone discovers a flaw in a previously accepted informal proof. The upshot is that human engineers guide a theorem-prover through the intermediate steps of a proof. The human chooses the next lemma, and a complex theorem-prover generates a formal proof, and a simple verifier checks the steps. That’s how modern engineers build reliable machinery with 155 million interdependent parts.
Proving a computer chip correct requires a synergy of human intelligence and computer algorithms, as currently neither suffices on its own. Perhaps a true AI could use a similar combination of abilities when modifying its own code — would have both the capability to inventlarge designs without being defeated by exponential explosion, and also the ability to verify its steps with extreme reliability. That is one way a true AI might remain knowably stable in its goals, even after carrying out a large number of self-modifications.
This paper will not explore the above idea in detail. (Though see Schmidhuber 2003 for a related notion.) But one ought to think about a challenge, and study it in the best available technical detail, before declaring it impossible — especially if great stakes depend upon the answer. It is disrespectful to human ingenuity to declare a challenge unsolvable without taking a close look and exercising creativity. It is an enormously strong statement to say that you cannot do a thing — that you cannot build a heavier-than-air flying machine, that youcannot get useful energy from nuclear reactions, that you cannot fly to the Moon. Such statements are universal generalizations, quantified over every single approach that anyone ever has or ever will think up for solving the problem. It only takes a single counterexample to falsify a universal quantifier. The statement that Friendly (or friendly) AI is theoretically impossible, dares to quantify over every possible mind design and every possible optimization process — including human beings, who are also minds, some of whom are nice and wish they were nicer. At this point there are any number of vaguely plausible reasons why Friendly AI might be humanlyimpossible, and it is still more likely that the problem is solvable but no one will get around to solving it in time. But one should not so quickly write off the challenge, especially considering the stakes.
6: Technical failure and philosophical failure
Bostrom (2001) defines an existential catastrophe as one which permanently extinguishes Earth-originating intelligent life or destroys a part of its potential. We can divide potential failures of attempted Friendly AI into two informal fuzzy categories, technical failure andphilosophical failure. Technical failure is when you try to build an AI and it doesn’t work the way you think it does — you have failed to understand the true workings of your own code. Philosophical failure is trying to build the wrong thing, so that even if you succeeded you would still fail to help anyone or benefit humanity. Needless to say, the two failures are not mutually exclusive.
The border between these two cases is thin, since most philosophical failures are much easier to explain in the presence of technical knowledge. In theory you ought first to say what you want, then figure out how to get it. In practice it often takes a deep technical understanding to figure out what you want.