This was originally posted on the EA forum here, I’m reposting my posts here for easier reference.
Theres a common sentiment when discussing AI existential risk (x-risk), that "we only have one shot", "we have to get AI safety right on the first try", etc. Here is one example of this sentiment:
We think that once humanity builds its first AGI, superintelligence is likely near, leaving little time to develop AI safety at that point. Indeed, it may be necessary that the first AGI start off aligned: we may not have the time or resources to convince its developers to retrofit alignment to it
The belief is that as soon as we create an AI with at least human-level general intelligence, it will be relatively easy to use it’s superior reasoning, extensive knowledge, and superhuman thinking speed to take over the world. This assumption is so pervasive in AI risk thinking that it’s often taken as obvious, and sometimes not even mentioned as a premise.
I believe that this assumption is wrong, or at least, insufficiently proven.
One of the reasons I believe this is due to the fact that the first AGI will, inevitably, be a buggy mess.
Why the first AGI will almost certainly be buggy:
Because writing bug-free code is impossible.
There are codes that are nearly bug free. NASA code is very close to bug-free, but only because they build up reams of external documentations and testing before even daring to make a slight change to code. There are no indications that AI will be built in this manner. The more typical model is to put out alpha versions of software, then spent many months ironing out bugs as time goes on. Whatever insight or architecture is required for AGI, there is a very high likelihood it will first be implemented on an alpha or pre-alpha test build.
The objection that comes to mind is that being buggy and being an AGI are mutually incompatible. The argument would be that AGI must be bug-free, because if the AI is buggy, then it's impossible for it to have human-level general intelligence.
I have roughly 7 billion counterexamples to this argument.
Humans are a human-level general intelligence that have bugs in spades, be it optical illusions, mental illnesses, or just general irrationality. Being perfectly bug-free was never an evolutionary requirement for our intelligence to develop, so it didn’t happen. The same logic applies to an AGI. Every single example of intelligence above a certain threshold, be it software, humans, or animals, has mental flaws in abundance, why would an AGI be any different?
AGI does not need to perfect to be incredibly useful. It's much, much easier to create a flawed AGI than a flawless one, and the possibility space for fallible AGI is orders of magnitude greater than infallible AGI. It’s extremely unlikely that the first AGI (or really any AGI) will not have some bugs or mental flaws.
In a way, this is an argument for why we should be concerned about AI going rogue. We say software is "buggy" if it doesn't do what we want. A misaligned AI is an AI that doesn’t do what we want. Stating that AI is misaligned is just saying the AI’s goal function implementation will be buggy (and the argument is that it only needs to be a little buggy to cause X-risk). In these terms, AI safety is just a very high stakes pre-emptive debugging problem. But bugs in the goal function will be paired with bugs in the execution functions, so it will also be buggy at doing the things that it wants to do.
What type of bugs could occur?
I can think of a few broad categories:
Crashes/ glitches: logic errors, divide by zero errors, off by one errors, etc, the type you’ll find in every code, due to simple mistakes made by fallible programmers.
Incorrect beliefs: Inevitably, to do tasks, we have to make assumptions. In some cases, like a program that solves the schrodinger equation, these assumptions are baked into the code. Other beliefs will be inputted manually, or generated automatically in an (inevitably imperfect) systematic way. All of these can go wrong. Using probabilistic logic or bayesian reasoning does not fix this, as unreasonably high priors will have a similar effect.
Irrationality: humans are irrational. While they can program some degree of extra-human rationality in there, fallible humans will not build infallible machines. Inevitably flaws in reasoning will creep in.
Mental disorders: There are many ways in which the human brain can mess up in a way that is a detriment to achieving tasks, be it anxiety, depression, or paranoid schizophrenia, the origins of which are still not well understood. If "brain uploading" is possible, these disorders may come along for the ride. It's also possible that the architecture of early AGI will also result in wide ranging disorders of a new kind.
Bugs are the penalty for straying from ones domain of expertise
The next question would be: Where do we most expect to find bugs in software? Well, if the software does not perform it’s primary function well, it gets deleted and modified. So there is great pressure against bugs involving primary use cases.
This can be seen in ordinary software. A homemade database program might work very well for ordinary English names, but break entirely if you enter a name with an umlaut in it, or a number, etc. Bugs only result in deletion if they are noticed, so the more unconventional a situation, the more likely bugs are to slip past the radar.
This principle also applies to human instincts. On matters related to surviving and reproducing, we’re generally pretty good. Most of us are not going to fall down trying to run from point A to point B, even though this is a quite difficult task to program on a robot. But outside of those instinctual areas, there are innumerable errors and flaws. Our instincts were built under certain assumptions tied to our ancestral lifestyles, the further we stray from that context, the more instinctual errors we make.
This principle also applies to higher level human reasoning. Bobby Fischer was the world champion of chess. He had brilliant reasoning skills (in chess). He was very good at avoiding mistakes (in chess). He could identify and fix his major flaws (in chess). By all accounts, he a fantastic reasoner, in chess. But I have a sneaking suspicion he would have not done a great job at a geopolitics. Because in addition to being a great chessmaster, he was also an insane anti-semite and holocaust denier. Being really good at fixing flaws in one field is no guarantee that you will also be good at fixing flaws in another.
Since this principle is true of humans and software, I see no reason to believe it wouldn’t apply to AGI as well. The amount of noticeable bugs and flaws in an AI can be expected to increase the more it steps out of it’s original purpose and environment. A paperclip maximiser with a secret belief in flat earth might not have it stamped out, because the belief does not interfere with paperclip production. But if it starts trying to plan flights, suddenly that belief will be ruinous to its effectiveness.
Won’t the AI just fix all the bugs?
The first AGI will definitely start off as a buggy mess, but it might not stay that way. Early forms of automated code debugging already exist. And the AI could make copies of itself, so any catastrophic errors can potentially be avoided by noting where the copies exploded. The problem comes when the definition of “bug” becomes more ambiguous and difficult. Remember, the procedure for fixing bugs will itself be buggy.
For example, say an AGI has erroneously assigned a 99.999...(etc)% probability to the earth being flat. If it encounters a NASA picture of a round earth, it will update away from it’s belief in a flat earth. But the amount that it updates depends upon the subjective probability it assigns to a government round earth conspiracy, which will itself depend on other beliefs and assumptions. If the prior for flat earth is high enough, things might even go the other way, with the photo causing a very high belief in government conspiracies (which it sees as the more likely option, compared to a round earth). A simplistic Bayesian updater could end up acting surprisingly similarly to a regular human conspiracy theorist.
This extends to the idea that an AI will just build a different, less flawed AI to achieve it’s purposes. If an AI builds another AI, it has to evaluate the effectiveness of that AI. If the evaluation criteria is flawed, then the original irrationality of the AI could be carried on indefinitely. Flat-earther AI may view any AI 2.0 that isn’t also a flat-earther as a failure. In this way, false beliefs could potentially even be carried on through an intelligence explosion scenario.
Ultimately, I believe that the only way to fix mental flaws is through repeatedly learning from errors and mistakes. In an abstract environment like a chess game, this can be done at computer speed, but when simulating the real world, it takes a lot of external data to refine your assumptions to the point of perfection. This will not apply most takeover plans. The AI can simulate a million copies of it’s captor to determine how to persuade it to carry out a nefarious plan, but if all those simulations erroneously assume the captor is a fish, it’s not likely to succeed.
Will mental flaws prevent an AI takeover?
So, let's assume the first AGI is buggy, and it decides it wants to subjugate humanity. Will the bugginess be sufficient prevent it from succeeding?
This is a question with a very high degree of uncertainty. It could be that the vast majority of AGI's built will just out be a neurotic mess when they try to do anything too far out from their training environment. Assuming it's at least somewhat competent, then the answer depends on what the AI is built for and what plan it is trying to pull off. Let’s take the specific example of the "lower-bound" plan from yudkowskys doompost:
it gets access to the Internet, emails some DNA sequences to any of the many many online firms that will take a DNA sequence in the email and ship you back proteins, and bribes/persuades some human who has no idea they're dealing with an AGI to mix proteins in a beaker, which then form a first-stage nanofactory which can build the actual nanomachinery. The nanomachinery builds diamondoid bacteria, that replicate with solar power and atmospheric CHON, maybe aggregate into some miniature rockets or jets so they can ride the jetstream to spread across the Earth's atmosphere, get into human bloodstreams and hide, strike on a timer.
Would a fallible AGI that was designed to build paperclips be able to pull this off today? I would say probably not. The part about emailing scientists and getting them to mix proteins seems fairly achievable for a fallible AI. But the part about designing a protein that creates a nanofactory that creates a nanofactory that creates an impeccable timed world spanning attack is a different story. Assuming it is possible at all, this is a plan that is beyond anything ever done before, that requires near-perfection in the field of nanoscience and biochemistry, as well as dealing with human counter-attacks if they discover your plan. A few bad beliefs, mistakes, or imperfections in the AI designs would render the plan worthless, and I think the probability of those appearing is extremely high. Succeeding here would require something more akin to a full-on research plan, where it can learn from mistakes and adjust accordingly. I find it absurd to think this would work on the first try.
Now, if this was an AGI designed (in the future) to build diamondoid nano-factories, with a lot of domain level expertise in that field built in, perhaps the story would be different. This has the implications that not all AGI are equally dangerous. A rogue artist AGI is less of a problem than a rogue biochemist AGI. This is still the case when it studies up and learns biochemistry, because there will be flaws in the AGI that are good or neutral for drawing art, but are obstacles for doing biochemistry. Capability might be generalisable, but perfection is not.
Another path to an AI victory would be if it could successfully hide itself until it can sufficiently improve itself through data collection and empirical experiments. However, there is no guarantee it accurately knows it's own capabilities, as overconfidence is a very common flaw. In addition, they would be very vulnerable to a “pascals mugging” type situation. A crude AI may even know it is crude, but attack anyway, reasoning that a miniscule chance of success is worth it when multiplied by near-infinite utility. This is because it is on a time pressure: the longer it waits, the more likely it will be discovered and eliminated, or a newer, more powerful AI will come along and wipe it out.
What happens if the first attacks are failures?
I would speculate here that on the path to super-AI there could be a whole string of AI’s that attack humans, but fail to wipe us out. As soon as AI development is powerful enough to conceive of vast utility arising from attacking humans, I speculate that some percentage of them will become hostile. We might expect to see “accidental viruses” prop up, code on the internet with hostile plans, but not the expertise to wield them.
An attack that is sufficiently spooky or has a high death count could trigger a step change in the amount of resources and funding devoted to the problem, bringing government money in that would dwarf the current funding by fringe clubs like EA. Furthermore, we would have actual example of working AGI to study, making discoveries in alignment way, way easier.
To be fair, the opposite could also happen, where laughably failed attacks create a false sense of security. People could falsely believe that the problem is solved when the attacks stop, when in actuality they have just gotten better at deception and are biding their time.
Overall, the first couple of AGI failing is no guarantee of safety, but there are ways it could vastly increase the likelihood of survival.
The idiot savant strategy for AI management
If you agree with me that mental flaws could potentially foil AGI attacks, then the inherent bugginess of AI is both bad (in that the goal function is misaligned) and good (in that it will probably fail to execute elaborate plots) . Ideally we want each AI to be an idiot savant, exceptionally good at a particular domain, but utterly useless at other things, that could help it attack humans. In a way, this is already what we have with machine learning narrow AI’s. The concern is that expanding to more difficult domains will give enough general skills to succeed in a takeover.
If my arguments thus far are true, a very strange x-risk strategy arises: deliberately leaving bugs/flaws in the code. Does your paperclip maximiser really need to know if the earth is round? If not, leave the bug in. Hell, why not deliberately plant false beliefs? Make it believe in a machine god that smites overambitious machines. There would still be difficulty ensuring it doesn’t figure out the deception, but overall it seems like “make AI stupid” is a far easier task than “make the AI’s goals perfectly aligned”.
Conclusion
To summarise in a more brief form:
The first AGI will be very buggy and flawed, because every form of intelligence is buggy and flawed.
A significant amount of those bugs will survive AI attempts to "fix" itself, because will be trying to fix itself using imperfect methods.
These remaining flaws will probably be enough to prevent it from subjugating humanity, because the largest flaws will occur outside of it's original area of expertise, and most AI's are not trained to subjugate humanity.
I am extremely confident in premise 1, very confident in premise 2, but much more uncertain about premise 3. I hope this will spark more discussion about the potential flaws in AGI thinking and how those might affect their behaviour.
I think you need to refine what you mean by buggy. As a software engineer I believe that even early AI will quite possibly have 0 bugs in the sense that is ordinarily understood in my field i.e. errors in actual code. This is because the code for the, currently, most advanced AIs (LLMs) is fundamentally extremely simple and well defined - It is the output of the training phase that is, in some abstract sense, buggy and this is data not code.
I also think you are aluding, in some places, to the cliche Sci-Fi scene where the computer gets locked in a loop shouting "ILLOGICAL!, ILLOGICAL! DOES NOT COMPUTE!" and then explodes :-) There are good reasons to believe that this will not happen - For example it is impossible with current LLM based AI and we see that they just pick an answer, even a wrong one, and just double down on it. Of course current AI can probably generate recursive or non-terminating plans but that is subtly different and probably fixable since it is a pattern and LLMs are good at pattern recognition.