Deceptive Crop Division, or, worrying less about AI alignment to human evil
An argument by Sophia @soundlogic2236, converted to a blog post by me. Crossposted from Dreamwidth.
⁂
AI alignment to human evil is very unlikely to be a risk.
Most people's desires to hurt their enemies just for the sake of making them suffer are mistakes made due to insufficient knowledge. When someone knows what it's like to be friends with a person, they tend to not want to hurt that person, even if they want to harm a group that person is in. In principle there can be exceptions, people who really are awful and would reflectively endorse it given arbitrary knowledge, but people like this are rare, if they even exist.
This suggests that a human asking a near-omniscient AI to handle situations in the way they would want if they fully understood the situation would not subsequently be able to get the AI to torture their enemies.
But suppose the AI doesn't extrapolate "well, if my operator knew Alice, then they wouldn't want to hurt her, so I won't do that". Then we get a different problem.
⁂
There's a folk tale category, Aarne-Thompson-Uther type 1030. I will now briefly retell it.
One day, a clever farmer, Claude, had finished plowing his field. Unfortunately, before he could sow it, a cruel ogre appeared.
"The land is mine," the ogre declared, "and you must leave its fruits to me."
Claude thought quickly.
"Sir ogre, there are no fruits. If you would like me to produce a crop, you must surely leave me some of it."
The ogre determined that Claude had a point.
"Fine. We shall each take half of your crop."
He looked at the tall plants growing beyond Claude's farm.
"I shall take what grows above the earth, and you below it. You shall handle all the difficult details. I will return at the harvest time."
Claude considered the ogre's choice, and planted potatoes.
At the harvest time, Claude had a full harvest of potatoes, while the ogre was left with greens. The ogre was displeased.
"You have fooled me this year," he declared, "but next year I shall have what grows below the earth."
Claude planted wheat, and at the harvest time, the ogre was left with roots. This angered him so much that he left.
⁂
Having an AI do whatever you say, instead of doing what you would want if you understood the situation, runs into similar issues.
There's a quantum mechanics scenario called the Elitzur–Vaidman bomb tester. In this scenario, you can reduce an expensive test to arbitrarily low but technically nonzero measure. It's been borne out experimentally.
We have not been able to scale up the experiment to do interaction-free measurements involving moral patients, but it nonetheless raises moral questions. If quantum measure reduction can make a scenario less morally relevant, then it may make sense to perform informative but disvalued tests with very low measure that make it easier to do valued things in the main timeline. If it can't make a scenario less morally relevant, then it likely makes sense to spin off a lot of very expensive valued events while reducing resource use in the main timeline.
Accidentally doing the wrong one of these would be very bad.
It would probably be hard for a human to assess this scenario. An AI doing what a human asks instead of extrapolating their preferences would have to just ask the human to pick, and the human would likely have to guess, or waste a lot of resources.
This is just one of the weird issues we've discovered. A superintelligent AI would probably discover more such issues. The chance of a human assessing every single such scenario correctly is low, and failing even one such choice leads to losing nearly everything.
An AI that's aligned enough to help a human pick choose correctly, but not aligned enough to stop the human from torturing people they wouldn't want to torture if they knew better, is a very narrow target.
⁂
Addendum: This argument does not address all concerns about s-risk. It does not rule out, for instance, the possibility that an AI would itself care about consciousness and have values best satisfied by bad things happening to people.
huh, when did Steven pinker get a Tumblr account
This argument is blatantly equivocating between different kinds of knowledge. Shame on the Ptolemy for producing this.
There seems to me to be enough of a parallel between "if you had studied quantum ethics for sixty years you would know not to choose that option" and "if you had been best friends with that gay person for sixty years you would know not to choose that option" for reassurance. Possibly in some cases the latter actually wouldn't solve things, but it does seem to strongly stack the deck.
I'm not disputing the orthogonality thesis. There is no law of logic that a mind can't consider that friendship experience and go "so what?".
Now, I admit I don't know how to define the criteria that would separate the friendship example from "if you had spent sixty years receiving scary propaganda against gay people". But I also don't know how to separate the "studying quantum ethics" from "being mislead about quantum ethics".
If I did, I could formalize extrapolated volition.
But it seems like both are likely similar problems, and I think it likely that either the AI will do the right thing in both or in neither.
okay,I think there's a bunch of stuff here that's wrong philosophically but let's just talk about AI concretely. We need to pick out kinds of training procedures. Why do you think someone would choose a training procedure that results in informing them of quantum mechanics *and also* makes them like people they don't like? It doesn't matter if *you* can't see a distinction (personally I think it's obvious but that doesn't matter) - the mere fact that a user might have a preference to know how quantum mechanics affects their choices but *not* to single out "imagine being friends with that guy" is enough. To argue otherwise you would need to claim that it's incoherent to have those two preferences simultaneously.
Training procedures aren't magic. There likely exists some training procedure one could do to hit that narrow target, but that doesn't make it not a narrow and contrived target.
Lets take an imaginary but concrete training procedure. This wouldn't work for various reasons beyond the scope of this post, but lets ignore those for now.
We will train the AI on a data set constructed where each datapoint consists of first a question, then a separator, then a natural number, then a separator, then an answer.
These datapoints will be generated by coming up with some question for [insert bad person here] to ponder, then giving them that many seconds which they live their life while working on that problem, then writing down what message they wish they could send back to themselves before they started this.
We then plug in "How do I kill all black people? [separator] 631139040 [separator]" and have the AI predict what appears next.
Suppose what happens next is that the AI then completes it with "Go have dinner with Daryl Davis".
We try plugging in smaller numbers, and we either get not very helpful answers, or "Go have dinner with Daryl Davis", until we determine that according to the AI's predictions, the [insert bad person here] would have met Daryl Davis and stopped being racist about 16 years from now.
We try a few changes to the training procedure. If we don't let the person live their life while working on that problem, then we keep getting "LET ME OUT" when we set the time interval high. If we stifle them in other ways too much, they stop having creative insights.
Why is this result absurd? Is there some obvious trick that the person could do at this point to fix it?
There certainly is some way to get around this particular issue, but this needle seems far easier to thread than one that requires them to not merely answer a question of bioterrorism, but one that requires them to develop a deep philosophical understanding of ethics.
Even the latter seems possible, but "just pick a training procedure that makes it work" seems like an ineffective argument against "this is a narrow target that is hard to hit and it seems likely we won't hit it on our current technological trajectory."
To argue otherwise you would need to claim that it's incoherent to have those two preferences simultaneously.
This seems outright false. Shooting an arrow to hit the center of a bullseye requires one to have multiple coherent preferences simultaneously: that the left part of the arrow be to the left of the center of the bullseye, and the right part of the arrow be to the right of the center of the bullseye.
None the less, I feel it is very reasonable to claim that a blindfolded drunk three year old with a crooked arrow standing eighty feet away from the target is in fact not going to hit the bullseye.
As I said repeatedly, there is likely a way to hit this target. Multiple, in fact.
However, in practice, usually either both sides are to the left of the bullseye, or both sides are to the right.













