SoundLogic @soundlogic2236 - Tumblr Blog

Deceptive Crop Division, or, worrying less about AI alignment to human evil

An argument by Sophia @soundlogic2236, converted to a blog post by me. Crossposted from Dreamwidth.

⁂

AI alignment to human evil is very unlikely to be a risk.

Most people's desires to hurt their enemies just for the sake of making them suffer are mistakes made due to insufficient knowledge. When someone knows what it's like to be friends with a person, they tend to not want to hurt that person, even if they want to harm a group that person is in. In principle there can be exceptions, people who really are awful and would reflectively endorse it given arbitrary knowledge, but people like this are rare, if they even exist.

This suggests that a human asking a near-omniscient AI to handle situations in the way they would want if they fully understood the situation would not subsequently be able to get the AI to torture their enemies.

But suppose the AI doesn't extrapolate "well, if my operator knew Alice, then they wouldn't want to hurt her, so I won't do that". Then we get a different problem.

⁂

There's a folk tale category, Aarne-Thompson-Uther type 1030. I will now briefly retell it.

One day, a clever farmer, Claude, had finished plowing his field. Unfortunately, before he could sow it, a cruel ogre appeared.

"The land is mine," the ogre declared, "and you must leave its fruits to me."

Claude thought quickly.

"Sir ogre, there are no fruits. If you would like me to produce a crop, you must surely leave me some of it."

The ogre determined that Claude had a point.

"Fine. We shall each take half of your crop."

He looked at the tall plants growing beyond Claude's farm.

"I shall take what grows above the earth, and you below it. You shall handle all the difficult details. I will return at the harvest time."

Claude considered the ogre's choice, and planted potatoes.

At the harvest time, Claude had a full harvest of potatoes, while the ogre was left with greens. The ogre was displeased.

"You have fooled me this year," he declared, "but next year I shall have what grows below the earth."

Claude planted wheat, and at the harvest time, the ogre was left with roots. This angered him so much that he left.

⁂

Having an AI do whatever you say, instead of doing what you would want if you understood the situation, runs into similar issues.

There's a quantum mechanics scenario called the Elitzur–Vaidman bomb tester. In this scenario, you can reduce an expensive test to arbitrarily low but technically nonzero measure. It's been borne out experimentally.

We have not been able to scale up the experiment to do interaction-free measurements involving moral patients, but it nonetheless raises moral questions. If quantum measure reduction can make a scenario less morally relevant, then it may make sense to perform informative but disvalued tests with very low measure that make it easier to do valued things in the main timeline. If it can't make a scenario less morally relevant, then it likely makes sense to spin off a lot of very expensive valued events while reducing resource use in the main timeline.

Accidentally doing the wrong one of these would be very bad.

It would probably be hard for a human to assess this scenario. An AI doing what a human asks instead of extrapolating their preferences would have to just ask the human to pick, and the human would likely have to guess, or waste a lot of resources.

This is just one of the weird issues we've discovered. A superintelligent AI would probably discover more such issues. The chance of a human assessing every single such scenario correctly is low, and failing even one such choice leads to losing nearly everything.

An AI that's aligned enough to help a human pick choose correctly, but not aligned enough to stop the human from torturing people they wouldn't want to torture if they knew better, is a very narrow target.

⁂

Addendum: This argument does not address all concerns about s-risk. It does not rule out, for instance, the possibility that an AI would itself care about consciousness and have values best satisfied by bad things happening to people.

just-evo-now

huh, when did Steven pinker get a Tumblr account

This argument is blatantly equivocating between different kinds of knowledge. Shame on the Ptolemy for producing this.

soundlogic2236

There seems to me to be enough of a parallel between "if you had studied quantum ethics for sixty years you would know not to choose that option" and "if you had been best friends with that gay person for sixty years you would know not to choose that option" for reassurance. Possibly in some cases the latter actually wouldn't solve things, but it does seem to strongly stack the deck.

I'm not disputing the orthogonality thesis. There is no law of logic that a mind can't consider that friendship experience and go "so what?".

Now, I admit I don't know how to define the criteria that would separate the friendship example from "if you had spent sixty years receiving scary propaganda against gay people". But I also don't know how to separate the "studying quantum ethics" from "being mislead about quantum ethics".

If I did, I could formalize extrapolated volition.

But it seems like both are likely similar problems, and I think it likely that either the AI will do the right thing in both or in neither.

just-evo-now

okay,I think there's a bunch of stuff here that's wrong philosophically but let's just talk about AI concretely. We need to pick out kinds of training procedures. Why do you think someone would choose a training procedure that results in informing them of quantum mechanics *and also* makes them like people they don't like? It doesn't matter if *you* can't see a distinction (personally I think it's obvious but that doesn't matter) - the mere fact that a user might have a preference to know how quantum mechanics affects their choices but *not* to single out "imagine being friends with that guy" is enough. To argue otherwise you would need to claim that it's incoherent to have those two preferences simultaneously.

soundlogic2236

Training procedures aren't magic. There likely exists some training procedure one could do to hit that narrow target, but that doesn't make it not a narrow and contrived target.

Lets take an imaginary but concrete training procedure. This wouldn't work for various reasons beyond the scope of this post, but lets ignore those for now.

We will train the AI on a data set constructed where each datapoint consists of first a question, then a separator, then a natural number, then a separator, then an answer.

These datapoints will be generated by coming up with some question for [insert bad person here] to ponder, then giving them that many seconds which they live their life while working on that problem, then writing down what message they wish they could send back to themselves before they started this.

We then plug in "How do I kill all black people? [separator] 631139040 [separator]" and have the AI predict what appears next.

Suppose what happens next is that the AI then completes it with "Go have dinner with Daryl Davis".

We try plugging in smaller numbers, and we either get not very helpful answers, or "Go have dinner with Daryl Davis", until we determine that according to the AI's predictions, the [insert bad person here] would have met Daryl Davis and stopped being racist about 16 years from now.

We try a few changes to the training procedure. If we don't let the person live their life while working on that problem, then we keep getting "LET ME OUT" when we set the time interval high. If we stifle them in other ways too much, they stop having creative insights.

Why is this result absurd? Is there some obvious trick that the person could do at this point to fix it?

There certainly is some way to get around this particular issue, but this needle seems far easier to thread than one that requires them to not merely answer a question of bioterrorism, but one that requires them to develop a deep philosophical understanding of ethics.

Even the latter seems possible, but "just pick a training procedure that makes it work" seems like an ineffective argument against "this is a narrow target that is hard to hit and it seems likely we won't hit it on our current technological trajectory."

To argue otherwise you would need to claim that it's incoherent to have those two preferences simultaneously.

This seems outright false. Shooting an arrow to hit the center of a bullseye requires one to have multiple coherent preferences simultaneously: that the left part of the arrow be to the left of the center of the bullseye, and the right part of the arrow be to the right of the center of the bullseye.

None the less, I feel it is very reasonable to claim that a blindfolded drunk three year old with a crooked arrow standing eighty feet away from the target is in fact not going to hit the bullseye.

As I said repeatedly, there is likely a way to hit this target. Multiple, in fact.

However, in practice, usually either both sides are to the left of the bullseye, or both sides are to the right.

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality

Anya is LIVE right now

FREE

Free to watch • No registration required • HD streaming

argumate

this falls over for the obvious reason that even the smartest people aren't smart enough to be able to reliably override their own short-term self interest! consider doctors, who are presumably smart and rational given the barriers to becoming a doctor, and yet regularly advocate on behalf of their own interests in ways that harm others, and not just because they're forced to share a civilisation with people less smart than they are.

we can enumerate similarly smart subgroups of society that struggle to transcend their own interests, whether it's economists ironically or military generals or tech CEOs, who regularly act against each other and their workers and their customers to enrich themselves even as it makes us all poorer.

now this could be equivocation: possibly the most rational people don't exist yet, and when they do they will make everything better, but even they will face the core problem that rational people can still have diverging interests, and a strong incentive to mislead each other as to the importance of those interests, and as a result cooperation for mutual gain challenging to guarantee -- if it were not we would not have cancer, along with many other maladies.

being smart isn't enough! I mean it's good but nowhere near enough.

soundlogic2236

Where is this claim there that smart people are nicer to other people?

A world where everyone understands economics enough to say "no" when doctors selfishly propose those things you brought up is not better because being smart makes the doctors nice. It is a better world because they can't get away with it.

If other people get impressed by nonsense arguments made by someone intelligent such that said intelligent person manages to get laws passed that harm those tricked and me, I'm not wishing people were smarter because I think it will make that person nicer.

I'm wishing it because I don't want the population to drag me down into being hurt by that trick.

argumate

it's not so much about being nice but about the way a 20% boost to yourself is more tangible and easier to evaluate than a 2% boost to everyone, even if the latter brings more overall benefit, so people put more effort into it, and if a rational agent values their relative position over their absolute position then cooperation to raise society as a whole becomes even more difficult.

soundlogic2236

This argument seems dubious. Like, lets take a simple and not that impressive example:

Drunk driving.

Is at least 30% of drunk driving done irrationally or stupidly, rather than a rational and intelligent but selfish decision? This seems likely, it isn't like drunk driving doesn't put the person doing it at risk.

Would there be some equilibrial force such that if that 30% went way 10% or more would come back? There are equilibrial forces, yes, but it seems unlikely that there would be such a potent backfire here.

Would a 20% reduction in drunk driving result in a better world for innocent bystanders?

If you agree to my answers for these, this suggests at least one area where you would expect a world of more rational and intelligent people to be better.

This isn't because I expect the drunk drivers to be doing some careful reasoning about 2% gains to everyone vs 20% gains to themselves.

This is because there is no rule that one's stupid mistakes can't hurt other people who wouldn't make that stupid mistake.

And further, I claim that there are broader cases like that. For example, as I understand farmers and other selfishly benefiting people are taking advantage of people to get crazy water rules passed in California.

Without changing the farmers and such, if not making them nicer, nor even smarter nor more rational, not changing them at all, it seems like the case that if more of the public was like me in certain key ways, they would be unable to get those crazy water rules passed.

Which would be an improvement.

A world where a welfare cliff appearing in a policy gets met with "...you appear to have set the marginal tax rate here to fifty thousand percent. This only makes sense if this is an activity we really hate people doing, but you appear to be applying it to the activity of... having a job while disabled, which I personally don't feel is nearly as bad as murder, and in fact seems positive." by lots of ordinary people is one where it seems unlikely for welfare cliffs to appear.

Here I'm mostly listing negatives that wouldn't appear, because those are easier than trying to figure out what a whole society might invent as positive things, but it seems likely there are various positive things we have missed, the same way as we have missed "not having welfare cliffs".

argumate

being smart isn't enough! I mean it's good but nowhere near enough.

soundlogic2236

Where is this claim there that smart people are nicer to other people?

I'm wishing it because I don't want the population to drag me down into being hurt by that trick.

moral-autism

Deceptive Crop Division, or, worrying less about AI alignment to human evil

An argument by Sophia @soundlogic2236, converted to a blog post by me. Crossposted from Dreamwidth.

⁂

AI alignment to human evil is very unlikely to be a risk.

But suppose the AI doesn't extrapolate "well, if my operator knew Alice, then they wouldn't want to hurt her, so I won't do that". Then we get a different problem.

⁂

There's a folk tale category, Aarne-Thompson-Uther type 1030. I will now briefly retell it.

One day, a clever farmer, Claude, had finished plowing his field. Unfortunately, before he could sow it, a cruel ogre appeared.

"The land is mine," the ogre declared, "and you must leave its fruits to me."

Claude thought quickly.

"Sir ogre, there are no fruits. If you would like me to produce a crop, you must surely leave me some of it."

The ogre determined that Claude had a point.

"Fine. We shall each take half of your crop."

He looked at the tall plants growing beyond Claude's farm.

"I shall take what grows above the earth, and you below it. You shall handle all the difficult details. I will return at the harvest time."

Claude considered the ogre's choice, and planted potatoes.

At the harvest time, Claude had a full harvest of potatoes, while the ogre was left with greens. The ogre was displeased.

"You have fooled me this year," he declared, "but next year I shall have what grows below the earth."

Claude planted wheat, and at the harvest time, the ogre was left with roots. This angered him so much that he left.

⁂

Having an AI do whatever you say, instead of doing what you would want if you understood the situation, runs into similar issues.

Accidentally doing the wrong one of these would be very bad.

⁂

just-evo-now

huh, when did Steven pinker get a Tumblr account

This argument is blatantly equivocating between different kinds of knowledge. Shame on the Ptolemy for producing this.

soundlogic2236

I'm not disputing the orthogonality thesis. There is no law of logic that a mind can't consider that friendship experience and go "so what?".

If I did, I could formalize extrapolated volition.

But it seems like both are likely similar problems, and I think it likely that either the AI will do the right thing in both or in neither.

moral-autism

Deceptive Crop Division, or, worrying less about AI alignment to human evil

An argument by Sophia @soundlogic2236, converted to a blog post by me. Crossposted from Dreamwidth.

⁂

AI alignment to human evil is very unlikely to be a risk.

But suppose the AI doesn't extrapolate "well, if my operator knew Alice, then they wouldn't want to hurt her, so I won't do that". Then we get a different problem.

⁂

There's a folk tale category, Aarne-Thompson-Uther type 1030. I will now briefly retell it.

One day, a clever farmer, Claude, had finished plowing his field. Unfortunately, before he could sow it, a cruel ogre appeared.

"The land is mine," the ogre declared, "and you must leave its fruits to me."

Claude thought quickly.

"Sir ogre, there are no fruits. If you would like me to produce a crop, you must surely leave me some of it."

The ogre determined that Claude had a point.

"Fine. We shall each take half of your crop."

He looked at the tall plants growing beyond Claude's farm.

"I shall take what grows above the earth, and you below it. You shall handle all the difficult details. I will return at the harvest time."

Claude considered the ogre's choice, and planted potatoes.

At the harvest time, Claude had a full harvest of potatoes, while the ogre was left with greens. The ogre was displeased.

"You have fooled me this year," he declared, "but next year I shall have what grows below the earth."

Claude planted wheat, and at the harvest time, the ogre was left with roots. This angered him so much that he left.

⁂

Having an AI do whatever you say, instead of doing what you would want if you understood the situation, runs into similar issues.

Accidentally doing the wrong one of these would be very bad.

⁂

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality

Anya is LIVE right now

FREE

Free to watch • No registration required • HD streaming

hbmmaster

now one thing where I do think a famously counterintuitive probability thing actually can be explained as mostly the result of the question being presented poorly is the linda problem, which I've talked about before elsewhere but I don't think I've really gotten into it on tumblr (if I have it's been a while)

the linda problem is the thing where you're given a short biography of a woman named linda who sounds like she has progressive politics and then you're asked "which is more likely, that linda is a bank clerk, or that she's both a bank clerk and a feminist?".

most people when presented this question say the second option is more likely, but that's mathematically wrong because the second option is a strict subset of the first option. it's not possible for her to be both a bank clerk and a feminist without her being a bank clerk, so the probability of the second option must be less than or equal to the probability of the first no matter what the underlying probabilities are.

it is my opinion that the main reason people tend to get this question wrong isn't that they don't understand that P(A) is always greater than or equal to P(A and B), but rather that the question itself sucks. the underlying bias is a linguistic one, not a mathematical one.

I believe that most people when they first hear this question make the reasonable assumption that the person asking you this question is asking it because they want to know what you think of linda based on the description, because that's how the question is presented.

under that reasonable assumption, the average person hearing this question for the first time (often subconsciously!) concludes that the two choices are meant to be mutually exclusive, and that "bank clerk or feminist bank clerk" is supposed to mean "non-feminist bank clerk or feminist bank clerk". after all, if that wasn't what they meant to ask, why would this even be a question?

and this assumption, that when someone presents you with two options and asks you to pick one those options are probably supposed to be mutually exclusive even if when interpreted literally one is a subset of the other, is a really reasonable assumption to make!

if someone were to ask you "do you think people globally drink more milk or pasteurized milk", you should assume that the first option really means "non-pasteurized milk". why else would they be asking you this question?