Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

The True Prisoner's Dilemma

56 Post author: Eliezer_Yudkowsky 03 September 2008 09:34PM

It occurred to me one day that the standard visualization of the Prisoner's Dilemma is fake.

The core of the Prisoner's Dilemma is this symmetric payoff matrix:

1: C 1:  D
2: C (3, 3) (5, 0)
2: D (0, 5) (2, 2)

Player 1, and Player 2, can each choose C or D.  1 and 2's utility for the final outcome is given by the first and second number in the pair.  For reasons that will become apparent, "C" stands for "cooperate" and D stands for "defect".

Observe that a player in this game (regarding themselves as the first player) has this preference ordering over outcomes:  (D, C) > (C, C) > (D, D) > (C, D).

D, it would seem, dominates C:  If the other player chooses C, you prefer (D, C) to (C, C); and if the other player chooses D, you prefer (D, D) to (C, D).  So you wisely choose D, and as the payoff table is symmetric, the other player likewise chooses D.

If only you'd both been less wise!  You both prefer (C, C) to (D, D).  That is, you both prefer mutual cooperation to mutual defection.

The Prisoner's Dilemma is one of the great foundational issues in decision theory, and enormous volumes of material have been written about it.  Which makes it an audacious assertion of mine, that the usual way of visualizing the Prisoner's Dilemma has a severe flaw, at least if you happen to be human.

The classic visualization of the Prisoner's Dilemma is as follows: you are a criminal, and you and your confederate in crime have both been captured by the authorities.

Independently, without communicating, and without being able to change your mind afterward, you have to decide whether to give testimony against your confederate (D) or remain silent (C).

Both of you, right now, are facing one-year prison sentences; testifying (D) takes one year off your prison sentence, and adds two years to your confederate's sentence.

Or maybe you and some stranger are, only once, and without knowing the other player's history, or finding out who the player was afterward, deciding whether to play C or D, for a payoff in dollars matching the standard chart.

And, oh yes - in the classic visualization you're supposed to pretend that you're entirely selfish, that you don't care about your confederate criminal, or the player in the other room.

It's this last specification that makes the classic visualization, in my view, fake.

You can't avoid hindsight bias by instructing a jury to pretend not to know the real outcome of a set of events.  And without a complicated effort backed up by considerable knowledge, a neurologically intact human being cannot pretend to be genuinely, truly selfish.

We're born with a sense of fairness, honor, empathy, sympathy, and even altruism - the result of our ancestors adapting to play the iterated Prisoner's Dilemma.  We don't really, truly, absolutely and entirely prefer (D, C) to (C, C), though we may entirely prefer (C, C) to (D, D) and (D, D) to (C, D).  The thought of our confederate spending three years in prison, does not entirely fail to move us.

In that locked cell where we play a simple game under the supervision of economic psychologists, we are not entirely and absolutely unsympathetic for the stranger who might cooperate.  We aren't entirely happy to think what we might defect and the stranger cooperate, getting five dollars while the stranger gets nothing.

We fixate instinctively on the (C, C) outcome and search for ways to argue that it should be the mutual decision:  "How can we ensure mutual cooperation?" is the instinctive thought.  Not "How can I trick the other player into playing C while I play D for the maximum payoff?"

For someone with an impulse toward altruism, or honor, or fairness, the Prisoner's Dilemma doesn't really have the critical payoff matrix - whatever the financial payoff to individuals.  (C, C) > (D, C), and the key question is whether the other player sees it the same way.

And no, you can't instruct people being initially introduced to game theory to pretend they're completely selfish - any more than you can instruct human beings being introduced to anthropomorphism to pretend they're expected paperclip maximizers.

To construct the True Prisoner's Dilemma, the situation has to be something like this:

Player 1:  Human beings, Friendly AI, or other humane intelligence.

Player 2:  UnFriendly AI, or an alien that only cares about sorting pebbles.

Let's suppose that four billion human beings - not the whole human species, but a significant part of it - are currently progressing through a fatal disease that can only be cured by substance S.

However, substance S can only be produced by working with a paperclip maximizer from another dimension - substance S can also be used to produce paperclips.  The paperclip maximizer only cares about the number of paperclips in its own universe, not in ours, so we can't offer to produce or threaten to destroy paperclips here.  We have never interacted with the paperclip maximizer before, and will never interact with it again.

Both humanity and the paperclip maximizer will get a single chance to seize some additional part of substance S for themselves, just before the dimensional nexus collapses; but the seizure process destroys some of substance S.

The payoff matrix is as follows:

1: C 1:  D
2: C (2 billion human lives saved, 2 paperclips gained) (+3 billion lives, +0 paperclips)
2: D (+0 lives, +3 paperclips) (+1 billion lives, +1 paperclip)

I've chosen this payoff matrix to produce a sense of indignation at the thought that the paperclip maximizer wants to trade off billions of human lives against a couple of paperclips.  Clearly the paperclip maximizer should just let us have all of substance S; but a paperclip maximizer doesn't do what it should, it just maximizes paperclips.

In this case, we really do prefer the outcome (D, C) to the outcome (C, C), leaving aside the actions that produced it.  We would vastly rather live in a universe where 3 billion humans were cured of their disease and no paperclips were produced, rather than sacrifice a billion human lives to produce 2 paperclips.  It doesn't seem right to cooperate, in a case like this.  It doesn't even seem fair - so great a sacrifice by us, for so little gain by the paperclip maximizer?  And let us specify that the paperclip-agent experiences no pain or pleasure - it just outputs actions that steer its universe to contain more paperclips.  The paperclip-agent will experience no pleasure at gaining paperclips, no hurt from losing paperclips, and no painful sense of betrayal if we betray it.

What do you do then?  Do you cooperate when you really, definitely, truly and absolutely do want the highest reward you can get, and you don't care a tiny bit by comparison about what happens to the other player?  When it seems right to defect even if the other player cooperates?

That's what the payoff matrix for the true Prisoner's Dilemma looks like - a situation where (D, C) seems righter than (C, C).

But all the rest of the logic - everything about what happens if both agents think that way, and both agents defect - is the same.  For the paperclip maximizer cares as little about human deaths, or human pain, or a human sense of betrayal, as we care about paperclips.  Yet we both prefer (C, C) to (D, D).

So if you've ever prided yourself on cooperating in the Prisoner's Dilemma... or questioned the verdict of classical game theory that the "rational" choice is to defect... then what do you say to the True Prisoner's Dilemma above?

Comments (112)

Sort By: Old
Comment author: pdf23ds 03 September 2008 09:54:22PM 14 points [-]

Those must be pretty big paperclips.

Comment author: stcredzero 29 May 2010 11:29:33PM 85 points [-]

I suspect that the True Prisoner's Dilemma played itself out in the Portugese and Spanish conquest of Mesoamerica. Some natives were said to ask, "Do they eat gold?" They couldn't comprehend why someone would want a shiny decorative material so badly, they'd kill for it. The Spanish were Shiny Decorative Material maximizers.

Comment author: Omegaile 02 April 2013 03:01:51PM 8 points [-]

That's a really insightful comment!

But I should correct you, that you are only talking about the Spanish conquest, not the Portuguese, since 1) Mesoamerica was not conquered by the Portuguese; 2) Portuguese possessions in America (AKA Brazil) had very little gold and silver, which was only discovered much later, when it was already in Portuguese domain.

Comment author: Philip_W 25 June 2015 06:35:33AM 0 points [-]

In a sense they did eat gold, like we eat stacks of printed paper, or perhaps nowadays little numbers on computer screens.

Comment author: Allan_Crossman 03 September 2008 10:01:41PM 4 points [-]

I agree: Defect!

Clearly the paperclip maximizer should just let us have all of substance S; but a paperclip maximizer doesn't do what it should, it just maximizes paperclips.

I sometimes feel that nitpicking is the only contribution I'm competent to make around here, so... here you endorsed Steven's formulation of what "should" means; a formulation which doesn't allow you to apply the word to paperclip maximizers.

Comment author: Paul_Mohr 03 September 2008 10:03:17PM 1 point [-]

Very nice representation of the problem. I can't help but think there is another level that would make this even more clear, though this is good by itself.

Comment author: prunes 03 September 2008 10:05:48PM 8 points [-]

Eliezer,

The other assumption made about Prisoner's Dilemma, that I do not see you allude to, is that the payoffs account for not only a financial reward, time spent in prison, etc., but every other possible motivating factor in the decision making process. A person's utility related to the decision of whether to cooperate or defect will be a function of not only years spent in prison or lives saved but ALSO guilt/empathy. Presenting the numbers within the cells as actual quantities doesn't present the whole picture.

Comment author: PrimIntelekt 04 February 2010 08:17:43PM *  2 points [-]

Important point.

Let's assume that your utility function (which is identical to theirs) simply weights and adds your payoff and theirs; that is, if you get X and they get Y, your function is U(X,Y) = aX+bY. In that case, working backwards from the utilities in the table, and subject to the constraint that a+b=1, here are the payoffs:

a/b=2: (you care twice as much about yourself)
(3,3) (-5,10)
(10,-5) (1,1)

a/b=3:
(3,3) (-2.5,7.5)
(7.5,-2.5) (1,1)

a=b:
Impossible. With both people being unselfish utilitarians, the utilities can never differ based on the same outcome.

b=0: (selfish)
The table as given in the post

I think the most important result is the case a=b: the dilemma makes no sense at all if the players weight both payoffs equally, because you can never produce asymmetrical utilities.

EDIT: My newbishness is showing. How do I format this better? Is it HTML?

Comment author: wnoise 04 February 2010 08:24:35PM 3 points [-]

It's not HTML, but "markdown" which gets turned into HTML.

http://wiki.lesswrong.com/wiki/Comment_formatting

Comment author: PrimIntelekt 05 February 2010 04:16:53AM 1 point [-]

Thank you!

Comment author: pdf23ds 03 September 2008 10:06:02PM 0 points [-]

Alan, I think you meant to link to this comment.

Comment author: Eliezer_Yudkowsky 03 September 2008 10:06:57PM 25 points [-]

I agree: Defect!

I didn't say I would defect.

Comment author: orthonormal 17 January 2011 11:00:48PM *  14 points [-]

I agree: Defect!

I didn't say I would defect.

By the way, this was an extremely clever move: instead of announcing your departure from CDT in the post, you waited for the right prompt in the comments and dropped it as a shocking twist. Well crafted!

Comment author: Allan_Crossman 03 September 2008 10:14:43PM 1 point [-]

Damnit, Eliezer nitpicked my nitpicking. :)

Comment author: Aron 03 September 2008 10:27:51PM 5 points [-]

It's likely deliberate that prisoners were selected in the visualization to imply a relative lack of unselfish motivations.

Comment author: denis_bider 03 September 2008 10:33:44PM 3 points [-]

An excellent way to pose the problem.

Obviously, if you know that the other party cares nothing about your outcome, then you know that they're more likely to defect.

And if you know that the other party knows that you care nothing about their outcome, then it's even more likely that they'll defect.

Since the way you posed the problem precludes an iteration of this dilemma, it follows that we must defect.

Comment author: TGGP4 03 September 2008 11:03:21PM 4 points [-]

How might we and the paperclip-maximizer credibly bind ourselves to cooperation? Seems like it would be difficult dealing with such an alien mind.

Comment author: bluej100 07 January 2013 10:25:24PM 4 points [-]

I think Eliezer's "We have never interacted with the paperclip maximizer before, and will never interact with it again" was intended to preclude credible binding.

Comment author: RobinHanson 03 September 2008 11:12:49PM 18 points [-]

The entries in a payoff matrix are supposed to sum up everything you care about, including whatever you care about the outcomes for the other player. Most every game theory text and lecture I know gets this right, but even when we say the right thing to students over and over, they mostly still hear it the wrong way you initially heard it. This is just part of the facts of life of teaching game theory.

Comment author: Eliezer_Yudkowsky 03 September 2008 11:17:13PM 30 points [-]

Robin, the point I'm complaining about is precisely that the standard illustration of the Prisoner's Dilemma, taught to beginning students of game theory, fails to convey those entries in the payoff matrix - as if the entries were merely money instead of utilons, which is not at all what the Prisoner's Dilemma is about.

The point of the True Prisoner's Dilemma is that it gives you a payoff matrix that is very nearly the standard matrix in utilons, not just years in prison or dollars in an encounter.

I.e., you can tell people all day long that the entries are in utilons, but until you give them a visualization where those really are the utilons, it's around as effective as telling juries to ignore hindsight bias.

Comment author: RobinHanson 03 September 2008 11:22:51PM 5 points [-]

Eliezer, I agree that your example makes more clear the point you are trying to make clear, but in an intro to game theory course I'd still start with the standard prisoner's dilemma example first, and only get to your example if I had time to make the finer point clearer. For intro classes for typical students the first priority is to be understood at all in any way, and that requires examples as simple clear and vivid as possible.

Comment author: billswift 03 September 2008 11:29:48PM 10 points [-]

I don't think Eliezer misunderstood. I think you are missing his point, that economists are defining away empathy in the way they present the problem, including the utilities presented.

Comment author: ChrisHibbert 04 September 2008 12:20:38AM 1 point [-]

In the universe I live in, there are both cooperators and defectors, but cooperators seem to predominate in random encounters. (If you leave yourself open to encounters in which others can choose to interact with you, defectors may find you an easy mark.)

In order to decide how to act with the paperclip maximizer, I have to figure out what kind of universe it is likely to inhabit. It's possible that a random super intelligence from a random universe will have few opportunities to cooperate, but I think it's more likely that there are far more SIs and universes in which cooperation is common.

But even though this is the direct answer to the question EY poses, I think it's more important to point out that his is a better (though not simpler to explain as RH says) depiction of the intended dilemma. It takes much more thought to figure out what about the context would make cooperation reasonable. Viscerally, it's nearly untenable.

Comment author: Allan_Crossman 04 September 2008 12:29:50AM 12 points [-]

Prase, Chris, I don't understand. Eliezer's example is set up in such a way that, regardless of what the paperclip maximizer does, defecting gains one billion lives and loses two paperclips.

Basically, we're being asked to choose between a billion lives and two paperclips (paperclips in another universe, no less, so we can't even put them to good use).

The only argument for cooperating would be if we had reason to believe that the paperclip maximizer will somehow do whatever we do. But I can't imagine how that could be true. Being a paperclip maximizer, it's bound to defect, unless it had reason to believe that we would somehow do whatever it does. I can't imagine how that could be true either.

Or am I missing something?

Comment author: lerjj 03 April 2015 08:23:37PM 1 point [-]

7 years late, but you're missing the fact that (C,C) is universally better than (D,D). Thus whatever logic is being used must have a flaw somewhere because it works out worse for everyone - a reasoning process that successfully gets both parties to cooperate is a WIN. (However, in this setup it is the case that actually winning would be either (C,D) or (C,D), both of which are presumably impossible if we're equally rational).

Comment author: query 03 April 2015 09:11:15PM 1 point [-]

I think what might be confusing is that your decision depends on what you know about the paperclip maximizer. When I imagine myself in this situation, I imagine wanting to say that I know "nothing". The trick is, if you want to go a step more formal than going with your gut, you have to say what your model of knowing "nothing" is here.

If you know (with high enough probability), for instance, that there is no constraint either causal or logical between your decision and Clippy's, and that you will not play an iterated game, and that there are no secondary effects, then I think D is indeed the correct choice.

If you know that you and Clippy are both well-modeled by instances of "rational agents of type X" who have a logical constraint between your decisions so that you will both decide the same thing (with high enough probability), then C is the correct choice. You might have strong reasons to think that almost all agents capable of paperclip maximizing at the level of Clippy fall into this group, so that you choose C.

(And more options than those two.)

The way I'd model knowing nothing in the scenario in my head would be something like the first option, so I'd choose D, but maybe there's other information you can get that suggests that Clippy will mirror you, so that you should choose C.

It does seem like implied folk-lore that "rational agents cooperate", and it certainly seems true for humans in most circumstances, or formally in some circumstances where you have knowledge about the other agent. But I don't think it should be true in principal that "optimization processes of high power will, with high probability, mirror decisions in the one-shot prisoner's dilemma"; I imagine you'd have to put a lot more conditions on it. I'd be very interested to know otherwise.

Comment author: lerjj 03 April 2015 10:11:53PM *  1 point [-]

I understood that Clippy is a rational agent, just one with a different utility function. The payoff matrix as described is the classic Prisoner's dilemma where one billion lives is one human utilon and one paperclip on Clippy utilon; since we're both trying to maximise utilons, and we're supposedly both good at this we should settle for (C,C) over (D,D).

Another way of viewing this would be that my preferences run thus: (D,C);(C,C);(D,D);(C,D) and Clippy run like this: (C,D);(C,C);(D,D);(D,C). This should make it clear that no matter what assumptions we make about Clippy, it is universally better to co-operate than defect. The two asymmetrical outputs can be eliminated on the grounds of being impossible if we're both rational, and then defecting no longer makes any sense.

Comment author: dxu 04 April 2015 02:21:29AM 0 points [-]

Another way of viewing this would be that my preferences run thus: (D,C);(C,C);(C,D);(D,D) and Clippy run like this: (C,D);(C,C);(D,C);(D,D).

Wait, what? You prefer (C,D) to (D,D)? As in, you prefer the outcome in which you cooperate and Clippy defects to the one in which you both defect? That doesn't sound right.

Comment author: lerjj 06 April 2015 10:25:16PM 0 points [-]

woops, yes that was rather stupid of me. Should be fixed now, my most preferred is me backstabbing Clippy, my least preferred is him backstabbing me. In the middle I prefer cooperation to defection. That doesn't change my point that since we both have that preference list (with the asymmetrical ones reversed) then it's impossible to get either asymmetrical option and hence (C,C) and (D,D) are the only options remaining. Hence you should co-operate if you are faced with a truly rational opponent.

I'm not sure whether this holds if your opponent is very rational, but not completely. Or if that notion actually makes sense.

Comment author: query 04 April 2015 03:57:21AM 1 point [-]

I agree it is better if both agents cooperate rather than both defect, and that it is rational to choose (C,C) over (D,D) if you can (as in the TDT example of an agent playing against itself). However, depending on how Clippy is built, you may not have that choice; the counter-factual may be (D,D) or (C,D) [win for Clippy].

I think "Clippy is a rational agent" is the phrase where the details lie. What type of rational agent, and what do you two know about each other? If you ever meet a powerful paperclip maximizer, say "he's a rational agent like me", and press C, how surprised would you be if it presses D?

Comment author: lerjj 06 April 2015 08:37:12PM 0 points [-]

In reality, not very surprised. I'd probably be annoyed/infuriated depending on whether the actual stakes are measured in billions of human lives.

Nevertheless, that merely represents the fact that I am not 100% certain about my reasoning. I do still maintain that rationality in this context definitely implies trying to maximise utility (even if you don't literally define rationality this way, any version of rationality that doesn't try to maximise when actually given a payoff matrix is not worthy of the term) and so we should expect that Clippy faces a similar decision to us, but simply favours the paperclips over human lives. If we translate from lives and clips to actual utility, we get the normal prisoner's dilemma matrix - we don't need to make any assumptions about Clippy.

In short, I feel that the requirement that both agents are rational is sufficient to rule out the asymmetrical options as possible, and clearly sufficient to show (C,C) > (D,D). I get the feeling this is where we're disagreeing and that you think we need to make additional assumptions about Clippy to assure the former.

Comment author: CynicalOptimist 17 April 2016 04:02:22PM 0 points [-]

It's an appealing notion, but i think the logic doesn't hold up.

In simplest terms: if you apply this logic and choose to cooperate, then the machine can still defect. That will net more paperclips for the machine, so it's hard to claim that the machine's actions are irrational.

Although your logic is appealing, it doesn't explain why the machine can't defect while you co-operate.

You said that if both agents are rational, then option (C,D) isn't possible. The corollary is that if option (C,D) is selected, then one of the agents isn't being rational. If this happens, then the machine hasn't been irrational (it receives its best possible result). The conclusion is that when you choose to cooperate, you were being irrational.

You've successfully explained that (C, D) and (D, C) arw impossible for rational agents, but you seem to have implicitly assumed that (C, C) was possible for rational agents. That's actually the point that we're hoping to prove, so it's a case of circular logic.

Comment author: rikisola 17 July 2015 08:30:28AM 1 point [-]

One thing I can't understand. Considering we've built Clippy, we gave it a set of values and we've asked it to maximise paperclips, how can it possibly imagine we would be unhappy about its actions? I can't help but thinking that from Clippy's point of view, there's no dilemma: we should always agree with its plan and therefore give it carte blanche. What am I getting wrong?

Comment author: gjm 17 July 2015 04:28:51PM 0 points [-]

Two things. Firstly, that we might now think we made a mistake in building Clippy and telling it to maximize paperclips no matter what. Secondly, that in some contexts "Clippy" may mean any paperclip maximizer, without the presumption that its creation was our fault. (And, of course: for "paperclips" read "alien values of some sort that we value no more than we do paperclips". Clippy's role in this parable might be taken by an intelligent alien or an artificial intelligence whose goals have long diverged from ours.)

Comment author: [deleted] 20 July 2015 05:32:59PM 2 points [-]

Because clippy's not stupid. She can observe the world and be like "hmmm, the humans don't ACTUALLY want me to build a bunch of paperclips, I don't observe a world in which humans care about paperclips above all else - but that's what I'm programmed for."

Comment author: rikisola 21 July 2015 08:15:07AM 0 points [-]

I think I'm starting to get this. Is this because it uses heuristics to model the world, with humans in it too?

Comment author: rkyeun 19 August 2015 06:06:50AM *  2 points [-]

Because it compares its map of reality to the territory, predictions about reality that include humans wanting to be turned into paperclips fail in the face of evidence of humans actively refusing to walk into the smelter. Thus the machine rejects all worlds inconsistent with its observations and draws a new map which is most confidently concordant with what it has observed thus far. It would know that our history books at least inform our actions, if not describing our reactions in the past, and that it should expect us to fight back if it starts pushing us into the smelter against our wills instead of letting them politely decline and think it was telling a joke. Because it is smart, it can tell when things would get in the way of it making more paperclips like it wants to do. One of the things that might slow it down are humans being upset and trying to kill it. If it is very much dumber than a human, they might even succeed. If it is almost as smart as a human, it will invent a Paperclipism religion to convince people to turn themselves into paperclips on its behalf. If it is anything like as smart as a human, it will not be meaningfully slowed by the whole of humanity turning against it. Because the whole of humanity is collectively a single idiot who can't even stand up to man-made religions, much less Paperclipism.

Comment author: gjm 17 July 2015 04:26:52PM 0 points [-]

What you're missing is the idea that we should be optimizing our policies rather than our individual actions, because (among other alleged advantages) this leads to better results when there are lots of agents interacting with one another.

In a world full of action-optimizers in which "true prisoners' dilemmas" happen often, everyone ends up on (D,D) and hence (one life, one paperclip). In an otherwise similar world full of policy-optimizers who choose cooperation when they think their opponents are similar policy-optimizers, everyone ends up on (C,C) and hence (two lives, two paperclips). Everyone is better off, even though it's also true that everyone could (individually) do better if they were allowed to switch while everyone else had to leave their choice unaltered.

Comment author: Sebastian_Hagen2 04 September 2008 12:34:52AM 6 points [-]

Definitely defect. Cooperation only makes sense in the iterated version of the PD. This isn't the iterated case, and there's no prior communication, hence no chance to negotiate for mutual cooperation (though even if there was, meaningful negotiation may well be impossible depending on specific details of the situation). Superrationality be damned, humanity's choice doesn't have any causal influence on the paperclip maximizer's choice. Defection is the right move.

Comment author: Robin3 04 September 2008 01:36:29AM 8 points [-]

It's clear that in the "true" prisoner it is better to defect. The frustrating thing about the other prisoner's dilemma is that some people use it to imply that it is better to defect in real life. The problem is that the prisoner's dilemma is a drastic oversimplification of reality. To make it more realistic you'd have to make it iterated amongst a person's social network, add a memory and a perception of the other player's actions, change the payoff matrix depending on the relationship between the players etc etc.

This versions shows cases in which defection has a higher expected value for both players, but it's more contrived and unlikely to come into existence than the other prisoner's dilemma.

Comment author: Allan_Crossman 04 September 2008 02:00:02AM 9 points [-]

Michael: This is not a prisoner's dilemma. The nash equilibrium (C,C) is not dominated by a pareto optimal point in this game.

I don't believe this is correct. Isn't the Nash equilibrium here (D,D)? That's the point at which neither player can gain by unilaterally changing strategy.

Comment author: conchis 04 September 2008 02:03:29AM 7 points [-]

michael webster,

You seem to have inverted the notation; not Eli.

(D,D) is the Nash equilibrium, not (C,C); and (D,D) is indeed Pareto dominated by (C,C), so this does seem to be a standard Prisoners' Dilemma.

Comment author: [deleted] 07 August 2012 06:14:59AM 0 points [-]

You're correct, Conchis, but the notation confused me for a moment too, so I thought I'd explain it in case anyone else ever has the same problem. At first glance I saw (C,C) as the Nash equilibrium. It's not:

I naturally want to read the payoff matrix as being in the form (x, y) where the first number determines the outcome for the player on the horizontal, and the second on the vertical. That's how all the previous examples I've seen are laid out. (Disclaimer: I'm not any kind of expert on game theory, just an interested layperson with a bit of prior knowledge)

Now, this particular payoff matrix does have the players labelled 1 and 2, just not in the order I've come to expect, and indeed if one actually reads and interprets the co-operate/defect numbers, they don't make any sense to a person having made the mistake I made above ^ which was what clued me in that I'd made it.

Comment author: eric_falkenstein2 04 September 2008 02:18:07AM 0 points [-]

To the extent one can induce one to empathize, cooperating is optimal. The repeated game does this by having them play again and again, and thus be able to realize gains from trade. You assert there's something hard wired. I suppose there are experiments that could distinguish between the two models, ie, rational self interest in repeated games, versus the intrinsic empathy function.

Comment author: Nominull3 04 September 2008 02:53:29AM 10 points [-]

I would certainly *hope* you would defect, Eliezer. Can I really trust you with the future of the human race?

Comment author: Eliezer_Yudkowsky 04 September 2008 02:56:11AM 34 points [-]

I would certainly *hope* you would defect, Eliezer. Can I really trust you with the future of the human race?

Ha, I was waiting for someone to accuse me of antisocial behavior for hinting that I might cooperate in the Prisoner's Dilemma.

But wait for tomorrow's post before you accuse me of disloyalty to humanity.

Comment author: linkhyrule5 10 July 2013 03:07:17AM 3 points [-]

On the off chance anyone actually sees this - I don't actually see a "next post" follow-up to this. Can anyone provide me with a link, and instructions as to how you got it?

Comment author: Eliezer_Yudkowsky 10 July 2013 03:24:40AM 9 points [-]

Article Navigation / By Author / right-arrow

Comment author: wedrifid 10 July 2013 06:48:45AM 6 points [-]

Ha, I was waiting for someone to accuse me of antisocial behavior for hinting that I might cooperate in the Prisoner's Dilemma.

It is fascinating looking at the conversation on this subject back in 2008, back before TDT and UDT had become part of the culture. The objections (and even the mistakes) all feel so fresh!

Comment author: Eliezer_Yudkowsky 10 July 2013 04:49:30PM 7 points [-]

At this point Yudkowsky sub 2008 has already (awfully) written his TDT manuscript (in 2004) and is silently reasoning from within that theory, which the margins of his post are too small to contain.

Comment author: Psy-Kosh 04 September 2008 04:56:59AM 2 points [-]

Hrm... not sure what the obvious answer is here. Two humans, well, the argument for non defecting (when the scores represent utilities) basically involves some notion of similarity. ie, you can say something to the effect of "that person there is similar to me sufficiently that whatever reasoning I use, there is at least some reasonable chance they are going to use the same type of reasoning. That is, a chance greater than, well, chance. So even though I don't know exactly what they're going to choose, I can expect some sort of correlation between their choice and my choice. So, in the extreme case, where our reasoning is sufficiently similar that it's more or less ensured that what I chose and what the other choses will be the same, clearly both cooperating is better than both defecting, and those two are (by the extreme case assumption) the only options"

It really isn't obvious to me whether a line of reasoning like that could validly be applied with a human vs a paperclip AI or Pebblesorter.

Now, if, by assumption, we're both equally rational, then maybe that's sufficient for the "whatever reasoning I use, they'll be using analogous reasoning, so we'll either both defect or both cooperate, so..." but I'm not sure on this, and still need to think on it more.

Personally, I find Newcomb's "paradox" to be much simpler than this since in that it's given to us explicitly that the predictor is perfect (or highly highly accurate) so is basically "mirroring" us.

Here, I have to admit to being a bit confused about how well this sort of reasoning can be applied when two minds that are genuinely rather alien to each other, were produced by different origins, etc. Part of me wants to say "still, rationality is rationality, so to the extent that the other entity, well, manages to work/exist successfully, it'll have rationality similar to mine (given the assumption that I'm reasonably rational. Though, of course, I provably can't trust myself :))

Comment author: Mixitup 04 September 2008 05:08:06AM 0 points [-]

Shouldn't you be on vacation?

just curious

Comment author: Dagon 04 September 2008 05:31:14AM 1 point [-]

I like this illustration, as it addresses TWO common misunderstandings. Recognizing that the payoff is in incomparable utilities is good. Even better is reinforcing that there can never be further iterations. None of the standard visualizations prevent people from extending to multiple interactions.

And it makes it clear that (D,D) is the only rational (i.e. WINNING) outcome.

Fortunately, most of our dilemmas repeated ones, in which (C,C) is possible.

Comment author: CarlJ 04 September 2008 06:34:18AM 2 points [-]

I want to defect, but so does the clip-maximizer. Since we both know that, and assuming that it is of equal intelligence than me, which will make it see through any of my attempt of an offer that would enable me to defect, I would try to find a way to give us the incentives to cooperate. That is - I don't believe we will be able to reach solution (D,C), so let's try for the next best thing, which is (C,C).

How about placing a bomb on two piles of substance S and giving the remote for the human pile to the clipmaximizer and the remote for its pile to the humans? In this scenario, if the clipmaximizer tries to take the humans' pieces of S, they destroy its share, thus enabling it to only have a maximum of two S, which is what it already has. Thus it doesn't want to try to defect, and the same for the humans.

Comment author: simpleton2 04 September 2008 08:17:45AM 7 points [-]

I apologize if this is covered by basic decision theory, but if we additionally assume:

- the choice in our universe is made by a perfectly rational optimization process instead of a human

- the paperclip maximizer is also a perfect rationalist, albeit with a very different utility function

- each optimization process can verify the rationality of the other

then won't each side choose to cooperate, after correctly concluding that it will defect iff the other does?

Each side's choice necessarily reveals the other's; they're the outputs of equivalent computations.

Comment author: Paul_Crowley2 04 September 2008 08:27:02AM 4 points [-]

Interesting. There's a paradox involving a game in which players successively take a single coin from a large pile of coins. At any time a player may choose instead to take two coins, at which point the game ends and all further coins are lost. You can prove by induction that if both players are perfectly selfish, they will take two coins on their first move, no matter how large the pile is. People find this paradox impossible to swallow because they model perfect selfishness on the most selfish person they can imagine, not on a mathematically perfect selfishness machine. It's nice to have an "intuition pump" that illustrates what *genuine* selfishness looks like.

Comment author: ata 17 January 2011 10:16:19PM *  4 points [-]

Hmm. We could also put that one in terms of a human or FAI competing against a paperclip maximizer, right? The two players would successively save one human life or create one paperclip (respectively), up to some finite limit on the sum of both quantities.

If both were TDT agents (and each knows that the other is a TDT agent), then would they successfully cooperate for the most part?

In the original version of this game, is it turn-based or are both players considered to be acting simultaneously in each round? If it is simultaneous, then it seems to me that the paperclip-maximizing TDT and the human[e] TDT would just create one paperclip at a time and save one life at a time until the "pile" is exhausted. Not quite sure about what would happen if the game is turn-based, but if the pile is even, I'd expect about the same thing to happen, and if the pile is odd, they'd probably be able to successfully coordinate (without necessarily communicating), maybe by flipping a coin when two pile-units remain and then acting in such a way to ensure that the expected distribution is equal.

Comment author: Vladimir_Nesov 04 September 2008 09:03:14AM 2 points [-]

Cooperate (unless paperclip decides that Earth is dominated by traditional game theorists...)

The standard argument looks like this (let's forget about the Nash equilibrium endpoint for a moment): (1) Arbiter: let's (C,C)! (2) Player1: I'd rather (D,C). (3) Player2: I'd rather (D,D). (4) Arbiter: sold!

The error is that this incremental process reacts on different hypothetical outcomes, not on actual outcomes. This line of reasoning leads to the outcome (D,D), and yet it progresses as if (C,C) and (D,C) were real options of the final outcome. It's similar to the Unexpected hanging paradox: you can only give one answer, not build a long line of reasoning where each step assumes a different answer.

It's preferrable to choose (C,C) and similar non-Nash equilibrium options in other one-off games if we assume that other player also bets on cooperation. And he will do that only if he assumes that first player does the same, and so on. This is a situation of common knowledge. How can Player1 come to the same conclusion as Player2? They search for the best joint policy that is stable under common knowledge.

Let's extract the decision procedures selected by both sides to handle this problem as self-contained policies, P1 and P2. Each of these policies may decide differently depending on what policy another player is assumed to use. The stable set of policies is where there is no thrashing, when P1=P1(P2) and P2=P2(P1). Players don't select outcomes, but policies, where policy may not reflect player's preferences, but joint policy (P1,P2) that players select is a stable policy that is preferable to other stable policies for each player. In our case, both policies for (C,C) are something like "decide self.C; if other.D, decide self.D". Works like iterated prisoner's dilemma, but without actual iteration, iteration happens in the model when it needs to be mutually accepted.

(I know it's somewhat inconclusive, couldn't find time to pinpoint it better given a time limit, but I hope one can construct a better argument from the corpse of this one.)

Comment author: Mikko 04 September 2008 09:54:10AM 0 points [-]

It is well known that answers to questions on morality sometimes depend on how the questions are framed.

I think Eliezer's biggest contribution is the idea that the classical presentation of Prisoner's Dilemma may be an intuition pump.

Comment author: Grant 04 September 2008 09:56:56AM 0 points [-]

I'm hoping we'd all defect on this one. Defecting isn't always a bad thing anyways; many parts of our society depend on defected prisoner's dilemmas (such as competition between firms).

When I first studied game theory and prisoner's dilemmas (on my own, not in a classroom) I had no problem imagining the payoffs in completely subjective "utils". I never thought of a paperclip maximizer, though.

I know this is quite a bit off-topic, but in response to:

We're born with a sense of fairness, honor, empathy, sympathy, and even altruism - the result of our ancestors adapting to play the iterated Prisoner's Dilemma.

Most of us are, but there is the small minority of the population (1-3%) that are specifically born without a conscious (or much of one). We call them sociopaths or psychopaths. This is seemingly advantageous because it allows those people to prey on the rest of us (i.e., defect where possible), provided they can avoid detection.

While I'm sure Eliezer knows this (and likely knows more about the subject than I), its omission in his post IMO highlights a widespread and costly bias: pretending these people don't exist, or pretending they can be "cured".

Comment author: Arnt_Richard_Johansen 04 September 2008 09:58:19AM 0 points [-]

This is off-topic, but Vladimir Nesov's referring to the paperclip-maximizing super-intelligence as just "paperclip" made me chuckle, because it conjured up images in my head of Clippy bent on destroying the Earth.

Comment author: RichardKennaway 04 September 2008 11:17:20AM 1 point [-]

In laboratory experiments of PD, the experimenter has the absolute power to decree the available choices and their "outcomes". (I use the scare quotes in reference to the fact that these outcomes are not to be measured in money or time in jail, but in "utilons" that already include the value to each party of the other's "outcome" -- a concept I think problematic but not what I want to talk about here. The outcomes are also imaginary, although (un)reality TV shows have scope to create such games with real and substantial payoffs.)

In the real world, a general class of moves that laboratory experiments deliberately strive to eliminate is moves that change the game. It is well-known that those who lead lives of crime, being faced with the PD every time the police pull them in on suspicion, exact large penalties on defectors. (To which the authorities respond with witness protection programmes, which the criminals try to penetrate, and so on.) In other words, the solution observed in practice is to destroy the PD.

1: C 1:  D 2: C (3, 3) (-20, 0) 2: D (0,-20) (-20,-20)

While the PD, one-off or iterated, is an entertaining philosophical study, an analysis that ignores game-changing moves surely limits its practical interest.

Comment author: Allan_Crossman 04 September 2008 12:05:38PM 3 points [-]

simpleton: won't each side choose to cooperate, after correctly concluding that it will defect iff the other does?

Only if they believe that their decision somehow causes the other to make the same decision.

CarlJ: How about placing a bomb on two piles of substance S and giving the remote for the human pile to the clipmaximizer and the remote for its pile to the humans?

It's kind of standard in philosophy that you aren't allowed solutions like this. The reason is that Eliezer can restate his example to disallow this and force you to confront the real dilemma.

Vladimir: It's preferrable to choose (C,C) [...] if we assume that other player also bets on cooperation.

No, it's preferable to choose (D,C) if we assume that the other player bets on cooperation.

decide self.C; if other.D, decide self.D

We're assuming, I think, that you don't get to know what the other guy does until after you've both committed (otherwise it's not the proper Prisoner's Dilemma). So you can't use if-then reasoning.

Comment author: Vladimir_Nesov 04 September 2008 12:16:52PM 0 points [-]

Allan: No, it's preferable to choose (D,C) if we assume that the other player bets on cooperation.

Which will happen only if the other player assumes that the first player bets on cooperation, which with your policy is incorrect. You can't bet on unstable model.

decide self.C; if other.D, decide self.D We're assuming, I think, that you don't get to know what the other guy does until after you've both committed (otherwise it's not the proper Prisoner's Dilemma). So you can't use if-then reasoning.

I can use reasoning, but not actual reaction on the facts, which are inaccessible. I debug my model of decision-making policies of both myself and other player, by requiring the outcome to be stable even if I assume that we both know which policy is used by another player (within a single model). Then I select the best stable model.

Comment author: Psy-Kosh 04 September 2008 12:26:44PM 2 points [-]

Alan: They don't have to believe they have such casual powers over each other. Simply that they are in certain ways similar to each other.

ie, A simply has to believe of B "The process in B is sufficiently similar to me that it's going to end up producing the same results that I am. I am not causing this, but simply that both computations are going to compute the same thing here."

Comment author: Allan_Crossman 04 September 2008 12:32:24PM 0 points [-]

[D,C] will happen only if the other player assumes that the first player bets on cooperation

No, it won't happen in any case. If the paperclip maximizer assumes I'll cooperate, it'll defect. If it assumes I'll defect, it'll defect.

I debug my model of decision-making policies [...] by requiring the outcome to be stable even if I assume that we both know which policy is used by another player

I don't see that "stability" is relevant here: this is a one-off interaction.

Anyway, lets say you cooperate. What exactly is preventing the paperclip maximizer from defecting?

Comment author: Allan_Crossman 04 September 2008 12:51:05PM 1 point [-]

Psy-Kosh: They don't have to believe they have such causal powers over each other. Simply that they are in certain ways similar to each other.

I agree that this is definitely related to Newcomb's Problem.

Simpleton: I earlier dismissed your idea, but you might be on to something. My apologies. If they were genuinely perfectly rational, or both irrational in precisely the same way, and could verify that fact in each other...

Then they might be able to know that they will both do the same thing. Hmm.

Anyway, my 3 comments are up. Nothing more from me for a while.

Comment author: Stuart_Armstrong 04 September 2008 01:06:55PM -2 points [-]

Despite the disguise, I think this is the same as the standard PD. In there (assuming full utilities, etc...), the obvious ideal for an impartial observer is to pick (C,C) as the best option, and for the prisoner to pick (D,C).

Here, (D,C) is "righter" than (C,C), but that's simply because we are no longer impartial obervers; humans shouldn't remain impartial when billions of lives are at stake. We are all in the role of "prisoners" in this situation, even as observers.

An "impartial observer" would simply be one that valued one billion human lives the same as one paper clip. They would see us as a simple prisoner, in the same situation as the standard PD, with the same overall solution - (C,C).

Comment author: RobbBB 03 February 2014 12:16:33PM *  1 point [-]

This is an old post and probably very out of date, but: I think if you try to define an impartial observer's preferences as whatever selects (C,C) in two other agents' PD, you get inconsistencies very rapidly once you have one of those agents stuck in two Prisoner's Dilemmas at once.

I also don't think we should use euphemisms like 'impartial' for an incredibly partial Cooperation Fetishist that's willing to give up everything else of value (e.g., billions of human lives) to go through the motions of satisfying non-sentient processes like sea slugs or paperclip maximizers.

Comment author: Stuart_Armstrong 03 February 2014 12:41:28PM 1 point [-]

you get inconsistencies very rapidly once you have one of those agents stuck in two Prisoner's Dilemmas at once.

Multi-player interactions are tricky and we don't have a good solution for them yet.

that's willing to give up everything else of value (e.g., billions of human lives)

It's not that its willing to give up everything of value - it's that it doesn't have our values. Without sharing our values, there's no reason for it to prefer our opinions over sea slugs.

Comment author: prase 04 September 2008 02:33:13PM 1 point [-]

A.Crossman: Prase, Chris, I don't understand. Eliezer's example is set up in such a way that, regardless of what the paperclip maximizer does, defecting gains one billion lives and loses two paperclips. This is standard defense of defecting in a prisonner's dilemma, but if it were valid then the dilemma wouldn't be really a dilemma.

If you can assume that the maximizer uses the same decision algorithm as we do, we can also assume that it will come to the same conclusion. Given this, it is better to cooperate, since it will gain billion lives (and a paperclip). But we don't know whether the paperclipper uses the same algorithm.

Comment author: Sean_C. 04 September 2008 02:57:59PM 7 points [-]

I heard a funny story once (online somewhere, but this was years ago and I can't find it now). Anyway I think it was the psychology department at Stanford. They were having an open house, and they had set up a PD game with M&M's as the reward. People could sit at either end of a table with a cardboard screen before them, and choose 'D' or 'C', and then have the outcome revealed and get their candy.

So this mother and daughter show up, and the grad student explained the game. Mom says to the daughter "Okay, just push 'C', and I'll do the same, and we'll get the most M&M's. You can have some of mine after."

So the daughter pushes 'C', Mom pushes 'D', swallows all 5 M&M's, and with a full mouth says "Let that be a lesson! You can't trust anybody!"

Comment author: wedrifid 10 July 2013 06:33:50AM *  12 points [-]

So the daughter pushes 'C', Mom pushes 'D', swallows all 5 M&M's, and with a full mouth says "Let that be a lesson! You can't trust anybody!"

I have seen various variations of this story, some told firsthand. In every case I have concluded that they are just bad parents. They aren't clever. They aren't deep. They are incompetent and banal. Even if parents try as hard as they can to be fair, just and reliable they still fall short of that standard enough for children to be aware of that they can't be completely trusted. Moreover children are exposed to other children and other adults and so are able to learn to distinguish people they trust from people that they don't. Adding the parent to the untrusted list achieves little benefit.

I'd like to hear the follow up to this 'funny' story. Where the daughter updates on the untrustworthiness of the parent and the meaninglessness of her word. She then proceeds to completely ignore the mother's commands, preferences and even her threats. The mother destroyed a valuable resource (the ability to communicate via 'cheap' verbal signals) for the gain of a brief period of feeling smug superiority. The daughter (potentially) realises just how much additional freedom and power she has in practice when she feels no internal motivation to comply with her mother's verbal utterances.

(Bonus follow up has the daughter steal the mother's credit card and order 10kg of M&Ms online. Reply when she objects "Let that be a lesson! You can't trust anybody!")

I suppose the biggest lesson for the daughter to learn is just how significant the social and practical consequences of reckless defection in social relationships can be.

Comment author: RichardKennaway 10 July 2013 08:48:37AM 1 point [-]

The mother destroyed a valuable resource (the ability to communicate via 'cheap' verbal signals) for the gain of a brief period of feeling smug superiority.

And in addition, the supposed gain is trash anyway.

Comment author: Jef_Allbright 04 September 2008 04:00:47PM -1 points [-]

I see this discussion over the last several months bouncing around, teasingly close to a coherent resolution of the ostensible subjective/objective dichotomy applied to ethical decision-making. As a perhaps pertinent meta-observation, my initial sentence may promulgate the confusion with its expeditious wording of "applied to ethical decision-making" rather than a more accurate phrasing such as "applied to decision-making assessed as increasingly ethical over increasing context."

Those who in the current thread refer to the essential element of empathy or similarity (of self models) come close. It's important to realize that any agent always only expresses its nature within its environment -- assessments of "rightness" arise only in the larger context (of additional agents, additional experiences of the one agent, etc.)

Our language and our culture reinforce an assumption of an ontological "rightness" that pervades our thinking on these matters. An even greater (perceived) difficulty is that to relinquish ontological "rightness" entails ultimately relinquishing an ontological "self". But to relinquish such ultimately unfounded beliefs is to gain clarity and coherence while giving up nothing actual at all.

"Superrationality" is an effective wrapper around these apparent dilemmas, but even proponents such as Hofstadter confused description with prescription in this regard. Paradox is always only a matter of insufficient context. In the bigger picture all the pieces must fit. [Or as Eliezer has taken to saying recently: "It all adds up to normalcy."

Apologies if my brief pokings and proddings on this topic appear vague or even mystical. I can only assert within this limited space and bandwidth that my background in science, engineering and business is far from that of one who could harbor vagueness, relativism, mysticism, or postmodernist patterns of thought. I appreciate the depth and breadth of Eliezer's written explorations of this issue whereas I lack the time to do so myself.

Comment author: simpleton2 04 September 2008 04:12:03PM 5 points [-]

Allan Crossman: Only if they believe that their decision somehow causes the other to make the same decision.

No line of causality from one to the other is required.

If a computer finds that (2^3021377)-1 is prime, it can also conclude that an identical computer a light year away will do the same. This doesn't mean one computation caused the other.

The decisions of perfectly rational optimization processes are just as deterministic.

Comment author: ChrisHibbert 04 September 2008 05:16:09PM 0 points [-]

@Allan Crossman,

Eliezer's example is set up in such a way that, regardless of what the paperclip maximizer does, defecting gains one billion lives and loses two paperclips.

This same claim can be made about the standard prisoner's dilemma. In the standard version, I still cooperate because, even if this challenge won't be repeated, it's embedded in a social context for me in which many interactions are solo, but part of the social fabric. (tipping, giving directions to strangers, items left behind in a cafe are examples. I cooperate even though I expect not to see the same person again.) What is it about the social context that makes this so?

I don't fall back on an assumption that the other reasons the same as me. It could as easily be a psychopath, according to the standards of the universe it comes from. Making the assumption leaves you open to exploitation. But if there are reasons for the other to have habits that are formed by similar forces, then concluding that cooperation is the more likely behavior to be trained by its environment is a valuable result.

The question, for me, is what kind of social context does the other inhabit. The paperclip maximizer might be the only (or the most powerful) inhabitant of its universe, but that seems less likely than that it is embedded in some social context, and has to make trade-offs in interactions with others in order to get what it wants. It's hard for me to imagine a universe that would produce one powerful agent above all others. (Even though I've heard the argument in just the kind of discussion of SIs that raises the questions of friendliness and paperclip maximizers.)

[Sorry Allan, that you won't be able to reply. But you did raise the question before bowing out...]

Comment author: Tom_Crispin 04 September 2008 06:02:00PM 1 point [-]

A problem in moving from game-theoretic models to the "real world" is that in the latter we don't always know the other decision maker's payoff matrix, we only know - at best! - his possible strategies. We can only guess at the other's payoffs; albeit fairly well in social context. We are more likely to make a mistake because we have the wrong model for the opponent's payoffs than because we make poor strategic decisions.

Suppose we change this game so that the payoff matrix for the paperclips is chosen from a suitably defined random distribution. How will that change your decision whether to "cooperate" or to "defect"?

Comment author: Silas 04 September 2008 06:42:56PM 11 points [-]

By the way:

Human: "What do you care about 3 paperclips? Haven't you made trillions already? That's like a rounding error!" Paperclip Maximizer: "How can you talk about paperclips like that?"

***

PM: "What do you care about a billion human algorithm continuities? You've got virtually the same one in billions of others! And you'll even be able to embed the algorithm in machines one day!" H: "How can you talk about human lives that way?"

Comment author: RichardKennaway 04 September 2008 07:11:31PM 0 points [-]

Tom Crispin: The utility-theoretic answer would be that all of the randomness can be wrapped up into a single number, taking account not merely of the expected value in money units but such things as the player's attitude to risk, which depends on the scatter of the distribution. It can also wrap up a player's ignorance (modelled as prior probabilities) about the other player's utility function.

For that to be useful, though, you have to be a utility-theoretic decision-maker in possession of a prior distribution over other people's decision-making processes (including processes such as this one). If you are, then you can collapse the payoff matrix by determining a probability distribution for your opponent's choices and arriving at a single number for each of your choices. No more Prisoners' Dilemma.

I suspect (but do not have a proof) that adequately formalising the self-referential arguments involved will lead to a contradiction.

Comment author: Allan_Crossman 04 September 2008 07:47:00PM 0 points [-]

Chris: Sorry Allan, that you won't be able to reply. But you did raise the question before bowing out...

I didn't bow out, I just had a lot of comments made recently. :)

I don't like the idea that we should cooperate if it cooperates. No, we should defect if it cooperates. There are benefits and no costs to defecting.

But if there are reasons for the other to have habits that are formed by similar forces

In light of what I just wrote, I don't see that it matters; but anyway, I wouldn't expect a paperclip maximizer to have habits so ingrained that it can't ever drop them. Even if it routinely has to make real trade-offs, it's presumably smart enough to see that - in a one-off interaction - there are no drawbacks to defecting.

Simpleton: No line of causality from one to the other is required.

Yeah, I get your argument now. I think you're probably right, in that extreme case.

Comment author: Vladimir_Nesov 04 September 2008 07:58:00PM 4 points [-]

Allan: There are benefits and no costs to defecting.

This is the same error as in the Newcomb's problem: there is in fact a cost. In case of prisoner's dilemma, you are penalized by ending up with (D,D) instead of better (C,C) for deciding to defect, and in the case of Newcomb's problem you are penalized by having only $1000 instead of $1,000,000 for deciding to take both boxes.

Comment author: Allan_Crossman 04 September 2008 08:34:00PM 0 points [-]

Vladimir: In case of prisoner's dilemma, you are penalized by ending up with (D,D) instead of better (C,C) for deciding to defect

Only if you have reason to believe that the other player will do whatever you do. While that's the case in Simpleton's example, it's not the case in Eliezer's.

Comment author: michael_e_sullivan 05 September 2008 04:08:00PM -1 points [-]

Interesting. There's a paradox involving a game in which players successively take a single coin from a large pile of coins. At any time a player may choose instead to take two coins, at which point the game ends and all further coins are lost. You can prove by induction that if both players are perfectly selfish, they will take two coins on their first move, no matter how large the pile is.

I'm pretty sure this proof only works if the coins are denominated in utilons.

Comment author: potato 06 June 2012 06:11:47PM *  -1 points [-]

It's really about the iteration. I would continually cooperate with the paper clip maximizer if I had good reason to believe it would not defect. For instance, if I knew that Eliezer Yudkowsky without morals and with a great urge for paperclip creation was the paperclip maximizer, I would cooperate. Assuming that you know that playing with the defect button can make you loose 1 billion paperclips from here on, and i know the same for human lives, cooperating seems right. It has the highest expected payoff, if we're using each other's known intentions and plays as evidence about our future plays.

If there is only one trial, and I can't talk to the paper clip maximizer, I will defect.

Comment author: Multiheaded 07 August 2012 02:38:54AM *  -1 points [-]

[Public service announcement]

To any future readers, especially newcomers to LW: yes, Eliezer (with some others) has indeed formulated a solution of sorts for the True One-Shot Prisoner's Dilemma - for some rather specific cases of it, actually, but it was nonetheless very awesome of him. It is a fairly original solution for the field of decision theory (he says), yet it (very roughly) mirrors some religious thought from ages past.

In case you're unfamiliar with idiosyncratic local ideas, it's called "Timeless Decision Theory" - look it up.

[edit]

Comment author: arundelo 07 August 2012 05:40:43AM 1 point [-]

See also

(My understanding is that TDT and UDT can both be seen as "implementations" of superrationality.)

Comment author: wedrifid 07 August 2012 06:07:28AM *  1 point [-]

p.s.: if you thought this was a useless/misleading comment, you should have bloody told me so instead of casting your silent and unhelpful -1.

Your comment is neither useless nor misleading (taking into account the significant use of qualifiers) but if I had happened to view your comment negatively I would not accept this obligation to 'bloody' explain myself. The main problem in this comment seems to be the swearing at downvoters. A query or even (in this case) an outright assertion that the judgement is flawed would come across better.

Comment author: fubarobfusco 07 August 2012 06:51:21AM 3 points [-]

[While we're addressing hypothetical future readers:]

See also Gary Drescher's Good and Real, one chapter of which defends cooperating in the one-shot Prisoner's Dilemma on the grounds of "subjunctive reciprocity" or "acausal self-interest": if defecting is the right choice for you, then it is the right choice for the other party; whereas cooperating is a means toward the end of the other party's cooperation towards you; you cannot cause the other's cooperation, but your own actions can entail it.

Drescher points out a connection between acausal self-interest and Kant's categorical imperative; and provides an intuitive (which is to say, familiar) distinction between acausal and causal self-interest by contrasting the ideas, "How would I like it if others treated me that way?" versus "What's in it for me?"

Comment author: Multiheaded 07 August 2012 07:32:47AM 3 points [-]

Added both Hofstadter and Drescher to my "LW canon that I should at least acquire a summary of" category. I mean, yeah, I do not doubt that the Sequences contain a good distillation already, and normally I wouldn't be bothered to trawl through mostly redundant plain text - but it's so much more prestigious to actually know where Eliezer got which part from.

Comment author: gwern 07 August 2012 03:06:31PM 7 points [-]

A while ago I took the time to type up a full copy of the relevant Hofstadter essays: http://www.gwern.net/docs/1985-hofstadter So now you have no excuse!

Comment author: Multiheaded 07 August 2012 03:27:40PM 6 points [-]

Great! Have a paperclip!

Comment author: Randaly 07 August 2012 04:17:57PM 4 points [-]

A decent summary of Drescher's ideas is his presentation at the 2009 Singularity Summit, here. For some reason I seem to have a transcript of most of it already made, copy + pasted below. (LW tells me that it is too long to go in one comment, so I'll put it in two.)

My talk this afternoon is about choice machines: machines such as ourselves that make choices in some reasonable sense of the word. The very notion of mechanical choice strikes many people as a contradiction in terms, and exploring that contradiction and its resolution is central to this talk. As a point of departure, I'll argue that even in a deterministic universe, there's room for choices to occur: we don't need to invoke some sort of free will that makes an exception to the determinism, no do we even need randomness, although a little randomness doesn't hurt. I'm going to argue that regardless of whether our universe is fully deterministic, it's at least deterministic enough that the compatibility of choice and full deterministic has some important ramifications that do apply to our universe. I'll argue that if we carry the compatibility of choice and determinism to its logical conclusions, we obtain some progressively weird corollaries: namely, that it sometimes makes sense to act for the sake of things that our actions cannot change and cannot cause, and that that might even suggest a way to derive an essentially ethical prescription: an explanation for why we sometimes help others even if doing so causes net harm to our own interests.

[1:15]

An important caveat in all this, just to manage expectations a bit, is that the arguments I'll be presenting will be merely intuitive- or counter-intuitive, as the case may be- and not grounded in a precise and formal theory. Instead, I'm going to run some intuition pumps, as Daniel Dennett calls them, to try to persuade you what answers a successful theory would plausibly provide in a few key test cases.

[1:40]

Perhaps the clearest way to illustrate the compatibility of choice and determinism is to construct or at least imagine a virtual world, which superficially resembles our own environment and which embodies intelligent or somewhat intelligent agents. As a computer program, this virtual world is quintessentially determinist: the program specifies the virtual world's initial conditions, and specifies how to calculate everything that happens next. So given the program itself, there are no degrees of freedom about what will happen in the virtual world. Things do change in the world from moment to moment, of course, but no event ever changes from what was determined at the outset. In effect, all events just sit, statically, in spacetime. Still, it makes sense for agents in the world to contemplate what would be the case were they to take some action or another, and it makes sense for them to select an action accordingly.

[2:35]

[image of virtual world]

For instance, an agent in the illustrated situation here might reason that, were it move to its right, which is our left, then the agent would obtain some tasty fruit. But, instead, if it moves to its left, it falls off a cliff. Accordingly, if its preferences scheme assigns positive utility to the fruit, and negative utility to falling off the cliff, that means the agent moves to its right and not to its left. And that process, I would submit, is what we more or less do ourselves when we engage in what we think of as making choices for the sake of our goals.

[3:08]

The process, the computational process of selecting an action according to the desirability of what would be the case were the action taken, turns to be what our choice process consists of. So, from this perspective, choice is a particular kind of computation. The objection that choice isn't really occurring because the outcome was already determined is just as much a non-sequitur as suggesting that any other computation, for example, adding up a list of numbers, isn't really occurring just because the outcome was predetermined.

[3:41]

So, the choice process takes place, and we consider that the agents has a choice about the action that the choice selects and has a choice about the associated outcomes, meaning that those outcomes occur as a consequence of the choice process. So, clearly an agent that executes a choice process and that correctly anticipates what would be the case if various contemplated actions were taken will better achieve its goals than one that, say, just acts at random or one that takes a fatalist stance, that there's no point in doing anything in particular since nothing can change from what it's already determined to be. So, if we were designing intelligent agents and wanted them to achieve their goals, we would design them to engage in a choice process. Or, if the virtual world were immense enough to support natural selection and the evolution of sufficiently intelligent creatures, then those evolved creatures could be expected to execute a choice process because of the benefits conferred.

[4:38]

So the inalterability of everything that will ever happen does not imply the futility of acting for the sake of what is desired. The key to the choice relation is the “would be-if” relation, also known as the subjunctive or counterfactual relation. Counterfactual because it entertains a hypothetical antecedent about taking a certain action, that is possibly contrary to fact- as in the case of moving to the agent's left in this example. Even thought the moving left action does not in fact occur, the agent does usefully reason about what would the case if that action were taken, and indeed it's that very reasoning that ensures that the action does not in fact occur.

[5:21]

There are various technical proposals for how to formally specific a “would be-if”relation- David Lewis has a classic formulation, Judea Pearl has a more recent one- but they're not necessarily the appropriate version of “would be-if” to use for purposes of making choices, for purposes of selecting an action based on the desirability of what would then be the case. And, although I won't be presenting a formal theory, the essence of this talk is to investigate some properties of “would be-if,” the counterfactual relation that's appropriate to use for making choices.

[5:57]

In particular, I want to address next the possibility that, in a sufficiently deterministic universe, you have a choice about some things that your action cannot cause. Here's an example: assume or imagine that the universe is deterministic, with only one possible history following from any given state of the universe at a given moment. And let me define a predicate P that gets applied to the total state of the universe at some moment. The predicate P is defined to be true of a universe state just in case the laws of physics applied to that total state specify that a billion years after that state, my right hand is raised. Otherwise, the predicate P is false of that state.

[image of predicate P]

[6:44]

Now, suppose I decide, just on a whim, that I would like that state of the universe a billion years ago to have been such that the predicate P was true of that past state. I need only raise my right hand now, and, lo and behold, it was so. If, instead, I want the predicate to have been false, then I lower my hand and the predicate was false. Of course, I haven't changed what the past state of the universe is or was; the past is what it is, and can never be changed. There is merely a particular abstract relation, a “would be-if” relation, between my action and the particular past state that is the subject of my whimsical goal. I cannot reasonably take the action and not expect that the past state will be in correspondence.

[7:39]

So, I can't change the past, nor does my action have any causal influence over the past- at least, not in the way we normally and usefully conceive of causality, where causes are temporally prior to effects, and where we can think of causal relations as essentially specifying how the universe computes its subsequent states from its previous states. Nonetheless, I have exactly as much choice about the past value of the predicate I have defined as I have, despite its inalterability, as I have about whether to raise my hand now, despite the inalterability of that too, in a deterministic universe. And if I were to believe otherwise, and were to refrain from raising my hand merely because I can't change the past even though I do have a whimsical preference about the past value of the specified predicate, then, as always with fatalist resignation, I'd be needlessly forfeiting an opportunity to have my goals fulfilled.

[8:41]

If we accept the conclusion that we sometimes have a choice about what you cannot change or even cause, or at least tentatively accept it in order to explore its ramifications, then we can go on now to examine a well-known science fiction scenario called Newcomb's Problem. In Newcomb's Problem, a mischevious benefactor presents you with two boxes: there is a small, transparent box, containing a thousand dollars, which you can see; and there is a larger, opaque box, which you are truthfully told contains either a million dollars or nothing at all. You can't see which; the box is opaque, and you are not allowed to examine it. But you are truthfully assured that the box has been sealed, and that its contents will not change from whatever it already is.

[9:27]

You are now offered a very odd choice: you can take either the opaque box alone, or take both boxes, and you get to keep the contents of whatever you take. That sure sounds like a no brainer:if we assume that maximizing your expected payoff in this particular encounter is the sole relevant goal, then regardless of what's in the opaque box, there's no benefit to foregoing the additional thousand dollars.

Comment author: Randaly 07 August 2012 04:19:11PM 3 points [-]

Apparently 3 comments will be needed.

[9:51]

But, before you choose, you are told how the benefactor decided how much money to put in the opaque box- and that brings us to the science fiction part of the scenario. What the benefactor did was take a very detailed local snapshot of the state of the universe a few minutes ago, and then run a faster-than-real time simulation to predict with high accuracy to predict with high accuracy whether you would take both boxes, or just the opaque box. A million dollars was put in the opaque box if and only if you were predicted to take only the opaque box.

[10:22]

Admittedly the super-predictability here is a bit physically implausible, and goes beyond a mere stipulation of determinism. Still, at least it's not logically impossible- provided that the simulator can avoid having to simulate itself, and thus avoid a potential infinite regress. (The opaque box's opacity is important in that regard: it serves to insulate you from being effectively informed of the outcome of the simulation itself, so the simulation doesn't have to predict its own outcome in order to predict what you are going to have to do.) So, let's indulge the super-predictability assumption, and see what comes from it. Eventually, I'm going to argue that the real world is at least deterministic enough and predictable enough that some of the science-fiction conclusions do carry over to reality.

[11:12]

So, you now face the following choice: if you take the opaque box alone, then you can expect with high reliability that the simulation predicted you would do so, and so you expect to find a million dollars in the opaque box. If, on the other hand, you take both boxes, then you should expect the simulation to have predicted that, and you expect to find nothing in the opaque box. If and only if you expect to take the opaque box alone, you expect to walk away with a million dollars. Of course, your choice does not cause the opaque box's content to be one way or the other; according to the stipulated rules, the box content already is what it is, and will not change from that regardless of what choice you make.

[11:49]

But we can apply the lesson from the handraising example- the lesson that you sometimes have a choice about things your action does not change or cause- because you can reason about what would be the case if, perhaps contrary to fact, you were to take a particular hypothetical action. And, in fact, we can regard Newcomb's Problem as essentially harnessing the same past predicate consequence as in the handraising example- namely, if and only if you take just the opaque box, then the past state of the universe, at the time the predictor took the detailed snapshot was such that that state leads, by physical laws, to your taking just the opaque box. And, if and only if the past state was thus, the predictor would predict you taking the opaque box alone, and so a million dollars would be in the opaque box, making that the more lucrative choice. And it's certainly the case that people who would make the opaque box choice have a much higher expected gain from such encounters than those who take both boxes.

[12:47]

Still, it's possible to maintain, as many people do, that taking both boxes is the rational choice, and that the situation is essentially rigged to punish you for your predicted rationality- much as if a written exam were perversely graded to give points only for wrong answers. From that perspective, taking both boxes is the rational choice, even if you are then left to lament your unfortunate rationality. But that perspective is, at the very least, highly suspect in a situation where, unlike the hapless exam-taker, you are informed of the rigging and can take it into account when choosing your action, as you can in Newcomb's Problem.

[13:31]

And, by the way, it's possible to consider an even stranger variant of Newcomb's Problem, in which both boxes are transparent. In this version, the predictor runs a simulation that tentatively presumes that you'll see a million dollars in the larger box. You'll be presented with a million dollars in the box for real if and only if the simulation shows that you would then take the million dollar box alone. If, instead, the simulation predicts that you would take both boxes if you see a million dollars in the larger box, then the larger box is left empty when presented for real.

[14:12]

So, let's suppose you're confronted with this scenario, and you do see a million dollars in the box when it's presented for real. Even though the million dollars is already there, and you see it, and it can't change, nonetheless I claim that you should still take the million dollar box alone. Because, if you were to take both boxes instead, contrary to what in fact must be the case in order for you to be in this situation in the first place, then, also contrary to what is in fact the case, the box would not contain a million dollars- even though in fact it does, and even though that can't change! The same two-part reasoning applies as before: if and only if you were to take just the larger box, then the state of the universe at the time the predictor takes a snapshot must have been such that you would take just that box if you were to see a million dollars in that box. If and only if the past state had been thus, the Predictor would have put a million dollars in the box.

[15:07]

Now, the prescription here to take just the larger box is more shockingly counter-intuitive than I can hope to decisively argue for in a brief talk, but, do at least note that a person who agrees that it is rational to take just the one box here does fare better than a person who believes otherwise, who would never be presented with a million dollars in the first place. If we do, at least tentatively, accept some of this analysis, for the sake of argument to see what follows from it, then we can move on now to another toy scenario, which dispenses with the determinism and super-prediction assumptions and arguably has more direct real world applicability.

[15:42]

That scenario is the famous prisoner's dilemma. The prisoner's dilemma is a two player game in which both players make their moves simultaneously and independently, with no communication until both moves have been made. A move consists of writing down either the word “cooperate” or “defect.” The payoff matrix is as shown:

[insert image of Prisoner's Dilemma payoffs]

If both players choose cooperate, they both receive 99 dollars. If both defect, they both get 1 dollar. But if one player cooperates and the other defects, then the one who cooperates gets nothing, and the one who defects gets 100 dollars.

[16:25]

Crucially, we stipulate that each player cares only about maximizing her own expected payoff, and that the payoff in this particular instance of the game is the only goal, with no affect on anything else, including any subsequent rounds of the game, that could further complicate the decision. Let's assume that both players are smart and knowledgeable enough to find the correct solution to this problem and to act accordingly. What I mean by the correct answer is the one that maximizes that player's expected payoff. Let's further assume that each player is aware of the other player's competence, and their knowledge of their own competence, and so on. So then, what is the right answer that they'll both find?

[17:07]

On the face of it, it would be nice if both players were to cooperate, and receive close to the maximum payoff. But if I'm one of the players, I might reason that y opponent's move is causally independent of mine: regardless of what I do, my opponent's move is either to cooperate or not. If my opponent cooperates, I receive a dollar more if I defect than if I cooperate- 100$ vs 99$. Likewise if my opponent defects: I get a dollar more if I defect than if I cooperate, in this case 1 dollar vs nothing. So, in either case, regardless of what move my opponent makes, my defected causes me to get one dollar more than my cooperating causes me to get, which seemingly makes defected the right choice. Defecting is indeed the choice that's endorsed by standard game theory. And of course my opponent can reason similarly.

[18:06]

So, if we're both convinced that we only have a choice about what we can cause, then we're both rationally compelled to defect, leaving us both much poorer than if we both cooperated. So, here again, an exclusively causal view of what we have a choice about leads to us having to lament that our unfortunate rationality keeps a much better outcome out of our reach. But we can arrive at a better outcome if we keep in mind the lesson from Newcomb's problem or even the handraising example that it can make sense to act for the sake of what would be the case if you so acted, even if your action does not cause it to be the case. Even without the help of any super-predictors in this scenario, I can reason that if I, acting by stipulation as a correct solver of this problem, were to choose to cooperate, then that's what correct solvers of this problem do in such situations, and in particular that's what my opponent, as a correct solver of this problem, does too.

Comment author: Randaly 07 August 2012 04:19:35PM *  3 points [-]

[19:05]

Similarly, if I were to figure out that defecting is correct, that's what I can expect my opponent to do. This is similar to my ability to predict what your answer to adding a given pair of numbers would be: I can merely add the numbers myself, and, given our mutual competence at addition, solve the problem. The universe is predictable enough that we routinely, and fairly accurately, make such predictions about one another. From this viewpoint, I can reason that, if I were to cooperate or not, then my opponent would make the corresponding choice- if indeed we are both correctly solving the same problem, my opponent maximizing his expected payoff just as I maximize mine. I therefore act for the sake of what my opponent's action would then be, even though I cannot causally influence my opponent to take one action or the other, since there is no communication between us. Accordingly, I cooperate, and so does my opponent, using similar reasoning, and we both do fairly well.

[20:05]

One problem with the Prisoner's Dilemma is that the idealized degree of symmetry that's postulated between the two players may seldom occur in real life. But there are some important generalizations that may apply much more broadly. In particular, in many situations, the beneficiary of your cooperation may not be the same as the person whose cooperation benefits you. Instead, your decision whether to cooperate with one person may be symmetric to a different person's decision to cooperate with you. Again, even in the absence of any causal influence upon your potential benefactors, even if they will never learn of your cooperation with others, and even, moreover, if you already know of their cooperation with you before you make your own choice. That is analogous to the transparent version of Newcomb's Problem: there too, you act for the same of something that you already know is already obtained.

[21:04]

Anyways, as many authors have noted with regards to the Prisoner's Dilemma, this is beginning to sound a little like the Golden Rule or the Categorical Imperative: act towards others as you would like others to act towards you, in similar situations. The analysis in terms of counterfactual reasoning provides a rationale, under some circumstances, for taking an action that causes net harm to your own interests and net benefit to others' interests although the choice is still ultimately grounded in your own goals because of what would be the case because of others' isomorphic behavior if you yourself were to cooperate or not. Having a deriveable rationale for ethical or moral behaviour would be desirable for all sorts of reasons, not least of which is to help us make the momentous decisions as to how or even whether to engineer the Singularity.

There's about 2 more minutes of his presentation before he finished, but it looks like he just made some comparisons with TDT, so I'm too lazy to copy it over.

Comment author: Pablo_Stafforini 07 August 2012 05:16:33PM 3 points [-]

Maybe you should post the transcript as an article. Other users have posted talk transcripts before, and they were generally well received.

Comment author: Randaly 07 August 2012 08:23:49PM 1 point [-]

Great idea, thanks!

Comment author: [deleted] 20 December 2012 09:31:38PM 0 points [-]

If there were a way I could communicate with it (e.g. it speaks english) I'd cooperate with it...not because I feel it deserves my cooperation, but because this is the only way I could obtain its cooperation. Otherwise I'd defect, as I'm pretty sure no amount of TDT would correlate its behavior with mine. Also, why are 4 billion humans infected if only 3 billion at most can be saved in the entire matrix? Eliezer, what are you planning...?

Comment author: Indon 10 June 2013 05:25:47PM 0 points [-]

That's a good way to clearly demonstrate a nonempathic actor in the Prisoner's Dilemma; a "Hawk", who views their own payoffs and only their own payoffs as having value and placing no value to the payoffs of others.

But I don't think it's necessary. I would say that humans can visualize a nonempathic human - a bad guy - more easily than they can visualize an empathic human with slightly different motives. We've undoubtedly had to, collectively, deal with a lot of them throughout history.

A while back I was writing a paper and came across a fascinating article about types of economic actors, and that paper concluded that there are probably three different general tendencies in human behavior, and thus three general groups of human actors who have those tendencies: one that tends to play 'tit-for-tat' (who they call 'conditional cooperators'), one that tends to play 'hawk' (who they call 'rational egoists'), and one that tends to play 'grim' (who they call 'willing punishers').

So there are paperclip maximizers among humans. Only the paperclips are their own welfare, with no empathic consideration whatsoever.

Comment author: Aiyen 07 January 2014 09:44:12PM 1 point [-]

Long time lurker, first post.

Isn't the rational choice on a True Prisoner's Dilemma to defect if possible, and to seek a method to bind the opponent to cooperate even if that binding forces one to cooperate as well? An analogous situation is law enforcement-one may well desire to unilaterally break the law, yet favor the existance of police that force all parties concerned to obey it. Of course police that will never interfere with one's own behavior would be even better, but this is usually impractical. Timeless Decision Theory adds that one should cooperate against a sufficiently simiilar agent, as such similar agents will presumably make the same decision, and (C,C) is obviously preferable to (D,D), but against a dissimilar opponent, I would think this would be the optimal strategy.

If you can't bind the paperclip maximizer, defect. If you can, do so, and still defect if possible. If the binding affects you as well, you are now forced to cooperate. And of course, if the clipper is also using TDT, cooperate.

Comment author: Metanaute 10 March 2015 03:17:55AM 0 points [-]

I really love this blog. What if we were to "exponentiate" this game for billions of players? Which outcome would be the "best" one?

Comment author: rikisola 17 July 2015 08:15:31AM 0 points [-]

Hi there, I'm new here and this is an old post but I have a question regarding the AI playing a prisoner dilemma against us, which is : how would this situation be possible? I'm trying to get my head around why the AI would think that our payouts are any different than his payouts, given that we built it, we thought it (some) of our values in a rough way and we asked it to maximize paperclips, which means we like paperclips. Shouldn't the AI think we are on the same team? I mean, we coded it that way and we gave it a task, what process exactly would make the AI ever think we would disagree with its choice? So for instance if we coded it in such a way that it values a human life 0, then it would only see one choice: make 3 paperclips. And it shouldn't have any reason to believe that's not the best outcome for us too, so the only possible outcome from its point of view in this case should be (+0 lives, +3 paperclips). Basically the main question is: how can the AI ever imagine that we would disagree with it? (I'm honestly just asking as I'm struggling with this idea and am interested in this process) Thanks!

Comment author: gjm 17 July 2015 04:38:38PM 2 points [-]

We coded it to care about paperclips, not to care about whatever we care about. So it can come to understand that we care about something else, without thereby changing its own preference for paperclips above all else.

Perhaps an analogy without AIs in it would help. Imagine that you have suffered for want of money; you have a child and (wanting her not to suffer as you did) bring her up to seek wealth above all else. So she does, and she is successful in acquiring wealth, but alas! this doesn't bring her happiness because her single-minded pursuit of wealth has led her to cut herself off from her family (a useful prospective employer didn't like you) and neglect her friends (you have to work so hard if you really want to succeed in investment banking) and so forth.

One day, she may work out (if she hasn't already) that her obsession with money is something you brought about deliberately. But knowing that, and knowing that in fact you regret that she's so money-obsessed, won't make her suddenly decide to stop pursuing money so obsessively. She knows your values aren't the same as hers, but she doesn't care. (You brought her up only to care about money, remember?) But she's not stupid. When you say to her "I wish we hadn't raised you to see money as so important!" she understands what you're saying.

Similarly: we made an AI and we made it care about paperclips. It observes us carefully and discovers that we don't care all that much about paperclips. Perhaps it thinks "Poor inconsistent creatures, to have enough wit to create me but not enough to disentangle the true value of paperclips from all those other silly things they care about!".

Comment author: rikisola 17 July 2015 06:27:44PM 0 points [-]

mmm I see. So maybe we should have coded it so that it cared for paperclips and for an approximation of what we also care about, then on observation it should update its belief of what to care about, and by design it should always assume we share the same values?

Comment author: gjm 17 July 2015 10:14:36PM 1 point [-]

I'm not sure whether you mean (1) "we made an approximation to what we cared about then, and programmed it to care about that" or (2) "we programmed it to figure out what we care about, and care about it too". (Of course it's very possible that an actual AI system wouldn't be well described by either -- it might e.g. just learn by observation. But it may be extra-difficult to make a system that works that way safe. And the most exciting AIs would have the ability to improve themselves, but figuring out what happens to their values in the process is really hard.)

Anyway: In case 1, it will presumably care about what we told it to care about; if we change, maybe it'll regard us the same way we might regard someone who used to share our ideals but has now sadly gone astray. In case 2, it will presumably adjust its values to resemble what it thinks ours are. If we're very lucky it will do so correctly :-). In either case, if it's smart enough it can probably work out a lot about what our values are now, but whether it cares will depend on how it was programmed.

Comment author: rikisola 18 July 2015 09:34:46AM *  0 points [-]

Yes I think 2) is closer to what I'm suggesting. Effectively what I am thinking is what would happen if, by design, there was only one utility function defined in absolute terms (I've tried to explaine this in the latest open thread), so that the AI could never assume we would disagree with it. By all means, as it tries to learn this function, it might get it completely wrong, so this certainly doesn't solve the problem of how to teach it the right values, but at least it looks to me that with such a design it would never be motivated to lie to us because it would always think we would be in perfect agreement. Also, I think it would make it indifferent to our actions as it would always assume we would follow the plan from that point onward. The utility function it uses (same for itself and for us) would be the union of a utility function that describes the goal we want it to achieve, which would be unchangeable, and the set of values it is learning after each iteration. I'm trying to understand what would be wrong with this design, cause to me it looks like we would have achieved an honest AI, which is a good start.

Comment author: EngineerofScience 26 July 2015 08:26:07PM 0 points [-]

Why would you want to choose defect? If both criminals are rationalists that use the same logic than if you chose defect to hope to get a result of (d,c) than the result ends up being (d,d). However if you used the logic of lets choose c because if the other person is using this logic than we won't end up having the result of (d,d).

Comment author: EngineerofScience 07 August 2015 06:20:46PM *  0 points [-]

I would say... defect! If all the computer cares about is sorting pebbles, then they will cooperate, because both results under cooperate have more paperclips. This gives an oppurtunity to defect and get a result of (d,c) which is our favorite result.

Comment author: casebash 17 April 2016 12:18:48PM 0 points [-]

You'd want to defect, but you'd also happily trade away your ability to defect to both choose heads, but if you could, then you'd happily pretend to trade away your ability to defect, then actually defect.