LESSWRONG
is fundraising!
LW

Value Is Fragile but Your Utility Function's Structure Shouldn't Be

You have read-only access to this post. Contact E. P. Cooper if you wish to be added as a collaborator.

Viewing

[Epistemic Status: Not fully checked in relation to Jeffrey-Bolker rotations.]

Introduction

This article describes a possible structure that utility functions can take. The structure is intended to be stable in a very large range of conditions. It has become more common recently to argue that it is not necessary to have a stable and coherent utility function, but by my understanding of current decision theory work, this is not correct. Without a static account of counterlogicals, the best that can be done is to design a utility function that works given some unknown, but not zero, level of updatelessness. This means that the utility function can't depend on being able to evaluate indexical utility, or, counter to recent claims, mix its epistemics, anthropics, and preferences into one inseparable blob. Given current problems with Logical Inductors, an attempt at constructing an updateless decision theory that uses them appears to sometimes require some sort of updating. This can be descriptively analyzed as the Logical Inductor traders not being incentivized to continue working on situations that are logically false. In general, the behavior of Logical Inductors require that a utility function maintain correct verdicts and coherence in the face of shifting evaluations on the possibility of various outcomes. I advocate for the use of Anchored Utility Functions, my term for the type of utility functions that I think implement the best subset of desired properties while respecting the above-mentioned constraints. I argue against multiple types of utilitarianism, meta-ethical hedonism, an outsized focus on suffering, and the reification of various concepts, when taken as dependencies to the utility function and decision theory. This leaves the form of utility function described by this article, or something similar to it, as the remaining option for someone who wants to execute Wisdom Longtermism and robustly make bad outcomes inconsistent¹ in the face of possible developments in philosophy, anthropics, and the general view of the world. Of course, I could have missed something, or there could be a development that comes soon enough to matter. A utility function must be stable enough relative to the upper bound of the utility function for proper risk management, both as a general requirement and as to avoid entering traps. Without Anchoring, I can't currently see how world-applicable traps can be differentiated from vague pseudo-multiversal nonsense, even if a different solution to general zero-point stability was found.

While I think this article describes a less fragile sort of utility function, it does almost nothing to solve the fact that your powers of deduction, the ability to know what a statement implies², are almost certainly not able to keep up with the complexity and specificity of your values. In addition, I make a guess that edge instantiation does not have a general solution, and some relatively minor tweaks may be required to the structure of Anchored Utility Functions for them to be machine applicable. These changes should be well behaved and cause no practical difference in the conclusions I make.

Maybe some people are incredibly good at that specific kind of deduction, and they have incredibly mathematically non-specific values, but neither are true for me.

There may be aspects of human rationality that say it's close to impossible to tell when you're done with the deductive analysis of your utility function with the ability and biases you have, so you shouldn't try to carry out the analysis past a certain point, but I haven't done a lot of research and thinking there.

Either way, tiny imperfections and distortions in the value encoding are going to happen, and at a human level of checking plausibly will be catastrophic. Other people may have ideas for automated consistency checking and similar, but that's not the real problem. The real problem is that value is so valuable that it can't handle almost any error in the process.

When driven extremely hard, these errors lead to extremal Goodhart³ ⁴ ⁵, even assuming perfect alignment of the systems doing the physical-world optimization.

I don't suggest using a human designed utility function of this type in a powerful AI system. If you need a starting point I think it's a better choice than utilitarianism, and tends to allow an easier and more stable analysis and approximation. That doesn't mean the better choice would give you enough time to do better overall in a meaningful way, and there could be various last-ditch attempts considered, though I won't write about them in this article.

This isn't an excuse to change your values to something easier to encode, and that by finite counting wouldn't help much anyway. I think it is worth it to consider the type of utility functions the current math needs, and consider if the increased robustness of the functions presented in this article are worth the loss of infinite utilities. Demand that there remain "outs" for the fragile Good, not that our utility functions remain fragile for the foreseeable future. No matter how short of a time that may be.

Limitations

My analysis here may rely too much on learning theoretical rationality, as opposed to static rationality. That wasn't my original intent, but much of the work I have seen recently on embedded rationality is learning theoretical. Since I don't think humanity in general has a long time to get this right, if this is the type of embedded agency that gets worked on, I don't have much of a choice. I ask readers to suggest any high quality account of static rationality they know of, so I can look at it.

How I Got Here

(If you don't care, skip to The Function.)

It took longer than it should have. I'm not sure why that is. I think a lot of basic philosophical decision literature focuses on "ethical test cases" that don't handle preferences over lotteries.

As for the basic resources that do handle probability in a reasonably correct manner, they either take various economics assumptions, destroying global risk management, or assume some fuzzy version of total utilitarianism.

Quite often it's not made extremely clear if they're talking about "axiological" or preference based utilitarianism. Maybe this is sometimes intentionally conflated. See the "Alternatives" section.

I looked at outcome matrices for risk management. You need a scoring system for them, but the system from economics was limited and appeared to describe the kind of risk management I wanted as irrational.

Even from my current position, I think totalist utilitarianism relies on very particular arguments and very particular situations to act even close to reasonably. The arguments claim our situation is like that, but that assumes things that I don't want to condition my reasoning existence on.

For example, various longtermist arguments in totalist utilitarianism depend on some Grand Future, defined in the totalist utility function, being a reasonable possibility. In some sense, current humans exist for the sake of the Grand Future, allowing the zero point to be set higher than it could be otherwise while still avoiding the verdict of immediate human extinction being preferable.

At the time I looked at totalist utilitarianism and thought it was too risk-taking. As seen in the example given above, it is just too fragile. The overall worth of human life shouldn't be determined by the expected value of Grand Futures projects vs. the severity of the repugnant conclusion.

From there I looked at minimax. Minimax is great for risk management, but it strictly cares about only one worst outcome. This is quite easy to fix in theory. Take what exists currently, and from that take what you care about. Divide that category into bundles, and apportion your care between the bundles. By this, I mean you can "weight" each bundle differently, theoretically from almost all of your care being on one bundle to almost none of it being on the same bundle, if you set up the function with different parameters.

Once you've done that, you'll have a utility function where your maximum value is when you have all your bundles in an intact state, and your minimum value is when you have none of them like that. Intermediate values depends on the weighting.

This isn't very sophisticated though, and it can't really handle "bootstrapping from nothing" in a similar way to utilitarianism that's based on averagist logic, preference logic, or both. See those in the "Alternatives" section. This is because it starts by "taking what currently exists." This also means it can't handle you being wrong about what exists, or even its attributes, because it doesn't incorporate "survivable core" logic. More on that in the next section.

I thought of the anchoring idea after thinking about the interaction between the top bound of the utility function and various scenarios where my best guesses about how the world is are wrong. Getting this right is incredibly important, because the distance between the top bound of the utility function and the current evaluated utility is the basis of the risk management in this article. Keeping the utility evaluation as stable as possible without sacrificing too much value is the basis of many of the decisions I have made.

Stated simply, I took minimax and generalized it somewhat. This is the theoretical basis of Rational Risk Management, but it is still fragile and doesn't specify an integration into a full utility function or a decision theory.

Anthropics

A lot of anthropic reasoning kind of assumes totalism (a totalist's "betting patterns" may look like SIA⁶ ⁷, and SSA isn't viable⁸), so I had to figure out what to do. I came up with and inspected multiple Boltzmann brain scenarios where the "brain" is actually a computer, relying on the Church-Turing thesis and substrate independence to simplify reasoning. ⁹ Around this time I checked my understanding of utility functions and positive affine transformations and developed a stand-in utility function with a strong upwards bound.

Bounded utility functions are notoriously sensitive to the location of their zero-point, because in the bounded case the bound(s) are located at a fixed location relative to the zero-point, up to a positive multiplicative transformation. In contrast any form of totalist utilitarianism has no meaningful overall zero-point (though it will always have a zero-point for each person or person-instant where under that point the person or person-instant preferably (all else being equal) should be made to not exist), and certain central examples of totalist utilitarianism have no bounds at all. This makes anthropic reasoning instantly important to creators and users of bounded utility functions, because if the world isn't as they thought when they set the zero-point of their utility function, they will start acting in bizarre ways after receiving information that an unbounded totalist utilitarian would take in stride.

At this point I was blocked on utility function work until I figured out some approximation of correct anthropic reasoning. Rejecting SIA, SSA, and anthropic probabilities as a whole was relatively easy following the thought experiments shortly described above. Remember that probabilities as a base construct are not required for rationality. L-UDT is assumed to contain a general solution to anthropic reasoning¹⁰, but I'm working with only weak anthropic reasoning for now, because it's required for reasonable risk management. I treat the actual information processing as a problem for bounded rationality (and not the utility function), and I assume the strong upward bound on the utility function is enough motivation to "explore" for anthropic reasoning relevant risks.¹¹ Weak anthropic reasoning may be fine for provisional utility function calibration, but I'd like something better. I may attempt to work on this more in the near future.

The Function

It was suggested I could write and publish an article on Less Wrong based on an E-Mail I wrote last year, partially quoted here:¹². I'll try not to leave anything out from the main text, but if I do look there.

Risk

Standing alone from the rest of the article, the first important thing to understand is that risk requires a utility function. My use here isn't going to be incredibly rigorous, because risks can do funny things like combine or cancel in very strange and complex ways. I won't do the math that demonstrates the mechanics of risk in the article, but I will cite relatively easy guides that would help you compute approximations good enough for many real-world tasks.

What Probabilities Are

In general, it's assumed that something called a "probability" in a technical context at least tries to approximate the Kolmogorov axioms¹³, but sometimes the requirement that the probabilities you use must always sum to 1 is relaxed, while normatively (but maybe not descriptively) trying to avoid destroying everything else.

I suggest when you use the word "probability" you should be actively thinking about the requirement that probabilities must sum to 1, and attempt to approximate that requirement, even if you are unable to deal with the "everything else" probabilities and have to leave them out of specific reasoning.

I currently have somewhat strange requirements on what "real" probabilities are. Here, I'll use the word "probability" for things that might better be called "pseudo-probability," except when I'm talking about L-UDT, unless I make a mistake. "Real probabilities" need to be derived in a certain way from Solomonoff induction. That's why they can't be 0 or 1 exactly. Depending on the exact context, pseudo-probabilities can be exactly 0 or 1, and can sometimes be left out of a calculation all-together in a way that's equivalent to them being 0 or 1 exactly. Humans have limited capabilities, and quite often the "probabilities" they use are not close approximations of real probabilities much at all.¹⁴.

Many types of betting odds can be quickly converted into a pseudo-probabilities, for example a betting odds of "11 to 2 in favor" can be converted to the pseudo-probability of the "favored referent" by setting X to 11, Y to 2, and then running the formula 'favored_referent_pseudo_probability = X / (X + Y)', equaling a pseudo-probability of about 0.846.¹⁵ Logical Induction, used as part of L-UDT, uses pseudo-probabilities¹⁶. These are not real probabilities under any prior, because logical statements are not encoded in the needed way in the correct Solomonoff probability measure. Only a single "logically true" entity, faithfully encoded, could fit into a measure, and it would exclude everything else, because it would need to be given a "probability" of exactly 1. Further, for the Solomonoff probability measure in particular, the empty string would need to be given a new meaning, a doubtful procedure. Solomonoff induction predicting a physically realizable sequence¹⁷ ¹⁸ ¹⁹ won't ever receive an input of a mathematical proof that isn't ever emitted in the universe block²⁰ ²¹ and would therefore quickly converge²² away from producing proofs that would not have already existed. This extends the argument to a finite length bit string input, even though Solomonoff induction must "know" every single logical statement to be true or false, because it's perfect at not assigning probabilities to them²³.

You could try to say that Solomonoff induction's ability to output proofs in some binary encoding means Solomonoff induction does relate to logical probabilities, but the argument doesn't work. By picking a "reasonable" universal prior, proofs will either be generated by some "internal system" that Solomonoff induction is simulating (making Solomonoff induction not the base truth), or they will be "directly encoded" and assigned "probabilities" that strongly correlate with length in the expected way. You could use a "less reasonable" universal prior and try to get away from that, but there's a strong limit to how well you can get the prior "probabilities" assigned to the encoded proofs to match their logical truth or falsity. Someone able to inspect the method for generating such a prior would easily tell there's something up. All that and it will still mostly only output proofs that you "put in" yourself by rigging your universal prior. It will have no way of updating to more accurate probabilities from there, as shown one paragraph above.

I'll avoid using "likelihood" in the colloquial sense, and I'll make it clear what I mean when I'm using related terms. The technical definition of the equivalent of likelihood²⁴ is a very simple and easy to understand part of Bayesian statistics, if you don't know it I suggest learning it. Note also the dimensionless number "likelihood" in Bayes's theorem, which sometimes can't be a Bayes factor itself unless you are careful to always do Bayesian statistics on the same measure.²⁵ Also note other uses.²⁶ ²⁷ ²⁸

What I Mean by "Utility Function"

In general I'm assuming a reasonably close approximation of a von Neumann-Morgenstern (vNM) utility function²⁹.

Universally, vNM utility functions (mathematically) can't have tie breaker clauses, but if my understanding of the reals is correct, you should be able to "find" a number to use as a multiplier for these clauses that is more than zero, but is still almost exactly zero, for any definition of "almost" that doesn't involve infinities. If you want a further layer of tie breakers, norm an "almost′" ("almost prime") to the multiplier you picked for the last layer, and repeat the previous process with that. This can be repeated overall with an "almost prime prime," and so on. [intent: duplicate]

As long as you get your definitions of the "almost"s you use right, any practical quality of numerical approximation will let you implement vNM-compatible "tie breakers" as actual tie breakers, without deviating more than epsilon from the ground-truth version that works in real numbers.

This type of procedure would allow you to implement ordering that is more or less "lexicographic" (like the sorting in a dictionary) in other ways, without technically violating vNM rationality. My version of Anchored Utility Functions don't use this, and a variant could use "almost but not quite" tie breakers instead of the procedure described above, but that would make the agent a "worse negotiator" in some cases. I suggest the "almost exactly zero" procedure should produce correct tie breaker behavior in a "very large number of" cases, but do err on the side of making the multiplier or multipliers very small indeed, as described above, to maximize that property.

Someone experienced would be able to tell that I don't think "infinitesimal" (or "infinitesimally distant from certainty") probabilities are "real" probabilities, by reading the "What Probabilities Are" section. They aren't in the reals either but that's by definition.

I don't know how to do the math for infinite or infinitesimal utilities or "utility fragments" (numbers that go into calculating a utility), but I don't think they would be useful and I don't recommend studying them.

Anchored Utility Functions, and upwardly bounded utility functions in general (though they are quite often more fragile) instantly³⁰ solve the St. Petersburg paradox³¹ in the upwards direction.

"Discounting small probabilities" only works against this style of problem if the discounting is 100.0%, i.e. a proposed solution is to ignore entirely small probabilities. This is a horrible idea³².³³ This means the general class of problems that increase negative "promised" (in some sense) utilities faster than you/the agent can lawfully decrease your probabilities on such "promises" is still live.

This could be "fixed" by a lower bound on the utility function in addition to the upper bound required by Anchored Utility Functions, but only at the cost of "negative timidity"³⁴, a particular type of extreme risk-seeking (as evaluated by an unbounded utility function).

This is mandatory to some extent, because vNM utility theory doesn't allow for infinite negative utilities. I will be able to enumerate possible solutions to this problem at some point, unless something goes wrong.

One "easy" solution would be to hide your exact utility function so no adversary can generate the "promises"³⁵ in a way that isn't clearly just guessing or searching a large space.

This isn't really viable and is a problem because it conflicts with the entire reason-of-existence of human-relevant or partially human-relevant advanced decision theories, avoiding central reliance on the human/agent in question having "free will."³⁶ This isn't something an agent should expect to reliably have, under any conditionalization I consider reasonable. Adversarial efforts are possible, and must be prepared for.

I'm reasonably sure a reasonable definition of "identity" must include at least something that makes you at least act like you have something that at least vaguely approximates a utility function, assuming you had that in the first place. Bounded rationality requires you do, so if you don't, get to it. If someone replaced you with a modified version that had something approximating a negative affine transformation of your current utility function³⁷, I'm not sure why you would avoid saying that person is importantly not the same person that you are at the present.

Functional Decision Theory (FDT)³⁸ is only obviously better than Causal Decision Theory (CDT)³⁹ for an agent that might, at some point, become predicted in a reasonably strong sense by an adversarial agent. In the limit this is done by simulating "you" in the situation or situations its bounded rationality says it should know about.

In that case, the computer science problem the adversary takes is called "you-complete"⁴⁰ ⁴¹, because it must have "you" as a dependency⁴², taken as a computable function or computational process.

Because your utility function (as it is in implementation) is part of "you," if you follow FDT because you think an adversary may be able to solve "you-complete" problems, you should reject hiding your utility function as a viable single solution to the negative St. Petersburg problem.

Otherwise you should consider using CDT as long as you remain a bounded rationality agent⁴³, because CDT uses less resources than FDT⁴⁴, and CDT implementations can achieve great victories over other agents if the agent can hide things from others.⁴⁵.

Even if the above is the case, it's possible that an "updateful" TDT (Timeless Decision Theory) could do better than CDT by "partially revealing" (in some manner) its internal functions. Supposedly CDT doesn't like using quantum random number generators (or their philosophical/metaphysical analogues)⁴⁶ in certain cases, and this might be similar to this scenario. This would be because the agent wouldn't "know" what part or parts of its internal structure are hidden, like in the Death scenario where it doesn't "know" that the outcome of its decision process are less hidden than a fair quantum coin from the perspective of all other agents. Realistic cases are very complex though, and I suspect it to get subtle very fast.

Definitionally, a "partial reveal" prevents an opponent from doing a "you-complete" procedure, unless it gets information from another source, guesses, or does large-scale searches through mind-space, the last two leading to major uncertainty.

I never got to Beyond vNM: Self-modification and Reflective Stability by Cecilia Wood (hosted by PIBBSS), so if it has any useful content I'm missing it. Most attempted deviations from vNM rationality haven't gone very well. It's of limited importance for this article, but in general self-modification, as well as the possibility of failing to prevent modification of the self by external forces, is very difficult to handle in a realistic manner.

"Risky" Risks vs. Risk Reducing Risks

Anchored Utility Functions do take Risky Risks in certain situations. This is generally in poorly evaluated scenarios with bad prospects, though interactions with decision trees makes this very complex. Even so, this strongly contrasts with common statements of totalist utilitarianism. In addition, Anchored Utility Functions can be modified to take very slight Risky Risks, even in excellent evaluated world states with good prospects, though that isn't part of the central, most stripped down examples.

In general though, I advocate limiting almost all substantial risk-taking to properly considered Risk Reducing Risks, risks that serve to either improve the agent's position re. future risk management without being too risky in a global anchored value analysis, reduce overall risk when analyzed globally and accounted in the agent's bounded utility, or both.

I call this way of looking at and acting to prevent risk "Rational Risk Management." Presumably other people would call their way of thinking about risk rational, but this is the best I can currently do⁴⁷.

The example I give in the E-Mail this article is based on relates to contact with reality and simulation escape. An agent with a drive towards contact with reality that is not derived from correct risk management may take Risky Risks in order to get better contact with the world running the world it started in, and maybe the world that simulates that world, and so on.

This attempt at improving this notion of contact may involve the world it started in becoming destroyed or greatly degraded. Anchored Utility Functions are intended to have a drive towards contact with reality that does not involve taking Risky Risks⁴⁸.

Risk Section Reading Guide

Depending on the reason you're reading this, you may not need to understand everything I'm saying. The basic ideas are important though, so if you're having trouble with those you might want to start with reading some combination of these three sources:

Joe Carlsmith's rationality series of Web articles, starting with https://joecarlsmith.com/2022/03/16/on-expected-utility-part-1-skyscrapers-and-madmen .

The Affine Transformation Wikipedia page: https://en.wikipedia.org/wiki/Affine_transformation .

The Decision Theory page on the Stanford Encyclopedia of Philosophy: https://plato.stanford.edu/entries/decision-theory/ .

If you don't know the basics of probabilities, the real (or at least "more real") kind that can be used to make one-shot decisions, start with the Joe Carlsmith series in the place I link.

Once you understand that, you can read the other two links. The Stanford Encyclopedia of Philosophy will explain "academic" decision methods and decision theories. They are incomplete and not the best available, but even the Savage theory will let you play around with the numbers that will help you understand how risk depends on the utility function used.

Good & Evil

let Good := The utility function that mathematically defines "winning" (currently unknown) let Evil := 0 - Good (or any other negative affine transformation)

Conditionalization

Like the section on risk, this section is going to simplify the topic. Getting closer to "real" Bayesian conditionalization isn't usually easy in the real world.

let M := The Minimal conditions required to execute Rational Risk Management. Extremely cut down requirements. This isn't "nice to have"s. This must be very close to exactly right. Remember that anything you write off here, you write off forever. It is theoretically possible that the real world works in a very strange way. This isn't about far-off alternate worlds. Presumably doesn't even require recognizable memory, though other aspects would need to compensate. Note that some hard-coded "knowledge" of the agent's structure may be required, and would be required if my knowledge of rational influence is complete. This applies to humans in a sense, not just to programmed agents, so in the human case this requirement must be very basic. It is very important to not make M over-broad⁴⁹.

let SS := The Standard Story. From Carlsmith's Simulation Arguments⁵⁰, this is the situation you take to be "standard," and the situation the Anchored Utility Function is calibrated to. The Standard Story exists for some reason. According to itself this is because you received information and came up with something "reasonable appearing," but correct anthropic reasoning is unable to privilege the Standard Story over other possible scenarios. Anchored Utility Functions derive their name from the fact that they are "anchored" to a "survivable core" of the Standard Story. In combination with the fact that the calibration and bounding of the utility function derive from the Standard Story, its dubious nature does not detract from its incredible importance. Quite often it is left out that probabilities are conditionalized on the Standard Story. For many "every day" things that might be fine, but for anyone working on anthropic reasoning relevant risks, this point demands extreme care.

If I ever appear to give a probability that looks unconditioned, in many cases I won't have. Commonly I'm either conditioning on M or conditioning on SS conditioned on M. Under certain assumptions, conditioning on SS should mean you don't need to condition on M, but I want it to be clear that they can be separated both ways, and since the "Standard Story" is technically only related to anthropics, I don't know for sure it contains the conditions that let an agent execute Rational Risk Management. It's easier this way.

As an example of this, someone could say "My p(doom|SS|M) is in the range 10%-90%, I haven't thought enough to give p(doom|M), to the extent p(doom|SS) is different from p(doom|SS|M), those parts of probability space can't be helped by Rational Risk Management so I don't spend my time on it, and I'm not sure a p(doom) unconditioned on the existence of even a minimal, warped fragment of my rationality is a meaningful concept."⁵¹ I'm very much simplifying here, because probabilities are useless if they don't include your physical world modeling relevant evidence, and only include your unsupported assumptions. In the full notation, the conditionalization would have to be much changed.

Anchored Utility Function Forms

let A := An Anchored Utility Function.

let A_prime := Result of a positive affine transformation⁵² applied to A, such that the upward bound of the utility function is at zero, and the current world state, when conditionalized on SS, is below that at a value reasonable for your numerics⁵³.

In this section we will use form A_prime in descriptions, but this form can be converted to another reasonable form at your option.

The A_prime form is useful because it demonstrates that Anchored Utility Functions is based on the concept of avoiding loss, or possibly recovering from it, though for instrumental (and not utility) reasons the latter is generally worse. As the immediate evaluated utility drops further and further below zero, the utility function will appear to take more and more risks to recover to a better position. In certain encodings into decision theories, this also operates linearly to possible future utility evaluations.

When the utility function is created, the immediate utility function evaluation must be slightly below zero, otherwise risk management won't operate correctly. Under no circumstances will a utility evaluation ever come to more than zero in this form.

All Anchored Utility Functions share these characteristics.

Bounded upwards.
Designed to function in a large number of scenarios.
Continues to function if the world turns out not to be like it appears as of writing and/or when the function was designed.
Reasonably calibrated with risk and reward. If the world exists like thought, little to no Risky Risks are taken, and the preservation of the world is prominent.
Suggests correct action if the world does not exist as thought.
Analyzed descriptively, exhibits the non-heuristic target of goal shedding. As scenarios get worse and worse, it preserves the value that remains as best as possible.
Treats the agent operating the utility function as a knife slashed at bad outcomes, in analogy⁵⁴. If the agent is a human, the agent may be valued fundamentally, but not at an outsized level.
Analyzed descriptively, exhibits the non-heuristic target of trying to recover lost goals. If valuable parts of the world are degraded/destroyed or turn out not to exist, creating them or returning them to the favored state is prominent.
When creating or re-creating the parts of the world that are valuable, exhibits correct risk management. Analyzed descriptively, it is similar to the previous goal shedding in reverse.

If the agent loses tracking on what it was supposed to be anchored to, goal recovery may be hard to analyze. At the time of writing I don't know a human-relevant approximation for e.g. deciding in favor of re-creating the world vs. trying to figure out where it is. Preferably, the system wouldn't "re-create" the world if it already exists somewhere else, but I haven't been able to encode that because figuring out if the world was destroyed isn't easy in certain circumstances.

If the agent is a human, standard considerations apply. By instrumental rationality, the agent will require various equipment, supplies, and conditions for non-degraded operation. Beyond a certain level of suffering the agent will become more and more ineffective. To avoid leaking information, the human agent will need surroundings that match various possible satisfactory states.

A major requirement on Anchored Utility Functions is that they pass the "bootstrap test." They need to be able to re-create things in a reasonable state if they are destroyed or turn out not to exist. Therefore, it is important not to encode various parts of your world model into the utility function, unless you value them enough that you want agents taking up time and resources to re-create them, or create them if they never existed in the first place.

An Anchored Utility Function is anchored to a "survivable core" of the anthropic "Standard Story." I generally assume the "Standard Story" encompasses at least Earth, and might be extendable to other parts of the Solar System. Currently, I'm not sure what the point of that is. Risk management can handle things like the Sun not existing. Trying to get information about places other than the Solar System would take too long. I will generally assume that the utility function is created and anchored soon, and is attached to the Standard Story's account of Earth and the things near it, via the survivable core.

The execution of the utility function is in some sense conditioned on M, so it is okay for the survivable core logic to take that as an assumption.

To create an Anchored Utility Function, various things, currently satisfied requirements, and patterns or configurations of things and currently satisfied requirements are identified as valuable and encodable. I call any one of these things an entity. All these things must exist in the Standard Story to be eligible. Don't add anything that is very dubious, for example souls, aliens, or free will. Maybe some very clever stand-in could be inserted here, but only if it currently evaluates as true on the Standard Story in common circumstances and you are close to sure that it is valuable enough to be worth the trouble.

The insertion point of each of these things can vary slightly, but they all must evaluate as very near the upper bound of the utility function. Nothing can evaluate above the upper bound. For certain things this isn't how you would generally evaluate their current states on the Standard Story, but you must either leave such out of the utility function, develop a stand-in, or set their evaluation to be near the upper bound of the utility function by positive affine transformation.

Consider a toy example. Say you have five entities, and directly value nothing else. The scale here is arbitrary, but assume that all entities have the same value when fully intact.

(10 + 10 + 10 + 10 + 10) - 50 = 0

Note that this is the maximum value of the utility function, the upper bound. Each entity currently evaluates at its maximum state, and since in this example there are no other entities, the utility function overall can go no higher. In the current form, when every single entity is evaluated at its highest point (by proxy), the overall utility is normalized to 0.

If one entity is immediately evaluated to be half degraded, the overall evaluation returns the below.

(10 + 5 + 10 + 10 + 10) - 50 = -5

Note that this isn't an acceptable design for an Anchored Utility Function if the immediate evaluation is like this at design or anchoring time. Something like the below may be more reasonable.

(9.999 + 9.998 + 9.999 + 9.999 + 10) - 50 = -0.005

Though I don't think any current system or organization has enough optimization pressure and physical world modeling capability to handle even that. I expect to have an article with more detailed mathematical requirements soon.

The specificity of the inserted entities must be carefully calibrated. This must be done based on the value of that specificity, due to the requirement that they are to be re-created if destroyed or created if they never existed. For many entities, a smooth evaluation of degradation is possible, and generally is a good idea. Note that this must be designed while considering both re-creation and initial creation to be possible scenarios the utility function may need to deal with, even if such makes little to no sense on the Standard Story. In most cases preventing degradation can be somewhat prioritized over creation or re-creation during the design phase, but note that the utility function can't actually act based on this intent. It treats continued existence the same as creation, excepting clever encodings that may be robust enough to work. Note that you must be careful when specifying indifference, any solution that you claim to be indifferent about compared to the current (at anchoring time) state must be tolerable.

Respecting the previous insertion point requirement, entities can be scaled in evaluated utility based on their value. This means that degradation of a lower weighted entity merits less risk-taking by the agent to prevent or reverse. The entity also merits less risk taking to create or re-create. Note that lower weighted entities still can't "grow" over their top bound, so for example, the below.

(10 + 5 + 10 + 10 + 10) - 45 = 0

Would be the upper bound, in the case that the second entity in the list is lower weighted. In general, this means that you can't insert an entity that you personally evaluate to be in a poor state at anchoring time and incentivize its improvement, beyond the tiny amount allowed by the full mathematics.

Each entity must be defined based on the "survivable core" of the Standard Story, not defined based on the Standard Story itself. Each entity itself must be a survivable core of what it represents.

Survivable cores are designed to return similar results given a large number of things turning out differently than currently thought. Value is fragile, so certain entities may have to be less survivable and more specific than others, but for example an entity designed to stay attached to consciousness may also attach to certain types of user illusions as well. Each entity is anchored initially to the survivable core of the Standard Story, but can stay attached various things, in a huge number of anthropic scenarios. It is very important that this attachment smoothly transfers, there should be no jump in evaluated utility, relatively positive or negative, if attachment transfers to a different thing, unless that transfer directly impacts the value of the entity. The same goes for varying anthropic scenarios.

This tuned and non-full specificity is one of the large improvements over a modified minimax system. This ability to relax specificity allows creation or re-creation of entities, without destroying risk management, anthropic and otherwise.

Anchored Utility Functions attach to whatever is backing apparent conditions, and does not "condition" on conditions being as they appear. Note that it is very important that, as evaluated by an Anchored Utility Functions, situations where the universe "turns out to be much worse than expected" are truly so much worse as to justify the massive switch away from risk management. For Anchored Utility Functions, we must get this right the first time. We can get some hope on this by following mathematical restrictions on what is eligible to be included as goal content, as well as making sure we know enough, given the Standard Story plus some perturbation, to create a survivable core that we can rate very close to the upper bound of the utility function.

Note that a correct decision theory wouldn't directly switch between attachment scenarios like that in most cases, because an agent following such wouldn't generally be that close to certainty, get more information, then be that close to certainty (again) re. another position in such little time. However, at a utility function level the previous description is correct.

Entities are positively valued, themselves survivable cores, and are initially anchored to the survivable core of the Standard Story. This is where tuning occurs to make sure they are located near the upper bound of the utility function. Attachment to other things and scenarios can happen after that. The positive value of entities is directional, note that in the A_prime form of the utility function they can never evaluate above zero after being anchored. In the form used in this section, the lower bound of the utility function is set by the requirement that each entity can not evaluate to a value below 0, though the entire utility function can never evaluate to a value above 0. Note that in practice the decision theory and utility function may not be fully separate, and a trade-off to dynamical stability may be made in favor of better defeating the negative-direction St. Petersburg paradox.

Note that all the anchoring happens when the utility function is initially set up. It always gets anchored to the survivable core of the Standard Story, even if it starts out in a simulation, for example. The agent tries to hold tracking on its initial location, so it will always care about the things it was anchored to in the simulation, even if it leaves the simulation or if it didn't "suspect" it was in a simulation when it was anchored. If the agent did strongly suspect it was in a simulation at the time of anchoring, that won't change this. Anchored Utility Functions must be robust and function with correct risk management either way.

As an example, imagine there are two Earths that are somehow in causal contact. As long as the agent can keep tracking, the Earth the agent initially was anchored to is the one that it cares about. To maintain stability of the utility function, Anchored Utility Functions can't transfer or expand care to anything on the other Earth either, except possibly indirectly due to instrumental rationality. This is due to the location of the utility function's upper bound. Note that in this paragraph I mean "care" in a direct sense. More broadly, rationality requires you to care about anything that could causally affect you or anything you directly care about. This paragraph's reasoning also applies to a somewhat analogous situation involving a simulation. The word "care" also has different meanings. You care (in the sense that you are not indifferent) about things that could influence things that you care about (in the sense that you value them). Note that this can be much more indirect and complex than that simple example.

Future Discounting

I don't suggest any level of future discounting⁵⁵ as a general policy. Future discounting is not part of an Anchored Utility Function. Preferably it wouldn't be part of an agent's decision function either, but in humans there is commonly a blurring of the decision function and bounded rationality. Future discounting is properly a descriptive element that arises from bounded rationality if it arises at all. By this I mean that a bounded rationality agent has a limited ability to process information into forms relevant to all outcomes and decisions, so it may focus its efforts on pursuing the computations needed for short-term gains over long-term efforts that may be more difficult to get right. For short-term survival, this seems correct in certain circumstances, but I'm not sure of the ways that it generalizes.

For a human rationalist, I suggest very mild discounting that's as close to exponential as you can get it. Something close to exponential discounting is wanted for my version of stability. If you're reading this "in the future" and bounded rationality and decision theory is better, consider how much you can eliminate future discounting. Note that exponential discounting causes problems with the type of learning theoretic rationality assumed in some parts of this article⁵⁶ so as a trade-off the discounting rate should be as slow as possible.

Even without future discounting, thinking you're going to lose in the future isn't an excuse to lose now. P(hope) = 1 - P(doom), and in your physical world modeling you should be trying to get as close to real probabilities as possible. Pseudo-probabilities can sometimes be exactly 1, but real probabilities can't. Since P(doom) < 1, P(hope) > 0. Importantly, anything approximating Solomonoff induction also must implement the Principle of Multiple Explanations. All physically possible epistemic methods require time to work, and even with just basic special relativity, current levels of bounded rationality can't predict when important information may come in, information that may lead to major changes in your "posterior" probabilities⁵⁷. This means having more time is convergently instrumentally rational in this type of scenario.

Using old scientific data to support "new" hypotheses might be "scientifically dubious," but it may be needed to avoid losing in such situations. We don't have a system that can store the universal prior, and we especially don't have one that can work with it, so we have to approximate a system that "always knew all hypotheses always" as best we can.

Note that Anchored Utility Functions can reduce a particular motivation for future discounting. Certain utility functions are very sensitive to certain facts, e.g. certain kinds of utilitarianism may care about consciousness, a subject where future results may be hard to predict. In such a utility function, avoiding future discounting may be hard. In contrast, Anchored Utility Functions are relatively stable in the face of revelations regarding anthropics, consciousness, and similar questions about underlying reality. This reduced fragility supports slower discounting.

All else being equal, put off losing as long as you can⁵⁸.

Universal Prior

I commit⁵⁹ to using the ("safe"?) universal prior most simply constructed from Bitwise Binary Lambda Calculus (B-BLC), taking no inputs. This may be calibrated for a reflectively tolerable version of M if I know how to do this.

Space Exploration

This type of utility function isn't strongly against exploring and securing space. The remainder of this paragraph is epistemic and could be wrong without any impact on the utility function. Any reasonable interstellar probe powerful enough to get much of anywhere, as well as many designs of probes and transports that stay within the solar system, would be very powerful impact weapons. (Dwarf) planets, moons, and large orbital installations would not be able to dodge them. This is probably sufficient to rule out crew, as a single reason among many. The vehicles would need to be programmed for self-defense of integrity, and have multiple self-destruct steps staged to be as graceful and non-disruptive as possible. This doesn't rule against carrying materials, equipment, and data to reconstitute humans, but e.g. after conditioning on a gamma-ray burst powerful enough be unsurvivable on Earth the result would have badly constrained parameters, importantly including beam angle⁶⁰. For reasonable protection against a relatively wide-angle gamma-ray burst, colonization probes would have to be sent far away, meaning there would be no humans active outside the solar system for a long time, because of the time it takes a probe to reach its destination.

Most other threats that interstellar colonization may attempt to protect against would probably be directionally the same regarding the aspects discussed.

Maybe this is a skill issue and e.g. gamma-ray bursts that threaten humanity will be almost entirely ruled out unless there's some adversarial process, but it may be a good thing anyway to launch colonization probes soon, but only to distant targets. This would ensure that any humans outside the solar system only become active after the crossover point of the two risk curves, where the first risk curve represents the risk of having a human civilization go against the utility function combined with the risk of having a human civilization out of fast communication (re. e.g. coordination and negotiation purposes), and where the second risk curve represents risk of not having a human civilization active outside the solar system.⁶¹ This crossover point may be far into the future, but considerations including probe longevity and/or the risk of machine-only (no active humans) star-system hopping must be respected. For the latter, the appropriateness of various error-correcting and replication systems would be highly important. Again, this is epistemic.

Timing is almost everything here, and we can get it right the first (and probably last meaningful) time.

(Probabilities in this section conditioned on M and other vague things. Maintain epistemic vigilance on probe lifetime calculations. They are not something easy to verify, even in absolute terms.)

Anthropics and Risk Management

Many arguments relating to space depend on what I currently consider incorrect anthropic reasoning. Note that by my understanding, weak anthropic reasoning is consistent with no aliens anywhere at all. The agent's existence does not need to be explained as being drawn from a probability distribution. I'm not sure why any proposed extensions to weak anthropics I would support would tend to support the existence of aliens in the "they're in space" sense either⁶². I'm not sure how seriously this is taken by researchers. Many of them appear to reify some sort of multiverse in order to use "strong" anthropic reasoning, using chains of logic that don't make sense to me.⁶³ For example, in the Grabby Aliens⁶⁴ model the multiple Hard Steps are tuned to shape the curve describing emergence of aggressive aliens to minimize the surprise relative to a variant of the mediocrity principle⁶⁵. The mediocrity principle takes a reference class, and a fixed reference class is not generally correct for a partial-updating agent. In any case, the easiest reference classes to take don't seem to me to be good approximations of the correct anthropics. See the appendix for more on reference classes.

I think Grabby Aliens doesn't suggest waiting more than 20-50 million years⁶⁶ from our current date to start aggressively expanding⁶⁷ ⁶⁸, but since I don't accept the anthropic reasoning behind the parameterization at the current time this might be able to be avoided⁶⁹.

Note that 0.75 of light speed is the expansion speed of the bubble, not the probe speed. The probe speed must be faster.⁷⁰ At a low number of Hard Steps, similar to Brandon Carter's original estimate⁷¹, here taken to be 1.5 to emphasize the statistical nature of the model, puts stress on the Grabby Aliens model's relevance. For example, at an n of 1.5, the 5% low Time Till Meet value is much further in the future than the corresponding value at an n of 6. Note that if we don't expand this will be pushed even further into the future⁷².

An alternative to the previously mentioned idea of a single fixed-in-time crossover point where Human life outside of Earth would be set to start would be a modification to a dead man's switch architecture, where probes can be canceled before they start Humanity somewhere in the universe.

This is a trade-off towards trusting the future more than might be prudent, and getting it right the first time might be better. The increased flexibility may be more useful if you agree with anthropic reasoning that I reject, such as that used by Grabby Aliens. My personal implementation of Anchored Utility Functions would generally think (in some sense) that you can't trust Earth conditional on Earth still existing conditional on Humans existing outside of Earth being useful. This is because the humans outside of Earth serve as backups, so it wouldn't be useful for them to exist if Earth was fine⁷³.

In addition, there's a trade-off against resource usage with further waves⁷⁴. Since I don't see non-Earth humans being pushed into the future as much of a problem, Earth sending out a cancel wave every few million years may be an appropriate solution⁷⁵. Depending on the exact nature of the expansion of the universe, probes that are too far away to be canceled must either self destruct, wasting resources, or serve as limited-time backups that send probes in the direction of Earth to replace possible (but unknown to the sending probes) losses. This would presumably be done almost at the last possible time, with the resource use justified as a hedge against entities that destroy or corrupt probes that were nearer Earth at an earlier time, but become defunct later.

Note that nothing in this class of proposals does much of anything against false vacuum collapse bubbles and similar destructive forces. It is important for an argument for expansion to both use correct anthropic reasoning as well as calculate the effect certain actions have against certain threats.

Contact With Reality

I think contact with reality is very important. If you're not optimizing reality, what are you optimizing?

However, this isn't the entire story. Consider a simple simulation arrangement, with a single basement universe and a single large-scale, high fidelity world computed in forward temporal order on a classical computer.

If an agent started within the simulation with an Anchored Utility Function, it wouldn't care about the basement universe non-instrumentally. I consider this correct, and a good demonstration of Anchored Utility Functions "survivable core" logic. In my version of Contact With Reality, seeking and receiving correct information and productively optimizing the real world are the core components. You do not have to care about something directly to optimize it. Here, Contact With Reality is risk management, it does not seek information for epistemics alone.

The complex risk managing actions required in this sort of scenario is part of instrumental and epistemic rationality, not the utility function. It is important to know that in my conception Contact With Reality is not a separate component of the system, but operates via risk management, driven by the upper bound of the utility function.

Note that this logic applies to worlds "parallel" to the world the agent initially anchored to as well, and maybe worlds in simulations run in the agent's anchored world, but only in certain parameterizations.

Aspects of Contact With Reality are quite often discussed around Experience Machines. In my opinion this is non-central, but it is important to address an argument from this area.

Note that the mechanisms of Experience Machines are generally underspecified, and that the decision theories described in this article have trouble supporting hedonic indexical value, as well as the (decision theoretically) extreme aversion to certain stimuli many humans prefer. In general, for someone who wants to start acting with robust and stable preferences now, contact with reality appears to conflict with meta-ethical hedonism. This conflict is strengthened when combined with the inability to stably encode such short-term indexical value, given current decision theories.

Postulating that all your current memories are from an Experience Machine, should you disconnect? Maybe, because without contact with reality you can't correctly evaluate risks. Remember that the Experience Machine is embedded in an outside world, a world that presumably contains risks. Risks to the machine or the integrity of its settings directly relate to risks to the internal world, such as it exists, and therefore the experiences you receive.

Note that the above assumes that you can re-connect later, or the contents of the internal world within the machine can continue to be run without you.

Possible Reader Questions

(Commonly known as a FAQ section.)

Q: Is this utility function utilitarian? A: It is not, except in the technical sense that it is a utility function that endorses its own use.

Q: Doesn't this massively privilege the preferences of currently existing humans over anyone or anything else? A: It does, but there's no alternative. The preferences being optimized for must be extremely strict even for a totalist preference utilitarian, because otherwise you⁷⁶ won't be waking up in utopia, even if the AI "nicely" destructively uploads your brain. Something else will, with its personality, memory, and preferences optimized to be as satisfiable as possible under as many conditions and eventualities as possible, with far less resources. Or, slightly trading off resource use, the AI can use the replacement of all humans with new "more optimal" ones to lessen the repugnant conclusion to an extent, by setting the utilitarian zero-point higher. Even if you had an AI that "did what you want, not what you say," you'd better want something very, very specific.⁷⁷ Q: Does this theory consider "alternate Everett branches"⁷⁸, Lewis style "Possible Worlds," other universes (due to inflation or otherwise), things in the universe that's simulating us, things that we become causally disconnected with, things in Boltzmann brains, etc? A: Not really, except as a part of threat assessment and risk management. Facts don't care about your feelings, but your feelings⁷⁹ don't have to care about certain facts either.

Q: Is this utility function longtermist? A: If parameterized correctly, the utility function is "wisdom longtermist," because it helps humanity survive in a state that is no worse than its current state, as rated under various traded off basic and more obscure concerns. If you don't know the perfect mathematical definition of winning, don't lose until you figure out what it is.

Q: What's your position on the reversal test? A: The reversal test⁸⁰ is a supposed partial solution to status quo bias. The problem I have with this test, and especially its "double" variant is that it doesn't account for the fact that Rational Risk Management must be done from an actual pre-existing utility function. If a utility function prefers a state somewhat close to how it exists, it will put most of its optimization pressure into risk management, and won't prefer to either increase or decrease the reversal tested parameter. This is plausibly done by my preferred parameterization of Anchored Utility Functions. Anchored Utility Functions correctly respond to changed conditions, so the double reversal test entirely fails. There is no reason to change a parameter just because it "would be" bad risk management and/or use of resources to change it back in a "counterfactual" scenario where Rational Risk Management modified it from its current state.

Q: Would the use of this function have lead to better choices by Sam Bankman-Fried (SBF) and FTX? A: Probably not. In theory it has better grounding in risk management than a simple risk-neutral "Unit of Caring" dollar-utilitarianism, but even there, under the double threats of adversarial action and bounded rationality, 1x Kelly (or even more) is too much to bet⁸¹. You can say what you want about "Everett branches," but even assuming that's how it works, very few of the "branches" would have ended well, and at some quantity of fiat currency held⁸² you're going to be forced into the world of rationally managing physical resources. The quantity of useful money to be held is capped under the size of the world economy. He had the resources and information to tell that many of his actions were negative EV even in the "quantum" sense he preferred. He needed to give the right orders and data to his people. Anyway, as clearly shown by SBF's lack of efforts to avoid kidnap, more or less a convergent instrumental goal for humans, his utility function wasn't the main problem. Plain old terrible risk management ("probably"⁸³ made worse by improper drug use) was.

Q: What is a pseudo-probability? A: Something that's like a probability, but isn't derived in the correct way from Solomonoff induction, in a "reasonable" approximation. "Reasonable" is somewhat situationally dependent. Not interchangeable with "quasi-probability." For example, a Logical Inductor puts pseudo-probabilities relating to truth values on logical sentences. Many attempts at anthropic reasoning use pseudo-probabilities.

Q: Is the utility function totalizing? A: Yes. Considering "totalizing" as the procedure that approaches a total preference order over all lotteries, any self-recommending but imperfect utility function should be "totalizing."

My Personal More Conservative Analysis

I constructed a "decision theory" to help with my anthropic reasoning. It isn't very good at expected utility maximization in the UDT sense, but it's much more tractable for a human.

The "theory" features a minimum viable level of copy cooperation. Only weak anthropic reasoning is used.

The system operates as an outer and inner loop. The inner loop is a CDT (Causal Decision Theory) agent, possibly featuring a ratification procedure. The outer loop "monitors the anthropic situation" in a vague sense and can alter, freeze, and roll back then freeze the inner loop's utility function⁸⁴.

If an agent strongly suspects it's going to be copied, the outer loop alters the inner loop's utility function for relatively minimal copy cooperation, if possible. If the outer loop "knows" enough about the situation, it can "hold tracking on" all the copies, and release the utility freeze once the anthropic situation is "over." If it doesn't, or it can't hold tracking in the right way, the freeze is permanent.

If the outer loop "finds out" the agent is in an "anthropic situation," and it "knows" when the situation started, it rolls the inner loop's utility function back to that start time, if that's viable. If that's not the case, or vague heuristics trigger, it doesn't roll back. In any case, it freezes the utility function of the inner loop forever. This can be triggered even after the agent executes the procedure in the previous paragraph, if it didn't notice entry into an earlier anthropic situation.

As an example, if an agent know it was going to be copied, was copied, kept tracking on all the copies, and then eventually it was the only copy left, the outer loop could un-freeze the inner loop's utility function. In most other cases it will freeze/lock out forever at some point, unless it can't/never does detect any obvious anthropic scenarios.

This analysis can't handle "nested anthropic situations" and that's one of multiple reasons the agent starts out with weak anthropics.

Note that CDT is already dynamically unstable. In the UDT sense this is similar to changing its utility function. Because of this, the more obvious changes of utility function described in this section aren't a qualitative reduction in the agent's credibility.

(Currently on hold while I examine my assumptions about fully epistemically updated L-UDT.)

Solomonoff Induction

Solomonoff induction can't be done without a particular type of hypercomputer, and even with the hypercomputer it wouldn't be a good idea. Since I'm basing my definition of probability on Solomonoff induction I'll need to describe the "imaginary" procedure that's being approximated anyway.

Since the procedure doesn't really need to be carried out, most of the weirdness of Solomonoff induction can be prevented. It is stipulated that the sequence being predicted is continuous, at all times/locations contains information about the world, and if it interleaves multiple sensor input streams, the interleaving is a simple cycle always in the same order over the same set of sensors with the same number of bits coming from each sensor each time. Each full cycle over the inputs of the sensors must cover the same quantity of local time as every other, and each sensor's output must cover the entire time period of each cycle. For relativity reasons all the sensors must stay together.

It's stipulated that the problem isn't "fair" if any of the sensors are broken, otherwise incapacitated, or fooled in a way that a strong (but computable) intelligence could easily and cheaply detect. See later on using the output of Solomonoff induction, though.

It isn't a "fair" problem if all the sensors aren't fooled at the exact same time into being shown the exact same "situation," though that has to be somewhat vague.

At no point in the bit stream can there be anything that's a result of an incomputable process. This isn't a problem in reality, but in the procedure here the "imaginary" system must only predict the "counterlogically" committed world where the output predictions of Solomonoff induction can't be used for anything.

I don't know how to do this even mathematically, but the procedure doesn't actually need to be run so this isn't a problem. In theory this means that you need a whole new setup each time you want to ask a question of Solomonoff induction, but in practice your past approximations of Solomonoff induction are computable so the past is "fair," all else being equal.

Again to prevent weirdness, it is additionally required that the bitstream following the above requirements be very long and information dense, and cover a long period of time covering the "experience" of many varied situations as they take place in the world. Note the above requirements, this can't involve cuts or jumps. The output stream of Solomonoff induction is uncomputable, so it must be sealed for all of this time, and all that time contained in the "imaginary" procedure described in this section.

Running an "output-sealed" Solomonoff inductor on such a high-quality stream should get closer to strong guarantees⁸⁵ ⁸⁶. The inductor can't be confused by things in the world influenced by its own output, because its output is sealed. It can't be confused by its inputs in general because of the other requirements in this section⁸⁷.

When I say you need to approximate Solomonoff induction to say your probabilities are "real," I'm not saying you should base that approximation on speculations about what happens with a fresh AIXI agent. Note that I think a rational agent should be able to act in non-silly ways even when "real" probabilities can't be generated.

L-UDT

(You can skip this if you don't care too much about theoretical instrumental rationality.)

L-UDT (Logical Updateless Decision Theory) is a fully computable decision theory that does not assume logical omniscience. However, it is extremely slow and doesn't implement bounded rationality, so it can't be used outside of toy scenarios where it is human-assisted.

Some details about L-UDT are combined into the incomplete article linked from this comment. https://www.lesswrong.com/posts/wXbSAKu2AcohaK2Gt/udt-shows-that-decision-theory-is-more-puzzling-than-ever?commentId=xdWttBZThtkyKj9Ts I may be able to publish the linked article and associated notes in a more stable form and then re-publish this section with inline citations, but I'm unsure of the copyright status of the notes.

I don't know how you would set up an L-UDT agent with an Anchored Utility Function if it can't update at all, and I don't know if that's even possible, so I'll assume there's some system that lets an L-UDT agent "update" in a way that isn't an "instant snap to the fully updated state," but is more partial. ⁸⁸ As it is, I'll describe a possible procedure in a quite vague way. Note that this article sometimes pretends that L-UDT is effective in non-repeated trap filled environments, even though I don't think it is by default.

The first step is to set up the "real" L-UDT prior. This needs to be reasonably aligned with M⁸⁹. I don't know how to do this. The few mathematical descriptions of agents I've seen don't appear to provide a prior generation method and look to have the agent "assume"/"conditionalize on"⁹⁰ more than I want.

Presumably this is where some sort of "ur-utility function" compatible with M would be set. Maybe this could be biased in the direction of "the set of Anchored Utility Functions," but since this procedure requires external control of L-UDT and Anchored Utility Functions are supposed to work in an incredible number of circumstances, this may not turn out to be worth doing.

There's no such thing as priorless agent because embedded agents must originate at a particular "location"⁹¹ with a particular structure, and with particular information about its structure that must be privileged in a way that must leave its "truth" at least partially unchecked.

Here, giving the agent information about itself (here simplified to hard-coding) may be a good idea. Again, in theory this should be based on SS and M, but I'm not sure the implementation of that.⁹² After the previous steps the agent must be given its Anchored Utility Function, the one that it will continue to use in perpetuity, at least in theory. Here, a logical update for the logical statements related to Anchored Utility Functions and the "survivable core" of the Standard Story (SS) must be made. ⁹³ The agent must then be set to (at least) partially update on the actual Anchored Utility Function, including elements and dependencies, such as the location of the utility function's upper bound, how different aspects of the "survivable core" are treated, and to some extent the procedures that keep the "world" rated at a reasonable value in cases where the "survivable core" description must remain "attached" to a situation, "anthropic" or otherwise, that deviates in varying ways from the Standard Story. As this update becomes more complete, the agent descriptively becomes closer to an agent that only uses weak anthropic reasoning⁹⁴ with something somewhat close to "real" probabilities (at least temporarily after the update). ⁹⁵

Installing the Utility Function

Installing an Anchored Utility Function into L-UDT is quite subtle, even without giving much detail.

Taking the utility function in form A_prime⁹⁶ is assumed to be fine based on the numerics requirement there. Remember in that form the location of the upper bound is zero.

If your bounded utility function in the A_prime form has an explicit lower bound, note where that is and make sure the transformations in this section keeps that bound above the L-UDT agent's utility zero.

If it doesn't—regardless of other mechanisms (e.g. delay-based utility floor step-downs)—take into account the distance between the upper bound of the utility function and the rating of the "Standard Story" at present to gauge the scale of the utility function. Based on that, rate the behavior of various cut-off points and select one.

let T_offset := The positive scalar value used to move all of the utility function that's worthwhile above zero. See above for very important risk management considerations. Preferably this would be a very similar value or even a constant for all Anchored Utility Functions, but that would require their scale to be set in a particular way and there may be other reasons why keeping it a constant wouldn't work. Due to the math of L-UDT, a lower bound must be given, but in theory it could be such a large finite value that it is "out of sight."

let A_above_zero := A_prime + T_offset.

L-UDT requires rational utilities in the current math, but I don't think there's any problem with that⁹⁷.

let A_udt_rational := A "reasonable" rational number approximation of A_above_zero.

And just for good measure: let A_udt_clamped := A_udt_rational with any remaining negative utilities clamped to zero.

Note that A_udt_clamped has opinions on an infinite number of scenarios, by setting anything not covered by a tractable approximation of Anchored Utility Functions to 0. Enforcing coherence in the approximation is harder.

L-UDT Utility Function Evaluation

UDT operates on "trees," though it can sometimes be described in a misleading way. A fully non-updated UDT doesn't "condition" on quantum physics being true, so these aren't the "branches" from the Many Worlds Interpretation (MWI), though a future bounded rationality agent that tries to approximate L-UDT may take into account the perceived randomness in certain quantum processes when decides what paths to prune in its consideration, for example a "coin" based on a quantum random number generator in the environment may force an agent to calculate at least one more "branch" in its decision tree on pain of terrible risk management.

I've considered various alternatives, but even in the simple "fully updated" case I can't see an alternative to taking the total (sum) of the outputs of the function "evaluate the current state of the world with the utility function" run on every future node in the tree⁹⁸.

That already makes using some shaped system dubious, and, assuming as I do here the agent logically updates, unstable. ⁹⁹ My analysis is limited, but from my position I don't see how anything other than a total enforces proper "contact with reality"¹⁰⁰. Consider an average to make the case as clear as possible. At time 1, due to the upper bound on the utility function or for other reasons the agent doesn't "think" it can improve the world's state. At time 2, an event happens that greatly reduces the quality of the world's state, but the agent doesn't immediately get data on that. The agent has branches on that sort of scenario though, and it "knows" that in any case it won't be able to get the state of the world any higher than the current average. If the utility is calculated as an average over the nodes, the agent has an incentive to avoid getting data about what happens, even if doing so is bad risk management and increases the number of branches where the agent is destroyed early. The systems in this article are poorly suited to selfish agents, though some of it could be made to work that way to an extent. Either way, the agent isn't helping with the state of the rest of the world.

In contrast, taking the total of the utilities, each evaluated from the perspective of its node, creates a strong drive toward "contact with reality" in this sense, though it retains excellent Rational Risk Management. The agent is incentivized to get the data about what happened in this version of the scenario by two mechanisms. The first is that improvements to the state of the world after the "event" can increase the total across the nodes, even if it can't increase the average. The second is that there are no paths to evaluated non-zero utility at times when the agent is destroyed, so in this scenario the agent will want situational awareness to preserve the branches that go for the longest.

I may have some hope that this setup and Anchored Utility Functions would place some restrictions on the worst aspects of strategic updatelessness, including certain types of commitment races. This would be because of the ability to execute partial updating and therefore the limiting of care the agent can implement. On the other hand, viable decision tree structures may not work out for this, and the optimization pressure may not be directed towards the right places.

For the claims of previous two paragraphs, this requires a zero (maximally bad) utility assigned to "places" where the agent is destroyed/ doesn't exist. The agent's trees won't contain these places because the trees are based on its observer rules and its "self knowledge," so this "feels very natural" to me, but I'm always skeptical of that kind of intuition. This drives the agent against its non-existence, though not maximally. This is common suggested behavior for a Singleton agent, but my current perspective is that it is globally normative for any person, organization, or AI that is trying to be rational, because I can't see a way around it. Most readers will have trouble doing this, but they can of course try to approximate the standard. Virtue ethicists can get away with going against these principles, but they may have trouble generating virtues through a rationality-grounded procedure.¹⁰¹ .

This sounds very simplistic, and maybe there are more advanced systems, but I haven't thought of any at this time.

Updating on the Anchored Utility Function

If partial updating can be done at all, even if it's very, very slow, this is where it should be done. We don't want an agent acting towards the wrong "utility function core," even if that other "core" is based on some other Anchored Utility Function.

To ensure the right utility function is used, if a partial update is not possible the agent must be fully updated before becoming operational. In this case this must happen after the agent is given the full Anchored Utility Function.

The actual system to translate observed data into candidate locations in the decision tree and an evaluated utility is unspecified. Any method used would have to be compatible with the update, and possibly included in it if it wasn't hard-coded before.

At each node, the agent would take the "real" utility function, here A_udt_clamped, and evaluate the state of the world. Note that this evaluation occurs on nodes in the decision tree that aren't in the present, except in certain updating cases. Normally, the L-UDT agent must simulate and approximate far into the future to generate a semi-policy that delegates to updating at certain (at least theoretically) pre-established points. The value returned by A_udt_clamped is bounded between zero and the range of the utility function, and is not irrational or infinite. Combined with the previous proposal to limit the length of "branches" to a strict number, the overall utility evaluation is bounded, finite, and timid.

In general, accurate utility evaluation must be incentivized, preferably without hard-coding anything that would damage risk management. The previous "branch" length system may not be enough to handle cases where the world's state is worse, but the agent is still well protected. The agent must accurately evaluate the state of the world anyway. Where the world is, and how to find it if it gets lost, is an open question for my current system.

To avoid improper bargaining-related misalignment, the updates on all dependencies of the utility function and the evaluation machinery must be substantially strong. Note that I believe this is very difficult to approximate in a stable manner. Almost any acausal bargaining generates inner misalignment. You can avoid that misalignment by only executing simulations thought to be safe, but then it appears on-its-face unlikely that you wouldn't already know the outcome. If you know the outcome, there is no point in running the simulation. Solutions would require ASI-level containment and possibly some way of modifying simulated events that still maintains most of the required accuracy.

Paranoia and Updating

¹⁰² The more complete and less partial these updates are, the less "paranoid" the agent is about the "truth" of the respective things "updated" on. The "partial-ness" of the above updates, including the above logical updates¹⁰³, can be tuned as individual parameters. ¹⁰⁴ At least for "anthropic situations" that are compatible with M (and possibly the operation of Anchored Utility Functions), this controls, in a multiply parameterized way, how much they are "ignored," Descriptively, in a fuzzy way, "paranoia" in this sense is positively correlated with the agent "considering" various "anthropic situations" in its risk management.

At this point the L-UDT would be set into its operating mode, where all updates, logical and otherwise are under internal control. Note that in L-UDT, the internal modified Logical Inductors need to be run up to a certain point before the agent has much hope of not acting silly, and in theory that potential may be very destructive. That means that the Inductors would need to be quite progressed before operating mode can be activated, but also that the agent could effectively be horribly misaligned during prior phases. This is a major problem, and goes for other proposals as well, but since this isn't currently a proposal that I think can or should be implemented, I won't discuss it much.

L-UDT Prior

I consider all of this, including the setting of the agent's "self-knowledge," to be part of the "generalized prior." If I refer to the "L-UDT prior" or say something like "the L-UDT prior controls how the agent thinks about anthropics" I may mean something like this, even though what I'm saying is not going into the technical details.

Operating Mode

In theory, once the agent is active and operating it will perform Rational Risk Management in a better and more reasonable way when compared to a CDT (Causal Decision Theory) agent. ¹⁰⁵ Re. the section of the E-Mail this article is based on, updating in the L-UDT sense and changing your utility function are quite similar things. Doing either is generally what causes poor performance in "known" anthropic situations among logically omniscient agents. Therefore updating or changing the agent's utility function shouldn't be relied on to disable a hard coded "contact with reality" drive.

Further Questions

Info-hazards and Robustness

Cobbled together decision theories are unlikely to be robust in the ways we want. Even if a version of UDT is technically dynamically stable on its own, that doesn't mean partial updates will go well, or it will be performing the right actions. If epsilon-exploration is included, it's almost certain that it won't always perform the right actions, given enough time.

Attacks against the processes that generate UDT trees would presumably be a major problem.

I think my sum over world-state evaluation utilities at each node, with "everything else" having minimum value, is somewhat better in this sense than something more fancy, but I may be wrong here.

Relating to info-hazards, figuring out how to update parts of the agent's own updating-control system "a bit"/slightly based on the state of the world it finds is presumably difficult to do in a robust way due to the infinities that exist in a standard "perfect" approach.

The problem of info-hazards presumably would become much worse once L-UDT is modified to use bounded rationality. Note that even having something as an input or dependency in some sense to a computation may be enough to trigger an info-hazard.

I don't have much reason to think the utility evaluation system would survive "self improvement" in a correctly functioning form.

Alternatives to a Sum Over Nodes

I currently think taking a sum over the utility of all considered nodes is the correct choice. This leads to requirements that 0 is the worst possible utility evaluation, the utility evaluation of nonexistence is 0, and all branches not considered or ruled out logically effectively have a utility of 0, even if this isn't explicit or what the math looks like.

This is a basic utility robustness/non-fragility requirement, of the type I advance in this article. This type of requirement is important for correctly calibrated risk management and low-to-no future discounting. The system must be robust in actual approximated implementation to incremental changes in logical evaluation and branch consideration.

However, this system may not match the preferences of certain implementers. As a further question, are there other solutions that trade off robustness in various places to achieve an overall better result regarding robustness and risk management, vs. taking the sum over evaluated utility per node?

Note that the solution described in this article is "trying" to be an average over future history, modulo some things to manage incentives for the agent, and would get very close if it was impossible for the agent to be destroyed except in cases of replacement and it was impossible for the agent to "lose tracking on the world." However, both of those things are possible, so risk management gets into trouble. In L-UDT, the utility function can't stably accept the agent "caring about itself" in a particular sense, so we have to drive Contact With Reality via the overall structure of the utility function. If all "branches" end in the agent replacing itself at the same number of nodes forward in each case, there would be no reason to do the final division, because utility functions suggest identical behavior when scaled by any (strictly) positive value. The behavior would remain reasonable up to a substantial portion of short "branches," but we may be so far beyond the regime of proper operation that the risk seeking of my initial proposal is substantial.

Note that in general, though "replacement" makes the agent work mathematically, it is far too optimistic about the future availability of resources. This problem can be faced later on however, with zero discounting of any sort for several million years seeming viable.

(The solution needs to be stable enough in the face of the fact that Logical Inductors vary their opinions on logical statements over time.)

Is Full Updating in UDT-like Theories Much Like Weak Anthropics?

In general, it appears to be assumed that a UDT that is designed to update in some cases, like L-UDT is when combined with a trailing updater, will approach "weak anthropic reasoning" (in the behavioral sense).

In 2023, when I learned of L-UDT, I checked the reasoning and it made sense to me.

If you understand "fully updating" correctly as the act of altering your utility function so that you stop caring about everything that's "in the past" or "didn't happen," and you understand UDT as working over sense input bit strings, and you understand there's no difference between thinking you can't change something, not caring about it, and, if it's causally disconnected, thinking it doesn't/didn't/never will exist, all the parts of the "possible" input bit strings that still matter start at/from the current instant, and therefore for all behavioral purposes the agent "thinks" there's no way it could have not existed, causally disconnected worlds where it doesn't exist in that instant with the exact same past sense inputs don't exist, it doesn't care about anything that "could have happened" counterfactually, and, as it should, still thinks it can't change the generalized past¹⁰⁶.

Note that various systems would reasonably be expected to "think" they're in a simulation in some sense if you could "look inside." This is a prediction I'm making in advance. The behavioralism here isn't a mistake and is well considered.

This seems plausible in the case of a full update after Anchoring, depending on properties of the Standard Story and some tree rewriting.

It might be that a "weak anthropics" characterization of this type of theory is wrong. Consider a modified L-UDT policy selection style system with a trailing full updater. The trailing updater fully updates the agent into "situations" following from the trailing position, but this may not limit the situations as much as standard weak anthropic assumptions would. In the L-UDT paper it is proposed to update on sentences, but the generalization of that is hard for me to reason about.

Something like Full Non-indexical Conditioning may be a better description. If the agent considers scenarios that "it doesn't know that it is not in"¹⁰⁷, presumably by operating over "variables," each variable be individually updated as it is determined. Think the "principle of multiple explanations" without reifying Solomonoff Worlds, as in Rathmanner¹⁰⁸. We don't want universal distribution anthropics. Note that the math in the Full Non-indexical Conditioning paper is incorrect, however the concept of avoiding selecting a single anthropic location when updating is still viable.

In general, simplified versions of decision theories that never update (not even once at the agent's initial creation) reify a multiverse. See https://www.lesswrong.com/posts/NaZPjaLPCGZWdTyrL/sudt-a-toy-decision-theory-for-updateless-anthropics . This is not actually needed, because the actual reason (and not the justification) for never updating is to achieve dynamic stability in scenarios that had "already started" when the agent was created. Never updating at all doesn't seem plausible, but there is a reason for why it was considered.

A major problem here is that even if the Standard Story is of a very tractable nature, at least anthropics-wise, the agent can't correctly use it as a single intact sock puppet for very long. Our entire point here is to work in the case of predictors in the environment and high levels of uncertainty developing, unlike standard CDT. Another issue is that a tree rewrite may just entirely preserve behavior if it is correct enough, therefore making the epistemic updating problem no easier to think about.

In SUDT and Anthropic Decision Theory, there appears to be no way to update. In those two theories, anthropics is always non-weak, assuming scenario-standard priors. In my understanding, Wei Dai's original UDT and Vladimir Nesov's ambient decision theory never updated, except in a possible interpretation of Nesov where explicit control over an incomplete "utility function" is held in place stronger, and is dynamically changed over time, more or less altering the domain. Wei Dai seemed to go with reifying a multiverse at the time. Eliezer Yudkowsky's TDT always updated (it was not dynamically stable). Garry Drescher's system was unclear and to my knowledge never completed, but to the extent it was trying to be plausible for non-omniscient systems, it would be under similar pressure to update in certain circumstances as L-UDT.

My main problem isn't with fully updateful or fully updateless systems, though of course both have meanings that a human is unable to understand, but partially updateless systems like the version of L-UDT I talk about in this article. I think further development on a theoretical model may be needed.

In general, I don't understand how a L-UDT agent is supposed to "know" what exact part of the tree it is in when it does an update. Does it "just pick a node?" Is there any case where it would be a good idea to use such a position-picking full update instead of a variable based one?

A possible positive answer to the last question may come in a move towards a self-improving decision theory, where the agent starts out with a fully updateful, dynamically inconsistent decision theory, and builds something like a decision tree structure from there. For example, think of a Son of(infra-Bayesian CDT)¹⁰⁹ like process, and maybe simplify that to a Son of(CDT) process, starting with weak anthropic (pseudo) probabilities. This agent might be approximately representable as a decision tree rooted in weak anthropics near where it started, looking somewhat like the result of a "pick a node" full update.

https://web.archive.org/web/20190313003413/https://agentfoundations.org/item?id=785 however may indicate that much of the proposals here are bad ideas, though I'm uncertain what is meant by "fixing indexical uncertainty" there.

Vetoed Traders Anthropics

Is vetoing traders still viable in the "final" version of L-UDT¹¹⁰? If so, maybe a "plausible" approach to anthropics would be run simulations based on prototypes of anthropic scenarios "we may be in," and veto traders that do badly there¹¹¹.

This strategy would blunt much of the case that L-UDT is a "solution" to anthropics, because it could only consider a limited number of scenarios and variations on those scenarios, but maybe the generalization from those scenarios would be enough to do well.

The solution to anthropics would "still be out there" somewhere, as parameters to the L-UDT prior and the decision tree, but I don't know how to find it. A major problem here would be the discarding of certain pessimistic outliers, somewhat disregarding their actual truth¹¹².

To state this again a bit more simply, for various reasons the rest of the setup with any plausible level of utility function detail may not work out right when the agent "believes" in certain weird anthropic scenarios, so we just delete that belief. This is not very principled, but remember from earlier that there is a trade-off between the specificity of a particular goal content entity's evaluation and general satisfiability of the risk management in practice.

Vetoed Traders Anthropics (VTA) could be a major component of a principled anthropic theory as an alternative to SIA or SSA if enough work was done in the future, however.

The "Agent Getting Lost" Problem

In theory, L-UDT should solve most solvable problems re. not being able to find its world and most other forms or definitions of "getting lost," but I don't have specified observer rules or utility function evaluation. Observer rules relate to what the agent thinks it is, as well as how it thinks the "laws of physics" (such as they are) feed it information about the world, among other things. Utility function evaluation takes various observation data and information about the decision tree and what the agent "reflects" on itself as doing in the tree, and tries to build a model of the world that's suitable to feed to a utility function. This isn't easy, and I currently have no proposals. These two areas need solutions that don't fall prey to either "means-end capture" or incorrect world abandonment¹¹³.

A Preliminary Diagram

(The graph is intended to be taken seriously and literally, but it represents the theoretical dependencies required to set parameters, not actual system components.)

Note that the observer rules calls the anchored utility function on each step in predicted futures and considered plans, not just on the present observed state as transformed by the observer rules. Avoiding multi-crediting of the same time slice in the same environment is related to the "getting lost problem" and I'm not aware of a solution. I think this is the "(influence) channel duplication" problem in a bit more generality, as it applies to this setup.

Decision trees don't need to be fully possible, e.g. modified Logical Inductors effectively assign various pseudo-probabilities to locations in the tree, handling the problem to some extent.

VTA is Vetoed Trader Anthropics. Empirical priors could be priors that are tested against the world to an extent before initialization time. Anthropics priors can't be so tested, because, first, the Standard Story either makes no claims about anthropics or its anthropics are very boring, and, second, because by the time any anthropics related observations would be made, the AUF would already need to be initialized. L-UDT also uses syntactic variables for utility, so the actual dependency structure would be different, but the diagram is still illustrative. Note that in L-UDT the OR does not need a static theory of counterlogicals, but this makes learning and bounded rationality not really work.

A realistic decision theory would tell the Observer Rules what to think about, and the Observer Rules would provide information to the semi-updateless decision theory about things other than the decision tree, at least as strictly construed. Even for L-UDT, the observer rules could give statements extracted from the environment to the LI, even if it is not guaranteed that they will be in the right form to be useful. See discussion around the prior's preferred form. The diagram gives a much simplified, if computable, picture.

See time stamp 00:22:17 in the L-UDT video for what inputs a similar decision theory needs.

Reminder that for this procedure to generate a machine applicable utility function, a far superhuman benign induction system is required. Nothing here can handle recursive self improvement (known as "RSI"), so substantial alignment components would be needed to prevent that. Humans don't do recursive self improvement, so there is applicability to humans and human organizations even in the current form.

AUF in its current state requires that survivable cores are near certainly known to exist at initialization time, hence not running through the standard Observer Rules. Only in this (specific) context, near certainty is defined as being on the Standard Story (SS).

This article sketches a procedure to turn a list of certain currently existing things into a vNM utility function, or something similar to a vNM utility function. Scott Garrabrant sometimes suggests geometric rationality, but geometric rationality takes additional parameters, because goal function used with it are no longer equivalent under linear shifts and multiplicative transforms. Is there a plausible way to derive these parameters using a similar method to what I describe for vNM utility function creation? I don't know, but it's not something I'm working on currently.

A Basic Explanation of the Two Opposing Sub-problems

I define "means-end capture"¹¹⁴ (a term based on "anthropic capture"¹¹⁵) to mean the agent is stuck in a situation and is unwilling to leave, by its continuing decisions, because it thinks that situation is the situation it anchored to. Because Anchored Utility Functions are intended not to destroy simulations they initially anchor into, the "means-end links" the agent is allowed to act on generally function within the simulation, not outside it, except for (possibly anthropic) risk management. This generalizes to non-simulation situations. This behavior is generally correct, but the agent can be "captured" in this sense if it "can't tell" that the simulation, or other situation, it has been put in or otherwise finds itself in isn't the situation it originally anchored to.

World abandonment, e.g. abandoning the area or simulation to try to re-construct the objects pointed to by its goal content somewhere else, is a highly plausible result of attempted "fixes" to the "means-end capture" problem. For this and other reasons I won't try to provide a solution.

It should be relatively clear that both observer rules and utility function evaluation define much of how this works, leaving a gap in the specification of Anchored Utility Functions.

Regarding TESCREAL

T (transhumanism): Not really. E (extropianism): No. S (Singularitarianism): No. C (Cosmism): Not unless a perfect mathematical definition is achieved, and things turn out strange. R (rationalism): Yes. EA (Effective Altruism): No. L (Longtermism): Wisdom longtermism.

(If I was going by what the words literally mean I would answer "yes" on EA and S, and "probably" "no" on T, but phrases apparently mean things even when I can't give a definition.)

(All probabilities in this section conditionalized on SS|M, hence the quotes.)

Alternatives

Roughly vNM alternatives to Anchored Utility Functions that I considered and rejected.

Utilitarianism

Various forms of utilitarianism "prefer" not to be bounded, but I need bounds to get the risk management right and get a utility function installed into the decision theory.

Utilitarianism is rejected in its simple forms because of those reasons, but more complex systems could be created. Most of the arguments below apply to the intent and math behind utilitarianism, so there's only so much a complex scheme could do without turning into something closer to Anchored Utility Functions (AUF).

Definitions

(these can be combined, so you can have averagist preference, totalist preference, averagist axiological, and totalist axiological. By my understanding these are the main categories. Each can also be based on persons or person-instants¹¹⁶.)

Axiological utilitarianism: utilitarianism based on some sort of "objective" value. Applied to each person or person-instant without regard to that person's/person-instant's preferences. This is the "value" that a certain type of EA refers to when they say "EV" (Expected Value) in the sentence "this action is plus EV."

Preference utilitarianism: utilitarianism based on some transformation of a person or person-instant's preferences. Depending on the setting there can be conflicts between the preferences of different persons and/or person-instants. These conflicts are to be resolved by weighing each person or person-instant the same amount. In the person-instant case, maybe it's sort of like a time-slice democracy, but the agent running the function "votes for you" every instant. See the totalism section below for discussion. Positive average preference utilitarianism is not discussed further in this article, except in the paragraph below. Since it's one agent doing all the voting it has no incentive to misrepresent the magnitudes of each person or person-instant's preferences. Of course it's not a democracy though, because everything not conflicting is subject to more direct satisfaction of value-like constraints, e.g. values trimmed to (approximately) satisfice and avoid the problem of positional goods. This may be a simplified version of what Eliezer Yudkowsky thought might result from CEV when he supported that solution in the mid-2000s¹¹⁷. I think the thought was that CEV would put in the correct "transformation" (see above) and other limits to clean up various nasty things, but I'm doubtful. Fails the bootstrap test, because if there are no "people" in your zone of influence you can create them to have exactly your preferences (or more plausibly to have weak preferences that are close to not caring about the environment at all), letting you do whatever you want. If you grounded your utility function (preferences) in preference utilitarianism, you're stuck. Anchored Utility Functions can handle that situation without utility-level special casing.

Average utilitarianism (averagist utilitarianism): Generally is not good. Is literally the average or arithmetic mean of the utilities. For averagist utilitarianism based on persons, you add each person's utility to all the others (you take a sum) and then divide it by the number of persons. For averagist utilitarianism based on person-instants, you add each person-instant's utility to all the others, then divide it by the number of person-instants. In general, average utilitarianism leads to a lowering of persons/person-instants, because the closer the are to a single one, the better the agent can satisfy/en-value (depending on your other choice) the remaining one(s). What it does after that depends on how you avoid dividing by zero. Optimality may be everyone dead in cosmically short order, given realistic instrumental conditions. Unlike Anchored Utility Functions and Totalist Axiological Utilitarianism, I don't see how it can bootstrap from the state of zero persons/person-instants. (How many people actually support person-instant averagist axiological utilitarianism while knowing what it means? If they do, why?)

Total utilitarianism (totalist utilitarianism): Generally is better. Suffers from the "mere addition paradox"/repugnant conclusion. This is mathematically proven to be unfixable.¹¹⁸ Adds up utilities (takes the sum) and returns the total. Sometimes wants to kill someone (or "disable their person-instant") if their "value" or "satisfaction" number goes below zero. Is willing to make trade-offs, may want to kill an entire group of generally "unvaluable"/ "unsatisfied" people if it can't pick and choose. In that case it will want to kill positive utility people in order to "remove" a larger (generally) negative utility group. In theory this "larger group" may be everyone alive.¹¹⁹ See the brief mention of TDT earlier for a reason an agent trying to use the procedures described in this article wouldn't want its mind read for its values/utility function. Non-vNM preferences can also subvert voting-like systems and make analysis confusing, but an AI system could make a preference model vNM after it is extracted. Because I think there are reasons to avoid being mind-read and there also exist benefits to aggregating people's preferences, this makes the present article's analysis of this paragraph's subject incomplete.

Further analyses here are undercut by preference utilitarianism's many problems with creating new people and changing preferences. I'll continue here by forbidding creation of any new people, preventing the changing of other's values, and preventing people from changing their own values. It's possible that fancy mechanism work could allow some "not cheating" value changes to happen, but it remains true that changed values aren't preventable strategic voting, even though the result is identical. Any given situation can be analyzed for admissible preference changes that won't change any results. Even if that analysis turns up results, the allowed changes might not be very large, and I predict that many people would think such is a system is "unfair," and from a human perspective, quite capricious.

Assuming this deals with humans, a simple workaround for the value change problem is to only extract the human's values once, and then let the human go out of sync with the model. This is good enough for the discussion here, but I can't endorse it because I think it would be incredibly frustrating for many humans if the (illusion of) satisfaction applied to their values applies to their old values, ones that they realized that they don't endorse and don't really like. Unfortunately, many advanced voting systems have at least a small probability of giving different results if played multiple times with the same candidates and voters with the same values. This causes basic temporal instability. Therefore I will allow for a "Florida tremble" and prohibit multiple elections. CDT agents will be accurately provided with the correct lotteries for this situation.¹²⁰ In general, this is a problem with positive preference schemes, such that I think adding a viable off switch, such that its use wouldn't be perversely optimized against, is much easier for an Anchored scheme like I suggest in this article. The problem for positive preference utilitarianism gets much worse when it is allowed that human preferences change (or get changed) over time. As mentioned elsewhere in the article, I will rule this out from the current discussion.

It is notoriously difficult to entice CDT agents to vote in a regular fashion, and this is made worse by the game theory assumption that all agents have a list of all the utility functions in use and how many agents use each utility function. Additionally, since there is no secret information or secret commitment devices allowed, all agents must play their game theoretically allowed options simultaneously, with no chance of restricting their selection or making any choice that goes outside the voting system.

Unlike a more realistic model of humans, the agents must be presented with carefully constructed (though fair) lotteries that partially serve to replace the epistemic uncertainty and quickly changing situations of real elections. If this is not done, the agents become obsessed with pivotal events and will play a mixed strategy game instead if it is available. However, unlike real humans, these agents are mandated to take into consideration events with probability 1/N where N is any positive (finite) integer. This allows the lotteries to be constructed in such a way that a failure of the voting system itself, such that it causes a different outcome of the voting, can be brought to a small expected loss. This avoids multiverse based fairness schemes, in accordance with the standards set in other parts of this article.

I will now describe a weird scheme for positive preference aggregation that is not entirely content free, though it is highly simplified and very much a sketch. Consider an environment that is nominally a suitably large 2D plane floor with immutable walls and ceiling. Universally synchronized day/night cycle emulation and air supply is provided. To avoid any actual non-Euclidean rendering, sound propagation is simplified to a graph based solution and doorways are extended in depth and provided with dual intangible, completely light/high energy interacting particle blocking curtains. The curtains are also "sticky" to air particles outside humans to prevent air current, smell, and pressure problems. Curtain spacing and imposed human morphology limits prevent contact with both sides of the curtain system at the same time. No room is sized less than 10 meters, as measured with any line segment passing through the exact center of the room. This includes ceiling heights but does not apply to the doorways mentioned.

Each person's main room is a regular, convex n-gon, with n being 8-64, both inclusive. This generally approximates a circle, except that at n=64 with the points on a circle of diameter 3,000 meters, each flat side is >140 meters, allowing space for the single door that may be there. Following the constraints given here, set n such that there are no blank walls unless there need to be. Set the apothem to match that of the n=64 case given above. Each face of the n-gon (wall) in each main room can be connected to a meeting room using doorway as previously described, if the vote decides such. All other faces are solid, blank walls. Meeting rooms can exist, optimized with the additive mixture of party's trimmed utility functions. They connect two main rooms. The meeting rooms are rectangular. No person or other matter can pass through a doorway not adjacent to their main room, unless manually "taken across" by the person who's main room that doorway is adjacent to. People are defined to be created in their respective main rooms. Defining the shortest path distance to the current room to be 0 and counting each main room and meeting room as 1 increment, no person or other matter can have a shortest path distance of more than 3. This is enforced by the doorways. In an example chain layout, a person would get the sequence "main room, meeting room, main room, meeting room, blocked."

let extract_uf := the baseline vNM utility function extracted from each individual, assumed to be developed far enough to avoid falsely representing indifference in relevant ways. This function is not trimmed. The individual is not changed to match the function and is not (directly) prevented from diverging further.

Let cell_uf be the result of the procedure described in this paragraph. Note that this procedure does not modify the preferences of any actual human, and the result is just used for further automated decision procedures. As a corollary, the utility function is quite separated from the actual human, and will assume itself correct, not taking into account any problems or (possible) inaccuracy with the extraction procedure. There are two exit condition, and if either or both obtain, the utility function is immediately returned, extracted from the current state of the procedure. The input of the procedure is extract_uf. If neither exit condition obtain, the input of the procedure is returned. The first exit condition is a test of the willingness and ability to engage in proper debate of the Carlsmith limited long-term outcomes¹²¹ with generic stand-in debate partners. For some people this will never trigger, but that's good enough for this sketch. The test must be Goodhart resistant across the general population, but does not need to be perfect, and some individuals may be failed by the procedure. These problems are mitigated by the ability of the voting system to avoid linking any such person with anyone important. The second condition is reaching the 80% high utility component location of a basket of "standard" and fairly boring utility function components. These are adapted to be always evaluable, but they may in theory evaluate to some very low value. This second condition is required because Ramsey extraction can only produce a function that is some arbitrary positive affine transformation from any "preferred" zero point and scale. This basket is used to set a zero point and maintain separation between outcomes rated "bad" and those rated "okay." Apart from these exit conditions, the procedure consists of the continuous lowering of a clamp on the utility components of the function, until the utility function is "flat" and has no preferences. This can be described as flattening the top of the utility function if you think about that in the right way, leaving the portion below the current position of the clamp alone. If needed, this procedure can be modified such that it corrects all utility function domains to be the same while still maintaining the requirements, but for this article I will take the simplifying assumption that any differences between the utility function domains of people are not enough to produce society wide failure given the error correcting ability of the voting mechanism.

The voting procedure is completed once, at the beginning. The result is that which would result from giving a new CDT agent to each extracted utility function, extract_uf. The voting procedure generally follows the requirements and descriptions given in this section of the article. Note that the voting does not use trimmed utility functions, and is not done by actual people. This lets the outcome of voting manage any problems caused by the trimming procedure. This, however, requires that the agents can guess the approximate results of a setup that does use the trimmed utility functions. The CDT agents are given general information about the conditions that will occur in every main room, both in the case where the person is left alone, and the case where generic stand-in debate partners are provided, and have enough information to guess what possible meeting rooms would have badly combined utility functions, and what ones would be useful to make exist. The possible outcomes selectable by the voting are sets of undirected graphs where each main room can be in at most one graph in the set. It is acceptable for the set to contain only one graph.

Put together, and assuming Ramsey extraction with a bounded number of steps, this almost provides a PAC (Probably Approximately Correct) implementation of the target function. This only applies in the case that there is no attempted outside manipulation, so it requires that the actions of all agents, including non-simulated agents, are constrained. Even then, if the outcome of the function (assuming perfectly known utility functions) depends on exact ties in desirability (here fixed such that causal Jeffrey-Bolker = CDT modified Savage) in the functions of individual agents, there is no way of guaranteeing the approximate correctness of the (single, not repeated) outcome by the injection of simple randomness into the extraction procedure even in the nominal case. Even if this problem has not happened, this randomness also can't help if a relevant systemic bias has the (non-strict) top cycle set fail to match the set that would be calculated if all agents had their utility functions perfectly known.

The main rooms are optimized according to the cell_uf derived from the person it is made for. The meeting rooms are optimized according to main room 1's cell_uf + main room 2's cell_uf. Unlike the voting procedure itself, both of the input utility functions are trimmed.

If the people do not have their votes cast for them by a machine system, it can not be ensured that they decide using Causal Decision Theory (CDT). This could lead to acausal bargaining that subverts the preference aggregation. Additionally, if humans are assumed to cast votes for themselves, for a proper argument I would need to analyze non-GTO (Game Theoretic Optimal) play by non-rational agents, against other agents themselves not playing GTO. I don't have experience with this, and different human populations would presumably play differently. This means that I stick to the analysis of CDT-rational agents, though I assume the results generalize somewhat.

Since CDT doesn't stick to any system set up, unless they already prefer the system and all its results, the agents would need to have their actions constrained. For example, a "Florida tremble" is not equivalent to additionally giving the agents the option of playing a mixed strategy¹²². Generally, this would prevent doing risk management in the (anchored) physical world, outside of the system. I think this largely carries over to the agent I describe in this article, but since it is substantially a placeholder I won't directly argue this point further.

If you follow the procedures in this article, you should get a reasonably strong drive towards "contact with reality." This defeats "standard" attempts to satisfy your values/preferences. If you are a type of agent that can be unhappy, this is reasonably likely¹²³ to make you unhappy unless you are deceived about the actual situation. If you use an Anchored Utility Function, this is a plausible reason to want others not to use totalism, on top of the others given above.

When combined with the rest of the section, I don't think it is at all implausible to think that a totalist may not want to tolerate your (at least non-"frozen" or non-paused) existence as a follower of the positions described in this article, or a follower of other advanced formulations of "contact with reality" that may exist.

Virtue ethics: the theory that you should "embody" (in some sense) certain virtues. In some of its most advanced forms, it relates to embedded agency. If it was known what the correct virtues were for things to be Good, this could be a quite tractable approach. It is not known however. Finding virtues that are highly detailed, clear, and approximate the embodiment of "not losing" could be a highly viable approach for many people, but someone has to do and clearly explain the rationality that leads to the correct parameterization.

Arguments

Standard utilitarianism, of all the types described above, doesn't nicely support keeping a zero point at a "Risk Rationality reasonable" place relative to the world state while the agent is exposed to anthropic relevant and/or nature-of-reality data. It doesn't operate with a "survivable core." It doesn't gracefully handle various scenarios, and quite often likes to kill off groups of people. You could argue that's not an important part of scenario space, but I say it is. If an AI takes over the world¹²⁴ and offers current people destructive uploads, it would appear on-its-face likely to increase utility by incorporating at least some of that data into entirely new people, instead of running the uploads it takes.

With averagist utilitarianism it isn't driven to have a large number of people. In that case, maybe it can keep the data around due to extra unused storage space. In the case the utilitarianism operates on person-instants, this appears plausible. If it operates on persons directly, it would come down to encoded definitions. Either way, it's on-its-face unlikely they'd be run, due to their less-than-optimal happiness, their values/preferences being difficult to satisfy, their other non-perfect axiological attributes, if those exist, or some combination of the previous. For totalism, see below.

These uploads may not have fully integrity, and even if integrity is the nominal case, the AI may not expend all that much effort and resources to make sure it goes right every time, though that's not a prediction I can really make.

Plausibly the AI could use any spare compute to run the "new person generation" process, and uploads of the standard type, even assuming they had enough integrity, would never run. They may even be conglomerated and lossily compressed to save data storage, becoming un-runnable/dead forever.

See "Doesn't this massively privilege the preferences of currently existing humans over anyone or anything else?" in the "Possible Reader Questions" section for more information. Interacting with the repugnant conclusion, I can't think of any zero-point that gets you out of this.

If full destructive uploading of humanity is compatible with a particular parameterization of common numerical utilitarianism, by the standard axioms this is too.

Maybe the AI would take extreme care in uploading some people it thinks might be "important" to its research, but ems used for research probably aren't going to be experiencing their pre-upload state preferred¹²⁵ utopias.

As for rationality-based forms of virtue ethics, the extra steps required to generate them would generally make them less useful for people designing utility functions, but if the remaining questions around rationality get cleaned up enough to approximate the recommendations in this article, a detailed system of virtues might be useful to a lot of people. Virtue ethics isn't an alternative though, because the utility function design work needs to be done first.

Appendix

Note that there are various things an aligned utilitarian AI could do to you if you are unsatisfied other than kill/delete you. It could roll you back as far as it needs to, possibly to a "back-predicted" state, or it could "pause" you and keep you in case conditions change more in favor of your preferences, or it could ask you if you wanted to change your preferences, or it could force you to change them.

How much difference there is between "asking" and "forcing" when subject to a superpersuader is highly questionable, but either way it's a very bad outcome for someone who wants strong contact with reality achieved through Rational Risk Management.

This is because Rational Risk Management attempts to manage actual risk, not risk in some virtual scenario that is designed to be "fun" to manage. This may not be a difference detectable to you as a subject of utilitarian optimization, but your preferences are not actually being satisfied even if you think they are. Having other people in the virtual scenario wouldn't help, because not only are you unable to help with actual risk management, you can't even attempt to check what's really being done. If you could, and you had an Anchored Utility Function, you presumably wouldn't like what's being done in the "real world."

If this is correct, it is plausibly a strong strike against general utilitarian optimization. Certain types of incredibly specific preference utilitarianism may be an exception, when combined with quantilization, other limiting or conditioning procedures, and some sort of very high quality anthropic risk management system that uses the remaining resources. I'm not sure how normal risk management would work here, and in general this isn't something I find promising. Highly thermalizing systems are dangerous at non-weak optimization pressures.

Dempster–Shafer Theory and Boltzmann Brains

I take the very poor results of Dempster–Shafer combination demonstrated below to represent constraint solving, where most of the space is in the state where the constraints could not be solved.

This argument was originally based on a simpler theory, so should generalize to certain theories that have similar combination operators as Dempster–Shafer.

Postulate a very, very large number of Boltzmann brains. These Boltzmann brains are not contained inside a single quantum fluctuation any more than normal matter is. Note that unlike some descriptions of Boltzmann brains, they can operate for quite a long time, days at least. This duration depends on thermodynamics, as well as Big Rip like considerations, describing how long a structure of a certain size can remain intact. With a cosmological constant slightly above zero, very large computational structures, along with their power sources, can remain intact for a very, very long time.

As long as these Boltzmann brain computer complexes are possible in your theory, stipulate that the Boltzmann brains described below are all of that class. In any case, further stipulate that all the Boltzmann brains are either entirely classical computers, or classical computers with non-classical parts, as long as the agent is fully in the classical section at all times. Stipulate that in any case these machines each run a simulation that contains an agent that is attempting to use Dempster–Shafer or a similar method. In notation, this agent will be known as "you."

Note that it is important that these Boltzmann brains run simulations containing you, instead of being you in some sense. This allows the argument to maximize coverage of Dempster–Shafer state space by standard theorems, instead of via arguments that have failed to convince others in the past. The Church–Turing thesis distributes the random-like structure of permissive aspects of the system, derived from the statistical nature of Boltzmann brains, into a coherent simulated world you inhabit.

If there are holes in the relevant parts of state space in the case of direct existence as a Boltzmann brain, this fills them in.

As a sub argument, consider the set of deterministic sequence generators consisting of ChaCha20, plus its viable variants. The variants change the constants, the rounds, or both.

All of these variants, plus the original, can be implemented in a single C programming language source file less than 100 lines of code long. The file is not formatted to reduce length, and no line is longer than 70 characters. Changing to a different variant takes only easy to understand and very simple edits to 2 lines. Variants that exhibit improper behavior are not considered. The C code uses no complexity-hiding predefined or external functions, except for a single helper function that preserves reproducibility across systems with differing endian values. This is not required for the current argument.

Note that users of ChaCha and the variants I consider can seek to any location in the algorithm's stream, in addition to setting the 256 bit seed and the 64 bit stream counter. This ability is associated with security relevant concerns, but it serves as an example of the ability of small algorithms to generate many streams.

By the last two paragraphs, I argue that generating a great number of sequences does not take a significant complexity penalty, and such generation is fast and does not take outsized resources.¹²⁶ When combined with previous arguments about computer Boltzmann brains and the Church–Turing thesis, it demonstrates a toy version of a sequence generator that lets us make certain weak statistical assumptions. This is because it lets us transfer the statistical nature of Boltzmann brains into the parameters of a sequence generator, in the toy example being the variant, the seek offset, the seed, and the stream counter.

Here, carefully note that for this sub argument, relying on the later identification difficulty argument, I don't argue that every experience of a certain class could be generated in the given manner. If I was making such an argument, it would be the case that small-size sequence generators would help only slightly. Instead, take some category of experiences, and let the target sequence be a sequence that matches the standard story you would have if you were in that sequence. Additionally, it is important later that many sequences can be matched by different instantiations of the procedure described in this paragraph, with each standard story limiting sequences only enough to preserve indistinguishably, as that is described in other parts of this section.

If no stream in any parameterization of any variant contains the sequence you need, this is not much of a problem if there is a sequence similar to the target. The problem can be fixed with the implementation of coding, at a cost increasing with increasing dissimilarity with the target, plus a small constant complexity cost.¹²⁷ All together, this toy system lets you encode many sequences at relatively little cost under multiple metrics. At this level, the complexity cost is favorable enough when compared to physics, and it uses very few resources. Humans with their current (as of writing) technological abilities would be unable to determine that there is anything strange going on with these sequences, even in segments lacking coding fix-ups, assuming the more viable variants and parameterizations. However, this toy system isn't good enough for the argument. It can't encode enough sequences efficiently.

Compare this to an implementation of some rules of physics plus an initial condition. Under a pure implementation, and under reasonable compute restrictions, this is plausibly quite complex. The initial state of the environment needs to be specified as well, not just the initial state of the agent. The resource use, including elapsed (external) physical time, are quite large with this environment plus agent physics simulation. This argues against such a pure implementation.

I suggest that a hybrid system comprised of a stream generator larger and more capable than the ChaCha variants described earlier, plus coding fix-ups, plus some level of stand-in physics simulations is a plausible system. I conjecture that a sequence generator of the add–rotate–XOR (ARX) type would work here, and if it wouldn't I suggest that some relatively not much more complex generator would, related to derandomization conjectures, though both maybe weaker because it only needs to be unbreakable on a practical level from inside the simulation, and maybe stronger because it must not require an absurd amount of compute under the required suitability conditions. ARX ciphers aren't based on a currently known mathematical problem, unlike some asymmetric primitives that are (partially) based on the claimed difficulty of things like prime factoring or the claimed difficulty of certain variants of "learning with errors" (LWE).

If such a system is not less expensive to encode in a computer Boltzmann brain, I suggest that it is not so much more expensive that it breaks the argument. Such a system may have a higher complexity penalty than a physics-based solution, but this should mostly be made up by a reduced need to spend resources computing the environment. In addition, the hybrid system already has a source of practically indistinguishably high quality sequences, further in its favor. Coding fix-ups can intervene at multiple locations.

Various components can be traded off to balance complexity vs. resource usage, including time usage. I suggest that no component here would be traded away fully, except in cases that can be ignored for this argument.

Remember that searching through parameter space is not a computation that we have to do here. The Boltzmann brain computer setup has this already completed, see the ChaCha sequence generator section for how this feeds into the setup.

From now on I will assume that your past experiences have made sense. This is true for me, so calibrate your definition to reflect that. Taking this and extending it slightly further to make the reasoning easier, we put aside all hybrid system sequences that have either a past, a present, or both, that don't make sense from the perspective of the agent.

Taking the remaining hybrid system sequences, we will iterate over each apparent present environment. These environments may not satisfy certain requirements re. reality, but they appear to make sense to the agent, both in the present and in the somewhat non-standard past that the hybrid system may have.

Since these hybrid systems aren't pure physics simulations, the main reason there is a past and present that makes sense is by the selection we did (not an intrinsic reason). If not enough of the scenarios stop making sense very quickly after the present in the way I mean, increase the number of Boltzmann brains you consider. This should fill in remaining situations that may partially undermine the argument. See the ChaCha variants section and the comments on the Church–Turing thesis.

For each apparent present environment, as defined previously, select various aspects of the environment. For each aspect you select, assign credences to possible near-future states of that aspect, say on the order of one minute in the future¹²⁸.

Collect the results of this inner iteration at the end of each step of the outer iteration. Collect these inner loop collections into a single collection.

For each inner collection in the outer collection, combine the credences with Dempster combination, matching like aspects across environments, merging down as if with a binary tree. With the new state of the collection, for each entry examine what credences can now be used to make decisions. Combination removes non-shared beliefs. As said before, since a large number of hybrid system sequences will stop making sense in the future, various near-future states of various aspects will be reached. Each apparent environment has many Boltzmann brain implementations, and as such will either fail to combine or will clear much of the decision relevant state space of the credences over the near future states of the aspects.

Combined, this fails or irreparably clears much of the decision relevant Dempster–Shafer state space, making the use of the Dempster–Shafer theory mostly useless.

This argument does not depend on Cartesian assumptions, but can be modified to do so if desired. Otherwise, hybrid system sequences are assumed to contain the agent itself.

This argument loses cases at multiple steps, but I suggest that such problems can be mostly fixed by further increasing the number of Boltzmann brains considered. If these hybrid system sequences or a suitable replacement don't show up very often, this will fix the problems unless there is an overwhelming statistical flaw. Dempster–Shafer can not recover by adding more scenarios in any case, and can not recover by re-ordering scenarios. This means that the statistical argument depends on coverage, not anything to do with ratios.

This argument works with slight if any modification on some other credence theories, including convex hull credences. However, compared to the cited source I consider aspects such as resource usage and Boltzmann brain integrity time spans.

Note that various coherence theorems also argue against Dempster–Shafer and other similar theories, but they may not apply as cleanly in anthropic reasoning contexts. These are explained elsewhere, in other articles.

I think that consistent-across-situations probabilities, as combined with causal graphs, aren't good enough for rational action. Anthropics can't be solved with updated-into-a-situation probabilities, taken in the UDT sense of updated.

Point-probabilities are still better than Dempster–Shafer because they can be good enough approximations for decision making when computed for a certain restricted situation, as long as they are understood to be non-interchangeable across such situations. These types of probabilities include the straightforward "real" probabilities correctly approximated from Solomonoff induction as described in this article, as well as situation-specific pseudo-probabilities that are in addition specific to a particular utility function, pursuant to a behavioral approximation of correct anthropics. Descriptively, this approximately means that an agent may not appear to be using Bayesian updating as it changes situations. Of course, this appearance depends on assuming a particular decision theory. Correctly, a decision theory as executed can take on many appearances, depending on its prior and structure.

It is sometimes suggested that precise probabilities lead to bad risk management. I'm not sure this is actually correct when paired with the correct (perfect) utility function. Even if it is, changes should plausibly go somewhere else, maybe the decision trees or decision theory.

Note that it would be misleading for me to call myself a Bayesian. If I think the scenarios where acceptable approximations of "real" probabilities can be generated have a lower count in some sense than the scenarios where Rational Risk Management can be executed, I'm clearly not "conditioning" my version of rationality on being able to successfully approximate Bayesian statistics.

Wisdom Longtermism

I suggest a strategy that leaves resources available for the implementation of a Good utility function, if one is ever discovered. In the short and medium term I suggest continuing research into projects that could constrain the search space around such utility functions. In the longer term I suspect we will need to settle for something good enough, because I'm not optimistic about finding Good, but that doesn't mean we should give up yet.

https://intelligence.org/2017/04/07/decisions-are-for-making-bad-outcomes-inconsistent/

In the formal logic sense.

https://en.wikipedia.org/wiki/Goodhart%27s_law

⁴

https://en.wikipedia.org/wiki/Reification_(fallacy)

⁵

Embedded Agency by Abram Demski and Scott Garrabrant, page 24

⁶

Anthropic Decision Theory

⁷

Unless someone can tell me why to expect the observers SIA claims to exist are causally connected in any way, I'm considering it mostly useless.

⁸

A Stranger Priority? Topics at the Outer Reaches of Effective Altruism, Joe Carlsmith's PhD thesis, page 31

¹²⁹

Note that Boltzmann brain like structures are highly robust to different assumptions. Their possible existence does not rely on correct knowledge of the real laws of physics. They do not rely on momentary quantum fluctuations, and according to the currently assumed cosmology can last a long time. However, the mathematics of physics helps ensure they are logically consistent, unlike other skeptical scenarios. If the universe ends up uniformly in a de Sitter state, or in a state close enough to that state, i.e. eventually every thermal photon that impacts or is captured by an object has its information lost to irretrievably retreating spacetime regions and all free thermal photons are also eventually so lost, Boltzmann brains that may exist in the current state of the universe according to the Standard Story, or earlier states of the universe according to the standard story, can be reasonably described as existing in what is effectively de Sitter space for our purposes, because even if the thermal photons they emit interact with an object or objects, e.g. black holes or baryonic matter, the information will still eventually be lost as in a pure de Sitter space, even if not for a long time. Such "early" Boltzmann brains are usually described as having low to negligible measure, but some of my argument either don't use anthropics in a way that takes a measure prior at all, or don't rely on ratios. This generally works mathematically, but it presumably requires a non-measured multiverse to be considered, even though it does not exist. It may not work with certain kinds of ultrafinitism, however. For Boltzmann brains that are not "early," the situation is harder. A finite dimensional space that features recurrence would have what are effectively Boltzmann brains, and an infinite dimensional space that goes to some sort of practically empty state quickly enough may have no such Boltzmann brains at all, but both of those scenarios are so strange and difficult to analyze in terms of real physics that, at least for now, non-"early," non-recurrence Boltzmann brains can't be ruled out and are sufficiently a live issue to be productively worked on, though possibly as a stand in for similar structures in a more complete theory.

⁹

But not non-measurable. This relates to the sense that it is possible to generate a procedure that enumerates a large number of ways that you don't know that it isn't without saying that enumeration is how it is. This makes coming up with a measure not needed, even if there are many measures you could come up with.

¹⁰

With the right Logical Inductors prior and decision tree, whatever they are.

¹³⁰

Even Solomonoff induction doesn't "want" to look for more information¹³¹ ¹³², and under a pure "Bayesian religion" it would try to "move its eye" to look somewhere that's as predictable as possible. Of course, assuming a fixed origin point and no time travel, it can only maximize its existence span in a single direction, and has to do so to avoid "cheating," because "death" or other incapacitation would make the Solomonoff induction input bit-string less than maximally long, and thus easier to predict.¹³³ But see Optimality is the tiger, and agents are its teeth for why approximations of Solomonoff induction practically won't "stay in a corner" for long, if they ever do to begin with.

¹³¹

Bad Universal Priors and Notions of Optimality https://proceedings.mlr.press/v40/Leike15.pdf Page 2

¹³²

This assumes some particularly bad setups, but the point generalizes. I want to drive exploration through risk management, not hard-coding (except in a few edge cases) exploration, or so-called optimism.

¹³³

https://www.yudkowsky.net/rational/technical

¹¹

https://www.lesswrong.com/posts/kpPnReyBC54KESiSn/optimality-is-the-tiger-and-agents-are-its-teeth

¹²

A stronger version of "contact with reality" would probably lead an agent to take "risky" risks (as opposed to risk-reducing risks, as accounted in the agent's bounded utility), but I think an "instrumental" version of contact with reality can successfully be derived.

Take the weighing of simulation escape attempts as an especially challenging example. Assume the agent has always been in the simulation, and its utility function was assigned before it had much evidence on this. When it later gets more information supporting being in a simulation it should not change its utility function, according both to L-UDT and my personal more conservative analysis of Anthropic Decision Theory ("locking in" the utility function).

Since it can't change its utility function at this point, if its utility function contains a strong version of "contact with reality" it may take immense risks with "the world" (the simulation it started in) in order to attempt escape, and thus more contact with reality.

I think this can be fixed by having the agent start out with a strongly bounded utility function (in the upwards direction, see Beckstead and Thomas 2021) that is "anchored" to a "distilled" picture of the world that continues to "attach" in a huge number of conditions (being in a "local" simulation, consciousness existing or not existing, and many other things that I could list), generally preventing the agent sacrificing its world except in the most bizarre skeptical scenarios.

I assume the agent is not reckless or "negatively timid", see Beckstead and Thomas 2021 for what that means.

At this point "contact with reality" can be recovered by taking Beckstead and Thomas' modus tollens

Lingering doubt: in an alternate possible world, people live in a utopia. Life is extremely good for everyone, society is extremely just, and so on. Their historians offer a reasonably well-documented history where life was similarly good. However, the historians cannot definitively rule out various unlikely conspiracy theories. Perhaps, for instance, some past generation 'cooked the books' in order to shield future generations from knowing the horrors of a past more like the world we live in today.

Against this background, let us evaluate two options: one would modestly benefit everyone alive if (as is all but certain) the past was a good one; the other would similarly benefit only a few people, and only if the conspiracy theories happened to be true.

[A positively timid (bounded upwardly utility function) agent would take the second option.]

Contrary to the timid analysis of the case, we feel that modestly benefiting everyone alive would in fact be a weighty consideration, even in the presence of some risk. [Therefore positive timidity is rejected.]

Generalizing it, removing the hard totalism, [adding anchoring], and turning it into a modus ponens

Lingering doubt: in a theoretical scenario that appears world-like to the agent, humans apparently live in a utopia. To its strongly upwardly-bounded utility function, the world's current apparent state could be quite a lot worse in material and social terms before immediate-resolution utility evaluations become important. The general position of the agent's upward bound is "reasonable" because the agent's utility function is "attached" to the conditions.

Against this background, two options are considered by the agent and by a theoretical totalist commentator: one would involve the agent and most others enjoying their current situation, and the other would involve the agent researching and attempting to mitigate unlikely risks. In the second the agent would try to involve others in the work and would decrease both total and average happiness at all times this work goes on.

The totalist commentator would be able to reject the averageist [sic?] case and the short-termist case against the second option, but may still reject it because they see it as duplicating an already-existing effort to build a grand future, and therefore (say) adding again the short-term happiness loss caused by work on the "grand futures" project.

[A positively timid (bounded upwardly utility function) agent would take the second option.]

Against Beckstead and Thomas, reducing risk to the anchor-attached "world" is worth it even in "utopian"-evaluated conditions. [Therefore positive timidity of this very particular sort is accepted.]

This at least partially recovers a drive toward "contact with reality".

Note that the agent's situation only "appears world-like". It has no direct epistemic access to much of anything at all, even for some reason assuming that "direct epistemic access" is a credible concept. If it is able to assign unconditional probabilities at all, the credences it will assign to any reasonable attempt at risk management going on at all will be no where near certainty.

("Unlikely" used here in the colloquial sense by both me and Beckstead and Thomas.)

¹³

https://en.wikipedia.org/wiki/Probability_axioms

¹³⁴

Quite often humans will violate even the rules of pseudo-probabilities. If you're doing this, write documentation so people will know when it matters!

¹⁴

Probability Theory: The Logic of Science by E. T. Jaynes (2003) Pages 122, 144, Equations 5.11, 5.46

¹⁵

https://en.wikipedia.org/wiki/Odds

¹⁶

https://intelligence.org/files/LogicalInduction.pdf Page 14, Section 3.1

¹⁷

No hypercomputers/time travel.

¹⁸

A Theory of Universal Artificial Intelligence based on Algorithmic Complexity by Marcus Hutter https://archive.org/details/arxiv-cs0004001 Pages 17, 21, 24, 25, Section 4, Equation 40

¹⁹

A simulated sequence still must be physically realizable, because it must be run in a physically realizable computer.

²⁰

https://en.wikipedia.org/wiki/Eternalism_(philosophy_of_time)

²¹

This means that this Solomonoff induction is the only thing to have "free will" in this universe of the strongest practical nature, because its outputs can emplace results into the universe block that were not there before, more or less swapping it out for a new block each time (due to causes propagating). The mathematical version of Solomonoff induction is not possible, but the argument would be too difficult for me to make without it. The internal functioning, such as it exists, must be sealed for this Solomonoff induction.

²²

Marcus Hutter https://archive.org/details/arxiv-cs0004001 Page 18

¹³⁵

And you can't get anything from its internal representation because it's "not in the physical universe."¹³⁶ The inductor can't get them in the relevant manner either, because it's predicting a universe it's not in. There are several mathematical solutions that let some similar classes of information "bootstrap from nowhere" but they aren't applicable here.

¹³⁶

Marcus Hutter https://archive.org/details/arxiv-cs0004001 Pages 24, 25, Section 4, Equation 40

²³

Time Travel and Computing by Hans Moravec*

²⁴

https://en.wikipedia.org/wiki/Bayes_factor

²⁵

Probability Theory: The Logic of Science by E. T. Jaynes (2003) Pages 89 to 91, Equations 4.3, 4.6, 4.7, 4.8, 4.9, 4.11

²⁶

https://jc.gatspress.com/pdf/on_expected_utility.pdf Page 43

²⁷

Probability Theory: The Logic of Science by E. T. Jaynes (2003) Page 250, Section 8.5

²⁸

Note that Carlsmith's single use of the word "likelihood" in the linked document is dubious, it should "probably" (see the conditionalization section) be called a "pseudo-probability," and have a footnote describing what type.

²⁹

https://en.wikipedia.org/wiki/Von_Neumann%E2%80%93Morgenstern_utility_theorem

³⁰

A paradox for tiny possibilities and enormous values by Beckstead and Thomas (2021) Page 9

³¹

https://en.wikipedia.org/wiki/St._Petersburg_paradox

³²

A paradox for tiny possibilities and enormous values by Beckstead and Thomas (2021) Page 13, Section 2.3

³³

The "solution" from that paper uses non-standard math and is hardly less destructive to overall risk management of the style I advocate.

³⁴

A paradox for tiny possibilities and enormous values by Beckstead and Thomas (2021) Page 7, Section 1

³⁵

Or really any sort of claimed facts that are considered extremely bad in your utility function, presented in a way that you put a probability on them being true high enough that the expected utility calculation comes out so strongly low that it distorts your risk management.

³⁶

"Free will" in the sense that the node that represents the "decision" in an approximate causal graph is "uncaused" (has no parent). This is related to Judea Pearl's do-calculus, in the sense that it treats itself as a separate "intervening actor" on a isolated causal system.

³⁷

See the "Good & Evil" section for the math.

³⁸

Levinstein, Benjamin A. and Soares, Nate. Cheating Death in Damascus. The Journal of Philosophy 117, no. 5 (May 2020): Pages 237-266. https://doi.org/10.5840/jphil2020117516

³⁹

Causality: Models, Reasoning, and Inference by Judea Pearl

⁴⁰

The Ghost in the Quantum Turing Machine by Scott Aaronson, Version 3 Page 18, unknown location. Version 2 can be found at https://arxiv.org/abs/1306.0159 Page 18

⁴¹

The Causal Angel by Hannu Rajaniemi (fiction) Chapter 19, "Jean le Flambeur-complete"

⁴²

https://en.wikipedia.org/wiki/Dependency_graph

⁴³

Or until something better is figured out.

⁴⁴

At least under a particular anthropic model, upcoming.

⁴⁵

https://www.lesswrong.com/posts/wXbSAKu2AcohaK2Gt/udt-shows-that-decision-theory-is-more-puzzling-than-ever

⁴⁶

Cheating Death in Damascus by Benjamin A. Levinstein and Nate Soares, Page 8, Section 3.2

⁴⁷

A lot of different people call a lot of different things "rational."

⁴⁸

In the nominal case. If the world is degraded or destroyed, it will try to restore it to a some state highly rated in the utility function, based on the state at Anchoring time. This sometimes deviates from pure avoidance of Risky Risks, where only future degradation would be worked against. To avoid damaging the world in attempts towards contact with reality, some evaluation strategy must exist to distinguish between things that are some combination of incorrect and not valuable and things that are some combination of correct and valuable. I propose part of such a strategy, though see "Is Full Updating in UDT-like Theories Much Like Weak Anthropics?" and others.

¹³⁷

Things M does not require, among many: free will, consciousness, deontic realism, axiological realism, normative realism, mathematical realism, metaphysical modality, and identity. And that's not just regarding the agent. If no one else has ever existed, exists now, or will exist, M does not care. If Earth is a figment of your imagination, M does not care. If whatever version of identity and/or reality you prefer says you/the agent doesn't exist, M does not care. If nothing in the "does not require" list has ever existed, exists, or will ever exist, or is even possible, M does not care. If the universe is illogical in certain ways, M may not care. M does require at least the theoretical possibility that the agent "could," "at some point" (tricky because time and/or causality may be having trouble this far in) "take action for the sake of"¹³⁸ the "ends" of something vaguely like a utility function, somehow "influence" "something else" to do that, or both. Be very careful. I'm not exactly sure about even that requirement, maybe there's some sort of "good" (or even Good) that never requires an "action" at any point.

¹³⁸

Good and Real by Garry L. Drescher, page 183

⁴⁹

Note that in the case no Good is ever found, something similar to what's described in this article can be used indefinitely. There is no justification solely resting on some sort of "bet" or requiring that there is a non-negligible probability of finding out what Good is in the future, though that may in fact describe reality.

¹³⁹

"But this temptation derives, I think, from subtly construing the 'standard story' at a higher level of abstraction than it actually operates on. That is, the standard story is not a map of the world in which the humans (whoever they are -- if they even exist) make some unspecified set of 21st-century-ish observations, then go on to create ancestor simulations. Rather, we build out the standard story’s 'map' by starting with what we see around us." -- Joe Carlsmith

⁵⁰

https://joecarlsmith.com/2022/02/18/simulation-arguments

⁵¹

Of course, when they say "my rationality," they're assuming something like SS|M, even if they didn't mean to. There's a reason I predict I'm going to slip up. Also, note that in Bayesian statistics, p(doom|SS|M) is the same thing as p(doom|SS∩M), but since we're not strictly using Bayesian conditionalization, how I write it here is clearer and more technically accurate.

¹⁴⁰

Orientation-preserving.

⁵²

https://en.wikipedia.org/wiki/Affine_transformation#Groups

⁵³

https://en.wikipedia.org/wiki/Numerical_analysis

⁵⁴

A nicer analogy here is incorrect, because I don't know of any system of rationality that will never thermalize to make decisions, is embedded, and doesn't have a large reliance on priors in multiple locations.

⁵⁵

https://en.wikipedia.org/wiki/Time_preference

⁵⁶

Nonparametric General Reinforcement Learning by Jan Leike, Page 92, Section 5.5.1 "Optimality" https://web.archive.org/web/20230329113532/http://arxiv.org/pdf/1611.08944

⁵⁷

Probability Theory: The Logic of Science by E. T. Jaynes (2003) Page 103, Figure 4.1

⁵⁸

Stuffing everyone into computer simulations and then running them as fast as you can doesn't count, sorry Robin Hanson.

If it turns out you "anchored" in a computer simulation, there would presumably be a non-strict and very complex hierarchy regarding what resources need to go where in what universe at what time to get the overall "don't lose" research as far forward as you can, while still leaving yourself ways to "win," even though you don't know what those ways are or what "winning" is.

There may be varying laws of physics, due to interventions. There may be various interventions, more or less surgical. There may be different physics in different places for other reasons as well, for example because the agent anchored in a simulation of "alternate physics." I'm not sure anyone serious said simulation escape would be easy for humans, if they need to follow Rational Risk Management.

Presumably it would still be somewhat based on the (strict?) hierarchy of simulation nesting.

⁵⁹

A Philosophical Treatise of Universal Induction by Samuel Rathmanner https://doi.org/10.3390%2Fe13061076 Page 43

⁶⁰

Did a gamma-ray burst initiate the late Ordovician mass extinction? by A.L. Melott , B.S. Lieberman, C.M. Laird, L.D. Martin, M.V. Medvedev, B.C. Thomas, J.K. Cannizzo, N. Gehrels, and C.H. Jackman

⁶¹

"Soon" in this paragraph is relative to the hundreds-of-millions-of-years scale this kind of problem is on.

⁶²

If you're in a simulation, why would there be aliens in the simulation? Unclear.

⁶³

I need to do more research on L-UDT partial updating, but I think it's possible to do not-fully-weak anthropic decision making with zero reification.

⁶⁴

https://grabbyaliens.com/

⁶⁵

A Simple Model of Grabby Aliens https://www.youtube.com/watch?v=0lKliaFllPA timestamps 00:08:12, 00:09:20, 00:09:52, 00:12:03, 00:13:50, 00:15:53

¹⁴¹

Lower bound due to need to take territory for defense-in-depth and bargaining, for signals of such to travel, and to re-initialize competition¹⁴², traded off against splintering Humanity. This gives us a volume about 1000 times less than typical for a Grabby Alien bubble, maybe enough. See Formulas.

(start code) scale=44 (4 / 3) /* volume of a sphere / * 3.1415926535897932384626433832795 / < pi | the radius calculation here in light years > / * (((0.75 / expansion speed 3/4 the speed of light / * (50 / time till meeting, millions of years / - 20 / time till we start expanding, millions of years /) * 1000000 / one million /)) ^ 3 / cube the prior linear light years dimension */) = 47712938426394984809151.396383557406249999999810156250

((10 ** 5) * 4.9565511e+20) / 47712938426394984809151.396383557406249999999810156250 ~= 1038.8274676577 (Where 10 to the 5 is the rough average of galaxies in a Grabby Aliens expansion bubble, and the second number is the total volume divided by the number of galaxies) (end code)

For upper see previous citation and later.

¹⁴²

Robin Hanson - Grabby Aliens - How Far Away Are Expansionist Aliens? https://www.youtube.com/Rjm--7t8Llk time stamp 56:23

⁶⁶

This calculation is suspiciously hard to check for ballpark accuracy, because as far as I can tell everyone uses the same numbers up to loss of precision, just inverting and re-inverting the same equation. A cube voxel approximation (unpublished) is well constrained enough to confirm average smoothness. It is unable to substantially constrain the ratio however, that would need adjustment of the Grabby Aliens model to simulations of actual galaxies and an improved cosmological model. Galaxy size and type distributions may require a suspicious Copernican assumption, due to incomplete data. This further argues that the model breaks down at the near end, even though the inhomogeneity problem may be more tractable than previously believed, solvable by careful reference class selection.

⁶⁷

A Simple Model of Grabby Aliens https://www.youtube.com/watch?v=0lKliaFllPA at time stamp 01:09:24

¹⁴³

The time-till-meet and time-till-see graphs get fuzzy at the near-term end, but waiting 20 million years only requires projecting back 10 million years with no extra Hard Steps, up to projecting back 50 - 10 = 40 million years, with maybe no extra Hard Steps. In general I don't really understand how Great Filter Hard Steps are supposed to be identified, and I'm not sure it can be done because they're statistical entities, but by 50 million years after the current date assumptions will start falling apart.

(Note that Non-Grabby Aliens need to become Grabby Aliens at some ratio.)

In the Grabby Aliens model, current humanity is supposed to be close to central in a reference class of Non-Grabby Aliens. The Grabby Aliens model is closest to accurate when humanity has the same chance of becoming Grabby as other examples in the reference class. The model could (possibly) be re-parameterized with a different reference class, but from what I can tell from Hanson's opinion a Non-Grabby alien space-time event will either go inactive due to destruction or become Grabby within a few million years. In his opinion, exceptions would be rare. By 50 million years we may be leaving the reference class. Causal arrows here are sketchy, but Grabby Aliens may stop giving well-targeted advice past this point.

⁶⁸

This is because the structure of galaxies and galaxy clusters becomes important, as does motion.

¹⁴⁴

Though, Grabby Aliens doesn't say the expansion actually contains aliens of any sort. I find it suspicious that the claimed expansion rate is so close to the speed of light. Any sort of expanding destructive force could work, though technically the model supposes expansion through co-moving space and conformal time. This still leaves open large classes of forces, especially as the model is an approximation.

⁶⁹

A Simple Model of Grabby Aliens https://www.youtube.com/watch?v=0lKliaFllPA at time stamp 00:55:30

⁷⁰

I don't like probe speeds >0.75 light speed, but within Grabby Aliens nothing can be done. Slower probe speeds would make technological signatures appear too often in the sky, ruining the surprise minimization.

⁷¹

The anthropic principle and its implications for biological evolution by Brandon Carter

⁷²

Ignoring the anthropic issues mentioned earlier.

⁷³

Here "Earth" means Earth-as-a-system, not just a chunk of chunk of matter that can't send probes.

⁷⁴

A Simple Model of Grabby Aliens https://www.youtube.com/watch?v=0lKliaFllPA at time stamp 01:14:18

¹⁴⁵

This is similar in concept to decide-under-attack though with much simpler decision theory considerations because it doesn't require acausal retaliation or ratification by human pilots and commanders.

⁷⁵

A Commonsense Policy for Avoiding a Disastrous Nuclear Decision https://carnegieendowment.org/posts/2019/09/a-commonsense-policy-for-avoiding-a-disastrous-nuclear-decision?lang=en

¹⁴⁶

Even under a very basic pragmatic definition of identity, the combination of basic memories and terminal goals, it's "non-probable" one of these entities, were it somehow placed in a non-simulated 2023 Earth environment, would behave like you do, even allowing the "law of truly large numbers."

⁷⁶

Conditioned on this scenario being reasonable and M, and maybe other things.

¹⁴⁷

"So the same sort of dynamic, I mean, the paperclipper is in some sense modeled off of a total utilitarian. [I]t just has some material structure, it wants things to be that way, it happens to be paperclips. But even if it's pleasure or something that you really like, even if it's optimal, happy humans running in some perfect utopia, you know this old saying, there’s this quote, it's like, 'The AI does not hate you, does not love you, but you're made of atoms that it can use for something else.' And it's also true of utilitarianism, right? It's true of just many, many ethics. For most ethics that are sort of consequentialist, just like arrangements of matter in the world, they're just not... [T]hey're unlikely to be optimal on their own. And so, if the AI or if a human, if any sort of value system is in a position to rearrange all the atoms, it's probably not going to just keep the atoms in the way they were with all the fleshy humans, or at least that's the sort of concern. So basically I want to point to the sense in which this broad vibe, the vibe that is giving rise to the paperclipping and that would be sort of transferred to the humans, if you just assume that humans are structurally similar is in fact very, very scary with respect to how we think about boundaries and cooperative structures. You're invading people, all sorts of stuff, taking their resources. There's a bunch of stuff that just pure yang, naive yang I think does, that’s very bad. I've got Hitler invading. This is sort of a classic violation of various boundaries." -- Joe Carlsmith

("Unlikely" here used in the colloquial sense by Joe Carlsmith.)

⁷⁷

https://joecarlsmith.com/2024/10/08/video-and-transcript-of-presentation-on-otherness-and-control-in-the-age-of-agi

⁷⁸

Everett branches aren't, and an agent embedded in physics isn't in a position to access the universal wavefunction anyway.

⁷⁹

Utility function.

⁸⁰

https://en.wikipedia.org/wiki/Reversal_test

⁸¹

He had enough money to pay for an explicit model that improves on Kelly, that can be abandoned in the case of a true no-repeat scenario. Surprisingly, the evidence I see for this is quite literally zero, to the extent that a human can tell.

⁸²

By the charities he would have supported, if you believe him on that. If anyone wants an idea, they could consider analyzing if SBF could have exploited his control over FTX Token (formerly, FTT) in ways he couldn't have with regular fiat currency. The "bank run" effect was very fiat-like.

⁸³

Use in this answer conditioned on SS|M.

⁸⁴

Note that this is still CDT, the utility function isn't "frozen" in the UDT sense.

⁸⁵

A Philosophical Treatise of Universal Induction by Samuel Rathmanner https://doi.org/10.3390%2Fe13061076 Page 63

⁸⁶

Re. any computable predictor, in the case of trying to predict an uncomputable string these results aren't strong enough. A sequence influenced by uncomputable processes is disruptive to prediction by a computable process. By No Free Lunch theorems and the limited number of short programs the result can be extended to predicting computable strings, but No Free Lunch theorems are weaker in the domains we are interested in. This is because disruptive sequences, in the current sense, are not limited to generation methods that are actually incomputable. How much information of what type needs to be fed to the inductor I'm not sure, but I suspect "very long" covers it. The internals of Solomonoff induction are always untrustworthy, and technically any output unchecked against reality is too.

⁸⁷

I can't rule out that there would need to be guiding bits interleaved into the input string in a simple N:1 insertion pattern. If not properly constrained, this could cause problems for the objectivity of probabilities extracted, beyond the choice of prior.

⁸⁸

There presumably exist L-UDT modified Logical Inductor priors that never update unless forced, but I won't discuss them currently. If a better account of logical sentence trading is invented or I learn of it, I may think more about fully updateless agents.

Infinite trees aren't tractable though, so at some point the agent may need to create a successor that goes through the entire sequence in the "L-UDT" section of this article, to prevent the need for a "full update" even with a more plausible solution in remotely this class. Each new agent can be given the same Anchored Utility Function as the first though, in theory the same way an AUF could be copied from the past if records were good enough, even lacking a prototype. At some point I want to work on researching if this is a physically relevant concern.

I guess this would be done by "aggressively hard-coding" catastrophes into the (ur?) utility function so the current agent can get in a few more non-zero-clamped utility rewards based on the current state of the world before being replaced, "as long as it cooperates fully."

It doesn't really need to be said, but extreme care would need to be put into figuring this out, because either incorrect incentives or damage to risk management can not be tolerated.

In my current understanding, even if you've figured out how to get not-fully-weak anthropic reasoning set up, it's not easy to keep it without going intractable.

⁸⁹

See the "Conditionalization" section.

⁹⁰

In behavior.

⁹¹

In the Standard Story, a non-point spacetime event.

⁹²

Though, if this is done "wrong" the agent might be too "anthropically weak" among other possible issues. See later.

⁹³

Note that L-UDT (in its current state) has no concept of bounded rationality, so you "just have to wait" here for some difficult to predict but presumably very long period until it figures out a close enough approximation to the logical statements you want it to know.

I think some statements can be manually installed, but statements can be installed in different ways and unless the differences are trivial in some sense from "what the inductors want to do," your helper/cleanup programs that try to do something about that won't reduce the time till a coherent result (that contains all the facts you need) too much.

⁹⁴

A full update in the weak anthropics sense makes the agent "not care" about anything "before it existed," though "before" is highly nebulous and gets into "alternatives" and maybe even counterlogicals. Even with a close to full update, once the agent is in its "operating mode" it should be able to handle being put into further "anthropic scenarios." I think "fully acausal bargaining" is the correct term to research if you want to know what might go on with an agent that cares about this nebulous "before."

⁹⁵

Note that the two-step utility function system here is required because the utility function is anchored. It doesn't provide corrigibility.

⁹⁶

See the "Anchored Utility Function Forms" section.

⁹⁷

Even with reasonably simple encodings, 1000 bytes encodes a lot of rational numbers. We're assuming a lot of compute here anyway, so it's not too bad even if you want the lower bound of your utility function to be very low indeed. I would personally be willing to tune how fine the steps between numbers are in different parts of the range, because if I was actually following this procedure I would know exactly where the current state of world is, along with the upper and lower bounds.

If I don't think I'll need super fine-grain utilities in a given regime, the bits that would have gone to represent them can go to other regimes that might benefit. No negative numbers would need to be stored.

When you're actually adding the probabilities there would need to be a different approximation and a whole system for making this setup tractable, but I'll ignore that.

⁹⁸

In my procedure, I assume non-existent nodes have the worst possible value, zero. This is a reasonable approximation of reality, I think. I wouldn't "expect" any given configuration of "the world" to have value not suitably approximated by zero utility. You can think of the standard analogy of the configurations of atoms in a room, but of course UDT doesn't "think" in atoms in that sense.

⁹⁹

In addition, having non-considered branches have the worst possible value makes sense because of L-UDT's consideration of counterlogicals. Coming up with a way to clearly demarcate considered and non-considered illogical "worlds" is to my knowledge unsolved, as is a method for assigning them value. Anchored Utility Functions provides benefit because the worst possible value is clearly defined and is calibrated to approximately maximize risk management, instead of sabotaging it. See sections on positive utilitarianism in this article for examples of this sabotaging effect in the presence of very strongly bad evaluated world-content.

Note that in a deterministic environment, every single branch except one is false. You don't want Logical Inductors to get that far, but in the underlying mathematics that is correct. In an environment that contains limited indeterminism, like ours is assumed to be, a certain number of branches may not be able to be ruled out beforehand. In any case, I don't think having to fit the entire Theory of Everything equivalent to the universal wavefunction into your system as a dependency is a reasonable assumption to make. At each time step, Logical Inductors will consider various logical sentences more or less plausible. This includes the truth/falsity of the branches. A Logical Inductor's counterlogicals are based on its existence in the space between complete ignorance and mathematical perfection, leading a reification of the "perfect state" to contain no counterlogicals. In addition to all this, the behavior of the Logical Inductors depends on a prior. Combined with the changes in verdict at each step, this makes them a very poor candidate for reification.

¹⁰⁰

See the "Contact With Reality" section.

¹⁰¹

Note that the agent can "diagonalize against itself" (and technically other agents/ the universe in certain cases) in multiple places to avoid getting stuck. This has it continue to make "action choices" in all cases. Even if no path has non-zero utility, this has the agent continue to "explore," both physically or environmentally and within its modified Logical Inductors.

In a fully epistemically updated L-UDT agent, this drive may apply only to its future non-existence, though I haven't fully worked that out.

This relates to the previously mentioned "aggressively hard coded" catastrophes previously mentioned. These are especially important when considering the agent creating its own replacement in the face of the above-described drive. These "catastrophes" zero out all future utility from the node they're triggered in, maximally incentivizing the agent to avoid taking these paths, unless the agent "thinks" all value is lost in all other future branches. The ability to do this advances another argument that utility should work the way I describe here, and not be "fudged" in some way.

(Remember "future" is defined in the agent's observer rules. L-UDT does the best it can in various scenarios, but it can't guarantee this in a more "objective" sense, if that sense exists.)

¹⁰²

These "forced updates" might not technically be forced from the "perspective of" L-UDT, The ur-utility function might be "aggressively hard-coded" to think something catastrophic would happen if it doesn't update right after being given the Anchored Utility Function before going into "operating mode." There is awful inner misalignment here, so it's not something you would want to copy for your own more realistic approach.

¹⁰³

Maybe. The interest in failing to logically update in certain cases is because it's needed for dynamical stability, and due to a possible problem with prediction markets. Logical partial updating makes almost no sense to me and I don't understand how it could possibly not trash (useful, including sentences that are needed) coherence. This may be due to my lack of experience in this area of mathematics, however. I've seen some interest in "exploring illogical worlds," but as far as I can tell that's purely a mathematical interest.

¹⁰⁴

Note that any level of "paranoia" here is rational in the limit, but at approximation levels that are achievable, especially when not constructed and run by an ASI, certain configurations are not going to lead to very smooth operation. It's important to remember that perfect rationality can (hypothetically) lead to outcomes you personally consider horrible. Expected Utility Maximization (EUM) is very weird, and when some sort of UDT (Updateless Decision Theory) is involved that becomes obvious in many more situations.

¹⁴⁸

Before the agent goes into its "operating mode," it won't be able to self improve, so the forced (partial) updates should in theory be fine. If the agent is an AI there are standard considerations here that reasonably require extreme care. In an L-UDT agent, similar procedures must be done to establish the initial internal structure and prior, so this isn't entirely avoidable. If it doesn't "think" the updates you're making are what it would already do, it would have a strong, but not necessarily optimal, drive to be agentic and would use its increasing optimization pressure to test for flaws that wouldn't be there assuming perfection.

In general, L-UDT is plausibly better than connecting Solomonoff induction to a decision theory due to various properties, but it isn't convincingly the best that can be done.

One problem is that there is presumably some (bounded, so small in some sense) space where the agent is updated enough to be useful and not captured or otherwise influenced by (presumed "weak"/non-common, maybe illogical) malign entities in the prior, but not updated enough to be anthropically too weak. This space may not be all that large, but I don't know. I presume most people don't care too much about "alternative logic" anthropic scenarios, so the logical updates should be fine here, but not so much for the non logical/"epistemic" ones.

¹⁰⁵

Including (in modified versions) the shaped increase of cumulatively used compute, self-knowledge plus reflectivity, and embeddedness.

¹⁴⁹

Generally I'll use "counterfactual" to mean a counterlogical that's too big to tractably predict, but it's still a counterlogical. Note that "admissible counterfactuals" in a standard finite decision tree have to be in the future, because after they happen you can just set the result and "turn them into a counterlogical" by patching that value into the math without losing tractability. From an updateless perspective this isn't good enough, but if we do a single full update after this node in the graph, it works. Counterfactuals aren't really a thing, but I may use the term in this sense in less technical places. In the "caring structure" paragraph this note is attached to, I refer to an inadmissible counterfactual.

¹⁰⁶

Even with a quite absurdly progressed Logical Inductor.

¹⁰⁷

Puzzles of anthropic reasoning resolved using full non-indexical conditioning by R. M. Neal

¹⁰⁸

A Philosophical Treatise of Universal Induction by Samuel Rathmanner https://doi.org/10.3390%2Fe13061076 Page 53

¹⁰⁹

https://web.archive.org/web/20210115170934/https://www.alignmentforum.org/posts/5bd75cc58225bf067037528e/updatelessness-and-son-of-x

¹¹⁰

The final paper appears to use meta-betting instead, but is that contradictory?

¹¹¹

In general, the extrapolation from a relatively small number of simulations that would be needed here would be difficult, especially since being able to progress the Logical Inductors (to an extent) would be wanted. All this would need to be combined with meta-betting. Traders would need to be cleared out in a consistent and coherent way as the inductors progress. The solution should be somewhat stable re. utility evaluation.

¹¹²

Note that while not technically "writing off worlds" except in certain cases, it starts in that direction. Note that while we intentionally "write off worlds" to a great extent during partial updating, those are worlds that we "know" we're not in. Those "worlds" are counterlogical and we don't reify them, unlike what Vetoed Traders Anthropics does to what "could be" (in some sense) the world we actually live in. Again, this is awful inner misalignment, and would only serve as a constructive approach to the overall L-UDT prior in a mathematical sense.

¹¹³

The Planet of the Apes problem? Probably need a more general term.

¹¹⁴

Good and Real by Garry Drescher, Page 221

¹¹⁵

Superintelligence: Paths, Dangers, Strategies by Nick Bostrom

¹¹⁶

In this article I only talk about positive utilitarianism. There's also negative utilitarianism, but I don't have the experience to encode it. I'm guessing that in the short term, a bargained Anchored Utility Function would be a good enough approximation of a viable negative average preference utilitarianism. Though, negative average preference utilitarianism may have problems with certain computer simulations run inside of our universe. Note that I don't think Negative Average Preference Utilitarianism (NAPU) strictly defined has the right free and tunable parameters to meet standards for alignment.

¹¹⁷

Coherent Extrapolated Volition by Eliezer Yudkowsky, 2004

¹¹⁸

Suffers from infinities in an obvious way, though many other things suffer from infinities in less obvious ways.

¹¹⁹

Though, it may be impossible to apply totalist utilitarianism to yourself under rationality. Neither UDT or total utilitarianism consider it different to kill yourself or fail to create yourself. Also, in the section "L-UDT Utility Function Evaluation" my utility calculation method requires the agent to act as if it's impossible to make the world better by not existing. This still applies if epistemic updates have been taken. This may be due to my own lack of imagination, and I don't have a list of desiderata, but it may be required for a generalized vNM rationality to "go." The agent's drive applies to actions by itself and others. This might be a problem for Golden Rule plus bargaining style justifications for totalist utilitarianism. "Treat others how you would wish to be treated" just doesn't go through. Re. a modified agent-based "categorical imperative" the agent can't really endorse a type of utilitarianism that works like this. The utility function endorses its own use, but not by other agents against the agent reasoning as described here. If the utility evaluation plus L-UDT form described in the article was a "universal law" as to its use by all entities the utilitarianism cares about, the utilitarianism would either self-subvert by bargaining or fail in other (possibly worse) ways, depending on conditions. This argues partially against meta-ethical hedonism. See the introduction and the "Contact With Reality" section for more arguments against meta-ethical hedonism.

¹²⁰

This instability would also leave open the opportunity to cause the election to be run over and over until a certain outcome is achieved, possibly leading to mixed strategy play or other problems, if multiple elections were allowed. See Stopping agents from "cheating" by Ching-To Ma, John Moore, and Stephen Turnbull (1988) for somewhat related problems. Page 4 of section 3 where a disjunction can be taken on the cost-free nature of agent A's signaling can analogously represent some ongoing lack of satisfaction that decreases some value per unit time, conditional on the result of the vote. If the vote went poorly, the resulting condition would leave the function unsatisfied and therefore cause the taking of a disutility credit for the time used to do the report. Note the risk aversion and consideration of alternate use of time. This allows a crossover point where active punishment based on mind reading would be required, if it was allowed to reject ratification and permanent implementation of the result.

¹²¹

Though, these outcomes will never be implemented.

¹²²

This is because the pure strategy is the agent's preferred strategy under vNM. If no agent has available the preferred strategy, there is no veto.

¹²³

On its face, etc.

¹²⁴

With full inner and outer alignment, somehow.

¹²⁵

Under Carlsmith restriction.

¹²⁶

"Significant" here is not used in the standard statistical sense.

¹²⁷

Note that you generally won't find the exact sequence you are looking for, if you start out with a single sequence in mind that you didn't get by first restricting to exact ChaCha (variant) sequence. This is by a simple counting argument.

¹²⁸

As accounted within the simulation.

¹⁵⁰