However, if we go back to the hypothetical me who lives before the Michelson-Morley experiment, I think you’d find an odd pattern in the answers you’d get. If H = “Newtonian mechanics works at all scales,” and you try to elicit P(A|H) for various A, I think you’d get pretty nice, coherent results. Because I could actually (in some cases) write down the physics equations relevant to A and use them and some probability theory to compute P(A|H). My answers would look like a real probability distribution — no conjunction fallacies or the like — because they’d been derived using actual math.
- Some thoughts on LessWrongianism
Having been derailed from work by a coworker inexplicably linking to Less Wrong, I find myself once more pondering Less Wrong. And I think I can put my finger on at least part of what I find off-putting.
When Less Wrong says Bayes theorem they really mean be less confident in your judgement/react to new evidence. These are incredibly common lessons you can find in lots of places, and its very good advice.
But Less Wrong wraps it up in Bayes Theorem not as the mathematics but as the focal point of a mysteryin the old Greek sense. Its not really teaching anyone any probability theory, its using probability theory as a mace in order to bludgeon in be less confident in your judgement/react to new evidence.
Lots of Yudkowsky’s writing is full of the mystical language of revelation and the elevation of math to mystery (some more explicit than others). And at that point you aren’t actually using probability theory or math, you are just pointing at a thing you’ve named “MATH" (or in this case "BAYES”) that justifies your worldview.
- Some thoughts on LessWrongianism
Having been derailed from work by a coworker inexplicably linking to Less Wrong, I find myself once more pondering Less Wrong. And I think I can put my finger on at least part of what I find off-putting.
When Less Wrong says Bayes theorem they really mean be less confident in your judgement/react to new evidence. These are incredibly common lessons you can find in lots of places, and its very good advice.
But Less Wrong wraps it up in Bayes Theorem not as the mathematics but as the focal point of a mysteryin the old Greek sense. Its not really teaching anyone any probability theory, its using probability theory as a mace in order to bludgeon in be less confident in your judgement/react to new evidence.
Lots of Yudkowsky’s writing is full of the mystical language of revelation and the elevation of math to mystery (some more explicit than others). And at that point you aren’t actually using probability theory or math, you are just pointing at a thing you’ve named “MATH" (or in this case "BAYES”) that justifies your worldview.
There’s a lot of truth in that, and I’m happy to make fun of people who overuse Bayes’ theorem as a shibboleth, but I think at the heart of things there is a really important core connection. I was in the community pretty early and I got to see how it developed and a lot of it really did come from spiralling off of this core idea of using probability theory to ground cognition.
The basic derivation seems to start with the Bayesian idea that you can semi-rigorously assign numerical probabilities to one-time events like “The Republicans will win this year’s election”.
From there you get the idea that you can check if those probabilities are right or wrong by comparing large baskets of events - for example, if you predict one hundred elections with 99% confidence, but you were wrong about ten of them, you’re overconfident.
From there you get the idea that most people are provably overconfident about most things, but that calibration training and awareness of cognitive biases can provably make you less so (see http://lesswrong.com/lw/1f8/test_your_calibration/ ).
From there you get an interest in expected utility theory, since it says that you can make good decisions by having well-calibrated probabilities plus an idea of the value of things.
From there you get an interest in signaling and “belief in belief”, since it soon becomes clear most people are terribly calibrated and nobody cares. Since there are good ways of getting maximally accurate beliefs (like measuring the track record of pundits and institutions) but most people don’t use them, many institutions that purport to be about accurate beliefs must not be.
From there you get an interest in weird politics, where you try to come up with systems that can promote maximally accurate beliefs and reduce signaling spirals - for example, prediction markets.
From there you get an interest in all sorts of “antipredictions” where it seems most people are overconfident that a low-probability but super-interesting thing is not true (cryonics, singularity, et cetera - see http://squid314.livejournal.com/349656.html )
From there you get an interest in logical-positivist type stuff where you divide arguments into arguments over predictions (which are true at some probability) vs. arguments over semantics (which you have to disentangle to see if there’s a real disagreement over predictions)
And I know that actual Bayesian statisticians and computer scientists have taken Bayes’ theorem very far and created a whole field around it, and this field doesn’t look much like the “field” of Less Wrong, and so it causes a big disconnect when real Bayesian scientists hear LWers talk about “Bayes” to mean totally different things.
But I didn’t know anything about Bayes before contacting the rationalist community. For me learning that you could do sorta rigorous stuff with probabilistic predictions was pretty revelatory, and I mentally pegged all of the stuff that followed from this revelation as “Bayes-related”.
I don’t know if I’m agreeing with you or disagreeing with you here. it’s probably bad branding, but it’s bad branding that makes sense from the inside.
http://yudkowsky.net/rational/technical/ is kind of about this. - Some thoughts on LessWrongianism
There’s a lot of truth in that, and I’m happy to make fun of people who overuse Bayes’ theorem as a shibboleth, but I think at the heart of things there is a really important core connection. I was in the community pretty early and I got to see how it developed and a lot of it really did come from spiralling off of this core idea of using probability theory to ground cognition.
From the outside, it seems like the community took a lot of fields that are loosely related because they use some ideas from probability theory, and then glossed over all subtlety by making a lot of informal, hand-waiving references to the ideas. So what you end up with isn’t any sort of rigor, its just the illusion of rigor.
As a core example, using Bayesian updates/Bayes theorem is often intractable or just a bad idea in a lot of well defined problems. In extremely messy real world situations where the problems aren’t well defined, calculating likelihoods is impossible. There are a lot of cases where people can take any piece of evidence and spin a story how that fits with their model- in those cases Bayes theorem is useless.
Pointing at it isn’t making your thinking more rigorous, all you are saying is “don’t be too confident, be willing to change your mind, look at the evidence” which is something everyone probably agrees with in principle.
Heck, there are well defined problems where using subjective probability isn’t the best way to handle the idea of “belief”- when faced with sensor data problems that have unquantified (or unquantifiable) uncertainty the CS community overwhelmingly chooses Dempster-Shafter theory, not Bayes/subjective probabilities.
i.e. a lot of Less Wrong ideas seem to take ideas from some community (the skeptical community for existence) and then dress it up in the language of probability theory and pretend that using the probability theory words makes it more rigorous.
Also, you used “anti-prediction” in your post, I have no idea that that means, and it seems like you are using it to mean “prediction”?
- Some thoughts on LessWrongianism
There’s a lot of truth in that, and I’m happy to make fun of people who overuse Bayes’ theorem as a shibboleth, but I think at the heart of things there is a really important core connection. I was in the community pretty early and I got to see how it developed and a lot of it really did come from spiralling off of this core idea of using probability theory to ground cognition.
From the outside, it seems like the community took a lot of fields that are loosely related because they use some ideas from probability theory, and then glossed over all subtlety by making a lot of informal, hand-waiving references to the ideas. So what you end up with isn’t any sort of rigor, its just the illusion of rigor.
As a core example, using Bayesian updates/Bayes theorem is often intractable or just a bad idea in a lot of well defined problems. In extremely messy real world situations where the problems aren’t well defined, calculating likelihoods is impossible. There are a lot of cases where people can take any piece of evidence and spin a story how that fits with their model- in those cases Bayes theorem is useless.
Pointing at it isn’t making your thinking more rigorous, all you are saying is “don’t be too confident, be willing to change your mind, look at the evidence” which is something everyone probably agrees with in principle.
Heck, there are well defined problems where using subjective probability isn’t the best way to handle the idea of “belief”- when faced with sensor data problems that have unquantified (or unquantifiable) uncertainty the CS community overwhelmingly chooses Dempster-Shafter theory, not Bayes/subjective probabilities.
i.e. a lot of Less Wrong ideas seem to take ideas from some community (the skeptical community for existence) and then dress it up in the language of probability theory and pretend that using the probability theory words makes it more rigorous.
Also, you used “anti-prediction” in your post, I have no idea that that means, and it seems like you are using it to mean “prediction”?
"As a core example, using Bayesian updates/Bayes theorem is often intractable or just a bad idea in a lot of well defined problems. In extremely messy real world situations where the problems aren’t well defined, calculating likelihoods is impossible. There are a lot of cases where people can take any piece of evidence and spin a story how that fits with their model- in those cases Bayes theorem is useless."
You’re going to make fun of me, but it sounds like you’re making a Bayesian argument.
What you’re saying by “In these cases people can take any piece of evidence and spin a story about how that fits with their model” is: “The chance that I believe this fits with my model, conditional upon it being true, is equal to the chance that I believe this fits with my model, conditional upon it being false, therefore I should not update my estimate of its truth based on the observation that I believe this fits with my model.”
What’s more, I think I learned this on LW - I can even link you to exactly the post that makes this point - and that the post’s description of this as an insight of Bayesianism is correct. This kind of thing also gets explained in that Technical Explanation link I gave you.
This makes it hard for me to sympathize with your use of it as some kind of example of something Bayesianism is uniquely unsuited to handling, and by extension with your thesis that Less Wrong uses Bayesianism on problems that Bayesianism is uniquely unsuited for.
Re: Antipredictions, see http://lesswrong.com/lw/wm/disjunctions_antipredictions_etc/
Re: David Chapman, see my response to him at http://slatestarcodex.com/2013/08/06/on-first-looking-into-chapmans-pop-bayesianism/ - Some thoughts on LessWrongianism
You’re going to make fun of me, but it sounds like you’re making a Bayesian argument.
Only because you guys use “Bayesian” to encompass this huge nebulous cloud of ideas. I suspect I’d be hard-pressed to write about probability theory in a way that wouldn’t fit some idea you cover by the word “Bayesian.”
It might be helpful to separate Bayesianism into two components- the subjective interpretation of probability, and Bayes theorem/updating. These are independent pieces, we can have one without the other.
Your response was to my argument against the second of these, Bayesian updates.
What you’re saying by “In these cases people can take any piece of evidence and spin a story about how that fits with their model” is: “The chance that I believe this fits with my model, conditional upon it being true, is equal to the chance that I believe this fits with my model, conditional upon it being false, therefore I should not update my estimate of its truth based on the observation that I believe this fits with my model.”
I think you missed my point a bit, I’m not talking about events which are just as likely under a model as under NOT the model. Here we can do a Bayes theorem update, but the weight of the update is just 1, so nothing changes.
What I’m talking about are models that are so nebulous that its unclear how to even calculate the likelihood, so there is no way to use Bayes theorem to do an update.
Instead, we can:
1. formulate events that are impossible under the model (Popper’s falsification), and if you observe one of those events you toss the model (the likelihood of a non-charge conserving decay under the standard model, for instance, is 0). Here I don’t need detailed predictions, but I do need something like conservation laws, so this opens up a wide class of models to falsification.
2. formulate a boring null-hypothesis that has the virtue of being well-defined, and calculate the likelihood of your data under that model. (Fisher’s hypothesis testing). If the data is fairly likely under the boring hypothesis, decide that there is nothing to be explained by a new model. Here I don’t need a well defined exploratory model, I just need a well-defined boring model (null-hypothesis). This is often a much easier goal (i.e. in particle physics, the “standard model” is well defined in a way that “extra dimensions” or “super symmetry” are not).
Even though we are still talking about subjective probability, we aren’t using priors or Bayes theorem at all. And notice these are sort of the default positions on which science was built- this isn’t because people were unaware of Bayes theorem, its because its limited enough that the default assumption is that Bayes isn’t going to be a good approach.
Now, there is a separate question of “is the subjective interpretation of probability always the best way to think about belief?” I think the answer here is, at least in practice, no. There are very helpful mathematical generalizations of probability that are used in real world problems where subjective probability runs into issues. I can write more about that sometime, but its a pretty big subject.
- Some thoughts on LessWrongianism
You’re going to make fun of me, but it sounds like you’re making a Bayesian argument.
Only because you guys use “Bayesian” to encompass this huge nebulous cloud of ideas. I suspect I’d be hard-pressed to write about probability theory in a way that wouldn’t fit some idea you cover by the word “Bayesian.”
It might be helpful to separate Bayesianism into two components- the subjective interpretation of probability, and Bayes theorem/updating. These are independent pieces, we can have one without the other.
Your response was to my argument against the second of these, Bayesian updates.
What you’re saying by “In these cases people can take any piece of evidence and spin a story about how that fits with their model” is: “The chance that I believe this fits with my model, conditional upon it being true, is equal to the chance that I believe this fits with my model, conditional upon it being false, therefore I should not update my estimate of its truth based on the observation that I believe this fits with my model.”
I think you missed my point a bit, I’m not talking about events which are just as likely under a model as under NOT the model. Here we can do a Bayes theorem update, but the weight of the update is just 1, so nothing changes.
What I’m talking about are models that are so nebulous that its unclear how to even calculate the likelihood, so there is no way to use Bayes theorem to do an update.
Instead, we can:
1. formulate events that are impossible under the model (Popper’s falsification), and if you observe one of those events you toss the model (the likelihood of a non-charge conserving decay under the standard model, for instance, is 0). Here I don’t need detailed predictions, but I do need something like conservation laws, so this opens up a wide class of models to falsification.
2. formulate a boring null-hypothesis that has the virtue of being well-defined, and calculate the likelihood of your data under that model. (Fisher’s hypothesis testing). If the data is fairly likely under the boring hypothesis, decide that there is nothing to be explained by a new model. Here I don’t need a well defined exploratory model, I just need a well-defined boring model (null-hypothesis). This is often a much easier goal (i.e. in particle physics, the “standard model” is well defined in a way that “extra dimensions” or “super symmetry” are not).
Even though we are still talking about subjective probability, we aren’t using priors or Bayes theorem at all. And notice these are sort of the default positions on which science was built- this isn’t because people were unaware of Bayes theorem, its because its limited enough that the default assumption is that Bayes isn’t going to be a good approach.
Now, there is a separate question of “is the subjective interpretation of probability always the best way to think about belief?” I think the answer here is, at least in practice, no. There are very helpful mathematical generalizations of probability that are used in real world problems where subjective probability runs into issues. I can write more about that sometime, but its a pretty big subject.
Isn’t your 1 a special case of Bayes’ Theorem? If there’s something which, conditional upon the theory being true has probability zero, then when we observe it we make an infinitely large update away from that theory. In fact, the link I sent you makes exactly this point: “[We see that] Karl Popper’s insight that falsification is stronger than confirmation translates into a Bayesian truth about likelihood ratios. Popper erred in thinking that falsification was qualitatively different from confirmation; both are governed by the same Bayesian rules.”
2 also seems like a special case of Bayes, in that you’re calculating P(data|null) with a vague qualitative understanding that the prior of the null hypothesis is pretty high, and the prior of the experimental hypothesis is at least high enough that the level of update caused by p < 0.05 is enough make it rise to your attention. In fact, when you’re not keeping a form of qualitative Bayes in the back of your mind when doing null hypothesis testing, it has the potential to go very wrong - see for example Ioannidis on the failure modes of NHST in medicine.
Again, I think we are talking two different things here. You are saying “When you are a statistician dealing with specific data, there are often techniques which are more useful than Bayes”, and I agree as does I assume everyone else including Eliezer.
I am saying “Qualitative Bayesian reasoning is a very powerful tool for helping us understand what we are doing when we are doing science, philosophy, statistics, or normal intuitive thought.” It may be that there are other ways of getting that level of understanding, but Bayes worked for me. It is the reason I don’t have to remember things like “You can’t just use NHST on very unlikely hypotheses you pick from throwing at a dartboard” as maxims I memorized from a book, but instead can actually understand why they’re obviously true. Maybe there are other paradigms that can do that, but Bayes seems to be the one that fits my natural way of thinking the best. - Some thoughts on LessWrongianism
Isn’t your 1 a special case of Bayes’ Theorem? If there’s something which, conditional upon the theory being true has probability zero, then when we observe it we make an infinitely large update away from that theory. In fact, the link I sent you makes exactly this point: “[We see that] Karl Popper’s insight that falsification is stronger than confirmation translates into a Bayesian truth about likelihood ratios. Popper erred in thinking that falsification was qualitatively different from confirmation; both are governed by the same Bayesian rules.”
No, its not a special case, its a different question being asked of the data entirely. Where Bayes theorem works, both will agree, but the point of falsification is that it can be used where we can’t calculate the likelihood needed for Bayes. P(data|model) can be generally undefined, but we still might be able to falsify by finding the few places where P(data|model) = 0, and search those out specifically.
Summary- For Bayes, you need P(data|model) for multiple models, to compare, and a prior distribution over the models. For falsification, you need only one model, you need no prior distribution, and you only need some places where P(data|model) = 0, you don’t need the full P(data|model). If the ingredients for Bayes aren’t there, you might still have the ingredients for falsification.
The second one also seems like a special case of Bayesian reasoning, in that you’re using P(data|model) to update your prior level of confidence in the model - with possible catastrophic failure modes based on the model having too low a prior (which constantly haunts the field of medicine, as pointed out most cogently by Ioannidis)
No, I’m thinking of cases where P(data|model) (really what we want is P(model|data) via Bayes, so we need P(data|model for every model with non-zero prior) can’t be calculated, but P(data|null-model) can be calculated. If P(data|null-model) is pretty high, we can say “well, maybe we don’t need a new model, the boring one is probably ok.” There is no prior involved here.
In cases where P(data|model) IS calculable, we can make it fit with Bayes theorem by putting most of our prior weight on the boring hypothesis, and then it will look like Bayesian reasoning.
Summary- Once again, for Bayes you need multiple models where you can calculate P(data|model), and a prior distribution over those models. For hypothesis testing you only need P(data|null-model). If the ingredients for Bayes aren’t there, but you have a well defined null hypothesis, you can still hypothesis test.
Where Bayes works well, its going to give you a lot more information than hypothesis testing or falsification, but it requires a lot more to work.
Also worth noting - calculating P(data|model) is just probability theory. We aren’t using Bayes theorem until we start comparing models and using them to reweight priors.
You are saying “When you are a statistician dealing with specific data, there are often techniques which are more useful than Bayes”
I’m suggesting that once you move outside of the toy-model-for-explanatory purposes, the P(data|model) is rarely something you can actually calculate. I’d go so far as to say statisticians dealing with very narrow problems are the people most likely to encounter situations where Bayes theorem is actually helpful.
Beyond that, usually you can’t do anything like the updates, because the models we use are too vague.
I am saying “Qualitative Bayesian reasoning is a very powerful tool for understanding what we are doing when we are doing science, philosophy, or thought.”
This conversation is actually making me think that mentally you are lumping all of probability theory into “qualitative bayesian reasoning.” In which case, I guess I agree, but there is a huge language problem?
Worth considering- Bayes theorem will never leave you without a model. Falsification and hypothesis testing can. If you falsify your only model, or reject a null with no alternatives, all you have is confusing data.
With Bayes theorem, if you falsify one of your two models (you need at least two or you can’t even use Bayes sensibly), the probability of the second model will jump to 1, even if its very unlikely.
EDIT: The broad point here is when you say “oh, I saw X I’m going to update my priors” this isn’t actually justified by Bayes theorem, you might as well just say “huh, thats surprising, I’m going to rethink my position a bit.”
Bayes theorem isn’t getting you anywhere here- you aren’t holding multiple competing worldviews in your head with different levels of certainty and reweighting them accordingly as data comes in. You’ve dressed up some ideas you can find in any pop-science book (always check your ideas against reality. If reality doesn’t match your ideas, change your ideas, etc) and put an unnecessary layer of vocabulary around them.
- Some thoughts on LessWrongianism
> No, its not a special case, its a different question being asked of the data entirely. Where Bayes theorem works, both will agree, but the point of falsification is that it can be used where we can’t calculate the likelihood needed for Bayes. P(data|model) can be generally undefined, but we still might be able to falsify by finding the few places where P(data|model) = 0, and search those out specifically.
I can’t solve a quintic equation in my head, but when x^5, x^4, x^3, and x^2 = 0, I can solve the linear equation. That doesn’t mean that linear equations aren’t a special case of higher degree equations, it just means they’re an especially tractable case.
From a philosophical point of view, what we’re trying to do with Popperian falsificationism is impossible. There’s always a nonzero chance of P(data|model), even if it’s just the Cartesian case of “an evil wizard is creating the data to mislead you”. The reason falsificationism works *anyway* is that we’re doing something qualitatively Bayesian on very very very small probabilities.
Suppose you believe a certain drug cures cancer 100% of the time. Then you give the drug to somebody. Their cancer does not get better. p(data|model) = 0. You perform a Bayesian update. Using Bayes’ Theorem, the probability of the model, given the data, equals…well, on the numerator of Bayes’ theorem you’ve got probability(data|model), which is zero, so given a nonzero denominator it’s going to come out to zero regardless. So the p(model|data) = 0, since you’ve got the data p(model) = 0, so the model is falsified. So you shift your probability to a different model where the drug does not cure the disease.If we want to introduce an evil wizard, we just say p(data|model) = 10^-20 - ie zero except in the unlikely case where an evil wizard is meddling with results. Now the numerator of Bayes’ Theorem contains 10^-20 times something, so we know the resulting p(model|data) will be very very very small unless our prior for the model was implausibly huge. So it still works.
> No, I’m thinking of cases where P(data|model) (really what we want is P(model|data) via Bayes, so we need P(data|model for every model with non-zero prior) can’t be calculated, but P(data|null-model) can be calculated. If P(data|null-model) is pretty high, we can say “well, maybe we don’t need a new model, the boring one is probably ok.” There is no prior involved here.
Once again, this seems like a special case of Bayes where you’re making it look like not-Bayes in the same way you can make a quintic equation look like not-a-quintic-equation.
Let’s say what we’re talking about is whether grape juice cures cancer. You give some people grape juice and you find their cancer has gotten better. But then you do some NHST and you find that p = 0.50, because this is a very minor form of cancer that usually gets better its own anyway. Therefore you say that this experiment fails to provide enough evidence to reject the null hypothesis.
But this experiment, on its own, doesn’t give us the *slightest* amount of information about how strongly we should believe grape juice cures cancer. If it’s the one millionth replication of a seminal study that proved that it did, and 95% of replications confirm the result and 5% fail to reject the null hypothesis because of insufficient power, I’m still going to believe that grape juice very likely cures cancer. If scientists have discovered an extremely plausible mechanism by which grape juice should cure cancer, and your study was very small, even after you “fail to reject the null hypothesis” I might still believe grape juice cures cancer. On the other hand, in the real world, even if you soundly reject the null hypothesis, my very low prior on grape juice curing cancer means that I’m going to assume you bungled something until you get replicated several times, preferably in a large multi-center trial.
And if you use a GWAS to find that some gene causes schizophrenia with p = 0.04, I’m going to laugh in your face and tell you there are 25,000 genes so 1,000 of them are going to achieve that level by pure chance. If you tell me that weird priming effects exist at some probability, I’m going to crack open your file drawer and find the nineteen studies that say it doesn’t but which never saw the light of day because of publication bias.
In other words, when you do NHST, you’re artificially walling off a tiny bit of the scientific process and calling it “the experiment”, then saying “the experiment doesn’t use Bayes!”, then letting everyone else use qualitative Bayesian reasoning to convert the experiment into actual knowledge. Until you do the qualitative Bayes, your “rejecting/failing to reject the null hypothesis” in your experiment doesn’t translate into statements about the world.
> I’m suggesting that once you move outside of the toy-model-for-explanatory purposes, the P(data|model) is rarely something you can actually calculate. I’d go so far as to say statisticians dealing with very narrow problems are the people most likely to encounter situations where Bayes theorem is actually helpful.
How is that different from saying that once you move outside of toy models of a couple of particles, quantum mechanics is rarely something you can actually calculate, and physicists dealing with very tiny systems are the people most likely to encounter situations where quantum mechanics is actually helpful?
Yes, trying to design an airplane using quantum mechanics would be stupid. You should design it using much higher-level concepts like aerodynamics. That doesn’t mean quantum mechanics doesn’t describe the behavior of airplanes.
I continue to think you’re trying to come at this from an engineering perspective and I’m trying to come at it from a philosophical perspective, and you keep telling me my philosophy is useless for engineering. I’m not denying that.
> This conversation is actually making me think that mentally you are lumping all of probability theory into “qualitative bayesian reasoning.” In which case, I guess I agree, but there is a huge language problem?
I agree there is some kind of weird language disconnect.
I’m using “qualitative Bayesian reasoning” to mean “reasoning that acknowledges that beliefs can be represented as subjective probabilities, and those subjective probabilities can be updated according to mathematical laws”. Is that different from how you would use it?
> Worth considering- Bayes theorem will never leave you without a model. Falsification and hypothesis testing can. If you falsify your only model, or reject a null with no alternatives, all you have is confusing data. With Bayes theorem, if you falsify one of your two models (you need at least two or you can’t even use Bayes sensibly), the probability of the second model will jump to 1, even if its very unlikely.
So there are a few problems here. First of all, in ideal Bayes you should never be able to falsify models, just make them very very unlikely. The evil wizard problem again.
Second of all, I think everyone on Less Wrong agrees Bayesian reasoning only works perfectly when you assume you are considering all possible models - this is why hypothetical Bayesian engines like AIXI usually use infinite computing power and are not possible in the real world. Depending on how close you come to that assumption, your Bayesian reasoning may work less well or not at all. This is part of what I mean with the engineering-philosophy disconnect - I assume a smart engineer would avoid plans that need an infinite number of inputs to work. On the other hand, if you’re a philosopher, then finding a process that works perfectly when given infinite inputs means you’re on track to figuring out the structure of an idealized problem.
> Bayes theorem isn’t getting you anywhere here- you aren’t holding multiple competing worldviews in your head with different levels of certainty and reweighting them accordingly as data comes in. You’ve dressed up some ideas you can find in any pop-science book (always check your ideas against reality. If reality doesn’t match your ideas, change your ideas, etc) and put an unnecessary layer of vocabulary around them.
I’m not sure that’s true. I mean, holding multiple competing worldviews at different levels of certainty and reweighting them with new data seems to be *exactly* what doctors do all the time.
"Well, this patient could have bacterial meningitis, or viral meningitis, or a brain tumor, or maybe it’s something extremely rare I’ve never heard of. But it’s probably viral meningitis, because that’s by far the most common of these options."
"No, the patient’s getting much more sick than I expected from a viral meningitis. That means it’s probably bacterial meningitis or a tumor."
"Well, I did the test for bacteria, and the test for a tumor, and those were both negative. Now I’m starting to think it’s something rare I’ve never heard of. I know that’s a priori unlikely, but it’s even more unlikely that all of these tests would come out negative by mistake. I’m going to refer this patient to the specialist in rare neurological diseases."
Yes, obviously pop science books and traditional wisdom contain some of the same insights as Bayes, in the same way that traditional martial arts contains a lot of the same wisdom as really high-tech sports medicine that measures the exact amount of force produced by each muscle. And some people learn better by hearing vague traditional wisdom.
Other people learn better by seeing the traditional wisdom mathematized and knowing exactly what it is they’re doing when they’re using their various heuristics.
My own story is that I’ve corrected misdiagnoses of really serious diseases by some doctors much older and more experienced than I am because I’m familiar with the Bayes mammogram problem and they weren’t (LET ME TELL YOU ABOUT SCREENING TESTS SOMETIME). *They* probably knew that you should “always check your ideas against reality”. *I* knew that if you have a very low prior probability and you update based on evidence with only moderately more likelihood in the disease state than the healthy state you’re still going to have a pretty low posterior probability.
This sort of thing is probably why they’ve started teaching Bayesian reasoning in some medical schools. I feel grateful that I had a head start by being part of a community whose way of teaching it is frankly about a zillion times more interesting. - su3su2u1
I can’t solve a quintic equation in my head, but when x^5, x^4, x^3, and x^2 = 0, I can solve the linear equation. That doesn’t mean that linear equations aren’t a special case of higher degree equations, it just means they’re an especially tractable case.
Do you REALLY think I don’t understand what a special case is?
Let’s see if we can reach some agreement on the following points- do you agree or disagree with the following statements:
1. To do a Bayesian update, in principle, in an ideal situation I need multiple well defined models.
2. To do a Bayesian update, in principle, in an ideal situation I need an expression for P(data|model) for each model.
And do you agree with the following:
3. To do falsification, in principle, in an ideal situation I do not need multiple well defined models.
4. To attempt falsification, in principle, in an ideal situation I do not need P(data|model) in general, only at least one case where P(data|model) = 0.
And do you agree with the following:
5. To do Fisher hypothesis testing, in principle, in an ideal situation I do not need multiple well defined models
6. To do Fisher hypothesis testing, in principle, in an ideal situation I do not need P(data|model), I only need P(data|null-model).
And finally:
7. Given that I need fewer ingredients to do hypothesis testing and falsification, they can both be applied when its impossible to apply Bayes theorem.
This is a double edged sword- falsification and hypothesis testing are weaker techniques. Falsification is only useful when you falsify. Null hypothesis testing is really only useful when you fail to reject the null. They are only filters to weed out stuff you don’t need to bother with too much. But they are much more generally applicable than Bayes, even in statistical problems.
- slatestarscratchpad
(sorry for terrible formatting, am on work computer)
» Do you REALLY think I don’t understand what a special case is?
No! I think you’re really smart and know much more than I do! But I keep saying things that make sense to me, and you keep sort of not addressing them and saying confusing things I don’t find relevant, so I want to test every one of my assumptions to make sure we’re not talking past each other. I’m sorry if I upset you, and I’m getting kind of exasperated too, so I’ll answer your questions here and then bow out of this one.
» Let’s see if we can reach some agreement on the following points- do you agree or disagree with the following statements:
» 1. To do a Bayesian update, in principle, in an ideal situation I need multiple well defined models.
» 2. To do a Bayesian update, in principle, in an ideal situation I need an expression for P(data|model) for each model.
Like I said, you know much more than me, so if you say these things are true I believe you. However, as I said before:
"Suppose you believe a certain drug cures cancer 100% of the time. Then you give the drug to somebody. Their cancer does not get better. p(data|model) = 0. You perform a Bayesian update. Using Bayes’ Theorem, the probability of the model, given the data, equals…well, on the numerator of Bayes’ theorem you’ve got probability(data|model), which is zero, so given a nonzero denominator it’s going to come out to zero regardless. So the p(model|data) = 0, since you’ve got the data p(model) = 0, so the model is falsified"
This looks Bayesian to me. It even uses Bayes’ theorem explicitly. It also seems to only involve one model. It also seems to be the exact same as falsification, but also leave room for doing things falsification cant’. So my answer is “While I trust you when you say these things, this sure looks like a Bayesian update to me.”
3 and 4 I agree with.
5 and 6 I agree with, with the same caveat I added last time.
7 I agree with, given that we’re using “impossible” to mean “practically impossible” rather than “impossible even for God”.
» This is a double edged sword- falsification and hypothesis testing are weaker techniques. Falsification is only useful when you falsify. Null hypothesis testing is really only useful when you fail to reject the null. They are only filters to weed out stuff you don’t need to bother with too much. But they are much more generally applicable than Bayes, even in statistical problems.
I don’t think I’ve ever denied any of this. I keep saying again and again that I agree that other techniques often beat Bayes in many different applications. I happily admit it again.
I think we’re talking past each other and should probably stop, except that I would like to hear your explanation of why the pseudo-Bayesian update I’m describing in your 1 & 2 doesn’t work.
- su3su2u1
I think the core of our disagreement is two fold:
1. I think we disagree on what exactly “Bayes” means. I think you have a notion that Bayes = probability theory, and I look at Bayes theorem as somewhat narrow result that comes out of probability theory. I think we would both agree that everyone should know some probability theory, its very useful.
2. I think that there are situations where “God” couldn’t use Bayes theorem (assuming “God” is limited to the computational processes of the physical universe, if you allow “God” to compute the uncomputable then all bets are off), you think that Bayes is always ideally applicable even if its not practically applicable.
EDIT: 2a. We seem to agree that Bayes isn’t applicable in a lot of very narrow statistical situations. I assert that if accept Bayes isn’t applicable in very narrow statistical situations where everything is cleanly defined, you should doubt its applicability in larger, more complicated situations where you can’t even write down a model.
I’m sorry if I upset you, and I’m getting kind of exasperated too, so I’ll answer your questions here and then bow out of this one.
Its not that I’m upset, just a tad exasperated. From my perspective, you keep bringing up situations where we both agree Bayes does work, and I’m thinking “wait… why is this coming up?” Where Bayes is useful, its nice.
Regarding Bayes with only one model:
Like I said, you know much more than me, so if you say these things are true I believe you. However, as I said before: “Suppose you believe a certain drug cures cancer 100% of the time. Then you give the drug to somebody. Their cancer does not get better. p(data|model) = 0. You perform a Bayesian update. Using Bayes’ Theorem, the probability of the model, given the data, equals…well, on the numerator of Bayes’ theorem you’ve got probability(data|model), which is zero, so given a nonzero denominator it’s going to come out to zero regardless.
If you only have the one model, and no competing hypothesis, Bayes theorem reduces to
P(B|A)P(A)/P(B|A)P(A)
You are also forced into a prior of 1, because you only have one model on which to put weight. So you end up unable to do any updating, the whole thing is just a tautology 1= 1.
In the case of your cancer example, we can rescue Bayes theorem by creating a dummy model,along with “this drug cures all cancer,” you have “this drug does not cure all cancer.”
Then you can use a Bayesian update, and all the weight falls into “this drug does not cure all cancer,” but thats just a trick to rescue things, you are guaranteed P(“does not cure all cancer”) = 1-P(“does cure all cancer”) so we are using Bayes, but there is no actual information contained in the second weight.
In this case, only one of the weights contains information, and the prior distribution is totally irrelevant, but we can rescue the formula I guess.
- nostalgebraist
Quick note on the su3su2u1 / slatestarscratchpad Bayesianism debate of the last few days:
Bayesian philosophy of science has been one of my pet interests for a while, and I’ve had many (fun, non-acrimonious) arguments with several Bayesians on here. (Note: none of this makes me any kind of expert — I really know pretty little about all this though it all makes me curious.)
And the more I read and talk about this stuff, the more it seems that the biggest gulf between “philosophical Bayesians” and their opponents is over the fact that “philosophical Bayes” requires you to have a prior. The fact is, Bayesianism is a recipe for thinking when you have a prior. If you don’t have a prior, or only have something that’s not quite a prior (like “these are my candidate theories, and I have no idea what probability to assign to all the ones I haven’t thought of yet”), Bayesianism tells you “come up with a prior so you can be more like me, the Ideal Account of Reasoning.”
It can be very easy to forget that this is a problem, for several reasons:
1. There really are cases (like the medical statistics problems) where you really do have a prior and knowing how to use Bayes’ theorem really does help you. But that doesn’t mean you should build your whole philosophy of science on assuming that every question is like a medical statistics problem, even the ones that aren’t.
2. Because the prior is this potentially huge set of “free parameters,” this thing that can be anything you wish (as long as it’s a probability measure), it’s easy to reconstruct almost any act of sensible reasoning as an act of “Bayesian reasoning” with some prior. John von Neumann famously said “with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” If you are allowed to make up a story about what someone’s prior was, then you can almost always recast their thinking as a “Bayesian update.” Deborah Mayo has a vivid way of describing this procedure:
Bayesian reconstructions of episodes in the history of science, Mayo says, are on a level with claiming that Leonardo da Vinci painted by numbers since, after all, there’s some paint-by-numbers kit which will match any painting you please. (Source)
Or, to put it another way, in some weird sense “all painting is a special case of painting by numbers” — you can take any act of painting and say, “ah, you say you aren’t painting by numbers, but I can reconstruct this as an instance of painting-by-numbers in which you had the paint-by-numbers kit that told you to produce exactly the painting you produced.”
The problem is of course that this gives you no good advice about how to create new paintings. ”All painting is a special case of paint-by-numbers, so to create a good painting I should find a good paint-by-numbers kit and use it” is actively bad advice for fledgling painters.
Or, to translate back to Bayes: “any act of good reasoning can be reconstructed as a Bayesian update with some prior” does not imply “to reason well, find a good prior and then update.”
Ultimately, the core issue for philosophical Bayesianism (IMO) is whether forcing yourself to have a prior is a good idea. In situations where you already have one, like the medical statistics problems, this isn’t an issue. If your state of knowledge looks nothing like a probability measure over hypotheses, and you say “Bayesian reasoning is ideal so I should be more like it by turning my state of knowledge into a probability measure over hypotheses," this could have good or bad effects. If it generally has bad effects, then Bayes isn’t a very good ideal.
This means that the “philosophy vs. engineering perspectives” distinction is somewhat misleading. In the ideal fantasy land where you always have a prior, Bayesianism is not only ideal, it is sort of trivial. It’s just what you do. In the real world, trying to approximate the Bayesian ideal means forcing yourself to have a prior when you don’t start out with one. If this is generally a bad idea, then Bayesianism isn’t an ideal, not even a philosophical one.
Even idealized philosophical realms have to make some contact with reality, or else everything quickly becomes absurd: it’s no use to, say, postulate an “idealized” world in which the only thing anyone cares about is having sex (defensible as an idealization since having sex really is very important to many people), and then derive real-world consequences like “we should direct all human behavior toward having as much sex as possible, ignoring all other activities, until everyone dies of starvation because the food has run out.”
So, if you make an idealization, and it works in ideal fantasy land, but the “engineering” consequences of trying to reach that ideal suck, then maybe it wasn’t such a good idealization.
(ETA: technical note — I said “as long as it’s a probability measure” above but sometimes it doesn’t even need to be that, as in the case of “improper priors.” This isn’t really relevant to this post at all, I just wanted to mention it for technical accuracy)
- su3su2u1
But what if you try to elicit P(A|~H)? You’d get a mess. Because I would really have no idea what “the world under ~H” looks like. Trying to actually confront that question would be a mammoth task: I’d have to invent every one of the infinite (?) number of possible theories that limit to Newtonian kinematics at human scales, and find some way to weigh the likelihoods of these against one another, and work out what they predict about A, which in some cases might be mathematically intractable.
Its this case that I was trying to get at in my discussion with slatestarscratchpad. You have one really well defined model (current physics), and then a nebulous bog (whatever else could possibly be). So P(A|H) is easy to calculate, but P(A|~H) just isn’t well defined at all.
Even if you start trying to specify each individual piece of ~H, you’ll run into problems- because ~H isn’t well defined you have free parameters that don’t exist in H, so the best model in ~H will apriori fit the data better than H (imagine a model that takes Newtonian mechanics and grafts on extra parameters).
This means all the action is in how you set the prior. You can try to penalize complexity, but that becomes very hard.
Lets take an example- the standard model vs. the super-symmetric standard model. The minimal super-symmetric standard model (I’ll use SUSY for super-symmetric) has more than 200 parameters (the regular standard model has about 20).
However, the standard model lagrangian can be generated from lorentz symmetry + SU(3)SU(2)U(1) + particle content. The SUSY version can be generated from generalized-lorentz symmetry + SU(3)SU(2)U(1) + particle content (the difference between lorentz symmetry and generalized lorentz symmetry is that you use a lie group for the regular standard model, and a graded lie group for SUSY). The graded lie group is actually a more general structure, which means there is a sense its less complex.
Many, many theorists argue that SUSY “is too elegant to not be true,” which is just their way of saying its actually less complex, so stick a larger prior on it. Lots of theorists would put a larger prior based on elegance on SUSY vs the regular standard model, and because of all the free parameters, the SUSY version fits the data better. But is this really a fair comparison? After all- the 20 parameters of the standard model get a good fit to all the data. A Bayesian could declare victory despite the fact that no super-symmetric partner has ever been seen (or even hinted at in the data).
And yet, the SUSY theorists don’t go around crowing victory, because particle physics operates on a falsification model. Before you can claim the standard model has failed, you need an unexplained observation that is at least 5 sigma away from the standard model prediction.
And this same situation is true of every single beyond the standard model theory. Any anomaly that comes out of the LHC will have every theorist claiming their theory fits it, and they will all be right, for exactly this reason (beyond the standard model implies more free parameters than the standard model).