(cache) trees are harlequins, words are harlequins

furioustimemachinebarbarian:

nostalgebraist:
justhere4coffee:
When people call you a “snowflake” just remember they’re quoting Fight Club, a satire written by a gay man about how male fragility causes men to destroy themselves, resent society, and become radicalized, and that Tyler Durden isn’t the hero but a personification of the main character’s mental illness, and that his “snowflake” speech is a dig at how fascists use dehumanizing language to breed loyalty from insecure people.
So basically people who say “snowflake” as an insult are quoting a domestic terrorist who blows up skyscrapers because he’s insecure about how good he is in bed.
The thing about this is – to write a good satire you need to make it close to reality (in some ways, to some extent). Which means you run the risk of creating something that your targets still find appealing.
Every other argument about the quality of Fight Club aside, I think it’s an important movie because it captured a thing that’s out there, which appeals to a certain large subsection of the male population. The fact that this subsection celebrates an ambivalent-at-best depiction of the thing suggests that there wasn’t anything else out there crystalizing the same thing with as much accuracy – if there were more unambiguously positive depictions of the thing, you’d think their popularity would have swamped Fight Club’s.
Yeah, maybe it’s making fun of guys who like the thing (and maybe without their knowledge), but it also revealed that there are a lot of those guys, and showed us exactly what it is that they like. (I’m not saying there weren’t other movies about masculinity; as I said, FC crystalized something more specific.) Even if you think the movie’s a satire, the humor of “these guys misinterpreted a movie, lol owned” is outweighed for me by the gravity of the realization “these guys exist, they aren’t going away, and they do unironically want the thing.“ Feels like they get the last laugh, here.
The problem with fight club is that while the text of the movie is obvious satire, the tone of the movie is seductive as hell. So even to the end, everything about Durden “feels” cool, rebellious and seductive. It’s a movie that seems almost purposefully constructed to make the audience miss the point.
Wall Street is another movie that I think sort of falls into that camp. A certain of person still uses “greed is good” non-ironically.
Also, I’m not sure there can be “positive depictions of the thing” in Fight Club because the core of it seemed pure, nihilistic rebellion without cause.

Yeah, I guess “positive depictions of the thing” would be more like “negative depictions of everything else.”

The nihilism of the characters comes from a wholesale rejection of society, a rising indignation or disgust towards normal life that eventually grows so far (“hitting bottom”) that them to think anything sufficiently abnormal is preferable, even joining a nihilistic cult. But the movie does nothing to make the viewer feel any strong negative feeling toward society. The movie depicts normal life by throwing together stock anti-conformity tropes (boring job, boring boss, buying furniture) without going out of its way to make them seem repulsive or wrong the way the characters do.

The alternative I’m picturing would focus a lot more creative energy on the depiction of society and normalcy – making the suits and dads seem over-the-top horrible, grotesque, unendurable, and depicting the Durden stuff as an oasis, an at least not that.

(Source: facebook.com)

16th Jun 201735,227 notes

My attempt to install tensorflow-fold thus far:

(1) follow steps on the repo’s installation page, which only gives instructions for binary installation, and links to the file for a specific build rather than a general binaries page

(2) try “import tensorflow-fold,” and get

undefined symbol: _ZN10tensorflow6tensor5SplitERKNS_6TensorERKNS_3gtl10ArraySliceIxEE

(3) google, find a thread where someone else had the problem, it turns out their version of tensorflow was too new to be compatible with the binary, they are advised to downgrade

(4) downgrade to tensorflow 1.0, now fold works but the activation function I used in my code isn’t in tensorflow 1.0

(5) google search with terms like “tensorflow fold install,” after clicking on several Q&A threads I click on something that turns out to be the fold repo’s “how to install from source” page, which apparently exists although there weren’t any links to it on the other documentation pages

(6) installation instructions tell me I need something called “Bazel” (”Google’s own build tool, now publicly available in Beta”), so I install that

(7) apparently I am supposed to use Bazel to … make a pip wheel for fold, so I can install it with pip locally? like, it’s not a registered PyPI package, but it still needs to get pip installed?

(8)

You also need to build a pip wheel for TensorFlow. Unfortuately this means we need to rebuild all of TensorFlow, due to known Bazel limitations (#1248).

(9) the machine is currently building tensorflow from source, something it has never had to do before, because it just had a … pip wheel … for it … …

Like I know this is research code, I’m not complaining, it’s just been such a weird ride, man

16th Jun 201712 notes

justhere4coffee:

When people call you a “snowflake” just remember they’re quoting Fight Club, a satire written by a gay man about how male fragility causes men to destroy themselves, resent society, and become radicalized, and that Tyler Durden isn’t the hero but a personification of the main character’s mental illness, and that his “snowflake” speech is a dig at how fascists use dehumanizing language to breed loyalty from insecure people.
So basically people who say “snowflake” as an insult are quoting a domestic terrorist who blows up skyscrapers because he’s insecure about how good he is in bed.

The thing about this is – to write a good satire you need to make it close to reality (in some ways, to some extent). Which means you run the risk of creating something that your targets still find appealing.

Every other argument about the quality of Fight Club aside, I think it’s an important movie because it captured a thing that’s out there, which appeals to a certain large subsection of the male population. The fact that this subsection celebrates an ambivalent-at-best depiction of the thing suggests that there wasn’t anything else out there crystalizing the same thing with as much accuracy – if there were more unambiguously positive depictions of the thing, you’d think their popularity would have swamped Fight Club’s.

Yeah, maybe it’s making fun of guys who like the thing (and maybe without their knowledge), but it also revealed that there are a lot of those guys, and showed us exactly what it is that they like. (I’m not saying there weren’t other movies about masculinity; as I said, FC crystalized something more specific.) Even if you think the movie’s a satire, the humor of “these guys misinterpreted a movie, lol owned” is outweighed for me by the gravity of the realization “these guys exist, they aren’t going away, and they do unironically want the thing.“ Feels like they get the last laugh, here.

(Source: facebook.com, via marcusseldon)

16th Jun 201735,227 notes

eggcup:

concept of a thieves guild: cool
reality of a thieves guild: tumblr shoplifting fandom

“Lifter”? I PREFER the term “treasure hunter”!

#shitpost

16th Jun 201726,453 notes

#the duality of man?

15th Jun 201724 notes

One is practical and open. The other surly, superior and obsessed with reading one book – by the philosopher Kant.

#quotes

15th Jun 201714 notes

bayes: a kinda-sorta masterpost

raginrayguns:

@nostalgebraist:
5. Why is the Bayesian machinery supposed to be so great?
This still confuses me a little, years after I wrote that other post. A funny thing about the Bayesian machinery is that it doesn’t get justified in concrete guarantees like “can unscrew these screws, can tolerate this much torque, won’t melt below this temperature.” Instead, one hears two kinds of justifications:
(a) Formal arguments that if one has some of the machinery in place, one will be suboptimal unless one has the other parts too
(b) Demonstrations that on particular problems, the machinery does a slick job (easy to use, self-consistent, free of oddities, etc.) while the classical tools all fail somehow
E. T. Jaynes’ big book is full of type (b) stuff, mostly on physics and statistics problems that are well-defined and textbook-ish enough that one can straightforwardly “plug and chug” with the Bayesian machinery. The problem with these demos, as arguments, is that they only show that the tool has some applications, not that it is the only tool you’ll ever need.
Examples of type (a) are Cox’s Theorem and Dutch Book arguments. These all start with the hypotheses and logical relations already set up, and try to convince you (say) if you have degrees of belief, they ought to conform to the logical relations. This is something of a straw man argument, in that no one actually advocates using the rest of the setup but not imposing these relations. (Although there are interesting ideas surprisingly close to that territory.)
The real competitors to Bayes (e.g. the classical toolbox) do not have the “hypothesis space + degrees of belief” setup at all, so these arguments cannot touch them.
Yeah, Jaynes starts with Cox’s theorem, which I think of as a sort of filter, which you can drop a system through and see where it gets stuck, and if it doesn’t get stuck and makes it all the way through, it’s probability theory. But he doesn’t really present any other systems that you can drop through the filter. He mostly criticizes orthodox statistics which you can’t really drop through it.
When I first read read Jaynes, the example I dropped through Cox’s theorem is fuzzy logic, defining Belief(A and B) = min(Belief(A), Belief(B)), and disjunction as maximum. This gets stuck because you can hold Belief(A) constant and increase Belief(B) without necessarily increasing Belief(A and B). That’s not allowed. I was very impressed with Cox’s theorem for excluding this, since I had not even noticed this property, and when brought to my attention it was in fact unreasonable.
It makes me wonder, if I would have been less impressed if I had started by using Dempster-Shafer theory as an example. Dempster-Shafer theory is the “interesting idea” that nostalgebraist linked to above. I’m writing this post to discuss it more thoroughly. tl;dr summary: Dempster-Schafer theory can be thought of as breaking the rule that there’s a “negation function” mapping Belief(~A) to Belief(A), and makes you wonder why we really need such a function.
So, as everyone in the internet Bayesianism discourse knows, Dempster-Schafer theory gives every proposition two numbers. These are the belief, Bel(A), and the plausibility, Plaus(A). Belief is how much it’s supported by the evidence, and plausibility is the degree to which it’s allowed by the evidence. Plausibility is higher.
As few discoursers seem to realize, Plaus(A) is just 1-Bel(~A), so in a sense Bel is all you need. It’s interesting, then, to drop Bel through Cox’s theorem, and see where it gets stuck.
And the first place I notice is at the following desideratum in Cox’s theorem:
There exists a function S such that, for all A, Bel(~A) = S(Bel(A)).
Bel(A) breaks this rule, supposedly ruling it out as a quantification of confidence. But how bad is it, really?
Suppose I’m happily using Dempster-Shafer theory for, I don’t know, assessment of fraud risk, when strawman!Cox bursts into my office, and declares “I’ve come to save you from your irrational degrees of belief!”
As the perfectly reasonable foil to this hysterical and unreasonable strawman, I reply in a tone of pure, innocent curiosity: “What do you mean? I’d love any opportunity to improve my fraud detection.”
“Well,” Cox begins, filliping a coin and covering it, “your Bel(Heads)=0.5, and your Bel(~Heads)=0.5, right?”
“Certainly,” I reply.
“And this case you’re reviewing, Bel(Fraud) = 0.5, correct?”
“Absolutely.”
“And your Bel(~Fraud)?”
“0.2.”
“That’s irrational!” he shrieks, throwing his hands in the air and revealing that the coin was a heads. “Let S be the function that maps from Bel(A) to Bel(~A). What’s S(0.5)? Is it 0.5, or 0.2?” He puts his hands on my desk, leans forward, and demands, “Which is it?”
“There is no such function,” I reply. “Why should there be?”
So, what can Cox do to convince me my assignments are irrational? Or that my fraud detection would be more efficient if there existed this negation function S?
So, that’s where I end up when I drop Dempster-Shafer Bel through Cox’s theorem, and this time I don’t feel I’ve revealed any flaw in the system.
Shafer himself says the same thing, actually:
Glenn Shafer:
Most of my own scholarly work has been devoted to representations of uncertainty that depart from the standard probability calculus, beginning with my work on belief functions in the 1970s and 1980s and continuing with my work on causality in the 1990s [18] and my current work with Vladimir Vovk on game-theoretic probability ([19], www.probabilityandfinance.com). I undertook all of this work after a careful reading, as a graduate student in the early 1970s, of Cox’s paper and book. His axioms did not dissuade me. As Van Horn notes, with a quote from my 1976 book [17], I am not on board even with Cox’s implicit assumption that reasonable expectation can normally be expressed as a single number. I should add that I am also unpersuaded by Cox’s two explicit axioms. Here they are in Cox’s own notation:
1. The likelihood ∼ b|a is determined in some way by the likelihood b|a: ∼ b|a = S(b|a). where S is some function of one variable.
2. The likelihood c ·b|a is determined in some way by the two likelihoods b|a and c|b · a: c · b|a = F(c|b · a, b|a), where F is some function of two variables.
I have never been able to appreciate the normative claims made for these axioms. They are abstractions from the usual rules of the probability calculus, which I do understand. But when I try to isolate them from that calculus and persuade myself that they are self-evident in their own terms, I draw a blank. They are too abstract—too distant from specific problems or procedures—to be self-evident to my mind.
Shafer goes on to quote and respond to Cox’s argument that there should exist F, but since I’m talking about S, I’m gonna look up how Jaynes argued for it.
ET Jaynes:
Since the propositions now being considered are of the Aristotelian logical type which must always be either true or false, the logical product AA̅ is always false, the logical sum A+A̅ always true. The plausibility that A is false must depend in some way on the plausibility that it is true. If we define u ≣ w(A|B), v ≣ w(A̅|B), there must exist some functional relation
v = S(u)
And that’s it. To explain notation w is the function that is eventually shown to have a correspondence with a probability mass function, overbar means “not”, and logical “sums” and “products” are conjunctions and disjunctions.
So, why must there exist this functional relation? Perhaps instead, the belief in A could change without altering the belief in ~A? That can happen in Dempster-Shafer I think, and it does seem kind of crazy. But even disallowing that, and allowing that there must be a function between belief in A and ~A, is it really the same function for every A? Why should it be?
Anyway, yeah. So, idk if I’d say, like nostalgebraist does, that Dempster-Shafer theory is surprisingly close to having the hypothesis space + beliefs setup but without the same constraints. I’d say instead that it’s exactly that. But I’m not totally sure since I’ve only read the basics and maybe things change in more complex applications.

Good stuff!!

To be completely honest, when I was writing that part you quoted, I was like “oh shit wait, D-S does have the same setup, so how does it get around the Cox and Dutch Book type stuff, or maybe it doesn’t? um….” and then in the interests of getting on with the rest of the post, I just hedged by being vague (“surprisingly close to that territory”)

So thanks for answering the question I was curious about but had to ignore.

I started wondering about the equivalent of the above in the measure-theoretic picture (i.e. why K-S doesn’t define a probability measure). If you translate “logical negation” to “set complement” like usual, then it violates additivity: A and ~A are disjoint, and together they make the whole space, so area(A) = area(whole space) - area(~A). This seems easier to understand than the Cox S thing, which fits with what Shafer said.

(Apparently, instead of a measure, it’s a “fuzzy measure.” Instead of additivity, a fuzzy measure just needs to get the correct order on what I was calling “obviously-nested” sets earlier)

I can see the strong intuition behind the Cox S desideratum. You should be able to take the negation of everything without changing any of the content. Like, when we talk about A and ~A, neither has the intrinsic property of “being the one with the tilde.” (Likewise with sets A, A^c.) You can see the desideratum as a relatively weak way of trying to make things symmetric under negation – everything goes through the same function, so hopefully every property of b|a will have an equivalent for S(b|a).

So, if there’s an asymmetry between one side and the other, what broke the initial symmetry? How do you decide which side is which? (That’s what I imagine the strawman!Cox figure saying)

But then, A and ~A are always distinct, even if not because “one has the tilde.” So for the D-S-using fraud protection worker, it is easy to break the symmetry because “Fraud” and “not Fraud” are different things. (Thus if they’d flipped all their tildes at the start, the symmetry would have broken the same way, “not Fraud” getting 0.2 and “Fraud” getting 0.5.)

Still, if we are understanding the “not” here either as logical negation or as set complement, this is still nonsensical. Because in both those frameworks, the negation doesn’t contain any information not contained in the original. Except …

If I think of “the information” used to specify sets S or S^c as a boundary, then S is “everything inside here” and S^c is “everything outside of here.” Of course this visual picture is depending on topological notions not present in the sets alone, but it suggests something true about spaces of ideas/hypotheses: we can draw a boundary around some ideas we know about, and the “inside here” set is all stuff we know about, but the “outside of here” set includes all other ideas, including ones we haven’t thought of. So this is a very natural distinction in practice.

How would you formalize that? I guess you’d have set theory in a universe (=“outcome space”) that wasn’t fully known, so you could say stuff like “I know 1 and 2 are in the universe, and I can make the set {1, 2}, but I don’t know if 3 is in the universe.” This probably exists but I don’t know what it’s called.

15th Jun 2017110 notes

raginrayguns:

when giving a presentation, I think it’s always good to bring up the obvious reason why it won’t work. When you show that you can address the obvious reason it won’t work, it will give people faith in your project. Otherwise, they will assume it will fail for the obvious reasons.

15th Jun 201743 notes

immanentizingeschatons:

astrobleme22:
honeyampoule:
earthshaker1217:
currentsinbiology:
Octopus and squid evolution is officially weirder than we could have ever imagined
Just when we thought octopuses couldn’t be any weirder, it turns out that they and their cephalopod brethren evolve differently from nearly every other organism on the planet.
In a surprising twist, scientists have discovered that octopuses, along with some squid and cuttlefish species, routinely edit their RNA (ribonucleic acid) sequences to adapt to their environment.
This is weird because that’s really not how adaptations usually happen in multicellular animals. When an organism changes in some fundamental way, it typically starts with a genetic mutation - a change to the DNA.
The findings have been published in Cell.
Olga Visavi/Shutterstock
I’m saying though.
They Cthulhu children. Stay woke
this is why octopus is the superior food
i love my family
@absurdseagull
@mitoticcephalopod
@dhominis
Whoa.

(Source: sciencealert.com, via mitoticcephalopod)

15th Jun 201711,027 notes

bayes: a kinda-sorta masterpost

principioeternus:

nostalgebraist:
I have written many many words about “Bayesianism” in this space over the years, but the closest thing to a comprehensive “my position on Bayes” post to date is this one from three years ago, which I wrote when I was much newer to this stuff. People sometimes link that post or ask me about it, which almost never happens with my other Bayes posts. So I figure I should write a more up-to-date “position post.”
I will try to make this at least kind of comprehensive, but I will omit many details and sometimes state conclusions without the corresponding arguments. Feel free to ask me if you want to hear more about something.
I ended up including a whole lot of preparatory exposition here – the main critiques start in section 6, although there are various critical remarks earlier.

Keep reading
I like this post. I myself would say that I’m only a “weak Bayesian”, and that while I do solidly believe in various “Bayesian brain” theories, those theories are *muuuuuch* more philosophically pragmatist than the Strong Bayesian epistemological program.
My big request would be whether anyone knows how to “replace” probability theory. What I really want is a way of predicting stuff that lets information flow top-down *and* bottom-up, allows for continuously graded inferences, and allows for arbitrarily complicated structures and connections. Most statistical and machine-learning methods, outside of those described below, *don’t* allow for that! This is why I stick by my Weak Bayesianism even when it visibly sucks.
That said, there are some formal developments Nostalgebraist has missed here.
* Nonparametrics! It’s not as if nobody has ever thought about the Problem of New Ideas before. There’s a whole subfield of Bayesian nonparametric statistics devoted to handling exactly this. The idea is that you start with a “nonparametric” prior model (a probabilistic model of an infinite-dimensional sample space). Sure, this model will assign probabilities over objects that are formally infinite, but you only ever have to actually deal with finite portions of them that talk about your finite data. Whenever new data appears to require a New Idea, though, the model will summon one up with approximately the right shape. You can Monte Carlo sample increasingly large/complex finite elements of the posterior, and you never have to hold the infinite object in your head to be doing probabilistic inference with it.
* Probabilistic programming! This one’s related to nonparametrics, since part of its purpose is to make nonparametrics easy to handle computationally. In a probabilistic programming language, we can perform inference (both conditionalization and marginalization) in any model whose conditional-dependence structure corresponds to some program. In practice, this means writing programs that flip coins, and then conditioning on observed flips to find the weights. It’s actually surprisingly intuitive for having so much mathematical and computational machinery behind it. It’s also Turing-universal: any distribution from which a computer can sample in finite time corresponds to some probabilistic program. So we have a model class including everything we think a physical machine can cope with!
* Divergences are universal performance metrics. Any predictive model - frequentist or Bayesian - can be *considered* to give an approximate posterior-predictive distribution. An information divergence (usually a Kullback-Leibler divergence) then defines a “loss function” between the true empirical distribution over held-out sample data and an equivalent sample from the predictive distribution. The higher the loss, the worse the predictive model, and the actual number can be (AFAIU) approximately calculated (certainly I’ve handled code that calculates approximate sample divergences). A good frequentist model will have a low divergence (loss), and a bad Bayesian model will have a high divergence (loss). This gives a good definition for a *bad* Bayesian model: one in which the posterior predictive doesn’t predict well. This technique is regularly used in Bayesian statistics to evaluate and criticize models.
What’s important here is that sample spaces like, “Countable-dimensional probability distributions” (Dirichlet processes), “Uncountable-dimensional continuous functions” (Gaussian processes), and “all stochastic computer programs” seem to give us increasingly broad classes of probability models. We would like to then do the reverse of old-fashioned Bayesian statistics: instead of starting with a restricted model, we can start with a very broad model and restrict it using our domain knowledge about the problem at hand. We then plug-and-play some computational stuff to perform inference.
Of course, it doesn’t yet work well in practice, but these things are regularly used to model really complex stuff, up to and including thought. Again, those are Weak Bayesian theories, and we care more about a Monte Carlo or variational posterior with a low predictive loss than about finding God’s own posterior distribution.
Another important choice to make is indeed how you interpret probability. I’ve actually liked the more measure-y way, once it was explained to me. “Propositions” are then interpreted as subspaces of the sample space. This seems like the Right Thing: you can start with a very complex model defined by some program or some infinite object or whatever, and then treat finite events within it as logical propositions. Those propositions will obey Boolean logic, but their logical relations will come from the model, rather than the other way around. An infinite-dimensional model will then also allow for an infinite number of propositions.
I consider this a fairly good example of how sometimes you should build your philosophy *on top of* the math and science that you know can work, rather than the other way around. Philosophy is an *output* of thought, so if you want new philosophy, you need new thoughts to think, and if you want new thoughts to think, you need to get them from the world.

This is an extremely interesting response, thank you.

I was totally ignorant of Bayesian nonparametrics until now and it is the sort of thing I should (and want to) know about. Do you have any recommendations about what to read first? Seems like there are a lot of references out there.

Any links about probabilistic programming that you think are especially good + relevant would be appreciated too.

I’m not sure I agree with your paragraph about divergences (or perhaps I don’t understand it). I’m aware of the K-L divergence, and it’s true that you can get a “posterior distribution” of some kind out of any predictive model. (In classification tasks, this is straightforward because the predictions are usually probabilistic anyway; it’s a little less clear to me how this works with regression, since the point estimates we make in regression don’t attempt to match the intrinsic/noise variance in the data, which would affect the K-L divergence.)

But there’s more than one way to compare two probability distributions, and I don’t see that “K-L divergence from empirical distribution of validation set” is the one best loss function for probabilistic modeling. For one thing, we’re presumably going to want to use the joint distributions of all our variables (so that the model has to get the relation of X to Y right, not just match the overall relative counts for Y). But that’s a potentially high-dimensional distribution which we’re sparsely sampling, so the literal empirical distribution will have spurious peaks centered at each data point, and we’d need to do some density reconstruction to get something more sensible – at which point it’s not clear that we trust this reference distribution more than our model’s posterior, since both involve approximate inference from the data.

Also, I know the K-L divergence has a bunch of special properties, but I’ve always been wary when people say that it is the one correct way to compare 2 distributions (or that there is one correct way). To make the case it seems like you’d need some link between the special properties and the thing you want to do. And in practice we use various loss functions (various proper scoring rules for classification, say) that aren’t (obviously?) the K-L div in disguise; is this wrong?

15th Jun 2017110 notes