Hacker News new | past | comments | ask | show | jobs | submit | colah3's comments login

It's a bit different than what's discussed here, but color-contrast detectors in neural networks can be thought of as forming a Klein bottle: https://distill.pub/2020/circuits/equivariance/#hue-rotation...

(This is, in some sense, for similar reason to Gunnar Carlson et al finding a Klein bottle when looking at high-contrast image patches, except one level more abstract, since it's about features rather than data points.)


Since this post is based on my 2014 blog post (https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ ), I thought I might comment.

I tried really hard to use topology as a way to understand neural networks, for example in these follow ups:

- https://colah.github.io/posts/2014-10-Visualizing-MNIST/

- https://colah.github.io/posts/2015-01-Visualizing-Representa...

There are places I've found the topological perspective useful, but after a decade of grappling with trying to understand what goes on inside neural networks, I just haven't gotten that much traction out of it.

I've had a lot more success with:

* The linear representation hypothesis - The idea that "concepts" (features) correspond to directions in neural networks.

* The idea of circuits - networks of such connected concepts.

Some selected related writing:

- https://distill.pub/2020/circuits/zoom-in/

- https://transformer-circuits.pub/2022/mech-interp-essay/inde...

- https://transformer-circuits.pub/2025/attribution-graphs/bio...


Related to ways of understanding neural networks, I've seen these views expressed a lot, which to me seem like misconceptions:

- LLMs are basically just slightly better `n-gram` models

- The idea of "just" predicting the next token, as if next-token-prediction implies a model must be dumb

(I wonder if this [1] popular response to Karpathy's RNN [2] post is partly to blame for people equating language neural nets with n-gram models. The stochastic parrot paper [3] also somewhat equates LLMs and n-gram models, e.g. "although she primarily had n-gram models in mind, the conclusions remain apt and relevant". I guess there was a time where they were more equivalent, before the nets got really really good)

[1] https://nbviewer.org/gist/yoavg/d76121dfde2618422139

[2] https://karpathy.github.io/2015/05/21/rnn-effectiveness/

[3] https://dl.acm.org/doi/pdf/10.1145/3442188.3445922


I guess I'll plug my hobby horse:

The whole discourse of "stochastic parrots" and "do models understand" and so on is deeply unhealthy because it should be scientific questions about mechanism, and people don't have a vocabulary for discussing the range of mechanisms which might exist inside a neural network. So instead we have lots of arguments where people project meaning onto very fuzzy ideas and the argument doesn't ground out to scientific, empirical claims.

Our recent paper reverse engineers the computation neural networks use to answer in a number of interesting cases (https://transformer-circuits.pub/2025/attribution-graphs/bio... ). We find computation that one might informally describe as "multi-step inference", "planning", and so on. I think it's maybe clarifying for this, because it grounds out to very specific empirical claims about mechanism (which we test by intervention experiments).

Of course, one can disagree with the informal language we use. I'm happy for people to use whatever language they want! I think in an ideal world, we'd move more towards talking about concrete mechanism, and we need to develop ways to talk about these informally.

There was previous discussion of our paper here: https://news.ycombinator.com/item?id=43505748


1) Isn't it unavoidable that a transformer - a sequential multi-layer architecture - is doing multi-step inference ?!

2) There are two aspects to a rhyming poem:

a) It is a poem, so must have a fairly high degree of thematic coherence

b) It rhymes, so must have end-of-line rhyming words

It seems that to learn to predict (hence generate) a rhyming poem, both of these requirements (theme/story continuation+rhyming) would need to be predicted ("planned") at least by the beginning of the line, since they are inter-related.

In contrast, a genre like freestyle rap may also rhyme, but flow is what matters and thematic coherence and rhyming may suffer as a result. In learning to predict (hence generate) freestyle, an LLM might therefore be expected to learn that genre-specific improv is what to expect, and that rhyming is of secondary importance, so one might expect less rhyme-based prediction ("planning") at the start of each bar (line).


Absolutely, the first task should be to understand how and why black boxes with emergent properties actually work, in order to further knowledge - but importantly, in order to improve them and build on the acquired knowledge to surpass them. That implies, curbing «parrot[ing]» and inadequate «understand[ing]».

I.e. those higher concepts are kept in mind as a goal. It is healthy: it keeps the aim alive.


My favorite argument against SP is zero shot translation. The model learns Japanese-English and Swahili-English and then can translate Japanese-Swahili directly. That shows something more than simple pattern matching happens inside.

Besides all arguments based on model capabilities, there is also an argument from usage - LLMs are more like pianos than parrots. People are playing the LLM on the keyboard, making them 'sing'. Pianos don't make music, but musicians with pianos do. Bender and Gebru talk about LLMs as if they work alone, with no human direction. Pianos are also dumb on their own.


The translation happens because of token embeddings. We spent a lot of time developing rich embeddings that capture contextual semantics. Once you learn those, translation is “simply” embedding in one language, and disembedding in another.

This does not show complex thinking behavior, although there are probably better examples. Translation just isn’t really one of them.


Furthermore: Learning additional languages fine tunes the embedding.

This is also the problem I have with John Searle’s Chinese room

> The model learns Japanese-English and Swahili-English and then can translate Japanese-Swahili directly. That shows something more than simple pattern matching happens inside.

The "water story" is a pivotal moment in Helen Keller's life, marking the start of her communication journey. It was during this time that she learned the word "water" by having her hand placed under a running pump while her teacher, Anne Sullivan, finger-spelled the word "w-a-t-e-r" into her other hand. This experience helped Keller realize that words had meaning and could represent objects and concepts.

As the above human experience shows, aligning tokens from different modalities is the first step in doing anything useful.


> The whole discourse of "stochastic parrots" and "do models understand" and so on is deeply unhealthy [...] So instead we have lots of arguments where people project meaning onto very fuzzy ideas and the argument doesn't ground out to scientific, empirical claims.

I would put it this way: the question "do LLMs, etc understand?" is rooted in a category mistake.

Meaning, I am not claiming that it is premature to answer such questions because we lack a sufficient grasp of neutral networks. I am asserting that LLMs don't understand, because the question of whether they do is like asking whether A-flat is yellow.


Regardless of the mechanism, the foundational 'conceit' of LLMs is that by dumping enough syntax (and only syntax) into a sufficiently complex system, the semantics can be induced to emerge.

Quite a stretch, in my opinion (cf. Plato's Cave).


> Regardless of the mechanism, the foundational 'conceit' of LLMs is that by dumping enough syntax (and only syntax) into a sufficiently complex system, the semantics can be induced to emerge.

Syntax has dual aspect. It is both content and behavior (code and execution, or data and rules, form and dynamics). This means syntax as behavior can process syntax as data. And this is exactly how neural net training works. Syntax as execution (the model weights and algorithm) processes syntax as data (activations and gradients). In the forward pass the model processes data, producing outputs. In the backward pass it is the weights of the model that become the data to be processed.

When such a self-generative syntactic system is in contact with an environment, in our case the training set, it can encode semantics. Inside the model data is relationally encoded in the latent space. Any new input stands in relation to all past inputs. So data creates its own semantic space with no direct access to the thing in itself. The meaning of a data point is how it stands in relation to all other data points.

Another important aspect is that this process is recursive. A recursive process can't be fully understood from outside. Godel, Turing, Chaitin prove that recursion produces blindspots, that you need to walk the recursive path to know it, you have to be it to know it. Training and inferencing models is such a process

The water carves its banks

The banks channel the water

Which is the true river?

Here, banks = model weights and water = language


Anyone who has widely read topics across philosophy, science (physics, biology), economics, politics (policy, power), from practitioners, from original takes, news, etc. ... has managed to understand a tremendous number of relationships due to just words and their syntax.

While many of these relationships are related to things we see and do in trivial ways, the vast majority go far beyond anything that can be seen or felt.

What does economics look like? I don't know, but I know as I puzzle out optimums, or expected outcomes, or whatever, I am moving forms around in my head that I am aware of, can recognize and produce, but couldn't describe with any connection to my senses.

The same when seeking a proof for a conjecture in an idiosyncratic algebra.

Am I really dealing in semantics? Or have I just learned the graph-like latent representation for (statistical or reliable) invariant relationships in a bunch of syntax?

Is there a difference?

Don't we just learn the syntax of the visual world? Learning abstractions such as density, attachment, purpose, dimensions, sizes, that are not what we actually see, which is lots of dot magnitudes of three kinds. And even those abstractions benefit greatly from the words other people use describing those concepts. Because you really don't "see" them.

I would guess that someone who was born without vision, touch, smell or taste, would still develop what we would consider a semantic understanding of the world, just by hearing. Including a non-trivial more-than-syntactic understanding of vision, touch, smell and taste.

Despite making up their own internal "qualia" for them.

Our senses are just neuron firings. The rest is hierarchies of compression and prediction based on their "syntax".


>Am I really dealing in semantics? Or have I just learned the graph-like latent representation for (statistical or reliable) invariant relationships in a bunch of syntax?

This and the rest of the comment are philosophical skepticism, and Kant blew this apart back when Hume's "bundle of experience" model of human subjects was considered an open problem in epistemology.


Can you get into more detail and share some links? Inquiring minds want to know

> Anyone who has widely read topics across philosophy, science (physics, biology), economics, politics (policy, power), from practitioners, from original takes, news, etc. ... has managed to understand a tremendous number of relationships due to just words and their syntax.

You're making a slightly different point from the person you're answering. You're talking about the combination of words (with intelligible content, presumably) and the syntax that enables us to build larger ideas from them. The person you're answering is saying that LLM work on the principle that it's possible for intelligence to emerge (in appearance if not in fact) just by digesting a syntax and reproducing it. I agree with the person you're answering. Please excuse the length of the below, as this is something I've been thinking about a lot lately, so I'm going to do a short brain dump to get it off my chest:

The Chinese Room thought experiment --treated by the Stanford Encyclopedia of Philosophy as possibly the single most discussed and debated thought experiment of the latter half of the 20th century -- argued precisely that no understanding can emerge from syntax, and thus by extension that 'strong AI', that really, actually understands (whatever we mean by that) is impossible. So plenty of people have been debating this.

I'm not a specialist in continental philosophy or social thought, but, similarly, it's my understanding that structuralism argued essentially the one can (or must) make sense of language and culture precisely by mapping their syntax. There aren't structulists anymore, though. Their project failed, because their methods don't work.

And, again, I'm no specialist, so take this with a grain of salt, but poststructuralism was, I think, built partly on the recognition that such syntax is artificial and artifice. The content, the meaning, lives somewhere else.

The 'postmodernism' that supplanted it, in turn, tells us that the structuralists were basically Platonists or Manicheans -- treating ideas as having some ideal (in a philosophical sense) form separate from their rough, ugly, dirty, chaotic embodiments in the real world. Postmodernism, broadly speaking, says that that's nonsense (quite literally) because context is king (and it very much is).

So as far as I'm aware, plenty of well informed people whose very job is to understand these issues still debate whether syntax per se confers any understanding whatsoever, and the course philosophy followed in the 20th century seems to militate, strongly, against it.


I am using syntax in a general form to mean patterns.

We are talking about LLMs and the debate seems to be around whether learning about non-verbal concepts through verbal patterns (i.e. syntax that includes all the rules of word use, including constraints reflecting relations between words meaning, but not communication any of that meaning in more direct ways) constitutes semantic understanding or not.

In the end, all the meaning we have is constructed from the patterns our senses relay to us. We construct meaning from those patterns.

I.e. LLMs may or may not “understand” as well or deeply as we do. But what they are doing is in the same direction.


> In the end, all the meaning we have is constructed from the patterns our senses relay to us. We construct meaning from those patterns.

Appears quite bold. What sense-relays inform us about infinity or other mathematical concepts that don't exist physically? Is math-sense its own sense that pulls from something extra-physical?

Doesn't this also go against Chomsky's work, the poverty of stimulus. That it's the recursive nature of language that provides so much linguistic meaning and ability, not sense data, which would be insufficient?


Curious what you make of symbolic mathematics, then - in particular, systems like Mathematica which can produce true and novel mathematical facts by pure syntactic manipulation.

The truth is, syntax and semantics are strongly intertwined and not cleanly separable. A "proof" is merely a syntactically valid string in some formal system.


1000%. It's really hard to express this to non-engineers who never wasted years of their life trying to work with n-grams and NLTK (even topic models) to make sense of textual data... Projects I dreamed of circa 2012 are now completely trivial. If you do have that comparison ready-at-hand, the problem of understanding what this mind-blowing leap means, to which end I find writing like the OP helpful, is so fascinating and something completely different than complaining that it's a "black box."

I've expressed this on here before, but it feels like the everyday reception of LLMs has been so damaged by the general public having just gotten a basic grasp on the existence of machine learning.


Thanks for the follow up. I've been following your circuits thread for several years now. I find the linear representation hypothesis very compelling, and I have a draft of a review for Toy Models of Superposition sitting in my notes. Circuits I find less compelling, since the analysis there feels very tied to the transformer architecture in specific, but what do I know.

Re linear representation hypothesis, surely it depends on the architecture? GANs, VAEs, CLIP, etc. seem to explicitly model manifolds. And even simple models will, due to optimization pressure, collapse similar-enough features into the same linear direction. I suppose it's hard to reconcile the manifold hypothesis with the empirical evidence that simple models will place similar-ish features in orthogonal directions, but surely that has more to do with the loss that is being optimized? In Toy Models of Superposition, you're using a MSE which effectively makes the model learn an autoencoder regression / compression task. Makes sense then that the interference patterns between co-occurring features would matter. But in a different setting, say a contrastive loss objective, I suspect you wouldn't see that same interference minimization behavior.


> Circuits I find less compelling, since the analysis there feels very tied to the transformer architecture in specific, but what do I know.

I don't think circuits is specific to transformers? Our work in the Transformer Circuits thread often is, but the original circuits work was done on convolutional vision models (https://distill.pub/2020/circuits/ )

> Re linear representation hypothesis, surely it depends on the architecture? GANs, VAEs, CLIP, etc. seem to explicitly model manifolds

(1) There are actually quite a few examples of seemingly linear representations in GANs, VAEs, etc (see discussion in Toy Models for examples).

(2) Linear representations aren't necessarily in tension with the manifold hypothesis.

(3) GANs/VAEs/etc modeling things as a latent gaussian space is actually way more natural if you allow superposition (which requires linear representations) since central limit theorem allows superposition to produce Gaussian-like distributions.


> the original circuits work was done on convolutional vision models

O neat, I haven't read that far back. Will add it to the reading list.

To flesh this out a bit, part of why I find circuits less compelling is because it seems intuitive to me that neural networks more or less smoothly blend 'process' and 'state'. As an intuition pump, a vector x matrix matmul in an MLP can be viewed as changing the basis of an input vector (ie the weights act as a process) or as a way to select specific pieces of information from a set of embedding rows (ie the weights act as state).

There are architectures that try to separate these out with varying degrees of success -- LSTMs and ResNets seem to have a more clear throughline of 'state' with various 'operations' that are applied to that state in sequence. But that seems really architecture-dependent.

I will openly admit though that I am very willing to be convinced by the circuits paradigm. I have a background in molecular bio and there's something very 'protein pathways' about it.

> Linear representations aren't necessarily in tension with the manifold hypothesis.

True! I suppose I was thinking about a 'strong' form of linear representations, which is something like: features are represented by linear combinations of neurons that display the same repulsion-geometries as observed in Toy Models, but that's not what you're saying / that's me jumping a step too far.

> GANs/VAEs/etc modeling things as a latent gaussian space is actually way more natural if you allow superposition

Superposition is one of those things that has always been so intuitive to me that I can't imagine it not being a part of neural network learning.

But I want to make sure I'm getting my terminology right -- why does superposition necessarily require the linear representation hypothesis? Or, to be more specific, does [individual neurons being used in combination with other neurons to represent more features than neurons] necessarily require [features are linear compositions of neurons]?


> True! I suppose I was thinking about a 'strong' form of linear representations, which is something like: features are represented by linear combinations of neurons that display the same repulsion-geometries as observed in Toy Models, but that's not what you're saying / that's me jumping a step too far.

Note this happens in "uniform superposition". In reality, we're almost certainly in very non-uniform superposition.

One key term to look for is "feature manifolds" or "multi-diemsnional features". Some discussion here: https://transformer-circuits.pub/2024/july-update/index.html...

(Note that the term "strong linear representation" is becoming a term of art in the literature referring to the idea that all features are linear, rather than just most or some.)

> I want to make sure I'm getting my terminology right -- why does superposition necessarily require the linear representation hypothesis? Or, to be more specific, does [individual neurons being used in combination with other neurons to represent more features than neurons] necessarily require [features are linear compositions of neurons]?

When you say "individual neurons being used in combination with other neurons to represent more features than neurons", that's a way one might _informally_ talk about superposition, but doesn't quite capture the technical nuance. So it's hard to know the full scope of what you intend. All kinds of crazy things are possible if you allow non-linear features, and it's not necessarily clear what a feature would mean.

Superposition, in the narrow technical sense of exploiting compressed sensing / high-dimensional spaces, requires linear representations and sparsity.


> One key term to look for is "feature manifolds" or "multi-diemsnional features"

I should probably read the updates more. Not enough time in the day. But yea the way you're describing feature manifolds and multidimensional features, especially the importance of linearity-in-properties and not necessarily linearity-in-dimensions, makes a lot of sense and is basically how I default think about these things.

> but doesn't quite capture the technical nuance. So it's hard to know the full scope of what you intend.

Fair, I'm only passingly familiar with compressed sensing so I'm not sure I could offer a more technical definition without, like, a much longer conversation! But it's good to know in the future that in a technical sense linear representations and superposition are dependent.

> all features are linear, rather than just most or some

Potentially a tangent, but compared to what? I suppose the natural answer is "non linear features" but has there been anything to suggest that neural networks represent concepts in this way? I'd be rather surprised if they did within a single layer. (Across layers, sure, but that actually starts to pull me more towards circuits)


I was going to comment the same about the Superposition hypothesis [0], when the OP comment (edit: Update: The OP commenter is (as pointed by other HN comments, the cofounder of Anthropic) behind the Superposition research) mentioned about "I've had a lot more success with: * The linear representation hypothesis - The idea that "concepts" (features) correspond to directions in neural networks", as this concept-per-NN-feature idea seems too "basic" to explain some of the learning which NNs can do on datasets. On one of our custom trained neural network models (not LLM, but audio-based and currently proprietary) we noticed the same of the ML model being able to "overfit" on a large amount of data despite not many few parameters relative to the size of the dataset (and that too with dropout in early layers).

[0] https://www.anthropic.com/research/superposition-memorizatio...


This has mirrored my experience attempting to "apply" topology in real world circumstances, off and on since I first studied topology in 2011.

I even hesitate now at the common refrain "real world data approximates a smooth, low dimensional manifold." I want to spend some time really investigating to what extent this claim actually holds for real world data, and to what extent it is distorted by the dimensionality reduction method we apply to natural data sets in order to promote efficiency. But alas, who has the time?


I think it's interesting that in physics, different global symmetries (topological manifolds) can satisfy the same metric structure (local geometry). For example, the same metric tensor solution to Einstein's field equation can exist on topologically distinct manifolds. Conversely, looking at solutions to the Ising Model, we can say that the same lattice topology can have many different solutions, and when the system is near a critical point, the lattice topology doesn't even matter.

It's only an analogy, but it does suggest at least that the interesting details of the dynamics aren't embedded in the topology of the system. It's more complicated than that.


If you like symmetry, you might enjoy how symmetry falls out of circuit analysis of conv nets here:

https://distill.pub/2020/circuits/equivariance/


Thanks for this additional link, which really underscores for me at least how you're right about patterns in circuits being a better abstraction layer for capturing interesting patterns than topological manifolds.

I wasn't familiar with the term "equivariance" but I "woke up" to this sort of approach to understanding deep neural networks when I read this paper, which shows how restricted boltzman machines have an exact mapping to the renormalization group approach used to study phase transitions in condensed matter and high energy physics:

https://arxiv.org/abs/1410.3831

At high enough energy, everything is symmetric. As energy begins to drain from the system, eventually every symmetry is broken. All fine structure emerges from the breaking of some symmetries.

I'd love to get more in the weeds on this work. I'm in my own local equilibrium of sorts doing much more mundane stuff.


That earlier post had a few small HN discussions (for those interested):

Neural Networks, Manifolds, and Topology (2014) - https://news.ycombinator.com/item?id=19132702 - Feb 2019 (25 comments)

Neural Networks, Manifolds, and Topology (2014) - https://news.ycombinator.com/item?id=9814114 - July 2015 (7 comments)

Neural Networks, Manifolds, and Topology - https://news.ycombinator.com/item?id=7557964 - April 2014 (29 comments)


Loved these posts and they inspired a lot of my research and directions during my PhDs.

For anyone interested in these may I also suggest learning about normalizing flows? (They are the broader class to flow matching) They are learnable networks that learn coordinate changes. So the connection to geometry/topology is much more obvious. Of course the down side of flows is you're stuck with a constant dimension (well... sorta) but I still think they can help you understand a lot more of what's going on because you are working in a more interpretable environment


hey chris, I found your posts quite inspiring back then, with very poetic ideas. cool to see you follow up here!

My guess is that the linear representation hypothesis is only approximately right in the sense that my expectation is that it is more like a Lie Group. Locally flat, but the concept breaks at some point. Note that I am a mathematician who knows very little about machine learning apart from taking a few classes at uni

The linear representation hypothesis is rather quite intreguing, I am curious what was the intuition behind it.

See https://transformer-circuits.pub/2022/toy_model/index.html#m...

If you're new to this, I'd mostly just look at all the empirical examples.

The slightly harder thing is to consider the fact that neural networks are made of linear functions with non-linearities between them, and to try to think about when linear directions will be computationally natural as a result.


Consider looking into fields related to machine learning to see how topology is used there. The main problem is that some of the cool math did not survive the transition to CS, e.g. the math for control theory is not quite present in RL.

In terms of topology, control theory has some very cool topological interpretations, e.g. toruses appear quite a bit in control theory.


A few comments on this thread:

Gwern is correct in his prior quote of how long these articles took. I think 50-200 hours is a pretty good range.

I expect AI assistants could help quite a bit with implementing the interactive diagrams, which was a significant fraction of this time. This is especially true for authors without a background in web development.

However, a huge amount of the editorial time went into other things. This article was a best case scenario for an article not written by the editors themselves. Gabriel is phenomenal and was a delight to work with. The editors didn't write any code for this article that I remember. But we still spent many tens of hours giving feedback on the text and diagrams. You can see some of this in github - e.g. https://github.com/distillpub/post--momentum/issues?q=is%3Ai...

More broadly, we struggled a lot with procedural issues. (We wrote a bit about this here: https://distill.pub/2021/distill-hiatus/ ) In retrospect, I deeply regret trying to run Distill with the expectations of a scientific journal, rather than the freedom of a blog, or wish I'd pushed back more on process. Not only did it occupy enormous amounts of time and energy, but it was just very de-energizing. I wanted to spend my time writing great articles and helping people great articles.

(I was recently reading Thompson & Klein's Abundance, and kept thinking back to my experiences with Distill.)


Huge fan of Distill here (and your personal blog).

> In retrospect, I deeply regret trying to run Distill with the expectations of a scientific journal, rather than the freedom of a blog, or wish I'd pushed back more on process. Not only did it occupy enormous amounts of time and energy, but it was just very de-energizing.

Scientific peer review pretty much always is incredibly draining, and (assuming the initial draft is worth publishing) it rarely adds more than a few percent to the quality of the article. However, newcomers are drowning in a sea of low quality SEO spam (if they bother to search & read blogs at all and don't go straight to their LLMs, which tend to regurgitate the same rubbish). The insistence on scientific peer review created a brand, which to this day allows me to blindly recommend Distill articles to people that I am training or teaching. So I, for one, am incredibly grateful that you went the extra-mile(s).


Thanks for the feedback! I'm one of the authors.

I just wanted to make sure you noticed that this is linking to an accessible blog post that's trying to communicate a research result to a non-technical audience?

The actual research result is covered in two papers which you can find here:

- Methods paper: https://transformer-circuits.pub/2025/attribution-graphs/met...

- Paper applying this method to case studies in Claude 3.5 Haiku: https://transformer-circuits.pub/2025/attribution-graphs/bio...

These papers are jointly 150 pages and are quite technically dense, so it's very understandable that most commenters here are focusing on the non-technical blog post. But I just wanted to make sure that you were aware of the papers, given your feedback.


The post to which you replied states:

  Anthropomorphing[sic] seems to be in an overdose mode with 
  "thinking / thoughts", "mind" etc., scattered everywhere. 
  Nothing with any of the LLMs outputs so far suggests that 
  there is anything even close enough to a mind or a thought 
  or anything really outside of vanity.
This is supported by reasonable interpretation of the cited article.

Considering the two following statements made in the reply:

  I'm one of the authors.
And

  These papers are jointly 150 pages and are quite 
  technically dense, so it's very understandable that most 
  commenters here are focusing on the non-technical blog post.
The onus of clarifying the article's assertions:

  Knowing how models like Claude *think* ...
And

  Claude sometimes thinks in a conceptual space that is 
  shared between languages, suggesting it has a kind of 
  universal “language of thought.”
As it pertains to anthropomorphizing an algorithm (a.k.a. stating it "thinks") is on the author(s).


Thinking and thought have no solid definition. We can't say Claude doesn't "think" because we don't even know what a human thinking actually is.

Given the lack of a solid definition for thinking and test to measure it, I think using the terminology colloquially is a totally fair play.


I view LLM's as valuable algorithms capable of generating relevant text based on queries given to them.

> Thinking and thought have no solid definition. We can't say Claude doesn't "think" because we don't even know what a human thinking actually is.

I did not assert:

  Claude doesn't "think" ...
What I did assert was that the onus is on the author(s) which write articles/posts such as the one cited to support their assertion that their systems qualify as "thinking" (for any reasonable definition of same).

Short of author(s) doing so, there is little difference between unsupported claims of "LLM's thinking" and 19th century snake oil[0] salesmen.

0 - https://en.wikipedia.org/wiki/Snake_oil


No one says that a thermostat is "thinking" of turning on the furnace, or that a nightlight is "thinking it is dark enough to turn the light on". You are just being obtuse.


Yes. A thermostat involves a change of state from A to B. A computer is the same: its state at t causes its state at t+1, which causes its state at t+2, and so on. Nothing else is going on. An LLM is no different: an LLM is simply a computer that is going through particular states.

Thought is not the same as a change of (brain) state. Thought is certainly associated with change of state, but can't be reduced to it. If thought could be reduced to change of state, then the validity/correctness/truth of a thought could be judged with reference to its associated brain state. Since this is impossible (you don't judge whether someone is right about a math problem or an empirical question by referring to the state of his neurology at a given point in time), it follows that an LLM can't think.


>Thought is certainly associated with change of state, but can't be reduced to it.

You can effectively reduce continuously dynamic systems to discreet steps. Sure, you can always say that the "magic" exists between the arbitrarily small steps, but from a practical POV there is no difference.

A transistor has a binary on or off. A neuron might have ~infinite~ levels of activation.

But in reality the ~infinite~ activation level can be perfectly modeled (for all intents and purposes), and computers have been doing this for decades now (maybe not with neurons, but equivalent systems). It might seem like an obvious answer, that there is special magic in analog systems that binary machines cannot access, but that is wholly untrue. Science and engineering have been extremely successful interfacing with the analog reality we live in, precisely because the digital/analog barrier isn't too big of a deal. Digital systems can do math, and math is capable of modeling analog systems, no problem.


It's not a question of discrete vs continuous, or digital vs analog. Everything I've said could also apply if a transistor could have infinite states.

Rather, the point is that the state of our brain is not the same as the content of our thoughts. They are associated with one another, but they're not the same. And the correctness of a thought can be judged only by reference to its content, not to its associated state. 2+2=4 is correct, and 2+2=5 is wrong; but we know this through looking at the content of these thoughts, not through looking at the neurological state.

But the state of the transistors (and other components) is all a computer has. There are no thoughts, no content, associated with these states.


It seems that the only barrier between brain state and thought contents is a proper measurement tool and decoder, no?

We can already do this at an extremely basic level, mapping brain states to thoughts. The paraplegic patient using their thoughts to move the mouse cursor or the neuroscientist mapping stress to brain patterns.

If I am understanding your position correctly, it seems that the differentiation between thoughts and brain states is a practical problem not a fundamental one. Ironically, LLMs have a very similar problem with it being very difficult to correlate model states with model outputs. [1]

[1]https://www.anthropic.com/research/mapping-mind-language-mod...


There is undoubtedly correlation between neurological state and thought content. But they are not the same thing. Even if, theoretically, one could map them perfectly (which I doubt is possible but it doesn't affect my point), they would remain entirely different things.

The thought that "2+2=4", or the thought "tiger", are not the same thing as the brain states that makes them up. A tiger, or the thought of a tiger, is different from the neurological state of a brain that is thinking about a tiger. And as stated before, we can't say that "2+2=4" is correct by referring to the brain state associated with it. We need to refer to the thought itself to do this. It is not a practical problem of mapping; it is that brain states and thoughts are two entirely different things, however much they may correlate, and whatever causal links may exist between them.

This is not the case for LLMs. Whatever problems we may have in recording the state of the CPUs/GPUs are entirely practical. There is no 'thought' in an LLM, just a state (or plurality of states). An LLM can't think about a tiger. It can only switch on LEDs on a screen in such a way that we associate the image/word with a tiger.


> The thought that "2+2=4", or the thought "tiger", are not the same thing as the brain states that makes them up.

Asserted without evidence. Yes, this does represent a long and occasionally distinguished line of thinking in cognitive science/philosophy of mind, but it is certainly not the only one, and some of the others categorically refute this.


Is it your contention that a tiger may be the same thing as a brain state?

It would seem to me that any coherent philosophy of mind must accept their being different as a datum; or conversely, any that implied their not being different would have to be false.

EDIT: my position has been held -- even taken as axiomatic -- by the vast majority of philosophers, from the pre-Socratics onwards, and into the 20th century. So it's not some idiosyncratic minority position.


Clearly there is a thing in the world that is a tiger independently of any brain state anywhere.

But the thought of a tiger may in fact be identical to a brain state (or it might not; at this point we do not know).


Given that a tiger is different from a brain state:

If I am thinking about a tiger, then what I am thinking about is not my brain state. So that which I am thinking about is different from (as in, cannot be identified with) my brain state.


> What I am thinking about is not my brain state

Obviously the thing you are thinking about is not the same as your thinking about it, nor the same as your brain state when thinking about it. Thinking about a thing is necessarily and definitionally distinct from the thing.

The question however is whether there is anything to "thinking about thing" other than the brain state you have when doing so. This is unknown at this time.


Earlier upthread, I said

>> the thought "tiger" [is] not the same thing as the brain state that makes [it] up.

To which you said

> Asserted without evidence.

This was in the context of my saying

>> There is undoubtedly correlation between neurological state and thought content. But they are not the same thing.

Now you say

> the thing you are thinking about is not the same as your thinking about it, nor the same as your brain state when thinking about it.

Are we at least agreed that the content of the thought "tiger" is not the same thing as the brain state that makes it up?

> The question however is whether there is anything to "thinking about thing" other than the brain state you have when doing so. This is unknown at this time.

If a tiger is distinct from a brain state, which I think we agree on, and if our thoughts are about real things such as tigers, which I assume we agree on, then how can there not be more to thought than the associated brain state?


> Are we at least agreed that the content of the thought "tiger" is not the same thing as the brain state that makes it up?

No. I don't agree that "the content of [a] thought" is something we can usefully talk about in this context.

Thoughts are subjective experiences, more or less identical to qualia. Thinking about a tiger is actually having the experience of thinking about a tiger, and this is purely subjective, like all qualia. The only question I can see worth asking about it is whether the experience of thinking about a tiger has some component to it that is not part of a fully described brain state.

> If a tiger is distinct from a brain state, which I think we agree on, and if our thoughts are about real things such as tigers,

We also have thoughts about unreal things. I don't see why such thoughts should be any different than the ones we have about real things.


>> If a tiger is distinct from a brain state, which I think we agree on, and if our thoughts are about real things such as tigers, which I assume we agree on, then how can there not be more to thought than the associated brain state?

> We also have thoughts about unreal things. I don't see why such thoughts should be any different than the ones we have about real things.

Let me rephrase then:

If a tiger is distinct from a brain state, which I think we agree on, and if our thoughts can be about real things such as tigers, which I assume we agree on, then how can there not be more to thought than the associated brain state?

A brain state does not refer to a tiger.


I realize I'm butting in on an old debate, but thinking about this caused me to come to conclusions which were interesting enough that I had to write them down somewhere.

I'd argue that rather than thoughts containing extra contents which don't exist in brain states, its more the case that brain states contain extra content which doesn't exist in thoughts. Specifically, I think that "thoughts" are a lossy abstraction that we use to reason about brain states and their resulting behaviors, since we can't directly observe brain states and reasoning about them would be very computationally intensive.

As far as I've seen, you have argued that thoughts "refer" to real things, and that thoughts can be "correct" or "incorrect" in some objective sense. I'll argue against the existence of a singular coherent concept of "referring", and also that thoughts can be useful without needing to be "correct" in some sense which brain states cannot participate in. I'll be assuming that something only exists if we can (at least in theory if not in practice) tie it back to observable behavior.

First, I'll argue that the "refers" relation is a pretty incoherent concept which sometimes happens to work. Let us think of a particular person who has a thought/brain state about a particular tiger in mind/brain. If the person has accurate enough information about the tiger, then they will recognize the tiger on sight, and may behave differently around that tiger than other tigers. I would say in this case that the person's thoughts refer to the tiger. This is the happy case where the "refers" relation is a useful aid to predicting other people's behavior.

Now let us say that the person believes that the tiger ate their mother, and that the tiger has distinctive red stripes. However, let it be the case that the person's mother was eaten by a tiger, but that tiger did not have red stripes. Separately, there does exist a singular tiger in the world which does have red stripes. Which tiger does the thought "a tiger with red stripes ate my mother" refer to?

I think it's obvious that this thought doesn't coherently refer to any tiger. However, that doesn't prevent the thought from affecting the person's behavior. Perhaps the person's next thought is to "take revenge on the tiger that killed my mother". The person then hunts down and kills the tiger with the red stripes. We might be tempted to believe that this thought refers to the mother killing tiger, but the person has acted as though it referred to the red striped tiger. However, it would be difficult to say that the thought refers to the red striped tiger either, since the person might not kill the red striped tiger if they happen to learn said tiger has an alibi. Hopefully this is sufficient to show that the "refers" relationship isn't particularly connected to observable behavior in many cases where it seems like it should be. The connection would exist if everyone had accurate and complete information about everything, but that is certainly not the world we live in.

I can't prove that the world is fully mechanical, but if we assume that it is, then all of the above behavior could in theory be predicted by just knowing the state of the world (including brain states but not thoughts) and stepping a simulation forward. Thus the concept of a brain state is more helpful to predicting their behavior than thoughts with a singular concept of "refers". We might be able to split the concept of "referring" up into other concepts for greater predictive accuracy, but I don't see how this accuracy could ever be greater than just knowing the brain state. Thus if we could directly observe brain states and had unlimited computational power, we probably wouldn't bother with the concept of a "thought".

Now then, on to the subject of correctness. I'd argue that thoughts can be useful without needing a central concept of correctness. The mechanism is the very category theory like concept of considering all things only in terms of how they relate to other things, and then finding other (possibly abstract) objects which have the same set of relationships.

For concreteness, let us say that we have piles of apples and are trying to figure out how many people we can feed. Let us say that today we have two piles each consisting of two apples. Yesterday we had a pile of four apples and could feed two people. The field of appleology is quite new, so we might want to find some abstract objects in the field of math which have the same relationship. Cutting edge appleology research shows that as far as hungry people are concerned, apple piles can be represented with natural numbers, and taking two apple piles and combining them results in a pile equivalent to adding the natural numbers associated with the piles being combined. We are short on time, so rather than combining the piles, we just think about the associated natural numbers (2 and 2), and add them (4) to figure out that we can feed two people today. Thus the equation (2+2=4) was useful because pile 1 combined with pile 2 is related to yesterday pile in the same way that 2 + 2 relates to 4.

Math is "correct" only in so far as it is consistent. That is, if you can arrive at a result using two different methods, you should find that the result is the same regardless of the method chosen. Similarly, reality is always consistent, because assuming that your behavior hasn't affected the situation, (and what is considered the situation doesn't include your brain state) it doesn't matter how or even if you reason about the situation, the situation just is what it is. So the reason math is useful is because you can find abstract objects (like numbers) which relate to each other in the same way as parts of reality (like piles of apples). By choosing a conventional math, we save ourselves the trouble of having to reason about some set of relationships all over again every time that set of relationships occurs. Instead we simply map the objects to objects in the conventional math which are related in the same manner. However, there is no singular "correct" math, as can be shown by the fact that mathematics can be defined in terms of set theory + first order logic, type theory, or category theory. Even an inconsistent math such as set theory before Russell's Paradox can still often produce useful results as long one's line of reasoning doesn't happen to trip on the inconsistency. However, tripping on an inconsistency will produce a set of relationships which cannot exist in the real world, which gives us a reason to think of consistent maths as being "correct". Consistent maths certainly are more useful.

Brain states can also participate in this model of correctness though. Brain states are related to each other, and if these relationships are the same as the relationships between external objects, then the relationships can be used to predict events occurring in the world. One can think of math and logic as mechanisms to form brain states with the consistent relationships needed to accurately model the world. As with math though, even inconsistent relationships can be fine as long as those inconsistencies aren't involved in reasoning about a thing, or predicting a thing isn't the point (take scapegoating for instance).

Sorry for the ramble. I'll summarize:

TL;DR: Thoughts don't contain "refers" and "correctness" relationships in any sense that brain states can't. The concept of "refers" is only usable to predict behavior if people have accurate and complete information about the things they are thinking about. However, brain states predict behavior regardless of how accurate or complete the information the person has is. The concept of "correctness" in math/logic really just means that the relationship between mathematical objects is consistent. We want this because the relationships between parts of reality seem to be consistent, and so if we desire the ability to predict things using abstract objects, the relationships between abstract objects must be consistent as well. However, brain states can also have consistent patterns of relationships, and so can be correct in the same sense.


Thanks for the response. I don't know if I'll have time to respond, I may, but in any case it's always good to write one's thoughts down.


Does a picture of a tiger or a tiger (to follow your sleight of hand) on a hard drive then count as a thought?


No. One is paint on canvas, and the other is part of a causal chain that makes LEDs light up in a certain way. Neither the painting nor the computer have thoughts about a tiger in the way we do. It is the human mind that makes the link between picture and real tiger (whether on canvas or on a screen).


>Rather, the point is that the state of our brain is not the same as the content of our thoughts.

Based on what exactly ? This is just an assertion. One that doesn't seem to have much in the way of evidence. 'It's not the same trust me bro' is the thesis of your argument. Not very compelling.


It's not difficult. When you think about a tiger, you are not thinking about the brain state associated with said thought. A tiger is different from a brain state.

We can safely generalize, and say the content of a thought is different from its associated brain state.

Also, as I said

>> The correctness of a thought can be judged only by reference to its content, not to its associated state. 2+2=4 is correct, and 2+2=5 is wrong; but we know this through looking at the content of these thoughts, not through looking at the neurological state.

This implies that state != content.


>It's not difficult. When you think about a tiger, you are not thinking about the brain state associated with said thought. A tiger is different from a brain state. We can safely generalize, and say the content of a thought is different from its associated brain state.

Just because you are not thinking about a brain state when you think about a tiger does not mean that your thought is not a brain state.

Just because the experience of thinking about X doesn't feel like the experience of thinking about Y (or doesn't feel like the physical process Z), it doesn't logically follow that the mental event of thinking about X isn't identical to or constituted by the physical process Z. For example, seeing the color red doesn't feel like processing photons of a specific wavelength with cone cells and neural pathways, but that doesn't mean the latter isn't the physical basis of the former.

>> The correctness of a thought can be judged only by reference to its content, not to its associated state. 2+2=4 is correct, and 2+2=5 is wrong; but we know this through looking at the content of these thoughts, not through looking at the neurological state. This implies that state != content.

Just because our current method of verification focuses on content doesn't logically prove that the content isn't ultimately realized by or identical to a physical state. It only proves that analyzing the state is not our current practical method for judging mathematical correctness.

We judge if a computer program produced the correct output by looking at the output on the screen (content), not usually by analyzing the exact pattern of voltages in the transistors (state). This doesn't mean the output isn't ultimately produced by, and dependent upon, those physical states. Our method of verification doesn't negate the underlying physical reality.

When you evaluate "2+2=4", your brain is undergoing a sequence of states that correspond to accessing the representations of "2", "+", "=", applying the learned rule (also represented physically), and arriving at the representation of "4". The process of evaluation operates on the represented content, but the entire process, including the representation of content and rules, is a physical neural process (a sequence of brain states).


> Just because you are not thinking about a brain state when you think about a tiger does not mean that your thought is not a brain state.

> It doesn't logically follow that the mental event of thinking about X isn't identical to or constituted by the physical process Z.

That's logically sound insofar as it goes. But firstly, the existence of a brain state for a given thought is, obviously, not proof that a thought is a brain state. Secondly, if you say that a thought about a tiger is a brain state, and nothing more than a brain state, then you have the problem of explaining how it is that your thought is about a tiger at all. It is the content of a thought that makes it be about reality; it is the content of a thought about a tiger that makes it be about a tiger. If you declare that a thought is its state, then it can't be about a tiger.

You can't equate content with state, and nor can you make content be reducible to state, without absurdity. The first implies that a tiger is the same as a brain state; the second implies that you're not really thinking about a tiger at all.

Similarly for arithmetic. It is only the content of a thought about arithmetic that makes it be right or wrong. It is our ideas of "2", "+", and so on, that make the sum right or wrong. The brain states have nothing to do with it. If you want to declare that content is state, and nothing more than state, then you have no way of saying the one sum is right, and the other is wrong.


Please, take the pencil and draw the line between thinking and non-thinking systems. Hell I'll even take a line drawn between thinking and non-thinking organisms if you have some kind of bias towards sodium channel logic over silicon trace logic. Good luck.


Even if you can't define the exact point that A becomes not-A, it doesn't follow that there is no distinction between the two. Nor does it follow that we can't know the difference. That's a pretty classic fallacy.

For example, you can't name the exact time that day becomes night, but it doesn't follow that there is no distinction.

A bunch of transistors being switched on and off, no matter how many there are, is no more an example of thinking than a single thermostat being switched on and off. OTOH, if we can't think, then this conversation and everything you're saying and "thinking" is meaningless.

So even without a complete definition of thought, we can see that there is a distinction.


> For example, you can't name the exact time that day becomes night, but it doesn't follow that there is no distinction.

There is actually a very detailed set of definitions of the multiple stages of twilight, including the last one which defines the onset of what everyone would agree is "night".

The fact that a phenomena shows a continuum by some metric does not mean that it is not possible to identify and label points along that continuum and attach meaning to them.


Looks like we replied to each others comments at the same time, haha


Your assertion that sodium channel logic and silicon trace logic are 100% identical is the primary problem. It's like claiming that a hydraulic cylinder and a bicep are 100% equivalent because they both lift things - they are not the same in any way.


People chronically get stuck in this pit. Math is substrate independent. If the process is physical (i.e. doesn't draw on magic) then it can be expressed with mathematics. If it can be expressed with mathematics, anything that does math can compute it.

The math is putting the crate up on the rack. The crate doesn't act any different based on how it got up there.


Or submarines swim ;)


think about it more


Honestly, arguing seems futile when it comes to opinions like GP. Those opinions resemble religious zealotry to me in that they take for granted that only humans can think. Any determinism of any kind in a non-human is seized upon as proof its mere clockwork, yet they can’t explain how humans think in order to contrast it.


> Honestly, arguing seems futile when it comes to opinions like GP. Those opinions resemble religious zealotry to me in that they take for granted that only humans can think. Any determinism of any kind in a non-human is seized upon as proof its mere clockwork, yet they can’t explain how humans think in order to contrast it.

Putting aside the ad hominems, projections, and judgements, here is a question for you:

If I made a program where a NPC[0] used the A-star[1] algorithm to navigate a game map, including avoiding obstacles and using the shortest available path to reach its goal, along with identifying secondary goal(s) should there be no route to the primary goal, does that qualify to you as the NPC "thinking"?

0 - https://en.wikipedia.org/wiki/Non-player_character

1 - https://en.wikipedia.org/wiki/A*_search_algorithm


Answer: I suppose no? But my point is only this:

1. People with the "AI isn't thinking" opinions move the goalposts, the borderline between "just following a deterministic algorithm" and "thinking" wherever needed in order to be right.

2. I argue that the brain itself must either be deterministic (just wildly complex) or, for lack of a better word, supernatural. If it's not deterministic, only God knows how our thinking process works. Every single person postulating about whether AI is "thinking" cannot fully explain why a human chooses a particular action, just as AI researchers can't explain why Claude does a certain thing in all scenarios. Therefore they are much more similar than they are different.

3. But really, the important thing is, unless you're approaching this from a religious POV (which is arguably much more interesting) the obsessive sorting of highly complex and not-even-remotely-fully-understood processes into "thinking" and "NOT thinking" groups is pointless and silly.


> 1. People with the "AI isn't thinking" opinions move the goalposts, the borderline between "just following a deterministic algorithm" and "thinking" wherever needed in order to be right.

I did not present an opinion regarding whether "AI thinks" or not, but instead said:

  The onus of clarifying the article's assertions ...

  As it pertains to anthropomorphizing an algorithm (a.k.a. 
  stating it "thinks") is on the author(s).
As to the concept of thinking, regardless of entity considered, I proffer the topic a philosophical one having no "right or wrong" answer so much as having an opportunity to deepen enlightenment of those who contemplate question.


Really appreciate your team's enormous efforts in this direction, not only the cutting edge research (which I don't see OAI/DeepMind publishing any paper on) but aslo making the content more digestible for non-research audience. Please keep up the great work!


I, uh, think, that "think" is a fine metaphor but "planning ahead" is a pretty confusing one. It doesn't have the capability to plan ahead because there is nowhere to put a plan and no memory after the token output, assuming the usual model architecture.

That's like saying a computer program has planned ahead if it's at the start of a function and there's more of the function left to execute.


> The obvious way to deal with this would be to send forward some of the internal activations as well as the generated words in the autoregressive chain.

Hi! I lead interpretability research at Anthropic.

That's a great intuition, and in fact the transformer architecture actually does exactly what you suggest! Activations from earlier time steps are sent forward to later time steps via attention. (This is another thing that's lost in the "models just predict the next word" framing.)

This actually has interesting practical implications -- for example, in some sense, it's the deep reason costs can sometimes be reduced via "prompt caching".


I'm more a vision person, and haven't looked a lot into NLP transformers, but is this because the attention is masked to only allow each query to look at keys/values from its own past? So when we are at token #5, then token #3's query cannot attend to token #4's info? And hence the previously computed attention values and activations remain the same and can be cached, because it would anyway be the same in the new forward pass?


Yep, that’s right!

If you want to be precise, there are “autoregressive transformers” and “bidirectional transformers”. Bidirectional is a lot more common in vision. In language models, you do see bidirectional models like Bert, but autoregressive is dominant.


Hi! I'm one of the authors.

There certainly are many interesting parallels here. I often think about this from the perspective of systems biology, in Uri Alon's tradition. There are a range of graphs in biology with excitation and inhibitory edges -- transcription networks, protein networks, networks of biological neurons -- and one can study recurring motifs that turn up in these networks and try to learn from them.

It wouldn't be surprising if some lessons from that work may also transfer to artificial neural networks, although there are some technical things to consider.


Agreed! So many emergent systems in nature achieve complex outcomes without central coordination - from cellular level to ant colonies & beehives. There are bound to be implications for designed systems.

Closely following what you guys are uncovering through interpretability research - not just accepting LLMs as black boxes. Thanks to you & the team for sharing the work with humanity.

Interpretability is the most exciting part of AI research for its potential to help us understand what’s in the box. By way of analogy, centuries ago farmers’ best hope for good weather was to pray to the gods! The sooner we escape the “praying to the gods” stage with LLMs the more useful they become.


This all feels familiar to the principle of least action found in physics.


Hi! I lead interpretability research at Anthropic. I also used to do a lot of basic ML pedagogy (https://colah.github.io/). I think this post and its children have some important questions about modern deep learning and how it relates to our present research, and wanted to take the opportunity to try and clarify a few things.

When people talk about models "just predicting the next word", this is a popularization of the fact that modern LLMs are "autoregressive" models. This actually has two components: an architectural component (the model generates words one at a time), and a loss component (it maximizes probability).

As the parent says, modern LLMs are finetuned with a different loss function after pretraining. This means that in some strict sense they're no longer autoregressive models – but they do still generate text one word at a time. I think this really is the heart of the "just predicting the next word" critique.

This brings us to a debate which goes back many, many years: what does it mean to predict the next word? Many researchers, including myself, have believed that if you want to predict the next word really well, you need to do a lot more. (And with this paper, we're able to see this mechanistically!)

Here's an example, which we didn't put in the paper: How does Claude answer "What do you call someone who studies the stars?" with "An astronomer"? In order to predict "An" instead of “A”, you need to know that you're going to say something that starts with a vowel next. So you're incentivized to figure out one word ahead, and indeed, Claude realizes it's going to say astronomer and works backwards. This is a kind of very, very small scale planning – but you can see how even just a pure autoregressive model is incentivized to do it.


Thanks for commenting, I like the example because it's simple enough to discuss. Isn't it more accurate to say not that Claude "realizes it's going to say astronomer" or "knows that it's going to say something that starts with a vowel" and more that the next token (or more pedantically, vector which gets reduced down to a token) is generated based on activations that correlate to the "astronomer" token, which is correlated to the "an" token, causing that to also be a more likely output?

I kind of see why it's easy to describe it colloquially as "planning" but it isn't really going ahead and then backtracking, it's almost indistinguishable from the computation that happens when the prompt is "What is the indefinite article to describe 'astronomer'?", i.e. the activation "astronomer" is already baked in by the prompt "someone who studies the stars", albeit at one level of indirection.

The distinction feels important to me because I think for most readers (based on other comments) the concept of "planning" seems to imply the discovery of some capacity for higher-order logical reasoning which is maybe overstating what happens here.


Thank you. In my mind, "planning" doesn’t necessarily imply higher-order reasoning but rather some form of search, ideally with backtracking. Of course, architecturally, we know that can’t happen during inference. Your example of the indefinite article is a great illustration of how this illusion of planning might occur. I wonder if anyone at Anthropic could compare the two cases (some sort of minimal/differential analysis) and share their insights.


I used the astronomer example earlier as the most simple, minimal version of something you might think of as a kind of microscopic form of "planning", but I think that at this point in the conversation, it's probably helpful to switch to the poetry example in our paper:

https://transformer-circuits.pub/2025/attribution-graphs/bio...

There are several interesting properties:

- Something you might characterize as "forward search" (generating candidates for the word at the end of the next line, given rhyming scheme and semantics)

- Representing those candidates in an abstract way (the features active are general features for those words, not "motor features" for just saying that word)

- Holding many competing/alternative candidates in parallel.

- Something you might characterize as "backward chaining", where you work backwards from these candidates to "write towards them".

With that said, I think it's easy for these arguments to fall into philosophical arguments about what things like "planning" mean. As long as we agree on what is going on mechanistically, I'm honestly pretty indifferent to what we call it. I spoke to a wide range of colleagues, including at other institutions, and there was pretty widespread agreement that "planning" was the most natural language. But I'm open to other suggestions!


Thanks for linking to this semi-interactive thing, but ... it's completely incomprehensible. :o (edit: okay, after reading about CLT it's a bit less alien.)

I'm curious where is the state stored for this "planning". In a previous comment user lsy wrote "the activation >astronomer< is already baked in by the prompt", and it seems to me that when the model generates "like" (for rabbit) or "a" (for habit) those tokens already encode a high probability for what's coming after them, right?

So each token is shaping the probabilities for the successor ones. So that "like" or "a" has to be one that sustains the high activation of the "causal" feature, and so on, until the end of the line. Since both "like" and "a" are very very non-specific tokens it's likely that the "semantic" state is really resides in the preceding line, but of course gets smeared (?) over all the necessary tokens. (And that means beyond the end of the line, to avoid strange non-aesthetic but attract cool/funky (aesthetic) semantic repetitions (like "hare" or "bunny"), and so on, right?)

All of this is baked in during training, during inference time the same tokens activate the same successor tokens (not counting GPU/TPU scheduling randomness and whatnot) and even though there's a "loop" there's no algorithm to generate top N lines and pick the best (no working memory shuffling).

So if it's planning it's preplanned, right?


The planning is certainly performed by circuits which we learned during training.

I'd expect that, just like in the multi-step planning example, there are lots of places where the attribution graph we're observing is stitching together lots of circuits, such that it's better understood as a kind of "recombination" of fragments learned from many examples, rather than that there was something similar in the training data.

This is all very speculative, but:

- At the forward planning step, generating the candidate words seems like it's an intersection of the semantics and rhyming scheme. The model wouldn't need to have seen that intersection before -- the mechanism could easily piece examples independently building the pathway for the semantics, and the pathway for the rhyming scheme

- At the backward chaining step, many of the features for constructing sentence fragments seem like the target is quite general (perhaps animals in one case, or others might even just be nouns).


Thank you, this makes sense. I am thinking of this as an abstraction/refinement process where an abstract notion of the longer completion is refined into a cogent whole that satisfies the notion of a good completion. I look forward to reading your paper to understand the "backward chaining" aspect and the evidence for it.


To plan: to think about and decide what you are going to do or how you are going to do something (Cambridge Dictionary)

That implies hire-other reasoning. If the model does not do that, which it doesn't, that's quite simply the wrong term.


> As the parent says, modern LLMs are finetuned with a different loss function after pretraining. This means that in some strict sense they're no longer autoregressive models – but they do still generate text one word at a time. I think this really is the heart of the "just predicting the next word" critique.

That more-or-less sums up the nuance. I just think the nuance is crucially important, because it greatly improves intuition about how the models function.

In your example (which is a fantastic example, by the way), consider the case where the LLM sees:

<user>What do you call someone who studies the stars?</user><assistant>An astronaut

What is the next prediction? Unfortunately, for a variety of reasons, one high probability next token is:

\nAn

Which naturally leads to the LLM writing: "An astronaut\nAn astronaut\nAn astronaut\n" forever.

It's somewhat intuitive as to why this occurs, even with SFT, because at a very base level the LLM learned that repetition is the most successful prediction. And when its _only_ goal is the next token, that repetition behavior remains prominent. There's nothing that can fix that, including SFT (short of a model with many, many, many orders of magnitude more parameters).

But with RL the model's goal is completely different. The model gets thrown into a game, where it gets points based on the full response it writes. The losses it sees during this game are all directly and dominantly related to the reward, not the next token prediction.

So why don't RL models have a probability for predicting "\nAn"? Because that would result in a bad reward by the end.

The models are now driven by a long term reward when they make their predictions, not by fulfilling some short-term autoregressive loss.

All this to say, I think it's better to view these models as they predominately are: language robots playing a game to achieve the highest scoring response. The HOW (autoregressiveness) is really unimportant to most high level discussions of LLM behavior.


Same can be achieved without RL. There’s no need to generate a full response to provide loss for learning.

Similarly, instead of waiting for whole output, loss can be decomposed over output so that partial emits have instant loss feedback.

RL, on the other hand, is allowing for more data. Instead of training on the happy path, you can deviate and measure loss for unseen examples.

But even then, you can avoid RL, put the model into a wrong position and make it learn how to recover from that position. It might be something that’s done with <thinking>, where you can provide wrong thinking as part of the output and correct answer as the other part, avoiding RL.

These are all old pre NN tricks that allow you to get a bit more data and improve the ML model.


In your astronomer example, what makes you attribute this to “planning” or look ahead rather than simply a learned statistical artifact of the training data?

For example, suppose English had a specific exception such that astronomer is always to be preceded by “a” rather than “an”. The model would learn this simply by observing that contexts describing astronomers are more likely to contain “a” rather than “an” as a next likely character, no?

I suppose you can argue that at the end of the day, it doesn’t matter if I learn an explicit probability distribution for every next word given some context, or whether I learn some encoding of rules. But I certainly feel like the prior is what we’re doing today (and why these models are so huge), rather than learning higher level rule encodings which would allow for significant compression and efficiency gains.


Thanks for the great questions! I've been responding to this thread for the last few hours and I'm about to need to run, so I hope you'll forgive me redirecting you to some of the other answers I've given.

On whether the model is looking ahead, please see this comment which discusses the fact that there's both behavioral evidence, and also (more crucially) direct mechanistic evidence -- we can literally make an attribution graph and see an astronomer feature trigger "an"!

https://news.ycombinator.com/item?id=43497010

And also this comment, also on the mechanism underlying the model saying "an":

https://news.ycombinator.com/item?id=43499671

On the question of whether this constitutes planning, please see this other question, which links it to the more sophisticated "poetry planning" example from our paper:

https://news.ycombinator.com/item?id=43497760


Let's note that the label you assign this feature is entirely speculative, i.e. it is your interpretation, not something the model actually "knows".


> In your astronomer example, what makes you attribute this to “planning” or look ahead rather than simply a learned statistical artifact of the training data?

What makes you think that "planning", even in humans, is more than a learned statistical artifact of the training data? What about learned statistical artifacts of the training data causes planning to be excluded?


Thanks for the detailed explanation of autoregression and its complexities. The distinction between architecture and loss function is crucial, and you're correct that fine-tuning effectively alters the behavior even within a sequential generation framework. Your "An/A" example provides compelling evidence of incentivized short-range planning which is a significant point often overlooked in discussions about LLMs simply predicting the next word.

It’s interesting to consider how architectures fundamentally different from autoregression might address this limitation more directly. While autoregressive models are incentivized towards a limited form of planning, they remain inherently constrained by sequential processing. Text diffusion approaches, for example, operate on a different principle, generating text from noise through iterative refinement, which could potentially allow for broader contextual dependencies to be established concurrently rather than sequentially. Are there specific architectural or training challenges you've identified in moving beyond autoregression that are proving particularly difficult to overcome?


Pardon my ignorance but couldn't this also be an act of anthropomorphisation on human part?

If an LLM generates tokens after "What do you call someone who studies the stars?" doesn't it mean that those existing tokens in the prompt already adjusted the probabilities of the next token to be "an" because it is very close to earlier tokens due to training data? The token "an" skews the probability of the next token further to be "astronomer". Rinse and repeat.


I think the question is: by what mechanism does it adjust up the probability of the token "an"? Of course, the reason it has learned to do this is that it saw this in training data. But it needs to learn circuits which actually perform that adjustment.

In principle, you could imagine trying to memorize a massive number of cases. But that becomes very hard! (And it makes predictions, for example, would it fail to predict "an" if I asked about astronomer in a more indirect way?)

But the good news is we no longer need to speculate about things like this. We can just look at the mechanisms! We didn't publish an attribution graph for this astronomer example, but I've looked at it, and there is an astronomer feature that drives "an".

We did publish a more sophisticated "poetry planning" example in our paper, along with pretty rigorous intervention experiments validating it. The poetry planning is actually much more impressive planning than this! I'd encourage you to read the example (and even interact with the graphs to verify what we say!). https://transformer-circuits.pub/2025/attribution-graphs/bio...

One question you might ask is why does the model learn this "planning" strategy, rather than just trying to memorize lots of cases? I think the answer is that, at some point, a circuit anticipating the next word, or the word at the end of the next line, actually becomes simpler and easier to learn than memorizing tens of thousands of disparate cases.


Is it fair to say that both "Say 'an'" and "Say 'astronomer'" output features would be present in this case, but say "Say 'an'" gets more votes because it is start of the sentence, and once it is sampled "An" further votes for "Say 'astronomer'" feature


I understand it differently,

LLMs predict distributions, not specific tokens. Then an algorithm, like beam search, is used to select the tokens.

So, the LLM predicts somethings like, 1. ["a", "an", ...] 2. ["astronomer", "cosmologist", ...],

where "an astronomer" is selected as the most likely result.


Just to be clear, the probability for "An" is high, just based on the prefix. You don't need to do beam search.


They almost certainly only do greedy sampling. Beam search would be a lot more expensive; also I'm personally skeptical about using a complicated search algorithm for inference when the model was trained for a simple one, but maybe it's fine?


Thanks! Isn’t “an Astronomer” a single word for the purpose of answering that question?

Following your comment, I asked “Give me pairs of synonyms where the last letter in the first is the first letter of the second”

Claude 3.7 failed miserably. Chat GPT 4o was much better but not good


Don't know about Claude, but at least with ChatGPT's tokenizer, it's 3 "words" (An| astronom|er).


That is a sub-token task, something I'd expect current models to struggle with given how they view the world in word / word fragment tokens rather than single characters.


"An astronomer" is two tokens, which is the relevant concern when people worry about this.


When humans say something, or think something or write something down, aren't we also "just predicting the next word"?


There is a lot more going on in our brains to accomplish that, and a mounting evidence that there is a lot more going on in LLMs as well. We don't understand what happens in brains either, but nobody needs to be convinced of the fact that brains can think and plan ahead, even though we don't *really* know for sure:

https://en.wikipedia.org/wiki/Philosophical_zombie


I trust that you want to say something , so you decided to click the comment button on HN.


But do I just want to say something because my childhood environment rewarded me for speech?

After all, if it has a cause it can't be deliberate. /s


Sure, the current version of LLM should wait for someone's input and then respond.


> In order to predict "An" instead of “A”, you need to know that you're going to say something that starts with a vowel next. So you're incentivized to figure out one word ahead, and indeed, Claude realizes it's going to say astronomer and works backwards.

Is there evidence of working backwards? From a next token point of view, predicting the token after "An" is going to heavily favor a vowel. Similarly predicting the token after "A" is going to heavily favor not a vowel.


Yes, there are two kinds of evidence.

Firstly, there is behavioral evidence. This is, to me, the less compelling kind. But it's important to understand. You are of course correct that, once Cluade has said "An", it will be inclined to say something starting with a vowel. But the mystery is really why, in setups like these, Claude is much more likely to say "An" than "A" in the first place. Regardless of what the underlying mechanism is -- and you could maybe imagine ways in which it could just "pattern match" without planning here -- it is preferred because in situations like this, you need to say "An" so that "astronomer" can follow.

But now we also have mechanistic evidence. If you make an attribution graph, you can literally see an astronomer feature fire, and that cause it to say "An".

We didn't publish this example, but you can see a more sophisticated version of this in the poetry planning section - https://transformer-circuits.pub/2025/attribution-graphs/bio...


> But the mystery is really why, in setups like these, Claude is much more likely to say "An" than "A" in the first place.

Because in the training set you're likely to see "an astronomer" than a different combination of words.

It's enough to run this on any other language text to see how these models often fail for any language more complex than English


You can disprove this oversimplification with a prompt like

"The word for Baker is now "Unchryt"

What do you call someone that bakes?

> An Unchryt"

The words "An Unchryt" has clearly never come up in any training set relating to baking


Attention is all you need.


The truth is somewhere in the middle :)


Ok there is correlation. But is there causation?


How do you all add and subtract concepts in the rabbit poem?


Features correspond to vectors in activation space. So you can just do vector arithmetic!

If you aren't familiar with thinking about features, you might find it helpful to look at our previous work on features in superposition:

- https://transformer-circuits.pub/2022/toy_model/index.html

- https://transformer-circuits.pub/2023/monosemantic-features/...

- https://transformer-circuits.pub/2024/scaling-monosemanticit...


I'm the research lead of Anthropic's interpretability team. I've seen some comments like this one, which I worry downplay the importance of @leogao et al's paper due to the similarity of ours. I think these comments are really undervaluing Gao et al's work.

It's not just that this is contemporaneous work (a project like this takes many months at the very least), but also that it introduces a number of novel contributions like TopK activations and new evaluations. It seems very possible that some of these innovations will be very important for this line of work going forward.

More generally, I think it's really unfortunate when we don't value contemporaneous work or replications. Prior to this paper, one could have imagined it being the case that sparse autoencoders worked on Claude due some idiosyncracy, but wouldn't work on other frontier models for some reason. This paper can give us increased confidence that they work broadly, and that in itself is something to celebrate. It gives us a more stable foundation to build on.

I'm personally really grateful to all the authors of this paper for their work pushing sparse autoencoders and mechanistic interpretability forward.


I'm glad you've enjoyed it! If you like the idea of a periodic table of features, you might like the Early Vision article from the original Distill circuits thread: https://distill.pub/2020/circuits/early-vision/

We've had a much harder time isolating features in language models than vision models (especially early vision), so I think we have a clearer picture there. And it seems remarkably structured! My guess is that language models are just making very heavy use of superposition, which makes it much harder to tease apart the features and develop a similar picture. Although we did get a tiny bit of traction here: https://transformer-circuits.pub/2022/solu/index.html#sectio...


I should mention, I've been a reader of hackernews for years, but never bothered to create an account/comment. These articles piqued my interest enough to finally get me to register/comment :)


Gosh, that's very flattering! Very touched by your interest.


Thank you for sharing these, I will definitely check them out! The concept of superposition here is new to me, but the way its described in these articles makes it very clear. The connection to compressed sensing and the Johnson–Lindenstrauss lemma is fascinating. I am very intrigued by your toy model results, especially the mapping out of the double-descent phenomena. Trying to understand what is happening to the model in this transition region feels very exciting.


I'm glad you've found it easy to follow!

My best guess at the middle regime is that there are _empirical correlations between features_ due to the limited data. That is, even though the features are independent, there's some dataset size where by happenstance some features will start to look correlated, not just in the sense of a single feature, but something a bit more general. So then the model can represent something like a "principal component". But it's all an illusion due to the limited data and so it leads to terrible generalization!

This isn't something I've dug into. The main reason I suspect it is that if you look at the start of the generalizing regime, you'll see that each feature has a few small features slightly embedded in the same direction as it. These seem to be features with slight empirical correlations. So that's suggestive about the transition regime. But this is all speculation -- there's lots we don't yet understand!


Thanks for the kind remark!

> I don't think you should feel bad for being slow, or for doing "few" things at all.

Unfortunately, I think it's tricky to do this in a journal format. If you accept submissions, you'll have a constant flow of articles -- which vary greatly in quality -- who's authors very reasonably want timely help and a publication decision. And so it's very hard to go slow and do less, even if that's what would be right for you.

> Could you get a Distill editor endowment to pay editors using donations throughout a non-profit fiscal sponsorship partner? ...

I don't think funding is the primary problem. I'm personally fortunate to have a good job, and happily spend a couple thousand a year out of pocket to cover Distill's operating expenses.

I think the key problem is that Distill's structure means that we can't really control how much energy it takes from us, nor chose to focus our energy on the things about Distill that excites us.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: