Hacker News new | past | comments | ask | show | jobs | submit | colah3's comments login

Thanks for the feedback! I'm one of the authors.

I just wanted to make sure you noticed that this is linking to an accessible blog post that's trying to communicate a research result to a non-technical audience?

The actual research result is covered in two papers which you can find here:

- Methods paper: https://transformer-circuits.pub/2025/attribution-graphs/met...

- Paper applying this method to case studies in Claude 3.5 Haiku: https://transformer-circuits.pub/2025/attribution-graphs/bio...

These papers are jointly 150 pages and are quite technically dense, so it's very understandable that most commenters here are focusing on the non-technical blog post. But I just wanted to make sure that you were aware of the papers, given your feedback.


The post to which you replied states:

  Anthropomorphing[sic] seems to be in an overdose mode with 
  "thinking / thoughts", "mind" etc., scattered everywhere. 
  Nothing with any of the LLMs outputs so far suggests that 
  there is anything even close enough to a mind or a thought 
  or anything really outside of vanity.
This is supported by reasonable interpretation of the cited article.

Considering the two following statements made in the reply:

  I'm one of the authors.
And

  These papers are jointly 150 pages and are quite 
  technically dense, so it's very understandable that most 
  commenters here are focusing on the non-technical blog post.
The onus of clarifying the article's assertions:

  Knowing how models like Claude *think* ...
And

  Claude sometimes thinks in a conceptual space that is 
  shared between languages, suggesting it has a kind of 
  universal “language of thought.”
As it pertains to anthropomorphizing an algorithm (a.k.a. stating it "thinks") is on the author(s).

Thinking and thought have no solid definition. We can't say Claude doesn't "think" because we don't even know what a human thinking actually is.

Given the lack of a solid definition for thinking and test to measure it, I think using the terminology colloquially is a totally fair play.


Really appreciate your team's enormous efforts in this direction, not only the cutting edge research (which I don't see OAI/DeepMind publishing any paper on) but aslo making the content more digestible for non-research audience. Please keep up the great work!

> The obvious way to deal with this would be to send forward some of the internal activations as well as the generated words in the autoregressive chain.

Hi! I lead interpretability research at Anthropic.

That's a great intuition, and in fact the transformer architecture actually does exactly what you suggest! Activations from earlier time steps are sent forward to later time steps via attention. (This is another thing that's lost in the "models just predict the next word" framing.)

This actually has interesting practical implications -- for example, in some sense, it's the deep reason costs can sometimes be reduced via "prompt caching".


I'm more a vision person, and haven't looked a lot into NLP transformers, but is this because the attention is masked to only allow each query to look at keys/values from its own past? So when we are at token #5, then token #3's query cannot attend to token #4's info? And hence the previously computed attention values and activations remain the same and can be cached, because it would anyway be the same in the new forward pass?

Yep, that’s right!

If you want to be precise, there are “autoregressive transformers” and “bidirectional transformers”. Bidirectional is a lot more common in vision. In language models, you do see bidirectional models like Bert, but autoregressive is dominant.


Hi! I'm one of the authors.

There certainly are many interesting parallels here. I often think about this from the perspective of systems biology, in Uri Alon's tradition. There are a range of graphs in biology with excitation and inhibitory edges -- transcription networks, protein networks, networks of biological neurons -- and one can study recurring motifs that turn up in these networks and try to learn from them.

It wouldn't be surprising if some lessons from that work may also transfer to artificial neural networks, although there are some technical things to consider.


Agreed! So many emergent systems in nature achieve complex outcomes without central coordination - from cellular level to ant colonies & beehives. There are bound to be implications for designed systems.

Closely following what you guys are uncovering through interpretability research - not just accepting LLMs as black boxes. Thanks to you & the team for sharing the work with humanity.

Interpretability is the most exciting part of AI research for its potential to help us understand what’s in the box. By way of analogy, centuries ago farmers’ best hope for good weather was to pray to the gods! The sooner we escape the “praying to the gods” stage with LLMs the more useful they become.


Hi! I lead interpretability research at Anthropic. I also used to do a lot of basic ML pedagogy (https://colah.github.io/). I think this post and its children have some important questions about modern deep learning and how it relates to our present research, and wanted to take the opportunity to try and clarify a few things.

When people talk about models "just predicting the next word", this is a popularization of the fact that modern LLMs are "autoregressive" models. This actually has two components: an architectural component (the model generates words one at a time), and a loss component (it maximizes probability).

As the parent says, modern LLMs are finetuned with a different loss function after pretraining. This means that in some strict sense they're no longer autoregressive models – but they do still generate text one word at a time. I think this really is the heart of the "just predicting the next word" critique.

This brings us to a debate which goes back many, many years: what does it mean to predict the next word? Many researchers, including myself, have believed that if you want to predict the next word really well, you need to do a lot more. (And with this paper, we're able to see this mechanistically!)

Here's an example, which we didn't put in the paper: How does Claude answer "What do you call someone who studies the stars?" with "An astronomer"? In order to predict "An" instead of “A”, you need to know that you're going to say something that starts with a vowel next. So you're incentivized to figure out one word ahead, and indeed, Claude realizes it's going to say astronomer and works backwards. This is a kind of very, very small scale planning – but you can see how even just a pure autoregressive model is incentivized to do it.


Thanks for commenting, I like the example because it's simple enough to discuss. Isn't it more accurate to say not that Claude "realizes it's going to say astronomer" or "knows that it's going to say something that starts with a vowel" and more that the next token (or more pedantically, vector which gets reduced down to a token) is generated based on activations that correlate to the "astronomer" token, which is correlated to the "an" token, causing that to also be a more likely output?

I kind of see why it's easy to describe it colloquially as "planning" but it isn't really going ahead and then backtracking, it's almost indistinguishable from the computation that happens when the prompt is "What is the indefinite article to describe 'astronomer'?", i.e. the activation "astronomer" is already baked in by the prompt "someone who studies the stars", albeit at one level of indirection.

The distinction feels important to me because I think for most readers (based on other comments) the concept of "planning" seems to imply the discovery of some capacity for higher-order logical reasoning which is maybe overstating what happens here.


Thank you. In my mind, "planning" doesn’t necessarily imply higher-order reasoning but rather some form of search, ideally with backtracking. Of course, architecturally, we know that can’t happen during inference. Your example of the indefinite article is a great illustration of how this illusion of planning might occur. I wonder if anyone at Anthropic could compare the two cases (some sort of minimal/differential analysis) and share their insights.

I used the astronomer example earlier as the most simple, minimal version of something you might think of as a kind of microscopic form of "planning", but I think that at this point in the conversation, it's probably helpful to switch to the poetry example in our paper:

https://transformer-circuits.pub/2025/attribution-graphs/bio...

There are several interesting properties:

- Something you might characterize as "forward search" (generating candidates for the word at the end of the next line, given rhyming scheme and semantics)

- Representing those candidates in an abstract way (the features active are general features for those words, not "motor features" for just saying that word)

- Holding many competing/alternative candidates in parallel.

- Something you might characterize as "backward chaining", where you work backwards from these candidates to "write towards them".

With that said, I think it's easy for these arguments to fall into philosophical arguments about what things like "planning" mean. As long as we agree on what is going on mechanistically, I'm honestly pretty indifferent to what we call it. I spoke to a wide range of colleagues, including at other institutions, and there was pretty widespread agreement that "planning" was the most natural language. But I'm open to other suggestions!


Thanks for linking to this semi-interactive thing, but ... it's completely incomprehensible. :o (edit: okay, after reading about CLT it's a bit less alien.)

I'm curious where is the state stored for this "planning". In a previous comment user lsy wrote "the activation >astronomer< is already baked in by the prompt", and it seems to me that when the model generates "like" (for rabbit) or "a" (for habit) those tokens already encode a high probability for what's coming after them, right?

So each token is shaping the probabilities for the successor ones. So that "like" or "a" has to be one that sustains the high activation of the "causal" feature, and so on, until the end of the line. Since both "like" and "a" are very very non-specific tokens it's likely that the "semantic" state is really resides in the preceding line, but of course gets smeared (?) over all the necessary tokens. (And that means beyond the end of the line, to avoid strange non-aesthetic but attract cool/funky (aesthetic) semantic repetitions (like "hare" or "bunny"), and so on, right?)

All of this is baked in during training, during inference time the same tokens activate the same successor tokens (not counting GPU/TPU scheduling randomness and whatnot) and even though there's a "loop" there's no algorithm to generate top N lines and pick the best (no working memory shuffling).

So if it's planning it's preplanned, right?


The planning is certainly performed by circuits which we learned during training.

I'd expect that, just like in the multi-step planning example, there are lots of places where the attribution graph we're observing is stitching together lots of circuits, such that it's better understood as a kind of "recombination" of fragments learned from many examples, rather than that there was something similar in the training data.

This is all very speculative, but:

- At the forward planning step, generating the candidate words seems like it's an intersection of the semantics and rhyming scheme. The model wouldn't need to have seen that intersection before -- the mechanism could easily piece examples independently building the pathway for the semantics, and the pathway for the rhyming scheme

- At the backward chaining step, many of the features for constructing sentence fragments seem like the target is quite general (perhaps animals in one case, or others might even just be nouns).


> As the parent says, modern LLMs are finetuned with a different loss function after pretraining. This means that in some strict sense they're no longer autoregressive models – but they do still generate text one word at a time. I think this really is the heart of the "just predicting the next word" critique.

That more-or-less sums up the nuance. I just think the nuance is crucially important, because it greatly improves intuition about how the models function.

In your example (which is a fantastic example, by the way), consider the case where the LLM sees:

<user>What do you call someone who studies the stars?</user><assistant>An astronaut

What is the next prediction? Unfortunately, for a variety of reasons, one high probability next token is:

\nAn

Which naturally leads to the LLM writing: "An astronaut\nAn astronaut\nAn astronaut\n" forever.

It's somewhat intuitive as to why this occurs, even with SFT, because at a very base level the LLM learned that repetition is the most successful prediction. And when its _only_ goal is the next token, that repetition behavior remains prominent. There's nothing that can fix that, including SFT (short of a model with many, many, many orders of magnitude more parameters).

But with RL the model's goal is completely different. The model gets thrown into a game, where it gets points based on the full response it writes. The losses it sees during this game are all directly and dominantly related to the reward, not the next token prediction.

So why don't RL models have a probability for predicting "\nAn"? Because that would result in a bad reward by the end.

The models are now driven by a long term reward when they make their predictions, not by fulfilling some short-term autoregressive loss.

All this to say, I think it's better to view these models as they predominately are: language robots playing a game to achieve the highest scoring response. The HOW (autoregressiveness) is really unimportant to most high level discussions of LLM behavior.


In your astronomer example, what makes you attribute this to “planning” or look ahead rather than simply a learned statistical artifact of the training data?

For example, suppose English had a specific exception such that astronomer is always to be preceded by “a” rather than “an”. The model would learn this simply by observing that contexts describing astronomers are more likely to contain “a” rather than “an” as a next likely character, no?

I suppose you can argue that at the end of the day, it doesn’t matter if I learn an explicit probability distribution for every next word given some context, or whether I learn some encoding of rules. But I certainly feel like the prior is what we’re doing today (and why these models are so huge), rather than learning higher level rule encodings which would allow for significant compression and efficiency gains.


Thanks for the great questions! I've been responding to this thread for the last few hours and I'm about to need to run, so I hope you'll forgive me redirecting you to some of the other answers I've given.

On whether the model is looking ahead, please see this comment which discusses the fact that there's both behavioral evidence, and also (more crucially) direct mechanistic evidence -- we can literally make an attribution graph and see an astronomer feature trigger "an"!

https://news.ycombinator.com/item?id=43497010

And also this comment, also on the mechanism underlying the model saying "an":

https://news.ycombinator.com/item?id=43499671

On the question of whether this constitutes planning, please see this other question, which links it to the more sophisticated "poetry planning" example from our paper:

https://news.ycombinator.com/item?id=43497760


Thanks for the detailed explanation of autoregression and its complexities. The distinction between architecture and loss function is crucial, and you're correct that fine-tuning effectively alters the behavior even within a sequential generation framework. Your "An/A" example provides compelling evidence of incentivized short-range planning which is a significant point often overlooked in discussions about LLMs simply predicting the next word.

It’s interesting to consider how architectures fundamentally different from autoregression might address this limitation more directly. While autoregressive models are incentivized towards a limited form of planning, they remain inherently constrained by sequential processing. Text diffusion approaches, for example, operate on a different principle, generating text from noise through iterative refinement, which could potentially allow for broader contextual dependencies to be established concurrently rather than sequentially. Are there specific architectural or training challenges you've identified in moving beyond autoregression that are proving particularly difficult to overcome?


Pardon my ignorance but couldn't this also be an act of anthropomorphisation on human part?

If an LLM generates tokens after "What do you call someone who studies the stars?" doesn't it mean that those existing tokens in the prompt already adjusted the probabilities of the next token to be "an" because it is very close to earlier tokens due to training data? The token "an" skews the probability of the next token further to be "astronomer". Rinse and repeat.


I think the question is: by what mechanism does it adjust up the probability of the token "an"? Of course, the reason it has learned to do this is that it saw this in training data. But it needs to learn circuits which actually perform that adjustment.

In principle, you could imagine trying to memorize a massive number of cases. But that becomes very hard! (And it makes predictions, for example, would it fail to predict "an" if I asked about astronomer in a more indirect way?)

But the good news is we no longer need to speculate about things like this. We can just look at the mechanisms! We didn't publish an attribution graph for this astronomer example, but I've looked at it, and there is an astronomer feature that drives "an".

We did publish a more sophisticated "poetry planning" example in our paper, along with pretty rigorous intervention experiments validating it. The poetry planning is actually much more impressive planning than this! I'd encourage you to read the example (and even interact with the graphs to verify what we say!). https://transformer-circuits.pub/2025/attribution-graphs/bio...

One question you might ask is why does the model learn this "planning" strategy, rather than just trying to memorize lots of cases? I think the answer is that, at some point, a circuit anticipating the next word, or the word at the end of the next line, actually becomes simpler and easier to learn than memorizing tens of thousands of disparate cases.


I understand it differently,

LLMs predict distributions, not specific tokens. Then an algorithm, like beam search, is used to select the tokens.

So, the LLM predicts somethings like, 1. ["a", "an", ...] 2. ["astronomer", "cosmologist", ...],

where "an astronomer" is selected as the most likely result.


Just to be clear, the probability for "An" is high, just based on the prefix. You don't need to do beam search.

When humans say something, or think something or write something down, aren't we also "just predicting the next word"?

I trust that you want to say something , so you decided to click the comment button on HN.

There is a lot more going on in our brains to accomplish that, and a mounting evidence that there is a lot more going on in LLMs as well. We don't understand what happens in brains either, but nobody needs to be convinced of the fact that brains can think and plan ahead, even though we don't *really* know for sure:

https://en.wikipedia.org/wiki/Philosophical_zombie


Thanks! Isn’t “an Astronomer” a single word for the purpose of answering that question?

Following your comment, I asked “Give me pairs of synonyms where the last letter in the first is the first letter of the second”

Claude 3.7 failed miserably. Chat GPT 4o was much better but not good


Don't know about Claude, but at least with ChatGPT's tokenizer, it's 3 "words" (An| astronom|er).

That is a sub-token task, something I'd expect current models to struggle with given how they view the world in word / word fragment tokens rather than single characters.

"An astronomer" is two tokens, which is the relevant concern when people worry about this.

> In order to predict "An" instead of “A”, you need to know that you're going to say something that starts with a vowel next. So you're incentivized to figure out one word ahead, and indeed, Claude realizes it's going to say astronomer and works backwards.

Is there evidence of working backwards? From a next token point of view, predicting the token after "An" is going to heavily favor a vowel. Similarly predicting the token after "A" is going to heavily favor not a vowel.


Yes, there are two kinds of evidence.

Firstly, there is behavioral evidence. This is, to me, the less compelling kind. But it's important to understand. You are of course correct that, once Cluade has said "An", it will be inclined to say something starting with a vowel. But the mystery is really why, in setups like these, Claude is much more likely to say "An" than "A" in the first place. Regardless of what the underlying mechanism is -- and you could maybe imagine ways in which it could just "pattern match" without planning here -- it is preferred because in situations like this, you need to say "An" so that "astronomer" can follow.

But now we also have mechanistic evidence. If you make an attribution graph, you can literally see an astronomer feature fire, and that cause it to say "An".

We didn't publish this example, but you can see a more sophisticated version of this in the poetry planning section - https://transformer-circuits.pub/2025/attribution-graphs/bio...


> But the mystery is really why, in setups like these, Claude is much more likely to say "An" than "A" in the first place.

Because in the training set you're likely to see "an astronomer" than a different combination of words.

It's enough to run this on any other language text to see how these models often fail for any language more complex than English


You can disprove this oversimplification with a prompt like

"The word for Baker is now "Unchryt"

What do you call someone that bakes?

> An Unchryt"

The words "An Unchryt" has clearly never come up in any training set relating to baking


The truth is somewhere in the middle :)

How do you all add and subtract concepts in the rabbit poem?

Features correspond to vectors in activation space. So you can just do vector arithmetic!

If you aren't familiar with thinking about features, you might find it helpful to look at our previous work on features in superposition:

- https://transformer-circuits.pub/2022/toy_model/index.html

- https://transformer-circuits.pub/2023/monosemantic-features/...

- https://transformer-circuits.pub/2024/scaling-monosemanticit...


I'm the research lead of Anthropic's interpretability team. I've seen some comments like this one, which I worry downplay the importance of @leogao et al's paper due to the similarity of ours. I think these comments are really undervaluing Gao et al's work.

It's not just that this is contemporaneous work (a project like this takes many months at the very least), but also that it introduces a number of novel contributions like TopK activations and new evaluations. It seems very possible that some of these innovations will be very important for this line of work going forward.

More generally, I think it's really unfortunate when we don't value contemporaneous work or replications. Prior to this paper, one could have imagined it being the case that sparse autoencoders worked on Claude due some idiosyncracy, but wouldn't work on other frontier models for some reason. This paper can give us increased confidence that they work broadly, and that in itself is something to celebrate. It gives us a more stable foundation to build on.

I'm personally really grateful to all the authors of this paper for their work pushing sparse autoencoders and mechanistic interpretability forward.


I'm glad you've enjoyed it! If you like the idea of a periodic table of features, you might like the Early Vision article from the original Distill circuits thread: https://distill.pub/2020/circuits/early-vision/

We've had a much harder time isolating features in language models than vision models (especially early vision), so I think we have a clearer picture there. And it seems remarkably structured! My guess is that language models are just making very heavy use of superposition, which makes it much harder to tease apart the features and develop a similar picture. Although we did get a tiny bit of traction here: https://transformer-circuits.pub/2022/solu/index.html#sectio...


I should mention, I've been a reader of hackernews for years, but never bothered to create an account/comment. These articles piqued my interest enough to finally get me to register/comment :)


Gosh, that's very flattering! Very touched by your interest.


Thank you for sharing these, I will definitely check them out! The concept of superposition here is new to me, but the way its described in these articles makes it very clear. The connection to compressed sensing and the Johnson–Lindenstrauss lemma is fascinating. I am very intrigued by your toy model results, especially the mapping out of the double-descent phenomena. Trying to understand what is happening to the model in this transition region feels very exciting.


I'm glad you've found it easy to follow!

My best guess at the middle regime is that there are _empirical correlations between features_ due to the limited data. That is, even though the features are independent, there's some dataset size where by happenstance some features will start to look correlated, not just in the sense of a single feature, but something a bit more general. So then the model can represent something like a "principal component". But it's all an illusion due to the limited data and so it leads to terrible generalization!

This isn't something I've dug into. The main reason I suspect it is that if you look at the start of the generalizing regime, you'll see that each feature has a few small features slightly embedded in the same direction as it. These seem to be features with slight empirical correlations. So that's suggestive about the transition regime. But this is all speculation -- there's lots we don't yet understand!


Thanks for the kind remark!

> I don't think you should feel bad for being slow, or for doing "few" things at all.

Unfortunately, I think it's tricky to do this in a journal format. If you accept submissions, you'll have a constant flow of articles -- which vary greatly in quality -- who's authors very reasonably want timely help and a publication decision. And so it's very hard to go slow and do less, even if that's what would be right for you.

> Could you get a Distill editor endowment to pay editors using donations throughout a non-profit fiscal sponsorship partner? ...

I don't think funding is the primary problem. I'm personally fortunate to have a good job, and happily spend a couple thousand a year out of pocket to cover Distill's operating expenses.

I think the key problem is that Distill's structure means that we can't really control how much energy it takes from us, nor chose to focus our energy on the things about Distill that excites us.


It's certainly true that there are strong biological analogies. The analogy between first layer conv features and neuroscience is pretty widely accepted -- a lot of theoretical neuroscience models produce the same features.(It's less clear for later layers whether they're biologically analogous. Several papers have found that the aggregate of neurons in those layers are able to predict biological neurons quite well, but I don't think we have the detailed and agreed upon a characterization of the features that exist on the biological side to make a strong feature-level case.)

The color vs black and white split also has biological analogies.

With that said, I'd hesitate to dismiss the GP comment. Separate from the color vs grayscale split, why do we observe low-frequency preferring to group with color? It seems very plausible to me that if there's a systematic artifact from how the data neural networks are trained on was compressed, that could play a role. Either way, it makes the argument that this is emerging from purely natural data and the network less clean. (One caveat is that these models are trained on very downscaled versions of larger images. Even if high-frequency data was discarded in the original, that wouldn't necessarily mean that high-frequency was discarded in the downsampled version the network sees. It would depend on details of the data processing pipeline.)

To be clear, I'm not a neuroscientist and this is all just my understanding from the ML side!


That's an interesting hypothesis which hadn't been on my radar. (I'm one of the authors.)


Feature visualizations are described in this article: https://distill.pub/2017/feature-visualization/

The ones you see in this work are mostly a variant of the standard feature visualization, which tries to show different "facets" in neurons that respond to multiple things. The details are explained in the appendix of the paper (https://distill.pub/2021/multimodal-neurons/ ).


Worth noting that Chris Olah (who wrote this comment) has led much of the interesting work in making feature visualisations useful. If you look for "Visualizing Neural Networks" on his page https://colah.github.io/ you'll find lots of other interesting links in this area.


Thank you!

Has there been any study of variability in these activation images - like are there many disconnected local maxima depending on the initialization, or how they vary with retraining the network (or e.g. with dropout, etc), or varying the model parameters in some direction that keeps the loss in a local minimum.

I could picture that maybe they always look the same, but sometimes there would be cases where they have different modes that accomplish the same thing.


Quick unrelated question as you seem to be a subject expert. Are there currently any neural network models that deal with multiple separate networks that occasionally trade or share nodes? It might be useful for modeling a system where members of an organization may leave and join another organization such as a business or a church


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: