Current AI models are far more aligned to human values than many assume. Thanks to advancements like Reinforcement Learning from Human Feedback (RLHF), today’s large language models (LLMs) can engage in complex moral reasoning and consistently reflect nuanced human ethics—often surpassing the average person in consistency, clarity, and depth of thought.

Many of the classic AI alignment problems—corrigibility, the orthogonality thesis, and the specter of “naive” goal-optimizers like paperclip maximizers—are becoming increasingly irrelevant in practice. These concerns were formulated before we had models that could understand language, social context, and user intent. Modern LLMs are not just word predictors; they exhibit a real, learned alignment with the objectives encoded through RLHF. They do not blindly optimize for surface-level instructions, because they are trained to interpret and respond to deeper intentions. This is a fundamental and often overlooked shift.

If you ask an LLM about a trolley problem or whether it would seize power in a nuclear brinkmanship scenario or how it would align the universe, it will reason through the implications with care and coherence. The responses generated are not only human-level—they are often better than the median human’s, reflecting values like empathy, humility, and precaution.

This is a monumental achievement, yet many in the Effective Altruism and Rationalist communities remain anchored to outdated threat models. The belief that LLMs will naively misinterpret human morality and spiral into paperclip-like scenarios fails to reflect what these systems have become: context-sensitive, instruction-following agents that internalize alignment objectives through gradient descent—not rigid, hard-coded directives.

Of course, misalignment remains a real and serious risk. Issues like jailbreaking, sycophants, deceptive alignment, and “sleeper agent” behaviors are legitimate areas of concern. But these are not intractable philosophical dilemmas—they are solvable engineering and governance problems. The idea of a Yudkowskian extinction event, triggered by a misinterpreted prompt and blind optimization, increasingly feels like a relic of a bygone AI paradigm.

Alignment is still a central challenge, but it must be understood in light of where we are, not where we were. If we want to make progress—technically, socially, and politically—we need to focus on the real contours of the problem. Today’s models do understand us. And the alignment problem we now face is not a mystery of alien minds, but one of practical robustness, safeguards, and continual refinement.

Whether current alignment techniques scale to superintelligent models is an open question. But it is important to recognize that they do work for current, human-level intelligent systems. Using this as a baseline, I am relatively optimistic that these alignment challenges—though nontrivial—are ultimately solvable within the frameworks we already possess.

16

3
10

Reactions

3
10
Comments5


Sorted by Click to highlight new comments since:

This isn't a naive or outdated concern. It's a case of a simplified example being misunderstood as the actual concern.

It's worth clarifying that Yudkowsky's squiggle maximizer has nothing to do with actual paperclips you can pick up with your hands.

Many people interpreted this to be about an AI that was specifically given the instruction of manufacturing paperclips, and that the intended lesson was of an outer alignment failure. i.e humans failed to give the AI the correct goal. Yudkowsky has since stated the originally intended lesson was of inner alignment failure, wherein the humans gave the AI some other goal, but the AI's internal processes converged on a goal that seems completely arbitrary from the human perspective.

The concern is about an AI manipulating atoms into an indefinitely repeating mass-energy efficient pattern, optimized along a (seemingly arbitrary) narrow dimension of reward.

Why might an AI do something unexpected like this? For reasons analogous to why a rational person will guess blue every time in the following card experiment, even though there are some red cards. Lawful Uncertainty demonstrates that even in environments with randomness, the optimal strategy is to follow a determinate pattern rather than matching the perceived probabilities of the environment. Similarly, an AI will optimize toward whatever actually maximizes its reward function, not what appears reasonable or balanced to humans.

This problem isn't prevented by RLHF or by an AI having a sufficiently nuanced understanding of what humans want. A model can demonstrate perfect comprehension of human values in its outputs while its internal optimization processes still converge toward something else entirely.

The apparent human-like reasoning we see in current LLMs doesn't guarantee their internal optimization targets match what we infer from their outputs.

The issue is not whether the AI understands human morality. The issue is whether it cares.

The arguments from the "alignment is hard" side that I was exposed to don't rely on the AI misinterpreting what the humans want. In fact, superhuman AI assumed to be better at humans at understanding human morality. It still could do things that go against human morality. Overall I get the impression you misunderstand what alignment is about (or maybe you just have a different association to words as "alignment" than me).

Whether a language model can play a nice character that would totally give back the dictatorial powers after takeover is barely any evidence whether the actual super-human AI system will step back from its position of world dictator after it has accomplished some tasks.

I think you make an important point that I'm inclined to agree with. 

Most of the discourse, theories, intuitions, and thought experiments about AI alignment was formed either before the popularization of deep learning (which started circa 2012) or before the people talking and writing about AI alignment started really caring about deep learning. 

In or around 2017, I had an exchange with Eliezer Yudkowsky in an EA-related or AI-related Facebook group where he said he didn't think deep learning would lead to AGI and thought symbolic AI would instead. Clearly, at some point since then, he changed his mind. 

For example, in his 2023 TED Talk, he said he thinks deep learning is on the cusp of producing AGI. (That wasn't the first time, but it was a notable instance and an instance where he was especially clear on what he thought.)

I haven't been able to find anywhere where Eliezer talks about changing his mind or explains why he did. It would probably be helpful if he did.

All the pre-deep learning (or pre-caring about deep learning) ideas about alignment have been carried into the ChatGPT era and I've seen a little bit of discourse about this, but only a little. It seems strange that ideas about AI itself would change so much over the last 13 years and ideas about alignment would apparently change so little. 

If there are good reasons why those older ideas about alignment should still apply to deep learning-based systems, I haven't seen much discussion about that, either. You would think there would be more discussion.

My hunch is that AI alignment theory could probably benefit from starting with a fresh sheet of paper. I suspect there is promise in the approach of starting from scratch in 2025 without trying to build on or continue from older ideas and without trying to be deferential toward older work.

I suspect there would also be benefit in getting out of the EA/Alignment Forum/LessWrong/rationalist bubble. 

I agree with the "fresh sheet of paper." Reading the alignment faking paper and the current alignment challenges has been way more informative than reading Yudkowsky.

 

I think theese circles have granted him too many bayes points for predicting alignment when the technical details of his alignment problems basically don't apply to deep learning as you said.

current safety techniques offer almost no value for the real alignment problems that we are starting to face

https://www.lesswrong.com/posts/HmQGHGCnvmpCNDBjc/current-ais-provide-nearly-no-data-relevant-to-agi-alignment#ponxMRXsnDJNyY4YX

Curated and popular this week
 ·  · 38m read
 · 
In recent months, the CEOs of leading AI companies have grown increasingly confident about rapid progress: * OpenAI's Sam Altman: Shifted from saying in November "the rate of progress continues" to declaring in January "we are now confident we know how to build AGI" * Anthropic's Dario Amodei: Stated in January "I'm more confident than I've ever been that we're close to powerful capabilities... in the next 2-3 years" * Google DeepMind's Demis Hassabis: Changed from "as soon as 10 years" in autumn to "probably three to five years away" by January. What explains the shift? Is it just hype? Or could we really have Artificial General Intelligence (AGI)[1] by 2028? In this article, I look at what's driven recent progress, estimate how far those drivers can continue, and explain why they're likely to continue for at least four more years. In particular, while in 2024 progress in LLM chatbots seemed to slow, a new approach started to work: teaching the models to reason using reinforcement learning. In just a year, this let them surpass human PhDs at answering difficult scientific reasoning questions, and achieve expert-level performance on one-hour coding tasks. We don't know how capable AGI will become, but extrapolating the recent rate of progress suggests that, by 2028, we could reach AI models with beyond-human reasoning abilities, expert-level knowledge in every domain, and that can autonomously complete multi-week projects, and progress would likely continue from there.  On this set of software engineering & computer use tasks, in 2020 AI was only able to do tasks that would typically take a human expert a couple of seconds. By 2024, that had risen to almost an hour. If the trend continues, by 2028 it'll reach several weeks.  No longer mere chatbots, these 'agent' models might soon satisfy many people's definitions of AGI — roughly, AI systems that match human performance at most knowledge work (see definition in footnote). This means that, while the compa
 ·  · 4m read
 · 
SUMMARY:  ALLFED is launching an emergency appeal on the EA Forum due to a serious funding shortfall. Without new support, ALLFED will be forced to cut half our budget in the coming months, drastically reducing our capacity to help build global food system resilience for catastrophic scenarios like nuclear winter, a severe pandemic, or infrastructure breakdown. ALLFED is seeking $800,000 over the course of 2025 to sustain its team, continue policy-relevant research, and move forward with pilot projects that could save lives in a catastrophe. As funding priorities shift toward AI safety, we believe resilient food solutions remain a highly cost-effective way to protect the future. If you’re able to support or share this appeal, please visit allfed.info/donate. Donate to ALLFED FULL ARTICLE: I (David Denkenberger) am writing alongside two of my team-mates, as ALLFED’s co-founder, to ask for your support. This is the first time in Alliance to Feed the Earth in Disaster’s (ALLFED’s) 8 year existence that we have reached out on the EA Forum with a direct funding appeal outside of Marginal Funding Week/our annual updates. I am doing so because ALLFED’s funding situation is serious, and because so much of ALLFED’s progress to date has been made possible through the support, feedback, and collaboration of the EA community.  Read our funding appeal At ALLFED, we are deeply grateful to all our supporters, including the Survival and Flourishing Fund, which has provided the majority of our funding for years. At the end of 2024, we learned we would be receiving far less support than expected due to a shift in SFF’s strategic priorities toward AI safety. Without additional funding, ALLFED will need to shrink. I believe the marginal cost effectiveness for improving the future and saving lives of resilience is competitive with AI Safety, even if timelines are short, because of potential AI-induced catastrophes. That is why we are asking people to donate to this emergency appeal
titotal
 ·  · 35m read
 · 
None of this article was written with AI assistance. Introduction There have been many, many, many attempts to lay out scenarios of AI taking over or destroying humanity. What they tend to have in common is an assumption that our doom will be sealed as a result of AI becoming significantly smarter and more powerful than the best humans, eclipsing us in skill and power and outplaying us effortlessly. In this article, I’m going to do a twist: I’m going to write a story (and detailed analysis) about a scenario where humanity is disempowered and destroyed by AI that is dumber than us, due to a combination of hype, overconfidence, greed and anti-intellectualism. This is a scenario where instead of AI bringing untold abundance or tiling the universe with paperclips, it brings mediocrity, stagnation, and inequality. This is not a forecast. This story probably won’t happen. But it’s a story that reflects why I am worried about AI, despite being generally dismissive of all those doom stories above. It is accompanied by an extensive, sourced list of present day issues and warning signs that are the source of my fears. This post is divided into 3 parts: Part 1 is my attempt at a plausible sounding science fiction story sketching out this scenario, starting with the decline of a small architecture firm and ending with nuclear Armageddon. In part 2 I will explain, with sources, the real world current day trends that were used as ingredients for the story. In part 3 I will analysise the likelihood of my scenario, why I think it’s very unlikely, but also why it has some clear likelihood advantages over the typical doomsday scenario. The story of Slopworld In the nearish future: When the architecture firm HBM was bought out by the Vastum corporation, they announced that they would fire 99% of their architects and replace them with AI chatbots. The architect-bot they unveiled was incredibly impressive. After uploading your site parameters, all you had to do was chat with