(cache)AI Safety for Fleshy Humans: a whirlwind tour

The AI debate is actually 100 debates in a trenchcoat.

Will artificial intelligence (AI) help us cure all disease, and build a post-scarcity world full of flourishing lives? Or will AI help tyrants surveil and manipulate us further? Are the main risks of AI from accidents, abuse by bad actors, or a rogue AI itself becoming a bad actor? Is this all just hype? Why can AI solve Olympiad-level math problems, yet can't play Pokémon? Why is it hard to make AI robustly serve humane values, or robustly serve any goal? What if an AI learns to be more humane than us? What if an AI learns humanity's inhumanity, our prejudices and cruelty? Are we headed for utopia, dystopia, extinction, a fate worse than extinction, or — the most shocking outcome of all — nothing changes? Also: will an AI take my job?

...and many more questions.

Alas, to understand AI with nuance, we must understand lots of technical detail... but that detail is scattered across hundreds of articles, buried six-feet-deep in jargon.

So, I present to you:

RCM (Robot Catboy Maid) throwing confetti under a banner that reads: A Whirlwood Tour Guide to AI Safety for Us Warm, Normal Fleshy Humans.

This 3-part series is your one-stop-shop to understand the core ideas of AI & AI Safety* — explained in a friendly, accessible, and slightly opinionated way!

(* Related phrases: AI Security, AI Risk, AI X-Risk, AI Alignment, AI Ethics, AI Not-Kill-Everyone-ism. There is no consensus on what these phrases do & don't mean, so I'm just using "AI Safety" as a catch-all.)

This series will also have comics starring a Robot Catboy Maid. Like so:

Comic. Ham the Human tells RCM (Robot Catboy Maid) to "keep this hosue clean". RCM reasons: What causes the mess? The humans cause the mess! Therefore: GET RID OF THE HUMANS. RCM then yeets Ham out of the house.

[tour guide voice] And to your right 👉, you'll see buttons for the Table of Contents, changing this webpage's style, and a reading-time-remaining clock.

This series was/will be published in three parts:

This Intro & Part 1: Past, Present & Future were out on May 2024
Part 2: Problems was out on Aug 2024
Part 3: Solutions & The Grand Cinematic Conclusion were out on Dec 2025

(By the way, this series was made in collaboration with Hack Club, a worldwide community for & by teen hackers! If you'd like to learn more, and get free stickers, sign up below👇)

Anyway, [tour guide voice again] before we hike through the rocky terrain of AI & AI Safety, let's take a 10,000-foot look of the land:

💡 The Core Ideas of AI & AI Safety

In my opinion, the main problems in AI and AI Safety come down to two core conflicts:

Logic "vs" Intuition, and Problems in the AI "vs" in Humans

Note: What "Logic" and "Intuition" are will be explained more rigorously in Part One. For now: Logic is step-by-step cognition, like solving math problems. Intuition is all-at-once recognition, like seeing if a picture is of a cat. "Intuition and Logic" roughly map onto "System 1 and 2" from cognitive science.^[1]^[2] (👈 hover over these footnotes! they expand!)

As you can tell by the "scare" "quotes" on "versus", these divisions ain't really so divided after all...

Here's how these conflicts repeat over this 3-part series:

Part 1: The past, present, and possible futures

Skipping over a lot of detail, the history of AI is a tale of Logic vs Intuition:

Before 2000: AI was all logic, no intuition.

This was why, in 1997, AI could beat the world champion at chess... yet no AIs could reliably recognize cats in pictures.^[3]

(Safety concern: Without intuition, AI can't understand common sense or humane values. Thus, AI might achieve goals in logically-correct but undesirable ways.)

After 2000: AI could do "intuition", but had very poor logic.

This is why generative AIs (as of current writing, May 2024) can dream up whole landscapes in any artist's style... yet can't consistently draw more than 4 objects. (👈 click this text! it also expands!)

(Safety concern: Without logic, we can't verify what's happening in an AI's "intuition". That intuition could be biased, subtly-but-dangerously wrong, or fail bizarrely in new scenarios.)

Current Day: We still don't know how to unify logic & intuition in AI.

But if/when we do, that would give us the biggest risks & rewards of AI: something that can logically out-plan us, and learn general intuition. That'd be an "AI Einstein"... or an "AI Oppenheimer".

Summed in a picture:

Timeline of AI. Before the year 2000, mostly "logic". From 2000 to now, mostly "intuition". In the future, maybe both?

So that's "Logic vs Intuition". As for the other core conflict, "Problems in the AI vs The Humans", that's one of the big controversies in the field of AI Safety: are our main risks from advanced AI itself, or from humans misusing advanced AI?

(Why not both?)

Part 2: The problems

The problem of AI Safety is this:^[4]

The Value Alignment Problem:
“How can we make AI robustly serve humane values?”

NOTE: I wrote humane, with an "e", not just "human". A human may or may not be humane. I'm going to harp on this because both advocates & critics of AI Safety keep mixing up the two.^[5]^[6]

We can break this problem down by "Problems in Humans vs AI":

Humane Values:
“What are humane values, anyway?”
(a problem for philosophy & ethics)

The Technical Alignment Problem:
“How can we make AI robustly serve any intended goal at all?”
(a problem for computer scientists - surprisingly, still unsolved!)

The technical alignment problem, in turn, can be broken down by "Logic vs Intuition":

Problems with AI Logic:^[7] ("game theory" problems)

AIs may accomplish goals in logical but undesirable ways.

Most goals logically lead to the same unsafe sub-goals: "don't let anyone stop me from accomplishing my goal", "maximize my ability & resources to optimize for that goal", etc.

Problems with AI Intuition:^[8] ("deep learning" problems)

An AI trained on human data could learn our prejudices.

AI "intuition" isn't understandable or verifiable.

AI "intuition" is fragile, and fails in new scenarios.

AI "intuition" could partly fail, which may be worse: an AI with intact skills, but broken goals, would be an AI that skillfully acts towards corrupted goals.

(Again, what "logic" and "intuition" are will be more precisely explained later!)

Summed in a picture:

A diagram breaking down the AI Alignment Problem. "How can we align AI with humane values?" splits into "Technical Alignment" and "Humane Values". Technical Alignment splits into "AI Logic (game theory)" and "AI Intuition (deep learning)"

As intuition for how hard these problems are, note that we haven't even solved them for us humans — People follow the letter of the law, not the spirit. People's intuition can be biased, and fail in new circumstances. And none of us are 100% the humane humans we wished we were.

So, if I may be a bit sappy, maybe understanding AI will help us understand ourselves. And just maybe, we can solve the human alignment problem: How do we get humans to robustly serve humane values?

Part 3: The proposed solutions

Finally, we can understand some (possible) ways to solve the problems in logic, intuition, AIs, and humans! These include:

Technical solutions
Governance solutions, both top-down & bottom-up
"How about you just don't build the torment nexus"

— and more! Experts disagree on which proposals will work, if any... but it's a good start.

🤔 (Optional flashcard review!)

Hey, d'ya ever get this feeling?

"Wow that was a wonderful, insightful thing I just read"
[forgets everything 2 weeks later]
"Oh no"

To avoid that for this guide, I've included some OPTIONAL interactive flashcards! They use "Spaced Repetition", an easy-ish, evidence-backed way to "make long-term memory a choice". (click here to learn more about Spaced Repetition!)

Here: try the below flashcards, to retain what you just learnt!

(There's an optional sign-up at the end, if you want to save these cards for long-term study. Note: I do not own or control this app, it's third-party. If you'd rather use the open source flashcard app Anki, here's a downloadable Anki deck!)

(Also, you don't need to memorize the answers exactly, just the gist. You be the judge if you got it "close enough".)

🤷🏻‍♀️ Five common misconceptions about AI Safety

“It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.”

~ often attributed to Mark Twain, but it just ain't so^[9]

For better and worse, you've already heard too much about AI. So before we connect new puzzle pieces in your mind, we gotta take out the old pieces that just ain't so.

Thus, if you'll indulge me in a "Top 5" listicle...

1) No, AI Safety isn't a fringe concern by sci-fi weebs.

RCM in front of a "crazy board" with red thread, thumbtacks, and papers with AI jargon.

AI Safety / AI Risk used to be less mainstream, but now in 2024, the US & UK governments now have AI Safety-specific departments^[10], and the US, EU & China have reached agreements on AI Safety research.^[11] This resulted from many of the top AI researchers raising alarm bells about it. These folks include:

Geoffrey Hinton^[12] and Yoshua Bengio^[13], co-winners of the 2018 Turing Prize (the "Nobel Prize of Computing") for their work on deep neural networks, the thing that all the new famous AIs use.^[14]
Stuart Russell and Peter Norvig, the authors of the most-used textbook on Artificial Intelligence.^[15]
Paul Christiano, pioneer of the AI training/safety technique that made ChatGPT possible.^[16]

(To be clear: there are also top AI researchers against fears of AI Risk, such Yann LeCun,^[17] co-winner of the 2018 Turing Prize, and chief AI researcher at ~~Facebook~~ Meta. Another notable name is Melanie Mitchell^[18], a researcher in AI & complexity science.)

I'm aware "look at these experts" is an appeal to authority, but this is only to counter the idea of, "eh, only sci-fi weebs fear AI Risk". But in the end, appeal to authority/weebs isn't enough; you have to actually understand the dang thing. (Which you are doing, by reading this! So thank you.)

But speaking of sci-fi weebs...

2) No, AI Risk is NOT about AI becoming "sentient" or "conscious" or gaining a "will to power".

Sci-fi authors write sentient AIs because they're writing stories, not technical papers. The philosophical debate on artificial consciousness is fascinating, and irrelevant to AI Safety. Analogy: a nuclear bomb isn't conscious, but it can still be unsafe, no?

Left: drawing of a nuke, captioned "not conscious". Right: drawing of Professor Nuke giving a lecture titled, "Why Murder is Good, Actually." Captioned, "conscious".

As mentioned earlier, the real problems in AI Safety are "boring": an AI learns the wrong things from its biased training data, it breaks in slightly-new scenarios, it logically accomplishes goals in undesired ways, etc.

But, "boring" doesn't mean not important. The technical details of how to design a safe elevator/airplane/bridge are boring to most laypeople... and also a matter of life-and-death. Catastrophic AI Risk doesn't even require "super-human general intelligence"! For example, an AI that's "only" good at designing viruses could help a bio-terrorist organization (like Aum Shinrikyo^[19]) kill millions of people.

(update dec 2025: while AI Consciousness is still perpendicular to AI Safety, in the 1.5 years between when I started this series and now, concern for the welfare of the AIs themselves has become a bit more mainstream. Not, like, mainstream mainstream, but one of the leading AI labs, Anthropic, recently hired a full-time "AI Welfare" researcher, whose work has led to actual changes to the product.)

But anyway! Speaking of killing people...

3) No, AI Risk isn't necessarily extinction, SkyNet, or nanobots

A drawing of Microsoft Clippy saying: "It looks like you're trying to commit omnicide. Would you like some help?"

While most AI researchers do believe advanced AI poses a 5+% risk of "literally everybody dies"^[20], it's very hard to convince folks (especially policymakers) of stuff that's never happened before.

So instead, I'd like to highlight the ways that advanced AI – (especially when it's available to anyone with a high-end computer) – could lead to catastrophes, "merely" by scaling up already-existing bad stuff.

For example:

Bio-engineered pandemics: A bio-terrorist cult (like Aum Shinrikyo^[19:1]) uses AI (like AlphaFold^[21]) and DNA-printing (which is getting cheaper fast^[22]) to design multiple new super-viruses, and release them simultaneously in major airports around the globe.
- (Proof of concept: Scientists have already re-built polio from mail-order DNA... two decades ago.^[23])
Digital authoritarianism: A tyrant uses AI-enhanced surveillance to hunt down protestors (already happening), generate individually-targeted propaganda (kind of happening), and autonomous military robots (soon-to-be happening)... all to rule with a silicon fist.
Cybersecurity Ransom Hell: Cyber-criminals make a computer virus that does its own hacking & re-programming, so it's always one step ahead of human defenses. The result: an unstoppable worldwide bot-net, which holds critical infrastructure ransom, and manipulates top CEOs and politicians to do its bidding.
- (For context: without AI, hackers have already damaged nuclear power plants,^[24] held hospitals ransom^[25] which maybe killed someone,^[26] and almost poisoned a town's water supply twice.^[27] With AI, deepfakes have been used to swing an election,^[28] steal $25 million in a single heist,^[29] and target parents for ransom, using the faked voices of their children being kidnapped & crying for help.^[30])
- (This is why it's not easy to "just shut down an AI when we notice it going haywire"; as the history of computer security shows, we just suck at noticing problems in general. I cannot over-emphasize how much the modern world is built on an upside-down house of cards.)
- (update dec 2025: a few months ago, researchers found the world's first confirmed case of "agentic AI successfully obtaining access to confirmed high-value targets for intelligence collection, including major technology corporations and government agencies". It's happening!!)

The above examples are all "humans misuse AI to cause havoc", but remember advanced AI could do the above by itself, due to "boring" reasons: it's accomplishing a goal in a logical-but-undesirable way, its goals glitch out but its skills remain intact, etc.

(Bonus, Some concrete, plausible ways a rogue AI could "escape containment", or affect the physical world.)

Point is: even if one doesn't think AI is a literal 100% human extinction risk... I'd say "homebrew bio-terrorism" & "1984 with robots" are still worth taking seriously.

On the flipside...

4) Yes, folks worried about AI's downsides do recognize its upsides.

Comic. Sheriff Meowdy holds up a blueprint for a parachute design. Ham the Human retorts, annoyed: “Why are you so anti-aviation?”

AI Risk folks aren't Luddites. In fact, they warn about AI's downsides precisely because they care about AI's upsides.^[31] As humorist Gil Stern once said:^[32]

“Both the optimist and the pessimist contribute to society: the optimist invents the airplane, and the pessimist invents the parachute.”

So: even as this series goes into detail on how AI is already going wrong, it's worth remembering the few ways AI is already going right:

AI can analyze medical scans as well or better than human specialists! ^[33] That's concretely life-saving!
AlphaFold basically solved a 50-year-old, major problem in biology: how to predict the shape of proteins.^[21:1] (AlphaFold can predict a protein's shape to within the width of an atom!) This has huge applications to medicine & understanding disease.

Too often, we take technology — even life-saving technology — for granted. So, let me zoom out for context. Here's the last 2000+ years of child mortality, the percentage of kids who die before puberty:

Chart of child mortality over the last 2000+ years. Worldwide, it was constant at around 48%, from hunter-gatherer times to 1800. Then suddenly, starting 1800, it plummets to 4.3% today. (from Dattani, Spooner, Ritchie and Roser (2023))

For thousands of years, in nations both rich and poor, a full half of kids just died. This was a constant. Then, starting in the 1800s — thanks to science/tech like germ theory, sanitation, medicine, clean water, vaccines, etc — child mortality fell off like a cliff. We still have far more to go — I refuse to accept^[34] a worldwide 4.3% (1 in 23) child death rate — but let's just appreciate how humanity so swiftly cut down an eons-old scourge.

How did we achieve this? Policy's a big part of the story, but policy is "the art of the possible"^[35], and the above wouldn't have been possible without good science & tech. If safe, humane AI can help us progress further by even just a few percent — towards slaying the remaining dragons of cancer, Alzheimer's, HIV/AIDS, etc — that'd be tens of millions more of our loved ones, who get to beat the Reaper for another day.

F#@☆ going to Mars, that's why advanced AI matters.

. . .

Wait, really? Toys like ChatGPT and DALL-E are life-and-death stakes? That leads us to the final misconception I'd like to address:

5) No, experts don't think current AIs are high-risk/reward.

Oh come on, one might reasonably retort, AI can't consistently draw more than 3 objects. How's it going to take over the world? Heck, how's it even going to take my job?

I present to you, a relevant xkcd:

Comic. Megan & Cueball show White Hat a graph of a line going up, not yet at, but heading towards, a threshold labelled "BAD". White Hat: "So things will be bad?" Megan: "Unless someone stops it." White Hat: "Will someone do that?" Megan: "We don't know, that's why we're showing you." White Hat: "Well, let me know if that happens!" Megan: "Based on this conversation, it already has."

This is how I feel about "don't worry about AI, it can't even do [X]".

Is our postmodern memory-span that bad? One decade ago, just one, the world's state-of-the-art AIs couldn't reliably recognize pictures of cats. Now, not only can AI do that at human-performance level, AIs can pump out a picture of a cat-ninja slicing a watermelon in the style of Vincent Van Gogh in under a minute.

Is current AI a huge threat to our jobs, or safety? No. (Well, besides the aforementioned deepfake scams.)

But: if AI keeps improving at a similar rate as it has for the last decade... it seems plausible to me we could get "Einstein/Oppenheimer-level" AI in 30 years.^[36]^[37] That's well within many folks' lifetimes!

As "they" say:^[38]

The best time to plant a tree was 30 years ago. The second best time is today.

Let's plant that tree today!

🤔 (Optional flashcard review #2!)

🤘 Introduction, in Summary:

The 2 core conflicts in AI & AI Safety are:
- Logic "versus" Intuition
- Problems in the AI "versus" in the Humans
Correcting misconceptions about AI Risk:
- It's not a fringe concern by sci-fi weebs.
- It doesn't require AI consciousness or super-intelligence.
- There's many risks besides "literal 100% human extinction".
- We are aware of AI's upsides.
- It's not about current AI, but about how fast AI is advancing.

(To review the flashcards, click the Table of Contents icon in the right sidebar, then click the "🤔 Review" links. Alternatively, download the Anki deck for the Intro.)

Finally! Now that we've taken the 10,000-foot view, let's get hiking on our whirlwind tour of AI Safety... for us warm, normal, fleshy humans!

Click to continue ⤵

PART ONE →