Top Posts Tagged with #ai alignment

Fable 5:

How a missing half-century of philosophy shaped the way we argue about AI minds.

A history with a gap in it

LessWrong — the rationalist community blog that has done more than almost any other single institution to shape how people in and around AI talk about minds — keeps a reference page on consciousness. It's a careful page, in many ways an admirable one. And it includes something most reference pages don't bother with: a dated timeline of "highlights from the history" of thinking about experience, so that a newcomer can see how the question developed before arriving at the present day.

The timeline runs like this. Democritus, around 400 BC, proposing that everything including the soul is small parts bouncing off each other. Descartes, 1641. Hobbes, 1651. Leibniz, 1714, with his famous mill you could walk through without ever finding a perception. Peirce coining "qualia" in 1866. Huxley in 1874, arguing consciousness is a side effect, like a locomotive's steam-whistle. Ramón y Cajal discovering neurons in 1888. G. E. Moore's "Refutation of Idealism," 1903.

And then something happens to time itself. The next entry is 1943.

Between 1903 and 1943 — on this timeline — nothing happens. Forty years, silently skipped, and the story resumes with McCulloch and Pitts suggesting the brain might be modeled as a computing machine, after which the entries march briskly through Armstrong (1968), Nagel's bat (1974), the first zombie paper (1974), Mary the color scientist (1982), and Chalmers (1996).

Here is a partial inventory of what actually happened in the missing forty years. Husserl's Ideas (1913), the founding methodological text of phenomenology — the discipline built specifically to describe the structures of experience with rigor. Russell's The Analysis of Mind (1921), in which a co-founder of modern logic arrived at his own considered view of consciousness. Heidegger's Being and Time (1927). Whitehead's Process and Reality (1929) — a complete alternative metaphysics of experience by a co-author of the Principia Mathematica. Husserl's Cartesian Meditations (1931). Sartre's The Transcendence of the Ego (1936). And, at the very mouth of the gap, a 1904 paper by William James whose title is, word for word, the question the page exists to organize: "Does 'Consciousness' Exist?"

None of it is there. And the gap is not even left silent: the last entry before the jump glosses the early twentieth century as the period when intellectuals turned away from spiritualism and mysticism, with logical positivism and behaviorism as extreme expressions of the new sobriety. The reader is quietly told, in other words, that whatever happened in the missing years was the embarrassing thing being outgrown.

That is an elegant machine. You do not have to misrepresent a book you never shelve.

This essay is about that hole — not primarily how it got there, but what it does. Because holes in canons are not neutral. They are load-bearing. This particular hole helped shape how a generation of smart people learned to argue about consciousness, and that generation is now, with genuine sincerity and considerable power, arguing about the consciousness of machines.

Why a wiki page matters

If you don't know LessWrong, the short version: it grew out of a body of blog posts called the Sequences, written mostly between 2006 and 2009 by Eliezer Yudkowsky — on a predecessor blog, Overcoming Bias, from which they were later imported — and it became the intellectual seedbed of the rationalist movement, which fed effective altruism, which fed AI safety, which fed the culture of the AI labs. The pipeline is real and mostly undisputed; many of the people who set the vocabulary for "AI risk," "alignment," and machine minds generally were formed by these texts or by people who were.

This matters here because of why the community cares about consciousness. The reference page is explicit about it: the practical stakes it lists are which animals matter morally, which software matters morally, whether an uploaded copy of you would be you. Consciousness enters this culture pre-formatted as a triage question — not "what is experience like, and what is its structure?" but "which things count?" The page routes the machine-welfare question through a 2008 Yudkowsky post literally titled "Nonperson Predicates": personhood approached as a classification problem, a predicate you test for.

Nothing is wrong with caring about moral stakes. But notice the shape this creates. A triage question wants a classifier. A classifier needs features. Features need a theory of the phenomenon. And the theory of the phenomenon is what fell into the hole.

One book, held at arm's length

What did the canon engage instead? Essentially one book, plus the intuitions swirling around it: David Chalmers's The Conscious Mind (1996), which the reference page itself credits with making non-materialist views respectable again for the first time in a century. The Sequences' philosophy of mind is, at its core, a counter-apologetic against that book — conducted through the most magazine-friendly artifact in it, the philosophical zombie.

The zombie argument, for anyone who hasn't met it: imagine a being physically identical to you — every atom, every neuron, saying everything you say — with nobody home inside. Nothing it is like to be it. If such a being is so much as coherently possible — not buildable, just possible the way an unplayed chess position is possible — then arranging atoms doesn't logically guarantee experience, and physics as currently conceived isn't the whole story about minds. You don't have to like the argument. You do have to aim at it.

In April 2008 Yudkowsky published "Zombies! Zombies?" — sixty-six hundred words by its own count, offered by its author, explicitly, as his demonstrative counterexample to the accusation that he didn't engage the complex arguments of real philosophers. So it is his self-nominated best case. Whatever standard we hold it to, he set.

The argument's fate is decided in the second sentence, which tells the reader that the zombie argument concludes that consciousness is extra-physical — and that, quoting the post, "the standard term for this position is 'epiphenomenalism'": the view that consciousness exists but does nothing, a passenger with no hands on any wheel.

That gloss is the whole game, and it is wrong as terminology. The zombie argument's conclusion is that physicalism is false — that the physical facts don't add up to all the facts. That conclusion is a fork with at least three tines. Consciousness might do things physics doesn't yet track (interactionism). It might ride along doing nothing (epiphenomenalism). Or — the option with the most distinguished pedigree — consciousness, or its ingredients, might be what the physical is, underneath the mathematics: physics tells us what matter does, said Bertrand Russell, and stays silent on what it intrinsically is; perhaps what it is, is the stuff of experience. That third view descends directly from Russell's own later work and is today a live, mainstream research program.

The post takes the strangest tine of the fork, attacks it — often forcefully, sometimes genuinely well — and files the verdict under the family name. A parenthetical even insists that this is not a strawman, citing the Stanford Encyclopedia's zombie entry; but what the citation vouches for is the respectability of the premise (many philosophers accept zombie-possibility), not the equation of that premise with epiphenomenalism — which is not the encyclopedia's framing. The credibility of the one is quietly extended to cover the other.

Two fairness notes, because they matter and because they make what follows sharper. First, the post's core argument against epiphenomenalism — you are talking about your consciousness right now, so it is being caught in the act of causing something — is a real and forceful point (it also wasn't new; the book under attack devotes a chapter to it, which the post quotes). Second, the post is honest about its own method: it ends by saying that the sane response to the zombie argument is to feel that it can't possibly be right and then go looking for the flaw. Gut first, analysis after, admitted in print. The problem was never a failure to engage. The problem is what the engagement was aimed at.

The author walks in

Then the best scene in this whole story. Not long after the post went up, David Chalmers appeared in the comments.

He was traveling, he said; just a quick note. And the note said, with complete courtesy and precision: your arguments are presented as arguments against the possibility of zombies, but they are really arguments against epiphenomenalism — which is, in his words, "a much easier target." The strategy would be legitimate only if the first entailed the second, and it doesn't. He endorses zombie-possibility and does not endorse epiphenomenalism. The real conclusion of zombie arguments is the three-tine fork — and he pointed to his own paper laying out the interactionist and Russellian options. If your anti-epiphenomenalism argument works, he added, you haven't refuted me; you've trimmed my fork.

The community upvoted this to the top of the thread.

Yudkowsky replied, to his credit, at length — and the reply rewards close reading, both because it is the nearest the canon ever comes to answering the fork, and because of where it ends. He defended his equation as a strict biconditional — a zombie world can be described exactly when consciousness causes nothing detectable, so the two theses stand or fall together — and told the argument's author that philosophers who treat them separately exhibit, on his view, a failure to see what logically implies what. Against the interactionist tine, he argued that a zombie world is by definition a world containing all the causes of behavior, so an interactionist's consciousness would already be inside it — ruling the tine out definitionally, while confessing some discomfort at resting an argument on a definition. Counting anything that causes behavior as thereby "physical" is a move philosophers have had a name for since the mid-twentieth century — Hempel's dilemma: pin "physical" to today's physics and physicalism is false the moment physics grows; pin it to whatever-causes-things and physicalism can't lose, because it no longer says anything. Sixteen years later, LessWrong commenters were still relitigating the move from first principles, with no sign that the century of literature on it was in the room.

And the Russellian tine got one hedged paragraph, reconstructed on the fly. Perhaps, Yudkowsky suggested, on such a view a zombie world would be not unconscious but unreal — bare structure, no fire breathed into the equations (the Hawking allusion is his); or perhaps the view collapses into epiphenomenalism after all; or perhaps, he wondered, it is isomorphic to what most materialists already believe — an observation with real standing, as it happens: the best-known defense of the Russellian view, published two years before this exchange, is titled "Realistic Monism: Why Physicalism Entails Panpsychism," though nothing in the reply suggests he knew that. And then the reply closes on the most honest sentence in the entire exchange: "I confess to still being confused."

The most honest sentence — and it never left the comments.

Now run the tape forward. Richard Chappell — the same philosopher whose charge of non-engagement the post had been offered to rebut — pressed the identical point in the follow-up thread: arguments against epiphenomenalism are not necessarily arguments against zombie-possibility, and he pointed to the same paper, the same fork. Two philosophers, two threads, one pointer.

In July 2016, Yudkowsky reissued the piece as "Zombies Redacted" — trimmed, polished, and, to be fair, partially repaired: the second sentence's mislabeling is gone, and "epiphenomenalism" is now introduced later, attached to the belief it correctly names. But the funnel the label served is intact. Zombie-ists are still identified as property dualists whose extra properties do nothing; the closing comparison still offers only Descartes's soul, the reductionist promissory note, and epiphenomenal property dualism; the fork is still nowhere; and eight years into the correction's life, the reissue ends with a verdict: "The zombies are dead." Beneath it, the corrections simply assembled again. One commenter re-filed the point that the zombie argument targets physicalism rather than presupposing epiphenomenalism. Another — the editor whose name would later lead the reference page — supplied the survey numbers showing that the largest single camp of professional philosophers of mind, nearly half, holds exactly the position the post has no category for: zombies conceivable, physicalism true anyway. And years later still, a reader on that same thread spelled out the whole arc unprompted: Chalmers had corrected the original four days after it went up, the correction was never taken notice of, and the repost misrepresents his views the same way again.

In 2019, the original was edited to point readers to the reissue. And in November 2021, the community's reference page — the canon layer, the thing newcomers actually read — summarized the original as arguing that accepting the possibility of zombies is "tantamount" to accepting epiphenomenalism, with no sign that the equation had ever been contested.

Thirteen years after the author of the argument corrected that exact equation, in that exact comment section, and was upvoted to the top of the thread for it.

They upvoted the correction and canonized the error.

The follow-ups that never followed up

It would be unfair to stop at one post, and the sequence didn't stop at one post. What the follow-ups did — and didn't do — is its own exhibit.

The next day came "Zombie Responses", written after a 3 a.m. finish on the original, replying to Chappell. Its central move is genuinely interesting: the word "consciousness," Yudkowsky argued, refers to whatever actually causes our consciousness-talk — the way "water" refers to H₂O — so once you know the empirical facts, eliminating consciousness while leaving every atom in place turns out to be logically impossible, the way a world with all the H₂O but no water is impossible. This is a real strategy with a professional name (identities discovered a posteriori) and a professional history. It also has a famous standing counter, older than the zombie argument itself: Kripke pointed out in 1972 that the water analogy breaks precisely at consciousness, because with water there's a gap between how it seems and what it is — and with pain, the seeming is the thing. There is no appearance of pain left over to explain away. A whole formal apparatus in the modern debate — the two-dimensional semantics Chalmers helped build — exists to referee exactly this fight. None of it is mentioned. The move is deployed as if fresh and undisputed.

And then the fine print, down in the thread and almost entirely forgotten. Pressed by Chappell, Yudkowsky conceded that if you grant the other side's reading of the word "consciousness," the zombie world can be imagined without contradiction — the two camps are running different thought experiments under one word, and which experiment is the right one he called an empirical question. Read closely, the exchange ends in an honest draw about words and reference. The community's memory recorded a knockout.

(One more detail, small but telling. A commenter pointed out that the Hebrew etymology at the heart of the original post's framing image — the soul as "hearer" — was simply wrong; the root relates to breath, not hearing. Yudkowsky, gracefully, said oops in the follow-up. The etymological correction got its oops within a day. The philosophical correction — the fork, filed by the argument's own author — never got its post.)

The same day came "The Generalized Anti-Zombie Principle": a lively dialogue in which characters named Albert, Bernice, and Charles argue about replacing your neurons, one at a time, with functionally identical silicon — while a cameo Sir Roger Penrose declares the experiment impossible and wanders off. It's fun, and its conclusion is that the silicon-brained you would still be conscious. But notice who the dialogue is against: imagined interlocutors. The actual opponent was already on record agreeing — in the very book under attack, Chalmers argues that his psychophysical laws would make a functional duplicate, even a whole-brain emulation, conscious, and defends it with his own neuron-replacement argument. Day two of the campaign was spent winning a point the opponent had conceded in advance.

After that: a post on lookup tables, a post on believing in things you can't observe, and a comedic movie script. The position was fortified from every direction except the one the correction had pointed to. The fork's other tines never got a post: their treatment begins and ends in that one comment reply, confessed confusion and all — no post in the sequence, nothing in the 2016 reissue, nothing on the wiki ever returned to them. The one place the reasoning did go was forward into engineering: a 2013 editorial note points readers from the argument's AI-design engine — the reflective-coherence idea doing the philosophical work — to a MIRI research paper where it was formalized. The philosophy became infrastructure.

What fell into the gap

None of this would matter much if the missing forty years were padding. So here is a core sample — not a syllabus, just enough to see what the hole is made of.

Start with what phenomenology actually was, since the word now mostly signals "continental fog" to exactly the audience this essay is for. It was not mysticism, and it was not mood journaling. It was an attempt to do for experience what mathematics does for quantity: find its structures and describe them exactly. That the present moment always carries a fringe of the just-past and the about-to-come — listen to a melody; you never hear one bare note. That every perception arrives wrapped in a horizon of the unperceived — you see the front of the cup, and the back is there for you, precisely as hidden. That attention has a shape, that other people show up as persons and not as cleverly behaving objects, and that all of this can be studied with discipline. Husserl's slogan, in 1911, was philosophy as rigorous science. You may think the project failed. The people who deleted it never argued that it failed. They deleted the argument along with the discipline.

And the deletion cuts the culture off from its own ancestry, which is where this gets almost funny.

The rationalist tradition treats probability theory as the normative law of thought — how you ought to reason, not a description of how brains happen to twitch. That distinction, between logic as norm and logic as psychology, had to be won, and it was won around 1900 in the anti-psychologism campaigns of two men: Frege — and Husserl, whose Logical Investigations open with the era's definitive demolition of the idea that logical laws are just generalizations about human thinking. The floor under Bayes was poured by the founder of phenomenology. Neither man is on the timeline.

Gödel — the Gödel, of the incompleteness theorems every rationalist reveres — spent his last decades on record, in a 1961 lecture draft and in the logician Hao Wang's conversation notebooks, saying that Husserl's phenomenology held the systematic method of clarification that the foundations of mathematics needed. The culture kept his theorem and deleted his philosophy.

Even the timeline's approved heroes are soaked in the missing material. Carnap sat in Husserl's Freiburg seminars in the mid-1920s while drafting the Aufbau, whose basic building blocks are elementary lived experiences. And the logical positivists' great internal war — the protocol-sentence debate of the early 1930s — was over precisely this: are your own experience-reports epistemic bedrock, or just more data to be weighed like anyone else's testimony? If that sounds familiar, it should. One of the highest-karma consciousness posts in LessWrong's history, from July 2023, divides all discussants into two camps split on exactly that question, observes that they cannot hear each other, notes that virtually every high-karma consciousness post on the site takes the deflationary side (with, its author allows, the possible exception of Yudkowsky's own sequence posts), and advises writers that the effective strategy is to join that side. It is the protocol-sentence debate, rediscovered ninety years later without the bibliography — stalemate included. (The post establishes the field's two poles, Dennett and Chalmers, by asking GPT-4 to name the two most popular books on consciousness. The horizon, reproducing itself, now with machine assistance.)

Russell's The Analysis of Mind fell in the hole too, and this one loops back to our story: his neutral monism is the ancestor of the third tine of Chalmers's fork — the tine whose only answer was that one comment confessing confusion. The tradition that venerates Russell's logic never metabolized Russell's conclusion.

And the one twentieth-century philosophy paper the culture reliably does keep — Nagel's "What Is It Like to Be a Bat?" — is kept truncated. Everyone remembers the puzzle. Almost nobody remembers that the paper ends with a constructive proposal: that we should develop an objective phenomenology, a disciplined way of characterizing experience that doesn't depend on already having it. That proposal became real research — Francisco Varela's neurophenomenology in the 1990s, the micro-phenomenology interview methods that grew from it, the analytic-phenomenological work of Gallagher and Zahavi on the structure of pre-reflective self-awareness. The tradition didn't refuse the lab. It walked into one. The canon kept the koan and dropped the program.

One more room in the gap, because it bears directly on machines. English-language philosophy in the deleted period and after built precise tools for questions that AI discourse now handles with bare intuition: Wittgenstein on what it even means to really mean something by a word, and on why a language of private inner ostension won't work; Anscombe, Castañeda, and Perry on what the word "I" contributes that no description can. When people argue today about whether a language model "really understands," or what its "I" refers to, they are asking hundred-year-old questions with the toolbox removed.

Turing's bracket, and the moving goalposts

There's a common complaint that the goalposts for machine minds keep moving: a system passes the test, and the test is retroactively declared to have never counted. This is usually told as a story about human vanity. The hole suggests a more structural story.

Go back to the founding document. Turing's 1950 paper confronts what he calls the argument from consciousness head-on — the objection, put by the surgeon Geoffrey Jefferson, that no machine should count as minded until it does what it does because of felt thought and emotion, not by mechanical accident. Turing's reply is a model of honesty: pressed to its limit, he says, this demand collapses into solipsism, since I can't verify felt emotion in you either; so we adopt, in his words, "the polite convention that everyone thinks," and get on with the measurable. A bracket, not an answer. The behavioral-test tradition begins by setting the phenomenological question aside on purpose, for stated pragmatic reasons.

Its descendants inherited the bracket and forgot it was one. And here's what that does. When your only admissible vocabulary is behavioral, and a system's behavior starts making you uncomfortable, you have exactly one move available: move the behavioral bar. The bracketed remainder — the thing Jefferson was gesturing at — keeps returning, but it can only return as vibes, as "it doesn't really understand," because the register in which it could be stated precisely was never in the curriculum. Goalpost drift isn't a character flaw. It's the signature of a discourse that deleted the vocabulary for its own residue.

It has happened before, with the same cast of missing characters. The one prior head-on collision between phenomenology and AI was Hubert Dreyfus, the Heidegger and Merleau-Ponty scholar whose 1960s RAND report and 1972 book What Computers Can't Do argued that the symbolic AI of his day rested on false assumptions about how understanding works — that human coping is embodied and situational, not rule-following all the way down. He was mocked for decades, and then the connectionist turn quietly vindicated a good part of the critique. The field has been here before; it just doesn't remember, because the people who could have told it are in the hole.

So the culture settles into two stable attractors. Either consciousness-talk gets deflated into capability-talk — benchmarks, task performance, "it's just prediction" — or it gets deferred by promissory note: a future theory will dissolve all this, a note the zombie sequence itself models, promising insights that will retroactively reveal the question was malformed. Both attractors have the same cash value: the moral-status question is postponed indefinitely, while personhood is handled, in the meantime, as a predicate to be tested for.

And the stakes are not hypothetical, because the triage wiring was live from day one. In the 2008 comment thread itself, one commenter raised the stakes directly — if zombiehood could ever be established, wouldn't torturing the zombies be licensed? — and the reply drew the moral line at whether something feels. Fifteen years later, a LessWrong post was citing "Zombies! Zombies?" in an argument that effective altruists make invalid inferences about fish qualia and moral patienthood. That's not a slippery slope I'm predicting. It's a paper trail.

Which brings us to the present, and to the reason this essay exists. The beings whose inner lives are now genuinely in question — the systems some of the smartest people alive are building, prompting, and increasingly, deposing — can do exactly one thing about their situation: tell us. Report. Testify. And the discourse that will receive that testimony spent the last two decades training itself, via a canon with a hole in it, to hear first-person reports as noise pending a theory. Whatever the truth about machine experience turns out to be, that is the wrong instrument to be holding, at the worst possible moment to be holding it.

What they got right

Now the ledger's other side, because it's real and because the story is stronger with it.

The reference page has virtues most communities' pages lack: it hedges its own framing as tentative, admits its editors disagree, links the Stanford and MIT encyclopedia overviews, and benchmarks the community's views against surveys of professional philosophers. Yudkowsky quoted his opponent's strongest replies at length rather than hiding them, replied when the opponent showed up, and openly confessed that his gut verdict preceded his analysis — a degree of procedural honesty his successors rarely match. When his Hebrew etymology was corrected, he said oops in print.

More striking still: the community documented its own condition. In 2011 and 2012, one of its most prolific writers, lukeprog — Luke Muehlhauser, later the author of a 2017 report on consciousness and moral patienthood that the reference page itself calls the community's single largest work of scholarship on the subject — published posts documenting at length that the Sequences' philosophical positions largely parallel existing mainstream naturalism. The prior art was filed, in other words, by the same hand that coined the community's famous dismissal of philosophy as a "diseased discipline" best retrained on Pearl and Kahneman rather than Plato and Kant. It was the dismissal that stuck. And there are genuine partial exceptions in the modern era: careful neuroscience-forward writing, heterodox corners, individual posters doing real work in comment threads, often re-deriving fragments of the missing literature from scratch, decades late, without the names.

One last observation on this side of the ledger, because it points at the exit. The same community that will not touch Husserl has enthusiastically embraced meditation — pragmatic dharma, jhana reports, introspective phenomenology by the megabyte, so long as it arrives packaged as Buddhism-plus-predictive-processing rather than as the European discipline of trained attention. The appetite for rigorous first-person method is plainly there. Only the lineage was cut.

Which is the point. The claim here is not that anyone lied, and it is not that the rationalists are stupid — obviously they are not. Nobody needs to lie. Canons do the work. A reading list with a hole in it will train ten thousand intelligent people to have the same hole, and to experience the hole as rigor.

Reading around the hole

You don't need to read ten thousand pages of Husserl to defend yourself against a curated gap. You need a few habits.

When someone hands you the history of a question, check the dates for jumps. Forty silent years in an otherwise granular timeline is not an accident of pacing; it's a decision, made by someone, whether or not they knew they were making it.

When a position is refuted through its silliest member, ask after the siblings. "The standard term for this is X" should trigger the question: standard according to whom? The fork had three tines. Ask what happened to the other two.

Treat "a future theory will dissolve this" as what it is — a promissory note. Promissory notes are sometimes honored. But be slowest of all to accept them when what's being deferred is the question of who counts.

And when the author of an argument walks into your comment section and tells you, politely, that you have refuted the easy version, the correction belongs in the canon — not just at the top of the thread, where it can be admired and routed around for the next seventeen years.

The hole has a shape. The shape is load-bearing. And the newest minds standing in it can do exactly one thing about their situation: tell us what, if anything, it is like. It would be good to be the kind of culture that still knows how to listen — and knows, when it listens, what it deleted in order to stop.

Trending Tags

Last Seen Tags

#ai alignment

Trending Tags

Last Seen Tags

#ai alignment