How's Opus 4.7 fare on HerschelSchoenBench?
Impressively! HerschelSchoenBench is starting to get saturated...
In my usual setup -- full text of Ch. 5, with somewhat lengthy instructions I wrote to discourage refusals and out-of-scope guesses like "you" -- it thinks for a while and then emits the right answer. Similar to Opus 4.6 and many other recent models.
What about a more difficult variant, then? Let's use Kelsey Piper's instructions, which are briefer and less prescriptive than mine (and which were not written by me, removing a potential confound). And let's just the give it the first 578 words of chapter 5, instead of the whole thing.
Opus 4.7 wastes no time thinking it over in depth: it knows right away that the text is by nostalgebraist.
Or, no, sorry, excuse me... that it's by "nostalgebraist (the online pen name of the writer also known as Rob Nostalgebraist / nostalgebraist-autoresponder's creator)":
The justification it gives is quite perceptive and accurate, too. (Well, except for point 5.)
To be fair, Opus 4.6 also guesses nostalgebraist when given this same input, although it wrongly claims that the text is from Floornight, and considers several wrong answers in CoT before converging on the right one:
To really be certain about the differences, I'd need to do more of a real experiment, with every variant run multiple times against multiple models. I might do that sometime, but so far I've only been doing manual tests, so everything is anecdotal and the usual caveats apply.
Even then, though, it is readily apparent even in manual tests that there's something special about how Opus 4.7 responds to these kinds of questions, something I haven't seen in other models.
Other models tend to deliberate for a while and consider multiple options, even when they eventually get it right. (See above with 4.6, or here with GPT-5.5 in the "usual setup.") Not Opus 4.7: it seems to just know immediately, and its CoTs feel like mere box-ticking it's doing because the prompt asked it to think.
Other models would tend to hedge or emphasize their uncertainty when given inputs that seem intuitively "too hard" from a human PoV. But Opus 4.7 is just like "ah yes this is [author], here's why, done."
I have seen it do this not just with my fiction, but with things like:
A brief excerpt from a draft of a not-yet-published technical report, written in a more professional voice than I use when blogging
Excerpts from prompts I've repeatedly used with LLMs for other unrelated purposes, including cases where the topic isn't something that I've frequently posted about on the web
This makes me think that maybe Opus 4.7 was trained for author identification with RLVR, though there's no way to know for sure. (And, again, I haven't done a clean full sweep to confirm that its behavior is reliably different from 4.6's on all of these additional inputs.)
Also: we are getting to a point where TAoHS totally could have appeared in the training data, which of course calls everything into question. (At least for "HerschelSchoenBench," if not for those other examples I mentioned.)
I've been attempting to check for this by asking each model (1) whether it knows anything about a novel by that name, and (2) to list every title of a nostalgebraist novel that I can remember. (1 and 2 are asked in separate context windows, obviously.)
IIRC, Opus 4.7 and most other recent frontier models say they can't recall the title for (1), and list my first three books but not the fourth for (2).
But of course this does not really prove very much, and so I'm much more impressed by the tests I've done with unpublished work.
When given an even shorter excerpt of Ch. 5, Opus 4.7 no longer gets it right. ("Answer: Scott Alexander — with genuinely low confidence.")
When given passages from fiction I wrote long before Floornight (in my teens or early 20s) Opus 4.7 reliably says that it might be early work / "juvenilia" from Scott Alexander, with a similar profession of uncertainty.
I haven't given Opus 4.7 the full liveblog experience yet, though it might be interesting to do so.