General
AI
- Andrew White, the other founder of Future House, wrote a great
essay on designing reward functions for
ether0
,
a chemistry reasoning model finetuned from 24B
Mistral
with reinforcement learning with verifier
rewards
(RLVR) and supervised fine-tuning (SFT).
- One of the most interesting parts of the essay: they finetuned the Mistral
to do
retrosynthesis, the
task of proposing reactions that will lead to a target chemical compound.
Prompts looked like
Suggest a commercially feasible one-step route to
synthesize Cc1cccc(CN(C)CC(C)O)c1O. Answer with reaction smiles format
(e.g., CC=O.O=C1CCC1Cl>[Mg2+].CCOCC>CC(O)C1CCC1=O)
.
- The Mistral got a reward for proposing reactions with purchasable
ingredients. (They determined if an ingredient was purchasable with a Bloom
filter over chemist’s
catalogs.)
I find it remarkable that this worked — the Mistral learned from this
reward signal and started proposing sensible reactions.
ether0
is the most compelling demonstration I’ve seen that RLVR
generalizes beyond coding, math, and logic. It would be fun to take
inspiration from it to build automated data scientists (since correct
answers for data analysis tasks on a given dataset are verifiable), medical
diagnosticians like
DeepRare, literature
reviewers like Kimi
Researcher, legal
reasoners, protein
designers, ML
researchers and
engineers, and more. It would also be
interesting to test if factual recall on benchmarks like
SimpleQA improves with
RLVR.
- Vladimir Nesov’s post “Slowdown After 2028: Compute, RLVR Uncertainty, MoE
Data
Wall”.
has two particularly interesting sections: the argument that the world will
run out of capex to scale compute, and that RLVR mostly elicits capabilities
already learned during pretraining.
- I am not sure the latter is a sound argument, because of the results from
Nvidia’s ProRL. The paper showed that
RLVR improved performance on some tasks that the vanilla instruction-tuned
model failed on even at very high k for pass@k. These tasks had a special
property: they were logic puzzles vastly underrepresented in the training
data compared to code and math. I think this compellingly suggests that
prolonged RLVR enhances out-of-distribution (OOD) reasoning.
- RLVR generalization to OOD reasoning is a big deal because it’s OOD
reasoning we need in many cases for white collar labor-automating AI. For
example, training an automated biologist just by scaling pretraining compute
is really hard, because researchers’ “reasoning traces” are not in the
training data: they’re written in Slack channels or spoken out loud in
research group meetings. Automated biology is also hard to get with SFT:
just how many grad students can Mercor find?
- Dr GRPO is a cool modification to the
original GRPO formulation.
- Vanilla GRPO works by sampling multiple completions per questions (generally
4-16), and assigning each completion the
z-score of the reward
divided by the length of the answer (to encourage brevity). For a more
detailed explanation, see this blog
post.
- The Dr GRPO authors make two modifications:
- stop dividing by the
std-dev (i.e., just
subtract the mean), and
- stop dividing by the length of the output.
- The idea is that when questions are very hard or very easy, most completions
will get similar rewards (mostly 0s or mostly 1s). So, std-dev will be low,
and the gradient update will get amplified. Medium difficulty questions will
have mixed results and a higher std-dev, so gradients for them will be
smaller. Bottom line: training will disproportionately focus on very hard
and very easy questions.
- And dividing by output length means longer incorrect answers get smaller
penalties, so the model learns to “ramble” when it’s wrong. This means that
the famous plot in the DeepSeek R1 paper
showing that the model learns to generate longer chains of thoughts with
continued RLVR could just be reflecting this reward hacking of generating
longer answers when it’s wrong.
- Mechanize published the blog post “The upcoming GPT-3 moment for
RL” about
scaling RL through “replication training”: “tasking AIs with duplicating
existing software products, or specific features within them”.
- Charles Goddard’s “Extending AFM-4.5B to 64k Context
Length”
is a fantastic retrospective. And, as it promises, it has a stunning,
surprising-it-even-works amount of soup.
- Goodfire put out some
work on parameter
decomposition-based mechanistic interpretability.
- Anthropic
open-sourced
circuit-tracing tools.
- I recently learned about Philip Isola’s work on the Platonic Representation
Hypothesis in deep learning models, which
seems fascinating. See also “Harnessing the Universal Geometry of
Embeddings”.
- The Claude vending machine
experiment happened.
- A long but entertaining and useful read: “the
void”.
- alphaXiv launched a Cursor for
PDFs.
- DeepMind launched
AlphaGenome,
a supervised DNA model. Its architecture is interesting: U-Net-like encoder
and decoder with attention and “pairwise” blocks sandwiched in between.
lucidrains is already on the replication.
- “Self-Adapting Language Models” is a cool
method. See also Sakana AI’s “Darwin Gödel
Machine”.
- The DeepSeek researcher Xingkai Yu wrote nano
vLLM, a minimal implementation
of vLLM in just over a thousand lines
of Python. I plan to dig into the code at some point and perhaps write a blog
post walking through it.
- Research of the genre “attacks to figure out as much as you can about
proprietary closed-source LLMs through all the information they expose” is
super cool. Two standout pieces are “The Worst (But Only) Claude 3
Tokenizer” and “Stealing
Part of a Production Language Model”.
- FAIR’s “Corrector Sampling in Language
Models” is a simple idea but gets
remarkably effective results: “Fine-tuning a pretrained 8B parameter model
with RPT for only 100B resulted in ~10% relative improvements on reasoning and
coding benchmarks compared to the standard sampling.”
- Pretty impressive o3-pro
demonstration.
- Leo Gao’s lessons for ML
research.
- Gwern’s “Number Search Engine via NN Embeddings”.
- Jack Morris’ “There Are No New Ideas in AI… Only New
Datasets”.
- George Mandis’ “OpenAI Charges by the Minute, So Make the Minutes
Shorter”.
- Sean Goedecke wrote a great post on the flaws in that Apple
paper:
“The illusion of ‘The Illusion of
Thinking’”