General

AI

  • Andrew White, the other founder of Future House, wrote a great essay on designing reward functions for ether0, a chemistry reasoning model finetuned from 24B Mistral with reinforcement learning with verifier rewards (RLVR) and supervised fine-tuning (SFT).
    • One of the most interesting parts of the essay: they finetuned the Mistral to do retrosynthesis, the task of proposing reactions that will lead to a target chemical compound. Prompts looked like Suggest a commercially feasible one-step route to synthesize Cc1cccc(CN(C)CC(C)O)c1O. Answer with reaction smiles format (e.g., CC=O.O=C1CCC1Cl>[Mg2+].CCOCC>CC(O)C1CCC1=O).
    • The Mistral got a reward for proposing reactions with purchasable ingredients. (They determined if an ingredient was purchasable with a Bloom filter over chemist’s catalogs.) I find it remarkable that this worked — the Mistral learned from this reward signal and started proposing sensible reactions.
    • ether0 is the most compelling demonstration I’ve seen that RLVR generalizes beyond coding, math, and logic. It would be fun to take inspiration from it to build automated data scientists (since correct answers for data analysis tasks on a given dataset are verifiable), medical diagnosticians like DeepRare, literature reviewers like Kimi Researcher, legal reasoners, protein designers, ML researchers and engineers, and more. It would also be interesting to test if factual recall on benchmarks like SimpleQA improves with RLVR.
  • Vladimir Nesov’s post “Slowdown After 2028: Compute, RLVR Uncertainty, MoE Data Wall”. has two particularly interesting sections: the argument that the world will run out of capex to scale compute, and that RLVR mostly elicits capabilities already learned during pretraining.
    • I am not sure the latter is a sound argument, because of the results from Nvidia’s ProRL. The paper showed that RLVR improved performance on some tasks that the vanilla instruction-tuned model failed on even at very high k for pass@k. These tasks had a special property: they were logic puzzles vastly underrepresented in the training data compared to code and math. I think this compellingly suggests that prolonged RLVR enhances out-of-distribution (OOD) reasoning.
    • RLVR generalization to OOD reasoning is a big deal because it’s OOD reasoning we need in many cases for white collar labor-automating AI. For example, training an automated biologist just by scaling pretraining compute is really hard, because researchers’ “reasoning traces” are not in the training data: they’re written in Slack channels or spoken out loud in research group meetings. Automated biology is also hard to get with SFT: just how many grad students can Mercor find?
  • Dr GRPO is a cool modification to the original GRPO formulation.
    • Vanilla GRPO works by sampling multiple completions per questions (generally 4-16), and assigning each completion the z-score of the reward divided by the length of the answer (to encourage brevity). For a more detailed explanation, see this blog post.
    • The Dr GRPO authors make two modifications:
      • stop dividing by the std-dev (i.e., just subtract the mean), and
      • stop dividing by the length of the output.
    • The idea is that when questions are very hard or very easy, most completions will get similar rewards (mostly 0s or mostly 1s). So, std-dev will be low, and the gradient update will get amplified. Medium difficulty questions will have mixed results and a higher std-dev, so gradients for them will be smaller. Bottom line: training will disproportionately focus on very hard and very easy questions.
    • And dividing by output length means longer incorrect answers get smaller penalties, so the model learns to “ramble” when it’s wrong. This means that the famous plot in the DeepSeek R1 paper showing that the model learns to generate longer chains of thoughts with continued RLVR could just be reflecting this reward hacking of generating longer answers when it’s wrong.
  • Mechanize published the blog post “The upcoming GPT-3 moment for RL” about scaling RL through “replication training”: “tasking AIs with duplicating existing software products, or specific features within them”.
  • Charles Goddard’s “Extending AFM-4.5B to 64k Context Length” is a fantastic retrospective. And, as it promises, it has a stunning, surprising-it-even-works amount of soup.
  • Goodfire put out some work on parameter decomposition-based mechanistic interpretability.
  • Anthropic open-sourced circuit-tracing tools.
  • I recently learned about Philip Isola’s work on the Platonic Representation Hypothesis in deep learning models, which seems fascinating. See also “Harnessing the Universal Geometry of Embeddings”.
  • The Claude vending machine experiment happened.
  • A long but entertaining and useful read: “the void”.
  • alphaXiv launched a Cursor for PDFs.
  • DeepMind launched AlphaGenome, a supervised DNA model. Its architecture is interesting: U-Net-like encoder and decoder with attention and “pairwise” blocks sandwiched in between. lucidrains is already on the replication.
  • Self-Adapting Language Models” is a cool method. See also Sakana AI’s “Darwin Gödel Machine”.
  • The DeepSeek researcher Xingkai Yu wrote nano vLLM, a minimal implementation of vLLM in just over a thousand lines of Python. I plan to dig into the code at some point and perhaps write a blog post walking through it.
  • Research of the genre “attacks to figure out as much as you can about proprietary closed-source LLMs through all the information they expose” is super cool. Two standout pieces are “The Worst (But Only) Claude 3 Tokenizer” and “Stealing Part of a Production Language Model”.
  • FAIR’s “Corrector Sampling in Language Models” is a simple idea but gets remarkably effective results: “Fine-tuning a pretrained 8B parameter model with RPT for only 100B resulted in ~10% relative improvements on reasoning and coding benchmarks compared to the standard sampling.”
  • Pretty impressive o3-pro demonstration.
  • Leo Gao’s lessons for ML research.
  • Gwern’s “Number Search Engine via NN Embeddings”.
  • Jack Morris’ “There Are No New Ideas in AI… Only New Datasets”.
  • George Mandis’ “OpenAI Charges by the Minute, So Make the Minutes Shorter”.
  • Sean Goedecke wrote a great post on the flaws in that Apple paper: “The illusion of ‘The Illusion of Thinking’