MIT's Recursive Language Improve on Long-Context Challenges (10 million Tokens!!)

Breaking Long-Context Limits with Recursive Language Models

Researchers from MIT's CSAIL have introduced Recursive Language Models (RLMs), a clever way to let large language models (LLMs) tackle massive prompts without choking on their built-in context limits. This approach shines on tough benchmarks where standard LLMs falter due to "context rot"—that frustrating drop in performance as inputs balloon.

The Core Problem with Long Contexts

LLMs like GPT-5 pack huge context windows, up to millions of tokens, but they still stumble on real-world long inputs. Performance dips because models overlook key details buried deep, especially in "needle-in-a-haystack" searches or multi-hop queries across documents. Even frontier models score poorly on benchmarks like OOLONG's trec_coarse split, where counting labels across thousands of unlabeled entries demands precise semantic mapping.

How RLMs Work in Practice

RLMs sidestep this by loading the full prompt into a Python REPL environment as a variable, handing the LLM just the query. The model then writes code to inspect chunks—peeking at samples, grepping with regex, partitioning data, or summarizing subsets—before recursively calling itself on promising pieces. No single call sees the entire context, keeping things efficient and avoiding rot; the root model builds the answer from REPL variables, outputting via commands like FINAL(answer).

Standout Results on Benchmarks

On OOLONG tasks over 128k tokens, RLM(GPT-5-mini) doubled GPT-5-mini's correct answers and beat GPT-5 overall, all at lower cost per query. For BrowseComp-Plus with 1000 documents (10M+ tokens), RLM(GPT-5) hit perfect scores where baselines crashed, outpacing even ReAct agents with BM25 retrieval. It also nailed long-output challenges like LoCoDiff git histories, processing diffs programmatically where base models failed.

Why This Feels Like a Game-Changer

Alex Zhang calls it a "bitter-lesson-pilled" shift: let computation scale via recursion rather than brute-force windows. Open-source code lives on GitHub (alexzhang13/rlm), with a notebook demo showing trajectories of peeks and partitions. Future tweaks—like deeper recursion, async calls, or RL-trained models—could unlock even wilder scaling for agents and research tools.