BABILong

What It Measures

Whether a retrieval method can find specific facts ("needles") embedded within increasingly large contexts of distractor text.

Methodology

Facts are inserted at various positions within contexts of 4k, 8k, 16k, and 128k tokens. Methods must retrieve the correct fact to answer a question. Accuracy is measured per context length.

Methods Compared

neocortex_v1, directfeed

Results

Context Length

Neocortex

directfeed

33%

16k

128k

Overall

11%

Analysis

Neocortex is the only method that successfully retrieves needles, scoring 33% at the 4k context length. While absolute accuracy is still low, this demonstrates the advantage of graph-based indexing over raw context window approaches — the knowledge graph can locate specific entities even when surrounded by large volumes of distractor text. Directfeed scores 0% across all context lengths.

PreviousTemporalBench NextVending-Bench

Last updated 12 days ago

hashtagWhat It Measures

hashtagMethodology

hashtagMethods Compared

hashtagResults

hashtagAnalysis

What It Measures

Methodology

Methods Compared

Results

Analysis