my-project-public

repository

loading code, commits, and activity

repositories

loading repo index

#1	# Clawd Memory BEAM Benchmark
#2
#3	Evaluated against ICLR 2026 BEAM dataset (Tavakoli et al.)
#4	Date: 2026-05-06 \| Version: Mnemosyne 2.5 \| Model: Gemini 2.5 Flash via OpenRouter
#5
#6	These results measure the underlying Mnemosyne engine used by Clawd Memory. The Clawd layer adds agent workflow, vault indexing, Solana/OODA metadata, and operational memory discipline on top of the engine.
#7
#8	---
#9
#10	## End-to-End Results (LLM-as-Judge, Rubric Scoring)
#11
#12	180 questions across 3 scales (48 per scale, 3 conversations each).
#13
#14	\| Scale \| Clawd engine \| RAG (Llama-4) \| LIGHT \| Honcho \| Hindsight \|
#15	\|-------\|-----------\|---------------\|-------\|--------\|-----------\|
#16	\| 100K \| 35.4% \| 32.3% \| 35.8% \| 63.0% \| 73.4% \|
#17	\| 500K \| 19.3% \| 33.0% \| 35.9% \| 64.9% \| 71.1% \|
#18	\| 1M \| 19.2% \| 30.7% \| 33.6% \| 63.1% \| 73.9% \|
#19
#20	Published baselines from Tavakoli et al., ICLR 2026 and Hindsight blog (Apr 2026). Identical BEAM dataset and LLM-as-judge protocol for valid comparison.
#21
#22	---
#23
#24	## Per-Ability Breakdown
#25
#26	### 100K (35.4% overall)
#27
#28	\| Ability \| Score \| Assessment \|
#29	\|---------\|-------\|------------\|
#30	\| IE (Info Extraction) \| 80.5% \| Strong. Extracts specific facts from conversation context \|
#31	\| ABS (Abstention) \| 50.0% \| Identifies half of unanswerable questions \|
#32	\| SUM (Summarization) \| 41.7% \| Moderate synthesis across conversation windows \|
#33	\| CR (Contradiction) \| 35.4% \| Some contradiction detection \|
#34	\| TR (Temporal) \| 29.2% \| Time-difference reasoning works occasionally \|
#35	\| MR (Multi-hop) \| 16.7% \| Weak. Cannot connect facts across distant messages \|
#36	\| KU (Knowledge Update) \| 16.7% \| Weak. Cannot track changing values over time \|
#37	\| EO (Event Ordering) \| 13.3% \| Very weak. Cannot order events chronologically \|
#38	\| IF (Instruction Following) \| 0.0% \| Not tested at this scale \|
#39	\| PF (Preference Following) \| 0.0% \| Not tested at this scale \|
#40
#41	### 500K (19.3% overall)
#42
#43	\| Ability \| Score \| Assessment \|
#44	\|---------\|-------\|------------\|
#45	\| ABS (Abstention) \| 83.3% \| Stronger than 100K. Larger conversations make abstention clearer \|
#46	\| SUM (Summarization) \| 25.3% \| Degraded from 100K \|
#47	\| KU (Knowledge Update) \| 16.7% \| Same weak performance as 100K \|
#48	\| MR (Multi-hop) \| 14.6% \| Same weak performance \|
#49	\| IE (Info Extraction) \| 8.3% \| Major degradation. Facts lost in larger contexts \|
#50	\| CR (Contradiction) \| 4.2% \| Near zero \|
#51	\| EO (Event Ordering) \| 1.7% \| Near zero \|
#52	\| TR (Temporal) \| 0.0% \| Lost entirely \|
#53
#54	### 1M (19.2% overall)
#55
#56	\| Ability \| Score \| Assessment \|
#57	\|---------\|-------\|------------\|
#58	\| ABS (Abstention) \| 100.0% \| Anomalous. Sample size effect (6 questions, all flagged correctly) \|
#59	\| MR (Multi-hop) \| 16.7% \| Same as smaller scales \|
#60	\| IE (Info Extraction) \| 16.7% \| Degraded from 80.5% at 100K \|
#61	\| TR (Temporal) \| 16.7% \| Slight recovery? Not significant \|
#62	\| EO (Event Ordering) \| 3.3% \| Near zero \|
#63	\| CR (Contradiction) \| 0.0% \| Zero \|
#64	\| KU (Knowledge Update) \| 0.0% \| Zero \|
#65	\| SUM (Summarization) \| 0.0% \| Lost entirely \|
#66
#67	---
#68
#69	## Analysis
#70
#71	### What Works
#72	- Small-scale information extraction (80.5% at 100K). Mnemosyne retrieves and surfaces specific facts well when conversations are under 500 messages. The full-context strategy (giving the LLM all messages) works well.
#73	- Abstention. Consistently identifies unanswerable questions. Improves with scale (50% → 83% → 100%).
#74
#75	### What Doesn't Work
#76	- Scaling beyond 500 messages. Performance drops from 35.4% to 19.3% when moving from 100K to 500K. The retrieval fallback for large conversations (`_multi_strategy_recall`) is not surfacing relevant memories.
#77	- Fact linking across messages. MR, EO, and KU scores are weak at all scales. These require connecting information spread across distant parts of a conversation, which needs a working episodic tier.
#78	- Episodic consolidation. The benchmark ingestion code calls `consolidate_to_episodic()` but the episodic tier remains empty. Without episodic entries, retrieval searches only working memory, which is purged during ingestion.
#79
#80	### Root Cause
#81	The episodic consolidation in the benchmark script produces zero entries. This means the retrieval path is missing its primary speed and quality tier for large conversations. Fixing this should significantly improve 500K and 1M scores.
#82
#83	### Cautions
#84	- Sample size: 48 questions per scale. Confidence intervals are wide. Full 100-conversation evaluation pending.
#85	- The 100% ABS at 1M is likely a sample artifact (6 questions, all easy to identify as unanswerable).
#86	- IF and PF abilities had zero questions in the sampled conversations. Not representative.
#87
#88	---
#89
#90	## Next Steps
#91
#92	1. Fix episodic consolidation to produce entries during benchmark ingestion
#93	2. Run full-scale evaluation (all 100 conversations, all 2,000+ questions)
#94	3. After episodic fix: re-evaluate 500K and 1M to measure improvement
#95	4. Set up Honcho/Hindsight/RAG baselines locally for same-LLM comparison
#96

z6Mkq5mY3JWtxoxUobWcfNHm7AkRubgSWEZTkBVqZXJviFZ5/my-project-public