[D] MemPalace claims 100% on LoCoMo and a "perfect score on LongMemEval." Its own BENCHMARKS.md documents why neither is meaningful.

A new open-source memory project called MemPalace launched yesterday claiming "100% on LoCoMo" and "the first perfect score ever recorded on LongMemEval. 500/500 questions, every category at 100%." The launch tweet went viral reaching over 1.5 million views while the repository picked up over 7,000 GitHub stars in less than 24 hours.

The interesting thing is not that the headline numbers are inflated. The interesting thing is that the project's own BENCHMARKS.md file documents this in detail, while the launch tweet strips these caveats. Some of failure modes line up with the methodology disputes the field has been arguing about for over a year (Zep vs Mem0, Letta's "Filesystem All You Need" reproducibility post, etc.).

1. The LoCoMo 100% is a top_k bypass.

The runner uses top_k=50. LoCoMo's ten conversations have 19, 19, 32, 29, 29, 28, 31, 30, 25, and 30 sessions respectively. Every conversation has fewer than 50 sessions, so top_k=50 retrieves the entire conversation as the candidate pool every time. The Sonnet rerank then does reading comprehension over all sessions.

BENCHMARKS.md says this verbatim:

The LoCoMo 100% result with top-k=50 has a structural issue: each of the 10 conversations has 19–32 sessions, but top-k=50 exceeds that count. This means the ground-truth session is always in the candidate pool regardless of the embedding model's ranking. The Sonnet rerank is essentially doing reading comprehension over all sessions - the embedding retrieval step is bypassed entirely.

The honest LoCoMo numbers in the same file are 60.3% R@10 with no rerank and 88.9% R@10 with hybrid scoring and no LLM. Those are real and unremarkable. A 100% is also independently impossible on the published version of LoCoMo, since roughly 6.4% of the answer key contains hallucinated facts, wrong dates, and speaker attribution errors that any honest system will disagree with.

2. The LongMemEval "perfect score" is a metric category error.

Published LongMemEval is end-to-end QA: retrieve from a haystack of prior chat sessions, generate an answer, GPT-4 judge marks it correct. Every score on the published leaderboard is the percentage of generated answers judged correct.

The MemPalace LongMemEval runner does retrieval only. For each of the 500 questions it builds one document per session by concatenating only the user turns (assistant turns are not indexed at all), embeds with default ChromaDB embeddings (all-MiniLM-L6-v2), returns the top five sessions by cosine distance, and checks set membership against the gold session IDs. It computes both recall_any@5 and recall_all@5, and the project reports the softer one.

It never generates an answer. It never invokes a judge. None of the LongMemEval numbers in this repository - not the 100%, not the 98.4% "held-out", not the 96.6% raw baseline - are LongMemEval scores in the sense the published leaderboard means. They are recall_any@5 retrieval numbers on the same dataset, which is a substantially easier task. Calling any of them a "perfect score on LongMemEval" is a metric category error.

3. The 100% itself is teaching to the test.

The hybrid v4 mode that produces the 100% was built by inspecting the three remaining wrong answers in their dev set and writing targeted code for each one: a quoted-phrase boost for a question containing a specific phrase in single quotes, a person-name boost for a question about someone named Rachel, and "I still remember" / "when I was in high school" patterns for a question about a high school reunion. Three patches for three specific questions.

BENCHMARKS.md, line 461, verbatim:

This is teaching to the test. The fixes were designed around the exact failure cases, not discovered by analyzing general failure patterns.

4. Marketed features that don't exist in the code.

The launch post lists "contradiction detection catches wrong names, wrong pronouns, wrong ages before you ever see them" as a feature. mempalace/knowledge_graph.py contains zero occurrences of "contradict". The only deduplication logic is an exact-match check on (subject, predicate, object) triples that blocks identical triples from being added twice. Conflicting facts about the same subject can accumulate indefinitely.

5. "30x lossless compression" is measurably lossy in the project's own benchmarks.

The compression module mempalace/dialect.py truncates sentences at 55 characters, filters by keyword frequency, and provides a decode() function that splits the compressed string into a header dictionary without reconstructing the original text. There is no round-trip.

The same BENCHMARKS.md reports results_raw_full500.jsonl at 96.6% R@5 and results_aaak_full500.jsonl at 84.2% R@5 — a 12.4 percentage point drop on the same dataset and the same metric, run by the project itself. Lossless compression cannot cause a measured quality drop.

Why this matters for the benchmark conversation.

The field needs benchmarks where judge reliability is adversarially validated, and evaluation pipelines are standardized or fully disclosed. Until then, "100% on LoCoMo" headlines are going to keep going viral, and the BENCHMARKS.md files that document the caveats are going to keep being read by approximately nobody. What's unusual about MemPalace is not any individual failure modes. It's that one repository contains so many of them at once, in a launch with viral reach, while the project's own internal documentation honestly discloses most of the issues that the launch communication strips.

Two other independent technical critiques landed in the first 24-hours: a README-versus-code teardown in issue #27, and another (Chinese language) #30.

Disclosure: We work on our own memory systems. All citations are open and verifiable against the linked repo.

Note: Links omitted for Reddit's spam filters. Find the full article, the BENCHMARKS.md citations, the Penfield LoCoMo audit, and the cited Zep / Mem0 / Letta posts in the first comment.

submitted by /u/PenfieldLabs
[link] [comments]

[D] MemPalace claims 100% on LoCoMo and a "perfect score on LongMemEval." Its own BENCHMARKS.md documents why neither is meaningful.

Want to read more?

Tagged with