A Tool That Caught Its Own Blind Spot
Nautilus Compass, developed by Chunxiao Wang and released on GitHub under an MIT licence, is designed to solve a well-documented problem in long-running AI agent sessions: behavioural drift. Over extended interactions, large language model (LLM) agents such as those powered by Claude or GPT-4 can forget user-specified constraints, revert to previously corrected mistakes, or fabricate agreements that were never made.
Version 1.0.0 shipped as the project's first public stable release. Within five hours, the developers encountered a failure in their own workflow that exposed a structural gap in how the tool surfaced recalled information.
A Claude Code agent tasked with publishing a long-form article through a six-step quality pipeline — documented in a cross-session memory file — correctly retrieved the relevant file during recall. The file title and an 80-character description appeared in the agent's context. The agent then proceeded to skip every step in the pipeline, generating ad-hoc scripts and bypassing the mandatory review process the file had been specifically written to enforce.
The agent had seen the index. It had not read the body.
The Three-Layer Fix
Version 1.1.0 addresses this in three ways. First, the top three recall hits now embed the first 800 characters of the file body directly into the agent's working context, formatted in a visible block alongside the title and description. Lower-ranked hits remain header-only to keep response size manageable.
Second, Compass's drift detector — which matches current prompts against a library of 35 behavioural anti-anchors learned from past mistakes — previously issued alerts that named the matched anchor without showing its content. Those alerts are now enriched with body text from the most relevant past lesson session, drawn through a two-tier matching process.
Third, the update introduces detection for a pattern the team calls "recall acknowledgement without consumption": cases where an agent references a recalled file in its output but does not subsequently act on its contents.
Black-Box Design and Benchmark Performance
According to the accompanying arXiv paper (arXiv:2605.09863), Compass takes a deliberately constrained architectural approach. Unlike memory systems such as Mem0, Letta, Cognee, Zep, MemOS, and smrti — all of which call an LLM at index time to extract structured facts or build knowledge graphs — Compass embeds raw conversation text directly using BGE-m3 embeddings and computes cosine similarity at query time.
This black-box design means it can operate with closed-API models where model weights are unavailable. The trade-off is measurable: the system scores 56.6% on LongMemEval-S v0.8, approximately 30 percentage points below recent white-box leaders that score above 90%. The authors describe this gap as "the architectural ceiling of the no-extraction design" and disclose it openly.
On drift detection, Compass achieves a ROC AUC of 0.83 on a held-out test set built from real Claude Code session traces. On EverMemBench-Dynamic (n=500), it scores 44.4%, which the authors report exceeds all four published baselines in that benchmark's Table 4.
The end-to-end reproduction cost is cited at approximately $3.50, roughly 14 times cheaper than GPT-4o-judged evaluation stacks.
Availability
The tool ships as a Claude Code plugin, an MCP A2A server compatible with Cursor, Cline, and Hermes, as well as a CLI and REST API. All code, anchors, frozen test data, and audit-log tooling are MIT-licensed and publicly available at github.com/chunxiaoxx/nautilus-compass.