โ† Back to Blog

How We Built a Three-Tier Memory System for a Mobile AI Agent

The hardest part of building a persistent AI agent isn't the agent loop โ€” it's memory. An LLM has no state between calls. Every turn starts from zero unless you explicitly hand it context. On a phone, you can't just dump everything into the context window: tokens cost money, latency matters, and the window has a hard ceiling anyway.

We needed a memory system that was fast to read, cheap to query, and smart enough to surface the right things at the right time. After a few failed attempts, we landed on three tiers. Here's how each one works and why we made the tradeoffs we did.

The problem with a single store

Our first attempt was a flat key-value store. The agent could call memory_store(key, value) and memory_recall(key). It worked for simple things โ€” "remember my name is Alex" โ€” but fell apart immediately for anything richer. The agent had no way to find related memories without knowing the exact key. It couldn't search. It couldn't reason about what it already knew before starting a research task.

We also noticed a temporal problem. Something the agent learned five minutes ago is very different from something it learned three weeks ago. Recency matters. A flat store treats everything the same.

The three tiers

Tier 1 โ€” Working memory

The current conversation context. This is just the message history passed to the model on each turn. It's ephemeral by design โ€” it exists for the duration of a session and nothing more. The agent loop manages it automatically; there's no explicit API for it.

We keep working memory lean. Long conversations get summarized rather than truncated, so the model always has a coherent picture of what's happened in the current session without blowing the context window.

Tier 2 โ€” Daily memory

A rolling log of events, findings, and facts from the current day. When the agent completes a meaningful task โ€” finishes a research run, executes a script, scrapes a page โ€” it writes a structured entry to daily memory. Entries include a timestamp, a summary, and any key facts extracted.

Daily memory is injected into the system prompt at the start of each session. It's how the agent knows "this morning I already looked up X" without you having to tell it. At the end of each day, important entries are promoted to long-term memory and the daily log is cleared.

Tier 3 โ€” Long-term semantic memory

A vector embedding store. When the agent writes to long-term memory, the text is embedded and stored alongside the raw content. Retrieval is semantic: the agent calls semantic_recall_facts(query) and gets back the most relevant entries by cosine similarity, not keyword match.

This is where durable facts live: user preferences, project context, research findings worth keeping, things the user has explicitly asked the agent to remember. It persists across days, weeks, and app restarts.

Teaching the agent to use memory correctly

Having the infrastructure isn't enough. An LLM won't use memory tools unless you explicitly tell it to โ€” and even then, it'll forget to half the time. We added two hard rules to the system prompt:

These two rules alone cut redundant API calls significantly in our testing. The agent stopped re-researching things it had already found and started building on prior work instead of starting from scratch every time.

The embedding problem on Android

Running a full embedding model on-device is expensive โ€” both in APK size and inference time. We needed something that worked offline but was also accurate when online. We ended up with a two-path approach.

The primary path calls your configured provider's embedding API (default: text-embedding-3-small via OpenAI). When the remote model is unreachable โ€” no key, no network, rate-limited โ€” the system falls back to a deterministic 256-dimensional character-trigram vector computed entirely on-device. Both paths produce L2-normalised vectors. The system tracks which facts were embedded locally vs remotely and upgrades local embeddings to remote ones lazily when the network comes back.

The practical implication: embedding writes do cost tokens when you're online. The cost is small โ€” a few hundred tokens per memory entry โ€” but it's real. The offline fallback means the app keeps working without a connection; it just uses a less semantically rich representation until it can sync.

What we got wrong during development

The first version of daily memory was too verbose. The agent would write multi-paragraph entries for simple tasks, and by mid-session the daily log was too long to inject into the system prompt without eating the context window. We added a summarization step: entries are compressed to two or three sentences before being stored, with the full content available on explicit recall.

We also underestimated how much the agent would write to long-term memory if left unconstrained. In development, stores accumulated thousands of entries quickly, most of them low-value. We added a relevance threshold โ€” entries below a certain confidence score don't get promoted from daily to long-term โ€” and that cleaned things up considerably.

Where it stands

The three-tier system is the architecture we shipped with. The agent builds up context over time, remembers project details and user preferences, and can find relevant memories even when the query doesn't match the exact wording of what was stored. The semantic search is meaningfully better than keyword matching for the kinds of things people actually ask about.

It's not perfect โ€” no memory system is. We expect to learn a lot from real usage about what the right defaults are, how often the daily log fills up, and whether the relevance threshold for long-term promotion is calibrated correctly. But the architecture is sound and the tradeoffs are deliberate.

โ€” The Forge OS Team