edge-airedisraspberry-pi

Running AI at the Edge: Caching Strategies for Raspberry Pi 5 AI HAT+ Inference

UUnknown

2026-01-21

10 min read

Practical caching for Raspberry Pi 5 + AI HAT+: use local LRU, Redis, and sqlite to slash inference latency and costs at the edge.

Running AI at the Edge: Caching Strategies for Raspberry Pi 5 AI HAT+ Inference

Hook: If you’re running generative AI on a Raspberry Pi 5 with the new AI HAT+, your biggest bottlenecks aren’t model size or the NPU — they’re latency, bandwidth, and unpredictable costs caused by repeated inference for the same prompts. This guide gives pragmatic, production-ready caching patterns (in-memory, on-disk, and hybrid) to cut latency, reduce egress, and simplify CI/CD cache invalidation.

Why caching matters for Pi5 + AI HAT+ in 2026

In late 2025 and early 2026 the Raspberry Pi 5 + AI HAT+ platform matured into a practical edge inference node. NPUs and optimized runtimes make on-device LLMs usable, but two realities remain:

Many requests are repetitive (same prompts, similar contexts) — perfect for caching.
Edge nodes have limited RAM and storage trade-offs; naive caching uses too much memory or causes wear on flash.

The result: a tailored caching stack — a hot in-memory LRU, a shared or local Redis for warm items, and a compact on-disk store (sqlite or disk-backed KV) for cold persistence — gives the best cost/latency balance on Pi5.

High-level architecture patterns

Pick a pattern that matches scale, consistency, and persistence needs:

1) Single-node, low-latency (simple)

Local in-process LRU for hot responses (Python's cachetools, Node lru-cache).
Optional sqlite-backed cache for persistence across reboots (see DB migration notes).
Best for single Pi deployments with minimal synchronization needs.

2) Two-level hybrid (recommended for many deployments)

Level 1: process-local LRU (O(1) ops, lowest latency).
Level 2: local Redis instance running on Pi (shared among worker processes) with eviction set to volatile-lru or allkeys-lru.
Level 3: sqlite or file-based cache as cold storage with TTL for long tail items.
This gives hot data fast, warm data shared, and cold data persistent.

3) Edge cluster with CDN (advanced)

Local cache + Redis per Pi; periodic sync to a central cache or use of upstream CDN / hybrid regional cache for extremely common responses.
Use surrogate keys and versioning to invalidate across nodes and CDNs.

What to cache for generative AI

Not everything should be cached. Pick these safe, high-impact items:

Full-response cache: Cache final responses for canonical prompts and deterministic generation (temperature=0 or fixed seed).
Completion candidates: When using sampling, cache beam-search or n-best results for reuse in UI components.
Embeddings: Vector results are stable for the same input; store them for search and retrieval.
Authorization/metadata: Rate-limit counters, per-user preferences, prompt templates (fast lookup).

Avoid caching:

Non-deterministic outputs unless you also store generation seeds and parameters with the key.
Large intermediate tensors — better to recompute or use model-specific checkpoints.

Cache key design: make keys deterministic and forward-compatible

Good keys are critical. Example canonical key (sha256 hex):

sha256(model|model_version|tokenizer_version|system_prompt|user_prompt|temperature|top_p|max_tokens)

Tips:

Include model and tokenizer versions; if weights or quantization change, cache must invalidate.
Include deterministic flags (temperature, seed). Only cache when deterministic or when you store the seed.
Use content-addressable hashing for large contexts to keep key length small.

In-memory caches: local LRU vs Redis vs Memcached

On Pi5, you have options. Choose based on concurrency, persistence, and cross-process sharing:

Local LRU (in-process)

Pros: fastest (no IPC), minimal deps, deterministic eviction. Cons: not shared across processes, limited size per process.

Python example (cachetools):

from cachetools import LRUCache
cache = LRUCache(maxsize=4096)
# store: cache[key] = (value, expiry_ts)

Node example (lru-cache):

const LRU = require('lru-cache')
const cache = new LRU({ max: 5000 })

Redis (local instance)

Pros: shared across worker processes, robust metrics, persistence optional. Cons: memory pressure, slightly higher latency than in-process.

Run Redis on a Pi5 with these practical configs to reduce I/O and improve eviction behavior:

# redis.conf excerpts
maxmemory 1gb
maxmemory-policy allkeys-lru
save ""
appendonly no
timeout 0

On Pi, put Redis data in tmpfs (if you accept non-persistent cache) to reduce flash wear:

sudo mount -t tmpfs -o size=1024M tmpfs /var/lib/redis

Systemd unit tip: set LimitNOFILE and CPU affinity so Redis doesn't compete with the model runtime. In many edge stacks this Redis layer integrates cleanly with edge orchestration.

Memcached

Memcached is lighter and fast for simple key/value but lacks persistence and the rich data structures of Redis. Useful if memory is ample and you need pure speed.

On-disk caches: sqlite and file-based KV

On-disk caches are for cold items you want to survive reboots. Use sqlite for compactness and transactional safety. Key points:

Use WAL mode, tuned pragmas to reduce fsyncs and improve throughput.
Store compressed blobs (gzip/ zstd) for long responses.
Use an index on expiry to expedite pruning.

Sqlite schema and pragmas (practical)

PRAGMA journal_mode=WAL;
PRAGMA synchronous=NORMAL;
PRAGMA temp_store=MEMORY;
PRAGMA page_size=4096;

CREATE TABLE IF NOT EXISTS cache (
  key TEXT PRIMARY KEY,
  value BLOB,
  expires_at INTEGER
);
CREATE INDEX IF NOT EXISTS idx_expires_at ON cache(expires_at);

Insert with UPSERT and expiry:

INSERT INTO cache(key, value, expires_at) VALUES(?, ?, ?)
ON CONFLICT(key) DO UPDATE SET value=excluded.value, expires_at=excluded.expires_at;

Garbage collection: run a background job that deletes expired rows periodically to keep DB small. For schema and zero-downtime rotation patterns, see live schema and migration patterns.

Two-level cache: full example flow

Implementing the two-level pattern (local LRU -> Redis -> sqlite) gives low-latency reads and low-cost persistence. Example flow on a request:

Compute key hash for (prompt + params + model_version).
Check process-local LRU. If hit and not expired -> return.
Else check Redis. If hit -> populate local LRU and return.
Else check sqlite (cold). If hit -> push to Redis (with TTL), push to LRU, and return.
Else perform inference. Store result in LRU, Redis, and sqlite (if suitable).

This flow minimizes repeated heavy inferences and keeps hot items ultra-fast. The recommended pattern is the two-level stack paired with edge orchestration for coordinated invalidation.

Example: Redis configuration and eviction strategy

Recommended settings for a Pi5 where model runtime uses significant RAM (6-8GB SKUs):

# redis.conf
maxmemory 512mb
maxmemory-policy allkeys-lru
maxclients 100
tcp-keepalive 60
lazyfree-lazy-eviction yes
lazyfree-lazy-expire yes

Rationale: allkeys-lru ensures the hottest keys stay cached; lazyfree reduces GC pauses on eviction.

Vector/embedding caches

Embeddings are small and highly reusable. Store embeddings in Redis (HASH+vector encoding) or sqlite (FAISS is heavy). For quick nearest-neighbor you can:

Keep embeddings in Redis and use approximate NN libraries (Redis-ML modules) if available.
For true offline similarity, store embeddings in sqlite and use an in-process ANN like hnswlib loaded on startup.

On Pi5, memory is valuable — keep only vectors for the active working set in RAM and cold-store the rest.

Cache invalidation, CI/CD and reproducibility

In production you need robust invalidation aligned with deployment and model updates:

Versioned keys: Always include model_version and tokenizer_version in keys. Deploys that change either implicitly invalidate stale caches.
Surrogate keys: Add tags to cached items (e.g., "templates:v2"). When template changes, scan and evict by tag (Redis supports sets of keys per tag).
CI hooks: On model rollout, trigger a cache purge script (Redis FLUSHDB or selective delete) from your pipeline. For sqlite, rotate DB file atomically. Tie this to your CI/CD and migration playbooks (live schema updates).

Observability and metrics

Measure and monitor these metrics on your Pi5 cluster:

Cache hit ratio (per-key and overall) — aim for >=70% on stable workloads.
Redis INFO metrics: used_memory, evicted_keys, keyspace_hits/keyspace_misses.
Local LRU stats: hits, misses, evictions (instrument your LRU wrapper).
Sqlite average read/write latency and DB size.

Export to Prometheus via redis_exporter and monitoring stacks and a small custom exporter for sqlite and in-process caches. Use Grafana dashboards to link cache hits to observed inference latency improvements.

Benchmarks — example results (typical)

From internal edge tests on Raspberry Pi 5 + AI HAT+ (quantized 2.7B model, late-2025 runtimes):

Cold inference: 1.8–3.0s median for a single-turn completion (depends on model and CPU/NPU scheduling).
Warm: Redis hit -> 12–40ms (serialization + network loopback on localhost).
Hot: local LRU hit -> 1–5ms (memory access & deserialization).

These are example numbers to show orders of magnitude gains: caching typical prompts moves you from seconds to tens of milliseconds. See broader discussion about edge AI platform-level trade-offs that drive these choices.

Storage and durability trade-offs

Decide based on workload:

Ephemeral caches on tmpfs: fastest, reduces flash wear, but not persistent across reboots.
Sqlite cold store on SD or SSD: persistent but watch for write amplification; use compression and batched writes.
Redis AOF: avoid on Pi unless you need full persistence; snapshotting and tmpfs + periodic cold-dump can be a balanced option.

Advanced strategies

1) Partial-token caching (prefix caching)

Cache early token sequences to accelerate streaming generation for common prefixes. Use careful keying by prefix length and context.

2) Response stitching & hybrid sampling

Store deterministic parts of a response (system and template rendering) and only sample or synthesize small stochastic pieces at request time to avoid full re-generation.

3) Cache warming and scheduled prefetch

During low-load windows, precompute common prompts and batch-insert into Redis and sqlite to reduce cold-start latency during peak hours.

4) Use zram and cgroups

To avoid OOM, configure cgroups to limit the model runtime and Redis memory. Use zram for swap to reduce SD wear if memory pressure is occasional. For regional coordination and cost/latency balance see hybrid edge–regional hosting strategies.

Security and privacy

Because caches often store user-generated text and embeddings, consider:

Encryption at rest for sqlite blobs (application-level encryption) if storing PII.
Access controls: limit Redis to localhost or a private network; enable AUTH for remote access.
TTL caps: do not cache private data beyond policy limits; implement per-user privacy flags in keys. For regulatory or compliance needs, review regulation and compliance guidance.

Why 2026 trends strengthen this approach

Recent developments through 2025–2026 make caching even more valuable:

Model quantization (INT8/INT4) and NPU runtime improvements lower inference costs, but caching multiplies these gains by avoiding repeated inference.
Standardization of local model formats (GGUF and improved runtimes) means more stable model_versioning — easier cache invalidation.
Edge orchestration tools matured, enabling coordinated cache invalidation and metrics collection across Pi fleets (see orchestration and edge ops playbook).

Tip: Treat caching as an architecture first — pick keys and invalidation policies before adding caches. Good design avoids stale results and brittle rollouts.

Quick checklist to deploy on a Raspberry Pi 5 + AI HAT+

Choose two-level stack: in-process LRU + local Redis + sqlite for cold persistence (recommended).
Standardize cache key format and include model/tokenizer versions.
Configure Redis with maxmemory and allkeys-lru; consider tmpfs if you accept ephemerality.
Configure sqlite with WAL, memory temp_store, and periodic GC.
Instrument cache hits/misses and link to latency metrics in Grafana.
Implement CI hooks to flush or rotate caches during model or template updates (see migration/CI patterns).

Final recommendations

On the Raspberry Pi 5 with AI HAT+, caching is the most cost-effective lever you have to cut latency and bandwidth. For most use cases, a hybrid approach (local LRU -> Redis -> sqlite) delivers consistent, measurable improvements while keeping the deployment simple and maintainable. Start with a conservative LRU size and a Redis instance tuned for low memory, then expand caching scopes as you measure hit ratios and latency gains.

Call to action

If you’re deploying generative AI at the edge, start by implementing the two-level cache pattern in a staging environment and gather hit/miss metrics for one week. Need a starter kit? Download our Pi5 cache-config repo (pre-configured redis.conf, sqlite schema, and example code for Python/Node) and a Prometheus dashboard template to get real data fast—then follow the edge AI playbook for platform-level testing.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.