aibrowsersecurity

Client-Side Model Caching: Storing and Invalidating Small Models in Browsers like Puma

UUnknown

2026-02-06

11 min read

Practical, production-ready patterns for storing, verifying, and evicting small on-device ML models in browsers like Puma (2026).

Stop re-downloading models — practical cache patterns for Local AI in browsers like Puma (2026)

Hook: If your app ships local AI in the browser, you already know the pain: slow cold starts, unpredictable storage quotas, and stale models after a CI deploy. Developers building on Puma and other local-AI browsers need a concise, operational playbook for storing, validating, and evicting small ML models and prompts on-device. This article gives you that playbook: checksums, secure storage, eviction heuristics, observability, and CI/CD-safe invalidation strategies designed for 2026 realities (WebGPU inference, 4-bit quantization, and tighter mobile quotas).

Why this matters in 2026

On-device inference moved from novelty to production in 2024–2025 thanks to lightweight quantized models and improved Web runtimes. By late 2025 the mainstream browsers, plus privacy-focused projects like Puma, began shipping optimized runtimes that run models locally using WebAssembly and WebGPU. That reduces latency and cost — but it shifts responsibility to developers to manage local storage, integrity, and cache invalidation.

Top 2026 trends you must account for:

Smaller quantized models (4-bit/8-bit) and LoRA adapters are common; local model bundles are often 1–50 MB.
WebGPU and WASM runtimes let browsers do real inference — but memory and disk quotas are still constrained on mobile.
Privacy-first browsers like Puma push local AI features, meaning users expect offline capability and secure, origin-scoped storage.
Edge/cloud CI/CD workflows must reconcile fast updates with offline clients that may not immediately fetch new models.

Design goals for client-side model caching

Before we dive into patterns, set clear goals:

Fast cold start: Keep the smallest viable model locally for immediate inference.
Integrity: Always verify models before loading to prevent corruption or tampering.
Quota-aware eviction: Respect navigator.storage limits and device constraints.
Offline-first & safe updates: Models must work offline and accept updates when available.
Secure at rest: Sensitive prompt data or proprietary models must be encrypted.

Storage options: pick the right store

Don't use localStorage for models or secrets. Preferred stores (ranked):

IndexedDB - Best balance of storage, transactionality, and wide support. Use for model blobs, manifests, metadata. (See edge-powered, cache-first PWA patterns for practical guidance.)
Cache Storage (Service Worker) - Great for immutable model files served over HTTPS with Cache-Control; simpler fetch semantics but less flexible than IndexedDB for metadata and signatures.
File System Access API - Useful for desktop PWA where user-granted folder access is acceptable (not for anonymous mobile flows).
Encrypted IndexedDB - Add encryption for sensitive models or prompts (see secure storage section).

Model manifest pattern (single source of truth)

Always ship a small JSON manifest that describes models available to the client. Manifest keys:

{
  "models": [
    {
      "id": "small-embed-v1",
      "url": "/models/small-embed-v1.bin",
      "size": 3_145_728,
      "checksum": "sha256:3a1f...",
      "algorithm": "sha-256",
      "version": "2026-01-12T18:00:00Z",
      "signature": "base64(...)")  // optional detached signature
    }
  ]
}

The client downloads the manifest frequently (short TTL) and models less often (use version/checksum to decide). Manifest-driven updates and signed bundles are part of the emerging web standardization discussed in edge-powered PWA resources.

Checksums and integrity verification

Checksum is non-negotiable. Use a cryptographic hash (SHA-256 or BLAKE3) of the exact file bytes. Prefer content-addressable storage: your local key = checksum. That makes dedup, eviction, and validation simple.

Compute and verify with Web Crypto

// compute SHA-256 checksum (browser)
async function checksumArrayBuffer(ab) {
  const hash = await crypto.subtle.digest('SHA-256', ab);
  return Array.from(new Uint8Array(hash)).map(b => b.toString(16).padStart(2,'0')).join('');
}

Store checksum in your manifest and verify before using a locally stored model. If signature verification is required (recommended for proprietary models), the manifest should include a detached signature over the model bytes or the manifest itself.

Signature verification (recommended for production)

Pin a public key in your app (baked into service worker or JS bundle) and verify signatures with WebCrypto. Use ECDSA P-256 or RSA-PSS depending on your security policy. Verifying the manifest instead of every model file lets you batch trust decisions and reduce verification overhead.

Download, transactional write, and atomic swap

Model downloads must be transactional. If a download fails halfway, you mustn’t leave a corrupt model that the runtime will try to use.

Recommended flow:

Download bytes to a temporary key: models/<id>.tmp
Compute checksum and verify signature (if present)
Store atomically by renaming/moving to models/<checksum> and store metadata mapping id → checksum
Update last-used timestamp

async function storeModel(id, arrayBuffer, manifestEntry) {
  const cs = await checksumArrayBuffer(arrayBuffer);
  if (cs !== manifestEntry.checksum) throw new Error('checksum mismatch');
  // open IndexedDB and save under key `models:${cs}`; then map id->cs in metadata store
}

When to evict: practical heuristics

Eviction should be deterministic, predictable, and observable. Combine multiple signals:

Storage pressure: Use navigator.storage.estimate() on modern browsers to know usage and quota. Start eviction when used/quota > 0.7.
Model size threshold: Keep a small base model (e.g., < 5 MB) as a cold-start fallback. Evict optional adapters or larger models first.
LRU + frequency: Evict least-recently-used and least-frequently-used entries. Maintain lastUsed + useCount metadata.
Age/TTL: If a model hasn't been used in X days (configurable), consider eviction during low-storage windows.
User preferences: Let users pin models (prevent eviction) or define offline storage budgets in settings.

Example eviction algorithm

async function evictIfNeeded(targetFreeBytes = 5 * 1024 * 1024) {
  const s = await navigator.storage.estimate();
  const free = s.quota - s.usage;
  if (free > targetFreeBytes) return;

  // fetch metadata, sort by pinned, lastUsed, useCount
  const candidates = await listModelsForEviction();
  // skip pinned models, skip base fallback
  for (const m of candidates) {
    await deleteModel(m.checksum);
    if ((await navigator.storage.estimate()).quota - (await navigator.storage.estimate()).usage > targetFreeBytes) break;
  }
}

Run eviction opportunistically (on startup, on download failure, on navigator.storage pressure events if available). For broader developer guidance on cache-first PWAs and eviction tradeoffs, see Edge-Powered, Cache-First PWAs.

Prompt caching vs model caching

Prompts and small prompt templates are much smaller and have different trade-offs:

Store prompts in IndexedDB or encrypted storage if they contain PII.
Evict prompts by LRU but keep user-saved templates pinned.
For short-lived caches (session-only), use in-memory structures in a service worker or dedicated worker.

Secure storage patterns

Models may contain proprietary weights or private user data (e.g., dataset-tuned adapters). Treat them as sensitive assets:

Encrypt at rest using Web Crypto. Derive an encryption key from a user secret (password) or a device-bound key via WebAuthn where possible.
Origin-bound keys: Keys should be stored in the browser's secure storage (e.g., Private keys stored via WebAuthn or using platform keystore integrations available in some browsers).
Do not store secrets in JS globals or localStorage.
Key rotation: support re-encrypting models when key material rotates (CI update, user changes password).

Encrypting model bytes with Web Crypto (pattern)

// derive a key (example using PBKDF2 - for production prefer HKDF + salt + iterations)
async function deriveKey(password, salt) {
  const pwKey = await crypto.subtle.importKey('raw', new TextEncoder().encode(password), {name:'PBKDF2'}, false, ['deriveKey']);
  return crypto.subtle.deriveKey({name:'PBKDF2', salt, iterations: 100_000, hash:'SHA-256'}, pwKey,
    {name:'AES-GCM', length:256}, false, ['encrypt','decrypt']);
}

async function encryptModel(ab, key) {
  const iv = crypto.getRandomValues(new Uint8Array(12));
  const ct = await crypto.subtle.encrypt({name:'AES-GCM', iv}, key, ab);
  return {iv: Array.from(iv), ciphertext: new Uint8Array(ct)};
}

This pattern lets you store encrypted blobs in IndexedDB. Keep IV and metadata unencrypted but store them as structured objects.

Offline-first updates and CI/CD workflows

Clients may be offline for days. Your update strategy should be tolerant:

Manifest-driven updates: push a new manifest with new model checksums. The client checks manifest first, then decides to download.
Version pinning: Map user-visible model selection to specific checksums to avoid accidental mismatches.
Staged rollouts: Use manifest flags to roll out new model versions gradually. Manifest can include rollout percentage and A/B keys. See operational rollout guidance in the Micro‑Apps DevOps Playbook.
Forced invalidation: To invalidate caches immediately, change the manifest entry checksum or version. Clients should treat checksum mismatch as authoritative and redownload.

Observability: what to measure

Measure and expose these metrics for each client to tune eviction and detect problems:

Cache hit rate (model load from local vs network)
Download times and failures per model checksum
Storage usage per model and overall quota use
Number of evictions and reason codes (quota, age, user request)
Checksum mismatches and signature verification failures

Use aggregated telemetry (privacy-preserving) and allow users to opt out. For explainability and live-inspector patterns that surface cache decisions and verification results, see Live Explainability APIs. For team-level telemetry rationalization, consult Tool Sprawl.

Runtime caching: in-memory + worker caches

For performance, keep hot models in memory or pinned to a dedicated Web Worker or SharedArrayBuffer-backed cache if the runtime allows. Typical pattern:

On first load, read model bytes from IndexedDB into a Web Worker and initialize the runtime (WASM/WebGPU). See how on-device capture stacks handle worker-backed transfers in On‑Device Capture & Live Transport.
Maintain an in-memory LRU for model handles; when memory pressure rises, free runtime allocations but keep the model bytes persisted on disk.
Use Transferable objects where possible to avoid copies; transferable patterns are discussed in mobile capture and low-latency stacks (see On‑Device Capture & Live Transport).

Example end-to-end flow

App fetches manifest (ttl: 5 minutes).
If manifest references a checksum not present locally, app attempts download when on Wi‑Fi or user consent.
Download to temporary key, verify checksum and signature, encrypt if needed, move to permanent key models:<checksum>.
Update metadata id → checksum, set lastUsed timestamp, increment useCount.
If storage pressure detected, run eviction algorithm.
On model load, pull decrypted bytes into worker and initialize runtime; emit telemetry events for hits/misses.

Quick checklist for implementation (copy-paste)

Use IndexedDB for model blobs + metadata; avoid localStorage.
Ship a manifest with checksums + optional signatures.
Verify checksums with crypto.subtle.digest('SHA-256').
Verify manifest signatures with WebCrypto (pinned public key).
Store models keyed by checksum (content-addressable).
Use navigator.storage.estimate() to detect pressure and trigger eviction.
Implement LRU + size-first eviction; keep a small fallback model pinned.
Encrypt sensitive models with WebCrypto; consider WebAuthn for key protection.
Expose cache metrics and let users pin models.

Edge cases & pitfalls

Quota surprises: Different browsers/devices expose very different quotas. Test on low-end Android in 2026 — quotas can be tight; guidance on low-end testing appears across edge-focused toolkits like cache-first PWA guides.
Partial writes: Never trust partially-written blobs; always verify checksum before use.
Signature algorithm support: Validate that target browsers support chosen algorithms in WebCrypto; provide fallbacks.
User resets or clears site data: Provide graceful fallback to network download and a small base model for cold start.
Telemetry & privacy: Be explicit in UI about what telemetry you collect; aggregate and strip identifiers.

“In 2026, on-device AI shifts trust and responsibility to the client. A well-designed cache isn’t optional — it’s the foundation of predictable performance and security.”

Concrete code snippets you can reuse

Compute checksum, store, and verify (simplified)

async function downloadAndStoreModel(manifestEntry) {
  const res = await fetch(manifestEntry.url, {cache: 'no-store'});
  const ab = await res.arrayBuffer();
  const cs = await checksumArrayBuffer(ab);
  if (cs !== manifestEntry.checksum) throw new Error('Checksum mismatch');
  await storeModelBlob(cs, ab); // keyed by checksum
  await mapIdToChecksum(manifestEntry.id, cs);
}

Estimate storage and trigger eviction

async function ensureFreeBytes(minBytes) {
  const est = await navigator.storage.estimate();
  if ((est.quota - est.usage) < minBytes) await evictIfNeeded(minBytes);
}

Actionable takeaways

Use manifests + checksums — manifest-first design keeps updates predictable and CI/CD-friendly. See progressive manifest ideas in Edge-Powered PWAs.
Key by checksum for atomic swaps and deduplication.
Encrypt sensitive models and prefer origin-bound keys; never store secrets in localStorage.
Evict deterministically with LRU + storage-pressure heuristics; keep a small, pinned fallback model.
Measure everything — cache hit rate, evictions, download failures — and make the data visible to engineers and power users. For explainability and inspector UIs, check Live Explainability APIs.

Future-proofing: what to watch in 2026–2027

Look for these near-term changes and prepare accordingly:

Wider browser support for hardware-accelerated model execution via WebGPU — expect larger hot model memory use. For on-device visual and data workloads, see On-Device AI & Data Visualization.
Standardization around model manifests and signed bundles for the web (efforts matured in 2025–2026).
Improved platform APIs for secure key storage in browsers — which will simplify encryption key management.

Final checklist before shipping

Manifest with checksums and signatures published via HTTPS.
Transactional download + checksum verification.
Content-addressable IndexedDB storage keyed by checksum.
Eviction policy: LRU + size + storage pressure; small fallback model pinned.
Encrypted storage for sensitive models and prompts; clear privacy UI.
Telemetry for cache metrics and a developer inspection panel.

Call to action

Start by adding a manifest and checksum verification to your build pipeline this sprint. Implement a small, pinned fallback model — it will buy you predictable cold starts on Puma and other local-AI browsers. If you’d like, download our open-source model cache library for IndexedDB (includes LRU eviction and signature verification) to get production-ready caching in hours — check the link in the developer docs and run a test on a low-end Android device to validate quotas.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.