From Chrome to Puma: Embedding Local AI Runtimes

A practical 2026 guide for browser engineers: embed ONNX/CoreML/TFLite safely with manifested model delivery, cache policies, and strong sandboxing.

Hook: Why browser engines must get local AI runtimes right in 2026

Browser engineers: your users expect fast, private AI features on mobile and desktop without sending data to a third-party API. Slow or insecure integrations turn checkouts, autofill, and accessibility features into liability. Embedding local AI runtimes (ONNX, CoreML, TensorFlow Lite) into a browser engine gives you latency, privacy, and offline capability — but it also raises hard questions about sandboxing, model delivery, cache management, and deployability.

Trusted context: where we are in 2026

In late 2025 and early 2026, adoption of on-device inference accelerated. WebGPU and the WebNN proposals matured into widely supported APIs, vendors shipped more mobile ML delegates, and niche browsers (for example, Puma on mobile) popularized local AI features. Meanwhile, quantization and GGML-style runtimes made small LLMs viable on phones. That changes the expectations for browser engineers: local models must be fast, updatable, auditable, and isolated.

What this guide covers

How to embed ONNX, CoreML and TensorFlow Lite into a browser (platform and engine level).
Cache strategies for model storage, validation, delta updates, and eviction.
Security sandboxing and permission models to run local AI safely.
CI/CD and serverless deployment patterns for model delivery and versioning.

High-level architecture: a recommended pattern

Treat on-device inference as a composable subsystem inside the browser. Implement three layers:

Runtime layer — platform-specific accelerators and runtimes (CoreML on iOS, NNAPI/DirectML on Android/Windows, ONNX Runtime, TFLite delegate).
Isolation layer — separate process / worker with strict sandboxing, resource quotas, and attestation hooks.
Policy & cache layer — model manifests, signature verification, cache store, eviction and update policies, and consent management.

Why process isolation matters

Running model inference in the same process as the renderer increases the blast radius for crashes and data exfiltration. Use a dedicated AI service process or a trusted WebAssembly worker that does not share memory with page JavaScript without explicit serialization.

Embedding runtimes: platform-by-platform patterns

iOS (CoreML) — best practice for WKWebView and engine-level embedding

On iOS, CoreML is the fastest path for hardware-accelerated inference. If you control the browser embedder, expose a minimal RPC from JavaScript to a sandboxed native worker that runs CoreML models.

Step-by-step:

Pre-package or download .mlmodelc bundles signed by your model registry.
Create a dedicated AI process using Sandbox and process separation (App Sandbox on macOS/iOS).
Use WKScriptMessageHandler to expose a narrow API: predict(modelId, inputBlob) -> Promise.
Perform model validation and signature check before any load.

// Swift: WKScriptMessageHandler example
class AINativeBridge: NSObject, WKScriptMessageHandler {
  func userContentController(_ uc: WKUserContentController, didReceive message: WKScriptMessage) {
    guard let body = message.body as? [String:Any],
          let modelId = body["modelId"] as? String,
          let input = body["input"] as? String else { return }

    // schedule inference on AI queue
    AIService.shared.run(modelId: modelId, inputBase64: input) { result in
      // post result back to webpage via evaluateJavaScript or message
    }
  }
}

Android (TFLite / ONNX via NNAPI) — use delegates and JavaScript bridges

On Android, prefer NNAPI delegates for devices that expose hardware acceleration. If you must support older devices, fall back to TFLite CPU or ONNX Runtime Mobile. Expose a JavaScriptInterface only to origins the user allows.

// Kotlin: WebView JavaScriptInterface scaffold
class AIBridge(private val context: Context) {
  @JavascriptInterface
  fun predict(modelId: String, inputBase64: String): String {
    val result = AIService.predict(modelId, Base64.decode(inputBase64, 0))
    return Base64.encodeToString(result, Base64.NO_WRAP)
  }
}

Engine-level (Chromium/Blink) — embed native runtimes for performance

If you work on the engine itself, integrate the runtime as a dedicated service process with a compact IPC surface. Support plugin delegates (CoreML, NNAPI, DirectML) and a wasm fallback for portability. Keep the JS-facing API minimal and opt-in.

WebAssembly fallback — portable but capped

Compile ONNX/TFLite runtimes to WebAssembly for cross-platform fallback. Use WebGPU for acceleration via the WebGPU-WASM interop or WebNN polyfills. This approach is slower than native delegates but essential for uniform behavior across browsers.

Model delivery and cache strategy — problems to solve

Model files are large, mobile storage is constrained, and network connectivity is variable. Design your cache strategy to be content-addressed, space-aware, secure, and updatable without breaking running sessions.

Recommended cache hierarchy

In-memory cache — short-lived workspace for currently running model shards and tensors. Evict aggressively on memory pressure.
Persistent cache — on-device filesystem or IndexedDB store for model bundles; maintain content-addressed keys by SHA-256 of the model binary.
Cold storage / CDN — server-hosted signed model bundles and manifest JSON served over HTTPS with strong caching headers.

Manifest and content-addressing

For every published model provide a manifest.json with fields: modelId, version, sha256, size, supportedRuntimes, signature, shardMap. The browser verifies the sha256 and cryptographic signature before storing.

{
  "modelId": "com.example/assistant-small",
  "version": "2026-01-10",
  "sha256": "abcdef...",
  "size": 73400320,
  "supportedRuntimes": ["tflite", "onnx", "coreml"],
  "signature": "BASE64_SIGNATURE",
  "shards": ["0-16MiB", "16-32MiB", "32-..."]
}

Chunked downloads, delta updates and prefetching

Support HTTP range requests and shard-aware downloads so you can stream the model and start inference on the first shard.
Use binary diff (bsdiff or custom deltas) for minor weight updates; publish delta manifests alongside full bundles.
Prefetch smaller models or tokenizer assets on low-cost network triggers (e.g., on Wi‑Fi during idle) and honor battery constraints.

Eviction and quota policies

Track per-origin and global quotas for model cache.
Use LRU with size-aware weighting (big models have higher eviction weight).
Expose a user-facing settings panel for storage and per-site AI permissions.

Security & sandboxing patterns

Running local AI introduces new attack surfaces: malicious models, poisoned inputs, and privacy leaks. Apply defense-in-depth: validate, isolate, rate-limit, attest.

Model provenance and signature verification

Sign all model bundles in your model registry using an offline key. At runtime, verify signatures using COSE or JOSE and reject unsigned or revoked models. Keep a signed revocation list that the browser can fetch periodically.

Process and syscall restrictions

Run inference inside a constrained process with no filesystem write access except the approved model cache, no network access unless explicitly permitted, and strictly limited memory and CPU quotas. On Linux-based systems use seccomp filters; on macOS/iOS rely on the platform sandbox.

WebAssembly sandboxing and memory limits

If you run WASM runtimes, enforce memory and execution time limits (fuel-based execution). Interpose I/O and ensure WASM cannot call into arbitrary host functions beyond a curated bridge.

Require explicit user consent when a site requests a local model. The permission dialog should show model provenance, estimated size, and whether data is retained locally. Store consent state per-origin and implement revocation APIs.

Input sanitization, throttling and privacy

Sanitize inputs at the boundary: restrict file reads and remove metadata where possible.
Throttle inference requests per origin to prevent covert channels and reduce exfiltration risk.
Consider automatic differential privacy or local aggregations for telemetry.

Performance optimizations

On-device inference is a memory and compute game. Prioritize small quantized models, OS delegates, and progressive inference.

Use int8/4 quantization and per-channel quantization to reduce RAM and improve throughput.
Delegate tensor ops to CoreML/NNAPI/DirectML when available instead of CPU kernels.
Warm up models lazily: run a light warm-up pass when the device is plugged in or during idle.
Implement progressive inference: run a tiny model for quick responses and trigger a larger model if needed.

CI/CD for models: treat models as first-class artifacts

Your engineer workflow should mirror software releases. Add automated validation, quantization, signing, and publishing steps to your pipeline.

Example GitHub Actions workflow (truncated)

name: model-publish
on:
  push:
    paths: ['models/**']

jobs:
  build-and-publish:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: quantize model
        run: python tools/quantize.py models/assistant.pt -o artifacts/assistant.tflite
      - name: validate
        run: python tools/validate_model.py artifacts/assistant.tflite
      - name: sign
        run: ./tools/sign_model.sh artifacts/assistant.tflite > artifacts/assistant.tflite.sig
      - name: upload to CDN
        uses: actions/upload-artifact@v4
        with:
          name: assistant-model
          path: artifacts/**

Key steps: static model checks (no executable code), quantization, unit/integration tests that run sample inferences, and cryptographic signing using a secure key store (KMS/HSM).

Serverless metadata & attestation endpoints

Provide a small serverless API to serve manifests, signatures, revocation lists, and attestation tokens. Keep it minimal to reduce maintenance and scale automatically with CDN-backed static hosting for bundles.

GET /manifests/com.example/assistant.json
Response: 200 {
  modelId, version, sha256, signature, supportedRuntimes
}

Example integration: secure predict flow

Site requests permission to use a model via navigator.permissions-like API (site provides modelId).
Browser shows provenance dialog and asks user consent.
On consent, browser fetches manifest, verifies signature, and checks LDPC (revocation list).
Browser downloads shards, verifies SHA-256 for each shard, stores them in content-addressed cache.
Site calls bridge predict(modelId, input). The sandboxed AI process runs inference and returns an encrypted/serialized response.
Browser applies rate-limits and logs a minimal telemetry record (user-opt-in) for performance tracking.

Testing and observability

Unit test model I/O and numerical stability across runtimes.
Integration tests that run on CI on representative devices or emulators with delegates toggled.
Expose debug pages that show cache usage, model versions, and last verification timestamp for audits.

Operational concerns: versioning, rollback, and A/B

Model issues are user-facing and must be easy to roll back. Use immutable versioning, signed releases, and staged rollouts:

Canary: push to 1-5% of devices first.
Automated rollbacks triggered by client-side health signals (aborts, high latency, accuracy regressions).
Support multi-version coexistence for A/B testing models without requiring full app updates.

Mobile-specific notes

iOS: CoreML model bundles should be prepared using coremltools, signed and optimized for size.
Android: rely on NNAPI delegates where possible; implement fallback to TFLite/ONNX when missing.
Battery life: limit background downloads/inference to charging or user-configured windows.

Real-world example: what Puma demonstrated for mobile browsers

Niche mobile browsers introduced local AI features that highlight user demand for on-device assistants. Their approach shows three practical lessons for engine authors: keep the model pipeline transparent, offer runtime choices, and make consent granular. Use those lessons as guiding principles when building your integration.

"Local-first AI is now a user expectation on mobile: privacy, latency and offline capability win. But it must be delivered with auditable security and controlled resource use." — Practical takeaway

Future-proofing and 2026 trends you should watch

Standardization: expect WebNN/WebGPU to be the default acceleration APIs in most browsers; design to use them when available.
Model shards & streaming: more model serving will move to shard-first streaming so inference can begin before the full model downloads.
Homomorphic/secure enclaves and attested local compute will grow for regulated verticals (health, finance).
Smaller, instruction‑tuned models will replace many cloud calls — keep update and regression safeguards in place.

Checklist: ship a safe local-AI feature (engineer-to-engineer)

Process isolation: inference runs outside renderer process.
Manifest & signature: every model has a signed manifest with sha256.
Cache policy: content-addressed store, LRU eviction, per-origin quotas.
Permissions: per-origin consent and revocable opt-ins.
Resource limits: memory, CPU, timeouts enforced.
Telemetry: minimal, privacy-preserving, opt-in for debugging only.
CI: quantize, validate, sign and publish as part of your release pipeline.

Actionable code & config snippets

Service worker snippet that verifies manifest signature before caching (pseudo-code):

self.addEventListener('fetch', event => {
  if (event.request.url.endsWith('/model/manifest.json')) {
    event.respondWith(fetch(event.request).then(async res => {
      const body = await res.clone().json();
      // verify signature with public key bundle
      if (!verifySignature(body)) throw new Error('Invalid signature');
      // put in cache
      const cache = await caches.open('model-cache');
      await cache.put(event.request, res.clone());
      return res;
    }));
  }
});

Closing: how to get started this sprint

Start small: pick a trivial model (e.g., tokenizer + intent classifier <20MB quantized) and implement a narrow permissioned API in your browser that runs it in a sandboxed service process. Add manifest verification, an LRU cache, and a CI pipeline that signs the artifact. Use OS delegates (CoreML, NNAPI) where available, and provide a WASM fallback.

Key takeaways

Local AI unlocks privacy, latency and offline UX — but only if you design for security, updateability and resource limits from day one.
Use platform delegates (CoreML, NNAPI) for performance; keep WASM as fallback.
Treat models as signed artifacts and implement manifest-driven cache & update policies.
Sandbox and limit resource use to reduce attack surface and give users control.

Call to action

Ready to prototype? Clone our reference repo (engine+bridge+CI) for a working minimal embed and CI pipeline. If you need help productionizing local AI in your browser, reach out to the webdevs.cloud team for a security review and deployment workshop.