edge AIvoice assistantprivacy

Local-First LLMs on Raspberry Pi 5: Building a Private Voice Assistant

UUnknown

2026-02-06

10 min read

Build a private voice assistant on Raspberry Pi 5 + AI HAT+ 2—local wake-word, STT, on-device LLMs, and CI/CD for secure deployments.

Ship a private, low-latency voice assistant on Raspberry Pi 5 with AI HAT+ 2 — fully local, secure, and production-ready

Hook: If you’re a developer or ops lead tired of sending voice data to cloud LLMs, struggling with latency and cost, or juggling brittle CI/CD for smart-home assistants, this guide shows a repeatable path to a private, local-first voice assistant that runs on a Raspberry Pi 5 + AI HAT+ 2 using open LLMs, on-device wake-word detection, and local speech-to-text.

Why local-first matters in 2026

Edge AI has matured. Late-2025 breakthroughs in 4-bit and 8-bit quantization, GGUF model packaging, and NPU-friendly runtimes let useful LLMs run on small hardware. At the same time, privacy and compliance pressures — and the rise of hybrid voice assistants from major vendors — make on-device inference a practical choice for teams who need:

Low latency (wake-word → response in sub-second to a few seconds).
Privacy and compliance (no audio leaves the home network).
Predictable costs (no per-request cloud billing).
Offline resilience for critical local automations.

What this guide builds (at a glance)

Raspberry Pi 5 + AI HAT+ 2 hardware stack with OS and drivers.
On-device wake-word detection (Porcupine / Rhasspy options).
Local speech-to-text (whisper.cpp or Coqui models) and VAD.
On-device LLM inference using a quantized GGUF model (llama.cpp / ONNX runtime + vendor SDK for NPU).
Local text-to-speech (Coqui TTS or lightweight VITS) and system integration.
CI/CD flow for model and service updates using GitHub Actions and secure deployment.

Hardware and components

Raspberry Pi 5 (64-bit OS recommended)
AI HAT+ 2 (NPU accelerator, vendor SDK with ONNX/TFLite support)
USB microphone or I2S microphone array for better beamforming
Speakers (USB or 3.5mm via DAC)
16–64 GB microSD or NVMe (Pi 5 supports NVMe via PCIe for logs and models)

Prerequisites and software choices — pragmatic trade-offs

Below are the recommended, proven OSS building blocks in 2026; they balance accuracy, performance, and licensing for production use:

Wake-word: Picovoice Porcupine (closed-source but lightweight) or open alternatives like Rhasspy / Mycroft Precise for fully open stacks.
VAD: webrtcvad to gate speech segments and save CPU.
Speech-to-text: whisper.cpp (GGML), Coqui STT, or ONNX quantized models.
LLM inference: llama.cpp/llama.cpp forks with GGUF support or ONNXRuntime using AI HAT+ 2 vendor SDK for NPU acceleration. Quantize to 4/5-bit where supported.
Text-to-speech: Coqui TTS or lightweight VITS models quantized for runtime.
Orchestration: systemd services or Docker containers. For fleet CI/CD, use balena or GitHub Actions + SSH deployment.

Step 1 — Prepare the Pi 5 and AI HAT+ 2

Flash a 64-bit Raspberry Pi OS or Ubuntu 24.04 (headless) and enable SSH.
Update and enable locales, timezones, and realtime audio groups.

sudo apt update && sudo apt upgrade -y
sudo usermod -aG audio,video,audio $USER
sudo raspi-config nonint do_ssh 0

Install vendor SDK for AI HAT+ 2. The HAT typically exposes ONNX/TFLite runtime or a dedicated API; follow the manufacturer install (kernel modules, userspace SDK).
Verify NPU availability: run the vendor sample inference and check utilization.

Tips

Use an NVMe boot if available to avoid slow SD I/O when loading models.
Pin system CPU governor to performance for consistent latency during inference.

Step 2 — Wake-word + VAD: cheap safety for local-first assistants

Why: Wake-word detection prevents continuous STT and LLM usage, cutting CPU usage and improving privacy. Voice activity detection (VAD) prevents false triggers and saves battery/CPU.

Porcupine quick setup (example)

pip install pvporcupine pvrecorder
# Python snippet
import pvporcupine, pvrecorder
p = pvporcupine.create(keywords=['hey pico'])
r = pvrecorder.PvRecorder(device_index=0, frame_length=p.frame_length)
r.start()
while True:
    pcm = r.read()
    result = p.process(pcm)
    if result >= 0:
        print('Wake word detected')
        break

If you prefer fully open-source, Rhasspy provides an offline wake-word engine and integrated ASR pipelines and can be swapped into this flow.

Step 3 — Local speech-to-text pipeline

Use VAD to chop audio, then pass segments to whisper.cpp or a small Coqui model. On Pi + HAT, convert the model to a quantized GGUF or ONNX format that the vendor SDK accelerates.

# Example using whisper.cpp (simplified)
# Build whisper.cpp with ARM optimizations and load a quantized model
./main -m small.en.gguf -f /tmp/segment.wav -otxt

Practical tips

Prefer small.en or similar compact models for command-and-control accuracy; larger models give better conversation but cost more compute.
Use endpointing: start inference only after VAD detects stable speech, and stop on silence to reduce total CPU time.

Step 4 — On-device LLM inference (the core)

Running an LLM locally is the biggest engineering choice. You can:

Use a compact GGUF model with llama.cpp on CPU (works for small assistants).
Use the AI HAT+ 2 NPU and ONNXRuntime (or vendor runtime) to run larger quantized models.

Sample flow

STT produces text: "Turn on living room lights"
Assistant prompt template wraps user text and system instructions.
LLM runs locally and returns an action or free text response.
Action router executes local commands (MQTT, Home Assistant API), or TTS speaks the response.

# Minimal Python interaction with llama.cpp via subprocess
import subprocess, json
prompt = 'User: Turn on living room lights\nAssistant:'
proc = subprocess.run(['./main', '-m', 'tiny-gguf', '-p', prompt, '--n_predict', '128'], capture_output=True)
print(proc.stdout.decode())

Optimization checklist

Quantize models to GGUF 4-bit (or use AWQ) to lower memory.
Use NPU acceleration via ONNX if your vendor SDK supports it.
Cache embeddings for repeated prompts to save CPU.
Limit context length for typical assistant actions (256–1024 tokens).

Step 5 — Text-to-speech and local actions

For TTS pick Coqui TTS or pre-quantized VITS models for natural sounding speech with low latency.

# Example invoking Coqui TTS
from TTS.api import TTS
tts = TTS('tts_models/en/ljspeech/tacotron2-DDC')
tts.tts_to_file(text='Lights on.', file_path='response.wav')
# Play via aplay
subprocess.run(['aplay', 'response.wav'])

Step 6 — Secure, repeatable deployment and CI/CD

A production assistant needs automated builds and safe deployment. Use the following strategy:

Store model pointers (not large files) in git; produce artifacts in CI.
Use GitHub Actions to build a Docker image that contains the runtime and your app.
Push images to a private registry or deploy via balena (for fleet management) or rsync+systemd for single devices.

Example GitHub Action (build + deploy via SSH)

name: Build and deploy
on: [push]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Build image
        run: |
          docker build -t myassistant:$GITHUB_SHA .
      - name: Save tar
        run: docker save myassistant:$GITHUB_SHA -o image.tar
      - name: Deploy to Pi
        uses: appleboy/scp-action@master
        with:
          host: ${{ secrets.PI_HOST }}
          username: pi
          key: ${{ secrets.PI_SSH_KEY }}
          source: image.tar
          target: /home/pi/
      - name: SSH load and run
        uses: appleboy/ssh-action@master
        with:
          host: ${{ secrets.PI_HOST }}
          username: pi
          key: ${{ secrets.PI_SSH_KEY }}
          script: |
            docker load -i /home/pi/image.tar
            docker stop myassistant || true
            docker rm myassistant || true
            docker run -d --restart unless-stopped --name myassistant --device /dev/snd myassistant:$GITHUB_SHA

Permissions and security

Use SSH key auth with restricted user and forced command if possible.
Run the assistant inside a minimally privileged container and enable seccomp/AppArmor.
Limit network egress for the device unless you explicitly need hybrid cloud fallback.

Edge/Cloud hybrid: when to fall back to a cloud LLM

Local-first doesn't mean cloud never. For long-form generation or heavy personalization you can implement a controlled fallback to a cloud LLM (for example, a hosted Gemini-like service) only when:

User opts-in during setup.
Confidence score from local LLM is low.
Device has strong network policy in place (TLS, mTLS).

Industry note: major vendors are adopting hybrid models — Apple and Google’s collaborations and the rise of on-device personalization demonstrate that practical assistants will use both edge and cloud intelligently (see coverage from 2024–2026).

Monitoring, metrics, and debugging

Track the following metrics locally and centrally (if you run multiple devices):

Wake-word false positive / negative rates
Average STT latency and word-error-rate (WER)
LLM inference latency and memory usage
System health: CPU, NPU, temperatures

Expose a secure /metrics endpoint (Prometheus) for fleet monitoring or push lightweight telemetry that doesn’t include PII.

Privacy, legal, and UX considerations

Keep raw audio local unless the user explicitly consents.
Provide a clear UI to inspect and delete recordings and model prompts.
Offer an opt-in cloud backup for trained personalizations only.
Consider local differential privacy for telemetry aggregates.

Performance tuning checklist for Pi 5 + AI HAT+ 2

Run the LLM on the NPU when possible (ONNX + vendor runtime).
Use 4-bit quantized GGUF or AWQ where quality is acceptable.
Cache model artifacts in fast NVMe storage.
Batch TTS segments and reuse synthesized buffers for repeated phrases.
Use webrtcvad-led endpointing to reduce STT calls.

Advanced: Fine-tuning and personalization (on-device)

As of 2026, lightweight personalization with adapters and LoRA-style updates is feasible to keep private user preferences on-device. Strategy:

Keep base model immutable and store small adapter files per user.
Apply adapters at runtime (adapter fusion) to bias responses without re-training base weights.
Persist adapters encrypted on disk and rotate keys with user credentials.

Common failure modes and how to fix them

High latency: check whether inference is running on CPU instead of NPU; confirm vendor runtime is used.
Wake-word misses: tune microphone placement and use beamforming mics or increase sensitivity carefully.
STT errors for accents or background noise: use dedicated denoising models or alternate STT models optimized for noisy audio.

Future trends and why you should start now (2026 outlook)

Edge NPUs are getting more capable and more standardized runtimes (ONNX + vendor bridges). Quantization improvements and efficient transformer architectures mean that assistants with useful context and natural responses will increasingly run offline. Hybrid strategies (local for privacy and low-latency; cloud for heavy tasks) will dominate enterprise deployments. Starting now gives you:

Operational experience with model packaging/quantization best practices.
A privacy-first product that differentiates you from cloud-only assistants.
The ability to deploy controlled fallbacks to cloud models like Gemini where permitted.

Summary & actionable checklist

In short: you can build a production-grade private voice assistant on Raspberry Pi 5 + AI HAT+ 2 by combining on-device wake-word detection, VAD, whisper.cpp or Coqui for STT, a quantized GGUF/ONNX LLM accelerated by the HAT’s NPU, and Coqui TTS for audio output. Automate deployment with GitHub Actions and protect privacy by keeping raw audio local.

Buy hardware: Pi 5, AI HAT+ 2, mic array, speakers.
Install 64-bit OS and vendor SDK; verify NPU runs samples.
Implement wake-word + VAD to gate STT calls.
Set up whisper.cpp or Coqui for STT with quantized models.
Run LLM inference via llama.cpp or ONNXRuntime using the NPU.
Integrate TTS and action router (MQTT / Home Assistant).
Automate builds and secure deployments with GitHub Actions or balena.

Call to action

If you want a ready-to-run starter repo, CI templates, and tested model packs for Raspberry Pi 5 + AI HAT+ 2 (including wake-word, whisper.cpp, a quantized GGUF assistant, and Coqui TTS), clone our sample project, open an issue with your hardware details, and join the community discussion. Start with one device, measure wake-word and STT metrics, and iterate — local-first assistants are now practical, private, and production-ready.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Up Next

Optimizing UI/UX for Top Android Skins: Practical Design Patterns and Pitfalls

Android•9 min read

Android Skins: The Hidden Compatibility Matrix Every App Developer Needs

Strategy•10 min read

Surviving the Metaverse Pullback: Cost/Benefit Framework for Investing in VR vs Wearables for Enterprise

VR•10 min read

Replacing Horizon Managed Services: How to Build an Internal Quest Headset Fleet Management System