Building a privacy-focused home assistant
How I set up a local voice assistant for my home
Talking to an assistant to control devices in your home can be very convenient, from adding things to the shopping list, setting timers, playing music, turning lights on and off, to even more complex tasks depending on how ‘smart’ your home is. Smart home speakers like Google’s Nest (formerly Google Home) or Amazon Alexa have been available for years, are easy to set up and are likely to become even more competent by integrating the latest LLMs. The big concern with them however is privacy – many people are uncomfortable having an always listening device in their house, and while the companies assure us that they only activate upon hearing the wake word and do not record conversations, this privacy promise hinges on trust, the device working properly, and its security not being compromised.
In this article I’m describing an alternative approach: instead of buying a smart speaker, let’s build one that operates locally and is fully under your control. We’ll use Home Assistant, an Android tablet and some locally running services to build a voice-activated assistant for your home. Of course, this is more complex than simply buying and setting up a smart speaker, but you gain privacy, control and can unlock more powerful automations.
The Basics
Underlying this approach is the excellent Home Assistant, the free and open-source home automation system. Many of you are probably already aware of Home Assistant, but for those who are not, it’s a platform that acts as a central controller for your smart home: you can integrate your lights, switches, TVs and much more and can build complex automations around your devices (it really has so many official integrations, and if a device isn’t officially supported you can likely find a custom integration built by the community).
Home Assistant has built-in support for a basic text-only assistant, but you can configure your own locally running voice pipeline, and it’s this feature we will leverage. To do that, you need to host at a minimum a Speech-to-Text (STT) and a Text-to-Speech (TTS) model. For the ‘brain’ of the assistant, HA includes an interpreter which is good enough for basic commands (and is very fast) but if you want additional smarts, I recommend hosting a (small) Large Language Model.
The setup:
Home Assistant Deployment: you can install HA on your own home server or Raspberry Pi or you can purchase it pre-installed on dedicated hardware like the Home Assistant Green. For the purposes of this article, we assume you already have an instance of Home Assistant up and running.
A home server: running Linux or Windows, to host the model deployments. Alternatively, you could also use one or multiple Raspberry Pis to run the needed models, but you’ll have to check the requirements of each model. If you’re already running Home Assistant on a home server you can use that, but if you are using Home Assistant Green or a Raspberry Pi it likely isn’t powerful enough to run the whole voice pipeline efficiently alongside HA, so you’ll need extra hardware. As for the hardware specs – low-powered systems can work pretty well and have the advantage of a reduced power consumption optimized for being always on. I used a mini-PC with an Intel N100 CPU and 16 GB of RAM, but of course the more powerful the hardware is, the faster and/or better models you can use.
An always-on device with a speaker and microphone: in this example I use a relatively cheap Android Tablet, but you could also use an ESPHome-powered device like the ATOM Echo, if you don’t want a smart display as well.
How it all works:
The tablet runs an app that does wake word detection. When you speak the wake word (e.g. ‘Hey Jarvis’), it will record your command and send it to the Home Assistant deployment
HA will send the recorded audio to the STT model and await the text result
HA will then formulate a prompt based on the text and send it to the LLM
The LLM will come up with a response, in a JSON format that HA can understand
HA will do the action (e.g. turn on the lights) and then send the response to the TTS module
HA then sends the audio response to the tablet which will play it back
As a speed improvement, we will also configure HA to first try to interpret the command with its built-in interpreter, and only if that fails send the command to the LLM. For basic commands this is much faster than the LLM.
Voice Processing
I used the open-source https://speaches.ai/ to run the STT and TTS models. It’s an OpenAI API-compatible server for running transcription, translation and speech generation, supporting a large number of models. Once deployed, it has a helpful UI where you can test your configuration and provides an endpoint to list all its models. For STT, I chose the Systran/faster-distil-whisper-small.en model, while for TTS I used the speaches-ai/Kokoro-82M-v1.0-ONNX-int8. Depending on your hardware, you may choose more- or less powerful models. On my server, the TTS model runs in about 5-6 seconds, which while a bit slow is sufficient for my needs. To easily integrate speeches-ai with HA, I placed it behind a proxy which translates the OpenAI API to the Wyoming protocol: https://github.com/roryeckel/wyoming_openai. You’ll see later how we integrate this with HA. I installed everything via docker:
docker run -d \
--name speaches \
--restart unless-stopped \
--network host \
-e UVICORN_PORT=<your_desired_port> \
-e STT_MODEL_TTL=-1 \
-e TTS_MODEL_TTL=-1 \
-e PRELOAD_MODELS=[”speaches-ai/Kokoro-82M-v1.0-ONNX-int8”,”Systran/faster-distil-whisper-small.en”] \
-e WHISPER__COMPUTE_TYPE=int8 \
-e WHISPER__BEAM_SIZE=1 \
-e WHISPER__LANG=en \
-e WHISPER__DEVICE=cpu \
-e WHISPER__TTL=-1 \
-e OMP_NUM_THREADS=2 \
-v “$(pwd)/speaches/speaches_cache:/root/.cache/huggingface” \
ghcr.io/speaches-ai/speaches:latest-cpuSome explanations:
STT/TTS_MODEL_TTL=-1 – this prevents model offloading, however it seems bugged in the latest speeches version
WHISPER_TTL=-1 – same but directly for the Whisper STT model, this one appears to work
PRELOAD_MODELS – I could not get this to work, but technically it should ensure the models are loaded in memory once the container starts up
WHISPER__COMPUTE_TYPE=int8 – int8 quantization for faster processing on my intel n100
WHISPER__BEAM_SIZE=1 – optimize recognition speed by limiting accuracy
OMP_NUM_THREADS=2 – on the intel n100 using 2 threads rather than all 4 is faster
-v “$(pwd)/speaches/speaches_cache:/root/.cache/huggingface” - mount the model cache folder outside the container so we can reuse downloads
Initially speaches-ai is empty, to install models you must make these two requests:
curl -X POST http://localhost:<port>/v1/models/speaches-ai/Kokoro-82M-v1.0-ONNX-int8
curl -X POST http://localhost:<port>/v1/models/Systran/faster-distil-whisper-small.enTo set up the Wyoming proxy:
docker run -d \
--name="wyoming-proxy" \
--restart unless-stopped \
--network host \
-e TTS_OPENAI_URL="http://localhost:<speaches-port>/v1" \
-e STT_OPENAI_URL="http://localhost:<speaches-port>/v1" \
-e TTS_STREAMING_MODELS="speaches-ai/Kokoro-82M-v1.0-ONNX-int8" \
-e STT_MODELS="Systran/faster-distil-whisper-small.en" \
-e TTS_VOICES="af_heart af_alloy af_aoede af_bella af_jessica af_kore af_nicole af_nova af_river af_sarah af_sky am_adam am_echo am_eric am_fenrir am_liam am_michael am_onyx am_puck am_santa bf_alice bf_emma bf_isabella bf_lily bm_daniel bm_fable bm_george bm_lewis" \
-e WYOMING_URI="tcp://0.0.0.0:<proxy port>" \
-e WYOMING_LANGUAGES="en" \
ghcr.io/roryeckel/wyoming_openai:latestSTT/TTS_OPENAI_URL – URL to the speaches deployment
TTS_STREAMING_MODELS – the TTS model as a streaming model i.e. it converts text in chunks and returns the partial audio, instead of waiting for everything
STT_MODELS – the STT model
TTS_VOICES – this proxy requires a list of all voices to populate a drop-down in HA, I’ve just pre-selected some voices I liked from Kokoro
Now we have an endpoint at ‘tcp://0.0.0.0:<proxy port>’ which exposes our STT and TTS models through the Wyoming protocol.
The Brain
The challenge here is to run an LLM which works well with Home Assistant commands, and which can run on the available hardware. I ultimately ended up using acon96/Home-Llama-3.2-3B which is a version of the Llama 3.2 3B model, fine-tuned for Home Assistant. It’s not the smartest when it comes to general reasoning, but it’s faster than the regular Llama model and it works reasonably well on my server. Of course, if you have faster hardware available you can run more complex models. I hosted this model in Ollama, which runs and exposes your model at a preconfigured port.
The Terminal
For the actual device that you speak to, I used an Android tablet, and I installed Android Voice Assistant on it: this is an open-source project which makes your Android device always listen for a list of wake words like (which you can also configure), and which uses the ESPHome protocol and is easy to integrate with Home Assistant. While not necessary for the voice part, I also installed the Home Assistant Companion App, and in the settings enabled “Fullscreen” mode and “Keep screen on”. This transforms the tablet into an always-on dashboard of my home which listens for and relays voice commands.
Tying It All Together
Now, to put everything together into a working voice assistant. First, all the parts must be integrated into Home Assistant:
STT and TTS: HA>Setttings>Integrations>Add integration>Wyoming Protocol. This should automatically detect the endpoint you created in your local network, but if it does not you can manually add the address and port of the proxy. The integration should create two entities: one for STT and one for TTS
LLM: while you can use the official Ollama integration, I recommend using the Local LLMs custom integration which you can install through HACS (Home Assistant Community Store), as it’s optimized for the model we are using. Configure it to point to the address of the Ollama instance you deployed previously.
Next, under Settings>Voice assistants>Add assistant you add a new assistant:
Name it
Select your LLM as a conversation agent
I recommend enabling “Prefer handling commands locally” (locally i.e. using the integrated interpreter) since this will process simple commands like “Turn on the kitchen lights” much faster
Speech-to-text: select the STT Wyoming entity
Text-to-speech: select the TTS Wyoming entity, and select a voice from the drop-down
Results
Now you can configure this assistant to be your default and done! You can try saying ‘Hey Jarvis, turn on the kitchen lights’ to the tablet and it should all work. That’s it, you now have your private, local voice assistant for your home! Your voice recordings do not leave your local network, and you have full control over how your assistant works.
My new setup, compared to a Google Nest Hub Gen 2 that I previously used, works overall pretty well:
The wake word detection is very accurate
STT seems more accurate
Direct commands like: turn on the TV, turn off the kitchen lights etc. work well
It is, however, a bit slower, especially if the LLM handles the command instead of the interpreter. Processing a command takes 5-10 seconds, and an additional few seconds to respond.
The LLM is not the most accurate, for example if the STT understands “AND milk to the shopping list” instead of “ADD milk to the shopping list” it fails to infer what I meant.
All in all, for me the gains in privacy make up for the drawbacks of this approach, and I will continue exploring ways to make it faster. Lastly, this article just shows how I chose to approach this problem, there are many different ways to deploy, run and integrate your models. Feel free to experiment and let me know your results!
Tips
HA sends the LLM a prompt like this ‘You are a helpful assistant, here are all the devices you are aware of: <entity_list> […] <your prompt>’. This means that if you expose many entities to your LLM it might get slow. I recommend limiting the number of entities exposed (Settings>Voice assistants>Expose tab) to just the ones you will use, there is no point in exposing the battery level of a sensor if you don’t need it.
Another way to speed up the LLM is to limit the context: you can lower the number of previous commands it remembers in the Local LLMs integration configuration.
Of course you can use other STT, TTS or LLM models and host them in different ways, I chose the specific models mentioned here because they maximize accuracy/performance on my low-powered hardware
Instead of a tablet or Android device, you can use a cheap ESPHome module like the Atom Echo. This integrates easily with HA, and you can place multiple devices in different rooms. I heard that the microphone and speaker quality is not great though.
If you don’t want to go through all the effort of setting this up but also don’t want to use Google’s or Amazon’s services, you could subscribe to Home Assistant Cloud which provides cloud-based STT and TTS, and you can configure your voice assistant to use it. This however means your audio is no longer processed locally.
For more information on home assistant’s voice pipeline:


Instantly sending this to my partner who is against all smart devices for security reason (they work in cybersecurity). Guess they'll have to build me one now.