A while back I played a round with the HASS Voice Assistant, and pretty easily got to a point where STT and TTS were working really well on my local installation. Also got the hardware to build wyoming satellites with wakeword recognition.
However, what kept me from going through the effort of setting everything up properly (and finally getting fucking Alexa out of my house) was the “all or nothing” approach HASS seemingly has to intent recognition. You either:
- use the build in Assistant conversation agent, which is a pain in the ass because it matches what your STT recognized 1:1, letter by letter, so it’s almost impossible to actually get it to do something unless you spoke perfectly (and forget, for example, about putting something on your ToDo list; Todo, todo, To-Do,… are all not recognized, and have fun getting your STT to reliably generate the ToDo spelling!), or
- you slap a full-blown LLM behind it, either forcing you to again rely on a shitty company, or host the LLM locally; but even in the latter case and on decent (not H100, of course, but with a GPU at least) hardware, the results were slow and shit, and due to context size limitations, you can just forget about exposing all your entities to the LLM Agent.
- You also have the option of combining the two approaches; match exactly first, if no intent recognized, forward to LLM; but in practice, that just means that sometimes, you get what you wanted (“all lights off” with a 70% success rate, I’d say), and still a lot of the time you have to wait for ages for a response that may be correct, but often isn’t from the LLM.
What I’d like is a third option, doing fuzzy matching on what the STT generated. Indeed, there seems to have been multiple options for that through rhasspy, but that project appears to be dead? The HASS integration has not been updated in over 4 years, and the rhasspy repos are archived as of earlier this month.
Besides, it was not entirely clear to me if you could just use the intent recognition part of the project, forgoing the rest in favor of what HASS already brings to the table.
At this point, I am willing to implement a custom conversation agent, but wanted to make sure first that I haven’t simply missed an obvious setting/addon/… for HASS.
My questions are:
- are you using the HASS Voice Assistant without an LLM?
- if so, how do you get your intents to be recognized reliably?
- do you know of any setting/project/addon helping with that?
Cheers! Have a good start into the working week…!
I don’t have as much experience with HASS, but I did use Mycroft for quite a while (stopped only because I had multiple big moves, and ended up in a place small enough voice control didn’t really make sense any more). There were a few intent parsers used with/made for that:
https://github.com/MycroftAI/adapt https://github.com/MycroftAI/padatious https://github.com/MycroftAI/padaos
In my experience, Adapt was far and away the most reliable. If you go the route of rolling your own solution, I’d recommend checking that out, and using the absolute minimum number of words to design your intents. E.g. require “off” and an entity, and nothing else, so that “AC off,” “turn off the AC,” and “turn the AC off” all work. This reduces the number of words your STT has to transcribe correctly, and allows flexibility in command phrasing.
If you borrow a little more from Mycroft, they had “fallback” skills that were triggered when an intent couldn’t be matched. You could use the same idea, and use https://github.com/seatgeek/thefuzz to fuzzy match entities and keywords, to try to handle remaining cases where STT fails. I believe that is what this community made skill attempted to do: https://github.com/MycroftAI/skill-homeassistant (I think there were more than one HASS skill implementations, so I could be conflating this with another).
Another comment mentioned OVOS/Neon - those forked off of Mycroft, so you may see overlap if you investigate those as well.