Some of the problem is not the A.I. technology but the model of the user interface. The model of Alexa is
[voice input] -> [action]
but to do more complex interactions it has to be more like
[input 1] -> [ask for clarification] -> [input 2] -> [propose action, ask for confirmation ] -> [action]
from the viewpoint of developing a training set this is exponentially more complicated. Maybe it is like playing chess or learning Atari games by reinforcement learning, but those games involve a small number of elements that combine in complex ways, whereas a conversation involves a space of millions of concepts. (e.g. "20 questions")
It's not so much that they need any specific feature ("buy toothpaste") that they need some way to recover the cost, for instance being more tied to the Amazon ecosystem, more likely to subscribe to Prime, etc.
It's not easy though. With roughly 70 million users they would need to generate about $150 a user in profit to cover $10 billion. At 2% margin they'd need to sell $7500 worth of stuff that they wouldn't otherwise sell.
What I don't get is where the $10 billion went: how much of it subsidizing the device? (If they sold 10 million devices last year it would be a $1000 subsidy which is incredible.) are they paying AWS prices to run the backend service? how much are they spending in software development, model training, etc?
My biggest problem with the flow nowadays is instead of getting briefer and simpler as you and the device learn each other better, it stays just as onerous all the way through--and, in fact, becomes more so as they introduce more "By the way, you can...[new feature discovery I did NOT want forced down my throat when trying to reset my alarm while half asleep]..." asides.
It's like every lesson about not disrupting GUI flow with modal banners, and about creating shortcuts and other accelerated paths for advanced users was forgotten in VUI. At best you get a one-size-fits-all brief mode that usually isn't very brief, and doesn't reward learning so doesn't increase engagement.
VUI vendors also plainly have no clue how to enable discovery without either fucking up your home screen with some ugly "things you can do" list or being annoying about volunteering stuff when responding to your commands. It's as if they might need to *gasp* document their features instead because voice devices aren't web apps you can just fiddle with.
The net result is I actually get way more frustrated with these devices now that I'm super-experienced with the tech, not less. That's the wrong direction.
That anti-pattern of interrupting you to tell you about a new feature is common and it's the worst. All the time I feel like I am behind the 8-ball and I really need to finish some small task right now but Firefox, Photoshop and seemingly every other product wants to get in my way of doing the task.
It's a deep problem though matching the expectations of a user for a VUI with what it can so, Alexa's greatest triumph is that it had a domain that it seemed pretty complete over and that mostly people could "get" what it was capable of doing.
And in those more complex interactions there is not enough adjustment capability. I often want Alexa to speak faster or just shorter (reduce filler words and pauses). There's no way (I am aware of) to make it more efficient for the determined user.
It's not so much that they need any specific feature ("buy toothpaste") that they need some way to recover the cost, for instance being more tied to the Amazon ecosystem, more likely to subscribe to Prime, etc.
It's not easy though. With roughly 70 million users they would need to generate about $150 a user in profit to cover $10 billion. At 2% margin they'd need to sell $7500 worth of stuff that they wouldn't otherwise sell.
What I don't get is where the $10 billion went: how much of it subsidizing the device? (If they sold 10 million devices last year it would be a $1000 subsidy which is incredible.) are they paying AWS prices to run the backend service? how much are they spending in software development, model training, etc?