I believe the major late-90s innovation in speech-to-text was considering arbitrary chains of sounds, rather than looking at words individually. Hence you tend to get coherent (if contextually wrong) phrases, rather than jumbles of homophones, and - just like a human listener - it tends to fail at key nouns that provide the context.
The more recent breakthrough has been deep-learning based, which is basically a black box of computational witchcraft. I think at some point there was a shift from spelling out sounds and then correcting that with a language model to somehow (!) going straight from sounds to correctly spelled language, greatly reducing the memory requirements, which were getting unwieldy.
Also at some point we did away with the need for speaker-specific training.
I think I'm safe in assuming that however I suspect it might work, the reality is a lot more complicated than that.