Author Topic: Text to Speech  (Read 1809 times)

tiermat

  • According to Jane, I'm a Unisex SpaceAdmin
Text to Speech
« on: 03 March, 2018, 05:13:53 pm »
Today is the WHO World Hearing Day.

This week I am presenting at work, on the subject of BBQ (it's learning week) and I want to recognise this day.  To that end I want to use some kind of speech to text to produce real time subtitles over my presentation.

So, massed minds of YACF, what are your suggestions?  Bonus points if it leverages Google's Speech API.

I am not giving my presentation until Friday, so have a bit of time to trial any suggestions.
I feel like Captain Kirk, on a brand new planet every day, a little like King Kong on top of the Empire State

Kim

  • Timelord
    • Fediverse
Re: Text to Speech
« Reply #1 on: 03 March, 2018, 05:44:49 pm »
I'd suggest that providing 'craptions' (as they're known in the deaf community, for reasons that should be readily apparent) goes counter to the objective of World Hearing Day, unless your message is that they're unfit for purpose[1] and people shouldn't consider them as much more than an aid to search engines and translators.

For realtime subtitling, an NRCPD registered speech-to-text reporter is the gold standard.  Barakta can elaborate, as this is her preferred means of accessing conferences, formal meetings, etc.  Unfortunately, it's such a specialised skill that it makes BSL interpreters look cheap.

If you want fit-for-purpose subtitles without costing the earth, you have to sacrifice the realtime requirement and either stick to a script (acting skills required) or subtitle a recording.  Using craptions to do the bulk of the work (particularly the timing) and then correcting them is usually much quicker and easier than creating subtitle data from scratch, which is what they're actually brilliant for.



[1] while
the
accessory
is
usually
pretty
good
at
least
when
the
audio
is
clean
they
still
don't
do
puncture
nation
including
indicating
who
is
speaking
and
the
word
by
word
rendering
makes
them
hard
work
to
follow
especially
if
your
trying
to
follow
the
subject
matter
at
the
same
thyme

barakta

  • Bastard lovechild of Yomiko Readman and Johnny 5
Re: Text to Speech
« Reply #2 on: 03 March, 2018, 06:03:16 pm »
I can't possibly improve on the inimitable Kim.

Realtime captioning needs humans using some kind of keyboard system to be effective enough for deaf and hard of hearing users.

In the UK the freelancers have an organisation and website at http://avsttr.org.uk/ where you can read about STTR and make enquiries which go out to most of the STTR operators in the UK.

My pet agency is 121 Captions which is deaf led has a website at https://www.121captions.com/ their prices are reasonable and they provide genuine practitioners.

You can get STTR either in-person or remote. Remote needs decent Internet and audio of room to captioner.

Cudzoziemiec

  • Ride adventurously and stop for a brew.
Re: Text to Speech
« Reply #3 on: 03 March, 2018, 07:12:38 pm »
Do you want text to speech or speech to text? I remember we got speech to text working okay when I was in film subtitling, but I can't remember what the software was. Expect it was proprietary anyway. I do remember it had to be trained to get to know one individual's voice, and wouldn't necessarily then work with other people. In any case, this was many years ago now.

Text to speech I know even less about.
Riding a concrete path through the nebulous and chaotic future.

Kim

  • Timelord
    • Fediverse
Re: Text to Speech
« Reply #4 on: 03 March, 2018, 07:40:00 pm »
Do you want text to speech or speech to text? I remember we got speech to text working okay when I was in film subtitling, but I can't remember what the software was. Expect it was proprietary anyway. I do remember it had to be trained to get to know one individual's voice, and wouldn't necessarily then work with other people. In any case, this was many years ago now.

Still done for live subtitling of TV.  Accuracy is better than with different voices, and of course fully optimised microphone arrangement helps (no background noise, as there would be if you fed programme audio straight to the craptionbot).

The problem with re-speaking is that it's slower than palantypy[1], and that you typically need a pair of humans (one re-speaking, the other correcting, punctuating, marking speakers, etc) to do it in realtime.  That's why the subtitles on the news tend to throw away entire sentences (or more) from time to time in order to catch up.

The advantage is that the skill level required is much lower, so more people can do it, and it's a lot cheaper.

We can't be far off being able to do away with the re-speaker, and having a human correcting a craptionbot live.



[1] STTR operators use a phonetic keyboard to achieve the necessary speed.

Cudzoziemiec

  • Ride adventurously and stop for a brew.
Re: Text to Speech
« Reply #5 on: 03 March, 2018, 08:43:02 pm »
Oh, we had a whole team doing the correcting, editing, adjusting to any one of the various house styles required by different studios*, etc, but we weren't doing it in realtime.

*Sometimes the same film or show would go out in several versions for different studios/networks, with identical audio-visual content but slightly different subtitles cos of differing fonts, line lengths, timing rules, etc.
Riding a concrete path through the nebulous and chaotic future.

Kim

  • Timelord
    • Fediverse
Re: Text to Speech
« Reply #6 on: 03 March, 2018, 09:06:53 pm »
Yeah, doing a really good job is always going to be slower than realtime, at least until someone comes up with a really clever algorithm (and even then, I can't imagine a computer managing to do suspense or comic timing properly).

Beardy

  • Shedist
Re: Text to Speech
« Reply #7 on: 04 March, 2018, 12:53:16 am »
I knew a couple of guys that worked on both speech to text and text to speech in the days that BT did proper blue sky research because it might prove useful or trigger something that proved useful. Now admittedly this is 20 + years ago, and I didn’t fully understand all the stuff we discussed, but basically back then ‘clever solutions’ using phonems and the like proved too difficult and so they resorted to what the senior guy called a ‘proper engineering solution’ which I think involved lots of processing. You’d have thought that in the intervening years improvements would have been made, but as evidenced on any ‘live’ broadcast on the majic lantern box, there hasn’t.
One think I wish they’d not do is correct obvious mistakes. In doing so you miss large portions of the subsequent words while they faff around and then subsequently miss another large portion because they just do.
For every complex problem in the world, there is a simple and easily understood solution that’s wrong.

Re: Text to Speech
« Reply #8 on: 04 March, 2018, 07:56:39 am »
Naturally speaking will do speech to text with 95%or greater accuracy. I use it to dictate documents upto 25 pages long and expect no more than about 10-15 mistakes in total.

We have no idea if the output can be put into PowerPoint.

I think you could have the recognition box permanently moored and the text would appear in the box


Beardy

  • Shedist
Re: Text to Speech
« Reply #9 on: 04 March, 2018, 09:54:00 am »
Naturally speaking will do speech to text with 95%or greater accuracy. I use it to dictate documents upto 25 pages long and expect no more than about 10-15 mistakes in total.

We have no idea if the output can be put into PowerPoint.

I think you could have the recognition box permanently moored and the text would appear in the box


even fully trained Naturally Speaking failed dismally with my northern accent. I gave up on it.
For every complex problem in the world, there is a simple and easily understood solution that’s wrong.

Kim

  • Timelord
    • Fediverse
Re: Text to Speech
« Reply #10 on: 04 March, 2018, 03:24:05 pm »
Naturally speaking will do speech to text with 95%or greater accuracy. I use it to dictate documents upto 25 pages long and expect no more than about 10-15 mistakes in total.

Indeed.

The problem is that high-90s still isn't good enough accuracy for subtitling.  It almost inevitably fails on the important context-establishing words that the listening/lipreading subtitle user is most likely to struggle with, for broadly similar reasons.  Also, while Dragon works admirably well for dictating a document or controlling a computer, the error rate goes up with less structured speech.

Which isn't to say it's useless.  High-90s% saves an awful lot of typing.  It makes audio/video searchable.  Google, Siri, Echo and the like do mostly work, in a way that simply wasn't achievable a few years ago.  Deaf people can and do make good use of speech-to-text tech (including YouTube craptions) to facilitate communication with hearing people (who, as a general rule, can't type).  It's just not up to the standard required for high-quality realtime access to speech.  Yet.