Speech Recognition versus Transcription

Aug - 05
2020

Speech Recognition versus Transcription

A lot of people wonder how I can run a thriving transcription business. Isn’t there an app for that? I can tell my smartphone what to do or talk to Alexa. Won’t your job soon be obsolete? I have heard that speech-recognition software is getting better and better over the 14 long years I’ve been a transcriptionist, and yet it is still nowhere near accurate enough for any serious publication. There have been claims that artificial intelligence and machine learning have reached human parity or will soon eclipse us in pattern recognition, but when it comes to processing language, meaning is incredibly important. Even if AI can translate a sound into a word, can it interpret the meaning (i.e. syntax and semantics) well enough to punctuate a sentence correctly? So far, the answer is a big fat no.

Using Dragon for Dictation

To clarify, when it comes to dictation, Nuance’s Dragon is the best in the market (which is why it is so expensive). It can understand one person speaking into a recorder (with no accent and no background noise) pretty well because you train it to your voice, and a personal language file is slowly compiled after hours of correcting its output. Still, you have to have a heavy editing hand or else verbalize “comma”, “period”, “new paragraph”. Likewise for Siri or Cortana. You can give them commands or even short-form voice messages, but they are going to have a hard time with multiple speakers, people with thick accents, and the processing time for long-form transcription (i.e. for an hour-long lecture or interview) will take longer that it would for a professional to transcribe.

A Human Transcriptionist is Necessary When You Cannot Compromise on Quality

I work for journalists, academics, politicians, and business people who cannot tolerate mistakes. They may publish those transcripts in newspapers, on blogs, in scientific journals, or as a formal record of proceedings. “ur” (your) or “gonna” (going to) and no periods might be sufficient when communicating via SMS to friends or family, but it’s totally unacceptable for formal publications. Incorrect grammar, spelling, punctuation, and capitalization can also be misleading and, frankly, can damage your credibility. You could be at risk of misquoting an interviewee. And anyone who’s in marketing knows that messaging is everything.

A Comparison

From time to time, I test speech-recognition software that’s available to the public (i.e. free). I have trained Windows Speech Recognition to my voice and tried to use my transcription software to process audio files. I have collected hundreds of hilarious and totally absurd transcripts, all of them completely unusable for a paying client. Not only do they take longer to proofread than it would to just transcribe them from scratch, but it takes a lot of processing power, and my (relatively new) computer slows down to the point that I can’t run any other programs.

Using YouTube Auto-Caps

This week, I decided to use YouTube’s automatic caption feature, which is based on Google’s state-of-the-art automatic speech recognition technology. I had a series of excellent quality audio files of around an hour each. The first thing to know about using YouTube to transcribe, even though it is free, it is very labor intensive. You cannot just upload an audio file to YouTube; it must have some sort of image. So just the process of adding an image to the audio file (with Movie Maker) takes a long time, about 50 minutes in my case.

Then you have to upload the file to YouTube and wait for it to process (about 20 minutes). There’s no notification to let you know the captions are available, but I’ve found that this is generally an overnight process.

On average, I can transcribe a one-hour file in 2.5-3 hours. For excellent audio quality, it would be close to two. So already, YouTube’s processing time has taken longer than me, a human transcriptionist.

Once the auto-caps are ready, they are available for download as an .srt, .vtt, or .scc file. These all have timecodes and other tagging embedded within them. I’ve been doing this for a while. So I have a series of macros to take all of the extraneous coding out, leaving only the spoken words. Then the problem remains that you have a long stream of text, all lowercase, with no punctuation or capitalization, no differentiation between speakers, not to mention recognition errors. So there is a lot of proofreading to do.

The Verdict: Transcription Still Significantly Faster and Substantially Higher Quality

Of the five one-hour files I tested, for proofreading alone, it took me anywhere from 2 to 6.5 hours. So taking into account all of the processing, the overnight captioning, and the proofreading, this is not beneficial to my productivity. Therefore, just as a professional translator would not judge Google Translate as adequate, I will not give discounts to proofread speech recognition files because it literally takes up to three times longer than to transcribe the same audio from scratch.

In the fast-paced world of breaking news and social media, even waiting overnight for a transcript that could’ve only taken two hours for a human transcriptionist to complete is too long. Why rely on an inconsistent app only to slog through hours of editing when you could just hire an experienced professional to do it for you?