Amy La Branch – Englische Transkription

Speech Recognition versus Transcription

Amy La Branch — Wed, 05 Aug 2020 16:43:28 +0000

A lot of people wonder how I can run a thriving transcription business. Isn’t there an app for that? I can tell my smartphone what to do or talk to Alexa. Won’t your job soon be obsolete? I have heard that speech-recognition software is getting better and better over the 14 long years I’ve been a transcriptionist, and yet it is still nowhere near accurate enough for any serious publication. There have been claims that artificial intelligence and machine learning have reached human parity or will soon eclipse us in pattern recognition, but when it comes to processing language, meaning is incredibly important. Even if AI can translate a sound into a word, can it interpret the meaning (i.e. syntax and semantics) well enough to punctuate a sentence correctly? So far, the answer is a big fat no.

Using Dragon for Dictation

To clarify, when it comes to dictation, Nuance’s Dragon is the best in the market (which is why it is so expensive). It can understand one person speaking into a recorder (with no accent and no background noise) pretty well because you train it to your voice, and a personal language file is slowly compiled after hours of correcting its output. Still, you have to have a heavy editing hand or else verbalize “comma”, “period”, “new paragraph”. Likewise for Siri or Cortana. You can give them commands or even short-form voice messages, but they are going to have a hard time with multiple speakers, people with thick accents, and the processing time for long-form transcription (i.e. for an hour-long lecture or interview) will take longer that it would for a professional to transcribe.

A Human Transcriptionist is Necessary When You Cannot Compromise on Quality

I work for journalists, academics, politicians, and business people who cannot tolerate mistakes. They may publish those transcripts in newspapers, on blogs, in scientific journals, or as a formal record of proceedings. “ur” (your) or “gonna” (going to) and no periods might be sufficient when communicating via SMS to friends or family, but it’s totally unacceptable for formal publications. Incorrect grammar, spelling, punctuation, and capitalization can also be misleading and, frankly, can damage your credibility. You could be at risk of misquoting an interviewee. And anyone who’s in marketing knows that messaging is everything.

A Comparison

From time to time, I test speech-recognition software that’s available to the public (i.e. free). I have trained Windows Speech Recognition to my voice and tried to use my transcription software to process audio files. I have collected hundreds of hilarious and totally absurd transcripts, all of them completely unusable for a paying client. Not only do they take longer to proofread than it would to just transcribe them from scratch, but it takes a lot of processing power, and my (relatively new) computer slows down to the point that I can’t run any other programs.

Using YouTube Auto-Caps

This week, I decided to use YouTube’s automatic caption feature, which is based on Google’s state-of-the-art automatic speech recognition technology. I had a series of excellent quality audio files of around an hour each. The first thing to know about using YouTube to transcribe, even though it is free, it is very labor intensive. You cannot just upload an audio file to YouTube; it must have some sort of image. So just the process of adding an image to the audio file (with Movie Maker) takes a long time, about 50 minutes in my case.

Then you have to upload the file to YouTube and wait for it to process (about 20 minutes). There’s no notification to let you know the captions are available, but I’ve found that this is generally an overnight process.

On average, I can transcribe a one-hour file in 2.5-3 hours. For excellent audio quality, it would be close to two. So already, YouTube’s processing time has taken longer than me, a human transcriptionist.

Once the auto-caps are ready, they are available for download as an .srt, .vtt, or .scc file. These all have timecodes and other tagging embedded within them. I’ve been doing this for a while. So I have a series of macros to take all of the extraneous coding out, leaving only the spoken words. Then the problem remains that you have a long stream of text, all lowercase, with no punctuation or capitalization, no differentiation between speakers, not to mention recognition errors. So there is a lot of proofreading to do.

The Verdict: Transcription Still Significantly Faster and Substantially Higher Quality

Of the five one-hour files I tested, for proofreading alone, it took me anywhere from 2 to 6.5 hours. So taking into account all of the processing, the overnight captioning, and the proofreading, this is not beneficial to my productivity. Therefore, just as a professional translator would not judge Google Translate as adequate, I will not give discounts to proofread speech recognition files because it literally takes up to three times longer than to transcribe the same audio from scratch.

In the fast-paced world of breaking news and social media, even waiting overnight for a transcript that could’ve only taken two hours for a human transcriptionist to complete is too long. Why rely on an inconsistent app only to slog through hours of editing when you could just hire an experienced professional to do it for you?

Recording Tips (expanded)

Amy La Branch — Fri, 16 Mar 2018 14:26:23 +0000

How do you ensure the best transcript? Record the best quality audio.

Transcriptionists can only type what they hear. In some cases, an expert transcriptionist may be able to piece together inaudible or unintelligible portions of the audio from context, but they are not supposed to just “make it up”. You want a true verbatim representation of what was said, not what the transcriptionist thinks she heard.

The better the audio quality, the faster the transcript.

That’s because the transcriptionist will listen to difficult audio again and again to try to discern it. If she can’t hear it, you will receive a transcript full of (inaudible) markings, which may not even be useful for your end purposes.

Your transcript will only be as good as your audio.

The best quality audio is always recorded at the fastest speed and highest quality possible. Radio quality is 128 kbps. For an mp3, in most cases, 44.1 kHz, 32 kbps should be sufficient quality and not produce a huge file. Uncompressed audio formats are always better because compressing audio files while recording greatly decreases audio quality. If files must be compressed because smaller files are easier to transfer, it is best to zip them after the recording is finished. The original larger file is always preferable to a file that is converted to a smaller size or format.

Reduce background noise during the recording process.

Audio quality is very difficult to restore and extremely easy to distort with audio editing. So it’s best to reduce as much background noise as possible during the recording process because commercial editing software does not “fix” the file. It can only really filter out defined noise tracks or act as a sort of equalizer (increasing the treble or the bass).

Record in a quiet place with a microphone near each participant.

The best audio quality is recorded in a quiet environment with no background noise, like a closed office. There is nothing worse for a transcriptionist than an interview in a loud café or with an air conditioner or wind that is constantly blowing on the microphone. Hidden microphones under the clothes always produce terrible audio because of clothing brushing against the mic or muffling the sound. The best quality audio is always when all of the participants speak directly into the microphone.

Do not interrupt.

When there are multiple participants being recorded, you will get the best audio quality when each speaker has their own microphone and each person speaks one at a time. When people interrupt or speak over each other, it is very difficult for a transcriptionist to differentiate what is said or who said it. A good facilitator can direct questions to specific people and ask participants not to interrupt as well as insist that they speak in a loud, clear voice.

When interviewing someone, the best quality audio has no backchanneling. Use nonverbal prompts like nodding or smiling, rather than interjecting every second with “mm-hmm”, “yeah”, “sure”, “okay”, “absolutely”. Sometimes even one word can obscure what the respondent was saying.

Never use speaker phone.

Another option is, if it’s a teleconference, anybody who’s not speaking should be muted to cut down on background noise, and if at all possible, please speak directly into the phone or handset and never use speaker phone, as ambient and environmental noise will be picked up as well, which muffles voices or makes them fuzzy.

The worst quality audio is when someone records a room full of people with an iPhone in the middle of the table. Maybe the person closest to the iPhone will be recorded well enough, but anyone who is further away from the microphone may be “muddy” or completely inaudible.

Provide resources for context.

When there are many people speaking, referring to participants by name or having each person introduce themselves is always best. Otherwise, the transcriptionist cannot identify more than three or four unique voices. When references such as meeting agendas, slide presentations, or participant lists are available, please provide them. Also, when there are lots of names or specific terms used, such as pharmaceuticals or corporate jargon, word lists for context are very much appreciated. Also, spelling out names or uncommon terms is fantastic.

In the end, the transcript is only as good as the audio quality. Bad audio quality is difficult to work with, takes longer to transcribe, and may even produce an unusable transcript.