Speech to Text: The Future of Unstructured Data (Part 2)

10 minute read

We had a lot of fun in Part 1, exploring Automatic Speech Recognition (ASR) and brainstorming ways it can help business harness unstructured data to work smarter. But now we’re moving into the nitty-gritty. What’s technically possible, and who’s currently doing it best?

You are going to be shocked when I tell you the answer. (Wait for it. Drumroll. Pause for dramatic effect.)

It. Depends. (wah, wah)

So instead of focusing on the Best ASR, let’s focus on the Best ASR for You

What Do I Need My ASR Tool To Do?

I’m a developer. I can tell you there are lots of different ways to “do” Speech to Text. (I played around with several of them for my podcast project.) We have plenty of great MINDs who can help you find the right ASR tool for your business. We start by assembling a list of your criteria.

First, what are you trying to accomplish with ASR technology? That will dictate which features you need. Here’s a list of considerations and accompanying questions to get your thought process started. If you’re not sure how to answer these questions, don’t worry; we are here to help you think it through and offer guidance. This is a preliminary exercise to begin prioritizing what’s important for your project:

Synchronous vs. Asynchronous: Do you need instantaneous, real-time transcription? Or are you using prerecorded files such as MP3s?

Speaker/Channel Identification: Do you need to note who is speaking (call center rep vs. customer, podcast host vs. guest) or which channel audio is coming from at any given time?

Word Timestamps: Do you need timestamps for every word in the audio file to map to an external source post-transcription? (Here’s an example of what that tech can enable.)

Custom Vocabulary: Do you work in healthcare, law or any other field that speaks its own language?

Spoken Foreign Language Support: How many and which languages do you need transcribed?


Who Offers What I Need?

As I said in my last post, the world of AI-enabled language services is competitive and evolving quickly. I put together this handy chart comparing the Big 4’s current ASR capabilities, but I’m hesitant to share it because it could be outdated in a month, a week, or you know, tomorrow. Consider it a snapshot in time, for your eyes only: 

  Microsoft AWS Google Cloud IBM
Speaker Identification Sorta* Yes Yes Yes
Channel Identification No Yes Yes No
Word Timestamps No Yes Yes Yes
Custom Vocabulary Yes Yes Yes Yes
Spoken Language Support 5 – 23
(depending on mode)
US English, Spanish,
Canadian French
(more to come)
100+ 7
Rest API Yes
(15-second clip limit)
No Yes Yes

*Microsoft’s speaker identification is integrated differently than the other three companies’ and is intended for speaker verification rather than transcription. It verifies the speaker based on a known list of people you provide, each with an accompanying audio file for training.

Who Does It Best?

There are many different ways to assess the value of ASR tools, and that value is changing all the time as providers leapfrog over each other in the eternal quest for dominance. Once you’ve established the parameters of your project, we can help you conduct a technology shootout to determine the best solution for you.

That said, the closest thing Speech to Text currently has to a gold standard accuracy test is the Word Error Rate. WER is an acronym you’ll run across a lot as you delve into the ASR realm. It’s tough to find well done general accuracy assessments, but the folks at transcription app Descript released one just a few months ago. It has transparent methodology, conscientious design and execution, and it’s fun to read. Plus it bears out my considered opinion that Google and Amazon are currently leading the pack.

Speech to Text as Part of the AI-driven Whole

At first blush, Speech to Text appears to have significant limitations. When you settle in to binge your favorite show, would you rather read the captions or listen to the actors deliver the dialogue? Transcription can lose the speaker’s intensity and emphasis, and that can have repercussions for business. If you’re in customer service, it’s important to you whether the customer calmly stated, “This is really annoying,” or shrieked, “This is REALLY ANNOYING!” An exclamation point can only tell a fraction of the story, right?

Well…the rise of AI is definitely requiring some paradigm shifts, and that’s certainly the case here. The goal is no longer merely to understand the customer’s needs and opinions; it’s to automate understanding the customer’s needs and opinions, freeing up your human workforce to meet those needs and respond to those opinions. That’s where Natural Language Processing (NLP) comes in.

Most people know NLP is how Alexa understands and responds to your questions and commands in plain, old human-speak, but that’s just one aspect of NLP functionality. NLP pulls key terms and their relationships out of mountains of unstructured data to understand and analyze what’s being said. It’s the linchpin of that call center example I gave at the beginning of Part 1 and this flowchart from Amazon:


Transcribe is Amazon’s ASR tool. Once it turns spoken unstructured data into text, Comprehend (AWS NLP service) makes sense of that text and puts it in a form where Athena can run queries on it that QuickSight can turn into shareable business intelligence. Regardless of which provider you choose, the flowchart will be comparable – different product names, similar functionalities.

Staying Up on ASR Progress

“Okay, Colin, you keep telling me this field is moving a mile a minute. I’m a busy professional moving a mile a minute. How am I supposed to track breakthroughs in this one burgeoning field?”

My best advice here is: Stay home! We’ve been talking for a year now about the consumerization of enterprise, and it’s still very much A Thing. If you have a Google Home or Amazon Echo, monitor its performance. Has your smartphone’s talk-to-text feature gotten better? The major players may roll out improvements slowly, but they may also push big updates that leave you thinking: “Whoa! Where’d she learn to do that?” You can bet that if the consumer technology has improved, developers will have much more to play with.

Of course you can also keep an eye on the big guys’ blogs and product pages. AWS just announced a big win, and I got the news via RSS feed. Three of its AI language services (including ASR and NLP) are now certified as compliant with the Payment Card Industry Data Security Standard. They were already deemed HIPAA compliant. That means if you work in commerce, retail or healthcare, your use case just got a whole lot more appealing.

If you’re looking for an aggregator to keep up more with the tech than the providers, would you believe there’s a GitHub repository devoted to tracking ASR developments? It’s called WER_are_we (Get it? I told you it was the ASR gold standard.)

One more hot tip for staying in the know: The International Speech Communication Association may sound like every comm major’s favorite extracurricular, but make a point to monitor the news coming out of its annual conference. Interspeech claims to be “the world’s largest and most comprehensive conference on the science and technology of spoken language processing,” and it’s a great reminder of the global face of AI language services. Google, IBM and Microsoft researchers presented at Interspeech 2018 in India. Next September, Interspeech celebrates its 20th anniversary in Austria.

Avoiding a Major Misstep

In Part 1, I mentioned the perks of getting into ASR on the ground floor. Cloud services make it easy and relatively inexpensive to tinker with the technology, and we can be reasonably assured AWS, Google, IBM and Microsoft aren’t going away. Rather, the four giants will continue to pour resources into their AI cloud offerings, meaning capabilities will grow as your needs grow. But early adoption is really use-case dependent. We’ll end as we began, putting your business at the center of the tech.

I’ve given you a lot to mull over above: project considerations, competing products, AI overhauling the way we work. If you can see Speech to Text revolutionizing how you do business, why not get in on it now? Conversely, if you don’t have a use case jumping out at you, it doesn’t hurt to keep observing from the sidelines.

But allow me to offer one last observation to ease your mind. Here at Mind Over Machines, we pride ourselves on thoughtful design and implementation. When you invest in those two things, there is no fatal error, no major misstep from which you/we can’t recover. A well-designed Speech to Text application should be built to allow swapping out your ASR API. That way, there’s always an escape hatch in case of unforeseen circumstances. Whether you make the wrong vendor choice, a provider goes out of business, or the pricing landscape changes drastically, pressing the do-over button isn’t prohibitively expensive or complicated. A little foresight is an excellent insurance policy.

Ready to start thinking through your inaugural Speech to Text project? Grab that list of questions up above, and give us a call or drop me a line. Let’s wreck a nice beach together — wait, I mean, recognize speech together!


About Colin

Colin Reynolds is a modern-day philosopher and a self-professed “growth-minded productivity junky trying to automate everything.” Those two traits would seem to be at odds, but the more you automate, the more time you have to think. Thinking is one of Colin’s favorite pastimes. As a devotee of Cal Newport’s Deep Work, he consistently makes a point to turn off the distractions (yes, even the podcasts) and enjoy some dedicated thought time.

While finishing up his bachelor’s of philosophy at Towson University, Colin took a few business courses and poured himself into learning Excel. It was his voluntary application of those self-taught Excel skills at a college internship that landed Colin his first paying gig. Since then, he has continued to ingratiate himself with employers and clients alike by showing them cool things they didn’t know were possible.

Now, Colin has someone new in his life showing him things he didn’t know were possible. His 8-month old daughter loves to take Colin and his wife on weekend hikes, drawing their attention to the sights and sounds of nature with enthusiastic coos of approval. Dedicated thought time may get harder to come by as family obligations increase, but where there’s a will, there’s a way (and it probably involves automation).