Translation with localisation
The reassurance of
human understanding.

If the future is AI, data is the power that fuels it. Much in the same way that coal and oil boosted the industrial revolution, quality data at scale is the key differentiator while building the technologies of tomorrow. When we consume the right data and successfully recognise the right patterns in it, we build AI that is capable of not just predicting the future but also shaping it.

But gathering and building data for optimal AI capabilities is a painfully slow process, often resulting in delayed business targets or chaos. We understand that. Which is why, be it audio, video, or text, our solutions focus on getting you the data your business needs, at the scale you need it, and in a form that closely resembles the real world applications of your business case. We start by understanding your business and committing to quality, and eventually merge our data seamlessly with your business process.

Our combination of proprietary tools and deft human expertise is perfectly aligned to stream the right data pipeline, source it with the right quality parameters, and configure it to continuously self-improve.

Seven Challenges the Transcribers may face while doing the Transcription

Though the transcription process is very interesting

1. Critical Segmentation:

If the speech in the audio is continuous, then it becomes a bit difficult to do the segmentation at the boundaries in the given time frame.

2. Poor Audio Quality:

The audio files, having poor audio quality is one of such challenges where the transcribers struggle a lot. They can’t do anything if they don’t find any solution to it.

3. Misaligned Speech in the Segments:

There are chances of leaving a word or words in the respective segments after breaking the segments if they are found to be erroneous.

4. Data Loss:

The transcribers may face the problem of losing the data if they do not follow the guidelines of settings of a particular tool. The project must be saved according to the guided settings.

5. File Renaming Issues:

It’s been commonly found in several cases that if the transcriber renames any file with a different name than that of the audio file, it creates the issues of audio mismatch. The audio is not imported as the names are different.

6. Identifying the Unique Speakers:

It becomes very difficult to identify the unique speakers if they speak in a repetitive fashion in a speech where there are many speakers. The same speaker may be named with a different name in the same file..!

7. Unintelligible Speech and Audible Overlapping:

There should be clarity in using the tags. If the speech contains the overlapping speeches of two or more different people, the problem arises. In many cases, the transcriber can go for tags, used for unidentifiable or unintelligible speech, but in some cases, the overlapping speech is audible. In such cases, the guidelines should be given clearly by the client as far as such scenarios are concerned.

Share This Post

Share on facebook
Share on linkedin
Share on twitter

More To Explore

Why we need ASR

Speech recognition technology allows computers to take spoken audio, interpret it and generate text from it.

Why do we need ASR

Speech recognition technology allows computers to take spoken audio, interpret it and generate text from it. But how do computers understand human speech? The short answer is…the wonder of signal processing. Speech is simply a series of sound waves created by our vocal chords when they cause air to vibrate around them. These sound waves are recorded by a microphone and then converted into an electrical signal. The signal is then processed using advanced signal processing technologies, isolating syllables and words. Over time, the computer can learn to understand speech from experience, thanks to incredible recent advances in artificial intelligence and machine learning. But signal processing is what makes it all possible.

Here at Megdap, we strive to solve this problem for all Indic languages.

Hello World of ASR

ASR or Automatic Speech Recognition is a process that takes a continuous audio speech signal and converts it into its equivalent text. This is an introductory blog around the process of performing ASR using the workflow to implement it in a Kaldi environment.

What we need:

- - Lots of audio files
  - All their corresponding transcripts

(Specific to Audio Input) Factors that impact a good ASR engine are:

- - Volume
  - Number of speakers
  - Pitch
  - Silences
  - Word speed
  - Background Noise

What we need to know to understand ASR:

I will be writing individual blogs related to the concepts explained below in the coming days.

- - Bayes Theorem
  - HMM
  - GMM
  - Basic language and phone understanding

What do we need to create an ASR Engine?

We mainly build the following:

- - Acoustic Model
  - Language Model
  - Lexicon Model

In Kaldi, the workflow used to build an ASR engine is an Acoustic Model, a Language model, and a Lexicon model. The acoustic model would help us understand the audio signal, whereas the language model would help us predict the next word in a sequence. The lexicon model is a pronunciation model.

In ASR, understanding how we hear is more valuable than understanding how we speak in feature extraction. The primary objective of speech recognition is to build a statistical model to infer the text sequences (say “cat sits on a mat”) from a sequence of feature vectors.

There is lots of golden content to be found in phonetics and linguistics. Regardless ASR is about finding the most likely word sequence given an audio and train these probability models with the provided transcripts.

The basic idea around building an ASR engine would revolve around understanding the speech signal. We need to represent the audio file (.wav or .flac) into its corresponding audio signal and extract features from it. That would involve applying a series of mathematical operations to extract features. MFCC analysis or Mel Frequency Cepstrum Coefficient Analysis is the conversion of that audio signal to the essential speech features required to train an acoustic model. A spectrogram is the conversion of an audio signal into the frequency domain using the Fourier transform.

Acoustic, Lexicon & Language Model Workflow Diagram:

Hyperparameter Tuning

While training, we run these extracted features through a lengthy pipeline. Starting with mono phone training, all the way to Neural Network training.

If you have bad mono phone alignments, you will have bad triphone alignments. If you have bad triphone alignments, then you will train a bad neural net. As such, you should take some time to tweak parameters on each stage, to make sure your model and alignments are good to pass on to the next stage.

To tackle this problem we run several in-house analysis scripts to understand the Alignments of the audio and their transcripts. Then we have an iterative cycle of improving our data quality with our Data Team.

Once we train the model we analyze the results using two major Metrics:

- - Word Error Rate(WER)
  - Sentence Error Rate(SER)

The lower the values of both, the better the model performance. Since we train several models along the pipeline, we have the flexibility to choose the best one.

Decoding And Server Side Hosting

Once we have trained our model using Kaldi, we construct a decoding lattice using the scripts with an input audio file to generate the final transcript from the Audio file. We then integrate this model with a RESTful API to allow smooth communication. More on this later!

Stay tuned for More Technical and In-Depth Machine Learning blogs from Megdap!

Here’s the link to our Website: www.megdap.com

Regards: Team Megdap

Share This Post

Share on facebook
Share on linkedin
Share on twitter

More To Explore

Courage to make Decisions

Entrepreneur, among many other things, needs lot of courage to make decisions. And that is one characteristic of an entrepreneur that will be challenged a lot and put to test most often. What do you need the most to make those decisions? Intelligence or Courage? While a combination of the two in right measure is undoubtedly a great asset to have, but that is not always true on a case-to-case basis.

An entrepreneur needs to make decisions on a daily basis. Should you work with this partner or not? Should you hire this person or not? Should you say no or yes to a particular investor? Should you change your business plan, because everyone you meet feels you should do? In situations like these, you have to take your own call. It is your own gut feeling. You have to make your own decision & move forward. You are not sure how these seemingly harmless decisions can turn out to be a major turning point for your business. As an entrepreneur, you should make those decisions & you should have the courage to make those decisions.

Marc Andreessen, of Andreessen Horowitz, says – “Courage without Genius, may not take you where you may want to go. But Genius without Courage will most certainly not take you anywhere”. Marc’s partner Ben Horowitz, also the author of “The Hard Thing about Hard Things” says, one of the most important attributes they look for in an entrepreneur is courage.

What do you think? What has been your own experience as an Entrepreneur?

Share This Post

Share on facebook
Share on linkedin
Share on twitter

More To Explore

Language is essential to Digitization

The pandemic has fundamentally altered the way humans are interacting

The pandemic has fundamentally altered the way humans are interacting. A transformational shift has happened. Many habits are here to stay.

Interactions are moving online

Meetings happening in person are now online. Kids are learning everything online, interactions they would have had in the classroom are increasingly over a call or on chat. Sales meetings to sell anything from a house to a broadband plan often now starts online.

An even entertainment has undergone a drastic shift. Staying at home has meant that content consumption via OTT platform has exploded. And with the obvious lack of other avenues to spend time more and more genres of movies and web series are being explored. And its not just entertainment people are acquiring skills online

Business has been impacted

Today every new launch of a product, channel sales or customer interaction has to be thought of as digital first. While the pandemic is hopefully a passing problem the world is seeing, behavior and habits acquired a likely to stick around.

Being digital first also comes with opening up of new opportunities. The pervasive availability of devices means that even if you do not have a car showroom in rural Orissa, it does not stop a potential car buyer from checking out the vehicle in 3D and get all of the information including the ability to order and get the car delivered.

Digital is bringing transformational opportunities to business , reducing operational cost and most importantly enabling them to overcome hurdles.

Satya Nadella talked of two years of digitization happening in two months. The effects of this transformation will linger for long.

Language is a vital part of digitization

Past efforts at digitization often meant catering to a niche audience often in English. In a markets around the world, be it India Africa or Europe, it was assumed that people would go to a neighborhood shop or turn up at a car dealer if they wanted to interact in a local language and hence many of the regional languages were not prioritized.

Globally too we see such patterns as content/information and transactional services lean heavily towards being English centric.

The key to achieving true digitization that delivers services across the population spectrum is to support languages in every digital platform. Humans love to interact in their own languages.

Language capabilities are difficult to source for enterprises and SMBs

Dealing with a multitude of services across translation, transcription or subtitling is a time consuming and often challenging task for SMB’s and enterprises. And then add to that the complexity of the numbers of languages businesses have to deal with. A simple easy to use platform to deliver every kind of language service possible has thus far been unavailable.

TexLang SMB Portal is helping business talk local to customers and partners.

The TexLang portal for SMB is a simple and easy to use self service portal that brings 3 capabilities across 30+ languages for our customers

Translation. TexLang can help get diverse content translated

- - 1. Terminology for Applications
  - 2. Business documents
  - 3. Media and books

Transcription

- - 1. Regulatory content for industries
  - 2. Speech to Text for AI training
  - 3. Market research interviews

Subtitling

- - 1. Entertainment
  - 2. Learning and Education
  - 3. Conferences and meetings

TexLang provides content which is 100% accurate using a combination of AI/ML and human augmentation ensuring that content is ready for business.TexLang Language as a Service ™ platform is now broadly available. Contact us to know how TexLang can help become your business be a digital native in multiple languages

Share This Post

Share on facebook
Share on linkedin
Share on twitter