If you want to serve an international customer base or if you’re building anything involving translation, supporting multiple languages is essential. Rev AI, on the other hand, promises support for 8 English speakers or 6 non-English speakers.īoth solutions can identify speakers equally well. Microsoft claims their tech supports diarization, but they don’t ever say how many speakers it can handle. Identifying who is talking and when is a key feature for high performance ASR systems. Try Rev AI Free Speaker ID and Diarization Winner: Rev AI The reason that Rev’s AI outperforms others is because our network of over 60,000 human transcriptionists contribute data that we use to constantly improve our models. In our podcast transcription benchmarks, we compared Rev AI to Microsoft’s ASR for 30 podcasts and found that Rev’s WER, 14.22%, is about 2% lower than Microsoft’s, which came in at 16.51%. A 20% WER, for instance, means that it got 20% of the words wrong. The gold standard for accuracy benchmarking is word error rate (WER), which measures how many words the ASR tech deletes, inserts, or substitutes as an overall percentage. After all, if the ASR engine messes up too many words, using it will be difficult at best and impossible at worst. Accuracy Winner: Rev AIīy far the most important point of comparison is accuracy. No matter what, you want to get a holistic picture of each technology and how they stack up against each other. Depending on your unique needs and uses, some will be more important than others. If you’re picking between Azure Cognitive Services and Rev.ai for your project, you want to compare these solutions along several metrics. There’s nothing that turns users off more than the frustration of trying to talk to a device that just can’t understand them, no matter how hard they try. R = requests.Whether you have a great idea for the next Internet of Things (IoT) device, you want to add live-captioning to your media streaming service, or you’re creating a hands-free voice user interface for a mobile application, you’re going to need an automatic speech recognition (ASR) solution that’s up for the job. 'Content-type': 'audio/wav codec=audio/pcm samplerate=16000', 'Ocp-Apim-Subscription-Key': YOUR_API_KEY, # Request that the Bing Speech API convert the audio to text 'Ocp-Apim-Subscription-Key': YOUR_API_KEY # Return an Authorization Token by making a HTTP POST request to Cognitive Services with a valid API key. Results = get_text(token, YOUR_AUDIO_FILE) REGION = 'ENTER_YOUR_REGION' # westus, eastasia, northeurope YOUR_AUDIO_FILE = 'ENTER_PATH_TO_YOUR_AUDIO_FILE_HERE' Create a Bing Speech API resource within the Azure Portal. In this demo, we will invoke the speech recognition service by using the REST API in Python.ġ. Note: Pricing is as of this post, check Microsoft's website for up to date pricing. Standard Tier (S0): Maximum of 20 calls per second £3GBP/$4USD/$5AUD per 1,000 transactions.Free Tier (F0): Maximum of 5 calls per second Maximum of 5,000 transactions per month.all possible interpretations) paired with a confidence score.įortunately for developers, there is a free tier that should be more than sufficient to get you started. The response is returned as JSON with the output format set to simple by default. See supported languages for a complete list. Dictation: Formal + Longer Utterances (full sentences that typically last 5 - 8 seconds).ĭefine the target language for conversion (e.g.Interactive: Formal + Short & Sharp (utterances typically last 2 - 3 seconds).Concise summary below, for more details check out Microsoft's documentation. The service optimises speech recognition based on which mode is specified, so it is important to define the mode most appropriate to your application. This will be required to programmatically work with the API and can be attained from the Azure Portal once a Bing Speech resource has been created. speaking into a mic) is typically collected, sent and transcribed in chunks to form a stream. To optimise performance, audio data (e.g. Increase accessibility for users with impaired vision.Ī sequence of continuous speech followed by a clear pause.Build intelligent applications that can be triggered by voice.Transcribe and analyse customer call centre data.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |