Almost human? How Azure Cognitive Services speech sounds more like a real person

Blog|by Mary Branscombe|8 November 2018

Speech recognition systems are getting more powerful. They’re most accurate with a good close-up microphone and some advanced knowledge of the vocabulary likely to be used on specific topics – as with the PowerPoint Presentation Translator, which creates a transcription of what the presenter says while they’re going through a deck of slides and translates into multiple languages – or when the system doesn’t have to work in real-time – like the transcription in Azure Stream, which takes about an hour to process a 30 minute video and can even recognise different speakers.

Similar options are available to developers in the Azure Cognitive Services speech APIs with Speech-to-Text (which includes customisable speech models for specific vocabularies), Speaker Recognition (which covers both identification and verification) and Speech Translation. Combine those with the language services that can recognise the point of what someone is trying to say: if they’re asking for details of flights, which sentence is about their destination and which is the day they want to fly? There’s also a Cognitive Services Speech SDK that includes speech to text, speech translation and intent recognition in C# (on Windows because it needs UWP or .NET Standard), C/C++ (on both Windows and Linux), Java (for Android and other devices) and Objective C, if you want to use speech recognition in native apps rather than JavaScript.

But just being able to understand speech isn’t enough to build real-time interactive speech systems. If users can talk to your application, it might need to be able to talk back, confirming that what they say has been recognised, or even hold a conversation with the user to extract information. If a customer tells a travel agent virtual assistant they want to fly to New York in early December, the assistant would need to ask them if they wanted JFK, Newark or La Guardia, as well as where they were flying from, and to tell them the price for the different flight options.

Currently the Azure Cognitive Services Text to Speech API can convert text to audio in multiple languages in close to real-time, saving the audio as a file for later use. There are more than 75 voices to choose from in 49 languages and locales (like different variants of English for the US, UK and Australia), with male and female voice and parameters developers can adjust to control speed, pitch, volume, pronunciation and extra pauses.

Microsoft’s new deep learning text to speech system, demonstrated at the Ignite conference recently, will support the same 49 languages and customisation options for developers who want to build their own voices, but while it’s in private preview it has just two pre-built voices in English, Jessa and Guy.

The problem with computer generated text is that it can be tiring to listen to, because it just doesn’t sound right; it’s acceptable for short utterances from a virtual assistant telling you the weather forecast or confirming the timer you just set, but it’s often not natural or engaging enough for something you’re listening to for a long time, like an audio book, because there just isn’t enough expression to make it easy to listen. Voice navigation systems would be clearer and easier to understand if the directions sounded less like a computer too.

It’s a hard problem. Human-like speech has to get the phonetics right, so each phoneme, syllable, word and phrase is pronounced correctly and articulated clearly. But to avoid sounding monotonous and robotic, it’s almost important to put in the correct patterns of stress and intonation – known as prosody – to put the stress on the right syllable in a word, make different syllables the right length and put the right pauses in the right place as the different parts that make up speech are synthesised into a computer voice.

Most text to speech systems split the process of putting the stress in the right place into several different steps. Analysis of the text in conjunction with a linguistic data model is one step, followed by a separate step of predicting the correct prosody with a different acoustic model; the output from those two steps is fed into a vocoder while the different units of speech are selected from a standard inventory of speech sections and joined together, which can cause glitches and discontinuities. The learning model for that can over-smooth the differences between sounds, making the generated speech sound muffled or buzzy rather than clear and expressive.

The recent improvements in speech recognition and translation have come from using deep neural networks and that’s what the new neural text to speech API does.

The difference between the preview Cognitive Services neural text to speech and more traditional approaches that split the problem into more stages. Source: Microsoft

Neural text to speech combines the stages of synthesising the voice and putting the stresses in the right places in the words, so it can optimise pronunciation, prosody, generating high-quality audio for the generated speech together. Instead of using multiple different models at different stages, it learns from large data sets of speech by many different speakers and uses that end to end machine learning model for both the neural network acoustic generator that predicts what the prosody should be in the speech and the neural network vocoder that takes that prosody and generates the speech.

That produces a more natural voice that Microsoft calls nearly indistinguishable from recorded human voices. You can compare the voices yourself in these audio files for three different sentences.

“The third type, a logarithm of the unsigned fold change, is undoubtedly the most tractable.”

“As the name suggests, the original submarines came from Yugoslavia.”

“This is easy enough if you have an unfinished attic directly above the bathroom.”

If you want to do real-time speech generation, Azure offers streaming speech served from Azure Kubernetes Service; that lets you scale it out as necessary for your workloads, and you can call both the new neural text to speech and the traditional text to speech APIs if you need to cover more languages, from the same endpoint.

The private preview of neural text to speech is currently available by application; fill out the form at http://aka.ms/neuralttsintro to explain how you plan to use the service.

Contact Grey Matter to discuss Azure and Cognitive Services: +44 (0)1364 654100 or if you require technical advice.

8 November 2018 | Blog

Contact Grey Matter

If you have any questions or want some extra information, complete the form below and one of the team will be in touch ASAP. If you have a specific use case, please let us know and we'll help you find the right solution faster.

By submitting this form you are agreeing to our Privacy Policy and Website Terms of Use.

Mary Branscombe

Mary Branscombe is a freelance tech journalist. Mary has been a technology writer for nearly two decades, covering everything from early versions of Windows and Office to the first smartphones, the arrival of the web and most things in between.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie records the user consent for the cookies in the "Advertisement" category.
cookielawinfo-checkbox-analytics	1 year	Set by the GDPR Cookie Consent plugin, this cookie records the user consent for the cookies in the "Analytics" category.
cookielawinfo-checkbox-functional	1 year	The GDPR Cookie Consent plugin sets the cookie to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	1 year	Set by the GDPR Cookie Consent plugin, this cookie records the user consent for the cookies in the "Necessary" category.
cookielawinfo-checkbox-others	1 year	Set by the GDPR Cookie Consent plugin, this cookie stores user consent for cookies in the category "Others".
cookielawinfo-checkbox-performance	1 year	Set by the GDPR Cookie Consent plugin, this cookie stores the user consent for cookies in the category "Performance".
CookieLawInfoConsent	1 year	CookieYes sets this cookie to record the default button state of the corresponding category and the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks
JSESSIONID	session	New Relic uses this cookie to store a session identifier so that New Relic can monitor session counts for an application.
SRCHD	1 year 24 days	Bing sets this cookie to display map content using Bing Maps.
SRCHUID	1 year 24 days	Bing sets this cookie to display map content using Bing Maps.
SRCHUSR	1 year 24 days	Bing sets this cookie to display map content using Bing Maps.
viewed_cookie_policy	1 year	The GDPR Cookie Consent plugin sets the cookie to store whether or not the user has consented to use cookies. It does not store any personal data.

Cookie	Duration	Description
_an_uid	7 days	No description available.
_cfuvid	session	Description is currently not available.
6suuid	1 year 1 month 4 days	No description available.
AN	1 month	No description available.
AS	session	No description available.
debug	never	No description available.
ebEventToTrack	1 month	No description available.
eblang	1 year	No description available.
gm_country_code	7 days	Description is currently not available.
guest	1 month	No description available.
JOTFORM_SESSION	1 month	No description available.
loglevel	never	No description available.
receive-cookie-deprecation	1 year 1 month 4 days	Description is currently not available.
SP	session	Description is currently not available.
SRCHHPGUSR	1 year 24 days	No description available.
SS	session	Description is currently not available.
stableId	1 year	Description is currently not available.
TESTCOOKIESENABLED	1 minute	Description is currently not available.
userReferer	1 month	No description available.
VISITOR_PRIVACY_METADATA	6 months	Description is currently not available.
zoom	never	No description available.

Cookie	Duration	Description
_SS	session	Bing sets this cookie to collect information on how visitors behave on multiple websites and to understand how they access the website, to provide relevant ads.
ANONCHK	10 minutes	The ANONCHK cookie, set by Bing, is used to store a user's session ID and verify ads' clicks on the Bing search engine. The cookie helps in reporting and personalization as well.
bcookie	1 year	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser IDs.
bscookie	1 year	LinkedIn sets this cookie to store performed actions on the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements by tracking user behaviour across the web, on sites with Facebook pixel or Facebook social plugin.
guest_id	1 year 1 month	Twitter sets this cookie to identify and track the website visitor. It registers if a user is signed in to the Twitter platform and collects information about ad preferences.
IDE	1 year 24 days	Google DoubleClick IDE cookies store information about how the user uses the website to present them with relevant ads according to the user profile.
li_sugr	3 months	LinkedIn sets this cookie to collect user behaviour data to optimise the website and make advertisements on the website more relevant.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
muc_ads	1 year 1 month 4 days	Twitter sets this cookie to collect user behaviour and interaction data to optimize the website.
MUID	1 year 24 days	Bing sets this cookie to recognise unique web browsers visiting Microsoft sites. This cookie is used for advertising, site analytics, and other operations.
personalization_id	1 year 1 month 4 days	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
SUID	12 hours	Google Analytics sets this cookie to collect data on user preferences and/or interaction with web campaign content (Microsoft).
test_cookie	15 minutes	doubleclick.net sets this cookie to determine if the user's browser supports cookies.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.
VISITOR_INFO1_LIVE	5 months 27 days	YouTube sets this cookie to measure bandwidth, determining whether the user gets the new or old player interface.
YSC	session	Youtube sets this cookie to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt-remote-device-id	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt.innertube::nextId	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
_clck	1 year	Microsoft Clarity sets this cookie to retain the browser's Clarity User ID and settings exclusive to that website. This guarantees that actions taken during subsequent visits to the same website will be linked to the same user ID.
_clsk	1 day	Microsoft Clarity sets this cookie to store and consolidate a user's pageviews into a single session recording.
_fbp	3 months	Facebook sets this cookie to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising after visiting the website.
_ga	1 year 1 month 4 days	Google Analytics sets this cookie to calculate visitor, session and campaign data and track site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognise unique visitors.
_ga_*	1 year 1 month 4 days	Google Analytics sets this cookie to store and count page views.
_gat_UA-*	1 minute	Google Analytics sets this cookie for user behaviour tracking.
_gcl_au	3 months	Google Tag Manager sets the cookie to experiment advertisement efficiency of websites using their services.
_gd_session	4 hours	This cookie is used for collecting information on users visit to the website. It collects data such as total number of visits, average time spent on the website and the pages loaded.
_gd_svisitor	1 year 1 month 4 days	This cookie is set by the Google Analytics. This cookie is used for tracking the signup commissions via affiliate program.
_gd_visitor	1 year 1 month 4 days	This cookie is used for collecting information on the users visit such as number of visits, average time spent on the website and the pages loaded for displaying targeted ads.
_gid	1 day	Google Analytics sets this cookie to store information on how visitors use a website while also creating an analytics report of the website's performance. Some of the collected data includes the number of visitors, their source, and the pages they visit anonymously.
_s	1 year	This cookie is associated with Shopify's analytics suite.
ajs_anonymous_id	never	This cookie is set by Segment to count the number of people who visit a certain site by tracking if they have visited before.
ajs_group_id	never	This cookie is set by Segment to track visitor usage and events within the website.
ajs_user_id	never	This cookie is set by Segment to help track visitor usage, events, target marketing, and also measure application performance and stability.
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
CLID	1 year	Microsoft Clarity set this cookie to store information about how visitors interact with the website. The cookie helps to provide an analysis report. The data collection includes the number of visitors, where they visit the website, and the pages visited.
CONSENT	2 years	YouTube sets this cookie via embedded YouTube videos and registers anonymous statistical data.
ln_or	1 day	Linkedin sets this cookie to registers statistical data on users' behaviour on the website for internal analytics.
MR	7 days	This cookie, set by Bing, is used to collect user information for analytics purposes.
MUIDB	1 year 24 days	Bing sets this cookie to determine how the user uses the website and any advertising that the end user may have seen before visiting the said website.
SM	session	Microsoft Clarity cookie set this cookie for synchronizing the MUID across Microsoft domains.
vuid	1 year 1 month 4 days	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos on the website.
wow.anonymousId	1 year 1 month 4 days	This is a analytic cookie used to store anonymous visitor ID. It tracks the visitor uniquely between visits.
wow.session	20 minutes	This cookie is set by the provider Communigator.This cookie is used to track the Internet Information Services(IIS) session state.
wow.utmvalues	20 minutes	This cookie is from Communigator. This cookie is used to store UTM values for the session.UTM values are specific text strings that are appended to URLs that allow Communigator to track the URLs and the UTM values when they get clicked on

Cookie	Duration	Description
__cf_bm	30 minutes	Cloudflare set the cookie to support Cloudflare Bot Management.
_EDGE_S	session	Bing sets this cookie to display map content using Bing Maps.
_EDGE_V	1 year 24 days	Bing sets this cookie to display map content using Bing Maps.
li_gc	5 months 27 days	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
TawkConnectionTime	session	Tawk.to, a live chat functionality, sets this cookie. For improved service, this cookie helps remember users so that previous chats can be linked together.
twk_idm_key	session	Tawk set this cookie to allow the website to recognise the visitor in order to optimize the chat-box functionality.

Almost human? How Azure Cognitive Services speech sounds more like a real person

Contact Grey Matter

Mary Branscombe

Managing change in your business: Preparing for Generative AI in the workplace

Intel oneAPI 2024.1 A Milestone Release

ISV Partner Day Shortlisted for CRN Sales & Marketing Award

Microsoft 365 and Azure Security Tools: Microsoft Intune

About

Solutions

Vendors

Certifications

Select Your Region

Almost human? How Azure Cognitive Services speech sounds more like a real person

Contact Grey Matter

Mary Branscombe

Related News

Managing change in your business: Preparing for Generative AI in the workplace

Intel oneAPI 2024.1 A Milestone Release

ISV Partner Day Shortlisted for CRN Sales & Marketing Award

Microsoft 365 and Azure Security Tools: Microsoft Intune

Select Your Region