Skip to main content

Voice generation

 The Voice Stack is improving rapidly. Systems that interact with users via speaking and listening will drive many new applications. Over the past year, I’ve been working closely with DeepLearning.AI, AI Fund, and several collaborators on voice-based applications, and I will share best practices I’ve learned in this and future posts.


Foundation models that are trained to directly input, and often also directly generate, audio have contributed to this growth, but they are only part of the story. OpenAI’s RealTime API makes it easy for developers to write prompts to develop systems that deliver voice-in, voice-out experiences. This is great for building quick-and-dirty prototypes, and it also works well for low-stakes conversations where making an occasional mistake is okay. I encourage you to try it!


However, compared to text-based generation, it is still hard to control the output of voice-in voice-out models. In contrast to directly generating audio, when we use an LLM to generate text, we have many tools for building guardrails, and we can double-check the output before showing it to users. We can also use sophisticated agentic reasoning workflows to compute high-quality outputs. Before a customer-service agent shows a user the message, “Sure, I’m happy to issue a refund,” we can make sure that (i) issuing the refund is consistent with our business policy and (ii) we will call the API to issue the refund (and not just promise a refund without issuing it).


In contrast, the tools to prevent a voice-in, voice-out model from making such mistakes are much less mature.


In my experience, the reasoning capability of voice models also seems inferior to text-based models, and they give less sophisticated answers. (Perhaps this is because voice responses have to be more brief, leaving less room for chain-of-thought reasoning to get to a more thoughtful answer.)


When building applications where I need a high degree of control over the output, I use agentic workflows to reason at length about the user’s input. In voice applications, this means I end up using a pipeline that includes speech-to-text (STT, also known as ASR, or automatic speech recognition) to transcribe the user’s words, then processes the text using one or more LLM calls, and finally returns an audio response to the user via TTS (text-to-speech). This STT → LLM/Agentic workflow → TTS pipeline, where the reasoning is done in text, allows for more accurate responses.


However, this process introduces latency, and users of voice applications are very sensitive to latency. When DeepLearning.AI worked with RealAvatar (an AI Fund portfolio company led by Jeff Daniel) to build an avatar of me, we found that getting TTS to generate a voice that sounded like me was not very hard, but getting it to respond to questions using words similar to those I would choose was. Even after a year of tuning our system — starting with iterating on multiple, long, mega-prompts and eventually developing complex agentic workflows — it remains a work in progress. You can play with it at http://deeplearning.ai/avatar


Initially, this agentic workflow incurred 5-9 seconds of latency, and having users wait that long for responses led to a bad experience. To address this, we came up with the following latency reduction technique. The system quickly generates a pre-response (short for preliminary response) that can be uttered quickly, which buys time for an agentic workflow to generate a more thoughtful, full response. (We’re grateful to LiveKit’s CEO Russ d’Sa and team for helping us get this working.) This is similar to how, if you were to ask me a complicated question, I might say “Hmm, let me think about that” or “Sure, I can help with that” — that’s the pre-response — while thinking about what my full response might be.


I think generating a pre-response followed by a full response, to quickly acknowledge the user’s query and also reduce the perceived latency, will be an important technique, and I hope many teams will find this useful. Our goal was to approach human face-to-face conversational latency, which is around 0.3-1 seconds. RealAvatar and DeepLearning.AI, through our efforts on the pre-response and other optimizations, have reduced the system’s latency to around 0.5-1 seconds.


Months ago, sitting in a coffee shop, I was able to buy a phone number on Twilio and hook it up to an STT → LLM → TTS pipeline in just hours. This enabled me to talk to my own LLM using custom prompts. Prototyping voice applications is much easier than most people realize!


Building reliable, scaled production applications takes longer, of course, but if you have a voice application in mind, I hope you’ll start building prototypes and see how far you can get! I’ll keep building voice applications and sharing best practices and voice-related technology trends in future posts.


[Original letter: https://www.deeplearning.ai/the-batch/issue-290/ ]

By Andrew


Comments

Popular posts from this blog

የዕለቱ ወንጌል ሐምሌ 5

  05/11/2016 /ጴጥሮስ ወጳውሎስ ብርሃናተ ዓለም 72 አርድእት ወይስሐቅ ሰማእት/                       የዕለቱ የወንጌል ክፍል                          ዮሐንስ 21:15-20      ምሳ ከበሉ በኋላም ኢየሱስ ስምዖን ጴጥሮስን፦ “የዮና ልጅ ስምዖን ሆይ፥ ከእነዚህ ይልቅ ትወደኛለህን?” አለው። “አዎን ጌታ ሆይ፥ እንድወድህ አንተ ታውቃለህ” አለው።“ግልገሎቼን አሰማራ፡” አለው። ደግሞ ሁለተኛ፦ “የዮና ልጅ ስምዖን ሆይ፥ ትወደኛለህን?” አለው። “አዎን ጌታ ሆይ፥ እንድወድህ አንተ ታውቃለህ፡” አለው። “ጠቦቶቼን ጠብቅ፡” አለው። ሦስተኛ ጊዜ፦ “የዮና ልጅ ስምዖን ሆይ፥ ትወደኛለህን?” አለው። ሦስተኛ፦ “ትወደኛለህን?” ስላለው ጴጥሮስ አዘነና፦ “ጌታ ሆይ፥ አንተ ሁሉን ታውቃለህ፤ እንድወድህ አንተ ታውቃለህ፡” አለው። ኢየሱስም አለው፦ “በጎቼን አሰማራ። እውነት እውነት እልሃለሁ፥ አንተ ጐልማሳ ሳለህ ወገብህን በገዛ ራስህ ታጥቀህ ወደምትወደው ትሄድ ነበር፤ ነገር ግን በሸመገልህ ጊዜ እጆችህን ትዘረጋለህ፥ ሌላውም ያስታጥቅሃል ወደማትወደውም ይወስድሃል።” በምን ዓይነት ሞት እግዚአብሔርን ያከብር ዘንድ እንዳለው ሲያመለክት ይህን አለ። ይህንም ብሎ፦ “ተከተለኝ፡” አለው።        ...

ወንጌል ዘሐምሌ 6

 ዕርገቱ ለዕዝራ ነቢይ ወንስተሮኒን ወአልሞድያስ ወሥርቀተ ተሙዝ ወዮልዮህ ወተዝካረ በርተሎሜዎስ ወ፲፻ ሰማዕታት ወከላድያን ሊቀ ጳጳሳት                      06/11/2016                    የዕለቱ የወንጌል ክፍል     “እንግዲህ በነቢዩ በዳንኤል የተባለውን የጥፋትን ርኩሰት በተቀደሰችው ስፍራ ቆሞ ስታዩ፥ አንባቢው ያስተውል፥ በዚያን ጊዜ በይሁዳ ያሉት ወደ ተራራዎች ይሽሹ፥ በሰገነትም ያለ በቤቱ ያለውን ሊወስድ አይውረድ፥ በእርሻም ያለ ልብሱን ይወስድ ዘንድ ወደ ኋላው አይመለስ። በዚያችም ወራት ለርጉዞችና ለሚያጠቡ ወዮላቸው። ነገር ግን ሽሽታችሁ በክረምት ወይም በሰንበት እንዳይሆን ጸልዩ፤ “በዚያን ጊዜ ከዓለም መጀመሪያ ጀምሮ እስከ ዛሬ ድረስ ያልሆነ እንግዲህም ከቶ የማይሆን ታላቅ መከራ ይሆናልና። እነዚያ ቀኖችስ ባያጥሩ ሥጋ የለበሰ ሁሉ ባልዳነም ነበር፤ ነገር ግን እነዚያ ቀኖች ስለ ተመረጡት ሰዎች ያጥራሉ። በዚያን ጊዜ ማንም፦ እነሆ፥ ክርስቶስ ከዚህ አለ፡ ወይም፦ ከዚያ አለ፡ ቢላችሁ አትመኑ፤ ሐሰተኞች ክርስቶሶችና ሐሰተኞች ነቢያት ይነሣሉና፥ ቢቻላቸውስ የተመረጡትን እንኳ እስኪያስቱ ድረስ ታላላቅ ምልክትና ድንቅ ያሳያሉ። እነሆ፥ አስቀድሜ ነገርኋችሁ። እንግዲህ፦ ‘እነሆ፥ በበረሀ ነው’ ቢሉአችሁ፥ አትውጡ፤ ‘እነሆ፥ በእልፍኝ ነው’ ቢሉአችሁ፥ አትመኑ፤ መብረቅ ከምሥራቅ ወጥቶ እስከ ምዕራብ እንደሚታይ፥...

የዕለቱ ወንጌል ሐምሌ 7:

 አባ ሲኖዳ ወአግናጥዮስ ሊቀ ጳጳሳት አባ ጊዮርጊስ ወመቃቢስ ወአግራጥስ ወቦኡ ሥሉስ ቅዱስ ቤተ አብርሃም የዕለቱ ወንጌል ሐምሌ 7: ዮሐ  8:51-59 ፶፩እውነት እውነት እላችኋለሁ፥ ቃሌን የሚጠብቅ ቢኖር ለዘላለም ሞትን አያይም።፶፪አይሁድ። ጋኔን እንዳለብህ አሁን አወቅን። አብርሃም ስንኳ ሞተ ነቢያትም፤ አንተም። ቃሌን የሚጠብቅ ቢኖር ለዘላለም ሞትን አይቀምስም ትላለህ።፶፫በእውኑ አንተ ከሞተው ከአባታችን ከአብርሃም ትበልጣለህን? ነቢያትም ሞቱ፤ ራስህን ማንን ታደርጋለህ? አሉት።፶፬ኢየሱስም መለሰ አለም። እኔ ራሴን ባከብር ክብሬ ከንቱ ነው፤ የሚያከብረኝ እናንተ አምላካችን የምትሉት አባቴ ነው፤፶፭አላወቃችሁትምም፥ እኔ ግን አውቀዋለሁ። አላውቀውም ብል እንደ እናንተ ሐሰተኛ በሆንሁ፤ ዳሩ ግን አውቀዋለሁ ቃሉንም እጠብቃለሁ።፶፮አባታችሁ አብርሃም ቀኔን ያይ ዘንድ ሐሤት አደረገ፥ አየም ደስም አለው።፶፯አይሁድም። ገና አምሳ ዓመት ያልሆነህ አብርሃምን አይተሃልን? አሉት።፶፰ኢየሱስም። እውነት እውነት እላችኋለሁ፥ አብርሃም ሳይወለድ እኔ አለሁ አላቸው።፶፱ስለዚህ ሊወግሩት ድንጋይ አነሡ፤ ኢየሱስ ግን ተሰወራቸው ከመቅደስም ወጥቶ በመካከላቸው አልፎ ሄደ ።"" ሮሜ9:1-17 ፩-፪ብዙ ኀዘን የማያቋርጥም ጭንቀት በልቤ አለብኝ ስል በክርስቶስ ሆኜ እውነትን እናገራለሁ፤ አልዋሽምም፤ ሕሊናዬም በመንፈስ ቅዱስ ይመሰክርልኛል።፫በሥጋ ዘመዶቼ ስለ ሆኑ ስለ ወንድሞቼ ከክርስቶስ ተለይቼ እኔ ራሴ የተረገምሁ እንድሆን እጸልይ ነበርና ..."" 1ኛዮሐ 4:11-21 የሐዋ 11:11-19 ምስባክ መዝ 27:8-9 ወሐሠሥኩ ገጽከ ገጸ ዚአከ አሐሥሥ እግዚኦ ወኢትሚጥ ገጸከ እምኔየ ፰አንተ ፊቴን እሹት ባልህ ጊዜ። አቤቱ፥ ፊትህን እሻለሁ ልቤ አንተን አለ።፱ፊትህ...