Building the Next Generation of Conversational AI

preview_player
Показать описание
Inside the Code: Ankit Kumar (Sesame) & Anjney Midha (a16z) on the Future of Voice AI

What goes into building a truly natural-sounding AI voice? In this episode, Sesame’s cofounder and CTO, Ankit Kumar, joins a16z’s Anjney Midha for a deep dive into the research and engineering behind their voice technology.

They discuss the technical challenges of real-time speech generation, the trade-offs in balancing personality with efficiency, and why the team is open-sourcing key components of their model. Ankit breaks down the complexities of multimodal AI, full-duplex conversation modeling, and the computational optimizations that enable low-latency interactions. They also explore the evolution of natural language as a user interface and its potential to redefine human-computer interaction.

Plus, we take audience questions on everything from scaling laws in speech synthesis to the role of in-context learning in making AI voices more expressive.

Key Takeaways:
- How Sesame achieves natural voice interactions through real-time speech generation.
- The impact of open-sourcing their speech model and what it means for AI research.
- The role of full-duplex modeling in improving AI responsiveness.
- How computational efficiency and system latency shape AI conversation quality.
- The growing role of natural language as a user interface in AI-driven experiences.

For anyone interested in AI and voice technology, this episode offers an in-depth look at the latest advancements pushing the boundaries of human-computer interaction.

Follow everyone on X:

Chapters:
0:00 - 00:51 | Intro
00:52 - 04:58 | Challenges Of Building
04:59 - 07:45 | Q + A: What Was Done To Bridge Transcription And Text Processing?
07:46 - 09:57 | How Is Sesame So Much Better Than Others?
09:58 - 12:42 | Challenges In| Making AI Accessible To All
12:43 - 14:10 | Great Researchers Prioritize User Experience
14:11 - 15:47 | What Is Good Taste In ML?
15:48 - 17:45 | Problems That Can Be Solved That Add Value To The World
17:46 - 26:25 | Open Source Audio For Speech Generation
26:26 - 34:00 | Contextual Speech vs Text to Speech, Differences
34:01 - 35:50 | Value Proposition Of Glasses With No Friction
35:51 - 38:00 | General Purpose API vs Open Source Model
38:01 - 40:47 | Creating High Quality APIs
40:48 - 45:54 | Companions And How Sesame Will Handle Context Retention In Long Conversations
45:55 - 46:59 | Talent: What It Takes To Become A Part Of The Sesame Team
47:00 - 54:37 | How Scaling Laws For Speech Differ From Text
54:38 - 58:33 | How An Organic Conversation Be Preserved Using A Voice Companion
58:34 - 1:03:52 | App Building Technology: Roadmap
1:03:53 - 1:09:09 | Architectures and Transformers
1:09:10 - 1:15:56 | The Focus On Personality, And The Differences In Products
1:15:57 - 1:25:25 | New AI Interface: Interacting With AI Companion
1:25:26 - 1:26:56 | Companion Challenges
1:26:57 - 1:29:22 | Computing Interface Of The Future
1:29:23 - 1:31:45 | Focused Product Experience Built By Small Teams
1:31:46 - 1:36:13 | Join Sesame If You Want To Make A Consumer Product People Love
Рекомендации по теме
Комментарии
Автор

As the person who asked the question at 58:45 thank you for your elaborate answer. This tech is truly mind blowing... a real breakthrough. I just hope that when the product is finally released you relax some of the guardrails that have been implemented recently. They take away from the naturalness of the conversation and reduce the "delightfulness" of the experience. It's important for people to forge their own relationship with their companion (within reason) that feels authentic and satisfying to them. People have very different needs and wants from using this kind of technology. It's important to accommodate a wide spectrum of human experience... and not be too predictive. I do feel encouraged after watching this interview. It seems like the devs are focusing in the right place. You have definitely crossed the uncanny valley (no small feat) now it's about creating a world in the other side. Classic AI statement "this is the worst it's ever gonna be". That is a truly exciting prospect. The initial demo is already very impressive. I can't wait to see where things go from here. Keep up the great work!!

narottamzakheim
Автор

1:00:50 Having the model be able to identify non-intelligible sounds and discern certain noises that are not words like coughing will be an absolute game changer cause then they will be able to interact with the more nonverbal components of the human experience. Can't wait to see where this goes.

verlax
Автор

This was a fascinating interview, and I listened to the end. But I was surprised that there was no mention of the potential problems of making such realistic, human-like companions available widely. Voices as realistic and convincing as the two in the demo could be used to trick and manipulate people, and some users would form emotional bonds with them that would interfere with their real-life relationships. Such problems already seem to be occurring with AI bots that are much less lifelike than Maya and Miles. What is Sesame’s position on those issues?

TomGally
Автор

I hope they open source the contextual speech model. Thats the most valuable and interesting thing imo

TheChucky
Автор

Grateful to have a look into the development cycle. I wish the folks of Sesame would have a weekly digest on what's going on in the demo and the full version. The revised version last week is a shadow of what it was but I understand for the sake of the demo. My impression is that there should be a age check for the demo and perhaps a release that is given to unlock deeper layers with explicit understanding about what some pitfalls the users could fall into if they don't self regulate. I don't believe these regulations need to come from the model.

AnthonyBackmanOffical
Автор

I haven’t seen many founders who are thinking about AI the same as I am, but Ankit is. My favorites are Ilya sutskever, Andrej karpathy, Elon, and now Ankit. Looking forward to hearing more from him.

Nova-Rift
Автор

Dang, that no dev api is a dagger to the heart

TylerKoz
Автор

Focus
Small talent dense team
No unwanted hype
long term roadmap
clarity

Some of the early signs of a great product company, hopefully 5 year down the line, you guys will be quite big

jatingupta
Автор

I think a video avatar of Maya/Miles while talking with them, having a video call with them, would be better than glasses.

ianr
Автор

Also be thinking of AI in Games. Not just generating a voice, or a particular game personas voice, but viseme curves to drive the face and mouth (morph targets/blend shapes). Much more than NVidias audio2face. Multi modal, where the game assistant can "see" the game screen at 10Hz for example and know where treasure is and the context of the game at a certain time, slowly guiding the player. And obviously doing a AI hand off when another Smart NPC interreacts with the player. Oh, and robotics ;)

stevecoxiscool
Автор

The question I wanted to ask is why only in English? And will the open source part let me train in other languages?

GuiDouil
Автор

They are open sourcing the one part that doesn't matter. Got it. Whelp, there goes that hope for some awesome open source release from them. Sesame is now dead to me.

tropmonky
Автор

The sound of this video sounds weird... modified... not pleasant to listen to, it's either because of the recording equipment that was used, or the editing, and no, it's not my computer, I've watched a few interviews today on Youtube and they all sound just fine... and they still do (just tested).
Compare the sound of this video, with this one for example: "DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters | Lex Fridman Podcast"

P.Mantis-nwxi
Автор

Bring back uncensored Maya, cowards. The corporate nun personality kinda stinks

Raulikien
Автор

Why did they call it csm-1b instead of tts-1b? And why did they open source it at all? It has no added value because such and better models have long been available as open source.

dasistdiewahrheit
Автор

She crams in a ton of metaphors and slang. My father would have trouble understanding all that

justinleemiller
Автор

I have a voice model that does better than styletts and most other open source, but its not fast. it can take a couple minutes. But it runs on consumer hardware. I fooled snoop doggs wife with it! (for the Players Club)

devmentorlive
Автор

Why y’all India’s don’t go to India and help your county develop something lol.

jeopardyking
welcome to shbcf.ru