Gemini 2.5 Pro for Audio Transcription

preview_player
Показать описание
In this video, I go through using the new Gemini 2.5 Pro for audio transcription and audio analysis tasks and show you how to get the best results out.

For more tutorials on using LLMs and building agents, check out my Patreon

🕵️ Interested in building LLM Agents? Fill out the form below

👨‍💻Github:

⏱️Time Stamps:
00:00 Intro
00:19 Gemini 2.5 Pro Experimental Blog
01:03 Gemini 2.5 Pro Capabilities
01:27 Output Tokens
02:01 Pricing
02:30 Supported Audio Formats
02:43 Technical Details About Audio
05:25 Demo (Colab)
06:43 Audio Diarization Process
Рекомендации по теме
Комментарии
Автор

Btw I am looking for a better place to download podcasts. Does anyone have any good suggestions for sites with links to MP3s etc?

samwitteveenai
Автор

G2.5 is incredible. It managed to process a 10-minute exchange between a helpdesk technician and an aggressive customer, analyze the operator's stress management, conflict resolution strategies, and procedural compliance — all in just 30 seconds. The potential applications are huge. Thank you for sharing your insights.

brunomineo
Автор

It's incredible what leaps and bounds this industry is making. I was just about to head down the Whisper+PyAnnote route when your video appeared. Thanks for putting this forward as another option.

kenchang
Автор

I can't be the only one who still occasionally gets goosebumps when interacting with these models?

Lol I'm old enough to remember when the term "subscribing to podcast" inherently meant downloading an mp3 from an rss feed 😂

thenoblerot
Автор

Gemini 2.5's multilingual aspect is criminally underused. Last weekend, it transcribed an acoustic South Indian song I recorded (perfect pronunciation!), googled the name, fetched lyrics, transcribed a chord chart (this par wasn't great, but it tried), and coached me on recording/mixing in Ableton. It felt like pair programming for music production— not perfect, but a genuine step change for audio, like GPT-4 was for text.

One could imagine its use for multilingual sentiment analysis in customer service calls (an area where RLHF could scale given the volume of feedback & complaints). Its multilingual nature also makes it valuable in developing countries for review/feedback (both for current human employees and for training AI models). Insanely underrated stuff.

TheVistastube
Автор

Works like a charm. Thanks for the inspiration.
I used a free API key and it worked on a 30 minute conversation, but failed on a 1 hour conversation (response=none). But it worked directly in Google AI Studio, same 1h audio file and prompt. All this for free. Pretty cool for personal stuff.

ClimateDS
Автор

FYI youtube provides 'chapters' which is a cheap way to divide a multi-topical content like podcast into chunks. you can do this programmatically. Happy to chat more on these Sam. I will reach out.

mz
Автор

the example transcript in the notebook ends at 59:57. what happens to the timestamp after that? I've tried it with audio files that are 1 hour 30 minutes and the timestamps after 59:59 go back to 00:00 and it starts messing up the transcript/timestamps.

charlesk
Автор

My biggest takeaway is: I am not the only one using LLM to parse this very podcast and trying to apply in business.

mz
Автор

2.0 flash thinking experimental is also 64k out, can transcribe just as well, and has a rate limit of 1, 500/day vs 25/day for 2.5pro. And also free, if you use the experimental version.
But the total usable context on the 2.0 is around 80k/prompt (how much it can focus on per session) vs 2.5 can handle the entire 1m tokens each time, but that's not needed for transcribing, since you'll never output anywhere near 64k anyway, so 80in, about 20k out per session....

eugenes
Автор

Instead of doing the second code step, just include those instructions in the system prompt for the diarizer. It will split up the timestamps however you ask, including by speakers. You can even ask for separate transcripts per speaker. Remember, you're asking a multimodal model to do this, it's not transcribing and then guessing at what's what, it's actually pulling that information directly out of the audio, in different ways. So you can ask "pull out all the jokes", "tell me when I said something stupid", "how does my tone sound"... It's wild.

eugenes
Автор

You are awsome broo Thank you very much it was very helpful

AliAsad-pq
Автор

Great discussion, thanks!
You mentioned other transcription services and I think you have covered libraries, like RealtimeSTT and speech_recognizer (I'll have to re-watch those now), but just curious how you think they do versus Gemini 2.5 audio. I like things free and local when I can, but still looking for the best quality output.

MojaveHigh
Автор

I transcribe with deepgram nova 3, it's killer, more expensive a bit than most things but still hella cheap

juliovac
Автор

Ever since today, the model is now paid and I am quite confused about it, since I already have chats on, but I have no billing linked to my account. Should I keep using it or just screw off and wait another week for free model

muhahahahahahayoufool
Автор

Does anybody know how well Gemini 2.5 fares transcribing audio with excessive background noise such as wind or road noise, or poor microphone quality, vs OpenAI’s Whisper and AssemblyAI ?

markplutowski
Автор

What is the best local model for podcast transcription and diarization? I'd love to get good transcripts of my podcast but have way too many episodes to use a paid API.

ComicBookPage
Автор

U could do better than transcription only, u tell Gemini to make discussion of the audio immediately and it generate article from the audio only

sonGOKU-gyrg
Автор

Are the timestamps good? By how much are they out please?

elawchess
Автор

How does it compare to 4o transcription that oAI announced a few days ago?

dus
welcome to shbcf.ru