Gemini 2.5 Pro for Audio Transcription

Показать описание

In this video, I go through using the new Gemini 2.5 Pro for audio transcription and audio analysis tasks and show you how to get the best results out.

For more tutorials on using LLMs and building agents, check out my Patreon

🕵️ Interested in building LLM Agents? Fill out the form below

👨‍💻Github:

⏱️Time Stamps:
00:00 Intro
00:19 Gemini 2.5 Pro Experimental Blog
01:03 Gemini 2.5 Pro Capabilities
01:27 Output Tokens
02:01 Pricing
02:30 Supported Audio Formats
02:43 Technical Details About Audio
05:25 Demo (Colab)
06:43 Audio Diarization Process

Рекомендации по теме

Комментарии

Btw I am looking for a better place to download podcasts. Does anyone have any good suggestions for sites with links to MP3s etc?

samwitteveenai

G2.5 is incredible. It managed to process a 10-minute exchange between a helpdesk technician and an aggressive customer, analyze the operator's stress management, conflict resolution strategies, and procedural compliance — all in just 30 seconds. The potential applications are huge. Thank you for sharing your insights.

brunomineo

It's incredible what leaps and bounds this industry is making. I was just about to head down the Whisper+PyAnnote route when your video appeared. Thanks for putting this forward as another option.

kenchang

I can't be the only one who still occasionally gets goosebumps when interacting with these models?

Lol I'm old enough to remember when the term "subscribing to podcast" inherently meant downloading an mp3 from an rss feed 😂

thenoblerot

Gemini 2.5's multilingual aspect is criminally underused. Last weekend, it transcribed an acoustic South Indian song I recorded (perfect pronunciation!), googled the name, fetched lyrics, transcribed a chord chart (this par wasn't great, but it tried), and coached me on recording/mixing in Ableton. It felt like pair programming for music production— not perfect, but a genuine step change for audio, like GPT-4 was for text.

One could imagine its use for multilingual sentiment analysis in customer service calls (an area where RLHF could scale given the volume of feedback & complaints). Its multilingual nature also makes it valuable in developing countries for review/feedback (both for current human employees and for training AI models). Insanely underrated stuff.

TheVistastube

Works like a charm. Thanks for the inspiration.
I used a free API key and it worked on a 30 minute conversation, but failed on a 1 hour conversation (response=none). But it worked directly in Google AI Studio, same 1h audio file and prompt. All this for free. Pretty cool for personal stuff.

ClimateDS

FYI youtube provides 'chapters' which is a cheap way to divide a multi-topical content like podcast into chunks. you can do this programmatically. Happy to chat more on these Sam. I will reach out.

mz

the example transcript in the notebook ends at 59:57. what happens to the timestamp after that? I've tried it with audio files that are 1 hour 30 minutes and the timestamps after 59:59 go back to 00:00 and it starts messing up the transcript/timestamps.

charlesk

My biggest takeaway is: I am not the only one using LLM to parse this very podcast and trying to apply in business.

mz

2.0 flash thinking experimental is also 64k out, can transcribe just as well, and has a rate limit of 1, 500/day vs 25/day for 2.5pro. And also free, if you use the experimental version.
But the total usable context on the 2.0 is around 80k/prompt (how much it can focus on per session) vs 2.5 can handle the entire 1m tokens each time, but that's not needed for transcribing, since you'll never output anywhere near 64k anyway, so 80in, about 20k out per session....

eugenes

Instead of doing the second code step, just include those instructions in the system prompt for the diarizer. It will split up the timestamps however you ask, including by speakers. You can even ask for separate transcripts per speaker. Remember, you're asking a multimodal model to do this, it's not transcribing and then guessing at what's what, it's actually pulling that information directly out of the audio, in different ways. So you can ask "pull out all the jokes", "tell me when I said something stupid", "how does my tone sound"... It's wild.

eugenes

You are awsome broo Thank you very much it was very helpful

AliAsad-pq

Great discussion, thanks!
You mentioned other transcription services and I think you have covered libraries, like RealtimeSTT and speech_recognizer (I'll have to re-watch those now), but just curious how you think they do versus Gemini 2.5 audio. I like things free and local when I can, but still looking for the best quality output.

MojaveHigh

I transcribe with deepgram nova 3, it's killer, more expensive a bit than most things but still hella cheap

juliovac

Ever since today, the model is now paid and I am quite confused about it, since I already have chats on, but I have no billing linked to my account. Should I keep using it or just screw off and wait another week for free model

muhahahahahahayoufool

Does anybody know how well Gemini 2.5 fares transcribing audio with excessive background noise such as wind or road noise, or poor microphone quality, vs OpenAI’s Whisper and AssemblyAI ?

markplutowski

What is the best local model for podcast transcription and diarization? I'd love to get good transcripts of my podcast but have way too many episodes to use a paid API.

ComicBookPage

U could do better than transcription only, u tell Gemini to make discussion of the audio immediately and it generate article from the audio only

sonGOKU-gyrg

Are the timestamps good? By how much are they out please?

elawchess

How does it compare to 4o transcription that oAI announced a few days ago?

dus

Gemini 2.5 Pro for Audio Transcription

Gemini 2.5 Pro for Audio Transcription

Why the NEW Gemini 2.5 Pro + NotebookLM is 2X POWERFUL

The Improved Gemini 2.5 Pro - A Coding Powerhouse

Gemini 2.5 is so cracked

Ultimate ChatGPT Killer is Here! Gemini 2.5 Pro Explained

5 NEW INSANE Ways to Use Gemini 2.5 Pro (FREE)

Google won. (Gemini 2.5 Pro is INSANE)

Gemini 2 5 Pro The Ultimate AI Game Changer

FREE AI Video Generator | Text to Video AI Free | Image to Video AI Free | Google AI Studio

Gemini 2.5 just leveled up. And it’s a BEAST

Research Faster with Gemini Advanced | How to Use Gemini Deep Research

Analiza Audio e Imagen de Videos con IA (Tutorial Gemini 2 5 Pro + AI Studio GRATIS)

Gemini 2.5 Pro vs Claude Sonnet: Best AI Coding Tool of 2025?

30 Gemini 2.5 Pro Hacks You Need to Know in 2025 (Become a PRO!)

Build Anything with Gemini 2.5 Pro, Here’s How

TESTING NEW PIONEER DDJ-FLX10 CONTROLLER 🔥🔥🔥

Google Gemini: PRO Tutorial for Beginners (2025)

Google Cloud Next - Gemini 2.5 Pro EVERYWHERE

Gemini 2.5 Pro SCHLÄGT ChatGPT! Googles KI-Modell GETESTET - So nutzt du Gemini 2.5 Pro (Deutsch)

Google Gemini 2.5 Pro Beats DeepSeq R1 & Cloud 3.7 — Free Access! #googlegemini #gemini #aitools...

How to Use Google Gemini - Including New Prompts

Vivo V29/V29e/V29 Pro FuntouchOS 15 Android 15 Update Features m🔥

Using wired headphones in 2025 🤔

Sound Sample Comparison of WF-1000XM5 vs Beoplay EX vs AirPods Pro vs Jabra Elite 8 Active