Speaker Verification using Siamese Networks

preview_player
Показать описание
Siamese network are widely used in Vision but their application in speech is very limited. In this work we explore Speaker verification task using Siamese networks. Siamese network is a metric learning approach where given 2 inputs the network should predict whether the inputs are same or different.In our case the inputs are 512x300 dimensional spectrograms.The network is VGG style CNN. The siamese network operates by taking 2 spectrogram patches and it should predict whether they are from same speaker or different speaker. During training we use L2 Distance loss to minimize the distance for positive pair and maximize the distance for negetive pair. Here the pair means 2 spectrogram patches(can be from same speaker or different speaker). When the network is constructed we put 1024 dimensional hidden layer before the loss layer to capture the speaker discriminative features. The features can distinguish 2 speakers very well compared to MFCC. When a new speaker wants to enroll we collect 2 min recordings of that speaker and split the file into 3sec segments and we extract the 1024 dimensional speaker embeddings for each of these segments and we take average of all these speaker embedding which gives us 1024 dimensional voice print for every speaker. During verification we compare the distance between the test speaker embedding with the claimed speaker voice print and if the distance is less than 0.3 we verify the speaker as positive hit.
Рекомендации по теме
Комментарии
Автор

May I know your evaluation model ? Like test accuracy or precision and recall

bellalie
Автор

Can you share link of your code please ?

adityanandgaokar
Автор

Hi Krishna,
I want to know how to tackle with the problem of different length audio files .
I mean, you have 512*300 shape melspectrogram which will be for fixed sized audio.
But, how to work for variable length audio ??

namangupta
join shbcf.ru