filmov
tv
Speaker Verification using Siamese Networks

Показать описание
Siamese network are widely used in Vision but their application in speech is very limited. In this work we explore Speaker verification task using Siamese networks. Siamese network is a metric learning approach where given 2 inputs the network should predict whether the inputs are same or different.In our case the inputs are 512x300 dimensional spectrograms.The network is VGG style CNN. The siamese network operates by taking 2 spectrogram patches and it should predict whether they are from same speaker or different speaker. During training we use L2 Distance loss to minimize the distance for positive pair and maximize the distance for negetive pair. Here the pair means 2 spectrogram patches(can be from same speaker or different speaker). When the network is constructed we put 1024 dimensional hidden layer before the loss layer to capture the speaker discriminative features. The features can distinguish 2 speakers very well compared to MFCC. When a new speaker wants to enroll we collect 2 min recordings of that speaker and split the file into 3sec segments and we extract the 1024 dimensional speaker embeddings for each of these segments and we take average of all these speaker embedding which gives us 1024 dimensional voice print for every speaker. During verification we compare the distance between the test speaker embedding with the claimed speaker voice print and if the distance is less than 0.3 we verify the speaker as positive hit.
Комментарии