Context : We are making a biometric authentication application where we use voice to verify who our speaker is (from the voice in our base). We are using as the main programming language.

What shall be the best ?

I know that MFCCs for the ’s would be the right input data. Now the question is which one would be the better method.

So far, we’ve covered two approaches : First one is using GMMs and reducing the dimensions of the MFCC supervector using the i-vector approach. But for that, no decent libraries are available.

Second approach would be to use a CNN based model, using the MFCC coefficients as an input and training them in a CNN and recognizing the speaker using that. Now before we get to experimentation, we just wanted to know if anyone has done it and which approach did you find the best.

The i-vector approach is extremely mathematically complex (Not that I don’t understand Gaussian Mixture Models as such, and not even that I don’t understand the mathematics behind any of the operations being done. I just don’t know how to implement it).

Side but off the track question : How did you guys learn to implement complex mathematical equations and proofs into good code that works and actually practically uses the concept?

Source link
thanks you RSS link


Please enter your comment!
Please enter your name here