Context : We are making a biometric authentication application where we use voice to verify who our speaker is (from the voice data in our data base). We are using java as the main programming language.
What shall be the best approach?
So far, we’ve covered two approaches : First one is using GMMs and reducing the dimensions of the MFCC supervector using the i-vector approach. But for that, no decent libraries are available.
Second approach would be to use a CNN based model, using the MFCC coefficients as an input and training them in a CNN and recognizing the speaker using that. Now before we get to experimentation, we just wanted to know if anyone has done it and which approach did you find the best.
The i-vector approach is extremely mathematically complex (Not that I don’t understand Gaussian Mixture Models as such, and not even that I don’t understand the mathematics behind any of the operations being done. I just don’t know how to implement it).
Side but off the track question : How did you guys learn to implement complex mathematical equations and proofs into good code that works and actually practically uses the concept?