I have a large corpus of documents for which I created vectors. Now I’m looking into the possibility to filter duplicate documents. Are there papers out there that look into this? Or do I just use my best judgment and say for example documents with a cosine similarity > .95 are considered ?

Are there even any other ways I can approach this problem?

Source link
thanks you RSS link
( https://www.reddit.com/r//comments/9cuyun/d__duplicates_with_doc2vec/)


Please enter your comment!
Please enter your name here