I have a large corpus of documents for which I created doc2vec vectors. Now I’m looking into the possibility to filter duplicate documents. Are there papers out there that look into this? Or do I just use my best judgment and say for example documents with a cosine similarity > .95 are considered duplicates?
Are there even any other ways I can approach this problem?
thanks you RSS link
More link Blog tech
more link ADS
Blockchain, bitcoin, ethereum, blockchain technology, cryptocurrencies
Information Security, latest Hacking News, Cyber Security, Network Sec
Information Security, latest Hacking News, Cyber Security, Network Security
Blog! Development Software and Application Mobile
Development apps, Android, Ios anh Tranning IT, data center, hacking
Car News, Reviews, Pricing for New & Used Cars, car reviews and news, concept cars