The breakthrough, which will be presented at Empirical Methods in Natural Language Processing or EMNLP, could prove to be important for Facebook, as the social media giant uses automatic language translation to help its users around the world to read posts in their preferred language, the Forbes reported.
The existing machine translation systems can achieve near human-level performance on some languages but they require access to parallel corpus — vast quantities of the same sentences in different languages — in order to learn, it said.
The team from the Facebook AI Research (FAIR) division were able to train a machine translation (MT) system by feeding it large pieces of different text in different languages from publicly available websites like Wikipedia.
The key thing to note is that these pieces of text were independent of one another. When you have different pieces of text in different languages they’re referred to as monolingual corpora, it said.
“Building a parallel corpus is complicated because you need to find people fluent in two languages to create it. For instance, if you wanted to build a parallel corpus of Portuguese/Nepali, you would need to find people fluent in these two languages, which would be very difficult,” Antoine Bordes, a research scientist and the head of FAIR’s Paris research lab, was quoted as saying in the report.
He said: “On the other side, building monolingual corpora Portuguese/Nepali is very easy: you just need to download webpages from Portuguese and from Nepali websites, it doesn’t matter if they are not parallel sentences or if they talk about different things”.
Most language translation computer systems use both monolingual corpora and parallel corpus to learn.
“The novelty in our approach is that we can train MT systems from monolingual corpora only, we don’t need any parallel corpus. Potentially, given a book written in an alien language, we could use our model to translate it into English,” Bordes said.