Mar 16, 2010

How advertising can help improve machine translation of languages

Automatic translation of languages by machines has been a standard fixture of science fiction - and like quite a few other sci-fi standards, it has been painfully slow to cross over to reality.

A recent breakthrough called 'statistical machine translation' promises to pluck this fantasy out of its permanent future resident status. Here's a description of how the technique works (from an Economist article in June 2006):
"Statistical translation encompasses a range of techniques, but what they all have in common is the use of statistical analysis, rather than rigid rules, to convert text from one language into another. Most systems start with a large bilingual corpus of text. By analysing the frequency with which clusters of words appear in close proximity in the two languages, it is possible to work out which words correspond to each other in the two languages. This approach offers much greater flexibility than rule-based systems, since it translates languages based on how they are actually used, rather than relying on rigid grammatical rules which may not always be observed, and often have exceptions."
Not surpisingly, the company which is at the forefront of statistical machine translation is Google. Whenever you use Google Translate, this is what's happening behind the scenes (via Economist Feb 2010):
"For translation, the company was able to draw on its other services. Its search system had copies of European Commission documents, which are translated into around 20 languages. Its book-scanning project has thousands of titles that have been translated into many languages. All these translations are very good, done by experts to exacting standards. So instead of trying to teach its computers the rules of a language, Google turned them loose on the texts to make statistical inferences. Google Translate now covers more than 50 languages, according to Franz Och, one of the company’s engineers. The system identifies which word or phrase in one language is the most likely equivalent in a second language. If direct translations are not available (say, Hindi to Catalan), then English is used as a bridge."
But currently there are a few drawbacks with statistical machine translation - which have primarily got to do with the kind of readymade translated texts they rely on. From yet another recent Economist article:
"It is getting better, but it still struggles with colloquialisms and idioms. As Ethan Zuckerman, co-founder of Global Voices and a researcher at Harvard University, puts it: “If you sound like an EU parliamentarian, we can translate you quite well.”
What's foxing these gargantuan statistical crunching machines is the linguistic equivalent of the last mile problem. The everyday spoken language which is rarely captured and archived - let alone translated into a dozen languages before archiving.

For advertising and advertisers who believe in their existence serving a larger purpose and providing a public good, there might be an opportunity here.

Ads and commercials are routinely translated into different langauges - especially when they come from global multinationals. Because they aim to communicate with end consumers, these also contain the kind of colloquilaisms and idioms that EU and UN speeches lack.

What if advertisers could provide Google or a non-profit third party transcripts of these ads and commercials along with the translations that have been professsionally created through human experts. If a sufficiently large number of advertisers commit their future and past archives, it may end creating a formidable archive of spoken and eveyday lingo for the statistical inference bots to bite their silicon teeth into.

The drawback - as some skeptics will point out - is that the language of advertising may not be any less stitled and far removed than a EU speech. On the other hand, for advertising that seeks to leave behind a cultural impact, this could provide a platform to really make it come true.

If such a thing can be worked upon, the advertsing itself may have a limited run - but it's value could live on forever by providing us with better machine translation for ages to come.
About the author:
Iqbal Mohammed is Head of Innovation & Strategy at a digital innovation agency serving the DACH and wider European markets. He is the winner of the WPP Atticus Award for Best Original Published Writing in Marketing & Communication.
You can reach him via email or Twitter.



misentropy


// Subscribe to blog updates via RSS or email. //