How to make headlines for the wrong reasons
Good afternoon, boys and girls, very bad language, for example, what we see in the side and at the airport these days? No, I haven’t gone insane, I’m just illustrating a point by resorting to reductio ad absurdum. In other words, I punched the sentence Hey folks, anyone up for some really truly bad language like the stuff we’re seeing at BÁC airport these days? into Bad Translator and let it go through 10 machine translations.
Why? Glad you asked… these days, Google is making headlines both in the Irish traditional press and in social media. But for all the wrong reasons. The reason? Google Translate. Or rather, a language pair someone should have thought about a little more. Or at least done some user testing on it. Something…
So what I imagine happened is this… some bright spark, either on the Google side or some well-meaning Irish government official thought it would be great if we could have Irish on Google translate. First mistake. Give humans a tool, and they will mis-use it. Like our ex-joiner hammering in screws. So before you give people a tool, think about likely scenarios of mis-use. It clearly does not require a team of MENSA members to imagine that in a minoritised language like Irish, people might start using it for things like their homework or cheap translations rather than a quick way of getting the gist behind web content.
But having blissfully ignored this step, someone must have forged ahead and contributed a bilingual corpus to Google developers with a note along the lines of here’s a corpus for Irish, please add it to Google Translate. Most likely, second mistake. Right, so there are many ways of building machine translation systems but most rely on a mix of rules and a bilingual corpus. The idea being that as long as you feed a computer enough aligned data in two languages, it can use statistics to figure out how to translate between the two. This idea in itself is sound. Sort of. It depends on the languages in question, the amount of data involved and the direction of the translation oddly enough. Here’s an ideal scenario: build a system using a VAST amount of data (we’re talking billions of words) to translate between closely related languages and into the language which has the less fancy grammatical system. Like German to English. That works quite well as a pair on Google Translate because a) there are indeed vast amounts of texts which exist in both languages. German has the fancier grammar (3 genders, case marking, inflection of verbs…) whereas English does buggerall (some past tense markers on verbs and a plural -s aside, which is peanuts in linguistic terms).
But once you move away from the ideal model, things start creaking. The more complex the structures of the target language, the more data you’d need for the computer to make any sense of it. So going English to Icelandic creaks much more because even though they’re related languages (ultimately), Icelandic is even more complex than German. Oh and there’s less bilingual data of course.
You get the idea. Now Irish is eye-candy to a linguist. It has grammatical structures to die for, a case system, two genders, two types of mutation (that’s when the first sound in a word changes… you might know people called Hamish? Well that’s what Irish does to a man called Séamus when you address him), a headache-inducing system for inflecting verbs, a different word order (English is subject-object-verb, Irish is verb-subject-object) and so on. A thousand things English doesn’t do. So what would we need to make this work? Yup, take a gold star, a corpus billions of words big.
Unfortunately there’s no bilingual corpus that even comes close to that. Or at the very least, Google did not feed in anywhere near enough data. I’ve lost track but I think it’s mistake 3?
Cue mistake 4… let it loose on people without a big warning strapped to it or any form of user testing. The result? Eye-wateringly bad translations which start cropping up in the weirdest places. Facebook … ok, we could probably live with that… homework… a lot worse, don’t teachers have enough to contend with? And of course the jewel in the crown – official signage. Yep, that’s right. Google Translate has been making its way onto signage from Dublin Airport to government websites. And the result is almost always nauseating. Breaking through barriers? Only the blood vessels in Irish speakers’ brains perhaps…
It’s not that one shouldn’t attempt to bring technology to smaller languages, I’m all for that. But quality is key. It’s a hard enough sell at the best of times and something like a poor machine translation system can seriously damage the confidence people have in technology in or for their language. A little careful thinking goes a long way…