Archive

Posts Tagged ‘machine translation’

The spectre of Google Translate for Gaelic

15/01/2015 3 comments

Not the kind of pre-Christmas cheer I was hoping for, seriously. Slap bang on the 23rd, someone draws my attention to an article called Google urged to go Gaelic. In a nutshell, a left-field (most likely well-intentioned) appeal by an MSP from Central Scotland to add Scottish Gaelic to the list of languages. As the mere thought was nauseating, I made some time and wrote a very long letter to Murdo Fraser, the man in question, with copies going to David Boag at Bòrd na Gàidhlig and Alasdair Allan, minister for languages. As it sums up my arguments quite succinctly (I hoped), I’ll just copy it here:


Just before Christmas, a friend drew my attention to an article in the Courier regarding Google Translate in which Mr Murdo Fraser argues for a campaign to get Scottish Gaelic onto Google Translate.

I’m sure that this is a well-intentioned idea but in my professional opinion, it would have terrible consequences. As one of the few people who work entirely in the field of Gaelic IT, I have a keen interest in technology and the potential benefit – and damage – this offers to languages like Gaelic. As it happens, I also was the Gaelic localizer (i.e. translator) for Google when it was still running the Google In Your Language programme and I have watched (often with dismay) what Google has done in this area since. One of the projects that certainly caught my eye was Google Translate, especially when Irish was added as a language in 2009. But having spoken to Irish people working in this field and having watched the effects of it on the Irish language, I rapidly came to the conclusion that while it looks ‘cool’, being on a machine translation system for a small(er) language was not necessarily a benefit and in some cases, a tragedy.

Without going into too much technical detail, machine translation of the kind that Google does works best with the following ingredients:
– a massive (billions of words) aligned bilingual corpus
– translation between structurally similar languages or
– translation from a grammatically complex language into a less grammatically complex language but not the other way round
– translation of short, non-colloquial phrases and sentences but not complex, colloquial or literary structures

In essence, machine translation trains an algorithms in ‘patterns’, which is why massive amounts of data are needed and why it works better from a complex language into a less complex language. For example, it is relatively easy to teach the system that German der/die/das require ‘the’ in English, but it requires a massive amount of data for the system to become clever enough to understand when ‘the’ becomes ‘der’ but not ‘die’.

Unfortunately for Irish, none of these conditions were met – and would also not be met for Scottish Gaelic. To begin with, even if we digitized all the works ever produced which exist in English and Gaelic, the corpus would still be tiny by comparison to the German/English corpus for example.

Then there is the issue of linguistic distance, Irish/Gaelic and English are structurally very different, with Gaelic/Irish having a lot more in the way of complex grammatical structures than English. To compensate for this, the corpus would have to be truly massive. Which is why the existing Irish/English system is extremely poor by anyone’s standards.

One might argue that the aim is not a perfect translation system but a means of accessing information only available in other languages – which is the case for many of the languages which are on Google Translate. But I’m doubtful if the reverse is true. To begin with, no fluent Gaelic speaker requires a Gaelic > English translation system and there is preciously little which is published in Gaelic in digital form which does not also exist in English. All this would do is remove yet another reason for learning Gaelic.

That would leave English > Gaelic and herein lies the tragedy of the English/Irish pairing on Google Translate. Whatever the intentions of the developers, people will mis-use such a system. I have put together a few annotated photos which illustrate the scale of the disaster in Ireland here. From school reports to official government websites, there are few places where students, individuals or officials trying to cut corners have not used Irish translations of Google Translate in ways they were not intended to be used.

If there HAD been a Gaelic/English pair, Police Scotland would have been an even bigger target of ridicule because such an automated translation would have produced gibberish at worst and absurd semi-Gaelic at best.

I think we can all agree that the last thing Gaelic needs is masses of poor quality translations floating around the internet. Funding is extremely short these days and this would, in my view, be a poor use of these scarce funds. There are more pressing battles to be fought in the field of Gaelic and IT, such as the refusal by the 3rd party suppliers of IT services to Gaelic schools and units to provide (existing) Gaelic software or even a keyboard setting in any school that allows students to easily input accented characters, be that for Gaelic, Spanish or French.

is mise le meas mòr,


Turns out I wasn’t the only one horrified by the mere thought – John Storey also wrote a very long and polite letter.

Early in January and within days of each other, both John and I received almost identical responses which, in a nutshell, said ‘Thanks but I’ll keep trying anyway’. Even less encouragingly, it make some really irrelevant reference to the lack of teachers in Gaelic Medium Education. Which is true of course but well, not relevant?


Thank you for contacting me in relation to Scots Gaelic and Google Translate and for your detailed correspondence.

I appreciate the depth of your letter and note your concerns in relation to issues of accuracy and the potential impact to speakers of Gaelic of Google translate. I will be sure to consider these when next speaking on the subject.

I also agree that there are other battles to be fought in the field of Gaelic and IT and appreciate the current issues surrounding the number of teachers in Gaelic Medium Education.  However, I do believe it is worth promoting the case for a more accessible Gaelic presence online and without this I believe that Gaelic could miss out on the massive opportunities afforded by the digital age.


I’m still waiting for a response from Bòrd na Gàidhlig or Alastair Allan. But I’m not encouraged. Really frustrated actually because (at least as the Press & Journal and the Perthshire Conservatives would have it), it seems like Bòrd na Gàidhlig and Alastair Allan are throwing their weight behind this ill-fated caper.

I really hope Google turns them down because I really don’t want to end up where the Irish IT specialists ended up – the merry world of “Told you so”…

But sadly “Got Gaelic onto Google” probably just sounds sexier on your CV than “Banged some desks and made sure all kids in Gaelic Medium Education can now easily type àèìòù”…

All look same, eh?

15/09/2012 5 comments

I must have been an elephant in another life, given how much time I seem to spend these days shaking my head over “avoidable stupidity”. Or maybe I’m just becoming a grumpy old man. That might be it – I’m losing the ability of youth to look at a slice of cold pizza and go “yummmm”. These days, I look at it and think “The cheese is hard, the cat sniffed it, I can’t even remember when I ordered it” and chuck it out. Ah but I digress.

This week’s headshaker is the way we seem to be loosing control to the developers, control over things that should not be in the remit of developers. Things like letting some algorithm “identify” the language of web content and adjusting my search results based on that. Who dreamt that up? No idea but I bet he was white, monolingual and only had the faintest notion that apart from English, there’s that thing the people making tacos speak and then maybe the thing the Chinese takeaway people use. Choice of three – easy, if L does not equal English, check for non-Latin. If it’s non-Latin in must be Chinese, if it is, it’s Spanish. At least that’s the way it comes across.

The problem is, dear developer, that there’s a great many languages out there and there’s quite a few which are fairly close to each other. Like Irish and Scottish Gaelic for example. So if you’re decide to automatically identify content by language and modify my search results based on that, then bloody well make sure you get it right! Anything else is just seriously annoying unless you give me the option of manually tweaking it.

Given that it’s not like it’s impossible to teach a computer to figure out the difference (for one, Irish uses acutes, Gaelic graves… the one goes up, the other one down, see?) it also raises the question of exactly whom they’re getting to program this stuff? High school students?

Probably not actually, I suspect they’re all really good at code. But listening to my other half, a business consultant with his very own set of why-oh-why’s, I suspect the problem actually is NOT the ability to do code. It’s lack of guidance at all levels. The way big companies hire folk these days goes something like this:

  1. Company A identifies an apparent problem. Without making sure they identify the root cause, they call for a Fixer-of-Problem-A. First mistake. You’re granny breaking her ankle may be the apparent problem but without checking, you don’t know if the problem is actually osteoporosis.
  2. So, having rightly or wrongly identified the problem, these days, a job spec gets sent to an agency. Second mistake. They usually get the wrong person to write the job spec, which means the agency is already at the receiving end of a potential mis-diagnosis and a badly written job spec. I’ve seen some of these… the really bad ones are the equivalent of needing a plumber and calling for someone with a proven track record in “the physical aspects of interior decoration as relates to waste disposal”. Yes. THAT bad.
  3. So we move onto mistake four. The agency usually adds its own flavour of inane, if not misleading, waffle. Using the plumber again, they add something about needing an end-to-end CV showing more than 20 years of experience in toilet seat lifting in blue-chip companies.
  4. Because it’s an IT related job everyone on this daisy chain assumes that the fixer and/or overseer of the fixing have to be IT people. Wrong. Fifth mistake. Of course you need IT folk to do the black magic but the overseer of the circus does not have to be one. In fact, I’d go as far as saying that they shouldn’t be one. Developers, when left to their own devices, tend to lose themselves in coding “fun” stuff. A failing I guess we all suffer from in our respective domains but for some reason, we let developers get away with policing themselves. In other words, a herd of sheep needs a sheepdog and a shepherd for guidance and direction, not another sheep. The sheepdog and shepherd should have a track record of having dealt with sheep but they don’t have to be sheep themselves.

I reckon it’s this nauseating daisy-chain of mistakes which blesses us with nonsense like the above. We need the coder to do the fancy stuff which, for example, helps identify the content of web pages. Jolly good, I can see the use of that if well done. But it should not be left to the developers to decide that is what’s needed right now, what it will do, how it gets tested, how it gets implemented and how to make sure the user has the necessary control over it if they need it. For that, we need a shepherd who’s not a sheep. If we manage that, I suspect we’d see fewer Siris, fewer counter-intuitive user interfaces, better language in the interfaces and a way of stopping Google from asking me every two seconds if I want to translate this damn page. No, I’m multilingual, and besides, running Irish machine translation over Gaelic won’t work anyway, dammit!