You may recall that back in November last year I (and some other people) seriously questioned whether Google Translate for Gaelic was really such a great idea. Most people who came down on the side of it being a good thing cited things such as “attracting young people” (that must be the minority language equivalent of “exposure” in the arts world…), “enhancing the status”, “used judiciously, it will do this and that good thing” and “wait and see, it won’t be that bad”.
Well, I have some news for you and if you’ve never seen me furious, and I mean steaming-out-of-the-ears-furious, here’s your chance.
I wasn’t actually planning to blog about this again, not for a while. But then I made the mistake of doing something unrelated – a bit of data entry in the Faclair Beag. After another fruitless attempt at finding the English for a coileach-gòthan, I picked up one of my many note-sheets and decided I might as well enter one. After a few useful phrases, I came across an odd looking word so I decided to ask the poor man’s corpus (which is useful in giving you a very quick impression of how common a word is). 346 results – that seemed fairly conclusive (for a language like Gaelic) but being the OCD QA freak I am, as always I did a gross error check to see which sites these hits were coming from. Topslotsite? Strange but maybe a coincidence (something English typos or bad line breaks result in seemingly Gaelic words)… just keep going. Coinfalls? What the… Slotjar??? No, I’m not lying…
And it’s not just something Google ran over the site descriptions, we’re talking entire sites which people have just punched through Google Translate and put online:
It’s not just casino stuff… you can also get gibberish about business…
So here’s Reason 1 for me being furious: The more this happens, the less useful it will make the web for doing various Gaelic (and any other such unfortunate small language) related projects. I’m not just talking about messing up my searches. For instance, there are various spellcheckers for smaller languages which are based on web corpora i.e. bodies of text which have been collected from the web to form the basis of a spellchecker. This also often results in helpful word statistics – which words are more common than others. That may just sound like geekery but that’s the kind of geekery that helps make a better predictive text tool for example. So while still geeky, we’re talking geeky-that-is-useful-to-Joe-Blogs.
There more of this GT shite we’re getting on the web, for each of those language that will mean the quality of anything you might otherwise cull from the web will go down, down and down. Because unproofed machine translation will just always re-hash whatever is in the machine’s brain i.e. you will only ever get more of the same.
Then I came across this:
Some site in Russia about maths that has been Google-translated. Look at the lovely yellow box in particular.
So here is Reason 2 for the ceò coming out of my cluasan: Some of us spend a lot of time working on educational (and usually free) tools such as Scratch. It’s hard enough to convince people to try things like that in Gaelic without having to hand them a huge note saying “Beware of the following 3,000 sites which are not fit for purpose”. You know what that kind of warning does to people’s confidence in Gaelic software? Well, I’ll give you a small hint, it doesn’t improve it, that’s for sure.
At this point I decide to put e-pen to e-paper because there’s something else that has been making me furious about all this, something I was saving for later, at least until I had seen what effect reporting the next problem would have.
Ladies and Gentlemen, I give you Publishing Hell. Let’s start with a light entrée:
Nice, eh? For those of you not Gaelic speakers, that ought to say “Cidsin an t-Samhraidh”. But wait, I hear you say, surely that’s just an isolated thing… Nope, they are both beyond all linguistic pales and beyond counting:
“Yes but isn’t that obvious? Surely people won’t buy these…” Well, I have news for that camp to. Apparently they do. And why not? Especially a learner who is not fluent might very well find something like that an attractive proposition for helping them learn more Gaelic… or Tswana… or Samoan… or Chichewa…
This guy on Amazon in particular gets my goat and I really hope he steps onto a poisonous snake… he basically has a range of crappy children’s books which he punches through any machine translation tool he can find.
How do I know that? Well, by looking at those languages I speak to a reasonable degree and by comparing the languages Mr Winterberg offers to those on Google Translate and Bing. Haitian Creole and Gaelic in particular are giveaways.
He of course excels at Irish too… I reported this to Amazon a while back, got the standard “We’ll look into it” email and that was the end of that.
It takes a special kind of callous to rip people off like that I must say. Hence my seething anger with both people like Mr Winterberg and of course Google, Amazon and Co who don’t have a stricter policy on such titles.
Maybe it was a sign of things to come when I noticed a few years ago that second-hand book sites like Abebooks were starting to get flooded with terrible print-on-demand titles of old books where people took scanned pages, put them through character recognition software and then turned the resulting gibberish into a newly printed book. Yes, I know I re-publish old titles but with one crucial difference – once I’ve done the OCR, there will be endless hours of proofreading to make sure there are as few errors as possible. In some cases, my reprints have way fewer errors than the original, incidentally.
Oh, did I mention that he seems to grab names of translators off the web to adorn his titles with? I’m positive that Steaphan has better Gaelic than what’s in these books.
And yes, to all intents and purposes, people are buying them.Some realise their error afterwards (and hopefully pulp or e-pulp the books) but some (like Lorna) either cannot tell that it’s shite or they’re fake reviewers:
Either way, it’s very worrying. Taking this one at face value (and I am inclined to believe this one is genuine given the title is good Gaelic and that the reference to the sgoil-àraich (preschool) is most likely beyond the ken of fake reviewers:
Really? There is a sgoil-àraich somewhere which has this shite on the shelves? Passing Google Translate Gaelic onto the next generation? And yes, that makes me angry. Very angry. Seethingly angry. Because somehow I doubt that all those big names who were so keen to get Google to build this Beelzebub of a tool are going to take the time and effort to write to every school, preschool, nursery, parent and learner in the Gaelic world to warn them of not only the potential pitfalls of Google Translate but also that they might encounter Google Translated products in unexpected places.
Thanks guys. Nicely done.
ADDENDUM: Someone on Facebook commented that these are problems to be solved; problems to be welcomed and that Bitching about this problem is like bitching about bad teachers. You don’t send all the teachers back to teacher school en masse because one makes some hideous errors. A generation would be lost. Or a language. The reason I’m commenting on this here as an addendum is not to slag them off but because I realise that this is indeed a way some people will look at this. So I’m dissecting it as a potential view on the matter, not as a personal response (which I did on Facebook).
So is it a problem to be solved? I don’t think it can actually be solved. Who has the time or the money to pay someone to have the time? I already invest much too much of my non-working time into building resources. I don’t have the time to chase charlatans on the web. It’s like fighting midges.
And this is not like having a bad teacher either. You have one bad teacher, fine, shunt them into admin or find another way of improving their skills. Or a bad night class teacher where word eventually will get round. The difference is the sheer scale of the issue. This is no longer a contest of human vs human, if you pardon the crude simplification, this has become human vs machine. And that’s a contest where a group of humans, small in number, are not going to come off well because a machine translation system can dish out junk much much faster than a team of humans can locate them and shut them down. Even if there was a simple way of shutting such things down.
No, this is a problem that Gaelic speakers cannot fix and that has grown out of the short-sightedness of a few people chasing a sexy headline, unwilling to engage in meaningful debate. All we can do now is watch the terror unfold and hope that it will one day step on the toes of a language much bigger than Gaelic.
Though I was contemplating “How to waste lifetime” as a title to be honest. If you don’t want to read through the Odyssey part, fair enough, the quick guide on what to do is at the very bottom.
No, I’m not about to repeat my rant on Detect locale, tempting as that may be. This is about the trek to actually get a locale onto the locales on offer on mobile operating systems like iOS or Android. In the case of Scottish Gaelic, we need to go back all the way to July 2010. I had just been roped into localizing Firefox and we had noticed that the plural rules for Gaelic were either missing or wrong. So in the process of fixing those, it was recommended to me that I also submit them to the Common Locale Data Repository (CLDR). Basically a big holding tank run by the Unicode Consortium for things like plural rules, names of days of the week, month names, whether the month goes before the day etc etc for different locales. Seemed reasonable, so off I go filing a ticket. It took a while but by September 2011, that was in. Yay.
In the meantime, because I had gotten involved with LibreOffice (well, technically speaking OpenOffice first), I also ended up submitting a minimal dataset for Gaelic to CLDR because creating one was a prerequisite for getting a release of LibreOffice and it again was recommended I submit to CLDR so it’s generally available. Fair enough. Took a while figuring out because back then, the handy Survey Tool (basically a graphic interface) didn’t exist – you had to edit an xml file. Yuck. Started in May 2011 and by October, that was done and dusted.
Here’s where I got naive. I thought the “filtering through” of locale data was automatic. I did actually ask a few people and they all thought it was automatic too – though nobody was entirely sure. For most of even the smaller locales such as Welsh and Irish, somebody must have done “it” far enough back for nobody to know where it came from. So October 2011 on, I start watching the list of locales on my Android. Periodically, I’d pop into a mobile phone shop to check the latest models, in case my phone and OS were just too old.
In the meantime, I kept chipping away at the xml file, adding things like language and country names until I had the file relatively complete. Hoping that perhaps there was a completion threshold – even though nobody seemed to know. I started pinging questions as Android aka Google, figuring they were easier to communicate with than Apple. Hah! It’s like standing at one end of the Munich Beer Festival and playing a game of Chinese whispers with someone at the far end. No answer, lots of silence or vague suggestions of “try there”. Spent hours trying to google the answer. Frustratingly, even though I can almost always tease the web into giving me the info I want, not on this occasion. It was as if nobody had ever actually done whatever it was that needed doing to get a new locale to pop up on mobile OS.
I was getting increasingly frustrated/annoyed/angry because up until increasingly, apps were using detect locale to determine the language of one’s UI. Up until Android 4.2, you could use an app to “fake” a locale i.e. I could set it to gd-GB and apps such as Opera Mini would come up in Gaelic. But towards the end of 2012, Google blocked that option. Don’t ask me why… The upshot was that even those apps which had been localized were now hidden away because Gaelic did not exist as an official locale. Which set me off on the quest to get manual locale selection into FOSS apps but that’s a different story.
To add insult to injury, while I could understand to some extent why Irish would just be “there” as a locale, I couldn’t for the life of me understand why Manx of all languages was there, but not Gaelic. I mean, bully for Manx but what gives?
Fast forward to May 2014. CLDR is implementing it’s shiny new Survey Tool and a colleague and I set about filling in the last remaining gaps in the locale data file. Still no Gaelic on Android or iOS, even though the data set was now complete. It wasn’t until August 2014, out of a discussion surrounding the Survey Tool, that someone finally pinned down the problem. Even though we’d had a good enough data set since 2011, this was held “just” in CLDR. It turns out that Android aka Google actually pulls it’s locales and locale data from something called ICU, the International Components for Unicode. So I file a bug on CLDR which someone kindly copymoved to ICU. While not great communicators, at least someone imported the data set from CLDR and it was finally included in the ICU 54 release in October 2014. It had taken more than 4 years to discover what was needed. And then it took less than 4 months to get it into the necessary data bucket. 😒
And even crazier, within weeks of the ticket being closed on ICU, a Gaelic speaking Apple tester excitedly mailed me to tell me that on his test version of iOS 8, Scottish Gaelic was there as a locale. There were a few other minor bumps in the road but with from iOS8, Gaelic was there as a locale and apparently, it made its debut on Android Marshmallow in October 2015. All we have to do now is for people to upgrade to iOS8 (fairly straight forward) and Android Marshmallow (not so straight forward, we’ll probably have to wait for people to physically upgrade their devices).
So here it is for all those who want their locale on Android, iOS & Co:
- Bring some spare time. Assuming a single contributor, it will probably take up to a year to get it to appear on the latest devices if you have perfect timing. More likely, 2 years.
- Submit a locale data set to CLDR. You will need a Survey Tool account – and bear in mind there is only ONE submission cycle a year, on the whole. If you missed the current one, I recommend you check out existing data sets because you will have to answer fairly techy questions around date and time formatting, plurals, sort orders and goodness knows what else. Pick a locale similar to your own or at least one for a language you speak and see what that looks like.
Also check what “coverage level target” your locale has (ask someone at CLDR via a ticket). Some locales have a low target, Gaelic happened to be in “comprehensive” for some reason. Probably not worth arguing which one you’re in and just knuckling down.
- File a ticket on ICU to get the data ported over.
- Wait and finally, enjoy.
To begin with, I do not hold all the facts and I do like (or do I have to used the present-past-potential-future tense already?) the product. But there have been so many what the fuck moments it sadly is time for another Dear Developer epistle.
The topic? Mozilla OS. Which judging by today’s post to the localization list by George Roter is now officially floating belly up and face down in digital muck. Oh sure, there are exciting opportunities with the Internet of Things (which has a lengthy Wikipedia article that truly fails to inspire) and Connected Devices (I have yet to meet a Mozillian who can actually tell me what that practically means for end-users).
I guess it at least has an element of closure because back in December, well, we were all completely in the dark, apart from a steady stream of well-meant fluffwords.
So what happened? Well, looking at it from the bottom-up view of a localizer, Mozilla has proven once again that it has a genuinely amazing and skilful pool of workers but management that makes a revolutionary student committee look efficient. So at some point the idea of Mozilla OS was born – all the good things about Mozilla but as an OS. Ok, sounds fair, and I was right in there from the start with localizations. Two reasons, no, make that three, one of which was selfish, the other practical and the third altruistic:
- We wanted mobile devices in our language (that was the selfish bit)
- Participating early means you reach the maintenance level of translation early, which is a lot easier when there are fewer words to begin with (the practical reason). Plus less of a chance localization turns into an afterthought. Or so I thought…
- We wanted to help create a better product that would reach more people (the altruistic reason but more on that later)
Regarding 1 and 2, I kind of started worrying early because it became clear that Mozilla was partly selling its soul to manufacturers. We could localize but there was to be no guarantee, as it turned out eventually, that commercial manufacturers would ship all locales with a high completion. Why? Apparently Mozilla had forgotten to either negotiate harder regarding that and/or forgotten to design an easy way of pulling an unshipped locale once your device had been set up. Ho-hum but given our experience with the better-late-than-never solution to manual locale selection on Mozilla Mobile, I had reasonable confidence there would be a solution. Eventually. So I stuck with the project. Paid for a testing device. Managed to get a tablet for testing too. Helped with sometimes left-field solutions, like when I helped someone crack the problem of how to sort contact lists in lists with mixed scripts without resorting to automatic Unicode conversion (like how to handle a contact like রবীন্দ্রনাথ ঠাকুর on a phone next to Jack Sparrow – easy, ask the user to provide a manual phonetic spelling during contact creation), filed bugs, was a bit of a squeaky wheel… yeah ok, I submitted no patches but I can’t code for toffee, believe it or not.
I guess alarm bells should have started going off when Flatfish (the tablet branch) went quiet. As in, suddenly there no more nightly builds and bugs were beginning to pile up, some pretty central (like the fact no build ever shipped all locales – no, it was crazier than that, the locales where there but the translations weren’t getting pulled from Pootle). Eventually the word was passed round in a very unofficial way that Flatfish was no longer a project Mozilla was pursuing. Like that wasn’t worth an announcement? Even a short blog post by someone high up? Gee, thanks…
At the very least it was highly odd that a mobile OS aiming to compete with existing mobile OS would ignore the tablet side but maybe, I said to myself, we’re prioritising resources until it works well on phone and then we’ll get onto tablets again.
Then in early December we had the news fiasco. Short version is, somehow word got out that Mozilla was canning Mozilla OS but nobody had prepared anything official, not even a blog post, never mind press releases. Just some fluff about the Internet of Things. There’s a pretty good write-up here if you want the whole nine yards. Then all through December and most of January everyone, including Mozillians (at least the workers at the “bottom”) had no idea about what was going on. Great.
In a sense, we still don’t (unless someone can finally explain Connected Devices and the IoT to me in simple, short sentences explaining how that relates to end-users…). Except that we are to cease all work on localizing Mozilla OS for now. Who know if this will still the position in a month but for now, there are not going to be any phones which will ship the OS. Why? Reading between the lines, the uptake wasn’t great. Really? Like it was ever going to be easy to get a share of the iOS/Android/Windows phone market? If the decision makers expected an easy ride, they were naive. If they expected a tough ride, why are we bottling out now?
Which, incidentally, they could have made easier but considering one thing they mostly seems to have ignored – while the existing 3 hog most of the market, they are very restricted in their approach to localized interfaces. There are up to 40 million speakers of lesser-used languages in the EU alone and while certainly not all will shift by any stretch of the imagination, for a considerable number of those Mozilla OS would have been one of the few realistic means of getting a device in THEIR language. Neither Android nor iOS cater for Breton or Occitan. Small fry, you might think. Not so. It’s a bit hard to count but there are at least some 350 million people on the planet speaking languages which are not amongst the big boys Android & Co cater for. If that isn’t a market then I don’t know what is.
Will it come back? I don’t know. Would be good… even better if they teamed up with Ubuntu on this one. For now, I’m focussing on Ubuntu Mobile which is also localizable AND ships all locales with a high completion percentage and CyanogenMod AOSP which the Asturians have recently proven to be a way onto at least some devices running a version of Android. Gaelic SHALL go to the ball… would have been nice if it had been with Mozilla OS too.
But seriously, Mozilla is not too big to fail and if it continues to behave like an ocean liner which is steered in a fashion reminiscent of a revolutionary student committee, there will be a hard rock somewhere along the line for it. Which would be a great disservice to all the inspiring and hard working folks at Mozilla, not to mention the volunteers and the world at large. So please, revolutionary leaders up there, put down the hooch, put the origami helmets in the memento drawer and sharpen up your leadership, planning and above all, communication.
The warm memories of childhood brought to you by the Terminator T-888 Cyberdyne Systems Class TOK715
Yes, I like technology. But increasingly these days, I wish we could have a global debate on where we’re going with this and how much of it we want. Not as Gaels or Basques or Chinese or Brazilians but as a species.
Let me backtrack a little. We’ve just released the first ever Gaelic text-to-speech voice, having working together with the great people at Cereproc in Edinburgh over the last year. This is a good thing. It may seem to contradict my intro but the way I see it, it is an enabling tool. If nothing else, it is an assistive tool for people who are blind or dyslexic – and who speak Gaelic. We often forget that being a speaker of a minority language does not prevent you from being struck by the same issues as everyone else. Or rather, speakers of majority languages tend to forget this. It is not meant to replace real humans and it won’t, as it cannot think for itself. It won’t run off to the kitchen and make dinner or suddenly turn round and say to the user “What’s with all this Somhairle stuff, I want to read some sci-fi, ok?”
Sure, learners will use the voice too and since it is a pretty good voice, it should enhance their learning experience, especially for those with little or no access to native speakers. So I don’t see an issue there (though we did all bust several collective guts making sure the quality is as good as possible).
But a couple of days later, a colleague drew my attention to a line in a summary of a talk to be given at the Centre for Speech and Technology Research. It talks of speech production and refers to “…multimodal interactive games, involving many characters, dialogue partners…”. Queue a slightly dystopian moment. I possibly misread the line slightly what it means is AI dialogue partners in games. Like “talking” to Deckard Cain in Diablo which is really just a fixed script which is reeled off following certain actions in the game.
But whatever the intended meaning, it did make me think about the wider implications of talking technology, and in this case speech technology in particular, further and further with little debate about where we’re going with all this. I did have this quick mental flash of a Gaelic speaking Terminator, baking cookies with a human child. Don’t be absurd, you might say but I don’t think it’s entirely far-fetched. There will come a point when our use of technology in language learning will turn into something more distasteful than a toy bleating out words. There will come a point when interaction with speech produced by an intelligent machine will start to infringe on the way our children learn language and probably even adult interactions.
It may be that we decide, as a species, that a Gaelic/Basque/Aymara/Rapanui… speaking robot is just the thing to re-invigorate our languages. But to my mind, it poses a bigger question about whether this won’t make the whole thing pointless? Not just the issue of language but increasingly us as a species? Our affection for things, pretty sunsets, memories of baking biscuits with our grandmother and the particular sound waves our mothers made at us, is a very human thing. I cannot see a machine developing an appreciation for a field of dandelion other than in a utilitarian sense. Or perhaps we might decide that an intelligent interaction with a machine is preferable to the passive consumption we have at the moment, like those families I observe on the train using tablets as pacifiers. Playing I Spy with the tablet? Would I be looking at the outside of the train or a picture of the outside projected onto the screen?
We still seem to be operating, as a species, on the basis that it’s ok to see what happens when I bang these two rocks together cause hey, it’s Zoug’s own time and effort and what harm can it do. But we’re reaching a point in our technological development where the harm we can do by just seeing what happens when you bang something together is becoming considerable.
It makes me wish we talked more about what we actually want before we go out and do it. But sadly, I cannot really see it happening much, not with people going “well, if I don’t, someone else will”. Maybe I’m just having a gloomy day (no, the sun IS shining in Glasgow today) but I get the feeling we might finally be getting close to an answer to the Fermi paradox, a somewhat unpalatable one albeit. Fingers crossed, eyes closed and hope for the best?
I’m glad to see others in the field have similar apprehensions about MT in small languages
This is an abbreviated transcript of a talk I gave at a British-Irish Council conference on language technology in indigenous, minority and lesser-used languages in Dublin earlier this month (November 2015) under the title ‘Do minority languages need the same language technology as majority languages?’ I wanted to bust the myth that machine translation is necessary for the revival of minority languages. What I had to say didn’t go down well with some in the audience, especially people who work in machine translation (unsurprisingly). So beware, there is controversy ahead!
View original post 1,251 more words
I seem to be posting a lot about Google these days but then they ARE turning into the digital equivalent of Nestlé.
I’ve been pondering this post for a while and how to approach it without making it sound like I believe in area 52. So I’ll just say what happened and let you come to your own conclusions mostly.
Back when Google still ran the Google in Your Language project, I tried hard to get into Gmail and what was rumoured to be a browser but failed, though they were keen to push the now canned Picasa. <eyeroll> Then of course they canned the whole Google in Your Language thing. When I eventually found out that Google Chrome is technically nothing else than a rebranded version of an Open Source browser called Chromium, I thought ‘great, should be able to get a leg into the door that way’. Think again. So I looked around and was already confused because there did not appear to be a clear distinction between Chromium and Chrome. The two main candidates were Launchpad and Google Code. So January 2011 I decide to file an issue on Google Code, thinking that even if it’s the wrong place, they should be able to point me in the right direction. The answer came pretty quick. Even though the project is called Chromium, they (quote) don’t accept third party translations for chrome. And nobody seems to know where the translations come from or how you become an official translator. A vague reference that I maybe should try Ubuntu.
I gave it some time. Lots of time in fact. I picked up the thread again early in 2013. Now the semi-serious suggestion was to fork Chromium and do my translation on the fork. Very funny. Needless to say, I was getting rather disgusted at the whole affair and decided to give up on Chrome/Chromium.
When I noticed that an Irish translator on Launchpad had asked a similar question about Chromium and saw the answer was they, as far as they know, push the translations upstream to Chromium from Launchpad, I decided I might as well have a go. As someone had suggested, at least I’ll get Chromium on Linux.
Fast forward to October 2014 and I’m almost done with the translation on Launchpad so I figure I better file a bug early because it will likely take forever. Bug filed, enthusiastic response from some admin on Launchpad. Great, I think to myself, should be plain sailing from here on. Spoke too soon. End of January 2015, the translation long completed, I query to silence and only get more silence. More worryingly, someone points me at a post on Ubuntu about Chromium on Launchpad being, well, dead.
Having asked the question in a Chromium IRC chat room, I decided to have another go on Google Code, new bug, new luck maybe? Someone in the room did sound supportive. That was January 28, 2015. To date, nothing has happened apart from someone ‘assigning the bug to l10n PM for triage’.
I’m coming to the conclusion that Chromium has only the thinnest veneer of being open. Perhaps in the sense that I can get a hold of the source code and play around with it. But there is a distinct lack of openness and approachability about the whole thing. Perhaps that was the intention all along, to use the Open Source community to improve the source code but to give back as little as possible and to build as many layers of secrecy and to put as many obstacles in people’s path as possible. At least when it comes to localization.
At least Ubuntu is no longer pushing Chromium as the default browser. But that still leaves me with a whole pile of translation work which is not being used. Maybe I should check out some other Chromium-based browsers like Comodo Dragon or Yandex. Perhaps I’m being paranoid but I’m not keen on software coming from Russia being on my systems or recommending it to other people. Either way, I’m left with the same problem that we have with Firefox in a sense – it would mean having to wean people off pre-installed versions of Google Chrome or Internet Explorer.
Anyone got any good ideas? Cause I’m fresh out of…
Not the kind of pre-Christmas cheer I was hoping for, seriously. Slap bang on the 23rd, someone draws my attention to an article called Google urged to go Gaelic. In a nutshell, a left-field (most likely well-intentioned) appeal by an MSP from Central Scotland to add Scottish Gaelic to the list of languages. As the mere thought was nauseating, I made some time and wrote a very long letter to Murdo Fraser, the man in question, with copies going to David Boag at Bòrd na Gàidhlig and Alasdair Allan, minister for languages. As it sums up my arguments quite succinctly (I hoped), I’ll just copy it here:
Just before Christmas, a friend drew my attention to an article in the Courier regarding Google Translate in which Mr Murdo Fraser argues for a campaign to get Scottish Gaelic onto Google Translate.
I’m sure that this is a well-intentioned idea but in my professional opinion, it would have terrible consequences. As one of the few people who work entirely in the field of Gaelic IT, I have a keen interest in technology and the potential benefit – and damage – this offers to languages like Gaelic. As it happens, I also was the Gaelic localizer (i.e. translator) for Google when it was still running the Google In Your Language programme and I have watched (often with dismay) what Google has done in this area since. One of the projects that certainly caught my eye was Google Translate, especially when Irish was added as a language in 2009. But having spoken to Irish people working in this field and having watched the effects of it on the Irish language, I rapidly came to the conclusion that while it looks ‘cool’, being on a machine translation system for a small(er) language was not necessarily a benefit and in some cases, a tragedy.
Without going into too much technical detail, machine translation of the kind that Google does works best with the following ingredients:
– a massive (billions of words) aligned bilingual corpus
– translation between structurally similar languages or
– translation from a grammatically complex language into a less grammatically complex language but not the other way round
– translation of short, non-colloquial phrases and sentences but not complex, colloquial or literary structures
In essence, machine translation trains an algorithms in ‘patterns’, which is why massive amounts of data are needed and why it works better from a complex language into a less complex language. For example, it is relatively easy to teach the system that German der/die/das require ‘the’ in English, but it requires a massive amount of data for the system to become clever enough to understand when ‘the’ becomes ‘der’ but not ‘die’.
Unfortunately for Irish, none of these conditions were met – and would also not be met for Scottish Gaelic. To begin with, even if we digitized all the works ever produced which exist in English and Gaelic, the corpus would still be tiny by comparison to the German/English corpus for example.
Then there is the issue of linguistic distance, Irish/Gaelic and English are structurally very different, with Gaelic/Irish having a lot more in the way of complex grammatical structures than English. To compensate for this, the corpus would have to be truly massive. Which is why the existing Irish/English system is extremely poor by anyone’s standards.
One might argue that the aim is not a perfect translation system but a means of accessing information only available in other languages – which is the case for many of the languages which are on Google Translate. But I’m doubtful if the reverse is true. To begin with, no fluent Gaelic speaker requires a Gaelic > English translation system and there is preciously little which is published in Gaelic in digital form which does not also exist in English. All this would do is remove yet another reason for learning Gaelic.
That would leave English > Gaelic and herein lies the tragedy of the English/Irish pairing on Google Translate. Whatever the intentions of the developers, people will mis-use such a system. I have put together a few annotated photos which illustrate the scale of the disaster in Ireland here. From school reports to official government websites, there are few places where students, individuals or officials trying to cut corners have not used Irish translations of Google Translate in ways they were not intended to be used.
If there HAD been a Gaelic/English pair, Police Scotland would have been an even bigger target of ridicule because such an automated translation would have produced gibberish at worst and absurd semi-Gaelic at best.
I think we can all agree that the last thing Gaelic needs is masses of poor quality translations floating around the internet. Funding is extremely short these days and this would, in my view, be a poor use of these scarce funds. There are more pressing battles to be fought in the field of Gaelic and IT, such as the refusal by the 3rd party suppliers of IT services to Gaelic schools and units to provide (existing) Gaelic software or even a keyboard setting in any school that allows students to easily input accented characters, be that for Gaelic, Spanish or French.
is mise le meas mòr,
Turns out I wasn’t the only one horrified by the mere thought – John Storey also wrote a very long and polite letter.
Early in January and within days of each other, both John and I received almost identical responses which, in a nutshell, said ‘Thanks but I’ll keep trying anyway’. Even less encouragingly, it make some really irrelevant reference to the lack of teachers in Gaelic Medium Education. Which is true of course but well, not relevant?
Thank you for contacting me in relation to Scots Gaelic and Google Translate and for your detailed correspondence.
I appreciate the depth of your letter and note your concerns in relation to issues of accuracy and the potential impact to speakers of Gaelic of Google translate. I will be sure to consider these when next speaking on the subject.
I also agree that there are other battles to be fought in the field of Gaelic and IT and appreciate the current issues surrounding the number of teachers in Gaelic Medium Education. However, I do believe it is worth promoting the case for a more accessible Gaelic presence online and without this I believe that Gaelic could miss out on the massive opportunities afforded by the digital age.
I’m still waiting for a response from Bòrd na Gàidhlig or Alastair Allan. But I’m not encouraged. Really frustrated actually because (at least as the Press & Journal and the Perthshire Conservatives would have it), it seems like Bòrd na Gàidhlig and Alastair Allan are throwing their weight behind this ill-fated caper.
I really hope Google turns them down because I really don’t want to end up where the Irish IT specialists ended up – the merry world of “Told you so”…
But sadly “Got Gaelic onto Google” probably just sounds sexier on your CV than “Banged some desks and made sure all kids in Gaelic Medium Education can now easily type àèìòù”…