When things are way, way, WAY worse than you thought they might get

01/09/2016 7 comments

You may recall that back in November last year I (and some other people) seriously questioned whether Google Translate for Gaelic was really such a great idea. Most people who came down on the side of it being a good thing cited things such as “attracting young people” (that must be the minority language equivalent of “exposure” in the arts world…), “enhancing the status”, “used judiciously, it will do this and that good thing” and “wait and see, it won’t be that bad”.

Well, I have some news for you and if you’ve never seen me furious, and I mean steaming-out-of-the-ears-furious, here’s your chance.

I wasn’t actually planning to blog about this again, not for a while. But then I made the mistake of doing something unrelated – a bit of data entry in the Faclair Beag. After another fruitless attempt at finding the English for a coileach-gòthan, I picked up one of my many note-sheets and decided I might as well enter one. After a few useful phrases, I came across an odd looking word so I decided to ask the poor man’s corpus (which is useful in giving you a very quick impression of how common a word is). 346 results – that seemed fairly conclusive (for a language like Gaelic) but being the OCD QA freak I am, as always I did a gross error check to see which sites these hits were coming from. Topslotsite? Strange but maybe a coincidence (something English typos or bad line breaks result in seemingly Gaelic words)… just keep going. Coinfalls? What the… Slotjar??? No, I’m not lying…


And it’s not just something Google ran over the site descriptions, we’re talking entire sites which people have just punched through Google Translate and put online:


It’s not just casino stuff… you can also get gibberish about business…


So here’s Reason 1 for me being furious: The more this happens, the less useful it will make the web for doing various Gaelic (and any other such unfortunate small language) related projects. I’m not just talking about messing up my searches. For instance, there are various spellcheckers for smaller languages which are based on web corpora i.e. bodies of text which have been collected from the web to form the basis of a spellchecker. This also often results in helpful word statistics – which words are more common than others. That may just sound like geekery but that’s the kind of geekery that helps make a better predictive text tool for example. So while still geeky, we’re talking geeky-that-is-useful-to-Joe-Blogs.

There more of this GT shite we’re getting on the web, for each of those language that will mean the quality of anything you might otherwise cull from the web will go down, down and down. Because unproofed machine translation will just always re-hash whatever is in the machine’s brain i.e. you will only ever get more of the same.

Then I came across this:


Some site in Russia about maths that has been Google-translated. Look at the lovely yellow box in particular.

So here is Reason 2 for the ceò coming out of my cluasan: Some of us spend a lot of time working on educational (and usually free) tools such as Scratch. It’s hard enough to convince people to try things like that in Gaelic without having to hand them a huge note saying “Beware of the following 3,000 sites which are not fit for purpose”. You know what that kind of warning does to people’s confidence in Gaelic software? Well, I’ll give you a small hint, it doesn’t improve it, that’s for sure.

At this point I decide to put e-pen to e-paper because there’s something else that has been making me furious about all this, something I was saving for later, at least until I had seen what effect reporting the next problem would have.

Ladies and Gentlemen, I give you Publishing Hell. Let’s start with a light entrée:


Nice, eh? For those of you not Gaelic speakers, that ought to say “Cidsin an t-Samhraidh”. But wait, I hear you say, surely that’s just an isolated thing… Nope, they are both beyond all linguistic pales and beyond counting:

cac-GT-04“Yes but isn’t that obvious? Surely people won’t buy these…” Well, I have news for that camp to. Apparently they do. And why not? Especially a learner who is not fluent might very well find something like that an attractive proposition for helping them learn more Gaelic… or Tswana… or Samoan… or Chichewa..

Thanks guys. Nicely done.

ADDENDUM: Someone on Facebook commented that these are problems to be solved; problems to be welcomed  and that Bitching about this problem is like bitching about bad teachers. You don’t send all the teachers back to teacher school en masse because one makes some hideous errors. A generation would be lost. Or a language. The reason I’m commenting on this here as an addendum is not to slag them off but because I realise that this is indeed a way some people will look at this. So I’m dissecting it as a potential view on the matter, not as a personal response (which I did on Facebook).

So is it a problem to be solved? I dont think it can actually be solved. Who has the time or the money to pay someone to have the time? I already invest much too much of my non-working time into building resources. I dont have the time to chase charlatans on the web. Its like fighting midges.

And this is not like having a bad teacher either. You have one bad teacher, fine, shunt them into admin or find another way of improving their skills. Or a bad night class teacher where word eventually will get round. The difference is the sheer scale of the issue. This is no longer a contest of human vs human, if you pardon the crude simplification, this has become human vs machine. And thats a contest where a group of humans, small in number, are not going to come off well because a machine translation system can dish out junk much much faster than a team of humans can locate them and shut them down. Even if there was a simple way of shutting such things down.

No, this is a problem that Gaelic speakers cannot fix and that has grown out of the short-sightedness of a few people chasing a sexy headline, unwilling to engage in meaningful debate. All we can do now is watch the terror unfold and hope that it will one day step on the toes of a language much bigger than Gaelic.

ADDENDUM 2: I have removed all references to the children’s books on Amazon I had previously discussed on here. The discussion over whether some of them are genuine translations and which of them and to what extent others might be MT was really beginning to detract from the issue at hand. I have asked the 3 people who re-blogged this post to take it down.

Categories: Uncategorized

Getting your locale onto mobile OS

02/04/2016 1 comment

Though I was contemplating “How to waste lifetime” as a title to be honest. If you don’t want to read through the Odyssey part, fair enough, the quick guide on what to do is at the very bottom.mazes_and_labyrinths3b_a_general_account_of_their_history_and_developments_28192229_281459795357929

No, I’m not about to repeat my rant on Detect locale, tempting as that may be. This is about the trek to actually get a locale onto the locales on offer on mobile operating systems like iOS or Android. In the case of Scottish Gaelic, we need to go back all the way to July 2010. I had just been roped into localizing Firefox and we had noticed that the plural rules for Gaelic were either missing or wrong. So in the process of fixing those, it was recommended to me that I also submit them to the Common Locale Data Repository (CLDR). Basically a big holding tank run by the Unicode Consortium for things like plural rules, names of days of the week, month names, whether the month goes before the day etc etc for different locales. Seemed reasonable, so off I go filing a ticket. It took a while but by September 2011, that was in. Yay.

In the meantime, because I had gotten involved with LibreOffice (well, technically speaking OpenOffice first), I also ended up submitting a minimal dataset for Gaelic to CLDR because creating one was a prerequisite for getting a release of LibreOffice and it again was recommended I submit to CLDR so it’s generally available. Fair enough. Took a while figuring out because back then, the handy Survey Tool (basically a graphic interface) didn’t exist – you had to edit an xml file. Yuck. Started in May 2011 and by October, that was done and dusted.

Here’s where I got naive. I thought the “filtering through” of locale data was automatic. I did actually ask a few people and they all thought it was automatic too – though nobody was entirely sure. For most of even the smaller locales such as Welsh and Irish, somebody must have done “it” far enough back for nobody to know where it came from. So October 2011 on, I start watching the list of locales on my Android. Periodically, I’d pop into a mobile phone shop to check the latest models, in case my phone and OS were just too old.

In the meantime, I kept chipping away at the xml file, adding things like language and country names until I had the file relatively complete. Hoping that perhaps there was a completion threshold – even though nobody seemed to know. I started pinging questions as Android aka Google, figuring they were easier to communicate with than Apple. Hah! It’s like standing at one end of the Munich Beer Festival and playing a game of Chinese whispers with someone at the far end. No answer, lots of silence or vague suggestions of “try there”. Spent hours trying to google the answer. Frustratingly, even though I can almost always tease the web into giving me the info I want, not on this occasion. It was as if nobody had ever actually done whatever it was that needed doing to get a new locale to pop up on mobile OS.

I was getting increasingly frustrated/annoyed/angry because up until increasingly, apps were using detect locale to determine the language of one’s UI. Up until Android 4.2, you could use an app to “fake” a locale i.e. I could set it to gd-GB and apps such as Opera Mini would come up in Gaelic. But towards the end of 2012, Google blocked that option. Don’t ask me why…  The upshot was that even those apps which had been localized were now hidden away because Gaelic did not exist as an official locale. Which set me off on the quest to get manual locale selection into FOSS apps but that’s a different story.

To add insult to injury, while I could understand to some extent why Irish would just be “there” as a locale, I couldn’t for the life of me understand why Manx of all languages was there, but not Gaelic. I mean, bully for Manx but what gives?

Fast forward to May 2014. CLDR is implementing it’s shiny new Survey Tool and a colleague and I set about filling in the last remaining gaps in the locale data file. Still no Gaelic on Android or iOS, even though the data set was now complete. It wasn’t until August 2014, out of a discussion surrounding the Survey Tool, that someone finally pinned down the problem. Even though we’d had a good enough data set since 2011, this was held “just” in CLDR. It turns out that Android aka Google actually pulls it’s locales and locale data from something called ICU, the International Components for Unicode. So I file a bug on CLDR which someone kindly copymoved to ICU. While not great communicators, at least someone imported the data set from CLDR and it was finally included in the ICU 54 release in October 2014. It had taken more than 4 years to discover what was needed. And then it took less than 4 months to get it into the necessary data bucket. 😒

And even crazier, within weeks of the ticket being closed on ICU, a Gaelic speaking Apple tester excitedly mailed me to tell me that on his test version of iOS 8, Scottish Gaelic was there as a locale. There were a few other minor bumps in the road but with from iOS8, Gaelic was there as a locale and apparently, it made its debut on Android Marshmallow in October 2015. All we have to do now is for people to upgrade to iOS8 (fairly straight forward) and Android Marshmallow (not so straight forward, we’ll probably have to wait for people to physically upgrade their devices).

Ay dios…

So here it is for all those who want their locale on Android, iOS & Co:

  1. Bring some spare time. Assuming a single contributor, it will probably take up to a year to get it to appear on the latest devices if you have perfect timing. More likely, 2 years.
  2. Submit a locale data set to CLDR. You will need a Survey Tool account – and bear in mind there is only ONE submission cycle a year, on the whole. If you missed the current one, I recommend you check out existing data sets because you will have to answer fairly techy questions around date and time formatting, plurals, sort orders and goodness knows what else. Pick a locale similar to your own or at least one for a language you speak and see what that looks like.
    Also check what “coverage level target” your locale has (ask someone at CLDR via a ticket). Some locales have a low target, Gaelic happened to be in “comprehensive” for some reason. Probably not worth arguing which one you’re in and just knuckling down.
  3. File a ticket on ICU to get the data ported over.
  4. Wait and finally, enjoy.
Categories: Uncategorized

No time to discuss this as a committee

To begin with, I do not hold all the facts and I do like (or do I have to used the present-past-potential-future tense already?) the product. But there have been so many what the fuck moments it sadly is time for another Dear Developer epistle.

The topic? Mozilla OS. Which judging by today’s post to the localization list by George Roter is now officially floating belly up and face down in digital muck. Oh sure, there are exciting opportunities with the Internet of Things (which has a lengthy Wikipedia article that truly fails to inspire) and Connected Devices (I have yet to meet a Mozillian who can actually tell me what that practically means for end-users).

I guess it at least has an element of closure because back in December, well, we were all completely in the dark, apart from a steady stream of well-meant fluffwords.


So what happened? Well, looking at it from the bottom-up view of a localizer, Mozilla has proven once again that it has a genuinely amazing and skilful pool of workers but management that makes a revolutionary student committee look efficient. So at some point the idea of Mozilla OS was born – all the good things about Mozilla but as an OS. Ok, sounds fair, and I was right in there from the start with localizations. Two reasons, no, make that three, one of which was selfish, the other practical and the third altruistic:

  1. We wanted mobile devices in our language (that was the selfish bit)
  2. Participating early means you reach the maintenance level of translation early, which is a lot easier when there are fewer words to begin with (the practical reason). Plus less of a chance localization turns into an afterthought. Or so I thought…
  3. We wanted to help create a better product that would reach more people (the altruistic reason but more on that later)

Regarding 1 and 2, I kind of started worrying early because it became clear that Mozilla was partly selling its soul to manufacturers. We could localize but there was to be no guarantee, as it turned out eventually, that commercial manufacturers would ship all locales with a high completion. Why? Apparently Mozilla had forgotten to either negotiate harder regarding that and/or forgotten to design an easy way of pulling an unshipped locale once your device had been set up. Ho-hum but given our experience with the better-late-than-never solution to manual locale selection on Mozilla Mobile, I had reasonable confidence there would be a solution. Eventually. So I stuck with the project. Paid for a testing device. Managed to get a tablet for testing too. Helped with sometimes left-field solutions, like when I helped someone crack the problem of how to sort contact lists in lists with mixed scripts without resorting to automatic Unicode conversion (like how to handle a contact like রবীন্দ্রনাথ ঠাকুর on a phone next to Jack Sparrow – easy, ask the user to provide a manual phonetic spelling during contact creation), filed bugs, was a bit of a squeaky wheel… yeah ok, I submitted no patches but I can’t code for toffee, believe it or not.

I guess alarm bells should have started going off when Flatfish (the tablet branch) went quiet. As in, suddenly there no more nightly builds and bugs were beginning to pile up, some pretty central (like the fact no build ever shipped all locales – no, it was crazier than that, the locales where there but the translations weren’t getting pulled from Pootle). Eventually the word was passed round in a very unofficial way that Flatfish was no longer a project Mozilla was pursuing. Like that wasn’t worth an announcement? Even a short blog post by someone high up? Gee, thanks…

At the very least it was highly odd that a mobile OS aiming to compete with existing mobile OS would ignore the tablet side but maybe, I said to myself, we’re prioritising resources until it works well on phone and then we’ll get onto tablets again.

Then in early December we had the news fiasco. Short version is, somehow word got out that Mozilla was canning Mozilla OS but nobody had prepared anything official, not even a blog post, never mind press releases. Just some fluff about the Internet of Things. There’s a pretty good write-up here if you want the whole nine yards. Then all through December and most of January everyone, including Mozillians (at least the workers at the “bottom”) had no idea about what was going on. Great.

In a sense, we still don’t (unless someone can finally explain Connected Devices and the IoT to me in simple, short sentences explaining how that relates to end-users…). Except that we are to cease all work on localizing Mozilla OS for now. Who know if this will still the position in a month but for now, there are not going to be any phones which will ship the OS. Why? Reading between the lines, the uptake wasn’t great. Really? Like it was ever going to be easy to get a share of the iOS/Android/Windows phone market? If the decision makers expected an easy ride, they were naive. If they expected a tough ride, why are we bottling out now?

Which, incidentally, they could have made easier but considering one thing they mostly seems to have ignored – while the existing 3 hog most of the market, they are very restricted in their approach to localized interfaces. There are up to 40 million speakers of lesser-used languages in the EU alone and while certainly not all will shift by any stretch of the imagination, for a considerable number of those Mozilla OS would have been one of the few realistic means of getting a device in THEIR language. Neither Android nor iOS cater for Breton or Occitan. Small fry, you might think. Not so. It’s a bit hard to count but there are at least some 350 million people on the planet speaking languages which are not amongst the big boys Android & Co cater for. If that isn’t a market then I don’t know what is.

Will it come back? I don’t know. Would be good… even better if they teamed up with Ubuntu on this one. For now, I’m focussing on Ubuntu Mobile which is also localizable AND ships all locales with a high completion percentage and CyanogenMod AOSP which the Asturians have recently proven to be a way onto at least some devices running a version of Android. Gaelic SHALL go to the ball… would have been nice if it had been with Mozilla OS too.

But seriously, Mozilla is not too big to fail and if it continues to behave like an ocean liner which is steered in a fashion reminiscent of a revolutionary student committee, there will be a hard rock somewhere along the line for it. Which would be a great disservice to all the inspiring and hard working folks at Mozilla, not to mention the volunteers and the world at large. So please, revolutionary leaders up there, put down the hooch, put the origami helmets in the memento drawer and sharpen up your leadership, planning and above all, communication.

Categories: Uncategorized

The warm memories of childhood brought to you by the Terminator T-888 Cyberdyne Systems Class TOK715

02/12/2015 1 comment

irobot_kitchenYes, I like technology. But increasingly these days, I wish we could have a global debate on where we’re going with this and how much of it we want. Not as Gaels or Basques or Chinese or Brazilians but as a species.

Let me backtrack a little. We’ve just released the first ever Gaelic text-to-speech voice, having working together with the great people at Cereproc in Edinburgh over the last year. This is a good thing. It may seem to contradict my intro but the way I see it, it is an enabling tool. If nothing else, it is an assistive tool for people who are blind or dyslexic – and who speak Gaelic. We often forget that being a speaker of a minority language does not prevent you from being struck by the same issues as everyone else. Or rather, speakers of majority languages tend to forget this. It is not meant to replace real humans and it won’t, as it cannot think for itself. It won’t run off to the kitchen and make dinner or suddenly turn round and say to the user “What’s with all this Somhairle stuff, I want to read some sci-fi, ok?”

Sure, learners will use the voice too and since it is a pretty good voice, it should enhance their learning experience, especially for those with little or no access to native speakers. So I don’t see an issue there (though we did all bust several collective guts making sure the quality is as good as possible).

But a couple of days later, a colleague drew my attention to a line in a summary of a talk to be given at the Centre for Speech and Technology Research. It talks of speech production and refers to “…multimodal interactive games, involving many characters, dialogue partners…”. Queue a slightly dystopian moment. I possibly misread the line slightly what it means is AI dialogue partners in games. Like “talking” to Deckard Cain in 250px-cain_cinematic_conceptDiablo which is really just a fixed script which is reeled off following certain actions in the game.

But whatever the intended meaning, it did make me think about the wider implications of talking technology, and in this case speech technology in particular, further and further with little debate about where we’re going with all this. I did have this quick mental flash of a Gaelic speaking Terminator, baking cookies with a human child. Don’t be absurd, you might say but I don’t think it’s entirely far-fetched. There will come a point when our use of technology in language learning will turn into something more distasteful than a toy bleating out words. There will come a point when interaction with speech produced by an intelligent machine will start to infringe on the way our children learn language and probably even adult interactions.

It may be that we decide, as a species, that a Gaelic/Basque/Aymara/Rapanui… speaking robot is just the thing to re-invigorate our languages. But to my mind, it poses a bigger question about whether this won’t make the whole thing pointless? Not just the issue of language but increasingly us as a species? Our affection for things, pretty sunsets, memories of baking biscuits with our grandmother and the particular sound waves our mothers made at us, is a very human thing. I cannot see a machine developing an appreciation for a field of dandelion other than in a utilitarian sense. Or perhaps we might decide that an intelligent interaction with a machine is preferable to the passive consumption we have at the moment, like those families I observe on the train using tablets as pacifiers. Playing I Spy with the tablet? Would I be looking at the outside of the train or a picture of the outside projected onto the screen?

We still seem to be operating, as a species, on the basis that it’s ok to see what happens when I bang these two rocks together cause hey, it’s Zoug’s own time and effort and what harm can it do. But we’re reaching a point in our technological development where the harm we can do by just seeing what happens when you bang something together is becoming considerable.

It makes me wish we talked more about what we actually want before we go out and do it. But sadly, I cannot really see it happening much, not with people going “well, if I don’t, someone else will”. Maybe I’m just having a gloomy day (no, the sun IS shining in Glasgow today) but I get the feeling we might finally be getting close to an answer to the Fermi paradox, a somewhat unpalatable one albeit. Fingers crossed, eyes closed and hope for the best?

Categories: Uncategorized

Do minority languages need machine translation?

22/11/2015 3 comments

I’m glad to see others in the field have similar apprehensions about MT in small languages

This is an abbreviated transcript of a talk I gave at a British-Irish Council conference on language technology in indigenous, minority and lesser-used languages in Dublin earlier this month (November 2015) under the title ‘Do minority languages need the same language technology as majority languages?’ I wanted to bust the myth that machine translation is necessary for the revival of minority languages. What I had to say didn’t go down well with some in the audience, especially people who work in machine translation (unsurprisingly). So beware, there is controversy ahead!

Be patient! Bí othar!

View original post 1,251 more words

Categories: Uncategorized

How to stonewall Open Source

07/03/2015 3 comments

I seem to be posting a lot about Google these days but then they ARE turning into the digital equivalent of Nestlé.

I’ve been pondering this post for a while and how to approach it without making it sound like I believe in area 52. So I’ll just say what happened and let you come to your own conclusions mostly.

Back when Google still ran the Google in Your Language project, I tried hard to get into Gmail and what was rumoured to be a browser but failed, though they were keen to push the now canned Picasa. <eyeroll> Then of course they canned the whole Google in Your Language thing. When I eventually found out that Google Chrome is technically nothing else than a rebranded version of an Open Source browser called Chromium, I thought ‘great, should be able to get a leg into the door that way’. Think again. So I looked around and was already confused because there did not appear to be a clear distinction between Chromium and Chrome. The two main candidates were Launchpad and Google Code. So January 2011 I decide to file an issue on Google Code, thinking that even if it’s the wrong place, they should be able to point me in the right direction. The answer came pretty quick. Even though the project is called Chromium, they (quote) don’t accept third party translations for chrome. And nobody seems to know where the translations come from or how you become an official translator. A vague reference that I maybe should try Ubuntu.

I gave it some time. Lots of time in fact. I picked up the thread again early in 2013. Now the semi-serious suggestion was to fork Chromium and do my translation on the fork. Very funny. Needless to say, I was getting rather disgusted at the whole affair and decided to give up on Chrome/Chromium.

When I noticed that an Irish translator on Launchpad had asked a similar question about Chromium and saw the answer was they, as far as they know, push the translations upstream to Chromium from Launchpad, I decided I might as well have a go. As someone had suggested, at least I’ll get Chromium on Linux.

Fast forward to October 2014 and I’m almost done with the translation on Launchpad so I figure I better file a bug early because it will likely take forever. Bug filed, enthusiastic response from some admin on Launchpad. Great, I think to myself, should be plain sailing from here on. Spoke too soon. End of January 2015, the translation long completed, I query to silence and only get more silence. More worryingly, someone points me at a post on Ubuntu about Chromium on Launchpad being, well, dead.

Having asked the question in a Chromium IRC chat room, I decided to have another go on Google Code, new bug, new luck maybe? Someone in the room did sound supportive. That was January 28, 2015. To date, nothing has happened apart from someone ‘assigning the bug to l10n PM for triage’.

I’m coming to the conclusion that Chromium has only the thinnest veneer of being open. Perhaps in the sense that I can get a hold of the source code and play around with it. But there is a distinct lack of openness and approachability about the whole thing. Perhaps that was the intention all along, to use the Open Source community to improve the source code but to give back as little as possible and to build as many layers of secrecy and to put as many obstacles in people’s path as possible. At least when it comes to localization.

At least Ubuntu is no longer pushing Chromium as the default browser. But that still leaves me with a whole pile of translation work which is not being used. Maybe I should check out some other Chromium-based browsers like Comodo Dragon or Yandex. Perhaps I’m being paranoid but I’m not keen on software coming from Russia being on my systems or recommending it to other people. Either way, I’m left with the same problem that we have with Firefox in a sense – it would mean having to wean people off pre-installed versions of Google Chrome or Internet Explorer.

Anyone got any good ideas? Cause I’m fresh out of…

The spectre of Google Translate for Gaelic

15/01/2015 3 comments

Not the kind of pre-Christmas cheer I was hoping for, seriously. Slap bang on the 23rd, someone draws my attention to an article called Google urged to go Gaelic. In a nutshell, a left-field (most likely well-intentioned) appeal by an MSP from Central Scotland to add Scottish Gaelic to the list of languages. As the mere thought was nauseating, I made some time and wrote a very long letter to Murdo Fraser, the man in question, with copies going to David Boag at Bòrd na Gàidhlig and Alasdair Allan, minister for languages. As it sums up my arguments quite succinctly (I hoped), I’ll just copy it here:

Just before Christmas, a friend drew my attention to an article in the Courier regarding Google Translate in which Mr Murdo Fraser argues for a campaign to get Scottish Gaelic onto Google Translate.

I’m sure that this is a well-intentioned idea but in my professional opinion, it would have terrible consequences. As one of the few people who work entirely in the field of Gaelic IT, I have a keen interest in technology and the potential benefit – and damage – this offers to languages like Gaelic. As it happens, I also was the Gaelic localizer (i.e. translator) for Google when it was still running the Google In Your Language programme and I have watched (often with dismay) what Google has done in this area since. One of the projects that certainly caught my eye was Google Translate, especially when Irish was added as a language in 2009. But having spoken to Irish people working in this field and having watched the effects of it on the Irish language, I rapidly came to the conclusion that while it looks ‘cool’, being on a machine translation system for a small(er) language was not necessarily a benefit and in some cases, a tragedy.

Without going into too much technical detail, machine translation of the kind that Google does works best with the following ingredients:
– a massive (billions of words) aligned bilingual corpus
– translation between structurally similar languages or
– translation from a grammatically complex language into a less grammatically complex language but not the other way round
– translation of short, non-colloquial phrases and sentences but not complex, colloquial or literary structures

In essence, machine translation trains an algorithms in ‘patterns’, which is why massive amounts of data are needed and why it works better from a complex language into a less complex language. For example, it is relatively easy to teach the system that German der/die/das require ‘the’ in English, but it requires a massive amount of data for the system to become clever enough to understand when ‘the’ becomes ‘der’ but not ‘die’.

Unfortunately for Irish, none of these conditions were met – and would also not be met for Scottish Gaelic. To begin with, even if we digitized all the works ever produced which exist in English and Gaelic, the corpus would still be tiny by comparison to the German/English corpus for example.

Then there is the issue of linguistic distance, Irish/Gaelic and English are structurally very different, with Gaelic/Irish having a lot more in the way of complex grammatical structures than English. To compensate for this, the corpus would have to be truly massive. Which is why the existing Irish/English system is extremely poor by anyone’s standards.

One might argue that the aim is not a perfect translation system but a means of accessing information only available in other languages – which is the case for many of the languages which are on Google Translate. But I’m doubtful if the reverse is true. To begin with, no fluent Gaelic speaker requires a Gaelic > English translation system and there is preciously little which is published in Gaelic in digital form which does not also exist in English. All this would do is remove yet another reason for learning Gaelic.

That would leave English > Gaelic and herein lies the tragedy of the English/Irish pairing on Google Translate. Whatever the intentions of the developers, people will mis-use such a system. I have put together a few annotated photos which illustrate the scale of the disaster in Ireland here. From school reports to official government websites, there are few places where students, individuals or officials trying to cut corners have not used Irish translations of Google Translate in ways they were not intended to be used.

If there HAD been a Gaelic/English pair, Police Scotland would have been an even bigger target of ridicule because such an automated translation would have produced gibberish at worst and absurd semi-Gaelic at best.

I think we can all agree that the last thing Gaelic needs is masses of poor quality translations floating around the internet. Funding is extremely short these days and this would, in my view, be a poor use of these scarce funds. There are more pressing battles to be fought in the field of Gaelic and IT, such as the refusal by the 3rd party suppliers of IT services to Gaelic schools and units to provide (existing) Gaelic software or even a keyboard setting in any school that allows students to easily input accented characters, be that for Gaelic, Spanish or French.

is mise le meas mòr,

Turns out I wasn’t the only one horrified by the mere thought – John Storey also wrote a very long and polite letter.

Early in January and within days of each other, both John and I received almost identical responses which, in a nutshell, said ‘Thanks but I’ll keep trying anyway’. Even less encouragingly, it make some really irrelevant reference to the lack of teachers in Gaelic Medium Education. Which is true of course but well, not relevant?

Thank you for contacting me in relation to Scots Gaelic and Google Translate and for your detailed correspondence.

I appreciate the depth of your letter and note your concerns in relation to issues of accuracy and the potential impact to speakers of Gaelic of Google translate. I will be sure to consider these when next speaking on the subject.

I also agree that there are other battles to be fought in the field of Gaelic and IT and appreciate the current issues surrounding the number of teachers in Gaelic Medium Education.  However, I do believe it is worth promoting the case for a more accessible Gaelic presence online and without this I believe that Gaelic could miss out on the massive opportunities afforded by the digital age.

I’m still waiting for a response from Bòrd na Gàidhlig or Alastair Allan. But I’m not encouraged. Really frustrated actually because (at least as the Press & Journal and the Perthshire Conservatives would have it), it seems like Bòrd na Gàidhlig and Alastair Allan are throwing their weight behind this ill-fated caper.

I really hope Google turns them down because I really don’t want to end up where the Irish IT specialists ended up – the merry world of “Told you so”…

But sadly “Got Gaelic onto Google” probably just sounds sexier on your CV than “Banged some desks and made sure all kids in Gaelic Medium Education can now easily type àèìòù”…

How to make headlines for the wrong reasons

Good afternoon, boys and girls, very bad language, for example, what we see in the side and at the airport these days? No, I haven’t gone insane, I’m just illustrating a point by resorting to reductio ad absurdum. In other words, I punched the sentence Hey folks, anyone up for some really truly bad language like the stuff we’re seeing at BÁC airport these days? into Bad Translator and let it go through 10 machine translations.

More unheavy fuel?

Why? Glad you asked… these days, Google is making headlines both in the Irish traditional press and in social media. But for all the wrong reasons. The reason? Google Translate. Or rather, a language pair someone should have thought about a little more. Or at least done some user testing on it. Something…

So what I imagine happened is this… some bright spark, either on the Google side or some well-meaning Irish government official thought it would be great if we could have Irish on Google translate. First mistake. Give humans a tool, and they will mis-use it. Like our ex-joiner hammering in screws. So before you give people a tool, think about likely scenarios of mis-use. It clearly does not require a team of MENSA members to imagine that in a minoritised language like Irish, people might start using it for things like their homework or cheap translations rather than a quick way of getting the gist behind web content.

But having blissfully ignored this step, someone must have forged ahead and contributed a bilingual corpus to Google developers with a note along the lines of here’s a corpus for Irish, please add it to Google Translate. Most likely, second mistake. Right, so there are many ways of building machine translation systems but most rely on a mix of rules and a bilingual corpus. The idea being that as long as you feed a computer enough aligned data in two languages, it can use statistics to figure out how to translate between the two. This idea in itself is sound. Sort of. It depends on the languages in question, the amount of data involved and the direction of the translation oddly enough. Here’s an ideal scenario: build a system using a VAST amount of data (we’re talking billions of words) to translate between closely related languages and into the language which has the less fancy grammatical system. Like German to English. That works quite well as a pair on Google Translate because a) there are indeed vast amounts of texts which exist in both languages. German has the fancier grammar (3 genders, case marking, inflection of verbs…) whereas English does buggerall (some past tense markers on verbs and a plural -s aside, which is peanuts in linguistic terms).

A bit like saying ‘Going all passengers from The gates please their sick people as if the doors to be opened before your Boarding Times’

But once you move away from the ideal model, things start creaking. The more complex the structures of the target language, the more data you’d need for the computer to make any sense of it. So going English to Icelandic creaks much more because even though they’re related languages (ultimately), Icelandic is even more complex than German. Oh and there’s less bilingual data of course.

You get the idea. Now Irish is eye-candy to a linguist. It has grammatical structures to die for, a case system, two genders, two types of mutation (that’s when the first sound in a word changes… you might know people called Hamish? Well that’s what Irish does to a man called Séamus when you address him), a headache-inducing system for inflecting verbs, a different word order (English is subject-object-verb, Irish is verb-subject-object) and so on. A thousand things English doesn’t do. So what would we need to make this work? Yup, take a gold star, a corpus billions of words big.

Unfortunately there’s no bilingual corpus that even comes close to that. Or at the very least, Google did not feed in anywhere near enough data. I’ve lost track but I think it’s mistake 3?

Cue mistake 4… let it loose on people without a big warning strapped to it or any form of user testing. The result? Eye-wateringly bad translations which start cropping up in the weirdest places. Facebook … ok, we could probably live with that… homework… a lot worse, don’t teachers have enough to contend with? And of course the jewel in the crown – official signage. Yep, that’s right. Google Translate has been making its way onto signage from Dublin Airport to government websites. And the result is almost always nauseating. Breaking through barriers? Only the blood vessels in Irish speakers’ brains perhaps…

It’s not that one shouldn’t attempt to bring technology to smaller languages, I’m all for that. But quality is key. It’s a hard enough sell at the best of times and something like a poor machine translation system can seriously damage the confidence people have in technology in or for their language. A little careful thinking goes a long way…

Categories: Uncategorized

Once bitten by Open Source, hooked forever?

So some would claim. But having just read the news from Munich, I would re-iterate the need for some soul-searching as to the truth of that claim. The news being that the City of Munich, having decided to switch from Microsoft to Linux in 2004, is considering going back to Microsoft. Sure, there may be some shady business involved but reading the article, there are valid problems that the users are raising.

There are undeniable benefits of Open Source stuff and I won’t bore everyone with going into them again. And undoubtedly some issues stem from users just being so used to Microsoft. But what stood out for me was the comment Munich’s mayor Dieter Reiter made about the complications with managing email, calendars and contacts and that in his view, Linux is sometimes behind Microsoft.

Now before y’all start listing the amazing tools I can sudo onto my Ubuntu machine, that’s not the point. The point is that what Microsoft does offer and which still eludes the Open Source scene is integration and end-user friendliness. Ubuntu sort of makes a stab at that but in my view still falls short.

I will forgo my usual verbosity and simply pose some questions:

  1. Was it really smart of Mozilla to ditch the official development of Thunderbird (their email client) and Lightning (the calendar that goes with it)? Rather than integrating it further with Firefox and coming up with a webmail service based on it?
  2. Why is there still so little cross-project coordination and cooperation in the Open Source scene?
  3. Could this be a painful lesson that OS is not an addictive drug to most users and that they will come off it if they’re having a bad trip? Does this mean that the cavalier way in which most OS projects approach issues of usability and the user interface are coming round big time to bite us?

Don’t get me wrong. I still think it’s the only sustainable way forward, especially for SMLs (small to medium locales). But pride in amazing code will not cut the mustard with Mrs McGinty down the road who just wants something she can use out of the box and link to her phone and with a calendar for her webmail so she won’t forget her next appointment with the orthodontist. Without resorting to command lines that would make Linus weep.

While 420km below the ISS a Dani is sharpening his stone axe

26/05/2014 5 comments

Sometimes the world of software feels a bit like that, a confusing array of ancient and cutting edge stuff.I see you nodding sagely, thinking of the people still using Windows 98 or even more extreme, Windows 3.11 or people who just don’t want to upgrade to Firefox 3 (we’re on 29 just now, for those of you on Shrome). I actually understand that, on the one hand you have very low-key users who just write the odd email and on the other you have specialists (this is most likely something happening at your local hospital, incidentally) who rely on a custom-rigged system using custom-designed software, all done in the days of yore, to run some critical piece of technology and who are loathe to change it since… well… it works. I don’t blame them, who wants to mess around with bleeding tiles when they’re trying to zap your tumour.

But that wasn’t actually what I was thinking about. I was thinking about the spectrum of localizer friendly and unfriendly software. At the one extreme you have cutting edge Open Source developers working on the next generation of localization (also known as l20n, one up from l10n) and on the other you have… well, troglodytes. Since I don’t want to turn this into a really complicated lecture about linguistic features, I’ll pick a fairly straightforward example, the one that actually made me pick up my e-pen in anger. Plurals.

What’s the big deal, slap an -s on? Ummm. No. Ever since someone decided that counting one-two-lots (ah, I wish I had grown up a !San) was no longer sufficient, languages have been busy coming up with astonishingly complex (or simple) ways of counting stuff. One the one extreme you have languages like Cantonese which don’t inflict any changes on the things they’re counting. So the writing system aside, you just go 0 apple, 1 apple, 2 apple… 100 apple, 1,000 apple and so on.

English is a tiny step away from that, counting 0 apples, 1 apple, 2 apples… 100 apples, 1,000 apples and so on. Spot something already? Indeed. Logic doesn’t really come into it, not in a mathematical sense. By that I mean there is no reason why in Cantonese 0 should pattern with 1, 2 etc but that in English 0 should go with 2, 3, etc. It just does. Sure, historical linguists can sometimes shed light on how these have developed but not very often. On the whole, they just are.

This is where it gets entertaining (for linguists). First insight, there aren’t as many systems as there are languages. So much less than 6,000. In fact, looking at the places where such rules are collected, there are probably less than a 100 different ways (on the planet) for counting stuff. Still fun time though (for linguists). Let me give you a couple of examples. A lot of Slavonic (Ukrainian, Russian etc) languages require up to 3 different forms of a noun:

  • FORM 1: any number ending in 1 (1, 11, 21, 31….)
  • FORM 2: ends in 2, 3 or 4 – but not 12, 13 or 14 (22, 23, 24, 32, 33, 34…)
  • FORM 3: anything else (12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 26, 27…)

That almost makes sense in a way. But we can add a few more twists. Take the resurrected decimal system in Scottish Gaelic. It requires up to 4 forms of a noun:

  • FORM 1: 1 and 11 (1 chat, 11 chat)
  • FORM 2: 2 and 12 (2 chat, 12 chat)
  • FORM 3: 3-10, 13-20 (3 cait, 4 cait, 13 cait, 14 cait…)
  • FORM 4: anything else (21 cat, 22 cat, 100 cat…)

Hang one, you’re saying, surely FORM 1 and FORM 2 could be merged. ’fraid not, because while the word cat makes it look as if they’re the same, if you start counting something beginning with the letter d, n, t, s, the following happens:

  • FORM 1: 1 taigh, 11 taigh
  • FORM 2: 2 thaigh, 12 thaigh
  • FORM 3: 3 taighean, 4 taighean, 13 taighean, 14 taighean…
  • FORM 4: 21 taigh, 22 taigh, 100 taigh…

Told you, fun! Now here’s where it gets annoying. Initially, in the very early days of software, localization mostly meant taking software written in English and translating it into German, French, Spanish, Italian & Co and then a bit later on adding Chinese, Japanese and Korean to the list.

Through a sheer fluke, that worked almost perfectly. English has a very common pattern, as it turns out (one form for 1 and another for anything else) so going from English to German posed no problems in translation. You simple took a pair of English strings like:

  • Open one file
  • Open %d files

and translated them into German:

  • Eine Datei öffnen
  • %d Dateien öffnen

Similarly, going to Chinese also posed no problem, you just ended up with a superfluous string because (I’ll use English words rather than Chinese characters):

  • Open one file
  • Open %d file

also created no linguistic or computational problems. Well, there was the fact that in French 0 patterns with 1, not with the plural as it does in English but I bet at that point English developers thought they were home and dry and ready to tick off the whole issue of numbers and number placeholders in software.

Now I have no evidence but I suspect a Slavonic language like Russian was one of the first to kick up a stink. Because as we saw, it has a much more elaborate pattern than English. Now there was one bit of good news for the developers: although these linguistic setups were elaborate in some cases, they also followed predictable patterns and you only need about 6 categories (which ended up being called ONE, TWO, FEW, MANY, OTHER for the sake of readability – so Gaelic ended up with ONE, TWO, FEW and OTHER for example). Which meant you could write a rule for the language in question and then prep your software to present the translator – and ultimately the user – with the right number of strings for translation. Sure, they look a bit crazy, like this one for Gaelic:

Plural-Forms: nplurals=4; plural=(n==1 || n==11) ? 0 : (n==2 || n==12) ? 1 : (n > 2 && n < 20) ? 2 : 3;\n

but you only had to do it once and that was that. Simples… you’d think. Oh no. I mean, yes, certainly doable and indeed a lot of software correctly applies plural formatting these days. Most Open Source projects certainly do, programs like Linux or Firefox for example have it, which is the reason why you probably never noticed anything odd about it.

One step down from this nice implementation of plurals are projects like Joomla! who will allow you to use plurals but they won’t help you. Let me explain (briefly). Joomla! has one of the more atavistic approaches to localization – they expect translators to work directly in the .ini files Joomla! uses. Oh wow. So to begin with, that DOES enable you to do plurals but to begin with you have to figure out how to say the plural rule of your language in Joomla! and put that into one of the files. In our case, that turned out to be

   public static function getPluralSuffixes($count) {
if ($count == 0 || $count > 19) {
$return =  array(‘0’);
elseif($count == 1 || $count == 11) {
$return =  array(‘1’);
elseif($count == 2 || $count == 12) {
$return =  array(‘2’);
elseif(($count > 2 && $count < 12) || ($count > 12 && $count < 19) {
$return =  array(‘FEW’);

Easy peasy. One then has to take the English, for example:


and change it to this for Gaelic:


Unsurprisingly, most localizers just can’t be bothered doing the plurals properly in Joomla!.

Ning is another project in this category – they also required almost as many contortions as Joomla! but their mud star is for having had plural formatting. And then having ditched it because allegedly the translators put in too many errors. Well duh… give a man a rusty saw and then complain he’s not sawing fast enough or what?

And then there are those projects which stubbornly plod on without any form of plural formatting (except English style plurals of course). The selection of programs which are still without proper plurals IS surprising I must say. You might think you’d find a lot of very old Open Source projects here which go back so far that no-one wants to bother with fixing the code. Wrong. There are some fairly new programs and apps in this category where the developers chose to ignore plurals either through linguistic ignorance or arrogance. Skype (started in 2003) and Netvibes (2005) for example. Just for contrast, Firefox was born in 2002 and to my knowledge always accounted for plurals.

Similarly, some of them belong to big software houses which technically have the money and manpower to fix this – such as Microsoft. Yep, Microsoft. To this date, no Microsoft product I’m aware of can handle non-English type plurals properly in ANY other language. Russians must be oddly patient when it comes to languages cause I get really annoyed when my screen tells me I have closed 5 window

A lot of software falls somewhere between the two extremes – I guess it’s just the way humans are, looking at the way we build our cities into and onto and over older bits of city except when it all falls down and we have to (or can?) start from scratch. But that makes it no less annoying when you’re trying to make software sound less like a robot in translation than it has to…

PS: I’d be curious to know which program first implemented plurals. I’m sort of guessing it’s Linux but I’m not old enough to remember. Let me know if you have some insights?

PPS: If you’re a developer and want to know more about plurals, I recommend the Unicode Consortium’s page on plurals as a starting point, you can take it from there.