You may recall that back in November last year I (and some other people) seriously questioned whether Google Translate for Gaelic was really such a great idea. Most people who came down on the side of it being a good thing cited things such as “attracting young people” (that must be the minority language equivalent of “exposure” in the arts world…), “enhancing the status”, “used judiciously, it will do this and that good thing” and “wait and see, it won’t be that bad”.
Well, I have some news for you and if you’ve never seen me furious, and I mean steaming-out-of-the-ears-furious, here’s your chance.
I wasn’t actually planning to blog about this again, not for a while. But then I made the mistake of doing something unrelated – a bit of data entry in the Faclair Beag. After another fruitless attempt at finding the English for a coileach-gòthan, I picked up one of my many note-sheets and decided I might as well enter one. After a few useful phrases, I came across an odd looking word so I decided to ask the poor man’s corpus (which is useful in giving you a very quick impression of how common a word is). 346 results – that seemed fairly conclusive (for a language like Gaelic) but being the OCD QA freak I am, as always I did a gross error check to see which sites these hits were coming from. Topslotsite? Strange but maybe a coincidence (something English typos or bad line breaks result in seemingly Gaelic words)… just keep going. Coinfalls? What the… Slotjar??? No, I’m not lying…
And it’s not just something Google ran over the site descriptions, we’re talking entire sites which people have just punched through Google Translate and put online:
It’s not just casino stuff… you can also get gibberish about business…
So here’s Reason 1 for me being furious: The more this happens, the less useful it will make the web for doing various Gaelic (and any other such unfortunate small language) related projects. I’m not just talking about messing up my searches. For instance, there are various spellcheckers for smaller languages which are based on web corpora i.e. bodies of text which have been collected from the web to form the basis of a spellchecker. This also often results in helpful word statistics – which words are more common than others. That may just sound like geekery but that’s the kind of geekery that helps make a better predictive text tool for example. So while still geeky, we’re talking geeky-that-is-useful-to-Joe-Blogs.
There more of this GT shite we’re getting on the web, for each of those language that will mean the quality of anything you might otherwise cull from the web will go down, down and down. Because unproofed machine translation will just always re-hash whatever is in the machine’s brain i.e. you will only ever get more of the same.
Then I came across this:
Some site in Russia about maths that has been Google-translated. Look at the lovely yellow box in particular.
So here is Reason 2 for the ceò coming out of my cluasan: Some of us spend a lot of time working on educational (and usually free) tools such as Scratch. It’s hard enough to convince people to try things like that in Gaelic without having to hand them a huge note saying “Beware of the following 3,000 sites which are not fit for purpose”. You know what that kind of warning does to people’s confidence in Gaelic software? Well, I’ll give you a small hint, it doesn’t improve it, that’s for sure.
At this point I decide to put e-pen to e-paper because there’s something else that has been making me furious about all this, something I was saving for later, at least until I had seen what effect reporting the next problem would have.
Ladies and Gentlemen, I give you Publishing Hell. Let’s start with a light entrée:
Nice, eh? For those of you not Gaelic speakers, that ought to say “Cidsin an t-Samhraidh”. But wait, I hear you say, surely that’s just an isolated thing… Nope, they are both beyond all linguistic pales and beyond counting:
“Yes but isn’t that obvious? Surely people won’t buy these…” Well, I have news for that camp to. Apparently they do. And why not? Especially a learner who is not fluent might very well find something like that an attractive proposition for helping them learn more Gaelic… or Tswana… or Samoan… or Chichewa..
Thanks guys. Nicely done.
ADDENDUM: Someone on Facebook commented that these are problems to be solved; problems to be welcomed and that Bitching about this problem is like bitching about bad teachers. You don’t send all the teachers back to teacher school en masse because one makes some hideous errors. A generation would be lost. Or a language. The reason I’m commenting on this here as an addendum is not to slag them off but because I realise that this is indeed a way some people will look at this. So I’m dissecting it as a potential view on the matter, not as a personal response (which I did on Facebook).
So is it a problem to be solved? I don’t think it can actually be solved. Who has the time or the money to pay someone to have the time? I already invest much too much of my non-working time into building resources. I don’t have the time to chase charlatans on the web. It’s like fighting midges.
And this is not like having a bad teacher either. You have one bad teacher, fine, shunt them into admin or find another way of improving their skills. Or a bad night class teacher where word eventually will get round. The difference is the sheer scale of the issue. This is no longer a contest of human vs human, if you pardon the crude simplification, this has become human vs machine. And that’s a contest where a group of humans, small in number, are not going to come off well because a machine translation system can dish out junk much much faster than a team of humans can locate them and shut them down. Even if there was a simple way of shutting such things down.
No, this is a problem that Gaelic speakers cannot fix and that has grown out of the short-sightedness of a few people chasing a sexy headline, unwilling to engage in meaningful debate. All we can do now is watch the terror unfold and hope that it will one day step on the toes of a language much bigger than Gaelic.
ADDENDUM 2: I have removed all references to the children’s books on Amazon I had previously discussed on here. The discussion over whether some of them are genuine translations and which of them and to what extent others might be MT was really beginning to detract from the issue at hand. I have asked the 3 people who re-blogged this post to take it down.
Though I was contemplating “How to waste lifetime” as a title to be honest. If you don’t want to read through the Odyssey part, fair enough, the quick guide on what to do is at the very bottom.
No, I’m not about to repeat my rant on Detect locale, tempting as that may be. This is about the trek to actually get a locale onto the locales on offer on mobile operating systems like iOS or Android. In the case of Scottish Gaelic, we need to go back all the way to July 2010. I had just been roped into localizing Firefox and we had noticed that the plural rules for Gaelic were either missing or wrong. So in the process of fixing those, it was recommended to me that I also submit them to the Common Locale Data Repository (CLDR). Basically a big holding tank run by the Unicode Consortium for things like plural rules, names of days of the week, month names, whether the month goes before the day etc etc for different locales. Seemed reasonable, so off I go filing a ticket. It took a while but by September 2011, that was in. Yay.
In the meantime, because I had gotten involved with LibreOffice (well, technically speaking OpenOffice first), I also ended up submitting a minimal dataset for Gaelic to CLDR because creating one was a prerequisite for getting a release of LibreOffice and it again was recommended I submit to CLDR so it’s generally available. Fair enough. Took a while figuring out because back then, the handy Survey Tool (basically a graphic interface) didn’t exist – you had to edit an xml file. Yuck. Started in May 2011 and by October, that was done and dusted.
Here’s where I got naive. I thought the “filtering through” of locale data was automatic. I did actually ask a few people and they all thought it was automatic too – though nobody was entirely sure. For most of even the smaller locales such as Welsh and Irish, somebody must have done “it” far enough back for nobody to know where it came from. So October 2011 on, I start watching the list of locales on my Android. Periodically, I’d pop into a mobile phone shop to check the latest models, in case my phone and OS were just too old.
In the meantime, I kept chipping away at the xml file, adding things like language and country names until I had the file relatively complete. Hoping that perhaps there was a completion threshold – even though nobody seemed to know. I started pinging questions as Android aka Google, figuring they were easier to communicate with than Apple. Hah! It’s like standing at one end of the Munich Beer Festival and playing a game of Chinese whispers with someone at the far end. No answer, lots of silence or vague suggestions of “try there”. Spent hours trying to google the answer. Frustratingly, even though I can almost always tease the web into giving me the info I want, not on this occasion. It was as if nobody had ever actually done whatever it was that needed doing to get a new locale to pop up on mobile OS.
I was getting increasingly frustrated/annoyed/angry because up until increasingly, apps were using detect locale to determine the language of one’s UI. Up until Android 4.2, you could use an app to “fake” a locale i.e. I could set it to gd-GB and apps such as Opera Mini would come up in Gaelic. But towards the end of 2012, Google blocked that option. Don’t ask me why… The upshot was that even those apps which had been localized were now hidden away because Gaelic did not exist as an official locale. Which set me off on the quest to get manual locale selection into FOSS apps but that’s a different story.
To add insult to injury, while I could understand to some extent why Irish would just be “there” as a locale, I couldn’t for the life of me understand why Manx of all languages was there, but not Gaelic. I mean, bully for Manx but what gives?
Fast forward to May 2014. CLDR is implementing it’s shiny new Survey Tool and a colleague and I set about filling in the last remaining gaps in the locale data file. Still no Gaelic on Android or iOS, even though the data set was now complete. It wasn’t until August 2014, out of a discussion surrounding the Survey Tool, that someone finally pinned down the problem. Even though we’d had a good enough data set since 2011, this was held “just” in CLDR. It turns out that Android aka Google actually pulls it’s locales and locale data from something called ICU, the International Components for Unicode. So I file a bug on CLDR which someone kindly copymoved to ICU. While not great communicators, at least someone imported the data set from CLDR and it was finally included in the ICU 54 release in October 2014. It had taken more than 4 years to discover what was needed. And then it took less than 4 months to get it into the necessary data bucket. 😒
And even crazier, within weeks of the ticket being closed on ICU, a Gaelic speaking Apple tester excitedly mailed me to tell me that on his test version of iOS 8, Scottish Gaelic was there as a locale. There were a few other minor bumps in the road but with from iOS8, Gaelic was there as a locale and apparently, it made its debut on Android Marshmallow in October 2015. All we have to do now is for people to upgrade to iOS8 (fairly straight forward) and Android Marshmallow (not so straight forward, we’ll probably have to wait for people to physically upgrade their devices).
So here it is for all those who want their locale on Android, iOS & Co:
- Bring some spare time. Assuming a single contributor, it will probably take up to a year to get it to appear on the latest devices if you have perfect timing. More likely, 2 years.
- Submit a locale data set to CLDR. You will need a Survey Tool account – and bear in mind there is only ONE submission cycle a year, on the whole. If you missed the current one, I recommend you check out existing data sets because you will have to answer fairly techy questions around date and time formatting, plurals, sort orders and goodness knows what else. Pick a locale similar to your own or at least one for a language you speak and see what that looks like.
Also check what “coverage level target” your locale has (ask someone at CLDR via a ticket). Some locales have a low target, Gaelic happened to be in “comprehensive” for some reason. Probably not worth arguing which one you’re in and just knuckling down.
- File a ticket on ICU to get the data ported over.
- Wait and finally, enjoy.
To begin with, I do not hold all the facts and I do like (or do I have to used the present-past-potential-future tense already?) the product. But there have been so many what the fuck moments it sadly is time for another Dear Developer epistle.
The topic? Mozilla OS. Which judging by today’s post to the localization list by George Roter is now officially floating belly up and face down in digital muck. Oh sure, there are exciting opportunities with the Internet of Things (which has a lengthy Wikipedia article that truly fails to inspire) and Connected Devices (I have yet to meet a Mozillian who can actually tell me what that practically means for end-users).
I guess it at least has an element of closure because back in December, well, we were all completely in the dark, apart from a steady stream of well-meant fluffwords.
So what happened? Well, looking at it from the bottom-up view of a localizer, Mozilla has proven once again that it has a genuinely amazing and skilful pool of workers but management that makes a revolutionary student committee look efficient. So at some point the idea of Mozilla OS was born – all the good things about Mozilla but as an OS. Ok, sounds fair, and I was right in there from the start with localizations. Two reasons, no, make that three, one of which was selfish, the other practical and the third altruistic:
- We wanted mobile devices in our language (that was the selfish bit)
- Participating early means you reach the maintenance level of translation early, which is a lot easier when there are fewer words to begin with (the practical reason). Plus less of a chance localization turns into an afterthought. Or so I thought…
- We wanted to help create a better product that would reach more people (the altruistic reason but more on that later)
Regarding 1 and 2, I kind of started worrying early because it became clear that Mozilla was partly selling its soul to manufacturers. We could localize but there was to be no guarantee, as it turned out eventually, that commercial manufacturers would ship all locales with a high completion. Why? Apparently Mozilla had forgotten to either negotiate harder regarding that and/or forgotten to design an easy way of pulling an unshipped locale once your device had been set up. Ho-hum but given our experience with the better-late-than-never solution to manual locale selection on Mozilla Mobile, I had reasonable confidence there would be a solution. Eventually. So I stuck with the project. Paid for a testing device. Managed to get a tablet for testing too. Helped with sometimes left-field solutions, like when I helped someone crack the problem of how to sort contact lists in lists with mixed scripts without resorting to automatic Unicode conversion (like how to handle a contact like রবীন্দ্রনাথ ঠাকুর on a phone next to Jack Sparrow – easy, ask the user to provide a manual phonetic spelling during contact creation), filed bugs, was a bit of a squeaky wheel… yeah ok, I submitted no patches but I can’t code for toffee, believe it or not.
I guess alarm bells should have started going off when Flatfish (the tablet branch) went quiet. As in, suddenly there no more nightly builds and bugs were beginning to pile up, some pretty central (like the fact no build ever shipped all locales – no, it was crazier than that, the locales where there but the translations weren’t getting pulled from Pootle). Eventually the word was passed round in a very unofficial way that Flatfish was no longer a project Mozilla was pursuing. Like that wasn’t worth an announcement? Even a short blog post by someone high up? Gee, thanks…
At the very least it was highly odd that a mobile OS aiming to compete with existing mobile OS would ignore the tablet side but maybe, I said to myself, we’re prioritising resources until it works well on phone and then we’ll get onto tablets again.
Then in early December we had the news fiasco. Short version is, somehow word got out that Mozilla was canning Mozilla OS but nobody had prepared anything official, not even a blog post, never mind press releases. Just some fluff about the Internet of Things. There’s a pretty good write-up here if you want the whole nine yards. Then all through December and most of January everyone, including Mozillians (at least the workers at the “bottom”) had no idea about what was going on. Great.
In a sense, we still don’t (unless someone can finally explain Connected Devices and the IoT to me in simple, short sentences explaining how that relates to end-users…). Except that we are to cease all work on localizing Mozilla OS for now. Who know if this will still the position in a month but for now, there are not going to be any phones which will ship the OS. Why? Reading between the lines, the uptake wasn’t great. Really? Like it was ever going to be easy to get a share of the iOS/Android/Windows phone market? If the decision makers expected an easy ride, they were naive. If they expected a tough ride, why are we bottling out now?
Which, incidentally, they could have made easier but considering one thing they mostly seems to have ignored – while the existing 3 hog most of the market, they are very restricted in their approach to localized interfaces. There are up to 40 million speakers of lesser-used languages in the EU alone and while certainly not all will shift by any stretch of the imagination, for a considerable number of those Mozilla OS would have been one of the few realistic means of getting a device in THEIR language. Neither Android nor iOS cater for Breton or Occitan. Small fry, you might think. Not so. It’s a bit hard to count but there are at least some 350 million people on the planet speaking languages which are not amongst the big boys Android & Co cater for. If that isn’t a market then I don’t know what is.
Will it come back? I don’t know. Would be good… even better if they teamed up with Ubuntu on this one. For now, I’m focussing on Ubuntu Mobile which is also localizable AND ships all locales with a high completion percentage and CyanogenMod AOSP which the Asturians have recently proven to be a way onto at least some devices running a version of Android. Gaelic SHALL go to the ball… would have been nice if it had been with Mozilla OS too.
But seriously, Mozilla is not too big to fail and if it continues to behave like an ocean liner which is steered in a fashion reminiscent of a revolutionary student committee, there will be a hard rock somewhere along the line for it. Which would be a great disservice to all the inspiring and hard working folks at Mozilla, not to mention the volunteers and the world at large. So please, revolutionary leaders up there, put down the hooch, put the origami helmets in the memento drawer and sharpen up your leadership, planning and above all, communication.
The warm memories of childhood brought to you by the Terminator T-888 Cyberdyne Systems Class TOK715
Yes, I like technology. But increasingly these days, I wish we could have a global debate on where we’re going with this and how much of it we want. Not as Gaels or Basques or Chinese or Brazilians but as a species.
Let me backtrack a little. We’ve just released the first ever Gaelic text-to-speech voice, having working together with the great people at Cereproc in Edinburgh over the last year. This is a good thing. It may seem to contradict my intro but the way I see it, it is an enabling tool. If nothing else, it is an assistive tool for people who are blind or dyslexic – and who speak Gaelic. We often forget that being a speaker of a minority language does not prevent you from being struck by the same issues as everyone else. Or rather, speakers of majority languages tend to forget this. It is not meant to replace real humans and it won’t, as it cannot think for itself. It won’t run off to the kitchen and make dinner or suddenly turn round and say to the user “What’s with all this Somhairle stuff, I want to read some sci-fi, ok?”
Sure, learners will use the voice too and since it is a pretty good voice, it should enhance their learning experience, especially for those with little or no access to native speakers. So I don’t see an issue there (though we did all bust several collective guts making sure the quality is as good as possible).
But a couple of days later, a colleague drew my attention to a line in a summary of a talk to be given at the Centre for Speech and Technology Research. It talks of speech production and refers to “…multimodal interactive games, involving many characters, dialogue partners…”. Queue a slightly dystopian moment. I possibly misread the line slightly what it means is AI dialogue partners in games. Like “talking” to Deckard Cain in Diablo which is really just a fixed script which is reeled off following certain actions in the game.
But whatever the intended meaning, it did make me think about the wider implications of talking technology, and in this case speech technology in particular, further and further with little debate about where we’re going with all this. I did have this quick mental flash of a Gaelic speaking Terminator, baking cookies with a human child. Don’t be absurd, you might say but I don’t think it’s entirely far-fetched. There will come a point when our use of technology in language learning will turn into something more distasteful than a toy bleating out words. There will come a point when interaction with speech produced by an intelligent machine will start to infringe on the way our children learn language and probably even adult interactions.
It may be that we decide, as a species, that a Gaelic/Basque/Aymara/Rapanui… speaking robot is just the thing to re-invigorate our languages. But to my mind, it poses a bigger question about whether this won’t make the whole thing pointless? Not just the issue of language but increasingly us as a species? Our affection for things, pretty sunsets, memories of baking biscuits with our grandmother and the particular sound waves our mothers made at us, is a very human thing. I cannot see a machine developing an appreciation for a field of dandelion other than in a utilitarian sense. Or perhaps we might decide that an intelligent interaction with a machine is preferable to the passive consumption we have at the moment, like those families I observe on the train using tablets as pacifiers. Playing I Spy with the tablet? Would I be looking at the outside of the train or a picture of the outside projected onto the screen?
We still seem to be operating, as a species, on the basis that it’s ok to see what happens when I bang these two rocks together cause hey, it’s Zoug’s own time and effort and what harm can it do. But we’re reaching a point in our technological development where the harm we can do by just seeing what happens when you bang something together is becoming considerable.
It makes me wish we talked more about what we actually want before we go out and do it. But sadly, I cannot really see it happening much, not with people going “well, if I don’t, someone else will”. Maybe I’m just having a gloomy day (no, the sun IS shining in Glasgow today) but I get the feeling we might finally be getting close to an answer to the Fermi paradox, a somewhat unpalatable one albeit. Fingers crossed, eyes closed and hope for the best?
I’m glad to see others in the field have similar apprehensions about MT in small languages
This is an abbreviated transcript of a talk I gave at a British-Irish Council conference on language technology in indigenous, minority and lesser-used languages in Dublin earlier this month (November 2015) under the title ‘Do minority languages need the same language technology as majority languages?’ I wanted to bust the myth that machine translation is necessary for the revival of minority languages. What I had to say didn’t go down well with some in the audience, especially people who work in machine translation (unsurprisingly). So beware, there is controversy ahead!
View original post 1,251 more words
Good afternoon, boys and girls, very bad language, for example, what we see in the side and at the airport these days? No, I haven’t gone insane, I’m just illustrating a point by resorting to reductio ad absurdum. In other words, I punched the sentence Hey folks, anyone up for some really truly bad language like the stuff we’re seeing at BÁC airport these days? into Bad Translator and let it go through 10 machine translations.
Why? Glad you asked… these days, Google is making headlines both in the Irish traditional press and in social media. But for all the wrong reasons. The reason? Google Translate. Or rather, a language pair someone should have thought about a little more. Or at least done some user testing on it. Something…
So what I imagine happened is this… some bright spark, either on the Google side or some well-meaning Irish government official thought it would be great if we could have Irish on Google translate. First mistake. Give humans a tool, and they will mis-use it. Like our ex-joiner hammering in screws. So before you give people a tool, think about likely scenarios of mis-use. It clearly does not require a team of MENSA members to imagine that in a minoritised language like Irish, people might start using it for things like their homework or cheap translations rather than a quick way of getting the gist behind web content.
But having blissfully ignored this step, someone must have forged ahead and contributed a bilingual corpus to Google developers with a note along the lines of here’s a corpus for Irish, please add it to Google Translate. Most likely, second mistake. Right, so there are many ways of building machine translation systems but most rely on a mix of rules and a bilingual corpus. The idea being that as long as you feed a computer enough aligned data in two languages, it can use statistics to figure out how to translate between the two. This idea in itself is sound. Sort of. It depends on the languages in question, the amount of data involved and the direction of the translation oddly enough. Here’s an ideal scenario: build a system using a VAST amount of data (we’re talking billions of words) to translate between closely related languages and into the language which has the less fancy grammatical system. Like German to English. That works quite well as a pair on Google Translate because a) there are indeed vast amounts of texts which exist in both languages. German has the fancier grammar (3 genders, case marking, inflection of verbs…) whereas English does buggerall (some past tense markers on verbs and a plural -s aside, which is peanuts in linguistic terms).
But once you move away from the ideal model, things start creaking. The more complex the structures of the target language, the more data you’d need for the computer to make any sense of it. So going English to Icelandic creaks much more because even though they’re related languages (ultimately), Icelandic is even more complex than German. Oh and there’s less bilingual data of course.
You get the idea. Now Irish is eye-candy to a linguist. It has grammatical structures to die for, a case system, two genders, two types of mutation (that’s when the first sound in a word changes… you might know people called Hamish? Well that’s what Irish does to a man called Séamus when you address him), a headache-inducing system for inflecting verbs, a different word order (English is subject-object-verb, Irish is verb-subject-object) and so on. A thousand things English doesn’t do. So what would we need to make this work? Yup, take a gold star, a corpus billions of words big.
Unfortunately there’s no bilingual corpus that even comes close to that. Or at the very least, Google did not feed in anywhere near enough data. I’ve lost track but I think it’s mistake 3?
Cue mistake 4… let it loose on people without a big warning strapped to it or any form of user testing. The result? Eye-wateringly bad translations which start cropping up in the weirdest places. Facebook … ok, we could probably live with that… homework… a lot worse, don’t teachers have enough to contend with? And of course the jewel in the crown – official signage. Yep, that’s right. Google Translate has been making its way onto signage from Dublin Airport to government websites. And the result is almost always nauseating. Breaking through barriers? Only the blood vessels in Irish speakers’ brains perhaps…
It’s not that one shouldn’t attempt to bring technology to smaller languages, I’m all for that. But quality is key. It’s a hard enough sell at the best of times and something like a poor machine translation system can seriously damage the confidence people have in technology in or for their language. A little careful thinking goes a long way…