I seem to be posting a lot about Google these days but then they ARE turning into the digital equivalent of Nestlé.
I’ve been pondering this post for a while and how to approach it without making it sound like I believe in area 52. So I’ll just say what happened and let you come to your own conclusions mostly.
Back when Google still ran the Google in Your Language project, I tried hard to get into Gmail and what was rumoured to be a browser but failed, though they were keen to push the now canned Picasa. <eyeroll> Then of course they canned the whole Google in Your Language thing. When I eventually found out that Google Chrome is technically nothing else than a rebranded version of an Open Source browser called Chromium, I thought ‘great, should be able to get a leg into the door that way’. Think again. So I looked around and was already confused because there did not appear to be a clear distinction between Chromium and Chrome. The two main candidates were Launchpad and Google Code. So January 2011 I decide to file an issue on Google Code, thinking that even if it’s the wrong place, they should be able to point me in the right direction. The answer came pretty quick. Even though the project is called Chromium, they (quote) don’t accept third party translations for chrome. And nobody seems to know where the translations come from or how you become an official translator. A vague reference that I maybe should try Ubuntu.
I gave it some time. Lots of time in fact. I picked up the thread again early in 2013. Now the semi-serious suggestion was to fork Chromium and do my translation on the fork. Very funny. Needless to say, I was getting rather disgusted at the whole affair and decided to give up on Chrome/Chromium.
When I noticed that an Irish translator on Launchpad had asked a similar question about Chromium and saw the answer was they, as far as they know, push the translations upstream to Chromium from Launchpad, I decided I might as well have a go. As someone had suggested, at least I’ll get Chromium on Linux.
Fast forward to October 2014 and I’m almost done with the translation on Launchpad so I figure I better file a bug early because it will likely take forever. Bug filed, enthusiastic response from some admin on Launchpad. Great, I think to myself, should be plain sailing from here on. Spoke too soon. End of January 2015, the translation long completed, I query to silence and only get more silence. More worryingly, someone points me at a post on Ubuntu about Chromium on Launchpad being, well, dead.
Having asked the question in a Chromium IRC chat room, I decided to have another go on Google Code, new bug, new luck maybe? Someone in the room did sound supportive. That was January 28, 2015. To date, nothing has happened apart from someone ‘assigning the bug to l10n PM for triage’.
I’m coming to the conclusion that Chromium has only the thinnest veneer of being open. Perhaps in the sense that I can get a hold of the source code and play around with it. But there is a distinct lack of openness and approachability about the whole thing. Perhaps that was the intention all along, to use the Open Source community to improve the source code but to give back as little as possible and to build as many layers of secrecy and to put as many obstacles in people’s path as possible. At least when it comes to localization.
At least Ubuntu is no longer pushing Chromium as the default browser. But that still leaves me with a whole pile of translation work which is not being used. Maybe I should check out some other Chromium-based browsers like Comodo Dragon or Yandex. Perhaps I’m being paranoid but I’m not keen on software coming from Russia being on my systems or recommending it to other people. Either way, I’m left with the same problem that we have with Firefox in a sense – it would mean having to wean people off pre-installed versions of Google Chrome or Internet Explorer.
Anyone got any good ideas? Cause I’m fresh out of…
Not the kind of pre-Christmas cheer I was hoping for, seriously. Slap bang on the 23rd, someone draws my attention to an article called Google urged to go Gaelic. In a nutshell, a left-field (most likely well-intentioned) appeal by an MSP from Central Scotland to add Scottish Gaelic to the list of languages. As the mere thought was nauseating, I made some time and wrote a very long letter to Murdo Fraser, the man in question, with copies going to David Boag at Bòrd na Gàidhlig and Alasdair Allan, minister for languages. As it sums up my arguments quite succinctly (I hoped), I’ll just copy it here:
Just before Christmas, a friend drew my attention to an article in the Courier regarding Google Translate in which Mr Murdo Fraser argues for a campaign to get Scottish Gaelic onto Google Translate.
I’m sure that this is a well-intentioned idea but in my professional opinion, it would have terrible consequences. As one of the few people who work entirely in the field of Gaelic IT, I have a keen interest in technology and the potential benefit – and damage – this offers to languages like Gaelic. As it happens, I also was the Gaelic localizer (i.e. translator) for Google when it was still running the Google In Your Language programme and I have watched (often with dismay) what Google has done in this area since. One of the projects that certainly caught my eye was Google Translate, especially when Irish was added as a language in 2009. But having spoken to Irish people working in this field and having watched the effects of it on the Irish language, I rapidly came to the conclusion that while it looks ‘cool’, being on a machine translation system for a small(er) language was not necessarily a benefit and in some cases, a tragedy.
Without going into too much technical detail, machine translation of the kind that Google does works best with the following ingredients:
– a massive (billions of words) aligned bilingual corpus
– translation between structurally similar languages or
– translation from a grammatically complex language into a less grammatically complex language but not the other way round
– translation of short, non-colloquial phrases and sentences but not complex, colloquial or literary structures
In essence, machine translation trains an algorithms in ‘patterns’, which is why massive amounts of data are needed and why it works better from a complex language into a less complex language. For example, it is relatively easy to teach the system that German der/die/das require ‘the’ in English, but it requires a massive amount of data for the system to become clever enough to understand when ‘the’ becomes ‘der’ but not ‘die’.
Unfortunately for Irish, none of these conditions were met – and would also not be met for Scottish Gaelic. To begin with, even if we digitized all the works ever produced which exist in English and Gaelic, the corpus would still be tiny by comparison to the German/English corpus for example.
Then there is the issue of linguistic distance, Irish/Gaelic and English are structurally very different, with Gaelic/Irish having a lot more in the way of complex grammatical structures than English. To compensate for this, the corpus would have to be truly massive. Which is why the existing Irish/English system is extremely poor by anyone’s standards.
One might argue that the aim is not a perfect translation system but a means of accessing information only available in other languages – which is the case for many of the languages which are on Google Translate. But I’m doubtful if the reverse is true. To begin with, no fluent Gaelic speaker requires a Gaelic > English translation system and there is preciously little which is published in Gaelic in digital form which does not also exist in English. All this would do is remove yet another reason for learning Gaelic.
That would leave English > Gaelic and herein lies the tragedy of the English/Irish pairing on Google Translate. Whatever the intentions of the developers, people will mis-use such a system. I have put together a few annotated photos which illustrate the scale of the disaster in Ireland here. From school reports to official government websites, there are few places where students, individuals or officials trying to cut corners have not used Irish translations of Google Translate in ways they were not intended to be used.
If there HAD been a Gaelic/English pair, Police Scotland would have been an even bigger target of ridicule because such an automated translation would have produced gibberish at worst and absurd semi-Gaelic at best.
I think we can all agree that the last thing Gaelic needs is masses of poor quality translations floating around the internet. Funding is extremely short these days and this would, in my view, be a poor use of these scarce funds. There are more pressing battles to be fought in the field of Gaelic and IT, such as the refusal by the 3rd party suppliers of IT services to Gaelic schools and units to provide (existing) Gaelic software or even a keyboard setting in any school that allows students to easily input accented characters, be that for Gaelic, Spanish or French.
is mise le meas mòr,
Turns out I wasn’t the only one horrified by the mere thought – John Storey also wrote a very long and polite letter.
Early in January and within days of each other, both John and I received almost identical responses which, in a nutshell, said ‘Thanks but I’ll keep trying anyway’. Even less encouragingly, it make some really irrelevant reference to the lack of teachers in Gaelic Medium Education. Which is true of course but well, not relevant?
Thank you for contacting me in relation to Scots Gaelic and Google Translate and for your detailed correspondence.
I appreciate the depth of your letter and note your concerns in relation to issues of accuracy and the potential impact to speakers of Gaelic of Google translate. I will be sure to consider these when next speaking on the subject.
I also agree that there are other battles to be fought in the field of Gaelic and IT and appreciate the current issues surrounding the number of teachers in Gaelic Medium Education. However, I do believe it is worth promoting the case for a more accessible Gaelic presence online and without this I believe that Gaelic could miss out on the massive opportunities afforded by the digital age.
I’m still waiting for a response from Bòrd na Gàidhlig or Alastair Allan. But I’m not encouraged. Really frustrated actually because (at least as the Press & Journal and the Perthshire Conservatives would have it), it seems like Bòrd na Gàidhlig and Alastair Allan are throwing their weight behind this ill-fated caper.
I really hope Google turns them down because I really don’t want to end up where the Irish IT specialists ended up – the merry world of “Told you so”…
But sadly “Got Gaelic onto Google” probably just sounds sexier on your CV than “Banged some desks and made sure all kids in Gaelic Medium Education can now easily type àèìòù”…
Sometimes it’s not the developers fault. Shocking, I know. Sometimes, it’s the linguistic community (using the term loosely) who is at fault for not asking for the right thing.
I was looking up something in my Mojave dictionary to other week (don’t ask), followed by a Google search which pointed me at a Gaelic dictionary for administrative terminology. While the two are probably some 15 years apart, they have one thing in common. They’re both “flat” documents (there’s probably a niftier term but I can’t think of it). The Mojave dictionary is a ring-bound, printed dictionary, the Gaelic one some form of PDF. Now, don’t get me wrong, I have a love affair with dictionaries – I collect dictionaries like my mother shoes and handbags. So the more the merrier.
But what also hit me is how much of a debt of gratitude the Scottish Gaelic world owes a chap called Kevin Scannell. While doing some research in Dublin, we met at the Club Cónradh na Gaeilge for a pint and a chat about Irish IT and he explained to me, in short and simple phrases, what a lexical database is and why anyone should give a monkey’s about having one. That chat came at a pivotal moment because, having just finished the digitization of Dwelly’s classical Gaelic dictionary, I had both come to realise some of the inherent shortcomings of paper dictionaries in digital form and I was at the cusp of embarking on the development of what is now the Faclair Beag along with Will Robertson.
Now there are many forms a lexical database can take but essentially the difference between a massive wordlist and a lexical database is that in the database you don’t just list words but you also mark them for what they are. For example, rather than having file, grunt, apple, horse, with, and in a comma separated list, such a database would mark file, grunt, apple and horse as being a singular noun, a second instance of file and grunt as regular verbs, and as a conjunction and with as a preposition. You can get a lot more fancy than that but at a very basic level, that’s the difference.
So what you might say, you still end up with a dictionary at the other end and doing a database involves a whole lot of extra work. Tru dat, but the other word I learned on that trip was the word future proofing. What that means is that if you write a beautiful dictionary as a flat text document, you get just that, a beautiful dictionary. Useful, and for many centuries the only kind available. But that was down to technical limitations, not necessarily choice. Anyway, such a dictionary is future proof only to a certain extent. If it’s digital (as opposed to typed, which unbelievably still happens…) you can edit bits for an new edition. You can put it online and, with a bit of messing about, even turn it into a searchable dictionary. But that’s just about it, anything beyond that involves an insane amount of extra work. For example, your dictionary may list 20,000 heardwords but there will be a lot of word forms which aren’t headwords: plurals, past tenses, words which only appear in examples but not as headwords and so on.
But I can look up the plural of goose. Yes, that’s true but say for example you wanted to do something beyond that. For example, you might be a linguist interested in word frequencies, wanting to find out how common certain words are. Do a word search in your text? Possible, but then you end up with a number for goose and another for geese. And in some languages the list of forms a word can take is huge, in Gaelic the goose can show up as gèadh, ghèadh, geòidh, gheòidh, gèadhaibh and ghèadhaibh.
But it’s not just nice for linguists. The applications of even a basic lexical database are impressive. Let me continue with the practical example to illustrate this. If you search for bean in the Faclair Beag, you end up seeing this entry at the top:
We decided to keep it fairly simple and devised different tables for the different types of words we get in Gaelic – feminine and masculine nouns, verbs, prepositions and so on. And for each, we made a table which covers the different possible forms of each word. For a Gaelic noun, that means lenition, datives, genitives, vocatives, singular and plural, plus a junk field for anything that might be exceptional.
Yes, it’s a bit of extra work but one immediate benefit is that because each form is tied to the ID of the root, it doesn’t matter if a user sticks in a form like mhnàthadh – the dictionary will still know what to look for. That’s a decided bonus for people who are inexperienced or looking for a rare inflected form they’re unsure of. It also cuts down the number See x entries because if two words are simply variations of the same root (like crèadh and criadh in Gaelic which are both pronounced the same way and mean the same thing). So usability is an immediate benefit.
Next benefit is an almost instantaneous and updatable spellchecker – as long as the data you punch in is clean, all you have to do is export the table and dump it in Hunspell for example. Ok, it involves a little more fiddling that that but compared to the task of extracting all words from a flat text file, it’s a doddle. For example, I was asked if we could do something for Cornish based on the Single Written Form dictionary. The answer was yes, but I don’t have time to extract all the words manually. In addition, our spellchecker is a lot leaner and smarter as a result because we were able to define certain rules, rather than multiply entries. For example, Gaelic has emphatic endings that can be added to any noun: -sa, -se, -san etc. So rather than add them manually to each noun, Kevin could just write a rule that said: if the table says it’s a noun, allow these endings. Simples.
Ok, so you get a spellchecker, big deal. It is, actually but anyway, another spin-off was predictive texting for Gaelic (again with help from the indefatigable Kevin), because all we had to do was to take the list and fiddle with the ranking. Simplifying a bit but again, when compared to doing it manually off a flat text file, it’s a lot less work. Another spin-off was a digital Scrabble for Gaelic and several other word games like hangman. Oh, the University of Arizona asked for a copy to help them tag some Gaelic texts. And we’re not finished by a long shot.
Did I mention the maps? Perhaps too long a story for here but using our database we have been able to build dialect maps on steroids, like this one here indicating the word in question is southern:
And I’m sure there are other uses that we haven’t even though of yet but whatever the development, we’re fairly future proof in the sense that with a bit of manipulation, we can make our dictionary data dance, sing, foxtrot and rumba, not just perform Za Zen.
Which brings me back to my original point. People in the world of small languages could benefit from doing their homework and rather than rushing into something, go a bit more slowly and build something that is resilient for the future – even if “Let’s do a dictionary and publish it next year” sound waaaay sexier. A database is something most developers can build and while it takes a bit more time, you don’t require a rocket scientist to add the language data – but in order to get it built, you have to ask for it in the first place.
It’s been a strange sort of end to the week. I e-met a new language and came face to face with a linguistic, digital needle in a cyberhaystack. Ok, I’m not making much sense so far, I know… just setting the scene!
We all know Skype, the new version of which (quoting my hilarious brother) “convinces through less functionality and more bugs”. Back when Skype still belonged to itself, I eventually discovered the fact that, at least on Windows, it’s pretty easy to localize. You go to Tools » Change Language » Edit Skype Language file and right down there where everyone can see it, you have the option to save the English.lang file (which contains the English strings) under a new name and add your own translation. So back in 2011 I started working on a Gaidhlig.lang and by early 2012 had finally caught up with all the updates that kept getting in the way.
What does one do when one has completed a translation? Sure, you submit it to the project and ask them to bundle it, release it, whatever. Not so fast, buckoes… Due to “size issues” (I’d like to remind everyone at this point that currently, a full language file weighs in at a massive 400KB), Skype only bundles the usual 20 or so suspects, CJK (that’s Chinese, Japanese and Korean) and a bunch of European languages with the install file. Since they never though of adding an Install new language function that could pull a file from some repository, the short of it was that even having localized the lot, you were on your own. Sure, you could post the file as an attachment on the forum but then who goes trawling through a forum in search of a language file?
Using the usual “Gaelic” channels, I think we’ve reached a reasonable number of people so far but certainly less than we would have reached had it been “inside the program itself.
But before I knock the old forum too much, I should point out that it actually had a dedicated localization section. Why do I mention this? Because, moving to the next episode where we finally meet Mr Big, when Skype was bought by Microsoft, the forums were wiped and *cough* improved. That’s right, the localization section went. Especially the parts where people were trying very hard to figure out how to turn a .lang file into something that Linux and MacOS could digest. Am I glad I took copies of the bits that were useful…
Anyway, even in the new forum, the localization questions never went away. But the stock answer of the one admin who bothers to check that corner is always that “there’s no news”. In fairness, I don’t think he actually has the power to do anything, he’s just the unfortunate person who has to interact with, shock and horror, the users. So even though Skype was first launched in 2003, here we are in 2012 still asking the same questions – why can’t you bundle our language, why can’t we convert/localize the files for MacOS/Linux and how about frickin plural formatting?
Yep, “there’s no news”. The chap working on Welsh then had an interesting suggestion – can’t we host them on SourceForge? You see, the problem with distributing the files via the forum is that once your post moves off the first page, who’s going to see it? So, brilliant idea I thought and we went about setting up a project. Nothing fancy, just the .lang files which don’t come bundled with Skype and a few Wiki pages with guidance.
Seeing I had a quiet day and since my contributions in terms of code are… amusing, I decided to hit the web to locate all the .lang files out there, or as many of them as possible anyway – I may suck at code but I rock at websearches! Half a day later, I had the most amazing collection of languages. Some I had known about – Gaelic, Welsh, Cornish, Irish and Uyghur – as their translators had been active on the forum. Some were part of the usual suspects but some were totally unexpected and one I’d never even heard about which is, as a matter of fact, rather unusual. So in the end, we had:
- Uyghur (Persion and Latin script)
Definitely wow. Admittedly, not all are complete but it’s still one of the most diverse lists I’ve ever come across, even if there are no languages from the Americas in the list. Especially Adyghe, Chuvash and Erzya are not languages you normally see on localization projects. And Nias I had never even heard about. Turns out it’s a language of some 700,000 speakers off the coast of Sumatra. That certainly cheered me up. Yeah I know, geek 🙂
But what made me shake my head all afternoon was something else – the lengths I had to go to in manipulating my websearches and the places I found some of them. Gaelic I had, Welsh, Albanian and Cornish came of Skype’s forum. Basque (normally a rather well organized language) I found embedded as a .obj file on some archived forum post. Adyghe, Chuvash and Erzya came of some websites that looked a bit like a forum where someone had posted, in the case of Erzya without linebreaks, the translations – in two cases, with the Russian strings still embedded so I had to strip those out first before creating the .lang files. Armenian came out of a public DropBox and Breton off the Ofis ar Brezhoneg website. Afrikaans was on some unlinked page on someone’s personal website. Esperanto was on the Wiki of the Universala Esperanto Asocio but it took me some time to figure that in order to get the strings, I had to trawl through the page history as someone had at some point – accidentally or deliberately – deleted them. Mirandese and Nias were in some silent loop on abandoned university websites – probably student projects from long ago. And one came off a file sharing site, I forget which, making me seriously wonder if I was downloading porn, a virus or actually the .lang file. I actually even found Kurdish but the people who did that seem to have accidentally stripped out the string names so having explained the problem, they’re trying to match them together again as my Kurdish isn’t that baş.
I didn’t quite know whether to congratulate myself or whether to cry. All that effort, all those wonderfully selfless people putting their time and effort into translating something into their language. And then, because the people making money off it couldn’t be bothered, we ended up with these needles in the cyberhaystack. Crying is still an option I feel…
It’s nice to know they’re on SourcForge now (check out SkypeInYourLanguage) and that there’s a few people willing to put some time into making the process a bit better but by gum guys… if people are actually willing to help you make more money by making your product available in more languages, how about giving them a leg up, rather than the finger?
The debate about digital technology and localization and internationalization has probably raged in one form or other ever since someone invented the first program. Mind, for me personally it goes back to that ill-fated moment when ASCII was born with some bright spark arguing that no one would ever need more than those few letters that English has. My first computing headaches were around ASCII – how do I do an /ɣ/ and what the heck was %73£ when someone typed it at the other end?
Much has happened since and I’ve moved from phonology to software translation big time but I still can’t quite decide whether we’re in a better place now or not when it comes to small languages. Those technicalities (like ASCII vs Unicode) aside, the field has indeed opened up, in particular when it comes to open source software. There’s nothing but laziness that stops a language from having at least an office suite (LibreOffice), a browser (Firefox or Opera), an email client and calendar (Thunderbird and Lightning), a media player (VLC), a wiki (MediWiki), a spellchecker, a forum package (phpBB) and blogging software (WordPress.org and .com) – satisfying a fair chunk of your average user. For the really tough there’s Linux in all its scary glory of course. Ignoring the height of the bar when it comes to actually localizing some of them, that’s not the whole story though.
At least in digitized countries, a significant chunk of our work and social lives have shifted onto various digital platforms. Desktops, laptops, smartphones, tablets… you name it. Hardly a year goes by without some innovation hitting the headlines. And the tech savy (overwhelmingly the young) have become real digital nomads. Yesterday’s app is so passé today and today’s market leader mobile phone OS may be tomorrow’s digital roadkill (anyone remember Symbian?). It’s a bewildering, fluid place.
It’s a place we can’t ignore. Whether we like it or not, virtually anyone under the age of 25 has a smartphone, from rocky outcrops in the Western Ocean like Barra to the mountains of Gipuzkoa, the deserts of Arizona and the steaming hills of Papua New Guinea. Ok, maybe not Papua New Guinea yet though it wouldn’t surprise me. The more of a space we can carve out for out languages and cultures, the better because sadly the old maxim of “Use it or lose it” – or however your language puts that – is true.
So we must compete somehow, at least at some base level. But I increasingly feel that without a small but dedicated full time team, this will become harder and harder unless there’s some magic on the way that I haven’t heard about. Let me give you an example. Predictive texting goes back to the 1970s, believe it or not but not wanting to be too depressive about it, it probably did not make huge inroads into our lives before the year 2000 or so when it really took off on phones. Back then, you had those languages which your manufacturer deemed appropriate, maybe a dozen or so if you were lucky. We’re now in 2012 and I’m waiting with bated breath for the first release of Irish, Scottish Gaelic and Manx on Adaptxt which, after much searching, I discovered last year. Finally an open source predictive texting project open to any language. Yay! Ok, so it only works on Android… I can live with that, looking at the Android market share. It would be good if iPhones also supported 3rd party entry methods but they don’t and I’m getting to the cheesed off stage with Apple’s approach to non-billion-speaker-languages anyway.
But I digress. There we are, happily preparing the tool which will finally take Scots Gaelic and Manx out of the letter-by-letter age (Irish has had Téacs since 2008 but I’m not sure how alive the project is) when Apple starts pushing Siri (that voice recognition thing on iPhones which, by the way, only works if your accent resembles that of the Queen and or Charleton Heston). I bet my bottom dollar that before long, every major mobile phone manufacturer will be running something similar.
Here, I gnash my teeth. Predictive texting is reasonably easy to do as long as you have a framework you can feed your data into. For example a spellchecker. But it’s taken around a decade for such a framework to grow out of the cyber community. Speech recognition is a harder. A lot harder. I have no idea how long it will take for languages such as Gaelic to take that hurdle and even less so of how many of this planet’s 6,000 languages will manage to do so. And that makes it all a little frustrating.
I don’t know what the answer is, right now, I just feel it would be nice if stuff slowed down a bit. Honestly, how much technological innovation do we need in 12 months? Or rather, how many false summits can we and our languages keep pace with?