Archive

Archive for the ‘Minority languages’ Category

A bit of lexicographic navelgazing

11/05/2013 12 comments

Sometimes it’s not the developers fault. Shocking, I know. Sometimes, it’s the linguistic community (using the term loosely) who is at fault for not asking for the right thing.

mojave

Just in case you didn’t believe me…

I was looking up something in my Mojave dictionary to other week (don’t ask), followed by a Google search which pointed me at a Gaelic dictionary for administrative terminology. While the two are probably some 15 years apart, they have one thing in common. They’re both “flat” documents (there’s probably a niftier term but I can’t think of it). The Mojave dictionary is a ring-bound, printed dictionary, the Gaelic one some form of PDF. Now, don’t get me wrong, I have a love affair with dictionaries – I collect dictionaries like my mother shoes and handbags. So the more the merrier.

But what also hit me is how much of a debt of gratitude the Scottish Gaelic world owes a chap called Kevin Scannell. While doing some research in Dublin, we met at the Club Cónradh na Gaeilge for a pint and a chat about Irish IT and he explained to me, in short and simple phrases, what a lexical database is and why anyone should give a monkey’s about having one. That chat came at a pivotal moment because, having just finished the digitization of Dwelly’s classical Gaelic dictionary, I had both come to realise some of the inherent shortcomings of paper dictionaries in digital form and I was at the cusp of embarking on the development of what is now the Faclair Beag along with Will Robertson.

Now there are many forms a lexical database can take but essentially the difference between a massive wordlist and a lexical database is that in the database you don’t just list words but you also mark them for what they are. For example, rather than having file, grunt, apple, horse, with, and in a comma separated list, such a database would mark file, grunt, apple and horse as being a singular noun, a second instance of file and grunt as regular verbs, and as a conjunction and with as a preposition. You can get a lot more fancy than that but at a very basic level, that’s the difference.

So what you might say, you still end up with a dictionary at the other end and doing a database involves a whole lot of extra work. Tru dat, but the other word I learned on that trip was the word future proofing. What that means is that if you write a beautiful dictionary as a flat text document, you get just that, a beautiful dictionary. Useful, and for many centuries the only kind available. But that was down to technical limitations, not necessarily choice. Anyway, such a dictionary is future proof only to a certain extent. If it’s digital (as opposed to typed, which unbelievably still happens…) you can edit bits for an new edition. You can put it online and, with a bit of messing about, even turn it into a searchable dictionary. But that’s just about it, anything beyond that involves an insane amount of extra work. For example, your dictionary may list 20,000 heardwords but there will be a lot of word forms which aren’t headwords: plurals, past tenses, words which only appear in examples but not as headwords and so on.

But I can look up the plural of goose. Yes, that’s true but say for example you wanted to do something beyond that. For example, you might be a linguist interested in word frequencies, wanting to find out how common certain words are. Do a word search in your text? Possible, but then you end up with a number for goose and another for geese. And in some languages the list of forms a word can take is huge, in Gaelic the goose can show up as gèadh, ghèadh, geòidh, gheòidh, gèadhaibh and ghèadhaibh.

But it’s not just nice for linguists. The applications of even a basic lexical database are impressive. Let me continue with the practical example to illustrate this. If you search for bean in the Faclair Beag, you end up seeing this entry at the top:

afb1 But what the casual dictionary user does not realise is that behind the scenes, things look a little different:

afb2We decided to keep it fairly simple and devised different tables for the different types of words we get in Gaelic – feminine and masculine nouns, verbs, prepositions and so on. And for each, we made a table which covers the different possible forms of each word. For a Gaelic noun, that means lenition, datives, genitives, vocatives, singular and plural, plus a junk field for anything that might be exceptional.

Yes, it’s a bit of extra work but one immediate benefit is that because each form is tied to the ID of the root, it doesn’t matter if a user sticks in a form like mhnàthadh – the dictionary will still know what to look for. That’s a decided bonus for people who are inexperienced or looking for a rare inflected form they’re unsure of. It also cuts down the number See x entries because if two words are simply variations of the same root (like crèadh and criadh in Gaelic which are both pronounced the same way and mean the same thing). So usability is an immediate benefit.

Next benefit is an almost instantaneous and updatable spellchecker – as long as the data you punch in is clean, all you have to do is export the table and dump it in Hunspell for example. Ok, it involves a little more fiddling that that but compared to the task of extracting all words from a flat text file, it’s a doddle. For example, I was asked if we could do something for Cornish based on the Single Written Form dictionary. The answer was yes, but I don’t have time to extract all the words manually. In addition, our spellchecker is a lot leaner and smarter as a result because we were able to define certain rules, rather than multiply entries. For example, Gaelic has emphatic endings that can be added to any noun: -sa, -se, -san etc. So rather than add them manually to each noun, Kevin could just write a rule that said: if the table says it’s a noun, allow these endings. Simples.

Ok, so you get a spellchecker, big deal. It is, actually but anyway, another spin-off was predictive texting for Gaelic (again with help from the indefatigable Kevin), because all we had to do was to take the list and fiddle with the ranking. Simplifying a bit but again, when compared to doing it manually off a flat text file, it’s a lot less work. Another spin-off was a digital Scrabble for Gaelic and several other word games like hangman. Oh, the University of Arizona asked for a copy to help them tag some Gaelic texts. And we’re not finished by a long shot.

Did I mention the maps? Perhaps too long a story for here but using our database we have been able to build dialect maps on steroids, like this one here indicating the word in question is southern:

And I’m sure there are other uses that we haven’t even though of yet but whatever the development, we’re fairly future proof in the sense that with a bit of manipulation, we can make our dictionary data dance, sing, foxtrot and rumba, not just perform Za Zen.

Which brings me back to my original point. People in the world of small languages could benefit from doing their homework and rather than rushing into something, go a bit more slowly and build something that is resilient for the future – even if “Let’s do a dictionary and publish it next year” sound waaaay sexier. A database is something most developers can build and while it takes a bit more time, you don’t require a rocket scientist to add the language data – but in order to get it built, you have to ask for it in the first place.

Wishful thinking à la Bretonne

03/02/2013 4 comments

Have you noticed that sometimes developers DO get it right but then are faced with strange user behaviours? No, I’m not talking about developers thinking that something should be the case, which isn’t. I’m talking about a strange chain of events on Facebook which makes me doubt the motivation of some language activists (yes, we’re allowed to self-criticize guys!).

We all know about Facebook. What we don’t all know about Facebook is that they have a pretty bizarre approach to translations (we can hardly call it localization…) and I don’t mean the fact they, for the most part, rely on community volunteers. No, it’s the process. There’s no clear process of adding or registering a new project and heaven knows how they actually pick the languages. At one point, Rumantsch was in (it now isn’t, no idea how it got in or why it’s now out, it’s a fairly small language with between 35,000 and 60,000 speakers), as is Northern Sami, Irish, Mongol and the usual big boys, including some questionable choices like Leet Speak and Pirate. So most languages are out. Not surprisingly, this has led to a number of Facebook groups and campaigns by people trying to get their  languages into the project. There used to be a project page full of posts along the lines of “please add my language” and “how do we get Facebook to add our language?” – universally met with thundering silence. Admins were rarer than Lord Howe Island stick insects.

Back in whenever, a chap called Neskie Manuel had a crafty idea, about getting his language, Secwepemctsín, onto Facebook. Why not, he figured, find a way of overlaying Facebook with a “translation skin” in order to make the process of translation (and in this case even localization) independent of Facebook & Co? It was a neat idea, which was somewhat interrupted by his sad and untimely death.

Now, round about the same time, two things happened. The Bretons set up a “Facebook in Breton” compaign. Fair enough. And a chap called Kevin Scannell took on board Neskie’s Facebook idea. Excellent. Before too long, the Facebook group had over 12,000 members and Kevin had released his script for a slew of amazing languages. It overlays not all of Facebook but just the most visible strings (the one’s we see daily, not the boring EULAs and junk). Even more amazingly, it can handle stuff Facebook hasn’t even woken up to yet, such as plurals, case marking and so on. Wow indeed.

The languages hailed from the four corners of the planet, from Aragonese, Manx and Nawat through Hiligaynon, Secwepemctsín, Samoan, K’iche’ and Māori to Kunwinjku and Gundjeihmi (two Australian languages). Wow indeed. And, of course Breton.

Now here’s the bizarre thing though. Ok, it’s not the full thing but who’d turn down a sandwich while waiting for a roast chicken that might never appear? No one, you’d think, so based on a combined market share of some 50% between Firefox and Chrome, some 200,000 speakers and 12,000 people in the “Facebook in Breton” group, you’d expect what, anything north of 6,000 enthusiastic users of the Breton script. After all, more than 1,100 people installed it in Scottish Gaelic (less than 60,000 speakers) and more than 500 people in Manx (way less than 2,000 fluent speakers).

A case of “you’d think” indeed. To date, a mind-boggling 450 people have installed it in Breton. As far as I can tell, the translation is good and was done by a single, highly fluent speaker (Fulup Jakez who works for Ofis ar Brezhoneg). So it’s not a quality issue. The scripts work (I use the Gaelic one) so it’s not that either. The Facebook group was notified several times, so it’s not like they didn’t know. Ok, so maybe not all Likes of the group actually are from speakers, fair enough, but glancing through the active posters, a lot of them seem to be in the right “linguistic area”.

So while the groupies are still foaming at the mouth about the lack of support from Zuckerberg and Co, there’s a perfectly good interim that would allow you to say Kenavo to French and Degemer mat to Breton on Facebook every day. I really don’t get it. Is it really the case that some activists are more in love with the idea of the thing than would actually use it if it was around? Or am I missing something really obvious? I sure hope I am…

On a more positive note, I hope the general idea of this type of “overlay” will eventually take off big time. We will never be able to convince the big boys to support all the languages on the planet, all of which are equally worthy of services in their own languages, whether they’re trying to re-grow lost speakers or whether they’re just a small to medium sized community. So having a tool that puts control over what we see on our screens into our hands would be great. No more running from company to company trying to make the case for adding language X, a little less duplication (I don’t know how many zillion times I’ve translated “Edit picture”), better quality and more focus on the important bits of an interface to translate (not the EULA for example… a document that sadly every software company is keen to have translated as soon as possible without ever asking who’ll read it). Ach well, I can hope…

Dear grumpy Native Speaker

31/05/2012 4 comments

Localization is obviously just a means to an end – the end being the end-user. You know, normal people. So since they’re also part of this process and so that you know I dish out fairly in both directions, not just developers, here’s an instalment which looks at the native-speaking end-user. Because I had a fairly nasty gripe in my inbox. No names but I think we all recognize the type.

First off, I have the utmost respect for native speakers of small languages who have managed to keep their language alive in the face of adversity. Secondly, I do not for one moment believe that any amount of learning can fullyreplace native speaker intuition though I will uphold the argument that in terms of formal grammar and spelling, learners often have a better take on things. Simply due to the differences in process – one learnt at the knee (no flashcards involved), the other using an intimidating array of books (often with too little “knee” involved).  Thus both groups have strengths and weaknesses which can and ought to complement each other. It certainly should not be a dogfight.

A peculiar paradox arises out if this situation though which many of you will recognize. When it comes to breaking into new territory for language X, it’s usually learners who do that. I’m sure you could write entire PhDs on the topic but on the whole, I think it’s fair to say that learners simply don’t put up with the argument that “language X has never been used for technology Y before”. They’ve always used, say, a browser and therefore they want it in their chosen language X. Again the two groups behave differently. On the whole, the native speakers assumes it doesn’t exist and that it can’t be done. The learner will go and look and if there isn’t one, will do something about it. As in, they sign up to a project like Mozilla Firefox and put in hours and hours of their own time to translate it.

Here’s the paradox. In the translation industry you’re usually only hired to translate into your native language because only native speakers are attuned to the nuances of their language. You usually also have to demonstrate competence in grammar and spelling. But in the world of small languages, such people are rare. Very rare. Literacy is usually lower amongst native speakers than learners because the mainstream education system doesn’t cater for the language. But very rarely do you find a learner who can’t read and write the language. So we get a situation where the people with the best linguistic skills are the least likely people to be found on a project like Firefox or LibreOffice.

Before you get visions of linguistic horror – the outcome is usually not that bad. Once in a while you come across real junk but on the whole, translations of software into small languages usually range from ok to good. Some are very good. While learners can go a bit neologism-happy now and then, what native speakers tend to forget is that when any language breaks into a new domain, it will sound a bit weird. Think about a really technical manual in your native language – does that roll off your tongue, does it ensure immediate comprehension by a non-specialist? But we’ll leave that debate for another day.

And before we get too carried away blaming the education system, there obviously are native speakers of small languages with high levels of literacy, especially in Europe. But for some reason, they often don’t get involved. I have my views on why that is but I don’t want this to become a rant. Let’s just say that they don’t, for the most part.

Now, my time is a limited as that of a native speaker. I enjoy the sunshine and going for walks too. My point is, before you send a rather nasty message off to someone the next time complaining that “no native speaker would have ever translated X like that”, albeit in rather lovely, native-sounding, well-spelled and grammar-checked language, ask yourself this question: Have you volunteered your time to the project in question to ensure the outcome is as good as can be? Cause if you haven’t, then I really don’t want to hear from you.

Shooting yourself in the foot, Goidelic style

28/02/2012 7 comments

Well, it would seem messing up is not the sole domain of monolingual English-speaking developers. Goidelic developers (that’s Irish, Scottish Gaelic or Manx) are just as bad it would seem.

I wasn’t going to write about the rather painful episode that was MyGaelic.com. In fairness (as far as I know) it actually didn’t start out as a plan for a Gaelic social networking site but a promotional campaign to encourage younger people to learning Gaelic. This soon acquired plans for a website, then a social networking element and before you knew, it was only a social networking site. Seems to me like a classic case of scope creep and PM failure. Unfortunately no-one appears to have asked the question, while the scope was creeping, what social networking is and what makes it tick. Things like “critical mass” for example. Or the question of why I’d shift from Facebook to MyGaelic, thus restricting myself only to my (much smaller) circle of Gaelic-speaking friends. The point about Facebook surely is that (almost) everyone IS on it…

Anyway. I had hope that we’d drawn the curtains over social networking sites in Gaelic/Irish (which, incidentally does not mean I don’t want Facebook to add Gaelic as an interface language, on the contrary, or I wouldn’t have participated in the addon which translates the Facebook menus into Gaelic). Apparently not. Someone posted on Fòram na Gàidhlig about this new Irish site called AbairLeat, in essence an Irish-language social networking site, and asking what it was like. So I have a bash, with a modicum of trepidation.

Ok, the bright side first. It looks visually attractive, if a little confusing at first but then maybe I’m just a Facebook victim! Sign up, do my profile… oops. First problem. To keep it in Irish, they’ve set up a tool that measures the % of Irish content you’re typing. Anything above 70% and you’re ok to post. For some reason, this tool took exception to the inflected form “chuid” and “hAlban” … Pass as to why. Even the phrase “Is é do bheatha” gets a score of 75%. Now the idiom may be more Gaelic than Irish but the words are all Irish. It does come up with suggestions – theoretically. Except the right-click to get to them interferes neatly with the spellchecker menu in Firefox. Then there’s the window for posting – it looks massive but the font you type in is about what, point 20? Which means you run out of space fast and it doesn’t wrap. Or shift over. And the % are still weird. Add to that various other navigation bugs. So I grind my teeth and log in via Internet Explorer. No difference really except that I don’t have a browser spellchecker interfering cause IE doesn’t have one for Irish. And please, I’m not doing some kind of deep-down bizarre user testing. I’m just having a snoop around.

Eventually I manage to (double) post about this problem and get a very friendly admin (+++). Guess what – they know it’s buggy and apparently, I should use Chrome. :roll: Great. There are two browsers available IN Irish. Firefox and IE. And they go and test in… Chrome. Nice one guys, full points.

Three lessons:

  1. Do some user testing with real users, whatever language you’re aiming at
  2. Switch browsers once in a while and don’t assume people will switch browsers just because of your site
  3. Don’t release a really buggy version in a small language. Speakers of small languages are hard-to-convince customers at the best of times and once you’ve alienated them from your site, they’re unlikely to return.

I wish them all the best – of course I want to become a bustling hub of Irish. But talk about shooting yourself and your language in the foot.

Follow

Get every new post delivered to your Inbox.