Sometimes it’s not the developers fault. Shocking, I know. Sometimes, it’s the linguistic community (using the term loosely) who is at fault for not asking for the right thing.
I was looking up something in my Mojave dictionary to other week (don’t ask), followed by a Google search which pointed me at a Gaelic dictionary for administrative terminology. While the two are probably some 15 years apart, they have one thing in common. They’re both “flat” documents (there’s probably a niftier term but I can’t think of it). The Mojave dictionary is a ring-bound, printed dictionary, the Gaelic one some form of PDF. Now, don’t get me wrong, I have a love affair with dictionaries – I collect dictionaries like my mother shoes and handbags. So the more the merrier.
But what also hit me is how much of a debt of gratitude the Scottish Gaelic world owes a chap called Kevin Scannell. While doing some research in Dublin, we met at the Club Cónradh na Gaeilge for a pint and a chat about Irish IT and he explained to me, in short and simple phrases, what a lexical database is and why anyone should give a monkey’s about having one. That chat came at a pivotal moment because, having just finished the digitization of Dwelly’s classical Gaelic dictionary, I had both come to realise some of the inherent shortcomings of paper dictionaries in digital form and I was at the cusp of embarking on the development of what is now the Faclair Beag along with Will Robertson.
Now there are many forms a lexical database can take but essentially the difference between a massive wordlist and a lexical database is that in the database you don’t just list words but you also mark them for what they are. For example, rather than having file, grunt, apple, horse, with, and in a comma separated list, such a database would mark file, grunt, apple and horse as being a singular noun, a second instance of file and grunt as regular verbs, and as a conjunction and with as a preposition. You can get a lot more fancy than that but at a very basic level, that’s the difference.
So what you might say, you still end up with a dictionary at the other end and doing a database involves a whole lot of extra work. Tru dat, but the other word I learned on that trip was the word future proofing. What that means is that if you write a beautiful dictionary as a flat text document, you get just that, a beautiful dictionary. Useful, and for many centuries the only kind available. But that was down to technical limitations, not necessarily choice. Anyway, such a dictionary is future proof only to a certain extent. If it’s digital (as opposed to typed, which unbelievably still happens…) you can edit bits for an new edition. You can put it online and, with a bit of messing about, even turn it into a searchable dictionary. But that’s just about it, anything beyond that involves an insane amount of extra work. For example, your dictionary may list 20,000 heardwords but there will be a lot of word forms which aren’t headwords: plurals, past tenses, words which only appear in examples but not as headwords and so on.
But I can look up the plural of goose. Yes, that’s true but say for example you wanted to do something beyond that. For example, you might be a linguist interested in word frequencies, wanting to find out how common certain words are. Do a word search in your text? Possible, but then you end up with a number for goose and another for geese. And in some languages the list of forms a word can take is huge, in Gaelic the goose can show up as gèadh, ghèadh, geòidh, gheòidh, gèadhaibh and ghèadhaibh.
But it’s not just nice for linguists. The applications of even a basic lexical database are impressive. Let me continue with the practical example to illustrate this. If you search for bean in the Faclair Beag, you end up seeing this entry at the top:
We decided to keep it fairly simple and devised different tables for the different types of words we get in Gaelic – feminine and masculine nouns, verbs, prepositions and so on. And for each, we made a table which covers the different possible forms of each word. For a Gaelic noun, that means lenition, datives, genitives, vocatives, singular and plural, plus a junk field for anything that might be exceptional.
Yes, it’s a bit of extra work but one immediate benefit is that because each form is tied to the ID of the root, it doesn’t matter if a user sticks in a form like mhnàthadh – the dictionary will still know what to look for. That’s a decided bonus for people who are inexperienced or looking for a rare inflected form they’re unsure of. It also cuts down the number See x entries because if two words are simply variations of the same root (like crèadh and criadh in Gaelic which are both pronounced the same way and mean the same thing). So usability is an immediate benefit.
Next benefit is an almost instantaneous and updatable spellchecker – as long as the data you punch in is clean, all you have to do is export the table and dump it in Hunspell for example. Ok, it involves a little more fiddling that that but compared to the task of extracting all words from a flat text file, it’s a doddle. For example, I was asked if we could do something for Cornish based on the Single Written Form dictionary. The answer was yes, but I don’t have time to extract all the words manually. In addition, our spellchecker is a lot leaner and smarter as a result because we were able to define certain rules, rather than multiply entries. For example, Gaelic has emphatic endings that can be added to any noun: -sa, -se, -san etc. So rather than add them manually to each noun, Kevin could just write a rule that said: if the table says it’s a noun, allow these endings. Simples.
Ok, so you get a spellchecker, big deal. It is, actually but anyway, another spin-off was predictive texting for Gaelic (again with help from the indefatigable Kevin), because all we had to do was to take the list and fiddle with the ranking. Simplifying a bit but again, when compared to doing it manually off a flat text file, it’s a lot less work. Another spin-off was a digital Scrabble for Gaelic and several other word games like hangman. Oh, the University of Arizona asked for a copy to help them tag some Gaelic texts. And we’re not finished by a long shot.
Did I mention the maps? Perhaps too long a story for here but using our database we have been able to build dialect maps on steroids, like this one here indicating the word in question is southern:
And I’m sure there are other uses that we haven’t even though of yet but whatever the development, we’re fairly future proof in the sense that with a bit of manipulation, we can make our dictionary data dance, sing, foxtrot and rumba, not just perform Za Zen.
Which brings me back to my original point. People in the world of small languages could benefit from doing their homework and rather than rushing into something, go a bit more slowly and build something that is resilient for the future – even if “Let’s do a dictionary and publish it next year” sound waaaay sexier. A database is something most developers can build and while it takes a bit more time, you don’t require a rocket scientist to add the language data – but in order to get it built, you have to ask for it in the first place.
Localization is obviously just a means to an end – the end being the end-user. You know, normal people. So since they’re also part of this process and so that you know I dish out fairly in both directions, not just developers, here’s an instalment which looks at the native-speaking end-user. Because I had a fairly nasty gripe in my inbox. No names but I think we all recognize the type.
First off, I have the utmost respect for native speakers of small languages who have managed to keep their language alive in the face of adversity. Secondly, I do not for one moment believe that any amount of learning can fullyreplace native speaker intuition though I will uphold the argument that in terms of formal grammar and spelling, learners often have a better take on things. Simply due to the differences in process – one learnt at the knee (no flashcards involved), the other using an intimidating array of books (often with too little “knee” involved). Thus both groups have strengths and weaknesses which can and ought to complement each other. It certainly should not be a dogfight.
A peculiar paradox arises out if this situation though which many of you will recognize. When it comes to breaking into new territory for language X, it’s usually learners who do that. I’m sure you could write entire PhDs on the topic but on the whole, I think it’s fair to say that learners simply don’t put up with the argument that “language X has never been used for technology Y before”. They’ve always used, say, a browser and therefore they want it in their chosen language X. Again the two groups behave differently. On the whole, the native speakers assumes it doesn’t exist and that it can’t be done. The learner will go and look and if there isn’t one, will do something about it. As in, they sign up to a project like Mozilla Firefox and put in hours and hours of their own time to translate it.
Here’s the paradox. In the translation industry you’re usually only hired to translate into your native language because only native speakers are attuned to the nuances of their language. You usually also have to demonstrate competence in grammar and spelling. But in the world of small languages, such people are rare. Very rare. Literacy is usually lower amongst native speakers than learners because the mainstream education system doesn’t cater for the language. But very rarely do you find a learner who can’t read and write the language. So we get a situation where the people with the best linguistic skills are the least likely people to be found on a project like Firefox or LibreOffice.
Before you get visions of linguistic horror – the outcome is usually not that bad. Once in a while you come across real junk but on the whole, translations of software into small languages usually range from ok to good. Some are very good. While learners can go a bit neologism-happy now and then, what native speakers tend to forget is that when any language breaks into a new domain, it will sound a bit weird. Think about a really technical manual in your native language – does that roll off your tongue, does it ensure immediate comprehension by a non-specialist? But we’ll leave that debate for another day.
And before we get too carried away blaming the education system, there obviously are native speakers of small languages with high levels of literacy, especially in Europe. But for some reason, they often don’t get involved. I have my views on why that is but I don’t want this to become a rant. Let’s just say that they don’t, for the most part.
Now, my time is a limited as that of a native speaker. I enjoy the sunshine and going for walks too. My point is, before you send a rather nasty message off to someone the next time complaining that “no native speaker would have ever translated X like that”, albeit in rather lovely, native-sounding, well-spelled and grammar-checked language, ask yourself this question: Have you volunteered your time to the project in question to ensure the outcome is as good as can be? Cause if you haven’t, then I really don’t want to hear from you.