Language Technology Lexicography Minority languages Scots Gaelic

A bit of lexicographic navelgazing

Sometimes it’s not the developers fault. Shocking, I know. Sometimes, it’s the linguistic community (using the term loosely) who is at fault for not asking for the right thing.

Just in case you didn’t believe me…

I was looking up something in my Mojave dictionary to other week (don’t ask), followed by a Google search which pointed me at a Gaelic dictionary for administrative terminology. While the two are probably some 15 years apart, they have one thing in common. They’re both “flat” documents (there’s probably a niftier term but I can’t think of it). The Mojave dictionary is a ring-bound, printed dictionary, the Gaelic one some form of PDF. Now, don’t get me wrong, I have a love affair with dictionaries – I collect dictionaries like my mother shoes and handbags. So the more the merrier.

But what also hit me is how much of a debt of gratitude the Scottish Gaelic world owes a chap called Kevin Scannell. While doing some research in Dublin, we met at the Club Cónradh na Gaeilge for a pint and a chat about Irish IT and he explained to me, in short and simple phrases, what a lexical database is and why anyone should give a monkey’s about having one. That chat came at a pivotal moment because, having just finished the digitization of Dwelly’s classical Gaelic dictionary, I had both come to realise some of the inherent shortcomings of paper dictionaries in digital form and I was at the cusp of embarking on the development of what is now the Faclair Beag along with Will Robertson.

Now there are many forms a lexical database can take but essentially the difference between a massive wordlist and a lexical database is that in the database you don’t just list words but you also mark them for what they are. For example, rather than having file, grunt, apple, horse, with, and in a comma separated list, such a database would mark file, grunt, apple and horse as being a singular noun, a second instance of file and grunt as regular verbs, and as a conjunction and with as a preposition. You can get a lot more fancy than that but at a very basic level, that’s the difference.

So what you might say, you still end up with a dictionary at the other end and doing a database involves a whole lot of extra work. Tru dat, but the other word I learned on that trip was the word future proofing. What that means is that if you write a beautiful dictionary as a flat text document, you get just that, a beautiful dictionary. Useful, and for many centuries the only kind available. But that was down to technical limitations, not necessarily choice. Anyway, such a dictionary is future proof only to a certain extent. If it’s digital (as opposed to typed, which unbelievably still happens…) you can edit bits for an new edition. You can put it online and, with a bit of messing about, even turn it into a searchable dictionary. But that’s just about it, anything beyond that involves an insane amount of extra work. For example, your dictionary may list 20,000 heardwords but there will be a lot of word forms which aren’t headwords: plurals, past tenses, words which only appear in examples but not as headwords and so on.

But I can look up the plural of goose. Yes, that’s true but say for example you wanted to do something beyond that. For example, you might be a linguist interested in word frequencies, wanting to find out how common certain words are. Do a word search in your text? Possible, but then you end up with a number for goose and another for geese. And in some languages the list of forms a word can take is huge, in Gaelic the goose can show up as gèadh, ghèadh, geòidh, gheòidh, gèadhaibh and ghèadhaibh.

But it’s not just nice for linguists. The applications of even a basic lexical database are impressive. Let me continue with the practical example to illustrate this. If you search for bean in the Faclair Beag, you end up seeing this entry at the top:

afb1 But what the casual dictionary user does not realise is that behind the scenes, things look a little different:

afb2We decided to keep it fairly simple and devised different tables for the different types of words we get in Gaelic – feminine and masculine nouns, verbs, prepositions and so on. And for each, we made a table which covers the different possible forms of each word. For a Gaelic noun, that means lenition, datives, genitives, vocatives, singular and plural, plus a junk field for anything that might be exceptional.

Yes, it’s a bit of extra work but one immediate benefit is that because each form is tied to the ID of the root, it doesn’t matter if a user sticks in a form like mhnàthadh – the dictionary will still know what to look for. That’s a decided bonus for people who are inexperienced or looking for a rare inflected form they’re unsure of. It also cuts down the number See x entries because if two words are simply variations of the same root (like crèadh and criadh in Gaelic which are both pronounced the same way and mean the same thing). So usability is an immediate benefit.

Next benefit is an almost instantaneous and updatable spellchecker – as long as the data you punch in is clean, all you have to do is export the table and dump it in Hunspell for example. Ok, it involves a little more fiddling that that but compared to the task of extracting all words from a flat text file, it’s a doddle. For example, I was asked if we could do something for Cornish based on the Single Written Form dictionary. The answer was yes, but I don’t have time to extract all the words manually. In addition, our spellchecker is a lot leaner and smarter as a result because we were able to define certain rules, rather than multiply entries. For example, Gaelic has emphatic endings that can be added to any noun: -sa, -se, -san etc. So rather than add them manually to each noun, Kevin could just write a rule that said: if the table says it’s a noun, allow these endings. Simples.

Ok, so you get a spellchecker, big deal. It is, actually but anyway, another spin-off was predictive texting for Gaelic (again with help from the indefatigable Kevin), because all we had to do was to take the list and fiddle with the ranking. Simplifying a bit but again, when compared to doing it manually off a flat text file, it’s a lot less work. Another spin-off was a digital Scrabble for Gaelic and several other word games like hangman. Oh, the University of Arizona asked for a copy to help them tag some Gaelic texts. And we’re not finished by a long shot.

Did I mention the maps? Perhaps too long a story for here but using our database we have been able to build dialect maps on steroids, like this one here indicating the word in question is southern:

And I’m sure there are other uses that we haven’t even though of yet but whatever the development, we’re fairly future proof in the sense that with a bit of manipulation, we can make our dictionary data dance, sing, foxtrot and rumba, not just perform Za Zen.

Which brings me back to my original point. People in the world of small languages could benefit from doing their homework and rather than rushing into something, go a bit more slowly and build something that is resilient for the future – even if “Let’s do a dictionary and publish it next year” sound waaaay sexier. A database is something most developers can build and while it takes a bit more time, you don’t require a rocket scientist to add the language data – but in order to get it built, you have to ask for it in the first place.

17 replies on “A bit of lexicographic navelgazing”

That would be nice but I (personally) don’t have time for another project that size – but we’d certainly look into it if someone approached us. Fancy a challenge? 🙂

The main barrier to an accessible corpus of texts is copyright. Texts that are out of copyright would avoid that problem, but leave you with some problems of ‘archaisms’.

Yes… perhaps something could be done by talking to the main publishers (Acair, CnanL, CLÌ, papers with columns) to agree some sort of a fair-use policy but I guess the request would have to come from a university.
The SCOTS corpus went down another path and simply asked for people to donate stuff.

I really like the Corpus idea. The BBC already allows use of material from An Là on Could that be a starting point? PS Mìcheal’s work in lexicography and in developing online resources in Gaelic deserves a medal. Or a stipend from the Scottish Government, which would probably be more useful to him and us.

This is exactly the right thing to do. Using IT resources the way they should be used.

Most people have no idea what a relational database is, let alone the awesome power to search and manipulate the data.


Tim Aggett, Software Test Analyst

Lexicographic navelgazing: I have a lexicon of Old Norse loanwords across Ir, ScG and Mx – about 600-800 such words (result of my PhD work). Not yet published – obviously digitizing is the way to take this into the future (although, like you, I’m a big fan of old school dictionaries). Do your files contain a field or fields for etymology? Especially taken with the idea of mapping. I’d be interested in talking about ways of doing all this.

Hi mcdrod, Yes we do have such a field though most words don’t have anything in it just now – there’s just not enough time to do fully complete entries – but have a look at this one for “uisge” which is fairly complete – the etymology field is about 2/3 down. We’d be more than happy to see if we can team up to add content. I’ll drop you an email, ok?

Well said. Every time I see a “flat” dictionary, like a text file or a word processing file, I despair a little. They have valuable information but are awkward to work with. They are prevalent in what I call “amateur lexicography” when some well-intentioned individual sets about writing a dictionary as a *document*. That is a misunderstanding. Dictionaries are not documents but *data sets*. By failing to appreciate that, people are wasting valuable effort.

The truth is that most people are data-illiterate. They have no idea how to structure data or what the benefits of doing that might be. I see the effects of that everywhere, not only in amateur lexicography. I dream of the day when basic data skills will be taught in schools, along with literacy and numeracy.

This is a useful synopisis and for me the future-proofing is key. For too long we have been producing stuff for use in the day, but the IT expertise and technology available now has opened up whole new set of possibilities. There needs to be a strategic approach with no duplication of effort. Most of all there needs to be an information flow about what is happening in an accessible, non-geek form (and that’s not being disrespectful). That way, everyone knows what is happening and can contribute constructively, making faster progress.

I´m surprised you didn´t use the same trick as with -sa etc. to handle lenition, since this is so clearly marked in G. as opposed to say Welsh or Cornish. You do a first pass on the string as entered, this will allow for any words where the lenition has become fixed, so the mutated form is the standard dictionary entry, then, if you draw a blank, the script deletes an ´h´ if that´s the second letter, and repeats the search. That would almost halve the size of you data tables, I think.

The same process can be used for very common endings like -(e)an plurals, -(a)ibh (nouns and verbs!) and so on. The worst that can happen is you get the occasional false positive.

Obvious really irregular customers, like your example _bean_, need special treatment, but these are really fairly rare exceptions.

It would get messy if I started account for lenition that way and in any case, a data field for lenited forms is needed to account for words like bhana “van” which only occur with bh-; otherwise I’d have to somehow enter that as *bana and then strip those forms when I export the data to do anything else.
Besides, for nouns, verbs and adjectives we applied all those known and predictable rules to prefill the table, so when I create a new entry for a regular verb, I hit a button and of the 30 or so forms, I generally only have to manually enter the verbal noun. Adding entries is very efficient this way round as I only have to tackle irregular or unpredictable forms. We did play around doing it the way you suggest by that would just get very very messy and produced data which would not be re-usable as readily.

No, see what I wrote, ´bhana´ etc. would go in as entries in their own right and be picked up on the first pass. This is how the online Cornish dictionary works, where I was presented with a pre-existing database that I was reluctant to meddle with any more than necessary, so resorted to manipulating the search string where necessary.

Yes, I see you prefer to try to predict all possible forms when you create an entry. That means a massive database though. Probably not so much a problem now we have huge amounts of storage and processors fast enough to plough through it all in real time. Maybe I´m just old fashioned, but I´m still a little wary of ´brute force´ approaches.

The SQL table which holds these forms is only 70MB, which as you said, these days isn’t much. Only half that once we’ve normalised it.
I’m not entirely sure what you think is “brute force” in out approach (unless you refer to the -sa etc endings in which case you may have misunderstood what I said – they don’t get “added” in the dictionary. They only get added when we run a script over an export of the database to create the Hunspell spellchecker. If someone searches for taigh-sa, they only get taigh-sa results).
Either way, this way round has worked well for us – labelling each form has even allowed us to share data with a corpus tagging project. I know there are many more things we could have done but with limited resources (I’m sure I’m not telling you anything new there) we had to compromise some.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s