Saturday, December 31, 2005

A meeting with Brion

I have been to Berlin again. Again for a conference, again in the first place to meet one man. Brion Vibber is the Chief Technological Officer of the Wikimedia Foundation. Brion is the person who is responsible for the technology behind Wikipedia, and all the other Wikimedia Foundation projects. Brion thinks that the first development milestone, the Wikidata namespace manager, could go into the next release of Mediawiki.. WOW

We discussed the long standing need for single login. Because it is a dependency for many other projects related to Ultimate Wiktionary and Wikimedia in general, if at all possible, Brion will start working on it soon after the release of MediaWiki 1.6. Brion will also look into Surfnet's A-Select in order to interface Wikimedia with other authentication service providers. There is a need for outside authentication to Wikimedia projects, especially Wiktionary, and A-Select has the potential to provide it. However, Brion feels that thinking about federation is only possible after the Wikimedia-internal authentication problems are fully resolved.

Together with Erik, we discussed the need for handling multiple languages inside a MediaWiki installation, which is obviously related to Wikidata and UW and will likely be one of our next development milestones. While Brion sees this as a quite complex problem, he did agree that the situation as it currently is - that multilingual projects like Meta and Commons have no language-awareness whatsoever - is broken. What we agreed to do is to send him specifications to review before any implementation begins. Brion also pointed out some current and potential problems with MySQL: that certain UTF-8 characters cannot be stored with the proper charset encoding, and that it may not be possible to have multiple sorting orders on a field without duplicating it. (In this context, we debated the need for following important standards such as CLDR for locale data.)
Besides meeting Brion, we made an appointment with two officials of Wikimedia Germany to discuss the potential for cooperation in different areas. So far, the signals are positive.

Guten Rutsch!


Monday, December 26, 2005

May I present to you ...

I have been hoping for this, I have been more or less promissing this and it is with great happiness and gratitude that I announce that on Boxingday we can showcase the first public Wikidata database. It contains some 70.000 words in 22 languages with definitions typically in 4 languages.

We have been given permission by European Environment Information and Observation Network (EIONET) to host the GEneral Multilingual Environmental Thesaurus (GEMET) thesaurus. This showcases our wish to have great relevant content in many languages that does have structures as can be found in thesauruses.

The data is not complete yet and the content will be improved. To put it in perspective, this read only implementation of a Wikidata database is a proof of concept. This will be expanded slowly but surely, not only to improve the technical features but also the information and the user interface.

We really welcome all your constructive comments.


Sunday, December 25, 2005

More pressies

Today I received another present. Angela informed me that a gentleman from Australia became aware that we might be interested in a machine translation engine. This is absolutely great news. Because the engine is an "n-gram" based design. This means that it has the potential to function for many languages. One other most relevant detail is that it has a small footprint.

Important is that even when this software is not and does not become Free software, we will be able to cooperate.. This is the key thing of the Ultimate Wiktionary project.

PS I received an e-mail from Erik that he is importing the GEMET data into a Wikidata database.. It does a cool 1.000 records a minute.

PS2 I am also celebrating Christmas. But I am in two minds.. I want this soo bad. How much do I want to be like a Santa bringing good cheer.. :)


Saturday, December 24, 2005

Under the Christmas tree

On Christmas Eve, by magic presents appear under the Christmas tree. The great thing is the suspense. The suspense of what will it be or the suspense of how a particular present will be appreciated. Presents under the tree, Christmas wishes presented in many ways, it is a festive moment.

Well what do we find under our virtual Christmas tree. Today I received an e-mail from Erik who wants meat on the bones of his present. He does want to have "versioned tables" to be included even though the main purpose is that we have something to show :) . It is still very much intended to be on-line in the coming day/days.

Yesterday, I had a conversation with Jimmy Wales. We discussed many things. I was really happy to learn that he considers it necessary to have "committees" that take care for the board of Wikimedia Foundation projects that do not get the attention that they do deserve. Wiktionary would be one of these projects.

In the last month many things have happened, I hinted so some and not to others. Some of the highlights are the potential cooperation with several organisations. Two of these I want to highlight; the GEvTerm project and the ProZ organisation.

  • GEvTerm is based on a great idea; when you have an international event somewhere. It will mean that many people will congregate to one place. They need to communicate but it cannot be expected that all people share one common language. The idea of GEvTerm is to concentrate on translations that are associated with the particular event and make it available to ease the human interaction.
  • ProZ is a member based organisation of professional translators. It serves the largest community of translators. Proz has been active building glossaries, they have their kudoz where colleagues can help out with particular problematic translations. They were about to create their own dictionary and we were lucky to get into contact with them through Sabine who is a ProZ member. We are now talking on how we can create one fabulous resource together.
As these are only two wonderfull opportunities, you will appreciate that a consortium of likeminded organisations will realise an even bigger potential to bring a resource with lexicological, terminological and thesaurus information that is there to be used Freely.

I wish everyone the most joyous of

Friday, December 23, 2005

Is a rose by any other name as beautifull ?

Ultimate Wiktionary is the name of a project. The initial goal was to improve on the exisisting Wiktionaries. This would bring cooperation between the people that are interested in specific languages but choose a different language for their user interface. When adopted, it would mean that we can concentrate our effort in one resource.

The name for the project does not determine that the eventual project will be found at "". There are reasons for it and, there are reasons against it. It is cheap to name it like this as the domain wiktionary is already owned by the Wikimedia Foundation. The second reason is that a fair number of people already know this name and finally it does link the old with the new. A big argument against is the use of this "ultimate" label. The functionality of the software will grow but at first there will not be much that deserves this accolade.

Now there is this opportunity; what name to pick and, what arguments to use
  • Wiktionary2
  • WiktionaryZ
These are the two contenders at the moment that people came up with. Personally I like WiktionaryZ best. When pronounced it is "wiktionaries" and it reflects that we will include all the content of the current Wiktionaries and also it really links into the past. Given the way it is written it also reflects modern times and the approach that we take with Ultimate Wiktionary is an exponent of our time.

Wiktionary2 is another great name it. It also gives a great link to the current Wiktionaries and it symbolises well the big technological step that it represents. One thing that is problematic that some people suggested that it could be seen as a version number; this would mean that the future might bring us a "". This is not a sensible way of doing things as you do not want to reflect version numbers in your domain name.

I hope that people will like these suggestions, and there is room for many more.


Tuesday, December 20, 2005

Extending the community

Ultimate Wiktionary is there to be used. My definition for success is "when people find an application for the data that we did not think of". There is however nothing wrong with us coming up with new ways in which we can extend the potential use.

Particularly interesting to me are the changes that extend the community of users. We want the scientists; the translators but we could also have the puzzlers. For me this would be really cool if UW becomes a challenge to my mother, she likes her crosswords and her cryptograms. Puzzlers are interested in synonymy and definitions so by adding this one field in the
Expression table, the first step is taken to charm yet another group of people into the Ultimate Wiktionary...


Sunday, December 18, 2005

Changes to the data design

I have posted the new data design and I did a lot of annotations.. Not all by a long shot and I am sure that Erik wants many more changes. The adagio of Open Source is to publish often so I do.


New tables due to many new ideas and other novelties

Ultimate Wiktionary is making a rush towards its first "outing". This results in all kinds of interesting things. It results in even more interest before we have anything to show for ourselves. It results in reasoned suggestion to change labels in preference to others; Label for Attribute because label has a meaning that is confusing to a large constituency for UW's community.

Wikidata is not the same as Ultimate Wiktionary and consequently has requirements of its own. It has language requirements of its own. It may need longer texts, it may require texts in a format that Ultimate Wiktionary frowns upon like capitalised expressions. As we are investigating the use of TBX for the static part of Ultimate Wiktionary, it made sense to think about TMX as well for this issue. This means that we need some basic stuff to deal with handling translation projects. I have come with this extension of Ultimate Wiktionary, this datadesign makes use of tables that are part of UW and may as a result become part of Mediawiki proper.

I realise that when we implement this, we have the core of a translation / localisation workflow. This makes sense when you consider that Wikipedia, one of the biggest websites of this world, exists in 212 different languages. When a Mediawiki message is changed, who is going to do the translation.. I doubt that there is one organisation that can do that well on a continuous basis. As I am a firm believer in using standards AND in eating my own dogfood, this is my first take on this issue.


Saturday, December 17, 2005

The relevancy or lack thereof of standards

When is a standard a standard? A standard is a standard when a standard body says it is.

That in a nutshell describes the situation for many standards and as far as I am concerned, I would prefer a definition that includes relevancy. "A standard is a standard when a standard body says so and when it is freely available for adoption". When a standard is not freely available, it means that the standard will not be adopted by some for monetary reasons. The consequence is that money removes relevancy from a Standard when it leads to it not being adopted.

In my mind the worst thing that can happen to a standard is that it is not adopted or ignored.


Friday, December 16, 2005

Alternate representations

A new problem that is in need for a solution are "alternate representations". Alternate representations are expressions that do not fit the mold of how you want to have expressions in a lexicological resource. One of the rules has always been that capitalisation is only used for words that are always capitalised, eg English (the language) is always capitalised in English. There are resources, resources that we would like to include, that have these as synonimes. An other example is "plague, bubonic" to me that should be "bubonic plague".

Many of these things find their origin in being the legacy of a paper based origin. In a digital resource with some magic linking "plague" and "bubonic plague", one would suffice. The problem is in how to make the Ultimate Wiktionary relevant. When we do include "plague, bubonic" in some way, we allow for the one to one linking from the Unified Medical Language System to Ultimate Wiktionary and vice versa. It would even allow for the inclusion of UMLS data in Ultimate Wiktionary.

My current thinking is about two options. I know that in lexicology they have some anotation to describe in what relation in a sentence a word exists. The other option is to have an AlternateRepresentation table that links an Expression to the preferred Expression.

I do want this anotation anyway, what I do not know is if this anotation is aware of capitalisation.


Thursday, December 15, 2005

Terms and a neutral point of view

Ultimate Wiktionary wants to be open, wellcoming to all communities and.. yes, I was human, so I was convinced that Term was a better term. However, it is one of those words with multiple meanings and particularly in the worlds of terminology, lexicology and thesauri. After some discussion we came to the conclusion that this is not the right word to describe what we mean, and also that it is not really neutral. So, a new word was agreed upon: LexicalItem.

There are some more changes that we decided on in Berlin; the
Label table contains attributes and also the name Attribute is less confusing. It is also necessary to include some more intelligence; this means that it must be possible to group the attributes.

There are loads of things that I have learned that I am still internalising. When I have that there will be several other changes.

Oh, the great news is that many of these changes are inspired by the great people that I met In Berlin.. It will make Ultimate Wiktionary more relevant to the science types .. :)


Tuesday, December 13, 2005

terms and what not

In the data design I have a table called Word. Well strike that.. on the Language Standards for Global Business conference in Berlin there were more people impressing me with the need to change this to Term..

I am only human,


Sunday, December 11, 2005

Even handed approach

There was one great thing said in a presentation of Flemish. When you research if a word is particular to a dialect, it is as relevant to know what words are NOT part of that particular dialect. There are therefore different situations:
  • A word is specific to the dialect
  • A word is used both in all areas where the language is spoken
  • A word is not used in the dialect but specific to the parts where the dialect is not spoken,
This aproach would be a NPOV aproach. The research that is needed to find these words is difficult. One scientific aproach would be by presenting people a text and ask them to correct this.

This aproach is not problematic when you consider the Dutch and Belgian situation, languages / dialects like Andalusian are much more problematic because people will bring political dimensions to it. The history of the creation of new wikipedia project often proves that a language is a dialect with an army.


Saturday, December 10, 2005

TST centrale

The "Instituut voor Nederlandse Lexicologie" has in her TST centrale a resource where people with a need for lexicological content can choose what they need. All this material is copyrighted and it is made available at the lowest possible cost. The material is the result of many scientific projects and it is considered basic material for lexicology based on the Dutch language.

There are other resources that have importance to people interested in lexicology. Logos in its dictionary provides a rich tapestry of words with translations. In its link to wordtheque, you find the words in its context. In the philosophy of Logos, this often provides as clear an idea as a definition would do. A link to publicly available resource is not available through the resources of the TST centrale.

In the Kudoz open glossaries of Proz, you find a rich resource of hard to translate words. When you start looking for resources that have a relevance for the creation of dictionaries, there are many resources that are not created in a "scientific" manner. Practically they can be extremely usefull. It is a shame that the scientific resources are not Free and consequently that they make the "unscientific" resources unavailable for the enrichment.

Anyway, as long as these resources are used side by side there is nothing that stops the research of lexicology. As the Wikipedias are a rich resource of contemporary language, and as its content is categorised as to subject matter, it is good to know that scientists are free to use it for their research. I checked it with Jimmy Wales and he was happy to confirm this.

Tomorrow I will be going to Berlin. We will talking about interfacing the Ultimate Wiktionary using the TBX standard..


Friday, December 09, 2005

A difference in approach

Yesterday, I was at a conference for Dutch language lexicologists. It was my first such thing and it was a grand experience. Lexicologists have always been abstract people, now they have faces they exist in many shapes and forms and they largely do many different things. They work together in many ways and to me they do marvellous things.

The difference in our approach and the scientific approach can be given in one word: scientific. What we try to do with Ultimate Wiktionary is not scientific. Being scientific has never been considered. Our outlook has always been practical. We want to do practical things with our dictionary. Publishing a scientific paper is not practical to us. That is not what our goal is.

Given this difference in approach, there is still very much that we can do for each other. By building a resource that is useful but not complete, it may have a limited scientific value but it does have a value. Being build by people who do not necessarily share the same methodology, it may be chaotic but is still has a scientific value. Even for all these "issues" a project that makes lexicons relevant to people who typically do not care is probably the most valuable gift we can give to the science of lexicology. If we can make lexicons relevant and exiting, there will be new people who will find their way in this profession..


Monday, December 05, 2005

Sinterklaas or Christmas is coming

Today the 5th of December it is "pakjesavond", the night prior to the birthday of Sinterklaas when traditionally presents are given sometimes with a rhyme or a suprise. Christmas is therefore not that far away and it is therefore a time of how to do things in line with the Christmas spirit. It is a time about Christmas spirit and Christmas gifts.

Last year we started to collaborate on Christmas wishes and it was good fun. It is so funny to see a text in alphabetic script and not have a clue as to how it is pronounced.. "Përshumvjet Krishtlindjen dhe Gëzuar Vitin e Ri". Last year we were as ambitious as this year; we would love more people to translate and say: "Merry Christmas and a happy New Year!" in their language..

We hope and expect that the first tangible results of all the effort that has gone into Ultimate Wiktionary will be our Christmas gift.. In the mean time we will also do some more work on our Christmas glossary.. Have a look and see how you can make the glossary yours as well :)


Friday, December 02, 2005

Wikipedia Is The Next Google

There is quite a lot of buzz about a blogentry by Steve Rubel; Wikipedia Is The Next Google is together with the comments a nice read. For me there are two things to this article; the disruptive nature of organisations and Wikipedia.

When disruptive technology apears, it changes business as usual. It has done so in our society from the moment when innovation was considered to be good. Innovation was never considered to be universally good, but it led to our current society with a number of people having it good in a way that could not be conceived one hundred years ago. In a way, with the ever increasing speed of communication, new ideas get an audience with an ever increasing speed.

Wikipedia is an encyclopedia, it is internet based and it is growing as quickly as new servers can be brought online. It can only do this because of the huge pent up demand for affordable information that has a neutral point of view. Important is the realisation that Wikipedia is not one but many encyclopedias. Every month there is yet another language that gets its own Wikipedia.

These wikipedias all have the ambition to equal the star wikipedias like the German and the English Wikipedia. They will have to grow from a small project where everybody knows everbody to a project where even the heroes of last year are not known by all anymore. Slowly but surely these project create Free information and get the recognition for the viability of the languages they express.

Certainly when there are few resources in a language, the impact that a wikipedia may have is big. Comparatively Wikipedia cannot be as important for languages like English and German as it could be for Swahili. It will take its own good time..

With all this talk about disruptive technology, it is fun for me to predict that Wikidata and Ultimate Wiktionary will be disruptive in their own right. It will be in more ways than one.. I am anxious in how conservative the Wikimedia crowd will prove to be. If they are like I expect them to be, they will allow both Wikidata and Ultimate Wiktionary to develop its potential.


Thursday, December 01, 2005

Luxury problems

When you need more people to work on programming. When you have them do work for money then at some stage you need to pay them. This is actually a great moment because it means that you have something to show. We are about to hit the first milestone of our development of the Ultimate Wiktionary. This is where we have functionality to identify tables and their relations within a Mediawiki environment.

The problem is that we have to pay outside of the European Community. So we do get into silly stuff like currency and costs.. We have to find out what the cheapest way is to get money elsewhere.

It is a problem but I prefer this to not having code finished.


Monday, November 28, 2005

Content looking to be seen

If there are good reasons to share lexicological content, the best seems to me that it makes what you have to share more relevant. Some content does not really need more exposure but the effort of professor Rennison for the Koromfe language is one such. The language is not well known; it does not even have an article by that name in Wikipedia. Even the Koromba people (indigenous in Burkina Faso) do not have their article yet..

Professor Rennison, who worked some twenty years on the Koromfe language, made his resource more relevant by suplying not only an English but also a French and German translation to the Koromfe idiom. Consequently his data is more accessible. At this time, Ultimate Wiktionary is little more than a promiss. It will be great if it can be a place where we can give the Koromfe language a place that is as important as any other language.


Saturday, November 26, 2005


The English language wiktionary has a new experiment. Connel, one of the Wiktionarians has downloaded the Gutenberg project, he performed a wordcount and did a ranking for these words. The word seeing for instance currently occupies the 621th place.

I really like this example of being creative with this aspect of wiki. When I discussed it with Erik in Berlin, we found that there is indeed little room at this moment in time. Maybe we should consider to have some free space at designated places where one can freely enter data that is not structured.

more resources: frequency list discussion


Thursday, November 24, 2005

Alternative definitions

In my Berlin III blog of November 21, people commented on what I said about Meanings. They objected that there would only be one DefinedMeaning. Thinking about this, I had to conclude that often one lemma can have multiple definitions and still be the same thing.
  • the problem solving ability
  • what the intelligence test measures
These two definitions are both for intelligence and they have the same translations. I had to learn these among others. As I said in my reply there is a need to decide what definition is the one that ties the DefinedMeaning down. This will be still the same. However alternative definitions are welcome as long as they are try to define the same thing.


The I&I conference and IEEE LOM

Yesterday and today the 15th I&I conference. There were some 200 educators that deal with integrating computers in the educational process. For Wikipedia the 14th edition was important as it was when teachers told Kennisnet that Wikipedia was important for the Dutch education. This resulted in cooperation between Kennisnet and the Wikimedia Foundation.

This time there was a large group of wikimedians there to inform about what we do. Kennisnet gave a great presentation on how they are experimenting with wikis in education.

I learned from many people that tagging educational content is starting to become important. The IEEE LOM standard adopted in the Netherlands (EDUSTANDAARD) is being implemented into many of the software applications. One problem is that in order for such a standard to make an impact, much data needs to be tagged. This means that everybody is a winner when educational material is shared among schools. In order to do this, some authorisation and authentication is needed. This is needed so that students will not get exams from the schools that do share.. Kennisnet provides just such a service in their Entree. Having this is likely to be a key enabler for successful sharing of content.

One other key enabler is to just share. People will have to get used to the idea that it is like marriage, both parties think they are giving more than the other .. :)

One thing I would really like is to have Mediawiki to include the potential to have IEEE LOM tags. Through our interwiki links we know which articles share the same subject. Consequently these articles can share much of the tags given in one flavour of IEEE LOM. There are several reasons why we should do this:
  • it improves the accesability of our information
  • it would involve many people in education in our projects
  • it would stimulate all implementations of IEEE LOM
  • there would be another reason for having Ultimate Wiktionary; it could localise the IEEE LOM tags.

Tuesday, November 22, 2005

Home but busy

Well I am home. A great weekend but a tiring trip home. At home I went to bed and I am still a bit groggy. Tomorrow I will be at the I&I conference. Only after that, in two days, I can start work on all the details that I have to document... :(

Mediawiki is a great product, it is great at scaling, with Commons we have our image repository, with Ultimate Wiktionary we will have our lexicological resource. All this is part of this great idea of having all information in people's language to all people.

It is great when you can be part of this puzzle on how to get all this together. There is however so much to do, what to do first.


Monday, November 21, 2005

Berlin III

Still in Berlin, at the end of three days of Ultimate Wiktionary we have done a lot of work. There is still a lot of work because what people want, need and deserve is something to show for all the work done. As visibility is important, we are going to have two things happen as soon as possible; first the Wikidata milestone 1 has to be finalized and committed to the CVS release branch, and then we will have an extra step to publish a read only version of the GEMET data. This combined with all the languages that are in the ISO 639-3 provisional version (in English) will give a clear idea of what we want: great lexicological content in all languages.

Technically, some things in the data design will be changed, among them a change of the Meaning table; it will become DefinedMeaning. This is to reflect that the DefinedMeaning defines which MeaningText is the one that truly defines a meaning. The point is that for a meaning you have to decide what language and what word define what it is. The other MeaningTexts in the other languages should be a translation of that specific text.

One great thing is that we came up with an improvement regarding inflections. The problem is that it does not make sense that all inflections show up in the list of the synonyms and translations, only because they share the same DefinedMeaning. By adding the key to the InflectionWord it belongs to, we can only show the headword for the parts of speech. Yes, it is a database change.

Erik was not happy with my Table table. He called it a hack and, it is a hack. So he does not want it, he does not want it, he does not want it... So, it is to go. Erik is correct where he says that NOT having this hack means that it will be much cleaner code and, it will help with the scalability issue.. A major point. So I will create a few more Relation tables that will be more specific.


Saturday, November 19, 2005

Berlin II

Today was a lot of work in a short time.. We went over the database and given Wikidata and the Ultimate Wiktionary it has some constraints. UW may be the first implementation of Wikidata it will certainly be one of the more complicated ones. This is good in a way, it means that the technology will be fleshed out from the beginning.

Some of the thing were thought to be a hack and yes, in a way they are but hacks that work.. things like my Table table .. :)



Today I am in Berlin to work on Ultimate Wiktionary. I am really happy to be here and I expect it to help a lot in making Ultimate Wiktionary AND Wikidata available. We are now having a break and one fun thing we already experienced is how much Erik and I have a different outlook. I am very much Ultimate Wiktionary oriented while Erik is more into Wikidata.

We have discussed many things. The subject of hosted thesauri is one that will come back again.. We are now into data design.


Friday, November 18, 2005

Adopting changes in an included thesaurus/glossary

When authorative glossaries or thesauri are included in Ultimate Wiktionary, they have to choose how they want to make their content available. When they choose to be seperate, and have a restricted group of people work on their content, they will not benefit from the wiki-way. As the definitions and structures are seperate, they will find that alternative meanings and structures will arive. Meanings that are essentially the same.

It is therefore that a method needs to be found to merge glossary entries, thesaurus entries and wiktionary entries. This may seem like a simple thing and in many ways it is. The problems arise particularly in the thesaurus structures that are associated with a lemma.

When the community around a thesaurus decide to adopt a lemma, the version that is adopted is tagged as being part of the lemma. When the lemma is changed, the later version can be adopted as well. This allows for one way of quality control. Alternatively the changed lemma is flagged as "pending aproval" this in turn allows for a second method of quality assurance. The Wiki way would be to assume good faith and expect a change to be a change for the better.

When agreement can be reached on how a word is defined and how it is translated the relations may need to be tagged as belonging to a specific thesaurus or glossary. As agreement could be reached on the Meaning, the connections between the different thesauri is what brings the similarities in focus. This in turn may help bring more understanding.


Thursday, November 17, 2005

Why dictionaries under a Free license should cooperate

It is easy to explain why the non-free dictionaries are not “Free”. They are not free because people thing that more money can be made with these dictionaries. It is elementary. It is impossible to understand why a dictionary available under a “Free” license should be licensed under any particular license.

From my perspective, there are a few things that are relevant. It must be “Free”, this is something that is shared by all so that people are able to use it. The other thing is there must be attribution.

When we work together on a FREE resource, it will not take too long and we have enough relevant information to matter in a language. From that moment onwards quality and quantity will grow because "enough" is the tipping point where it makes sense to add content to the shared resource. When a new spell checkers is generated every week, it is obvious where to correct mistakes and where to add content. A spell checker only needs to say where this can be done (this IS the reason for attribution) and it would enable people to get updates, it would also enable to find who contributed to this resource.

It is really important to understand about DATA that a license can only be "viral" with respect to DATA. Using a spell checker generated with a GPL licensed dictionary does not change the license of the software running this data. To be honest, I am happy that it does not, because it makes it obvious. It makes it obvious that the “Free” dictionaries should work together. By working together we will achieve more than what we can achieve separately.


Ideas on quality assurance for Ultimate Wiktionary

Ultimate Wiktionary is intended to be used. To be used not only interactively but also by programs. Certainly when the lexicological data is used in earnest, the need for quality assurance will exist and it can be understood that this makes sense. On the other hand, the way of the wiki is that we allow for the collaboration of everyone who has something to contribute.

This is a complicated thing and the current thinking is as follows:
  • All edits need to be validated two times by different people to be considered "good"
  • When two words are considered to be the same in Meaning and Expression, they can be merged. The translations of these meanings will be merged but they get the status of a newly added word. This is to ensure that translations are considered again for their validity.
  • Bots will be disallowed from making interactive edits. Every bot has to be associated with an interactive user.
  • A thesaurus or glossary can, when it is agreed that it needs this status, be write protected; this means that comments can be made on the "talk page".
  • When a bot is to be used for a specific usage ie the maintenance of a specific write protect thesaurus or glossary, it can be given a status of implied quality control. This means that the organisation that maintains this resource in the Ultimate Wiktionary is wholy responisible for its own quality. We are thinking of terminology like the terminology of the Roman Catholic church where the exact nature of the definitions is a matter of doctrine.
  • The user associated with such a bot will be the admin for this glossary or thesaurus. This admin can allow users to make changes to its resource.
  • Like on the other Wikimedia projects, we will need admins to do the necessary maintenance. The admin status for deletions should be given per language.
  • A priviledged few will be admin / bureacrat for the whole of the project. They will have access to all languages and all resources. As you can imagine they will be in a glass cage.

Tuesday, November 15, 2005

Semantic web

At some stage it had to be. The semantic web, the holy grail of this digital age had to come my way. The sum of all knowledge is to be found by it. It is a great endeavour, many people work really hard to make it a reality and so far it passed me by.

I am happy that it passed me by until now. I am happy because in my ignorance I was able to come up with the Ultimate Wiktionary. Ultimate Wiktionary is in its own way equally ambitious; it wants to have all lexicological information on all words of all languages. Some people say that you cannot know a thing if you do not have a word for it..

The semantic web came my way in a meeting at the University of Rotterdam; they need a lexicological resource for a big thesaurus. They have experimented with products that are closely related to the semantic web.

What I have understood is that certain words are used in a tree of concepts, in order to make this information usable in other languages; these concepts have to be translated. In my opinion that is exactly what the Ultimate Wiktionary is intended to do. When a "concept" is associated with a particular "Meaning", it follows that the translations and synonyms can be used to present these relations in another languages.

I understand that there is this idea that a concept and the tag used in the relations is considered by some to be distinct. At this moment I think it is not really practical, It is great that I will learn more about the semantic web.

I am on record that when people try to find a use of the Ultimate Wiktionary that I did not consider, I would think the Ultimate Wiktionary a success. By that standard, even though it is not operational yet, it is doing well.


Monday, November 14, 2005

The Century Dictionary

The Century Dictionary was in its time a wonderful resource, even though it has aged it is still a wonderful resource. It gives a best of great impression of how dictionaries were at the end of the 19th and the beginning of the 20th century (1889-1910).

The fact that much effort was undertaken to make it available in this digital day and time is wonderful. It is advertised as the biggest on-line English dictionary on the Internet, with more than 500.000 definitions it may be just that.

The Wikimedia Foundation was asked for advice on how this splendid resource could be modernized and updated. Being asked to give an opinion privileged me. As I think highly of resources like the Century Dictionary, I would at best convert the digitized content when this improves the usability of the data. As I valuable the Century Dictionary for what it is, I would definitely keep maintain the data as is.

This does not mean that all this lexicological information cannot be used to build a modern dictionary. This can be done in many ways. An important consideration is that the data of the Century Dictionary is firmly in the public domain. This means that any existing project that works on building a dictionary can and may use this data.

I would not mind including the data of the Century Dictionary in the Ultimate Wiktionary. It would prove a challenge to fit it in what has always been envisioned to be a modern dictionary. Then again, the Ultimate Wiktionary is also to be inclusive. So when the opportunity comes to include whole dictionaries, I am sure we will find a way and make sure that it makes sense for our users as well..

The conversion of the Century Dictionary will be a lot of work. However, there are many professions where the skills for such a project are taught in universities. It is therefore that I could see students working on such a project for a term project.


Sunday, November 13, 2005

How to cooperate with a community

Wikipedia is an outrageously successful project of the Wikimedia Foundation. Its most prestigious project is the Wikipedia in English, it has over 818.000 articles, it has an amazing number of active contributors and the achievements of this community are not only in this high number of articles, it is also in the rating given by Alexa (today nr 38), the attention we get in the press, the overall quality of the articles.

The community that makes all this possible is essential to what is done. The community does not exist as one big always agreeing whole. If anything the difference in what people want out of Wikipedia is huge. There is this tension between people who want to concentrate on the "main" Wikipedias and people who want to create new Wikipedias. A tension between having only illustrations that is Free and illustrations that is free to be used. Given the size of our community (a middle sized town) we have people willing to utter any POV.

In what I am doing, I am truly outside the Wikipedia community; I am firmly into dictionaries and I want to take the Wiktionaries to the next level. Ultimate Wiktionary is intended to be inclusive for all the lexicological data and applications we can think off. We came up with spell checkers, computer aided translation tools and it being a translation and a descriptive dictionary is what we started with.

As Ultimate Wiktionary is there to be an inclusive lexicological resource, we invite everyone to join us in making it exactly that. It is easy for people to join they just do. For organisations it is different. They often have a lot to offer but they also have their requirements. To address these requirements you have to be approachable. Sure the Wikipedia community is approachable but it lacks the ability to come up with a single response as the community is divided. It is unable to come up with a quick response, and when a response is given it is often does not answer the question that was asked in the first place.

The Wikimedia Foundation has as its goal to bring all information to all people of the world. Wiktionary has as its goal to bring all lexicological information of to all people of the world. It is therefore that it makes sense to have a consortium where organisations can find a focal point where their need for cooperation can be discussed. Given that we are open to cooperation on a non-discriminatory way and, that adding information makes us richer and more relevant. When organisations have a need for quality, we should find our way in providing this quality assurance. We should when it is legitimate request.

Ultimate Wiktionary will find its legitimacy not in being yet another on-line dictionary but in giving this data an application. When it does not go beyond what every dictionary does, it will be a failure. When organisations like the University of Bamberg have a use for the Ultimate Wiktionary, Ultimate Wiktionary will become a credible and important resource.


Wednesday, October 12, 2005

learning from dicologos

Dicologos is the dictionary of Logos. It is a great resource and there are many things to learn from it in order to make Ultimate Wiktionary a success. Dicologos has more than 7 million words, there are words in many languages and, the content is growing all the time and its main problem is that it is not well known and that its focus is on translators.

When translators use a dictionary, they typically use it to find confirmation for the translation of a word. All the rest is not really relevant to them. An ordinary user of a dictionary uses it as much as anything to find a definition of a word. As the bulk of the potential public for Dicologos is NOT a translator, the lack of definitions is a big problem when you want to establish public awareness of Dicologos a
Free resource. For Logos it is important that Dicologos is seen as an important resource because it demonstrates that Logos contributes to society by providing to the culture of our society.

When you have worked on this resource like I have, you will appreciate that it is much more responsive than the Wiktionary servers but you miss the cooperation, the sense of community. There are no talk pages like in all the Mediawiki Wikis. There are no mailing lists. There is no sense of community.

When Dicologos and the wiktionaries are to work together, a common ground must be found where the content and the communities can find each other. Technically, the content of the Wiktionaries cannot be converted to the Dicologos database because many types of information cannot find a place in the database design of Dicologos. In the same way it is not possible to convert the Dicologos data to the Wiktionaries because you have to do it so many times and, there is so much overlap in the data.

When Logos decides that they are going to work together in a lexicological resource, they will find that in Ultimate Wiktionary they can include all their content. They will have all the community features that are implicitly available in the
Mediawiki software. If they want to take the next step, they can work together in a resource that will be hosted by the Wikimedia Foundation. The specific needs of Logos can be adressed in what will be the Ultimate Wiktionary. These needs will be addressed in a non-discriminatory way.


Saturday, October 08, 2005

A thesaurus of biological terminology

When people write about a subject they often use acronyms. Using these as a shorthand makes sense because the alternative is using LONG phrases time and time again. Depending on the subject, different often mutually exclusive acronyms are used. As many of the acronyms are often used together, it helps when you can identify these patterns.

The result is that documents can be identified to be about a particular subject and as a result it enhances the time spend; you will read about things that are of interest.

An other application is when words are used together in a given setting that they can be identified to be about a given subject matter. For translators it means that it helps to choose the correct meaning for a word. When this is done automagically, a correct translation glossary can be loaded.


Thursday, October 06, 2005

I work at Logos to make the Ultimate Wiktionary happen

Ultimate Wiktionary is this thing that I want to happen badly. I will do everything to make it happen. I will even go to Italy to do so.

I did go to Modena (Italy) to make it happen. Modena is where the Logos Group is based. They have been working for 20 years on an online database. This resource is huge; this resource is important. It has between 7 and 10 miljon lemmas. It has an active community of translators working on the content and it is my pleasure to help make this resource even better.

One of the things Logos wants to do is to cooperate with the Wiktionary communities and as Ultimate Wiktionary is also to be implemented in an Mediawiki environment, what does make more sense then to join the resources of two vibrant communities ?


Wednesday, September 21, 2005

Exciting times

On Wikimania Jimbo Wales said that lexicological content ought to be free. When this happens all kind of things start to become possible. Logos is about to make a lot of their content available under the GFDL. Their content has long been freely available, now they are going to license it.

The great thing is that is allows for among other things, the use of this content for teaching languages. Consider, when you create a language excercise you need words to select. When these words are available in an electronic resource, you can make selections including particular vocabulary related to specific subject matter.

For the provider and the user of the Free content / Free education, it is a win-win situation. Both need ample data and with more eyes looking at content and structure, the data can only improve in quantity and quality.

The good news is all of this is happening.


Wednesday, September 07, 2005


On the English Wiktionary they have this wonderfull resource called "Entry layout explained". It explains what is needed for the different content that is available on there. I just discovered it because it was mentioned on IRC.

For me it is a treasure trove. Because all the content described needs its place in the Ultimate Wiktionary. I have to think through how to add homophones. I have to think about how to have them as content. At this stage I do not need to concern myself with where it will end up in the actual screens. I have to have it in the database.

Homophones led to the most drastic change in a long time. I divorced Relations from Meanings. Now RelationType is connected to the Table table. This in effect allows me to use the Relation table in combination with Words as well. This gives me right and rite as homophones

Tuesday, September 06, 2005


The wikimedia foundation had a wonderfull offer of getting a database with eponyms. This is great and obviously once we have them, we want to include them in Ultimate Wiktionary as well.

Eponyms are however a funny thing; they are definetly related to words and not to their meaning. Actually this should be quite obvious because in German you have "Röntgenstrahlen" while in English it is called "x-ray". One is an eponym, while the other is not and they do share the same meaning.

Thinking about eponyms and how to include them in the database design let me see the light that I was really wrong about how I had etymologies in the data design. Like eponymys etymologies are word related and not meaning related; they too are language specific.

The funny bit is that many people have looked at it and nobody noticed. I think it is like with so many things, the best designs do not survive reality unscathed. :)


Sunday, September 04, 2005

Open Office is LGPL

Open Office is nowadays LGPL. The importance is that this is a major shift for Sun Microsystems because Open Office is seen by many as the major free Office Suite. Making OO LGPL makes it much easier to share code.

Sun has its own Computer Aided Translation (CAT) tool. These open language tools are written in Java. The open language tools are licensed under the CDDL license. It would be great if they could share their code with the OmegaT CAT tool. OmegaT is also written in Java.

My point is that there is this big concentration of effort and power in the commercial CAT tool business to the extend that there is a genuine monopoly. It does not make sense to have all the Open/Free CAT tools work seperately. To stimulte cooperation We hope to make a success out of the reference tool for a translation glossary. The best thing that could happen if some serious attention is given to more cooperation.


Wednesday, August 24, 2005

Me, CAT tools en Microsoft

When you are not a translator and when you have no dealings with the translation business, you would not expect that anything that I can do could threaten business interests of Microsoft. I was therefore very suprised when I was informed that my wish for a reference implementation of a translation glossary and my wish to integrate such a functionality in open source CAT tools like OmegaT do just that.

I am actively looking for money to enhance OmegaT. Not because I am likely ever to use it, but because I want quality translations in Ultimate Wiktionary. This reference implementation will do exactly that. In order to make OmegaT less user-feindly, there several little things that can be done. Sabine, defined some quirks that are a barrier for newbies. She also identified that "Trados" compatibility would be really important to introduce many translators to OmegaT.

A friend of mine is working on three annoying things. He is a professional programmer. He will inform us how much time it costs and how much it would cost if 100 translators payed for this. The point that I want to drive home is, that you can either pay for a license or you pay for functionality. When you pay for functionality, it will prove to be much cheaper.

I have my reasons why I want OmegaT to be a success, Microsoft happens to be the monopolist in the translation/localisation business. It is a genuine suprise but when you think about it, it should not be.


Sunday, August 21, 2005

Working on the Logos website

You may apreciate that I have been spending time to learn how the Logos interface works. As the data has been worked on only by professional translators I am honoured that I am allowed to be an editor for the two languages I know best. With the content that I have available to me in Wikionary many translations in many languages are available to me. I have been adding words and translations like "gebarentaal" and the most recent one I worked on was "vakman". Now vakman is a tricky word; it is very imprecise in the translation and it has five different plurals. Now I have been looking how to deal with that in the Logos website but also to understand how it can be done in the Ultimate Wiktionary.

The differences between the current Wiktionary and Logos are profound. In Wiktionary anyone, even anonymous users can edit almost everything. In Logos an anonymous user can add words that are checked. When you are a professional, you edit translations in the languages that you know best. The thing that I missed were the talk pages; a place where you can discuss an individual word. I missed the IRC channel where I can discuss issues about words or meanings.

I understand the differences, they make sense because it shows where you are coming from. Logos provides very much a tool for translators by translators . Wiktionary is very much a tool for people who care about words/lexicology and share this in their mailinglist, IRC-channel and talk pages.

What kind of a community would result when these two communities were to merge? What kind of content? It would be an intersting experiment.


Tuesday, August 16, 2005

Sign languages

Since Wikimania one of the things I am hopefull of is the inclusion of sign languages and oral languages in the Ultimate Wiktionary. Thinking of sign languages for me is really difficult; I do not sign I am not deaf and I do not know anyone who is. Wolfgang gave me an impression of what it is to be deaf; to me it is clear that the only thing I want to do is to create this resource that is there for them as well. It has to be their resource. It has to be their dictionary, they have to create the movies, categorise the content so that it can be found.

Consider what is going to happen when we can Free all the recordings of sign languages that exist in many universities and combine them with the content that we hope / expect to have, it will be a lexicological resource that will be awesome. Because of its scope it will get a relevance of its own and that is why it so great that an organisation like the Wikimedia Foundation will host it; it is not party in any of the rivalries between organisations or institutions it is just there to provide information to all the people of this world in their own language.


Sunday, August 14, 2005


Wikimania was an event. From my perspective it was great. We are going to have this weird and wonderfull thing that is going under the project name of "Ultimate Wiktionary" and given that we hope to merge our activities with the Logos project we may start with a cool 7 miljon and a bit of lemma's. We then still have to work out many details like the migration of the wiktionaries, the creation of a new community that is not logos or wiktionary but logos.wiktionary.

Wikimania was also important because it paved the way of including sign languages into the project. Wolfgang Georgsdorf gave a great presentation and together with Ascander we changed the data design to include sign languages as well. I learned that there are ISO-639 codes for sign languages as well :) .

When I came back I have worked hard to do many things that can be considered the fallout of the conference, I still have not finished to do all the things that I want to do. The nds thing did not go away and it does cost me my time. There are all kinds of things that I want to have done and I work on them. It is about priorities. Informing about what is going on is a priority, I did some work on the nl.wikimedia server. And now I finally have written here as well.

Sunday, July 31, 2005

Working on my presentations

Wikimania is only a few days off. There will be many exiting speakers. And I hope to be one. Not that I will not speak. I am getting nervous. I did my homework, I did write my presentation and there are so many things I would like to include but can't.

I can't because things have to develop some more. Because Ultimate Wiktionary is still some way off. Because people are still thinking old Wiktionary and I would only confuse most people even more.

Really, when we get UW life and functional it will be great. The filenames that I used have names like Spelling, Word and Meaning. The problem is that people think of them as a spelling a word or a meaning. Somtimes I think I should have named them Kwik, Kwek and Kwak. (you may also know them as Huey, Dewey, and Louie. It would make it more abstract but if it would help ??

Wednesday, July 27, 2005

Translating proverbs

When you put proverbs in a dictionary, you add all the usual things. You describe the meaning, you add an etymology and you translate the proverb. However how do you translate proverbs, do you translate them literally or do you want to give a proverb with a same meaning.

The proverb "de beste stuurlui staan aan wal" has a similar meaning as the English "backseat drivers". The Dutch version is nautical and the second one obviously not. So it is hardly a literal translation but it is a functional translation. In general terms the meaning can be described and as such it would function however, I can apreciate that there will be a certain drift when proverbs from many languages are put forward.

The problem for me is to consider how to deal with this in the Ultimate Wiktionary.. Well, at this moment we do not have it yet so it is not a problem .. :)


Saturday, July 23, 2005


In a few days wikimania the first world conference of the Wikimedia Foundation will open in Frankfurt. It will be a great time for the people that are able to come. There will be some things that will be broadcasted over the Internet.. The agenda is absolutely smashing. I will be speaking as well and I will speak in "die grosse Saal". It will be the final presentation in a series of four;

*Wikidata will be about the technology that will be behind Ultimate Wiktionary and many other projects.
*Wikisign will be about creating a lexicological resource for the deaf.
*Logos will bring us what their experience is hosting lexicological content
*Ultimate Wiktionary will be about what we hope achieve in the next generation of Wiktionary

As I have been talking so much about what I hope to achieve, I may not bring you anything new. However, there is so much to it..


Friday, July 22, 2005


I am starting to get feedback on my ERD. As was to be expected, what is clear to me is not necessarily clear to others. The great thing are the suggestions that come with the feedback, things like call it "Script" in stead of "Characterset" and there is an ISO code for it. Another improvement was to include a field to say that a specific relation (eg idiom or proverb) are language dependent.

Some of these things are so basic that you tend to forget to include it by making it explicit. All in all it proves important, publishing and publishing again does work.

Wednesday, July 20, 2005

Sound and sign

I am progressing with the datadesign for Ultimate Wiktionary. The current challenge I am facing is to deal with both oral languages and sign languages.

The easiest for sign languages was the realisation that a movie is the "Pronunciation" of a signed word. This made me change the fieldname from "Soundfile" to "Mediafile". More complicated is the fact that there are some four written signlanguages. These I would really want in the Ultimate Wiktionary. The question is, do they have like Chinese does their own UTF-8 characters. When they do, I do not have to do anything. It would just work as designed.

I have realised that languages like Arabic and Chinese are formal written languages. There are many people who have a spoken language that is grammatically and syntactically (does this word exist?) different from the formal words. So when I record pronunciations, how do I deal with those. How do I register those lanuages? How do I indicate that these languages use Chinese / Arabic for their written language..

My working theory for the moment is that there may be transcriptions for those languages. Certainly when they have been noted by someone who has some authority, these can be used to link the essentially oral words with something that has characters. These characters are needed at this moment to make it possible to enter them in the database. Now the question is, how to relate them to the written language ... At this time it is just a matter of having the written word as a translation.. in effect this is correct.


Thursday, July 14, 2005

Working on a table design

I am working on the table design for the Ultimate Wiktionary. I have posted the current version of my ERD and true to the tenets of Open Source I am working on it and will post often, people can deduce what I am thinking. The funny thing is that since the last time I created a more or less working model for an UW, I have learned so much. The resulting datadesign is significantly different. I have come to the realisation that what I create is very much the result of this proces of assimilation of a lots of loose ends in the current Wiktionaries.

One of the things that is funny is that when you design a database design you not only have to think of the data itself, you also have to think about how it is to be used. The problem for me is that the database and the development are pretty much divorced. I know databases pretty well but I do not know the restrictions of MySQL in combination with what Wikidata will bring us.

I find it really thrilling that we are at the stage where there is an imminent need for the datadesign for Ultimate Wiktionary..


Friday, July 08, 2005

talking tables / files

We are arriving at the point where we have to talk file design. What tables do we need what fields will they have and how will this relate to the functionality of what we have.

The three most important tables will be "Language" "Word" "Meaning". They are top down related. The most difficult to understand will be "Meaning" because the meaning itself will be in a seperate table "Meaning-text". This is because the text of meaning is to be had in every language, and it is the abstraction of the meaning that is in "Meaning".

This "Meaning" will relate to synonymes and translations (a synonym is equivalent to a translation in the same language). This will give people an instant problem many words are not the exact translation of another, so how will we deal with this.

When a word is translated, the word picked in the translation is the one that fits best in the meaning of the original word. This meaning is therefore one that is of importance to this word as well. This meaning can be endemic to the language of the word, this makes it a natural fit or the meaning can be external to the language of the word. When the meaning is external to the language, this meaning is only relevant when translating the word.

This sound problematic. The word girl, meisje, Mädchen are good translations. In the Neopolitan language there are words that are specific to girls of a certain age. The meaning of these word is included in the meaning of the word girl. They need to be shown when you are interested in the Neopolitan language. However, when you are not interested, these meanings that are external to words of the English language, do not have to be shown.

I have been told that there are some four words that can be included in the word girl. These meanings do relate to each other and as such it makes sense to use thesaurus like structures to describe these relations. As these relations describe the meanings, these relations are relevant when you are interested in the Neopolitan language. They do help a translator choose the best fit and also alternatives when one word is used too often.


Thursday, July 07, 2005

Supporting a "bot"

The Wiktionary projects are very gratefull to the programmers of the pywikipedia bot. Particularly Andre Engels has been important in supporting this bot for the Wiktionary projects. He has programmed new functionalities that helped us work together more than what we would have done without it.

As can be deduced by its name, most people use the pywikipedia bot for Wikipedia projects. Many of the innovations have been programmed with Wikipedia in mind. The latest innovation allows you to be logged in several projecs at the same time. The interwiki bot makes use of this facility and when it finds that one project needs to be updated, it will do so. This enhances the quality of the bot dramatically.

Supporting a non-programmer like myself is a pain. It is therefore important that tools like tortoise work well. It means that a common baseline can be created. This in turn facilitate the analasys of error conditions. Today we finally got it to work. We had to remove the application and start it all over again.. This time it did download the pywikipediabot software from Sourceforge..

Really, Open Source rocks when there are friendly people like Andre ..


Wednesday, July 06, 2005

IATE and Free content

IATE or Inter-Agency Terminology Exchange is a project that is to create a glossary to be used for the European Union. It has live data and its content was appreciated by many translators until recently. Until recently, there was a guest profile with a guest password that was used by many. Because the IATE database is "not ready for the general public as it may not cope with the demand that might be put upon it", this access to the public is removed.

To gain access, you have to translate for the EU and you have to sign a contract that you use it only for EU use. There is however one bright spot; its copyright. The IATE copyright says clearly that you can have this data and use it as long as you attribute it to the institution that manages this information.

It is therefore a lucky coincidence that we want to make Ultimate Wiktionary relevant. It is as fortunate that we already plan on cooperating with the EU by publishing its GEMET content. When we have proven that we can host lexicological data, we can ask the EU if we can host this data. It is relevant data it is important data so much so that the EU expects that its modern systems will crash under the strain of all these people who want it.

With the Wikimedia servers, we are used to provide as good a service as we can. We do not promiss 0.9999 uptime, we do the best that we can. And, if this data can be had for the lexicological information that it is, we are quite happy to host it. We are quite happy to cooperate with the EU to make this information available and more relevant then it is at the moment, being a "secret".


Sunday, July 03, 2005

The need for a reference implementation

To make Ultimate Wiktionary relevant, we need data, we need a big community, we need relevance. Relevance can be had in several ways. One way in which we may get both more people and more content is by making the content of UW as a translation glossary to be used in translation tools. There are several translation tools, Sun Microsystems Opened up its CAT tool, OmegaT is another and, there are more.

The functionality that all these tools will derive from UW is the same. So having an implementation that provide the bare bones of what is needed makes sense. It does help to make a bigger group of people aware of the wish for this cooperation. It hopefully leads to the cooperation of the different communities behind these tools, in order to improve the quality of all the tools.

To communicate about the tool, we have started an experiment with Google groups. Here you find a discussion list. Everyone can read this, but only members may write to this list.. this helps against SPAM :) . It is not a Sourceforge environment yet, this is something that people who will develop this reference implementation should decide on.

It is always exiting to see how these things develop. I hope for the best.


Thursday, June 30, 2005

The capitalisation of the English language Wiktionary

With some 76.000 articles, the English language Wiktionary is the biggest Wiktionary. Until yesterday all the articlenames were capitalised. Some months ago there was a vote to change it so that articles would be as the word is spelled. This decision was not implemented, a lot of words were spilled on this issue and now many months later, out of the blue it was changed.

The English wiktionary now has a problem and, it has an opportunity. The problem is that many entries are wrong. The problem is that the interproject links to Wiktionary from Wikipedia are wrong. The opportunity is that there are many other things wrong as well and it is therefore an unsought opportunity to revisit the content to improve the content.

Many people will feel frustrated because of all the huha. Many people will feel angry because the timing was not great; it stopped the migration of Wikipedia to release 1.5 temporarily among other things. But as the opportunity is there. It is also the time to step to the plate and do the best that can be done.

I am speaking to Andre Engels and I hope that he will come up with a bot that will find the capitalised words and move them back to capitalisation. This bot should also be able to list the words where a word can be found in both upper- and lowercase. After this the bot can be run again.. There is also a need for a bot that checks the en.wikipedia content for links to wiktionary and checks if the article is there and if not fixes it to undercase..

Yes, there was a need to prepare this change but it is also understandable that given that the decision was reached so long ago it could go wrong as it did. So now we have to do without preparation and just do the work ..


Monday, June 27, 2005

To what length do you need to go to convince people

I am propably the official spokesperson for the Ultimate Wiktionary. It is this default thing; I came up with the idea and I carried it forward, found the needed funding and I do a lot of the evangelising. When I started with UW I had to learn a lot and I did. I do not expect that I have finished learning but anyway ..

There is this thing with convincing people, to what length should you go. To what length do you want to go to make people buy into an idea? My time is valuable in that I can spend it only once and if I spend too much time arguing I do not speak with people who have idea's on how things can be done. Arguing does not help the project when the result is not positive.

There are people who insist that I do everything by IRC or e-mail while I prefer to skype as it gives me better feed-back. There are people who are only interested in a tiny specific part of Wiktionary.. only English seems to be relevant to some. There are people who think they quote me and say things I would never say.

Basically, to me there are three groups. People who understand what I am saying, people who want to understand what I am saying and people who for whatever reason do not want to hear what I say or cannot understand what I say. With the first two groups I can talk. We do not have to agree but there is this basis of understanding. With the last group who can be quiet vocal, I find that they waste my time.

The problem is, there is always the off-chance that it is me who does not hear what they are saying. It may be a dilemma where there is no good solution. When I can adress the problem why they do not hear or understand what they say they may become part of the people that become relevant... So, how much time to spend on this and how much to spend on new things.

Spending time on new things is a hazard in itself. It moves me even further away from the people who find it hard to hear / understand what I am on about..

I think I will add the word "frustratie" to the nl.wiktionary..


Sunday, June 26, 2005

Great news on licenses

Today, Jimbo Wales told me that I can quote him; "the license will not prevent Open / Free software projects to use our data". This is indeed good news. It means that we can host data and cooperate with every and all. It means that we can host data for organisations that are less well equiped to do this.

When we can pull it off to do these kind of things we will add extra relevance to the Ultimate Wiktionary.


Saturday, June 25, 2005

Old spelling / new spelling / alternative spelling

When you think about an Ultimate Wiktionary, the idea of including all words of all languages is a given. That is ambitious enough. You do not need anything more, right ?

The Dutch language will change in 2006, it will change things that are artificial like paardenbloem back to paardebloem, it has always been pronounced as paardebloem.. The result will be that many words will be wrong from 2006 onwards.

In October 2005 a list of words will be published with the old and new spelling. It means that we have to cater for this list in the Ultimate Wiktionary. So the Ultimate Wiktionary has to be more ambitious alas..

This is then the time to start experimenting. So I am using the word Imbiß as an example, in modern German it is spelled as Imbiss, I have introduced two new templates. One to be used in front of everything to signal old spelling and the correct one. One to say that it used to be correct.

Having a date for the change will make the information even more valuable. When UW is used within software to be used for optical character reading, it may be used as a pass after the initial pass that did the scanning. It will allow for an appropriate spellcheck that will allow to enhance the quality of the OCR process.

One thing to consider as well is that some spellings are local to a certain region or country. Rudolf Heß is called Rudolf Hess in Switzerland.. the "scharfes S" is not used in die Schweiz.. So words that still have there "scharfes S" in German, are spelled differently in Switzerland. This is just spelling. Some words or their meaning are not known to all people who speak German like "Paradeis" which Austrians know to be a "Tomate".

I am more and more appreciating the fact that linguist find it astounding that we attempt to make the Ultimate Wiktionary a reality. What makes us try it is that it was for us a natural growth path from Wiktionary. So we have our problems serially and not in a parallel fashion. The issues are there to be solved and they can be solved. Getting the issues serially helps because it prevents you from being overwhelmed by complexities for us it is just a matter of refactoring.


Tuesday, June 21, 2005

The value of computer conferences

I have written earlier about the Holland Open software conference, I was happy to be there and gave a presentation there as well. A conference like this is really valuable, you make contacts and when you have a new project like Ultimate Wiktionary these are really valuable. They may alter the way a project is run. One such contact I had with Mr Bart Knubben of OSOSS. This organisation is about Open Standards and Open Source Software in the Dutch government.

OSOSS is working hard to make the list of properly spelled words maintained by the NTU available for the public. Because of all kinds of contractual restrictions this is not possible at this time. To alleviate this issue, they are working as the focal point of the Dutch Open world to get the list of the NTG, the Nederlandstalige TeX Gebruikersgroep, validated for the spelling. This means that some 222.872 words will be validated.

This list of differently spelled words, comes with indications how the word is to be broken up at the end of a line. When the UW is to host such a list, it will mean some adaptions to the software; we will want to keep track correct spelling. As the Dutch spelling will change in August 2006, it means that we will want to retain the old spelling and mark it as such. As the change of the spelling rules will be in the future, we will have to consider how to deal with this.

When we host a resource like this for the NTG, it means that our license has to be compatible with the NTG. Currently they use the GNU Lesser General Public License. They do not care who uses it under what license as long as it stays Free.

Technically there is this issue; we want to host this data for the NTG. It would be really cool to be the resource for the Open/Free content world and host the Open/Free resource for the Dutch language. It would very much be in line with our objectives. We will find a solution for this issue; one thing is sure the LGPL is not applicaple for a wiki. :)


Monday, June 20, 2005

What do you need in a dictionary item

When you have an entry in an electronic dictionary like Wiktionary, what is enough to make it worthwhile to have it? The question is relevant as there are people who are of the opinion that any article that does not have extensive defenitions and etymology is substandard.

My opinion is a bit more inclusive, I would like to have extensive definitions and etymology but for me the sheer fact that a word is properly spelled is enough to have it in an electronic dictionary. The Dutch language knows an institution that does provide the authorised list of correctly spelled Dutch words. For me a list with these words would be a worthwhile contribution to the Ultimate Wiktionary. Obviously, it would be a bit meagre but it does serve its purpose.

When the correct way of spelling words changes, like it will do on the 15th of October, an electronic dictionary has a clear advantage over paper based dictionaries. It is however not clear to me how We should cover the old correct spellings. In a way it is relevant to have a history of correct spellings. It could/should be part of the database..


Thursday, June 16, 2005

Proof of the pudding; cooperation

Some people tell me that I should say that "Ultimate Wiktionary" will improve cooperation. I have been saying that with UW we will get cooperation. At this moment the Italian and the Dutch wiktionaries are cooperating as well as possible. Today there were some changes on the word [[Jiddisch]] I had to check some things on the Italian wiktionary as a result and found that they have at least 10 more translations.

With UW we will get the cooperation, the synergy that we do not have at this moment. The will to cooperate is there but it just does not happen. So I am unapologetic, only with the UW we will get the synergy that we so desperately want. It is not that we do not want to, it is that it does not happen in a practical manner.


Tuesday, June 14, 2005

How to connect using e-mail

On a good day I get some 100 e-mails on a bad days there are many more. Many of these e-mails are spam. The e-mail software I use has an inbuild spam filter, it must be trained and it more than halves the work that I need to do. For the other stuff, I have to look at the mail to decide its relevancy. When it is from a bank or monetary institution I do not do business with it is spam, when it is from China in Chinese it is spam. As I am Dutch many of the American names that send me stuff are suspect. Typicaly this works out fine.

When I want to connect to people who are "official" or high up in an organisation, there is little chance for me to actually reach the right level. There are often many intermediary levels before my message gets to Mr or Mrs Right. These intermediary levels have similar strategies like mine; I do not expect that they are impressed with my or e-mail adressses. It makes me just a person of the public (and I am) not someone who asks something on behalf of the Wikimedia Foundation. So it would be helpfull if people who are known to be active on behalf of the WMF to have a e-mail adress. It helps to overcome the barriers thrown up by the intermediary levels and get a job done, a message delivered.


Monday, June 13, 2005

Влади́мир Влади́мирович Пу́тин

Sometimes it is a nice suprise when you find that a nice idea gets some following. The pronunciation of famous people is one such thing. When you listen how an Italian pronounces the name of the Italian prime minister or how an American pronounces the name of his president, you realise that it is different from how it is pronounced in other languages.

Влади́мир Влади́мирович Пу́тин is a suprise for me because it is the first famous person I found on wikipedia with a sound file that I did not ask for.

The funny thing with pronunciations is that the pronunciation of Mr Bush can be heard on the Dutch Wikipedia. The English Wikipedia objects to the soundfile; it has been removed already several times. Some people use Wikipedia to learn languages, it is therefore usefull to learn how a local pronounces famous names.

What I would really like is to have soundfiles of famous people. We have already asked the new pope... One can always hope :)


Saturday, June 11, 2005

Working on Farsi text, a browser story

I have been working on Farsi training material on Wikibooks; it is a project to teach the Farsi characters and sounds to Dutch people. I do not speek Farsi, I am not learning to speak Farsi, I am just helping this project to improve.

A lesson has two parts; the spoken Farsi words and the translation in Dutch. When you click on the Farsi words, you may hear the pronunciation in the .ogg format. When you press the Dutch words it takes you to the When a word does not exist I create it when the word exists I add the Farsi translation.

It is really hard if not impossible to use my favourite browser, I have to move to the other side as it is clearly superior when editing a page like FarsiLes5. In the past I did enter bugreports for Mozilla and Mediawiki, I learned today that they are working on it. I hope they do a good job because Firefox is almost useless when editing pages where there is a mix of languages.

Thursday, June 02, 2005

Automagic sounds

When you have been away for a few days, you have a lot of reading to do. I had little time to read my e-mail so I had to wade through hundreds of e-mails. There is this big temptation NOT to read many of those e-mails and just delete them.

I decided to wade through the wikitech-l and found this interesting concept of "transcluding a sound". I think they mean that this means that a sound is played automagically. This needs some clever software that will play the sound in-line. The article says that there is no need for such a feature .. Would it not be cool if you find a word, you hear it automagically ?? Yes, it could also be something that you can enable/disable from your preferences ..


Saturday, May 28, 2005

Chippewa or Ojibwe

Within the Wikimedia projects we have some rules. One of them is "ignore all rules" but that is a different story. The rule about what languages we support is based on a few things. For me the existence of a ISO-639 code is the more important one. I have been "bold" as I use the ISO-639-3 code; this one is not even ratified yet but it does include many more languages ..

Today I was working on some translations from the English Wiktionary and the word "fruit" had two translations in language I had not seen before. One of these was the "Ojibwe" language. There is not mention of this language in the ISO-639-3 so I had a problem. The language codes that indicate that a word is a language are based on this code.

Google as so often turned out to be my friend; the Ojibwe are better known as the Chippewa; the code for the Chippewa language is ciw. So I was pleased to have a code to go with the Ojibwe language.

In the Ultimate Wiktionary there will be no reliance on the existence of an ISO 639 code. We could have more languages and nobody would realise ..


Wednesday, May 25, 2005


"babel" is a template used on several Wikimedia projects. It intends to inform about the language skills of a person. I am not a particularly good at languages I only care to post about three languages on my user page. The rating is funny because you rate yourself; I do not think much of my German skills, Sabine thinks I should be a de-2 .. She is a "profi" at this language :)

What a template like this can be used for, is not only indicate who has some expertise in some language but also to use it as a filter on the recent changes in an Ultimate Wiktionary. This could function in a same way as it would work for the inclusion/exclusion of bots.

Another use would be to help indicate who shares knowledge of a language, this in turn could help form a community for a language.. I can appreciate that there could be a need for a "village well" or a "kroeg" for each community.

One thing I did find was that it is also a bit of information where you need LOADSA localisation.. There are three templates for each language; Oscar one of my Dutch Wikipedia friends is a tr-3 there is no template for that yet. His being only a tr-3 means that you cannot expect him to write this template in Turkish :)


Friday, May 20, 2005

Licensing and mirrors

Most of the Wikimedia projects are licensed with the GNU-FDL license. It works relatively well for server content. There are many people who have issues with it, but given its rules and regulations the content is free and will forever be free. That is genuinly cool.

As I recently explained we do want to use the data in Ultimate Wiktionary for non server purposes as well. I mentioned the .dict data format. Data in this format is also used in off line usage. To create this data you create a subset of the data we hold. Given the license we should inform about every contributor to each word. This is not practical. It is practical to refer to the UW for the history of every word.

As the Ultimate Wiktionary is a new database, it is best to start with an appropriate license that is free and prevents the data from becoming unfree.

With the UW containing the free data, it does not pay to be too concerned about mirrors that host UW data. Given time and the entheausiasm of our community, the UW content will grow and therfore has the potential to outcompete this type of competition in all important ways.


Monday, May 16, 2005

RFC 2229 and dict

One of the important things for an open content project is cooperation. Currently every Wiktionary has its own community and, when Ultimate Wiktionary will become a reality, we hope that many wiktionarians will find there home in the Ultimate Wiktionary.

The biggest challenge however will be to grow both the content and the community. As there still a limited number of languages present, we do need to grow a presence, a community for the missing languages. For many languages including my own mother tongue, there is no comprehensive coverage yet. We will be searching for content to be added by our community and by incorporating existing glossaries, wordlists and thesauri.

Today I learned about a third way of making free content available, it is by use of the RFC 2229 a protocol to provide people dictionary information over the Internet. The trick here is that there is a database and that it does provide the information where it is available. So from a user point of view it would be great when we cooperate with

There will be two issues that need to be resolved. We use the GNU-FDL license for our content and the GPL for our software and they use the GPL. In order to cooperate we will need to work something out. Licenses are a necessary evil but it would be a travesty if free licenses are found to be mutually exclusive.

RFC 2229 compliance would be for Ultimate Wiktionary mark II .. It is funny that Ultimate will be as much a work in progress as everything else.. Not really a suprise. :)