Tuesday, February 17, 2009

Domesday scenario

How do you call a project with over a million people collaborating on information about their country. A project that came to a successful end. A project with many units sold? A success.

What do you call the same project when twenty one years later only two working systems are left ?

The Domesday project is very much a project of its day and as a generation of British school children were involved, it has had a lot of attention to make the data available again.

There are lessons to be learned. Some of them seem to be obvious in our open content world. They seem to be obvious because we insist on Open Source and the use of licenses that are considered to be "free". There is however more to it. There are also the standards underlying the data. The basic standard we use is text, text expressed in Unicode. This standard is not perfect because some of the languages supported in the WMF do have characters not yet supported in Unicode.

In this text we often express information in a structured way. As long as it is seen as HTML, I can read it. When I look at the Wiki syntax I am lost. When people datamine Wikipedia, special software is written to parse these infoboxes and tables. The result is DBpedia and the DBpedia community does a great job.

The point is that it does not have to be this way. Were we to adopt Semantic MediaWiki for Wikipedia, we would adopt Open Standards that enable us to present our data in a way that is understood by other computers. This will help us achieve our goal of providing information to people because our data will be used to provide a better understanding. In this way we open up our data in a way that was not possible at the time when either of the Domesday books were written. We would open up our data and make it truly free because we make it available for innovative applications.
Thanks,
GerardM

4 comments:

Filceolaire said...

Semantic mediaWiki is good but has limitations which I believe mean it isn't up to encoding Wikipedia info.

The main problem is that it relates two nouns and a verb but most facts are more nuanced than that.

For example: #BERLIN# #IS CAPITAL OF# #GERMANY#

The above example is true for certain values of GERMANY (it was capital of the German Empire, under Kaiser Bill, before WWI, and of Nazi Germany and of East Germany and, sort of, until lately of the German Federal Republic) and for certain values of IS (i.e. certain years only).

To truely use it in wikipedia this fact needs a number of qualifiers - a start date, an end date, a source reference. It also needs each "noun" to have a separate wiki page - no merging the German Federal Republic page with the page for the Weimar republic.

Taking all this into account I believe an info box, completed by hand, is still the place to put structured data.

One way infobox data could be made more reusable might be to move info boxes to COMMONS so the data in infoboxes can be automagically imported into any language wiki (once the info box tags have been localised).

Could we start by doing something like this with WikiSpecies? i.e. move all species infobox data into Wikispecies and then automagically import it back into the EN species infoboxes?

GerardM said...

Hoi,
There can be many articles called Germany.. All could be about iterations of the German state and each of them can link to its capital. In this way you already have several qualifiers. When you create relations, you create them in a context and this is how you have to appreciate them.

In essence your issues have more to do with the semantic web concept then with Semantic MediaWiki. When you create an info box by hand, it has no context...

As to WikiSpecies.. How would you deal with the context of names.. Mammilaria sensu ... ??
Thanks,
GerardM

Filceolaire said...

I worry that I'm misunderstanding you but here goes anyway:

Yes separate pages can be created for each different version of 'Germany' but it doesn't seem right that the encyclopedia layout is driven by the needs of the data structure.

Creating separate pages for Brasil (capital Rio) and Brasil (capital Brasilia) seems even more twisted. How many times has the area of the USA changed over the years as it expanded accross the continent? do we have a new page each time?

In some cases having separate pages will conflict with WP:N which says minor characters in a work of fiction should share a list page. Do you really want to try and change that policy?

On the general question of the 'facts' being divorced from their context if they are put in a separate infobox: I thought that was the whole point! I thought the idea of the semantic web was that the facts could be extracted from their context and manipulated and used by computers. All the examples I've seen seem to do that. I'm concerned with trying to make the context parseable too so it doesn't get divorced. "Pages linking to this infobox" can certainly be added to restore a link to the narrative, though this will not link to the relevant line, but then some of the data in the infobox may never get repeated in the narrative (LAT and LONG coordinates?).

Semantic web got me all excited but the more I thought about it the more problems I saw with trying to shoehorn real world info into nice neat 'tuples'.

Wikispecies names? I think I will leave that as a problem for the student.

Barend said...

I believe that we touch upon the good-old 'perfect is the enemy of the good' syndrome. We are working in the scientific realm on so called 'rich triples' in RDF-type formats in which the 'edge' between two concepts can be annotated with many different qualifiers. I would like to discuss exceptions to the rule that pretty much every 'factoid' can be represented that way and that larger Graphs answering to description logics can be created for computer analysis far beyond what can be done with infoboxes. That said, I agree that stopping at minimal syntax defined triples in text is not enough for the long run and these triples need to be connected to a central triple store that is much richer. The absence of that element may have been a major drawback of the fledgling semantic web so far.
If Wikipedia ever wants to get full scientific recognition as a reliable source, moving to a triple connection is a must in my personal view