Friday, August 28, 2015

#Wikidata - Joseph Reagle not an #author?

The English Wikipedia article says: "Joseph Michael Reagle Jr. is an American academic and author focused on technology and Wikipedia". It seems obvious that the occupation "author" fits Mr Reagle.

Not so I am told, the word "auteur" is a generic term in French, so it is at best an anglicism. This gets us in a tricky position because it is suggested that if this appears in infoboxes which automatically import stuff from wikidata, it will create an absolute mess in the French wikipedia, with everybody being credited as an "auteur" which does not make sense at all.

When you analyse "author" in Wikidata, it is a subclass of "creator". Creator seems to me to be what the French understand for "auteur". Consequently, the labels used in French do not match what is meant by author in English.

Arguably, when items are labelled in a way where the meaning in one language is not the same as in other languages,  This has major consequences for the integrity of Wikidata.

NB Mr Reagle wrote a few books, that makes him more than an "essayist".

Thursday, August 27, 2015

#Wikidata - Heinz R. Pagels Human Rights of Scientists Award

Awards are often the subject of this blog. Every award has its own merit and every award connects many people as a result. The Heinz R. Pagels Human Rights of Scientists Award is an award hidden in an article on the Committee on Human Rights of Scientists. The story of Mr Pagels is interesting but so are the people who received the award.

Some of them have been prisoners of conscience, all of them have relevance. Most of them deserve more attention, be it in improving their articles, by adding statements in Wikidata, or reading about them. For people to receive an award like this, they have to have been in harms way. It is important to know how easy it is to get into problems and also why some of such problems are worth it.

By exposing awards like this, the people connected in this way get more attention. It is one way of making sure that their effort is valued.

Saturday, August 22, 2015

#Wikidata - recent #changes

Databases change all the time. The expectation is that these changes make things, different, better. This is true for all the online resources Wikidata connects to.

There are several good reasons to refer to an external database:
  • to indicate that the external source is about the same subject
  • to acknowledge the external source served as the source for a statement
  • to indicate whether shared values match
As databases change all the time, there is little value to indicate that a database shared the same value at a given date and time. Consider for instance the item for Mr Sudar Pichai, apparently he went twice to the Indian Institute of Technology Kharagpur and to Stanford University. When two source states that he went there, one source may know what academic degree was achieved at the end of the study where the other does not. When you only verify if the information in the two sources match, both sources match. One source may not care about what degree or when it was achieved and the other does. When you quote them as the source for the statement, you expect them to fully endorse the current content. Mr Pichai went to either educational institution once. Having two statements for the same thing completely defeats the objective of Wikidata; the objective of Wikidata being useable.

Having references for statements make sense when statements are exactly the same. When they are not, arguably there is little point but indicate that all values for a source match. This can be done by showing the source in green. It is a lot more reassuring to see all sources in green than a lot of references that give no assurance that the values are indeed the same,

Friday, August 14, 2015

#Wikidata - Mr Sundar Pichai

I heard of a dispute about the facts of Mr Pichai's study by Wikipedians. That was yesterday so I hoped that some of that discussion would transpire at the item for Mr Pichai.

Mr Pichai's item is indeed in need of serious attention. The stated place of birth should be more specific and, his education has the same school entered twice for no obvious reason. He was born in India but Wikidata has him as an "Indian American" for whatever reason.

The information when you Google Mr Pichai is much better. When Google and Wikidata were to compare each others records, the Wikidata item would certainly be flagged as problematic.

As a lot of Wikipedians have invested serious attention to Mr Pichai, comparing the Wikipedia article will expose the weakness of the Wikidata entry. I am not particularly interested in Mr Pichai, I leave it for someone else to sort this out.

Thursday, August 13, 2015

#Wikidata - #Quality, #probability and #set theory

The problem with any source is that it has errors. It cannot be helped. There is always a certain percentage that is wrong. When you take all the items of Wikidata that have statements, the type of process that added those statements provides an indication of the percentage of errors that were included.

I made thousands of mistakes. In a way I am entitled to have made those mistakes because I made over 2 million edits. Amir made even more edits with his bot. Because of the process involved the percentage of his errors will be fewer. When you only look at Wikidata and its items, you can be confident that these errors exist, you can be confident about what percentage is likely but there is no way to make an educated guess what is right or what is wrong. The only way to improve the data is by sourcing one statement at a time. It is a process that will introduce its own errors. That is something we know from experience elsewhere.

To add value to Wikidata, we need both quality and quantity. Let us consider the use of external sources that are known to have been created with the best of intentions. Consider one type of information, the place of birth for example. It is highly likely that Wikidata and that external source have many items in common. Once they are defined as being about the same person, we can use the logic of set theory. We can establish the number of records where both have a value for the place of birth. We can determine the amount of matching items, we can determine the number where one has a value and the other does not and, we can determine the number of items where there is a mismatch.

It is probable that most errors will be found where Wikidata and the source do not match. It is certain that even where the two match there will still be factual errors as both can be wrong.

Quality and confidence have much in common. Wikipedia has quality but we know it has issues. Wikidata has quality but we know it has issues. The easiest and most economical way to improve the quality of Wikidata is by comparing sources, many sources and concentrating on the differences. It is easy and obvious and when we ask someone to add a source to a statement we are confident that the result matters. It matters for both Wikidata and the external source.

This approach is not available to Wikipedia. It cannot easily compare with other sources and therefore there is no option but to source everything. Given that many statements find their origin in Wikipedia, new insights in Wikidata may prove a point and a need to adapt articles.

Consequently, applying set theory and probability will enhance the quality of Wikidata. It will help drive fact checking in Wikipedia and it is therefore the best approach to improve quality. Accepting new data from external sources and iterating this process of comparison will ensure that Wikidata will become much more reliable. Reliable because you can expect that the data is there and, reliable because you know that quality has been a priority in what we do.

Tuesday, August 11, 2015

#Wikidata - #Pen #awards

The many chapters of Pen International confer many awards. Mr Mazin Darwish now has his 2014 award and would it not be fun to have a query that shows all the people who ever were awarded one of the many, many "Pen awards".

First, all the chapters have to be part of Pen International, then all the awards have to be conferred by a Pen chapter and finally all the people have to be recognised as honored with one of the Pen awards.

This is something that is of interest, it is awarding and, why not.

It is much better than following the "instructions" on solving the "garbage" that is the honorary university degrees and doctorates. I am told to find sources for people who have an honorary doctorate or whatever and add sources that provide credence to such a statement. It may be a solution but it is a solution that does not scale.

To be honest, I cannot be bothered. When Wikidata in its infinite wisdom does not have a way to deal with contaminated data, it has a bigger problem, it makes me doubt all existing statements. All that is needed to cope with such issues is a way to flag data for being "suspicious".

With known "no good" data, you invite people to participate in providing a solution. The proposed solution however is not my cup of tea; it is not what I do. I cannot be bothered.

Monday, August 10, 2015

#Wikidata - #Free Mazin Darwish

It is satisfying to learn that Mr Darwish has been freed. He was jailed since February 2012 and, the BBC has it that we was freed. It mentions that he is the director of the Syrian Centre for Media and Freedom of Expression (SCM) and received many international awards.

Wikidata already knew about a few of those awards, finding more awards was a matter of reading the three Wikipedia articles. It is just a matter of doing the research. One of the awards Mr Darwish received was the PEN Pinter Prize in 2014. However, the Wikipedia article calls it the "Pinter International Writer of Courage Award". This award is not listed on the "List of PEN awards".

There is a reason to celebrate. Mr Darwish is free. It is satisfying to see that a lot of information is already there. Working on the data that exists on Mr Darwish connects him with more people sharing similar connections.

Every day there is someone who is worthy of attention. I can do this, you can do this. It is how Wikidata gains relevance. Relevance because it is information available for use in any language including Arabic.

#Wikidata - #corroboration and #sourcing

The problem with sources available on statements in Wikidata is that even when they are by definition the source of a statement, it is not what we understand a source to be. When I use tools to add statements to Wikidata based on lists and categories from a Wikipedia, that Wikipedia is my source. My tools do not help me add this fact so I do not add Wikipedia as a source. Other tools do and consequently there are some 20 million statements sourced in this way.

When no source is available, a statement can be corroborated by finding identical information in an external source. The difference is important. The external source is no source proving the veracity or the origin of the fact, it merely indicates that it does not differ. Corroboration is important, it does improve the likelihood that a statement is correct. It adds a notion of quality.

Wikidata items often refer to many external sources. Only when a fact new to Wikidata is added as a statement from one of these sources, the external source IS the source.

Some external sources provide information with the authority of a respected organisation. When the RKD Netherlands Institute for Arts History indicates that Nora van de Vlier received the Willink van Collen Prize in 1954 I would consider it a source and happily accept it as a source for a new statements in Wikidata. When such information is from DBpedia or Freebase, I would appreciate more references at a later date.

When it is not the original source the only thing I care to know is that there is no discrepancy between the data provided and the data available at the external sources. When external data is pushed into Wikidata as a reference, it could easily be considered a fraud. It is certainly clutter.