Words and what not: #Wikisource is not a data repository

Thursday, April 26, 2012

#Wikisource is not a data repository

The cost of access to proprietary data sources has only gone up. The budgets of libraries has not kept pace and relevant sources that used to be available to students for study and research are often no longer available. It is no longer an issue only for libraries in "other" countries.

Many universities and even countries like the Netherlands mandate that scientific publications need to be made available as Open Access. Typically publications and data become available under a free license for the whole world to use. One key side effect is that true science is helped exactly because the data and publications are freely available.

Recently, the library of Harvard University made millions of its library records available under a CC-0 license. These records contain bibliographic information about books, videos, audio recordings, images, manuscripts, maps, and more and are in the "Marc21" format.

Data like this is useful when you can query it and it remains useful as long as it is maintained. This data is maintained by Harvard University and it is maintained because it is key to the functioning of its library. It is unlikely that Harvard University is the only library in need of such data and once many organisations work together maintaining a universal database about books, videos, audio recordings, images, manuscripts, maps, and more, the data becomes more complete, authoritative and useful.

Having such data in Wikisource as has been suggested is not a good idea for several reasons. Wikisource is used for storing text, not data. A rich resource like this needs continuous and reliable maintenance to be useful. All this is available from Harvard. The catalog records are available for bulk download from Harvard, and are available for programmatic access by software applications via API's at the Digital Public Library of America (DPLA).

When data like this is useful to the Wikimedia Foundation's projects, the first order of business would be to study those API's maybe implement them and only consider storing such data at the Wikimedia Foundation for editing when there is a benefit. One benefit could be to integrate data from other sources and the subsequent need for de-duplication. Then again, it is unlikely that librarians are waiting for the Wikimedia movement to get into this act and, realistically the WMF will need an operational Wikidata before it can unleash its communities on what is just one of the many many important data resources that are available under a free license.

Thanks,

GerardM

Thursday, April 26, 2012

#Wikisource is not a data repository

No comments: