Why Persistent Identifiers are the Wrong Idea

I think I will rewrite this. It seems to me its only half the solution...more coming shortly

This afternoon I was reading Elizabeth Eisentstein's "The Printing Revolution in Early Modern Europe" when I came across this passage:

To consult different books, it was no longer so essential to be a wandering scholar...The era of the glossator and the commentator came to an end, and a new "era of intense cross-referencing between one book and another" began.

The point here is that after the printing press came about, there were more books available. An obvious point. As a result, cross-referencing became a feature, since it was possible to access and, consequently, reference other literature more readily.

This made me think about the current discussions around persistent identifiers for scholarly content. It seems the current solution is to offer a layer of indirection: this enables a stable identifier to persist, and should the 'actual location' of the content be changed, then we can re-configure the redirect to point to the new location.

Martin Fenner and Geoff Bilder point to this solution in their very good postings on this topic:
http://blogs.plos.org/mfenner/2009/02/17/interviewwithgeoffrey_bilder/
http://blog.martinfenner.org/2015/06/03/persistent-idenfiers-urls/

However, this method does not overcome the real problem. What if either:

  • no one updates the redirection after the location has changed
  • the content really goes offline (through loss of domain, for example)

It appears to me that we have the wrong solution. There is really no way to solve this issue with URIs. We can only minimise it.

So, how to go about resolving this (so to speak).

One way to get some insight into the issues, is to wind back the clock and look at the way content was located in the age of the newly-born printing press. In this age, scholars were liberated because they didn't have to wander the world looking for a particular book. Instead, identical printed copies proliferated, and it was just a matter of finding a copy of the work you were pursuing. Preferably you found a copy in a library or shop nearby. To find that work, one merely needed the cross-reference information to track it down. That "persistent identifier," comprised of author's name, book's title, page of reference for the quoted material, publisher's name and location, publication date, was commonly refered to as a citation (we still use that term). The citation helped the reader or researcher find a copy of the work cited. Not a particular printed book, but a copy of that book.

So, how is it that in an age of digital media we have gone backwards? Where copying something is even easier than in the printed age, why are we still pointing to 'one' authoritative copy? In essence, we are still referencing a book by stating the exact book that sits in a specific institution, on a particular shelf, with the blue (not green) cover.

It feels a little like the great leap backwards to me.

A way to get around this problem would be simply to allow and encourage content to be copied. Let digital media do what it does best - copy and distribute itself. The 'unique identifier' would then not be an URL (with a layer of indirection) but would take the form of a checksum or hash (http://en.wikipedia.org/wiki/Checksum). Finding the right work would then be a matter of searching for a copy of the material with the right checksum.

I don't know. I'm probably missing something. But it seems we have no problem tracking down youtube videos when they spawn into the ether. We can also tell one version of software from another, no matter where it is and how it is labeled. Why not just let the content go and provide mechanisms to find a specific version of the content via hash search (also solving the issues of versioning URIs)?

Colophon: written on a lovely sunday afternoon in the Mission. Adam got up Monday morning with that nagging feeling...rethought it. Talked to Raewyn. Rewrite coming. Written using Ghost (free) software.