TL;DR: Before inventing a new URI scheme, see if there's already one in use that does what you need.
My recent nerd forum lurking has given me a sense that there's a wave of interest in content distribution networks that use hashes as identifiers. I suspect this comes in part from widespread adoption of and understanding about the structures underpinning distributed version control systems, along with the simultaneous realization that naming things in such a way that you can verify that the bytes you get are the ones you asked for (and maybe something about deduplication, or something philosophical about names) is a good idea.
The idea's not at all new, though, and it turns out there are already [de-facto] standards for identifying files by hash and for fetching them from HTTP servers. I am writing this document to spread the word because wouldn't it be nice if we could all agree on these things so that our systems are interoperable.
Note that the example URNs used below all refer to the string
"Hello, world!
" (13 bytes, no trailing linefeed). I use the word
'blob' to mean 'byte sequence' (i.e. the contents of a file).
This URN scheme (along with several variations) is described
in an
IETF document and is also recognized by some Gnutella clients.
The 32-character string following the "urn:sha1:
" prefix is
the base32-encoded
SHA-1 sum of the referenced blob (actually a slight variation on
RFC3548 Base32-encoding is used that omits padding).
References a blob
by Tiger-Tree
hash [mirror].
The 39-character string following the prefix uses the
same base32 encoding as the 'urn:sha1:
' scheme.
Merkle trees are nice because, assuming you have access to the internal node data, you can verify parts of the file independently. If you're downloading a 10TB file and one bit gets flipped somewhere, you can identify and re-fetch a section of the file containing that bad bit instead of having to re-download the whole thing.
Bitprints are simply an SHA-1 and TigerTree hash concatenated together. In URN form there's a period between those two parts.
Advantages:
Disadvantages:
Overall, I like this scheme and support it in my projects (ContentCouch, PHPN2R), even if they just extract the SHA-1 part and use that.
Although Git uses SHA-1 hashes to reference files, those are not
hashes of the file itself. Instead, they are the hash of a small
header followed by the file contents. That's why the output of
'sha1sum some-file
' doesn't match that of 'git hash-object some-file
'.
Personally I think it would have been better if Git used the straight
SHA-1 sum of the file and stored metadata about how the data is to be
interpreted separately (i.e. a bit in the directory entry data
structure to indicate if the target is to be interpreted literally or
as a directory or symlink or whatever).
That said, I do have some ideas for a URI scheme to reference objects by Git-hash.
Because sometimes you want to identify things other than byte sequences.
This part is not any sort of pre-existing standard. I came up with it in 2008 because I wanted to build a flexible Git-like system geared towards storing and versioning very large directory structures containing potentially large files (think media collections).
Essentially, the idea is this: If you want to talk about something
that's not a byte sequence using a hash-based URN, you (1) create a
document about that thing (that's where the RDF comes in),
(2) serialize that document, (3) generate a URN for that document, and
(4) add some sort of {pre,post,circum}fix to that URN to indicate 'the
thing described by'. For that last part I couldn't find any
convention already in use. Postfixing the URN of an RDF document with
"#something
" comes close, but I didn't want to have to
give my RDF nodes IDs; I thought of just using "#
" but
decided I may as well invent a new prefix because its meaning would be
more obvious. The prefix I use is "x-rdf-subject:
",
giving URNs like
"x-rdf-subject:urn:bitprint:B3ZJZ7CSOXEXMZCWFHCBQP4CCSBJET6Y.SDN6FFGJIFX4ODPZ46NCBWNCJQP6APTEX6YRQGY",
meaning 'the thing described by
urn:bitprint:B3ZJZ7CSOXEXMZCWFHCBQP4CCSBJET6Y.SDN6FFGJIFX4ODPZ46NCBWNCJQP6APTEX6YRQGY,
which presumably is some RDF encoding'. (that particular URN
references a
directory of music files).
Why RDF? Because it's a standard and you can represent anything with it in an umabiguous way. Of course you can apply this same idea using formats other than RDF (and certainly other than XML-encoded RDF).
RFC2196 covers
this topic. I've been implementing a section of it, namely the
'GET /uri-res/N2R?some-urn
' part (which should
return the blob identified by some-urn). I've also come up with some extensions:
GET /uri-res/raw/some-urn[/filename-hint]
-
this allows one to reference a blob in a way that's a bit more natural
for web browsers. A filename hint can be included that will
presumably be the default if the user chooses to save the file (by
linking to '/uri-res/N2R?
' resources, users might end up
saving a lot of files called "N2R
").PUT /uri-res/N2R?some-urn
- PUTting to an 'N2R
' URL results in either:xt
' part of magnet:
URIs./uri-res/raw
links.Discuss this article on reddit.
The author of this article is TOGoS; append two zeroes and an "at gmail.com" to that name to e-mail him.