You may also be interested in HashURNs.html, an article specifically about representing hashes as URNs and some related topics.

Opera used to support @media projection, which was good for making slide shows.

Content-Addressable Storage

Naming Stuff using Hashes

Other Approaches?

[AUDIENCE PRECIPITATION]

Some Other Approaches

Parents' name + random first name (e.g. "Dan Stevens")
User-provided name (e.g. "New Document")
URLs (authority name + path)
IDs from system-specific sequences
- AUTO_INCREMENT
- Shared sequences (LLS entity IDs)
New number from a global registry (e.g. OIDs, SSNs)
- EIT = 1.3.6.1.4.1.44868
UUIDs ("de305d54-75b4-431b-adb2-eb6b9e546014")

Problems with Those Approaches

Names can be ambiguous

Who do you mean when you say "Dan"?
Which version of database-dump.sql should I be looking at to reproduce this bug?
I downloaded http://example.com/latest-version.zip twice and got different things.

Hard to verify that what you asked for is what you got.

http://example.com/latest-version.zip might have gotten corrupted in transit.

Cache invalidation is haaaard.

(Especially in distributed systems.)

User changed file-list.txt, then looks at cool-chart-generated-from-file-list.png.
Even if output is generated on the fly by the server, proxies or the user's browser may have cached the file, resulting in confusion.

A Naming Convention To Solve All?

Cryptographic Hashes

They take your data and emit a big number
You can't tell what input generated a hash
Changing the input changes the output
You can't go backwards.

That solution's kind of ugly becuse you pass the location of the file (the URL) and the deta to verify it separately. And bob's probably not going to actually check the hash because people are lazy.

Here's an Idea

Let's just identify the file by its hash.

As long as the hash matches, we don't care where it came from!

Systems that do this

Bittorrent
Git
Freenet

Systems that do this

Bittorrent
- Torrents contain a list of chunk hashes
Git
Freenet
LLS Ruleset Storage
Various other systems that Dan worked on:
- User-uploaded file storage
- Address records
My photo collection

Bittorrent

Pirated-Movie.torrent:

announce

http://9.rarbg.com:2710/announce

info

name	Pirated-Movie.mkv
piece length	262144
length	12345678
pieces	ac9e79797f8208a6c1b5c7f79a6bafdd07bd0bb2 be2e9400d39a60b75fed2bb25851ccd48b6151c8 7f0a9dfe35531f7eee14112ef34ac4434598b2de 1a7f3107f9b92ea07dbda6f7cc07f17eaa1abdaa b8c9bdec5a025eaeca6584572b88def2ce55cb76 636ac031b749e214f1e5254211590eb7e4a3bbe2 ...

Bittorrent

magnet:?xt=urn:btih:103863b4749ab394f0235ef2a4d988663d7d8579

(See Magnet URI Scheme [Wikipedia])

Git

3 primary object types: commits, trees, and blobs

Objects are identified by the hash of their serialized form (with a header)

Git

$ echo 'Hello!' > hello.txt
$ git add hello.txt
$ git commit -m "Hello."

[master (root-commit) 49d177e] Hello.
 1 file changed, 1 insertion(+)
 create mode 100644 hello.txt

$ echo 'Goodbye!' > goodbye.txt
$ git add goodbye.txt
$ git commit -m "Add a goodbye file."

[master (root-commit) 83b2093] Add a goodbye file.
 1 file changed, 1 insertion(+)
 create mode 100644 goodbye.txt

Git

83b209323ce9c03bd3c9e8fa9e3139c5acc5f54c:

commit 234

tree 503333de279da4cef9c359be02651fc285dfcc6f
parent 49d177e8179e0b3d5c21827e181811f4c1c7839c
author Dan Stevens <stevens@earthit.com> 1447783213 -0600
committer Dan Stevens <stevens@earthit.com> 1447783213 -0600

Add a goodbye file.

503333de279da4cef9c359be02651fc285dfcc6f:

tree 76

100644 goodbye.txt b04c55e39ec0138a2f04ffa29191457abc658bac
100644 hello.txt 10ddd6d257e01349d514541981aeecea6b2e741d

b04c55e39ec0138a2f04ffa29191457abc658bac:

blob 9

Goodbye!

10ddd6d257e01349d514541981aeecea6b2e741d:

blob 7

Hello!

49d177e8179e0b3d5c21827e181811f4c1c7839c:

commit 173

tree 04541bff04d1d03bae56102f80bd68cb0be0e167
author Dan Stevens <stevens@earthit.com> 1447778840 -0600
committer Dan Stevens <stevens@earthit.com> 1447778840 -0600

Hello.

04541bff04d1d03bae56102f80bd68cb0be0e167:

tree 37

100644 hello.txt 10ddd6d257e01349d514541981aeecea6b2e741d

Features of Git's Data Model

Data is entirely decoupled from storage.
Once you know the root commit hash, you can trace the entire history
Don't need to trust data sources to fetch data
- With signed commits, don't even need to trust them to find root commit!
Automatically de-duplicates identical objects
- Even ones authored by different people!
- "Hello!\n" is always 10ddd6d257e01349d514541981aeecea6b2e741d
Backups are just a matter of walking the tree and downloading all referenced objects.
Corrupt repo? Just patch missing/corrupted objects from a good one.

Git Weirdnesses

Inconsistent storage of hashes (sometimes base16-encoded, sometimes not)
Blobs include a header, so sha1(hello.txt) ≠ git-hash-object(hello.txt)
Command-line interface is a disaster

Git-like Systems

Merkle Trees

Hashes of pairs of hashes of pairs of hashes .... of chunks of your data

Benefits of Hash-Based Addressing

Simpler than making up names or 'sanitizing' user-provided ones!
Hashes double as both identifiers and checksums.
Deduplication is automatic.
Caching and distribution is trivial.
Fetching can be considered a purely functional operation.
Pure functions that take object IDs as input can be aggresively memoized.

Limitations of Hash-Based Addressing

Can't reference things that may be updated.
- Fixed a bug in your package? Tough luck.
Remote possibility of hash collisions. ;)
Hash by itself doesn't provide hints for where to look for the data.
Need to agree on hash scheme or else build equivalence tables.

Signature-Based Addresses

One solution to the "Can't update" problem.

Identifier includes public key of trusted author.

This scheme could be used by fully distributed systems in conjunction with hash-based keys to construct updateable entities This is how Freenet implements SSKs.

Query might be something like: "Give me the latest commit signed by the guy with public key urn:sha1:W6QJK24LSJ64HSIYWUQXXLA7RPKB43YY"

Content-Addressable Storage

Naming Stuff using Hashes

Other Approaches?

Some Other Approaches

Problems with Those Approaches

Names can be ambiguous

Hard to verify that what you asked for is what you got.

Cache invalidation is haaaard.

A Naming Convention To Solve All?

Cryptographic Hashes

Cryptographic Hashes

A Parable

Here's an Idea

Systems that do this

Systems that do this

Bittorrent

Bittorrent

Git

Git

Git

Git

Git

Features of Git's Data Model

Git Weirdnesses

Git-like Systems

Merkle Trees

Merkle Trees

Benefits of Hash-Based Addressing

Limitations of Hash-Based Addressing

Signature-Based Addresses

Conclusion