These are the slides for
a tech talk I did in 2015.
You may also be interested in
HashURNs.html,
an article specifically about representing hashes as URNs and some related topics.
A note on the medium:
Opera used to support @media projection
, which was good for making slide shows.
Nobody supports that any more, including Opera, which is lame.
So I hacked together some javascript to simulate slideshow mode.
Controls: previous, next, beginning, end.
Hit n to see the first slide.
Content-Addressable Storage
Naming Stuff using Hashes
Other Approaches?
[AUDIENCE PRECIPITATION]
Some Other Approaches
- Parents' name + random first name (e.g. "Dan Stevens")
- User-provided name (e.g. "New Document")
- URLs (authority name + path)
- IDs from system-specific sequences
- AUTO_INCREMENT
- Shared sequences (LLS entity IDs)
- New number from a global registry (e.g. OIDs, SSNs)
- UUIDs ("de305d54-75b4-431b-adb2-eb6b9e546014")
Problems with Those Approaches
Names can be ambiguous
- Who do you mean when you say "Dan"?
- Which version of database-dump.sql should I be looking at to reproduce this bug?
- I downloaded http://example.com/latest-version.zip twice and got different things.
Hard to verify that what you asked for is what you got.
- http://example.com/latest-version.zip might have gotten corrupted in transit.
Cache invalidation is haaaard.
(Especially in distributed systems.)
- User changed file-list.txt, then looks at cool-chart-generated-from-file-list.png.
- Even if output is generated on the fly by the server, proxies or the user's browser may have cached the file, resulting in confusion.
A Naming Convention To Solve All?
Cryptographic Hashes
- They take your data and emit a big number
- You can't tell what input generated a hash
- Changing the input changes the output
- You can't go backwards.
That solution's kind of ugly becuse you pass the location of the
file (the URL) and the deta to verify it separately. And bob's
probably not going to actually check the hash because people are
lazy.
Here's an Idea
Let's just identify the file by its hash.
As long as the hash matches, we don't care where it came from!
Bittorrent
Pirated-Movie.torrent
:
announce | http://9.rarbg.com:2710/announce |
info |
name | Pirated-Movie.mkv |
piece length | 262144 |
length | 12345678 |
pieces | ac9e79797f8208a6c1b5c7f79a6bafdd07bd0bb2
be2e9400d39a60b75fed2bb25851ccd48b6151c8
7f0a9dfe35531f7eee14112ef34ac4434598b2de
1a7f3107f9b92ea07dbda6f7cc07f17eaa1abdaa
b8c9bdec5a025eaeca6584572b88def2ce55cb76
636ac031b749e214f1e5254211590eb7e4a3bbe2
... |
|
Bittorrent
magnet:?xt=urn:btih:103863b4749ab394f0235ef2a4d988663d7d8579
(See Magnet URI Scheme [Wikipedia])
Git
3 primary object types: commits, trees, and blobs
Objects are identified by the hash of their serialized form (with a header)
Git
$ echo 'Hello!' > hello.txt
$ git add hello.txt
$ git commit -m "Hello."
[master (root-commit) 49d177e] Hello.
1 file changed, 1 insertion(+)
create mode 100644 hello.txt
$ echo 'Goodbye!' > goodbye.txt
$ git add goodbye.txt
$ git commit -m "Add a goodbye file."
[master (root-commit) 83b2093] Add a goodbye file.
1 file changed, 1 insertion(+)
create mode 100644 goodbye.txt
Git
Git
83b209323ce9c03bd3c9e8fa9e3139c5acc5f54c
:
tree 503333de279da4cef9c359be02651fc285dfcc6f
parent 49d177e8179e0b3d5c21827e181811f4c1c7839c
author Dan Stevens <stevens@earthit.com> 1447783213 -0600
committer Dan Stevens <stevens@earthit.com> 1447783213 -0600
Add a goodbye file.
503333de279da4cef9c359be02651fc285dfcc6f
:
100644 goodbye.txt b04c55e39ec0138a2f04ffa29191457abc658bac
100644 hello.txt 10ddd6d257e01349d514541981aeecea6b2e741d
b04c55e39ec0138a2f04ffa29191457abc658bac
:
Goodbye!
10ddd6d257e01349d514541981aeecea6b2e741d
:
Hello!
49d177e8179e0b3d5c21827e181811f4c1c7839c
:
tree 04541bff04d1d03bae56102f80bd68cb0be0e167
author Dan Stevens <stevens@earthit.com> 1447778840 -0600
committer Dan Stevens <stevens@earthit.com> 1447778840 -0600
Hello.
04541bff04d1d03bae56102f80bd68cb0be0e167
:
100644 hello.txt 10ddd6d257e01349d514541981aeecea6b2e741d
Features of Git's Data Model
- Data is entirely decoupled from storage.
- Once you know the root commit hash, you can trace the entire history
- Don't need to trust data sources to fetch data
- With signed commits, don't even need to trust them to find root commit!
- Automatically de-duplicates identical objects
- Even ones authored by different people!
- "Hello!\n" is always
10ddd6d257e01349d514541981aeecea6b2e741d
- Backups are just a matter of walking the tree and downloading all referenced objects.
- Corrupt repo? Just patch missing/corrupted objects from a good one.
Git Weirdnesses
- Inconsistent storage of hashes (sometimes base16-encoded, sometimes not)
- Blobs include a header, so
sha1(hello.txt) ≠ git-hash-object(hello.txt)
- Command-line interface is a disaster
Merkle Trees
Hashes of pairs of hashes of pairs of hashes .... of chunks of your data
Merkle Trees
Benefits of Hash-Based Addressing
- Simpler than making up names or 'sanitizing' user-provided ones!
- Hashes double as both identifiers and checksums.
- Deduplication is automatic.
- Caching and distribution is trivial.
- Fetching can be considered a purely functional operation.
- Pure functions that take object IDs as input can be
aggresively memoized.
Limitations of Hash-Based Addressing
- Can't reference things that may be updated.
- Fixed a bug in your package? Tough luck.
- Remote possibility of hash collisions. ;)
- Hash by itself doesn't provide hints for where to look for the data.
- Need to agree on hash scheme or else build equivalence tables.
Signature-Based Addresses
One solution to the "Can't update" problem.
Identifier includes public key of trusted author.
This scheme could be used by fully distributed systems in
conjunction with hash-based keys to construct updateable entities
This is how Freenet implements
SSKs.
Query might be something like: "Give me the latest commit signed by
the guy with public key urn:sha1:W6QJK24LSJ64HSIYWUQXXLA7RPKB43YY"
Conclusion
- Hashes are a really great way to identify immutable pieces of data.
- By using trees of hashes, you can represent complex structures.
- In cases where hash-based addressing is applicable, it solves the two hard problems of computer science:
- Cache invalidation
- Naming things