hashmap: serialize keys as strings not bytes #184

rvagg · 2019-08-29T06:33:13Z

Original form had the keys serialized as byte arrays but this switches them to strings. They are still likely treated as byte arrays for the purpose of hashing (as dictated by hashing APIs).

Doing this, ironically, makes it possible to do #180 where we could potentially store different key types but do kinded differentiation. But I'd like to make this change regardless of whether #180 goes anywhere.

Stebalien · 2019-08-30T01:27:04Z

I'm not sure if this is related but being able to use arbitrary byte sequences in HAMTs is pretty important.

rvagg · 2019-08-30T02:30:03Z

unrelated, this is just about how to store the key in the block, internally it's hashed and sliced as a byte sequence but still stored as a string.

vmx · 2019-09-05T12:24:35Z

I don't understand how an arbitrary byte sequence can be stored as valid UTF-8 string.

dvc94ch · 2019-09-05T15:44:56Z

I think he is planning on base encoding them? Please don't do this, multiformats/rust-multibase#9

rvagg · 2019-09-09T04:54:51Z

No, the "map" kind only supports string keys, so there's no conversion to strings going on here. e.g. https://github.com/rvagg/js-ipld-hashmap/ will insist on keys for operations are strings and will only return strings for keys() and entries(). Current spec requires block serialization from string to bytes and back again, this just changes block serialization to keep strings as strings.

This is related to the data model "maps having non-string keys" thing that's been discussed and aligning with the current conclusion that, no, they shouldn't be able to have non-string keys (for now at least).

Making this change, however, opens the door to #180 where we could potentially use the HashMap in a mode where keys are bytes (and other kinds .. maybe) as keys distinct from strings. If we just mush everything to bytes during serialization then we lose the opportunity to differentiate during deserialization. If in the future we say that data model maps could have non-string keys, being able to differentiate here would be useful here too.

rvagg · 2019-09-09T05:04:14Z

I might be reading @Stebalien's comment wrong: are you saying that using arbitrary byte sequences as keys in put, get, etc. operations is important, where strings don't cut it? That there are modes where you want to use a HAMT to reference byte array keys, on top of the more typical use of string keys?

This PR doesn't exactly preclude that use case, it just needs #180 as well (at least the String and Byte parts, the option to use Int keys should probably be dropped for now). The current incarnation of this spec just coerces all keys to a byte array during serialization. The implication of this is that when you deserialize you have to decide elsewhere whether you coerce to a string or not.

Stebalien · 2019-09-09T14:37:54Z

are you saying that using arbitrary byte sequences as keys in put, get, etc. operations is important, where strings don't cut it?

In maps in general, yes. For example, unix filenames are byte sequences, not unicode strings.

The implication of this is that when you deserialize you have to decide elsewhere whether you coerce to a string or not.

Got it. So the actual type is really just "anything that serializes to bytes".

dvc94ch · 2019-09-09T23:44:10Z

Why shouldn't maps in the data model support other key types? The rust implementation supports bytes and integer keys [0], and it doesn't make the implementation much more complicated.
I don't understand this switch. The hamt doesn't use maps internally (if I understand the schema correctly) except for encoding the Root and Node struct fields.

[0] https://github.com/dvc94ch/rust-ipld

rvagg · 2019-09-10T02:02:09Z

Thanks @Stebalien. I think I'm going to scrap this PR and do a combination of this and #180, that has a kinded union for keys so they can be stored as either Bytes or Strings. Will defer on Ints for now.

@dvc94ch: Some previous discussion on non-string keys for data model maps: #58, but there has been more since, I'm not sure it's been well recorded though. I think the main problem is that it's difficult to support across the codecs we care about. CBOR might be easy, but JSON, not so much. Bytes as keys is just as complicated (maybe moreso). The data model is supposed to be lowest-common-denominator between what we can reasonably support. Maybe this is something we could work on for a new dag-json but I suspect there's a nasty can of worms underneath here.

@vmx @warpfork @mikeal input pls?

Regardless of the data model, that's not what I'm trying to deal with here, although there was an aspect of aligning to the current understanding of the data model that maps only allow string keys. But again, this change doesn't preclude further modification, as per #180 which introduces that additional flexibility. I'll close both of these and open a new one to try and be more clear.

This non-string keys in the data model thing really needs to be properly addressed though.

dvc94ch · 2019-09-10T10:02:02Z

Returning an error from the json codec when not using a string key would be a solution. Since it's mostly useful for debugging, extending the json spec is also a possibility.
"string" 'xyghde' 0 for example for string, bytes and integer keys.

mikeal · 2019-09-10T17:25:11Z

We’ve been characterizing this as a codec support problem when it really isn’t. I’ve been as guilty of this as anyone but I don’t think this was the right way to approach it in hindsight.

The data model describes a set of base types. Codecs need to support those at a minimum but not as a maximum. There’s actually a lot of flexibility in how codecs decide to represent those types in the block and the data model says nothing about representations above and beyond that. The important part is, these are the types that we can confidently de-serialize into native types in almost any language. This is what allows us to create reasonable block format agnostic APIs.

This commitment to native type deserialization is similar to the motivations behind JSON compared to the commitments and priorities of, for the purposes of comparison, XML.

The reality is, many languages don’t have native types that support non-string keys in a Map/Struct. If we change the data model to allow this it means that all the API’s built on IPLD will end up with their own API’s for accessing and manipulating data that are not the simple native types programmers are already comfortable with (similar to the situation in XML where you have to use APIs like the DOM in order to read/write XML because it doesn’t have a simple mapping to native types).

In the case of collections (implemented in advanced layouts or composites or whatever), we’re already not able to use native types for the user facing APIs, because these are often multi-block structures. This is why we’ve ended up with Node APIs that have methods for common data structure operations (get, set, del, etc.) and the fact that we literally call it the same thing as the DOM does should not be lost on us. This situation makes it easy for us, in theory, to add support for non-string keys to collections. However, since all the tools we have for implementing these are built on the data model we end up with an intermediate representation of the data (between the Node API and the block format) that cannot use a native map with byte keys — which makes it quite difficult to actually use any features a codec might have beyond the data model.

While in theory we can support any arbitrary key type in a collection, in practice these can’t actually leverage special features in the codec that might create a more compact representation.

When we began considering non-string keys in hashmap we thought it would be easy, but it turns out to have uncovered a lot of complexity we didn’t anticipate when we started down this path. I’d like to see hashmap land and be in wide use on a shorter timeline than we have for resolving some of the bigger questions mentioned above, so I’m in favor of moving back to string keys for this spec. We can always work on another spec for a hashmap with non-string keys, this won’t be the last multi-block collection we write, and the sooner we ship this we can ship another one that makes it clearer there is not “one HAMT to rule them all.”

We should document the problem of “using block format features that are outside the data model” and think about potential solutions. If someone wants to propose changing the data model then we can discuss that as well, but keep in mind that doing so will push us farther down the “XML” path than the “JSON” path.

dvc94ch · 2019-09-10T23:03:34Z

I think we should separate the discussion about the data model from this proposal.

type BucketEntry struct {
+  key String
-  key Bytes
  value Value (implicit "null")
} representation tuple

I dont see how this change has anything to do with the data model and I think the two issues are being mixed up... This is in json notation would be:

["a string", null]

or

[{"base64": "axfgd"}, null]

rvagg · 2019-09-11T00:13:20Z

@dvc94ch yep, you're right they're getting mixed up and this change doesn't introduce inherent conflicts. But we have been intentionally mixing these things up a bit, which you might be able to see here (from #182): https://github.com/ipld/specs/blob/378a2ad8f88c69bb2b3111799a01677a261c31d1/schemas/advanced-layouts.md - in fact part of this PR is me getting my head into the ADL (advanced data layout) space, where an ADL is mimicking a kind, which is not the way I was thinking when I first implemented this particular HAMT design.

But, an ADL trying to mimic a data model kind, doesn't preclude that same logic & encoding from going above and beyond what a data model kind can do. So even if we were to say "no non-string keys in data model maps" it doesn't stop us from doing it with an ADL since it could be used via a non-generic programmatic interface anyway.

rvagg · 2019-09-11T00:52:22Z

Moving to #192 which is mostly #180 but is addressing some of what I was trying to get at here as well while retaining the possibility of storing bytes natively and providing a pure bytes interface.

fix typo

hashmap: serialize keys as strings not bytes

a5689c8

rvagg force-pushed the rvagg/hashmap-keys-stored-as-strings branch from 2a60566 to a5689c8 Compare September 9, 2019 04:55

This was referenced Sep 9, 2019

hashmap: 3 kinds of map keys, string, bytes, integers (for discussion) #180

Closed

don't convert string keys to Buffers for storage, allow mixed key types rvagg/iamap#8

Closed

mikeal mentioned this pull request Sep 10, 2019

doc: add motivation section to data model spec #191

Merged

rvagg mentioned this pull request Sep 11, 2019

hashmap: differentiate serialization of string and byte keys #192

Closed

rvagg closed this Sep 11, 2019

rvagg deleted the rvagg/hashmap-keys-stored-as-strings branch September 11, 2019 00:52

rvagg mentioned this pull request Sep 13, 2019

schemas: string keys for keyed unions filecoin-project/specs#516

Closed

Stebalien pushed a commit to Stebalien/specs that referenced this pull request Sep 18, 2019

Merge pull request ipld#184 from filecoin-project/patch/typo

004553d

fix typo

hashmap: serialize keys as strings not bytes #184

hashmap: serialize keys as strings not bytes #184

Uh oh!

Conversation

rvagg commented Aug 29, 2019

Uh oh!

Stebalien commented Aug 30, 2019

Uh oh!

rvagg commented Aug 30, 2019

Uh oh!

vmx commented Sep 5, 2019

Uh oh!

dvc94ch commented Sep 5, 2019

Uh oh!

rvagg commented Sep 9, 2019

Uh oh!

rvagg commented Sep 9, 2019

Uh oh!

Stebalien commented Sep 9, 2019

Uh oh!

dvc94ch commented Sep 9, 2019

Uh oh!

rvagg commented Sep 10, 2019

Uh oh!

dvc94ch commented Sep 10, 2019

Uh oh!

mikeal commented Sep 10, 2019

Uh oh!

dvc94ch commented Sep 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rvagg commented Sep 11, 2019

Uh oh!

rvagg commented Sep 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dvc94ch commented Sep 10, 2019 •

edited

Loading