-
Notifications
You must be signed in to change notification settings - Fork 93
hashmap: serialize keys as strings not bytes #184
Conversation
|
I'm not sure if this is related but being able to use arbitrary byte sequences in HAMTs is pretty important. |
|
unrelated, this is just about how to store the key in the block, internally it's hashed and sliced as a byte sequence but still stored as a string. |
|
I don't understand how an arbitrary byte sequence can be stored as valid UTF-8 string. |
|
I think he is planning on base encoding them? Please don't do this, multiformats/rust-multibase#9 |
|
No, the "map" kind only supports string keys, so there's no conversion to strings going on here. e.g. https://github.com/rvagg/js-ipld-hashmap/ will insist on keys for operations are strings and will only return strings for This is related to the data model "maps having non-string keys" thing that's been discussed and aligning with the current conclusion that, no, they shouldn't be able to have non-string keys (for now at least). Making this change, however, opens the door to #180 where we could potentially use the HashMap in a mode where keys are bytes (and other kinds .. maybe) as keys distinct from strings. If we just mush everything to bytes during serialization then we lose the opportunity to differentiate during deserialization. If in the future we say that data model maps could have non-string keys, being able to differentiate here would be useful here too. |
2a60566 to
a5689c8
Compare
|
I might be reading @Stebalien's comment wrong: are you saying that using arbitrary byte sequences as keys in put, get, etc. operations is important, where strings don't cut it? That there are modes where you want to use a HAMT to reference byte array keys, on top of the more typical use of string keys? This PR doesn't exactly preclude that use case, it just needs #180 as well (at least the String and Byte parts, the option to use Int keys should probably be dropped for now). The current incarnation of this spec just coerces all keys to a byte array during serialization. The implication of this is that when you deserialize you have to decide elsewhere whether you coerce to a string or not. |
In maps in general, yes. For example, unix filenames are byte sequences, not unicode strings.
Got it. So the actual type is really just "anything that serializes to bytes". |
|
Why shouldn't maps in the data model support other key types? The rust implementation supports bytes and integer keys [0], and it doesn't make the implementation much more complicated. |
|
Thanks @Stebalien. I think I'm going to scrap this PR and do a combination of this and #180, that has a kinded union for keys so they can be stored as either Bytes or Strings. Will defer on Ints for now. @dvc94ch: Some previous discussion on non-string keys for data model maps: #58, but there has been more since, I'm not sure it's been well recorded though. I think the main problem is that it's difficult to support across the codecs we care about. CBOR might be easy, but JSON, not so much. Bytes as keys is just as complicated (maybe moreso). The data model is supposed to be lowest-common-denominator between what we can reasonably support. Maybe this is something we could work on for a new @vmx @warpfork @mikeal input pls? Regardless of the data model, that's not what I'm trying to deal with here, although there was an aspect of aligning to the current understanding of the data model that maps only allow string keys. But again, this change doesn't preclude further modification, as per #180 which introduces that additional flexibility. I'll close both of these and open a new one to try and be more clear. This non-string keys in the data model thing really needs to be properly addressed though. |
|
Returning an error from the json codec when not using a string key would be a solution. Since it's mostly useful for debugging, extending the json spec is also a possibility. |
|
We’ve been characterizing this as a codec support problem when it really isn’t. I’ve been as guilty of this as anyone but I don’t think this was the right way to approach it in hindsight. The data model describes a set of base types. Codecs need to support those at a minimum but not as a maximum. There’s actually a lot of flexibility in how codecs decide to represent those types in the block and the data model says nothing about representations above and beyond that. The important part is, these are the types that we can confidently de-serialize into native types in almost any language. This is what allows us to create reasonable block format agnostic APIs. This commitment to native type deserialization is similar to the motivations behind JSON compared to the commitments and priorities of, for the purposes of comparison, XML. The reality is, many languages don’t have native types that support non-string keys in a Map/Struct. If we change the data model to allow this it means that all the API’s built on IPLD will end up with their own API’s for accessing and manipulating data that are not the simple native types programmers are already comfortable with (similar to the situation in XML where you have to use APIs like the DOM in order to read/write XML because it doesn’t have a simple mapping to native types). In the case of collections (implemented in advanced layouts or composites or whatever), we’re already not able to use native types for the user facing APIs, because these are often multi-block structures. This is why we’ve ended up with While in theory we can support any arbitrary key type in a collection, in practice these can’t actually leverage special features in the codec that might create a more compact representation. When we began considering non-string keys in We should document the problem of “using block format features that are outside the data model” and think about potential solutions. If someone wants to propose changing the data model then we can discuss that as well, but keep in mind that doing so will push us farther down the “XML” path than the “JSON” path. |
|
I think we should separate the discussion about the data model from this proposal. type BucketEntry struct {
+ key String
- key Bytes
value Value (implicit "null")
} representation tupleI dont see how this change has anything to do with the data model and I think the two issues are being mixed up... This is in json notation would be: ["a string", null]or [{"base64": "axfgd"}, null] |
|
@dvc94ch yep, you're right they're getting mixed up and this change doesn't introduce inherent conflicts. But we have been intentionally mixing these things up a bit, which you might be able to see here (from #182): https://github.com/ipld/specs/blob/378a2ad8f88c69bb2b3111799a01677a261c31d1/schemas/advanced-layouts.md - in fact part of this PR is me getting my head into the ADL (advanced data layout) space, where an ADL is mimicking a kind, which is not the way I was thinking when I first implemented this particular HAMT design. But, an ADL trying to mimic a data model kind, doesn't preclude that same logic & encoding from going above and beyond what a data model kind can do. So even if we were to say "no non-string keys in data model maps" it doesn't stop us from doing it with an ADL since it could be used via a non-generic programmatic interface anyway. |
Original form had the keys serialized as byte arrays but this switches them to strings. They are still likely treated as byte arrays for the purpose of hashing (as dictated by hashing APIs).
Doing this, ironically, makes it possible to do #180 where we could potentially store different key types but do kinded differentiation. But I'd like to make this change regardless of whether #180 goes anywhere.