ipld · rvagg · Aug 24, 2019
diff --git a/schema-layer/data-structures/hashmap.md b/schema-layer/data-structures/hashmap.md
@@ -117,10 +117,16 @@ type Element union {
 type Bucket list [ BucketEntry ]
 
 type BucketEntry struct {
-  key Bytes
+  key Key
   value Value (implicit "null")
 } representation tuple
 
+type Key union {
+  | String string
+  | Int int
+  | Bytes bytes
+} kinded
+
 type Value union {
   | Bool bool
   | String string
@@ -138,15 +144,42 @@ Notes:
 * `hashAlg` in the root block is a string identifier for a hash algorithm. The identifier should correspond to a [multihash](https://github.com/multiformats/multihash) identifier as found in the [multiformats table](https://github.com/multiformats/multicodec/blob/master/table.csv).
 * `bitWidth` in the root block should be at least `3`.
 * `bucketSize` in the root block must be at least `1`.
-* Keys are stored in `Byte` form.
 * The size of `map` is determined by `bitWidth` since it holds one bit per possible data element. It must be `1` or `2`<sup>`bitWidth`</sup>` / 8` bytes long, whichever is largest.
 
+## Key hashing
+
+For look-up or set operation involving a `key`, a hash is required of that key for the purpose of finding the index of that key at each level of the tree structure. As such, we need clear rules about how the allowed `key` data model kinds should be hashed.
+
+Regardless of hashing mechanism, the raw key, in its data model kind form, is to be stored in the final bucket in the data structure. Mutating the key to a hashable form is only an internal operation for the purpose of generating a consistent hash across implementations.
+
+It is assumed that hash algorithms being used by the HashMap operate directly on byte arrays.
+
+### Bytes
+
+If a key is a byte array, the entirety of the byte array should be passed to the hash algorithm directly as-is.
+
+### Strings
+
+If a key is a string, it should be converted to a byte array representing the bytes that make up the entirety of the string, and then passed to the hash algorithm. It is assumed that strings may be in UTF-8 form and that the resulting bytes representing that string are consistent across platforms.
+
+### Integers
+
+Integers must first be converted to byte arrays before being hashed. In order to perform consistent hashing across languages, platforms and their number implementations, we must choose a byte representation format and the length of bytes to be hashed.
+
+For the common case of numbers in the 32-bit signed range, that is, from `- 2`<sup>`31`</sup> to `(2`<sup>`31`</sup>`) - 1`, or `-2,147,483,648` to `2,147,483,647` we encode it as bytes in little-endian form to 4 bytes. This also includes numbers in the 8 and 16-bit range
+
+For numbers outside of this range, we encode to 8 bytes in 64-bit little-endian signed form. Encoding behavior for numbers outside of the 64-bit range (`- 2`<sup>`63`</sup> to `(2`<sup>`63`</sup>`) - 1`, or `−9,223,372,036,854,775,807` to `9,223,372,036,854,775,807`) is left undefined. However, encoding in 16-bits little-endian for the 128-bit range, and so on, is recommended if such support is possible and required for a given implementation of this specification.
+
+Unsigned integers should be represented in signed form for the purpose of hashing.
+
+The details of this conversion to bytes should be transparent to the user as the original integers are stored in the key/value pair location in buckets and only returned to users as the correct data model kind.
+
 ## Algorithm in detail
 
 ### `Get(key)`
 
 1. Set a `depth` value to `0`, indicating the root block
-2. The `key` is hashed, using `hashAlg`.
+2. The `key` is hashed, using `hashAlg`, according to the data model hashing rules above.
 3. Take the left-most `bitWidth` bits, offset by `depth x bitWidth`, from the hash to form an `index`. At each level of the data structure, we increment the section of bits we take from the hash so that the `index` comprises a different set of bits as we move down.
 4. If the `index` bit in the node's `map` is `0`, we can be certain that the `key` does not exist in this data structure, so return an empty value (as appropriate for the implementation platform).
 5. If the `index` bit in the node's `map` is `1`, the value may exist. Perform a `popcount()` on the `map` up to `index` such that we count the number of `1` bits up to the `index` bit-position. This gives us `dataIndex`, an index in the `data` array to look up the value or insert a new bucket.
@@ -158,7 +191,7 @@ Notes:
 ### `Set(key, value)`
 
 1. Set a `depth` value to `0`, indicating the root block
-2. The `key` is hashed, using `hashAlg`.
+2. The `key` is hashed, using `hashAlg`, according to the data model hashing rules above.
 3. Take the left-most `bitWidth` bits, offset by `depth x bitWidth`, from the hash to form an `index`. At each level of the data structure, we increment the section of bits we take from the hash so that the `index` comprises a different set of bits as we move down.
 4. If the `index` bit in the node's `map` is `0`, a new bucket needs to be created at the current node. If the `index` bit in the node's `map` is `1`, a value exists for this `index` in the node's `data` which may be a bucket (which may be full) or may be a link to a child node or an inline child node.
 5. Perform a `popcount()` on the `map` up to `index` such that we count the number of `1` bits up to the `index` bit-position. This gives us `dataIndex`, an index in the `data` array to look up the value or insert a new bucket.
@@ -181,7 +214,7 @@ Notes:
       3. Proceed to create new CIDs for the current block and each parent as per step **6.c**. until we have a new root block and its CID.
    3. If the `dataIndex` element of `data` contains a bucket (array) and the bucket's size is `bucketSize`:
       1. Create a new empty node
-      2. For each element of the bucket, perform a `Set(key, value)` on the new empty node with a `depth` set to `depth + 1`, proceeding from step **2**. This should create a new node with `bucketSize` elements distributed approximately evenly through its `data` array. This operation will only result in more than one new node being created if all `key`s being set have the same `bitWidth` bits of their hashes at `bitWidth` position `depth + 1` (and so on). A sufficiently random hash algorithm should prevent this from occuring.
+      2. For each element of the bucket, perform a `Set(key, value)` on the new empty node with a `depth` set to `depth + 1`, proceeding from step **2**. This should create a new node with `bucketSize` elements distributed approximately evenly through its `data` array. This operation will only result in more than one new node being created if all `key`s being set have the same `bitWidth` bits of their hashes at `bitWidth` position `depth + 1` (and so on). A sufficiently random hash algorithm should prevent this from occurring.
       3.  Create a CID for the new child node.
       4.  Mutate the current node (create a copy)
       5.  Replace `dataIndex` of `data` with a link to the new child node.
@@ -192,7 +225,7 @@ Notes:
 The deletion algorithm below is presented as an iterative operation. It can also be usefully conceived of as a recursive algorithm, which is particularly helpful in the case of node collapsing. See section "4.2 Deletion Algorithm" of the [CHAMP paper](https://michael.steindorfer.name/publications/oopsla15.pdf) for a description of this algorithm. Note that the linked paper does not make use of buckets so note the importance of counting entries in a node and comparing to `bucketSize` in the algorithm below.
 
 1. Set a `depth` value to `0`, indicating the root block
-2. The `key` is hashed, using `hashAlg`.
+2. The `key` is hashed, using `hashAlg`, according to the data model hashing rules above.
 3. Take the left-most `bitWidth` bits, offset by `depth x bitWidth`, from the hash to form an `index`. At each level of the data structure, we increment the section of bits we take from the hash so that the `index` comprises a different set of bits as we move down.
 4. If the `index` bit in the node's `map` is `0`, we can be certain that the `key` does not exist in this data structure, so there is no need to proceed.
 5. If the `index` bit in the node's `map` is `1`, the value may exist. Perform a `popcount()` on the `map` up to `index` such that we count the number of `1` bits up to the `index` bit-position. This gives us `dataIndex`, an index in the `data` array to look up the value or insert a new bucket.
@@ -225,6 +258,8 @@ The storage order of entries in an IPLD HashMap is entirely dependent on the has
 
 An implementation should only emit any given `key`, `value` or `key` / `value` entry pair once per iteration.
 
+The original data model kind for each `key` should be what is returned to the user for `Keys()` and `Entries()` operations.
+
 ### Differences to CHAMP
 
 This algorithm differs from CHAMP in the following ways:
@@ -243,6 +278,12 @@ To achieve canonical forms for any given set of `key` / `value` pairs, we note t
 
 The IPLD HashMap can be repurposed as a "Set": a data structure that holds only unique `key`s. Every `value` in a `Set(key, value)` mutation is fixed to some trivial value, such as `true` or `1`. `Has(key)` operations are then simply a `Get(key)` operation that asserts that a value was returned.
 
+## Use as a sparse array
+
+Using a HashMap with integer keys supports operation as an efficient sparse array **for non-iteration operations only**. Because hashing is performed on the byte array format of integer keys, entries will be distributed as evenly as the hash algorithm affords.
+
+Ordered iteration is not supported in HashMap, only hash-ordered iteration which is meaningless in most circumstances. So where a sparse array is required to be in ordered format for iteration purposes, a HashMap is not suitable and another collection type should be chosen.
+
 ## Implementation defaults
 
 Implements need to ship with _sensible defaults_ and be able to create HashMaps without users requiring intimate knowledge of the algorithm and the all of the trade-offs (although such knowledge will help in their optimal use).