Internet-Draft TID-I-D September 2024
Goldman & Caballero Expires 3 April 2025 [Page]
Workgroup:
Multiformats (proposed)
Internet-Draft:
draft-goldman-tid-latest
Published:
Intended Status:
Informational
Expires:
Authors:
A. Goldman
3box Labs
J. Caballero
learningProof UG

The Time-Ordered ID and base32lex Binary Encoding

Abstract

Developed for use in a high-bandwidth distributed social networking context, the TID is essentially a highly-compact UUIDv6 variant that optimizes for a few specific properties (most notably being sortable both bytewise and lexically) and fits into the space-efficient int64 type of languages that support it. One way it achieves these properties is by using a bespoke base-32 encoding alphabet rather than the similar base32hex encoding.

About This Document

This note is to be removed before publishing as an RFC.

Status information for this document may be found at https://datatracker.ietf.org/doc/draft-goldman-tid/.

Source for this draft and an issue tracker can be found at https://github.com/learningproof/tid-i-d.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 3 April 2025.

Table of Contents

1. Introduction

The TID is essentially a UUIDv6 variant that optimizes for a few specific properties:

  1. sortable both bytewise and lexically when encoded with a base32 variant (also specified in this document);

  2. collision-resistant for up to 024 independent parallel timestamping services, with this set of 1024 broken up in 3 contiguous namespace to support various use-cases described below;

  3. based on microseconds since unix epoch to simplify translation to other timestamp formats;

  4. fits in an int64 for efficient storage, sorting and compute; and,

  5. works well across the type systems or most major compiled languages in use today for application-level development.

Many minor choices, such as the choice of code points in the alternate base32 encoding and the signed nature of the timestamp bytespace are primarily informed by cross-language ergonomics.

1.1. Base32lex encoding

The 32 code points chosen to encode from binary, in order, are:

value               1111111111222222222233
          01234567890123456789012345678901
encoding  234567abcdefghijklmnopqrstuvwxyz

This differs from the canonical base32 encoding defined in [RFC4648] in two ways: firstly, the letters are lowercase, and secondly, the letters and numbers are inverted in the mapping sequence, such that bytewise ordering of binary TIDs and lexical sorting of base-encoded TIDs will always achieve the same ordering.

value                1111111111222222222233
           01234567890123456789012345678901
base32lex  234567abcdefghijklmnopqrstuvwxyz
base32     ABCDEFGHIJKLMNOPQRSTUVWXYZ234567

2. Composing TIDs

The canonical form of a TID is a 64-byte string which takes this form:

     0                   1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                         signed_microseconds_since_1970
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                                                |   clock_id        |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

2.1. Timestamp Component

The form of timestamp used to generate a TID is the number of microseconds since the Unix epoch (1970-01-01T00:00:00+00:00), i.e. with three more digits than an [RFC5102] dateTimeMilliseconds. In cases where a timestamping service may be returning timestamps faster than 1000 times every millisecond, uniqueness should be favored over microsecond accuracy; i.e., the ID generator should return the current microsecond since epoch OR the last microsecond returned plus 1, whichever is greater.

The effective range of TIDs is limited by the compaction into int64 form and the 10 bytes used to encode the nodeId segment; for reasons that will be explained below, it is further limited by one byte to avoid some translation problems with the targeted encodings and type systems, leaving 53 bytes of signed space for a subset of unix microsecond timestamps. Effectively, this means that the range of microseconds before or after 1970, expressed as a signed integer, is not (-2^63+1) to (2^63-1), but (-2^53+1) to (2^53-1). The following tables shows the min, zero, and max values of the integer range of microseconds, expressed in the 11-codepoint base32lex encoding. The additional 2 codepoints for the nodeId segment, explained below, are omitted for clarity.

Table 1
tid microseconds valid? ISO timestamp
s222-222-2222 -9007199254740991 yes (min value) 1684-07-28T00:12:25
2222-222-2222 0 yes 1970-01-01T00:00:00
bzzz-zzz-zzzz 9007199254740991 yes (max value) 2255-06-05T23:47:34
zzzz-zzz-zzzz 18014398509481982 no (binary unsafe) 2540-11-07T23:35:09

Note that half the possible range of values encodable in 11 codepoints are considered invalid TIDs, as their binary form would not fit safely in an int64 bytestring. As the canonical form of TIDs is an int64 bytestring, the invalid half of the string-encodable range should not be mistaken for valid TIDs and software handling these TID should validate strings accordingly.

2.2. Node Identifier Component

The node identifier, by analogy to the equivalent element in an [RFC9562] UUIDv6, is a spatially-unique identifier, but occupying a much smaller space (10 bits, as opposed to UUIDv6's 48 bits). It is divided into three contiguous ranges. The first 32 values (0-31, i.e. "20" - "2z" base-encoded) are reserved for "best effort" collision-resistance TIDs. The bulk of the range, (32-991, i.e. "30" - "yz" base-encoded) is reserved for context-dependent use. The remaining 32 entries (992-1023, i.e. "z0" - "zz" base-encoded) are reserved for globally unique TIDs.

"Best effort" node identifiers can be generated without coordination or deferral to external authorities, but are considered likely to collide when merged with data from external sources.

Context-dependent node identifiers should be use in the context of a specific application where they can be derived stably from application context. The application developer should take steps to ensure the that in any given time range, no node identifiers are in simultaneous use by two different actors. No process is specified for coordinating leases of node identifiers to actors.

Globally-unique node identifiers should only be used after being registered globally. At time of writing, there is only one public TimeID service operating.

Table 2
ClockID URL public key from to
z0 http://ccn.bz/tid todo 2222-222-2222 ongoing

2.3. Base-Encoded String Expression

The TimeID concatenates the timestamp and the node identifier. The string format is 11 code points of timestamp and 2 code points of node identifier, displayed for readability with - segment dividers after the 4th, 7th, and 11th code points:

STTT-TTT-TTTT-CC

Where:

  • S represents a character with limited range, representing 3 bytes of timestamp

  • each T represents 5-bytes of the timestamp, and

  • each C represent a 5-byte character from the node identifier

The timestamp is expressed in a string-form TID as the first 11 codepoints, i.e. 55 bytes, but in the binary form as the first 54 bytes of an int64. Note that the range of times that fit into those 54 bytes is actually a little smaller than 0 +/- (2^55-1); namely, it is 0 +/- (2^53-1). This is to accomodate various type-system quirks in the targeted languages and encodings: firstly, unsigned integers are problematic for JSON encoding (particularly JSON tooling), requiring the "assumed byte" (the 55th byte hard-coded into the decoding process) to designate a positive or negative number. Similarly, the 54th byte has to "sign extend" that positive or negative sign to keep the total integer expressible in a sign-extended 63 bytes to accomodate a quirk of the Java type system that converts integers bigger than 63 bytes to float64s. Sacrificing these two bytes of range safeguards round-trip translation of these int64s to JSON or Java and back.

We can take the timestamp 2024-07-19T09:40:46.480310 as an example to show the process. This is 1721382046481000 microseconds since Unix epoch. On a node with identifer 01, this base-encodes to:

3kxn-lhr-3gxq-23
 | |   |    |  |
 | |   |    |  NodeID 0-1023
 | |   |    microsecond
 | |   second
 | 10 hours (9.54 hours)
 year (1.115 years)

When using TimeIDs in string form, "prefixes" can be used to mask ranges, i.e.

Table 3
start end tid
2024-07-19T09:40:46.480310 2024-07-19T09:40:46.480310 3kxn-lhr-3gxq
2024-07-19T09:40:46.434304 2024-07-19T09:40:47.482879 3kxn-lhr
2024-07-19T04:28:52.498432 2024-07-19T14:01:32.236799 3kxn
2023-07-08T13:57:40.263936 2024-08-18T19:23:52.352767 3k

3. Conventions and Definitions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

4. Security Considerations

This minimal format makes very few guarantees and is built specifically for distributed systems without assuming coordinated time-keeping. The security implications of application-specific coordination of node identifiers are left as an exercise for the reader.

5. IANA Considerations

Currently, the registry for global TID services is maintained in a git repository maintained by the authors at github dot com. In the unlikely event that this this specification should become normative, a more formal registry governed according to IANA best current practice would be justified.

6. References

6.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/rfc/rfc2119>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/rfc/rfc8174>.

6.2. Informative References

[RFC4648]
Josefsson, S., "The Base16, Base32, and Base64 Data Encodings", RFC 4648, DOI 10.17487/RFC4648, , <https://www.rfc-editor.org/rfc/rfc4648>.
[RFC9562]
Davis, K., Peabody, B., and P. Leach, "Universally Unique IDentifiers (UUIDs)", RFC 9562, DOI 10.17487/RFC9562, , <https://www.rfc-editor.org/rfc/rfc9562>.
[RFC5102]
Quittek, J., Bryant, S., Claise, B., Aitken, P., and J. Meyer, "Information Model for IP Flow Information Export", RFC 5102, DOI 10.17487/RFC5102, , <https://www.rfc-editor.org/rfc/rfc5102>.

Appendix A. Test Vectors

For clarity, we've cross-computed a UUIDv6 for the test vector used above, as well as computed a TID for the test vector given for UUIDv6 in [RFC9562], i.e. the 60-bit timestamp: 0x1EC9414C232AB00 (138648505420000000), i.e. Tuesday, February 22, 2022 2:22:22.000000 PM GMT-05:00.

Table 4
Timestamp milliseconds
2024-07-19T09:40:46.480310 1721382046481
2022-02-02T07:22:22.000000 1645557742000
Table 5
microseconds tid UUIDv6
1721382046481000 3kxn-lhr-3gxq todo
1645557742000000 3iso-34e-qpw2 1EC9414C-232A-6B00-B3C8-9F6BDECED846

Appendix B. Acknowledgments

TODO acknowledge.

Authors' Addresses

A. Goldman
3box Labs
J. Caballero
learningProof UG