es5 Data Format

This specification is now superseded by the Earthstar version 6 specification.

Document version: 2022-11-18.1

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. "WILL" means the same as "SHALL".

Scope

This document describes:

The format of Earthstar documents
The cryptography used for identities and signatures
The procedures needed for running a single local replica: signing, ingesting, and verifying documents; ingesting and verifying attachments

This document may discuss, but does NOT prescribe, these topics which are still in development:

Network protocols
Algorithms for syncing documents between multiple replicas
The API of a specific Earthstar implementation

The specification for the previous format, es4, can be found here.

Libraries Needed to Implement Earthstar
Vocabulary and concepts
Data model
- Documents
- Ingesting Documents
Attachments
Ingesting Attachments
- Timestamps and Clock Skew
  - Trusting timestamps
Identities, Authors, Shares
Paths and Write Permissions
Documents and Their Fields
Wiping document contents
Document Serialization
Querying
Syncing
- Share Secrecy
  - Eavesdropping
- Resolving Conflicts
Future Directions

Libraries Needed to Implement Earthstar

To make your own Earthstar library, you'll need:

ed25519 Signatures

ed25519 is the same cryptography format used by Secure Scuttlebutt and hypercore.

base32 Encoding

Almost anywhere that binary data needs to be encoded in Earthstar, it's done with base32: public and private keys, signatures, hashes. The exception is binary document content which is base64 (see next section).

We use RFC4648 with lowercase letters and no padding. The character set is abcdefghijklmnopqrstuvwxyz234567.

Our encodings are always prefixed with an extra b character, following the multibase standard. The b format is the only format supported in Earthstar. Libraries MUST enforce that encoded strings start with b, and MUST NOT allow any other encoding formats.

Libraries MUST be strict when encoding and decoding — only allow lowercase characters; don't allow a 1 to be treated as an i.

Why this encoding?

We want to use encoded data in URL locations, which can't contain upper-case characters, so base64 and base58 won't work.

base32 is shorter than base16 (hex).

The choice of a specific base32 variant was arbitrary and was influenced by the ones available in the multibase standard, which is widely implemented.

The leading b character serves two purposes: it defines which base32 format we're using, and it prevents encoded strings from starting with a digit. This makes it possible to use encoded strings as standards-complient URL locations, as in earthstar://gardening.bajfoqa3joia3jao2df.

Indexed Storage

Earthstar messages are typically queried in a variety of ways. This is easiest to implement using a database like SQLite, but if you manage your own indexes you can also use a key-value database like leveldb.

Vocabulary and concepts

Library, Earthstar library — In this context, an implementation of Earthstar itself.

App — Software which uses Earthstar to store data.

Document — The unit of data storage in Earthstar, similar to a document in a NoSQL database. A document has metadata fields (author, timestamp, etc), a text field, and optionally fields describing an associated attachment.

Path — Similar to a key in leveldb or a path in a filesystem, each document is stored at a specific path.

Identity — A keypair which writes documents to a share. Identities are identified by an ed25519 public key in a format called an identity address. It's safe to use the identity keypair from multiple devices simultaneously.

Share — A collection of documents. Shares are identified by a share address, the public key of an ed25519 keypair. Share are separate, unrelated worlds of data. Each document exists within exactly one share.

Format — The Earthstar document specification is versioned. Each version of the specification is called a document format, and the code that handles that format is called a formatter.

Peer — A device which holds Earthstar data and wishes to sync with other peers. Peers may, for example, be individual users' devices and/or replica servers. A peer may hold data from multiple shares.

Attachment — Arbitrary binary data which a document may refer to.

Replica server — A peer whose purpose is to provide uptime and connectivity for many users. Usually these are cloud servers with publically routable IP addresses.

Data model

A share's data is a collection of documents authored by various identities, as well as any attachments those documents correspond to.

// Simplified example of data stored in a share

Share: "+gardening.bhyux4opeug2ieqcy36exrf4qymc56adwll4zeazm42oamxtr7heq"
  Path: "/wiki/shared/Flowers"
    Documents in this path:
      { author: @suzy.b..., timestamp: 1500094, text: 'pretty' }
      { author: @matt.b..., timestamp: 1500073, text: 'nice petals' }
      { author: @fern.b..., timestamp: 1500012, text: 'smell good' }
  Path: "/wiki/shared/Bugs"
    Documents in this path:
      { author: @suzy.b..., timestamp: 1503333, text: 'wiggly' }
  Path: "/audio/nightinggale.mp3",
    Documents in this path:
      { author: @suzy.b..., timestamp: 1500094, text: 'Nightingale's song (better recording)', attachmentSize: 1000203, attachmentHash: bk2xrjl90  }
      { author: @matt.b..., timestamp: 1500073, text: 'Nightingale's song. Location unknown.', attachmentSize: 900384, attachmentHash: b84ahi8ah }
      { author: @fern.b..., timestamp: 1500012, text: 'Nightingale's song', attachmentSize: 900384, attachmentHash: b84ahi8ah }
  Attachments:
    bk2xrjl90: <bytes>
    b84ahi8ah: <bytes>

A peer MAY hold data from many shares. Each share's data is treated independently. Each document within a share is also independent; they don't form a chain or feed or refer to each other (e.g. no Merkle backlinks).

Certain documents may have a corresponding attachment. For each known document the peer MAY also hold a corresponding attachment. A peer MUST NOT hold attachments with no corresponding documents.

Documents

A peer MAY hold some or all of the documents from a share, in any combination. Apps MUST assume that any combination of docs may be missing.

Each document in a share exists at a path. For each path, Earthstar keeps the newest document from each identity who has ever written to that path. "Newest" is determined by comparing the timestamp field in the documents. See the next section for details about trusting timestamps.

In the preceeding example, the /wiki/shared/Flowers path contains 3 documents, because 3 different identities have written there. They may have written there hundreds of times, but we only keep the newest document from each identity, in that path.

When looking up a path to retrieve a document, the newest document MUST be returned by default. Apps can also query for the full set of document versions at a path; the older ones are called history documents.

Ingesting Documents

When a new document arrives and an existing older one is already there (from the same identity and at the same path), the new document overwrites the old one. Earthstar libraries MUST actually delete the older, overwritten document. The author's intent is to remove the old data.

The process of validating and potentially saving an incoming document is called ingesting, and it MUST happen to newly obtained documents, whether they come from other peers or are made as local writes. Earthstar libraries MUST use this ingestion process:

// Pseudocode

IngestDoc(newDoc):
    // Check doc validity — bad data types, bad signature,
    // expired ephemeral doc, wrong format string,
    // timestamp too far in the future, ...
    if !isValid(newDoc):
        return "rejected an invalid doc"

    // Check if it's obsolete
    // (Do we have a newer doc with same path and same identity?)
    let existingDoc = query({author: newDoc.author, path: newDoc.path});
    if existingDoc exists && existingDoc.timestamp >= newDoc.timestamp;
        return "ignored an obsolete doc"

    // Overwrite older doc with same path and same identity
    if existingDoc exists:
        remove(existingDoc)
    save(newDoc)
    return "accepted a doc"

Each document has a format field which specifies which data Format it is. The isValid function in this pseudocode represents a call to the Formatter which is responsible for enforcing the rules of that format.

Deleting old versions of documents may result in 'dangling' attachments without corresponding documents. Peers MUST delete these dangling attachments within an hour of their corresponding doc being deleted.

Attachments

A peer MAY hold some or all of the attachments in a share, provided it already has a corresponding document. Apps MUST assume that any combination of attachments may be absent.

Because many documents may refer to the same attachment (e.g. different history versions of a document at the same path), the peer SHOULD only store each individual attachment once.

Ingesting Attachments

Newly arrived attachments are only saved if the peer holds a document that refers to an attachment of matching size and hash.

The process of validating and potentially saving an incoming attachment is also called ingesting, and it MUST happen to newly obtained attachments, whether they come from other peers or are made as local writes. Earthstar libraries MUST use this ingestion process:

// Pseudocode

IngestAttachment(doc, attachment):
    // Check that the doc for this attachment is already ingested
    if !isIngested(doc):
        return "rejected an attachment we don't have the doc for"

    // Check if it's obsolete
    // (Do we already have an attachment with the same hash, for the same format?)
    let existingAttachment = query({hash: doc.attachmentHash, format: doc.format });
    if existingAttachment exists
        return "already have this attachment"

    // Check attachment validity
    let attachmentHash = hash(attachment)
    let attachmentSize = size(attachment)
    
    if attachmentHash !== doc.attachmentHash || attachmentSize !== doc.attachmentSize:
      return "attachment does not match doc"
        
    // Persist the attachment
    save(attachment)
    return "persisted an attachment"

Timestamps and Clock Skew

Earthstar uses timestamps to order documents, as integers can be ordered even when there are gaps (i.e. missing documents).

For this reason, peers MUST NOT save two documents at the same path with the same timestamp.

When a user sets a new document without specifying a timestamp themselves, the document MUST be written with a timestamp of whichever value is greater:

The current time in microseconds since the Unix epoch
The timestamp of the latest document at the same path

Or in pseudocode:

let timestamp = max(previouslyLatestDoc.timestamp + 1, nowInMicroseconds)

Trusting timestamps

Because peers write their own timestamps using their clocks or user input, there are several scenarios in which timestamps can be inaccurate:

A user writes a timestamp in the past
A user writes a timestamp in the future (presumably to make their document take precedence)
A peer writes a timestamp of 2^53 - 2 to cause overflow problems.
A peer writes an inaccurate timestamp due to clock skew.

In order to mitigate scenarios 2, 3, and 4, peers MUST consider documents with timestamps further than ten minutes in the future invalid.

Identities, Authors, Shares

Character Set Definitions

ALPHA_LOWER = any of "abcdefghijklmnopqrstuvwxyz"
ALPHA_UPPER = any of "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
DIGIT = any of "0123456789"

B32_CHAR = ALPHA_LOWER + any of "234567"
ALPHA_LOWER_OR_DIGIT = ALPHA_LOWER + DIGIT

ASCII = decimal character code 0 to 127 inclusive
      = hex character code 0x00 to 0x7F inclusive
        (no "extended ASCII" > 0x7F)

PRINTABLE_ASCII = ASCII characters " " to "~", inclusive
                = decimal character code 32 to 126 inclusive
                = hex character code 0x20 to 0x7E inclusive

In this document, everywhere we say ASCII we mean "standard ASCII and not extended ASCII". Only characters less than or equal to 0x7f, none higher.

SHARE_ADDRESS = "+" NAME "." B32_PUBKEY
NAME = one ALPHA_LOWER followed by 0 up to and including 15 ALPHA_LOWER_OR_DIGIT 
B32_PUBKEY = "b" followed by 52 B32_CHAR

SHARE_SECRET = "b" followed by 52 B32_CHAR

Example:

address: +gardening.bhyux4opeug2ieqcy36exrf4qymc56adwll4zeazm42oamxtr7heq
secret: buaqth6jr5wkksnhdlpfi64cqcnjzfx3r6cssnfqdvitjmfygsk3q

A share address starts with + and is followed by a name, period ., and a public key.

It MUST have those four elements in that order.

Names are chosen by users when generating the share public address and secret. They cannot be changed later.
Public keys are 32-byte ed25519 public keys (just the integer portion, no wrapper or surrounding data structures), encoded as base32 with an extra leading "b". This results in 52 characters of base32 plus the "b", for a total of 53 characters.
Private keys (called "secrets") are also 32 bytes of binary data (just the secret integer), encoded as base32 in the same way as the public key.

The name:

MUST be 1 to 15 characters long, inclusive.
MUST only contain digits 0-9 and lowercase ASCII letters a-z
MUST NOT start with a digit

No uppercase letters are allowed.

Why these rules?

These rules allow share addresses to be used as the location part of regular URLs, after removing the +.

Note that anyone can instantiate a replica for a share if they know its full share address, so it's important to keep share addresses secret if you want to limit their audience. Write access to the share is granted by the share secret.

Identity Addresses

IDENTITY_ADDRESS = "@" SHORTNAME "." B32_PUBKEY
SHORTNAME = one ALPHA_LOWER followed by three ALPHA_LOWER_OR_DIGIT
B32_PUBKEY = "b" followed by 52 B32_CHAR

AUTHOR_SECRET = "b" followed by 52 B32_CHAR

Examples

address: @suzy.bo5sotcncvkr7p4c3lnexxpb4hjqi5tcxcov5b4irbnnz2teoifua
secret: becvcwa5dp6kbmjvjs26pe76xxbgjn3yw4cqzl42jqjujob7mk4xq

address: @js80.bnkivt7pdzydgjagu4ooltwmhyoolgidv6iqrnlh5dc7duiuywbfq
secret: b4p3qioleiepi5a6iaalf6pm3qhgapkftxnxcszjwa352qr6gempa

An identity address starts with @ and combines a shortname with a public key.

Shortnames are chosen by users when creating an identity keypair. They cannot be changed later. They are exactly 4 lowercase ASCII letters or digits, and cannot start with a digit.
Public keys are 32-byte ed25519 public keys (just the integer portion, no wrapper or surrounding data structures), encoded as base32 with an extra leading "b". This results in 52 characters of base32 plus the "b", for a total of 53 characters.
Private keys (called "secrets") are also 32 bytes of binary data (just the secret integer), encoded as base32 in the same way as the public key.

Apps MUST treat identities as separate and distinct when their addresses differ, even if only the shortname is different and the pubkeys are the same.

Note that identities also have Unicode display names stored in their profile documents, and those can be changed and allow more freedom of expression. See the next section.

FAQ: Identity Shortnames

Why shortnames?

Impersonation is a difficult problem in distributed social networks where account identifiers can't be both unique and memorable. Users have to vigilantly check for imposters. Typically apps will treat following relationships as trust signals, displaying the accounts of people you follow in a different way to help you avoid imposters.

Shortnames make user identifiers "somewhat memorable" to defend against impersonation.

For example: In Scuttlebutt, users are identified by a bare public key and their display names are mutable.

A user could create an account with a display name of "Cat Pictures" and get many followers. They could then change the display name to match another user that they wish to impersonate. Anyone who previously followed "Cat Pictures" is still following the account under the new name, causing the account to appear trustworthy in the app's UI. Users decided to trust the account in one context (to provide cat pictures) but after trust was granted, the account changed context (to impersonate a friend).

For example, let's say an app shows "✅" when you're following an account. "✅ Cat Pictures @3hj29dhj..." renames itself to "✅ Samantha @3hj29dhj...", which is hard to tell apart from your actual friend "✅ Samantha @9c2j392hx...".

Adding an immutable shortname to the identity address makes this attack more difficult. Users can now notice when display name is different than expected.

For example "✅ Cat Pictures @cats.3hj29dhj..." renames itself to "✅ Samantha @cats.3hj29dhj...", which is easier to tell apart from your actual friend "✅ Samantha @samm.9c2j392hx...".

Of course the attacker could choose to start off as "✅ Cat Pictures @samm.3hj29dhj...". Users are expected to notice this as a suspicious situation when following an account.

Why are shortnames 4 characters?

Shortnames need to be long enough that they can express a clear relationship to the real identity of the account.

They need to be short enough for users to intuitively understand that they are non-unique.

Why limit shortnames to ASCII?

Users would be better served if they could use their native language in shortnames, but this creates potential vulnerabilities from Unicode normalization.

This usability shortfall is limited because shortnames don't need to be very expressive; users can use Unicode in the display name in their profile.

What if users want to change their shortnames?

Users can change their display names freely but their shortnames are fixed. Modifying the shortname effectively creates a new identity and the user's followers will not automatically follow the new identity.

Humane software must allows users to change their names. (See Falsehoods programmers believe about names). Choosing and changing your own name is a basic human right.

Software should also help users avoid impersonation attacks, a common harassment technique which can be quite destructive. Earthstar attempts to find a reasonable trade-off between these competing needs in the difficult context of a distributed system with no central name authority.

Users who anticipate name changes, or dislike the permanence of shortnames, can choose shortnames which are memorable but non-meaningful, like zzzz or oooo.

Can users create two identities with the same pubkey but different shortnames?

Yes. They are considered two distinct identities, although you can infer that they belong to the same person.

Identity Display Names and Profile Info

An identity can have a profile containing their display name, biographic information, etc. Profile data is stored in the content of a variety of documents under /about/:

displayNamePath = "/about/~" + identityAddress + "/displayName.txt"

Example:
/about/~@suzy.bo5sotcncvkr7p4c3lnexxpb4hjqi5tcxcov5b4irbnnz2teoifua/displayName

Display names stored in profile information can be changed frequently and can contain Unicode.

The expected paths and format of the profile documents are described in our wiki at Standard paths and data formats used by apps. They are not part of this lower-level specification.

We may add more standard pieces of profile information later, such as following and blocking of other users, a paragraph about yourself, a user icon, etc, but this is not standardized yet.

However, apps SHOULD consider the /about/ namespace to be a standardizable area and be extra thoughtful about what they write there.

Why "about"?

Secure Scuttlebutt uses "about" messages to describe people's profile information, and we've adopted that vocabulary.

Also, "about" comes towards the beginning of the alphabet, so if peers sync their documents in alphabetical order by path (which may or may not happen), the /about/ data will be some of the first data synced.

Paths and Write Permissions

Paths

Similar to a key in leveldb or a path in a filesystem, each document is stored at a specific path.

Rules:

// note that double quote is not included,
// it's just part of our notation in this specification
PATH_PUNCTUATION = any of "/'()-._~!$&+,:=@%"

PATH_CHARACTER = ALPHA_LOWER + ALPHA_UPPER + DIGIT + PATH_PUNCTUATION

PATH_SEGMENT = "/" + one or more PATH_CHARACTER
PATH = one or more PATH_SEGMENT

A path MUST be between 2 and 512 characters long (inclusive).
A path MUST begin with a /
A path MUST NOT end with a /
A path MUST NOT begin with /@, but it may contain /@ in the middle.
A path MUST NOT contain // (because each PATH_SEGMENT must have at least one PATH_CHARACTER)
Paths are case sensitive.
Paths MAY contain upper and/or or lower case ASCII letters plus the punctuation and numbers described above.
Paths MUST NOT contain any characters except those listed above. To include other characters such as spaces, double quotes, emojis, or other non-ASCII characters, apps SHOULD use URL-style percent-encoding as defined in RFC3986. First encode the string as utf-8, then percent-encode the utf-8 bytes.
A path MUST contain one or more ! characters, anywhere, IF AND ONLY IF the document is ephemeral (because deleteAfter is non-null). See the section on Ephemeral Documents.

In the following examples, ... is used to shorten identity addresses for easier reading. ... is not actually related to the Path specification.

Example paths

Valid:
    /todos/123
    /wiki/shared/Dolphins
    /wiki/shared/Dolphin%20Sounds.mp3
    /about/~@suzy.bo5sotcn...fua/bio
    /wall/@suzy.bo5sotcn...fua/post123.md

Invalid: path segment must have one or more path characters
    /

Invalid: missing leading slash
    todos/123.json

Invalid: starts with "/@"
    /@suzy.bo5sotcn...fua/profile.json

Why these specific punctuation characters?

Earthstar paths are designed to work well in the path portion of a regular web URL.

Why can't a path start with /@?

When building web URLs out of Earthstar pieces, we may want to use formats like this:
https://mypub.com/WORKSPACE/PATH_OR_AUTHOR

https://mypub.com/+gardening.friends/wiki/Dolphins
https://mypub.com/+gardening.friends/@suzy.bo5sotcncvkr7...  (etc)
The restriction on /@ allows us to tell paths and identity addresses apart in this setting. It also encourages app authors to put their data in a more organized top-level prefix such as /wiki/ instead of putting each identity at the root of the path.

Another solution was to use a double slash // to begin paths and avoid confusion with identities:
Don't do this:
https://mypub.com/+gardening.gardening.bhyux4opeug2ieqcy36exrf4qymc56adwll4zeazm42oamxtr7heq//wiki/Dolphins
                                     ^
...but some webservers treat this as user error and rewrite the double slash to a single slash. So we have to carefully avoid the double slash when building URLs.

Path Characters With Special Meaning

/ - starts paths; separates path segments
! - used if and only if the document is ephemeral
~ - (tilde) defines author write permissions
% - for percent-encoding other characters
+@. - used in share and author addresses but allowed elsewhere too

Path Patterns with Special Meaning

A path ending with a file extension denotes a document has an attachment

Disallowed Path Characters

The list of ALLOWED characters up above is canonical and exhaustive. This list of disallowed characters is provided only for convenience and is non-normative if it accidentally conflicts with the allowed list.

See the source code src/util/characters.ts for longer notes.

Character - reason for being disallowed

space - not allowed in URLs
ASCII whitespace (tab, etc) - not allowed in URLs
ASCII control characters (bell, etc) - not allowed in URL, and not visible
<>"[\]^{|} - not allowed in URLs. {} MAY be used for path templates (see below)
* - MAY be used for glob-style querying (see below)
` backtick - not allowed in URLs
? - to avoid confusion with URL query parameters
# - to avoid confusion with URL anchors
; - useful for separating several paths while still being legal in URLs
non-ASCII chars - (above 0x7F) to avoid trouble with Unicode normalization and canonicalization for signatures, and phishing attacks

Path templates and glob-style querying

Earthstar libraries MAY offer extra ways of querying paths that use the {}*? characters. This is not standardized, but those characters are available because they're not allowed in normal paths.

Example: you might be able to query for /blog/v1/{category}/{postId}.json and get back matching documents with the category and postId extracted into variables for you, similar to the way URL routes are specified in libraries like Express.

Example: you might be able to do "glob-style" queries like /blog/v1/**/*.json.

Side note: The ASCII range of allowed path characters

When handling path strings, you may find yourself needing to choose a separator character that will lexicographically sort before or after all allowed paths.

If you're handling entire paths, this is easy, because all legal paths start with /.

If you're handling path segments (the parts between slashes), the range is wider:

Amongst the allowed path characters, the lowest ASCII value is exclamation mark ! and the highest ASCII value is tilde ~. (Of course, not all ASCII values between those extremes are allowed.)

Therefore if you need an ASCII value that's lower than any possible path segment, anything less than or equal to space (0x20, decimal 32) will do. And the only ASCII value higher than all path characters is DEL (0x7F, decimal 127). Only standard ASCII values are allowed in paths, so there's nothing higher than DEL.

Write Permissions and Path Ownership

Paths can encode information about which identities are allowed to write to them. Documents that break these rules are invalid and will be ignored.

A path is shared if it contains no ~ (tilde) characters. Any author can write a document to a shared path.

A path is owned if it contains at least one ~. An author address immediately following a ~ is allowed to write to this path. Multiple authors can be listed, each preceded by their own ~, anywhere in the path. The author address must begin with its usual leading @.

Example shared paths:

Anyone can write here:
/todos/123

Anyone can write here because there's no tilde "~"
/wall/@suzy.bo5sotcncvkr7p4c3lnexxpb4hjqi5tcxcov5b4irbnnz2teoifua/info.txt

Example owned paths:

Only suzy can write here:
/about/~@suzy.bo5sotcncvkr7p4c3lnexxpb4hjqi5tcxcov5b4irbnnz2teoifua/displayName.txt

Suzy and matt can write here, and nobody else can:
/chat/~@suzy.bo5sotcncvkr7p4c3lnexxpb4hjqi5tcxcov5b4irbnnz2teoifua~@matt.bwnhvniwd3agqclyxl4lirbf3qpfrzq7lnkzvfelg4afexcodz27a/messages.json

The following path can't be written by anyone. It's owned because it contains a tilde ~, but an owner is not specified. Even though the tilde appears without a @ following it, it still acts as a marker of an owned path:

/nobody/can/ever/write/this/path/~

The tilde + identity address pattern can occur anywhere in the path: beginning, middle or end.

Note that documents are mutable but their path can never change (or it would be a different document!) so the ownership of a particular path/document is permanent. You can't change the ownership of a document; you have to create a new document at a different path.

File Extensions

Documents may have arbitrary binary bytes associated with them, referred to as attachments. If a document is associated with an attachment, it MUST have a file extension.

The file extension is there to help applications know how to interpret attachment data. For example, documents with JPEG image attachments SHOULD have paths ending in .jpg, and documents with MP3 audio attachments SHOULD have paths ending in .mp3.

The file extension MUST be positioned at the end of the path to indicate a document with an attachment:

Valid path for a document with an attachment:

/images/squirrel.png

Invalid path for a document with an attachment:
/images.png/squirrel

A path ending with an identity address MUST NOT be interpreted as a document with an attachment:

/info/~@suzy.bo5sotcncvkr7p4c3lnexxpb4hjqi5tcxcov5b4irbnnz2teoifua

Documents without attachments MUST NOT use the filename extension to indicate how to interpret their contents.

Invalid paths for documents without attachments:
/todos/123.json
/blog/post.md

Valid paths for documents without attachments:
/todos/123
/blog/post

Instead, documents SHOULD enclose the file extension of the text contents in parentheses, excluding .:

/todos/123(json)
/blog/post(md)

Path and Filename Conventions

Multiple apps can put data in the same share. Here are guidelines to help them interoperate:

The first path segment SHOULD be a description of the data type OR the application that will read/write it. Examples: /wiki/, /chess/, /chat/, m/posts/, /earthstagram/, /sillywiki/.

Why?

Peers can selectively sync only certain documents. Starting a path with a descriptive name like /wiki/ makes it easy to sync only wiki documents and ignore the rest. It also lets apps avoid accidentally reading or writing documents from other apps.

Sometimes this first path segment will represent a data type that many apps will support; sometimes it will be named after a specific app.

Consider including a version number in the path representing the version of the data format, like /wiki-v1/ or /wiki/v1/.

Consider choosing a unique name for your app's data, like /magic-todo-list/ instead of an obvious choice like /todos/, to avoid accidental collision with other apps you might not even know about.

Documents and Their Fields

This example document is shown as JSON though it can exist in many serialization formats:

{
  author: "@suzy.bce576gvty3ecz5unzynwqwutjzqe6bvhcujec2mimz7n2o5ilkfa",
  text: "Flowers are pretty",
  textHash: "bt3u7gxpvbrsztsm4ndq3ffwlrtnwgtrctlq4352onab2oys56vhq",
  format: "es.5",
  path: "/wiki/shared/Flowers",
  timestamp: 1668780332430000,
  signature: "bjodtzvedk7cgqdngjt4zdj3ufvlsmqc363jht2ygftf73rb6a2huexi6vlopdk6pihijyuv643c3olxardyy2iqzgncj7mss5hqqsdq",
  share: "+gardening.bhyux4opeug2ieqcy36exrf4qymc56adwll4zeazm42oamxtr7heq",
  shareSignature: "bf5ifu7jjuxbsmyyxq2wcxmylydtdkwaz3h32zedznezlu4icflg3xwrqtds5ooilavr5zfoyasd6lfdccfyet2wegxmhuvwmjwot6dq",
}

Document schema in Typescript:

interface Doc {
  author: string; // an author address
  text: string; // an arbitrary string of utf-8
  textHash: string; // sha256(content) encoded as base32 with a leading 'b'
  
  format: "es.5"; // the format version that this document adheres to.
  path: string; // a path
  signature: string; // ed25519 signature of encoded document, signed by author
  timestamp: number; // integer.  when the document was created
  share: string; // a share address
  shareSignature: string; // ed25519 signature of encoded document, signed by share
  
  // optional fields
  
  deleteAfter?: number; // integer.  when the document expires.  absent for non-expiring documents.
  attachmentSize?: number; // integer. The size of the document's attachment in bytes. Absent for docs without attachments.
  attachmentHash?: string; // string. The Sha256 hash of the document's attachment. Absent for docs without attachments.
}

Here we use the words "fields" and "properties" to mean the same thing.

The fields above are called the "core fields". All core fields are REQUIRED. Some core fields may be null; these MUST NOT be omitted; they MUST be explicitly set to null if they are null.

Extra fields are FORBIDDEN as part of this core document schema.

All string fields MUST be limited to PRINTABLE_ASCII characters except for text, which is utf-8, or string fields can be null if specified above. PRINTABLE_ASCII is defined earlier, and notably does not contain newline or tab characters, which are reserved for use in the serialization format we use for hashing and signing.

All number fields MUST BE integers, and cannot be NaN or Infinity, but they can be null if specified above.

The order of fields is unspecified except for hashing and signing purposes (see section below). For consistency, the recommended canonical order is sorted lexicographically by field name.

Document Validity

A document MUST be valid in order to be ingested into a replica, whether from a local write, or a sync, or anywhere else.

Invalid documents MUST be individually ignored when peers are syncing, and the sync MUST NOT be halted just because an invalid document was encountered. Continue syncing in case there are more valid documents.

Documents can be temporarily invalid depending on their timestamp and the current wall clock. Next time a sync occurs, maybe some of the invalid documents will have become valid by that time.

To be valid a document MUST pass ALL these rules, which are described in more detail in the following sections:

author is a valid identity address string
text is a compliant string holding utf-8 data
textHash is the sha256 hash of the text, encoded as base32 with a leading b, for a total length of 53 characters.
timestamp is an integer between 10^13 and 2^53-2, inclusive
deleteAfter is absent, or is a timestamp integer in the same range as timestamp
format is a string of printable ASCII characters
path is a valid path string
signature and shareSignature are each a base32 string with a leading b. For the es.5 format it must be 104 characters long including the b.
share is a valid share address string which matches the local share we are intending to write the document to
attachmentSize is an integer between 0 and 2^53-2 inclusive, or absent
attachmentHash is a sha256 hash of the associated attachment, encoded as base32 with a leading b for a total length of 53 characters, or absent
No extra fields.
No missing fields (unless the fields may be absent)
Additional rules about timestamp and deleteAfter relative to the current wall clock (see below)
author has write permission to path based on tilde placement
signature is cryptographically valid

Author

The author field holds an identity address, formatted according to the rules described earlier in Author Addresses.

Text

The text field contains arbitrary utf-8 encoded data. If the data is not valid utf-8, the document is still considered valid but the library's behavior is undefined when trying to access the content.

The text field may be an empty string. In fact, the recommended way to remove data from Earthstar is to overwrite the document with a new one which has text = "". See Wiping Document Contents for more information.

The maximum length of the text field is eight thousand bytes (8,000 bytes). This is measured as "bytes when encoded as utf-8", not naive string length. (This means the overall document, when encoded as JSON, can be slightly larger than 8,000 bytes - the rest of the fields add about 450 bytes more.)

When a document has an associated attachment, the text field's contents MUST have a length greater than zero and SHOULD be used to describe the contents of the attachment. This description can be used by peers to evaluate the content of an attachment before downloading it.

When storing formatted text (e.g. JSON or Markdown) apps SHOULD enclose the file extension in parentheses at the end of the path, e.g. /todos/123(json). If the document has an attachment and formatted text, the pathname SHOULD NOT include this information, in order to prioritize the attachment's file extension instead.

Text Hash

The textHash is the sha256 hash of the text data. The hash digest is then encoded from binary to base32 following the usual Earthstar format, with a leading b.

Note that hash digests are usually encoded in hex format, but we use base32 instead to be consistent with the rest of Earthstar's encodings.

Wrong: binary hash digest —> hex encoded string —> base32 encoded string

Correct: binary hash digest —> base32 encoded string

Also be careful not to accidentally change the content string to a different encoding (such as utf-16) before hashing it — hash the utf-8 bytes.

Format

The format is a short string describing which version of the Earthstar specification to use when validating and interpreting the document. It's like a schema name for the core Earthstar document format.

It MUST consist only of PRINTABLE_ASCII characters.

The current format version is es.5 ("es" is short for Earthstar.)

If the specification is changed in a way that breaks forwards or backwards compatibility, the format version MUST be incremented. The version number SHOULD be a single integer, not a semver.

Other format families may someday exist, such as a hypothetical ssb.1 which would embed Scuttlebutt messages in Earthstar documents, with special rules for validating the original embedded Scuttlebutt signatures as part of validating the document.

Formatter Responsibilities

Earthstar libraries SHOULD separate out the code related to each format, so that they can handle old and new documents side-by-side. Code for handling a format version is called a Formatter. Formatters are responsible for:

Hashing documents
Generating new documents (and possibly attachments) from a given input
Wiping user content from documents (i.e. text and attachments)
Checking document validity when ingesting documents. See the Document Validity section for more info.

Therefore each different format can have different ways of generating, hashing, signing, and validating documents.

Path

The path field is a string following the rules described in Paths.

The document is invalid if the author does not have permission to write to the path, following the rules described in Write Permissions and Path Ownership.

The path MUST contain at least one ! character, anywhere, IF AND ONLY IF the document is ephemeral (has non-absent deleteAfter).

The path MUST end in a file extension (.something), at the end of path, IF AND ONLY IF the document has an attachment (has non-absent attachmentHash and attachmentSize fields). The last portion of a identity's public address MUST NOT be interpreted as a file extension.

Timestamp

Timestamps are integer microseconds (millionths of a second) since the Unix epoch.

Note this is NOT the default format used by Javascript, which uses milliseconds (thousandths of a second).

// Earthstar timestamps in javascript
let timestamp = Date.now() * 1000;

# Earthstar timestamps in python
timestamp = int(time.time() * 1000 * 1000)

Timestamps MUST be within the following range (inclusive):

// 10^13
let MIN_TIMESTAMP = 10000000000000;

// 2^53 - 2  (Javascript's Number.MAX_SAFE_INTEGER - 1)
let MAX_TIMESTAMP = 9007199254740990;

let timestampIsValid = MIN_TIMESTAMP <= timestamp && timestamp <= MAX_TIMESTAMP;

Why this specific range?

The min timestamp is chosen to reject timestamps that were accidentally computed in milliseconds or seconds.

The max timestamp is the largest safe integer that Javascript can represent.

The range of valid times is approximately 1970-04-26 to 2255-06-05.

Timestamps MUST NOT be from the future from the perspective of the peer accepting them in a sync; but a limited tolerance is allowed to account for clock skew between devices. The recommended value for the future tolerance threshold is 10 minutes, but this can be adjusted depending on the clock accuracy of devices in a deployment scenario.

Timestamps from the future, beyond the tolerance threshold, are (temporarily) invalid and MUST NOT be accepted in a sync. They can be accepted later, after they are no longer from the future.

Choosing the future tolerance threshold

In some settings such as in-the-field embedded devices, where devices do not have accurate clocks or connectivity to NTP servers, the future tolerance may be greatly increased. However this enables some possible attacks on the network that can cause instability, so it requires greater trust in the network participants. In extreme cases we may need to add algorithms for the peers to attempt to converge on a rough understanding of the current time to account for clock skew.

In these scenarios, document timestamps should be considered more like version numbers rather than actual meaningful timestamps.

Also see the (non-normative) document How does Earthstar handle timestamps, and can it recover from a device with a very inaccurate clock?

Ephemeral documents: deleteAfter

Documents may be regular or ephemeral. Ephemeral documents have an expiration date, after which they MUST be proactively deleted by Earthstar libraries.

Why have ephemeral documents?

Deleting a regular document leaves behind a small empty document which takes up space. Ephemeral documents are completely removed when they expire, so they are a good choice for applications which will write many short-lived documents.

They also provide more privacy. Users can always delete their regular documents, but that deletion must propagate across all the peers. Ephemeral documents will be deleted from the entire network when they expire even if some peers have lost connectivity or you are not there to request a deletion at that time.

Libraries MUST check for and delete all expired documents at least once an hour (while they are running). Deleted documents MUST be actually deleted, not just marked as ignored. If an expired document has an attachment which no other documents refer to, this attachment MUST also be deleted.

Libraries MUST filter out expired documents from queries and lookups and not return them. Libraries MAY or MAY NOT actually delete them when they are encountered during querying; they may choose to wait until the next scheduled hourly deletion time. In addition libraries MUST not return the attachments for expired documents.

Expired documents MUST not be sent or accepted during a sync. Both peers in a sync SHOULD filter the incoming and outgoing documents to enforce this. This is the responsibility of the Formatter.

The deleteAfter field holds the timestamp after which a document is to be deleted. It is a timestamp with the same format and range requirements as the regular timestamp field.

Regular, non-ephemeral documents omit the deleteAfter field.

Unlike the timestamp field, the deleteAfter field is expected to be in the future compared to the current wall-clock time. Once the deleteAfter time is in the past, the document becomes invalid.

The deleteAfter time MUST BE strictly greater than the document's timestamp.

The document path MUST contain at least one exclamation mark ! character IF AND ONLY IF the document is ephemeral. Regular, non-ephemeral documents MUST NOT have any ! characters in their paths.

Ephemeral documents MAY be edited by users to change the expiration date. This works best if the expiration date is increased into the future. If it's decreased so it expires sooner, the document may sync in unpredictable ways (see below for another example of this). If it's set to expire in the past, the document won't even sync off of the current peer because other peers will reject it, so the edit won't propagate. When shortening the expiration date there should be time for the edit to propagate across the entire network of peers before the document expires.

Why ephemeral documents need a ! in their path

Regular and ephemeral documents with the same path could interact in surprising ways. To avoid this, we enforce that they can never collide on the same path.

(An ephemeral document could propagate halfway across a network of peers, overwriting a regular document with the same path, and then expire and get deleted wherever it has spread. Then the regular document would regrow to fill the empty space.

But if the ephemeral document traveled across the entire network and exterminated the regular document, and THEN expired, there would be nothing left.

Which of these cases occurred would depend on how long the document took to spread, which could be very fast or could take months if there was a peer that was usually offline. We'd like to avoid this unpredictability.)

Signature

The ed25519 signature by the author, encoded in base32 with a leading b.

See Serialization for Hashing and Signing, below, for details.

Like the hashes and crypto keys in Earthstar, this is the raw binary signature encoded directly into base32. Do not encode the binary signature into a hex string and then into base32.

The share field holds a share address, formatted according to the rules described in Share Addresses.

As a consequence, each document belongs to exactly one share and cannot be moved to another share (because that would cause the signature to become invalid).

The ed25519 signature by the share keypair (share address + secret), encoded in base32 with a leading b.

See Serialization for Hashing and Signing, below, for details.

Like the hashes and crypto keys in Earthstar, this is the raw binary signature encoded directly into base32. Do not encode the binary signature into a hex string and then into base32.

Attachments: `attachmentHash` and `attachmentSize`

Documents may have attachments, which are arbitrary binary data. To be considered valid, a document must have the following IF AND ONLY IF said document has an attachment:

An attachmentSize field with the size of the document's attachment in bytes.
An attachmentHash field with the sha256 hash of the attachment encoded as base32.
A path ending with a file extension, e.g. /docs/notes.pdf

Why documents with attachments need file extensions

File extensions are a human-readable and compact way to indicate what an attachment contains and how it should be interpreted. But in addition to this, allowing only documents with attachments with file extensions means that documents with and without attachments can never collide on the same path.

Wiping document contents

The only way to truly delete every trace of a document is to use an ephemeral document.

However, non-ephemeral documents may have their contents (text and attachment) wiped from them. These documents retain all other fields, such as their path, author, and timestamp, and will continue to be synced with other peers. The attachment's bytes will be erased if no reference to them exists in other any document.

A document can be wiped by setting the text field to an empty string.

To erase an attachment, the attachmentSize field MUST be set to zero and the attachmentHash field MUST be set to the sha256 hash of an empty string encoded to base32 (b4oymiquy7qobjgx36tejs35zeqt24qpemsnzgtfeswmrw6csxbkq). The text field of the same document MUST be an empty string.

A peer MUST delete an attachment with no corresponding documents within an hour.

Document Serialization

There are 3 scenarios when we need to serialize a document to/from a series of bytes:

Hashing and signing
Network transmission
Storage

They have different needs and we use different formats for each.

Serialization for Hashing and Signing

When a signature is produced for a document, it's actually signing a hash of the document. We need a deterministic, standardized, and simple way to serialize a document to a sequence of bytes that we can hash. This is a one-way conversion — we never need to deserialize this format.

Earthstar libraries MUST use this exact process.

To hash a document:

// Pseudocode

let hashDocument(document): string => {
    // Get a deterministic hash of a document as a base32 string.
    // Preconditions:
    //   All string fields must be printable ASCII only
    //   Fields must have one of the following types:
    //       string
    //       integer
    //   Note that "string | integer" is not allowed
    //   because we'd have no way of telling "123" apart from 123.

    let accum: string = '';

    For each field and value in the document, sorted in lexicographic order by field name: {

        // Skip the content and signature fields
        if (field === 'text' || field === 'signature') { continue; }

        // Otherwise, append the fieldname and value.
        // Tab and newline are our field separators.
        // Convert integers to strings here.
        accum += fieldname + "\t" + value + "\n"

        // (The newline is included on the last field.)
    }

    // Binary digest, not hex digest string!
    let binaryHashDigest = sha256(accum).digest();

    // Convert bytes to Earthstar b32 format with leading 'b'
    return base32encode(binaryHashDigest);
}

To sign a document:

// Pseudocode

let signDocument(authorOrShareKeypair, document): void => {
    // Sign the document and store the signature into the document (mutating it).
    // authorOrShareKeypair contains a pubkey and a private key.

    let binarySignature = ed25519sign(
        authorOrShareKeypair,
        hashDocument(document)
    );

    // Convert bytes to Earthstar b32 format with leading 'b'
    let base32signature = base32encode(binarySignature);

    document.signature = base32signature;
}

Preconditions that make this work:

Documents can only hold integers, strings, and null — no floats or nested objects that could increase complexity or be nondeterministic
No document field name or field content can contain \t or \n, except content, which is not directly used (we use contentHash instead). So we can safely use tab and newline as field separators.
We don't need to worry about telling strings and integers apart because each field can hold an integer, or a string, but not both. So we don't need to quote our strings with quote marks.

Why use textHash instead of text for hashing documents?

This lets us drop the actual content (to save space) but still verify the document signature.

Serialization for Network

This is a two-way conversion between memory and bytes.

Earthstar doesn't have strong opinions about networking. This format does not need to be standardized, but it's good to choose widely used familiar tools.

Apps and libraries SHOULD use JSON (encoded as UTF-8) as a default choice unless there are important reasons to choose otherwise. JSON is widely known, widely supported, and fits within most network protocols easily.

Serialization for Storage

This is a two-way conversion between memory and bytes.

It does not need to be standardized; each implementation can use its own format.

It needs to support efficient mutation and deletion of documents, and querying by various properties.

It would be nice if this was an archival format (corruption-resistant and widely known).

Options to consider:

SQLite
Postgres
IndexedDB
leveldb or similar key-value databases (with extra indexes)
a bunch of JSON files, one for each document (with extra indexes)

For exporting and importing data:

one giant newline-delimited JSON file, one document per line, is easier to parse than a giant JSON array of documents, and streamable.

Querying

Libraries SHOULD support a standard variety of queries against a database of Earthstar messages. A query is specified by a single object with optional fields for each kind of query operation.

This query format will become standardized because it will be used for querying from one peer to another. It's not quite stable yet.

This only supports relatively simple ways of querying and filtering documents because we want to make it easy to use many different kinds of backend storage which may have limited query capabilities. Apps and libraries MAY add extensions for more powerful querying if they're able to, but this should be considered the minimal set for compatibility across peers.

The recommended query object format, expressed in Typescript:

export interface Query {
  /** Whether to fetch all historical versions of a document or just the latest versions. */
  historyMode?: HistoryMode;

  //   "path ASC" is actually "path ASC then break ties with timestamp DESC"
  //   "path DESC" is the reverse of that
  /** The order to return docs in. Defaults to `path ASC`. */
  orderBy?: "path ASC" | "path DESC" | "localIndex ASC" | "localIndex DESC";

  /** Only fetch documents which come after a certain point. */
  startAfter?: {
    /** Only documents after this localIndex. Only works when ordering by localIndex. */
    localIndex?: number;

    /** Only documents after this path. Only works when ordering by path. */
    path?: string;
  };

  // then apply filters, if any
  filter?: {
    path?: string;
    pathStartsWith?: string;
    pathEndsWith?: string;
    author?: string;
    timestamp?: number;
    timestampGt?: number;
    timestampLt?: number;
  };

  /** The maximum number of documents to return. */
  limit?: number;

  formats?: string[];
}

Syncing

Syncing is the process of trading documents between two peers to bring each other up to date.

Syncing can occur locally (within a process, between two Storage instances) as well as across a network.

Documents are locked into specific shares; therefore syncing can't transfer documents between shares, only between different peers that hold the same share.

The method used by peers to sync with each other is not yet stable, and will be standardized in a separate specification..

Knowing a share address makes it possible to create a replica for that share and sync data from other peers who know about it. Therefore a share address should only be shared with those you wish to have read access to the share's data. This could be the wider public, or a small group, or a single individual.

Writing new documents to a share is only possible with that share's secret. This secret should not be shared publicly.

It MUST be impossible to discover new shares through the syncing process. Peers MUST keep their shares secret and only transmit data when they are sure the other peer also knows the address of the same share.

One method of disclosing which shares two peers have in common is as follows:

Peer A generates a random string to be used as a salt during hashing.
Peer A hashes the share addresses it knows of using the salt.
Peer A sends the salt and hashed addresses to Peer B.
Peer B hashes the share addresses it knows of using the salt obtained from Peer A.
Peer B sees if any of the hashes it has produced matches any of the hashes it received from Peer A.

If there is a match, then both peers have that share in common. Where there are no matches, each peer will still have no knowledge of which shares the other knows of.

They can now proceed to sync each of their common shares.

Eavesdropping

An eavesdropper observing this exchange will know both pieces of entropy, and can confirm that the peers have or don't have shares that the eavesdropper already knows about, but can't un-hash the exchanged values to get the share addresses they don't already know.

But once the peers start trading actual share data, an eavesdropper can observe the share addresses in plaintext in the exchanged documents.

Peers SHOULD thus talk to each other over an encrypted connection such as HTTPS.

Resolving Conflicts

See the Data model section for details about conflict resolution.

Future Directions

These are not implemented or specified yet:

Transport Encryption

Peers could encrypt their communications with SSL, Noise Protocol, etc.

Document Encryption

The document content field can be encrypted by apps in any way they like. We have some convenient keys available already:

You can write a private message to one author using their public key
You can write a private message to the entire share using the share public key (if it's an invite-only share)

However, the rest of the document metadata will be in plaintext including the author, path, and timestamp, which might reveal important information. App authors would have to use uninformative paths. This issue discusses ways of nesting the metadata inside another document to obscure it.

Immutable Documents

Documents that can't be edited. They may or may not be able to be deleted, and they may or may not be ephemeral (expiring).

This would probably involve a new optional document field, immutable.

See more in this issue.

Earthstar