es5 Data Format
This specification is now superseded by the Earthstar version 6 specification.
Document version: 2022-11-18.1
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. "WILL" means the same as "SHALL".
Scope
This document describes:
- The format of Earthstar documents
- The cryptography used for identities and signatures
- The procedures needed for running a single local replica: signing, ingesting, and verifying documents; ingesting and verifying attachments
This document may discuss, but does NOT prescribe, these topics which are still in development:
- Network protocols
- Algorithms for syncing documents between multiple replicas
- The API of a specific Earthstar implementation
The specification for the previous format, es4, can be found here.
Table of contents
- Libraries Needed to Implement Earthstar
- Vocabulary and concepts
- Data model
- Attachments
- Ingesting Attachments
- Identities, Authors, Shares
- Paths and Write Permissions
- Documents and Their Fields
- Wiping document contents
- Document Serialization
- Querying
- Syncing
- Future Directions
Libraries Needed to Implement Earthstar
To make your own Earthstar library, you'll need:
ed25519 Signatures
ed25519 is the same cryptography format used by Secure Scuttlebutt and hypercore.
base32 Encoding
Almost anywhere that binary data needs to be encoded in Earthstar, it's done with base32: public and private keys, signatures, hashes. The exception is binary document content which is base64 (see next section).
We use RFC4648 with lowercase letters and no padding. The character set is abcdefghijklmnopqrstuvwxyz234567
.
Our encodings are always prefixed with an extra b
character, following the multibase standard. The b
format is the only format supported in Earthstar. Libraries MUST enforce that encoded strings start with b
, and MUST NOT allow any other encoding formats.
Libraries MUST be strict when encoding and decoding — only allow lowercase characters; don't allow a 1
to be treated as an i
.
Why this encoding?
- We want to use encoded data in URL locations, which can't contain upper-case characters, so base64 and base58 won't work.
- base32 is shorter than base16 (hex).
- The choice of a specific base32 variant was arbitrary and was influenced by the ones available in the multibase standard, which is widely implemented.
- The leading
b
character serves two purposes: it defines which base32 format we're using, and it prevents encoded strings from starting with a digit. This makes it possible to use encoded strings as standards-complient URL locations, as inearthstar://gardening.bajfoqa3joia3jao2df
.
Indexed Storage
Earthstar messages are typically queried in a variety of ways. This is easiest to implement using a database like SQLite, but if you manage your own indexes you can also use a key-value database like leveldb.
Vocabulary and concepts
Library, Earthstar library — In this context, an implementation of Earthstar itself.
App — Software which uses Earthstar to store data.
Document — The unit of data storage in Earthstar, similar to a document in a NoSQL database. A document has metadata fields (author, timestamp, etc), a text field, and optionally fields describing an associated attachment.
Path — Similar to a key in leveldb or a path in a filesystem, each document is stored at a specific path.
Identity — A keypair which writes documents to a share. Identities are identified by an ed25519 public key in a format called an identity address. It's safe to use the identity keypair from multiple devices simultaneously.
Share — A collection of documents. Shares are identified by a share address, the public key of an ed25519 keypair. Share are separate, unrelated worlds of data. Each document exists within exactly one share.
Format — The Earthstar document specification is versioned. Each version of the specification is called a document format, and the code that handles that format is called a formatter.
Peer — A device which holds Earthstar data and wishes to sync with other peers. Peers may, for example, be individual users' devices and/or replica servers. A peer may hold data from multiple shares.
Attachment — Arbitrary binary data which a document may refer to.
Replica server — A peer whose purpose is to provide uptime and connectivity for many users. Usually these are cloud servers with publically routable IP addresses.
Data model
A share's data is a collection of documents authored by various identities, as well as any attachments those documents correspond to.
// Simplified example of data stored in a share
Share: "+gardening.bhyux4opeug2ieqcy36exrf4qymc56adwll4zeazm42oamxtr7heq"
Path: "/wiki/shared/Flowers"
Documents in this path:
{ author: @suzy.b..., timestamp: 1500094, text: 'pretty' }
{ author: @matt.b..., timestamp: 1500073, text: 'nice petals' }
{ author: @fern.b..., timestamp: 1500012, text: 'smell good' }
Path: "/wiki/shared/Bugs"
Documents in this path:
{ author: @suzy.b..., timestamp: 1503333, text: 'wiggly' }
Path: "/audio/nightinggale.mp3",
Documents in this path:
{ author: @suzy.b..., timestamp: 1500094, text: 'Nightingale's song (better recording)', attachmentSize: 1000203, attachmentHash: bk2xrjl90 }
{ author: @matt.b..., timestamp: 1500073, text: 'Nightingale's song. Location unknown.', attachmentSize: 900384, attachmentHash: b84ahi8ah }
{ author: @fern.b..., timestamp: 1500012, text: 'Nightingale's song', attachmentSize: 900384, attachmentHash: b84ahi8ah }
Attachments:
bk2xrjl90: <bytes>
b84ahi8ah: <bytes>
A peer MAY hold data from many shares. Each share's data is treated independently. Each document within a share is also independent; they don't form a chain or feed or refer to each other (e.g. no Merkle backlinks).
Certain documents may have a corresponding attachment. For each known document the peer MAY also hold a corresponding attachment. A peer MUST NOT hold attachments with no corresponding documents.
Documents
A peer MAY hold some or all of the documents from a share, in any combination. Apps MUST assume that any combination of docs may be missing.
Each document in a share exists at a path. For each path, Earthstar keeps the newest document from each identity who has ever written to that path. "Newest" is determined by comparing the timestamp
field in the documents. See the next section for details about trusting timestamps.
In the preceeding example, the /wiki/shared/Flowers
path contains 3 documents, because 3 different identities have written there. They may have written there hundreds of times, but we only keep the newest document from each identity, in that path.
When looking up a path to retrieve a document, the newest document MUST be returned by default. Apps can also query for the full set of document versions at a path; the older ones are called history documents.
Ingesting Documents
When a new document arrives and an existing older one is already there (from the same identity and at the same path), the new document overwrites the old one. Earthstar libraries MUST actually delete the older, overwritten document. The author's intent is to remove the old data.
The process of validating and potentially saving an incoming document is called ingesting, and it MUST happen to newly obtained documents, whether they come from other peers or are made as local writes. Earthstar libraries MUST use this ingestion process:
// Pseudocode
IngestDoc(newDoc):
// Check doc validity — bad data types, bad signature,
// expired ephemeral doc, wrong format string,
// timestamp too far in the future, ...
if !isValid(newDoc):
return "rejected an invalid doc"
// Check if it's obsolete
// (Do we have a newer doc with same path and same identity?)
let existingDoc = query({author: newDoc.author, path: newDoc.path});
if existingDoc exists && existingDoc.timestamp >= newDoc.timestamp;
return "ignored an obsolete doc"
// Overwrite older doc with same path and same identity
if existingDoc exists:
remove(existingDoc)
save(newDoc)
return "accepted a doc"
Each document has a format
field which specifies which data Format it is. The isValid
function in this pseudocode represents a call to the Formatter which is responsible for enforcing the rules of that format.
Deleting old versions of documents may result in 'dangling' attachments without corresponding documents. Peers MUST delete these dangling attachments within an hour of their corresponding doc being deleted.
Attachments
A peer MAY hold some or all of the attachments in a share, provided it already has a corresponding document. Apps MUST assume that any combination of attachments may be absent.
Because many documents may refer to the same attachment (e.g. different history versions of a document at the same path), the peer SHOULD only store each individual attachment once.
Ingesting Attachments
Newly arrived attachments are only saved if the peer holds a document that refers to an attachment of matching size and hash.
The process of validating and potentially saving an incoming attachment is also called ingesting, and it MUST happen to newly obtained attachments, whether they come from other peers or are made as local writes. Earthstar libraries MUST use this ingestion process:
// Pseudocode
IngestAttachment(doc, attachment):
// Check that the doc for this attachment is already ingested
if !isIngested(doc):
return "rejected an attachment we don't have the doc for"
// Check if it's obsolete
// (Do we already have an attachment with the same hash, for the same format?)
let existingAttachment = query({hash: doc.attachmentHash, format: doc.format });
if existingAttachment exists
return "already have this attachment"
// Check attachment validity
let attachmentHash = hash(attachment)
let attachmentSize = size(attachment)
if attachmentHash !== doc.attachmentHash || attachmentSize !== doc.attachmentSize:
return "attachment does not match doc"
// Persist the attachment
save(attachment)
return "persisted an attachment"
Timestamps and Clock Skew
Earthstar uses timestamps to order documents, as integers can be ordered even when there are gaps (i.e. missing documents).
For this reason, peers MUST NOT save two documents at the same path with the same timestamp.
When a user sets a new document without specifying a timestamp themselves, the document MUST be written with a timestamp of whichever value is greater:
- The current time in microseconds since the Unix epoch
- The timestamp of the latest document at the same path
Or in pseudocode:
let timestamp = max(previouslyLatestDoc.timestamp + 1, nowInMicroseconds)
Trusting timestamps
Because peers write their own timestamps using their clocks or user input, there are several scenarios in which timestamps can be inaccurate:
- A user writes a timestamp in the past
- A user writes a timestamp in the future (presumably to make their document take precedence)
- A peer writes a timestamp of 2^53 - 2 to cause overflow problems.
- A peer writes an inaccurate timestamp due to clock skew.
In order to mitigate scenarios 2, 3, and 4, peers MUST consider documents with timestamps further than ten minutes in the future invalid.
Identities, Authors, Shares
Character Set Definitions
ALPHA_LOWER = any of "abcdefghijklmnopqrstuvwxyz"
ALPHA_UPPER = any of "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
DIGIT = any of "0123456789"
B32_CHAR = ALPHA_LOWER + any of "234567"
ALPHA_LOWER_OR_DIGIT = ALPHA_LOWER + DIGIT
ASCII = decimal character code 0 to 127 inclusive
= hex character code 0x00 to 0x7F inclusive
(no "extended ASCII" > 0x7F)
PRINTABLE_ASCII = ASCII characters " " to "~", inclusive
= decimal character code 32 to 126 inclusive
= hex character code 0x20 to 0x7E inclusive
In this document, everywhere we say ASCII we mean "standard ASCII and not extended ASCII". Only characters less than or equal to 0x7f
, none higher.
Share Addresses
SHARE_ADDRESS = "+" NAME "." B32_PUBKEY
NAME = one ALPHA_LOWER followed by 0 up to and including 15 ALPHA_LOWER_OR_DIGIT
B32_PUBKEY = "b" followed by 52 B32_CHAR
SHARE_SECRET = "b" followed by 52 B32_CHAR
Example:
address: +gardening.bhyux4opeug2ieqcy36exrf4qymc56adwll4zeazm42oamxtr7heq
secret: buaqth6jr5wkksnhdlpfi64cqcnjzfx3r6cssnfqdvitjmfygsk3q
A share address starts with +
and is followed by a name, period .
, and a public key.
It MUST have those four elements in that order.
-
Names are chosen by users when generating the share public address and secret. They cannot be changed later.
-
Public keys are 32-byte ed25519 public keys (just the integer portion, no wrapper or surrounding data structures), encoded as base32 with an extra leading "b". This results in 52 characters of base32 plus the "b", for a total of 53 characters.
-
Private keys (called "secrets") are also 32 bytes of binary data (just the secret integer), encoded as base32 in the same way as the public key.
The name:
- MUST be 1 to 15 characters long, inclusive.
- MUST only contain digits
0-9
and lowercase ASCII lettersa-z
- MUST NOT start with a digit
No uppercase letters are allowed.
Why these rules?
These rules allow share addresses to be used as the location part of regular URLs, after removing the
+
.
Note that anyone can instantiate a replica for a share if they know its full share address, so it's important to keep share addresses secret if you want to limit their audience. Write access to the share is granted by the share secret.
Identity Addresses
IDENTITY_ADDRESS = "@" SHORTNAME "." B32_PUBKEY
SHORTNAME = one ALPHA_LOWER followed by three ALPHA_LOWER_OR_DIGIT
B32_PUBKEY = "b" followed by 52 B32_CHAR
AUTHOR_SECRET = "b" followed by 52 B32_CHAR
Examples
address: @suzy.bo5sotcncvkr7p4c3lnexxpb4hjqi5tcxcov5b4irbnnz2teoifua
secret: becvcwa5dp6kbmjvjs26pe76xxbgjn3yw4cqzl42jqjujob7mk4xq
address: @js80.bnkivt7pdzydgjagu4ooltwmhyoolgidv6iqrnlh5dc7duiuywbfq
secret: b4p3qioleiepi5a6iaalf6pm3qhgapkftxnxcszjwa352qr6gempa
An identity address starts with @
and combines a shortname with a public key.
-
Shortnames are chosen by users when creating an identity keypair. They cannot be changed later. They are exactly 4 lowercase ASCII letters or digits, and cannot start with a digit.
-
Public keys are 32-byte ed25519 public keys (just the integer portion, no wrapper or surrounding data structures), encoded as base32 with an extra leading "b". This results in 52 characters of base32 plus the "b", for a total of 53 characters.
-
Private keys (called "secrets") are also 32 bytes of binary data (just the secret integer), encoded as base32 in the same way as the public key.
Apps MUST treat identities as separate and distinct when their addresses differ, even if only the shortname is different and the pubkeys are the same.
Note that identities also have Unicode display names stored in their profile documents, and those can be changed and allow more freedom of expression. See the next section.
FAQ: Identity Shortnames
Why shortnames?
Impersonation is a difficult problem in distributed social networks where account identifiers can't be both unique and memorable. Users have to vigilantly check for imposters. Typically apps will treat following relationships as trust signals, displaying the accounts of people you follow in a different way to help you avoid imposters.
Shortnames make user identifiers "somewhat memorable" to defend against impersonation.
For example: In Scuttlebutt, users are identified by a bare public key and their display names are mutable.
A user could create an account with a display name of "Cat Pictures" and get many followers. They could then change the display name to match another user that they wish to impersonate. Anyone who previously followed "Cat Pictures" is still following the account under the new name, causing the account to appear trustworthy in the app's UI. Users decided to trust the account in one context (to provide cat pictures) but after trust was granted, the account changed context (to impersonate a friend).
For example, let's say an app shows "✅" when you're following an account. "✅ Cat Pictures @3hj29dhj..." renames itself to "✅ Samantha @3hj29dhj...", which is hard to tell apart from your actual friend "✅ Samantha @9c2j392hx...".
Adding an immutable shortname to the identity address makes this attack more difficult. Users can now notice when display name is different than expected.
For example "✅ Cat Pictures @cats.3hj29dhj..." renames itself to "✅ Samantha @cats.3hj29dhj...", which is easier to tell apart from your actual friend "✅ Samantha @samm.9c2j392hx...".
Of course the attacker could choose to start off as "✅ Cat Pictures @samm.3hj29dhj...". Users are expected to notice this as a suspicious situation when following an account.
Why are shortnames 4 characters?
Shortnames need to be long enough that they can express a clear relationship to the real identity of the account.
They need to be short enough for users to intuitively understand that they are non-unique.
Why limit shortnames to ASCII?
Users would be better served if they could use their native language in shortnames, but this creates potential vulnerabilities from Unicode normalization.
This usability shortfall is limited because shortnames don't need to be very expressive; users can use Unicode in the display name in their profile.
What if users want to change their shortnames?
Users can change their display names freely but their shortnames are fixed. Modifying the shortname effectively creates a new identity and the user's followers will not automatically follow the new identity.
Humane software must allows users to change their names. (See Falsehoods programmers believe about names). Choosing and changing your own name is a basic human right.
Software should also help users avoid impersonation attacks, a common harassment technique which can be quite destructive. Earthstar attempts to find a reasonable trade-off between these competing needs in the difficult context of a distributed system with no central name authority.
Users who anticipate name changes, or dislike the permanence of shortnames, can choose shortnames which are memorable but non-meaningful, like
zzzz
oroooo
.
Can users create two identities with the same pubkey but different shortnames?
Yes. They are considered two distinct identities, although you can infer that they belong to the same person.
Identity Display Names and Profile Info
An identity can have a profile containing their display name, biographic information, etc. Profile data is stored in the content
of a variety of documents under /about/
:
displayNamePath = "/about/~" + identityAddress + "/displayName.txt"
Example:
/about/~@suzy.bo5sotcncvkr7p4c3lnexxpb4hjqi5tcxcov5b4irbnnz2teoifua/displayName
Display names stored in profile information can be changed frequently and can contain Unicode.
The expected paths and format of the profile documents are described in our wiki at Standard paths and data formats used by apps. They are not part of this lower-level specification.
We may add more standard pieces of profile information later, such as following and blocking of other users, a paragraph about yourself, a user icon, etc, but this is not standardized yet.
However, apps SHOULD consider the /about/
namespace to be a standardizable area and be extra thoughtful about what they write there.
Why "about"?
Secure Scuttlebutt uses "about" messages to describe people's profile information, and we've adopted that vocabulary.
Also, "about" comes towards the beginning of the alphabet, so if peers sync their documents in alphabetical order by path (which may or may not happen), the
/about/
data will be some of the first data synced.
Paths and Write Permissions
Paths
Similar to a key in leveldb or a path in a filesystem, each document is stored at a specific path.
Rules:
// note that double quote is not included,
// it's just part of our notation in this specification
PATH_PUNCTUATION = any of "/'()-._~!$&+,:=@%"
PATH_CHARACTER = ALPHA_LOWER + ALPHA_UPPER + DIGIT + PATH_PUNCTUATION
PATH_SEGMENT = "/" + one or more PATH_CHARACTER
PATH = one or more PATH_SEGMENT
- A path MUST be between 2 and 512 characters long (inclusive).
- A path MUST begin with a
/
- A path MUST NOT end with a
/
- A path MUST NOT begin with
/@
, but it may contain/@
in the middle. - A path MUST NOT contain
//
(because eachPATH_SEGMENT
must have at least onePATH_CHARACTER
) - Paths are case sensitive.
- Paths MAY contain upper and/or or lower case ASCII letters plus the punctuation and numbers described above.
- Paths MUST NOT contain any characters except those listed above. To include other characters such as spaces, double quotes, emojis, or other non-ASCII characters, apps SHOULD use URL-style percent-encoding as defined in RFC3986. First encode the string as utf-8, then percent-encode the utf-8 bytes.
- A path MUST contain one or more
!
characters, anywhere, IF AND ONLY IF the document is ephemeral (becausedeleteAfter
is non-null). See the section on Ephemeral Documents.
In the following examples, ...
is used to shorten identity addresses for easier reading.
...
is not actually related to the Path specification.
Example paths
Valid:
/todos/123
/wiki/shared/Dolphins
/wiki/shared/Dolphin%20Sounds.mp3
/about/~@suzy.bo5sotcn...fua/bio
/wall/@suzy.bo5sotcn...fua/post123.md
Invalid: path segment must have one or more path characters
/
Invalid: missing leading slash
todos/123.json
Invalid: starts with "/@"
/@suzy.bo5sotcn...fua/profile.json
Why these specific punctuation characters?
Earthstar paths are designed to work well in the path portion of a regular web URL.
Why can't a path start with
/@
?When building web URLs out of Earthstar pieces, we may want to use formats like this:
https://mypub.com/WORKSPACE/PATH_OR_AUTHOR https://mypub.com/+gardening.friends/wiki/Dolphins https://mypub.com/+gardening.friends/@suzy.bo5sotcncvkr7... (etc)
The restriction on
/@
allows us to tell paths and identity addresses apart in this setting. It also encourages app authors to put their data in a more organized top-level prefix such as/wiki/
instead of putting each identity at the root of the path.Another solution was to use a double slash
//
to begin paths and avoid confusion with identities:Don't do this: https://mypub.com/+gardening.gardening.bhyux4opeug2ieqcy36exrf4qymc56adwll4zeazm42oamxtr7heq//wiki/Dolphins ^
...but some webservers treat this as user error and rewrite the double slash to a single slash. So we have to carefully avoid the double slash when building URLs.
Path Characters With Special Meaning
/
- starts paths; separates path segments!
- used if and only if the document is ephemeral~
- (tilde) defines author write permissions%
- for percent-encoding other characters+@.
- used in share and author addresses but allowed elsewhere too
Path Patterns with Special Meaning
- A path ending with a file extension denotes a document has an attachment
Disallowed Path Characters
The list of ALLOWED characters up above is canonical and exhaustive. This list of disallowed characters is provided only for convenience and is non-normative if it accidentally conflicts with the allowed list.
See the source code src/util/characters.ts
for longer notes.
Character - reason for being disallowed
- space - not allowed in URLs
- ASCII whitespace (tab, etc) - not allowed in URLs
- ASCII control characters (bell, etc) - not allowed in URL, and not visible
<>"[\]^{|}
- not allowed in URLs.{}
MAY be used for path templates (see below)*
- MAY be used for glob-style querying (see below)- ` backtick - not allowed in URLs
?
- to avoid confusion with URL query parameters#
- to avoid confusion with URL anchors;
- useful for separating several paths while still being legal in URLs- non-ASCII chars - (above
0x7F
) to avoid trouble with Unicode normalization and canonicalization for signatures, and phishing attacks
Path templates and glob-style querying
Earthstar libraries MAY offer extra ways of querying paths that use the
{}*?
characters. This is not standardized, but those characters are available because they're not allowed in normal paths.Example: you might be able to query for
/blog/v1/{category}/{postId}.json
and get back matching documents with thecategory
andpostId
extracted into variables for you, similar to the way URL routes are specified in libraries like Express.Example: you might be able to do "glob-style" queries like
/blog/v1/**/*.json
.
Side note: The ASCII range of allowed path characters
When handling path strings, you may find yourself needing to choose a separator character that will lexicographically sort before or after all allowed paths.
If you're handling entire paths, this is easy, because all legal paths start with
/
.If you're handling path segments (the parts between slashes), the range is wider:
Amongst the allowed path characters, the lowest ASCII value is exclamation mark
!
and the highest ASCII value is tilde~
. (Of course, not all ASCII values between those extremes are allowed.)Therefore if you need an ASCII value that's lower than any possible path segment, anything less than or equal to space (
0x20
, decimal 32) will do. And the only ASCII value higher than all path characters isDEL
(0x7F
, decimal 127). Only standard ASCII values are allowed in paths, so there's nothing higher thanDEL
.
Write Permissions and Path Ownership
Paths can encode information about which identities are allowed to write to them. Documents that break these rules are invalid and will be ignored.
A path is shared if it contains no ~
(tilde) characters. Any author can write a document to a shared path.
A path is owned if it contains at least one ~
. An author address immediately following a ~
is allowed to write to this path. Multiple authors can be listed, each preceded by their own ~
, anywhere in the path. The author address must begin with its usual leading @
.
Example shared paths:
Anyone can write here:
/todos/123
Anyone can write here because there's no tilde "~"
/wall/@suzy.bo5sotcncvkr7p4c3lnexxpb4hjqi5tcxcov5b4irbnnz2teoifua/info.txt
Example owned paths:
Only suzy can write here:
/about/~@suzy.bo5sotcncvkr7p4c3lnexxpb4hjqi5tcxcov5b4irbnnz2teoifua/displayName.txt
Suzy and matt can write here, and nobody else can:
/chat/~@suzy.bo5sotcncvkr7p4c3lnexxpb4hjqi5tcxcov5b4irbnnz2teoifua~@matt.bwnhvniwd3agqclyxl4lirbf3qpfrzq7lnkzvfelg4afexcodz27a/messages.json
The following path can't be written by anyone. It's owned because it contains a tilde ~
, but an owner is not specified. Even though the tilde appears without a @
following it, it still acts as a marker of an owned path:
/nobody/can/ever/write/this/path/~
The tilde + identity address
pattern can occur anywhere in the path: beginning, middle or end.
Note that documents are mutable but their path can never change (or it would be a different document!) so the ownership of a particular path/document is permanent. You can't change the ownership of a document; you have to create a new document at a different path.
File Extensions
Documents may have arbitrary binary bytes associated with them, referred to as attachments. If a document is associated with an attachment, it MUST have a file extension.
The file extension is there to help applications know how to interpret attachment data. For example, documents with JPEG image attachments SHOULD have paths ending in .jpg
, and documents with MP3 audio attachments SHOULD have paths ending in .mp3
.
The file extension MUST be positioned at the end of the path to indicate a document with an attachment:
Valid path for a document with an attachment:
/images/squirrel.png
Invalid path for a document with an attachment:
/images.png/squirrel
A path ending with an identity address MUST NOT be interpreted as a document with an attachment:
/info/~@suzy.bo5sotcncvkr7p4c3lnexxpb4hjqi5tcxcov5b4irbnnz2teoifua
Documents without attachments MUST NOT use the filename extension to indicate how to interpret their contents.
Invalid paths for documents without attachments:
/todos/123.json
/blog/post.md
Valid paths for documents without attachments:
/todos/123
/blog/post
Instead, documents SHOULD enclose the file extension of the text contents in parentheses, excluding .
:
/todos/123(json)
/blog/post(md)
Path and Filename Conventions
Multiple apps can put data in the same share. Here are guidelines to help them interoperate:
The first path segment SHOULD be a description of the data type OR the application that will read/write it. Examples: /wiki/
, /chess/
, /chat/
, m/posts/
, /earthstagram/
, /sillywiki/
.
Why?
Peers can selectively sync only certain documents. Starting a path with a descriptive name like
/wiki/
makes it easy to sync only wiki documents and ignore the rest. It also lets apps avoid accidentally reading or writing documents from other apps.
Sometimes this first path segment will represent a data type that many apps will support; sometimes it will be named after a specific app.
Consider including a version number in the path representing the version of the data format, like /wiki-v1/
or /wiki/v1/
.
Consider choosing a unique name for your app's data, like /magic-todo-list/
instead of an obvious choice like /todos/
, to avoid accidental collision with other apps you might not even know about.
Documents and Their Fields
This example document is shown as JSON though it can exist in many serialization formats:
{
author: "@suzy.bce576gvty3ecz5unzynwqwutjzqe6bvhcujec2mimz7n2o5ilkfa",
text: "Flowers are pretty",
textHash: "bt3u7gxpvbrsztsm4ndq3ffwlrtnwgtrctlq4352onab2oys56vhq",
format: "es.5",
path: "/wiki/shared/Flowers",
timestamp: 1668780332430000,
signature: "bjodtzvedk7cgqdngjt4zdj3ufvlsmqc363jht2ygftf73rb6a2huexi6vlopdk6pihijyuv643c3olxardyy2iqzgncj7mss5hqqsdq",
share: "+gardening.bhyux4opeug2ieqcy36exrf4qymc56adwll4zeazm42oamxtr7heq",
shareSignature: "bf5ifu7jjuxbsmyyxq2wcxmylydtdkwaz3h32zedznezlu4icflg3xwrqtds5ooilavr5zfoyasd6lfdccfyet2wegxmhuvwmjwot6dq",
}
Document schema in Typescript:
interface Doc {
author: string; // an author address
text: string; // an arbitrary string of utf-8
textHash: string; // sha256(content) encoded as base32 with a leading 'b'
format: "es.5"; // the format version that this document adheres to.
path: string; // a path
signature: string; // ed25519 signature of encoded document, signed by author
timestamp: number; // integer. when the document was created
share: string; // a share address
shareSignature: string; // ed25519 signature of encoded document, signed by share
// optional fields
deleteAfter?: number; // integer. when the document expires. absent for non-expiring documents.
attachmentSize?: number; // integer. The size of the document's attachment in bytes. Absent for docs without attachments.
attachmentHash?: string; // string. The Sha256 hash of the document's attachment. Absent for docs without attachments.
}
Here we use the words "fields" and "properties" to mean the same thing.
The fields above are called the "core fields". All core fields are REQUIRED. Some core fields may be null; these MUST NOT be omitted; they MUST be explicitly set to null if they are null.
Extra fields are FORBIDDEN as part of this core document schema.
All string fields MUST be limited to PRINTABLE_ASCII
characters except for text
, which is utf-8, or string fields can be null if specified above. PRINTABLE_ASCII
is defined earlier, and notably does not contain newline or tab characters, which are reserved for use in the serialization format we use for hashing and signing.
All number fields MUST BE integers, and cannot be NaN or Infinity, but they can be null if specified above.
The order of fields is unspecified except for hashing and signing purposes (see section below). For consistency, the recommended canonical order is sorted lexicographically by field name.
Document Validity
A document MUST be valid in order to be ingested into a replica, whether from a local write, or a sync, or anywhere else.
Invalid documents MUST be individually ignored when peers are syncing, and the sync MUST NOT be halted just because an invalid document was encountered. Continue syncing in case there are more valid documents.
Documents can be temporarily invalid depending on their timestamp and the current wall clock. Next time a sync occurs, maybe some of the invalid documents will have become valid by that time.
To be valid a document MUST pass ALL these rules, which are described in more detail in the following sections:
author
is a valid identity address stringtext
is a compliant string holding utf-8 datatextHash
is the sha256 hash of thetext
, encoded as base32 with a leadingb
, for a total length of 53 characters.timestamp
is an integer between 10^13 and 2^53-2, inclusivedeleteAfter
is absent, or is a timestamp integer in the same range astimestamp
format
is a string of printable ASCII characterspath
is a valid path stringsignature
andshareSignature
are each a base32 string with a leadingb
. For thees.5
format it must be 104 characters long including theb
.share
is a valid share address string which matches the local share we are intending to write the document toattachmentSize
is an integer between 0 and 2^53-2 inclusive, or absentattachmentHash
is a sha256 hash of the associated attachment, encoded as base32 with a leadingb
for a total length of 53 characters, or absent- No extra fields.
- No missing fields (unless the fields may be absent)
- Additional rules about
timestamp
anddeleteAfter
relative to the current wall clock (see below) author
has write permission to path based on tilde placementsignature
is cryptographically valid
Author
The author
field holds an identity address, formatted according to the rules described earlier in Author Addresses.
Text
The text
field contains arbitrary utf-8 encoded data. If the data is not valid utf-8, the document is still considered valid but the library's behavior is undefined when trying to access the content.
The text
field may be an empty string. In fact, the recommended way to remove data from Earthstar is to overwrite the document with a new one which has text = ""
. See Wiping Document Contents for more information.
The maximum length of the text field is eight thousand bytes (8,000 bytes). This is measured as "bytes when encoded as utf-8", not naive string length. (This means the overall document, when encoded as JSON, can be slightly larger than 8,000 bytes - the rest of the fields add about 450 bytes more.)
When a document has an associated attachment, the text field's contents MUST have a length greater than zero and SHOULD be used to describe the contents of the attachment. This description can be used by peers to evaluate the content of an attachment before downloading it.
When storing formatted text (e.g. JSON or Markdown) apps SHOULD enclose the file extension in parentheses at the end of the path, e.g. /todos/123(json)
. If the document has an attachment and formatted text, the pathname SHOULD NOT include this information, in order to prioritize the attachment's file extension instead.
Text Hash
The textHash
is the sha256
hash of the text
data. The hash digest is then encoded from binary to base32 following the usual Earthstar format, with a leading b
.
Note that hash digests are usually encoded in hex format, but we use base32 instead to be consistent with the rest of Earthstar's encodings.
Wrong: binary hash digest —> hex encoded string —> base32 encoded string
Correct: binary hash digest —> base32 encoded string
Also be careful not to accidentally change the content string to a different encoding (such as utf-16) before hashing it — hash the utf-8 bytes.
Format
The format is a short string describing which version of the Earthstar specification to use when validating and interpreting the document. It's like a schema name for the core Earthstar document format.
It MUST consist only of PRINTABLE_ASCII
characters.
The current format version is es.5
("es" is short for Earthstar.)
If the specification is changed in a way that breaks forwards or backwards compatibility, the format version MUST be incremented. The version number SHOULD be a single integer, not a semver.
Other format families may someday exist, such as a hypothetical ssb.1
which would embed Scuttlebutt messages in Earthstar documents, with special rules for validating the original embedded Scuttlebutt signatures as part of validating the document.
Formatter Responsibilities
Earthstar libraries SHOULD separate out the code related to each format, so that they can handle old and new documents side-by-side. Code for handling a format version is called a Formatter. Formatters are responsible for:
- Hashing documents
- Generating new documents (and possibly attachments) from a given input
- Wiping user content from documents (i.e. text and attachments)
- Checking document validity when ingesting documents. See the Document Validity section for more info.
Therefore each different format can have different ways of generating, hashing, signing, and validating documents.
Path
The path
field is a string following the rules described in Paths.
The document is invalid if the author does not have permission to write to the path, following the rules described in Write Permissions and Path Ownership.
The path MUST contain at least one !
character, anywhere, IF AND ONLY IF the document is ephemeral (has non-absent deleteAfter
).
The path MUST end in a file extension (.something
), at the end of path, IF AND ONLY IF the document has an attachment (has non-absent attachmentHash
and attachmentSize
fields). The last portion of a identity's public address MUST NOT be interpreted as a file extension.
Timestamp
Timestamps are integer microseconds (millionths of a second) since the Unix epoch.
Note this is NOT the default format used by Javascript, which uses milliseconds (thousandths of a second).
// Earthstar timestamps in javascript
let timestamp = Date.now() * 1000;
# Earthstar timestamps in python
timestamp = int(time.time() * 1000 * 1000)
Timestamps MUST be within the following range (inclusive):
// 10^13
let MIN_TIMESTAMP = 10000000000000;
// 2^53 - 2 (Javascript's Number.MAX_SAFE_INTEGER - 1)
let MAX_TIMESTAMP = 9007199254740990;
let timestampIsValid = MIN_TIMESTAMP <= timestamp && timestamp <= MAX_TIMESTAMP;
Why this specific range?
The min timestamp is chosen to reject timestamps that were accidentally computed in milliseconds or seconds.
The max timestamp is the largest safe integer that Javascript can represent.
The range of valid times is approximately 1970-04-26 to 2255-06-05.
Timestamps MUST NOT be from the future from the perspective of the peer accepting them in a sync; but a limited tolerance is allowed to account for clock skew between devices. The recommended value for the future tolerance threshold is 10 minutes, but this can be adjusted depending on the clock accuracy of devices in a deployment scenario.
Timestamps from the future, beyond the tolerance threshold, are (temporarily) invalid and MUST NOT be accepted in a sync. They can be accepted later, after they are no longer from the future.
Choosing the future tolerance threshold
In some settings such as in-the-field embedded devices, where devices do not have accurate clocks or connectivity to NTP servers, the future tolerance may be greatly increased. However this enables some possible attacks on the network that can cause instability, so it requires greater trust in the network participants. In extreme cases we may need to add algorithms for the peers to attempt to converge on a rough understanding of the current time to account for clock skew.
In these scenarios, document timestamps should be considered more like version numbers rather than actual meaningful timestamps.
Also see the (non-normative) document How does Earthstar handle timestamps, and can it recover from a device with a very inaccurate clock?
Ephemeral documents: deleteAfter
Documents may be regular or ephemeral. Ephemeral documents have an expiration date, after which they MUST be proactively deleted by Earthstar libraries.
Why have ephemeral documents?
Deleting a regular document leaves behind a small empty document which takes up space. Ephemeral documents are completely removed when they expire, so they are a good choice for applications which will write many short-lived documents.
They also provide more privacy. Users can always delete their regular documents, but that deletion must propagate across all the peers. Ephemeral documents will be deleted from the entire network when they expire even if some peers have lost connectivity or you are not there to request a deletion at that time.
Libraries MUST check for and delete all expired documents at least once an hour (while they are running). Deleted documents MUST be actually deleted, not just marked as ignored. If an expired document has an attachment which no other documents refer to, this attachment MUST also be deleted.
Libraries MUST filter out expired documents from queries and lookups and not return them. Libraries MAY or MAY NOT actually delete them when they are encountered during querying; they may choose to wait until the next scheduled hourly deletion time. In addition libraries MUST not return the attachments for expired documents.
Expired documents MUST not be sent or accepted during a sync. Both peers in a sync SHOULD filter the incoming and outgoing documents to enforce this. This is the responsibility of the Formatter.
The deleteAfter
field holds the timestamp after which a document is to be deleted. It is a timestamp with the same format and range requirements as the regular timestamp
field.
Regular, non-ephemeral documents omit the deleteAfter
field.
Unlike the timestamp
field, the deleteAfter
field is expected to be in the future compared to the current wall-clock time. Once the deleteAfter
time is in the past, the document becomes invalid.
The deleteAfter
time MUST BE strictly greater than the document's timestamp
.
The document path MUST contain at least one exclamation mark !
character IF AND ONLY IF the document is ephemeral. Regular, non-ephemeral documents MUST NOT have any !
characters in their paths.
Ephemeral documents MAY be edited by users to change the expiration date. This works best if the expiration date is increased into the future. If it's decreased so it expires sooner, the document may sync in unpredictable ways (see below for another example of this). If it's set to expire in the past, the document won't even sync off of the current peer because other peers will reject it, so the edit won't propagate. When shortening the expiration date there should be time for the edit to propagate across the entire network of peers before the document expires.
Why ephemeral documents need a
!
in their pathRegular and ephemeral documents with the same path could interact in surprising ways. To avoid this, we enforce that they can never collide on the same path.
(An ephemeral document could propagate halfway across a network of peers, overwriting a regular document with the same path, and then expire and get deleted wherever it has spread. Then the regular document would regrow to fill the empty space.
But if the ephemeral document traveled across the entire network and exterminated the regular document, and THEN expired, there would be nothing left.
Which of these cases occurred would depend on how long the document took to spread, which could be very fast or could take months if there was a peer that was usually offline. We'd like to avoid this unpredictability.)
Signature
The ed25519 signature by the author, encoded in base32 with a leading b
.
See Serialization for Hashing and Signing, below, for details.
Like the hashes and crypto keys in Earthstar, this is the raw binary signature encoded directly into base32. Do not encode the binary signature into a hex string and then into base32.
Share
The share
field holds a share address, formatted according to the rules described in Share Addresses.
As a consequence, each document belongs to exactly one share and cannot be moved to another share (because that would cause the signature to become invalid).
Share signature
The ed25519 signature by the share keypair (share address + secret), encoded in base32 with a leading b
.
See Serialization for Hashing and Signing, below, for details.
Like the hashes and crypto keys in Earthstar, this is the raw binary signature encoded directly into base32. Do not encode the binary signature into a hex string and then into base32.
attachmentHash
and attachmentSize
Attachments: Documents may have attachments, which are arbitrary binary data. To be considered valid, a document must have the following IF AND ONLY IF said document has an attachment:
- An
attachmentSize
field with the size of the document's attachment in bytes. - An
attachmentHash
field with the sha256 hash of the attachment encoded as base32. - A
path
ending with a file extension, e.g./docs/notes.pdf
Why documents with attachments need file extensions
File extensions are a human-readable and compact way to indicate what an attachment contains and how it should be interpreted. But in addition to this, allowing only documents with attachments with file extensions means that documents with and without attachments can never collide on the same path.
Wiping document contents
The only way to truly delete every trace of a document is to use an ephemeral document.
However, non-ephemeral documents may have their contents (text
and attachment) wiped from them. These documents retain all other fields, such as their path, author, and timestamp, and will continue to be synced with other peers. The attachment's bytes will be erased if no reference to them exists in other any document.
A document can be wiped by setting the text
field to an empty string.
To erase an attachment, the attachmentSize
field MUST be set to zero and the attachmentHash
field MUST be set to the sha256 hash of an empty string encoded to base32 (b4oymiquy7qobjgx36tejs35zeqt24qpemsnzgtfeswmrw6csxbkq
). The text
field of the same document MUST be an empty string.
A peer MUST delete an attachment with no corresponding documents within an hour.
Document Serialization
There are 3 scenarios when we need to serialize a document to/from a series of bytes:
- Hashing and signing
- Network transmission
- Storage
They have different needs and we use different formats for each.
Serialization for Hashing and Signing
When a signature is produced for a document, it's actually signing a hash of the document. We need a deterministic, standardized, and simple way to serialize a document to a sequence of bytes that we can hash. This is a one-way conversion — we never need to deserialize this format.
Earthstar libraries MUST use this exact process.
To hash a document:
// Pseudocode
let hashDocument(document): string => {
// Get a deterministic hash of a document as a base32 string.
// Preconditions:
// All string fields must be printable ASCII only
// Fields must have one of the following types:
// string
// integer
// Note that "string | integer" is not allowed
// because we'd have no way of telling "123" apart from 123.
let accum: string = '';
For each field and value in the document, sorted in lexicographic order by field name: {
// Skip the content and signature fields
if (field === 'text' || field === 'signature') { continue; }
// Otherwise, append the fieldname and value.
// Tab and newline are our field separators.
// Convert integers to strings here.
accum += fieldname + "\t" + value + "\n"
// (The newline is included on the last field.)
}
// Binary digest, not hex digest string!
let binaryHashDigest = sha256(accum).digest();
// Convert bytes to Earthstar b32 format with leading 'b'
return base32encode(binaryHashDigest);
}
To sign a document:
// Pseudocode
let signDocument(authorOrShareKeypair, document): void => {
// Sign the document and store the signature into the document (mutating it).
// authorOrShareKeypair contains a pubkey and a private key.
let binarySignature = ed25519sign(
authorOrShareKeypair,
hashDocument(document)
);
// Convert bytes to Earthstar b32 format with leading 'b'
let base32signature = base32encode(binarySignature);
document.signature = base32signature;
}
Preconditions that make this work:
- Documents can only hold integers, strings, and null — no floats or nested objects that could increase complexity or be nondeterministic
- No document field name or field content can contain
\t
or\n
, exceptcontent
, which is not directly used (we usecontentHash
instead). So we can safely use tab and newline as field separators. - We don't need to worry about telling strings and integers apart because each field can hold an integer, or a string, but not both. So we don't need to quote our strings with quote marks.
Why use
textHash
instead oftext
for hashing documents?This lets us drop the actual content (to save space) but still verify the document signature.
Serialization for Network
This is a two-way conversion between memory and bytes.
Earthstar doesn't have strong opinions about networking. This format does not need to be standardized, but it's good to choose widely used familiar tools.
Apps and libraries SHOULD use JSON (encoded as UTF-8) as a default choice unless there are important reasons to choose otherwise. JSON is widely known, widely supported, and fits within most network protocols easily.
Serialization for Storage
This is a two-way conversion between memory and bytes.
It does not need to be standardized; each implementation can use its own format.
It needs to support efficient mutation and deletion of documents, and querying by various properties.
It would be nice if this was an archival format (corruption-resistant and widely known).
Options to consider:
- SQLite
- Postgres
- IndexedDB
- leveldb or similar key-value databases (with extra indexes)
- a bunch of JSON files, one for each document (with extra indexes)
For exporting and importing data:
- one giant newline-delimited JSON file, one document per line, is easier to parse than a giant JSON array of documents, and streamable.
Querying
Libraries SHOULD support a standard variety of queries against a database of Earthstar messages. A query is specified by a single object with optional fields for each kind of query operation.
This query format will become standardized because it will be used for querying from one peer to another. It's not quite stable yet.
This only supports relatively simple ways of querying and filtering documents because we want to make it easy to use many different kinds of backend storage which may have limited query capabilities. Apps and libraries MAY add extensions for more powerful querying if they're able to, but this should be considered the minimal set for compatibility across peers.
The recommended query object format, expressed in Typescript:
export interface Query {
/** Whether to fetch all historical versions of a document or just the latest versions. */
historyMode?: HistoryMode;
// "path ASC" is actually "path ASC then break ties with timestamp DESC"
// "path DESC" is the reverse of that
/** The order to return docs in. Defaults to `path ASC`. */
orderBy?: "path ASC" | "path DESC" | "localIndex ASC" | "localIndex DESC";
/** Only fetch documents which come after a certain point. */
startAfter?: {
/** Only documents after this localIndex. Only works when ordering by localIndex. */
localIndex?: number;
/** Only documents after this path. Only works when ordering by path. */
path?: string;
};
// then apply filters, if any
filter?: {
path?: string;
pathStartsWith?: string;
pathEndsWith?: string;
author?: string;
timestamp?: number;
timestampGt?: number;
timestampLt?: number;
};
/** The maximum number of documents to return. */
limit?: number;
formats?: string[];
}
Syncing
Syncing is the process of trading documents between two peers to bring each other up to date.
Syncing can occur locally (within a process, between two Storage instances) as well as across a network.
Documents are locked into specific shares; therefore syncing can't transfer documents between shares, only between different peers that hold the same share.
The method used by peers to sync with each other is not yet stable, and will be standardized in a separate specification..
Share Secrecy
Knowing a share address makes it possible to create a replica for that share and sync data from other peers who know about it. Therefore a share address should only be shared with those you wish to have read access to the share's data. This could be the wider public, or a small group, or a single individual.
Writing new documents to a share is only possible with that share's secret. This secret should not be shared publicly.
It MUST be impossible to discover new shares through the syncing process. Peers MUST keep their shares secret and only transmit data when they are sure the other peer also knows the address of the same share.
One method of disclosing which shares two peers have in common is as follows:
- Peer A generates a random string to be used as a salt during hashing.
- Peer A hashes the share addresses it knows of using the salt.
- Peer A sends the salt and hashed addresses to Peer B.
- Peer B hashes the share addresses it knows of using the salt obtained from Peer A.
- Peer B sees if any of the hashes it has produced matches any of the hashes it received from Peer A.
If there is a match, then both peers have that share in common. Where there are no matches, each peer will still have no knowledge of which shares the other knows of.
They can now proceed to sync each of their common shares.
Eavesdropping
An eavesdropper observing this exchange will know both pieces of entropy, and can confirm that the peers have or don't have shares that the eavesdropper already knows about, but can't un-hash the exchanged values to get the share addresses they don't already know.
But once the peers start trading actual share data, an eavesdropper can observe the share addresses in plaintext in the exchanged documents.
Peers SHOULD thus talk to each other over an encrypted connection such as HTTPS.
Resolving Conflicts
See the Data model section for details about conflict resolution.
Future Directions
These are not implemented or specified yet:
Transport Encryption
Peers could encrypt their communications with SSL, Noise Protocol, etc.
Document Encryption
The document content
field can be encrypted by apps in any way they like. We have some convenient keys available already:
- You can write a private message to one author using their public key
- You can write a private message to the entire share using the share public key (if it's an invite-only share)
However, the rest of the document metadata will be in plaintext including the author, path, and timestamp, which might reveal important information. App authors would have to use uninformative paths. This issue discusses ways of nesting the metadata inside another document to obscure it.
Immutable Documents
Documents that can't be edited. They may or may not be able to be deleted, and they may or may not be ephemeral (expiring).
This would probably involve a new optional document field, immutable
.
See more in this issue.