Z85m: a proposal for padding odd-sized inputs for the Z85 (Ascii85 / Base85) encoding system, with good space-efficiency

Possibly the most space-efficient ASCII- or text-based encodings are the Base85 or Ascii85 series of formats, as described on the Wikipedia page:


Ascii85, also called Base85, is a form of binary-to-text encoding developed by Paul E. Rutter for the btoa utility. By using five ASCII characters to represent four bytes of binary data (making the encoded size 1/4 larger than the original, assuming eight bits per ASCII character), it is more efficient than uuencode or Base64, which use four characters to represent three bytes of data (1/3 increase, assuming eight bits per ASCII character).

https://en.wikipedia.org/wiki/Ascii85

There are two fairly minor but annoying issues which come with the version of Ascii85 as defined in RFC1924:

  1. the output of the encoding is not very “shell-safe” or “source-code safe” – the output may contain characters that have unexpected side-effects under several string-quoting conventions.
  2. no padding mechanism is defined; this is a problem because Ascii85 requires all input to be an integer multiple of 4 bytes in length, due to the method of encoding. In the real world it’s quite common for binary data to be a multiple of 4 bytes in length, but it’s not assured.

The Wikipedia page references several implementation-specific solutions to address padding and truncation, including magic postamble sequences (Adobe) with characters that are in the set of potential encoding characters.

If one is choosing to use a Base85 representation to squeeze optimum space efficiency when encoding binary data, it seems odd to impose a fixed overhead upon all strings; also it would be desirable for the encoded strings to be moderately shell-safe.

The Z85 encoding scheme is a lot cleaner and more shell-safe than Ascii85 (contains tilde, single-quote, double-quote, backtick, and dollar) or RFC1924 (contains tilde, backtick and dollar) – the only particularly worrying Z85 encoding-character is dollar, which at least is a fairly visible character which (where applicable) developers are frequently used to “escaping” mid-string.

However Z85 (again) does not propose a standard padding algorithm.

One option for padding data prior to Z85 encoding might be to use PKCS#7 padding, where it is assumed that the last decoding block is entirely or has been padded, and where each pad byte represents a count of how many bytes to truncate from the output. This is popular, but again is space-inefficient especially where input is already a multiple of 4 bytes, in which case an entire extra 4 bytes of input padding / 5 bytes of output will be generated.

Z85 with Muffett Padding (“Z85m”)

If space is at a premium – for instance if small binary objects are being compactly safety-encoded to Z85 ASCII strings that will then be re-encoded to QR codes (or similar) then it would be optimal to “waste” at most one extra byte of encoded data in order to do so. We proceed as follows:

Version 1.0 — 26 April 2021

  • We propose an extension of the Z85 encoding standard at https://rfc.zeromq.org/spec/32/
  • The standard is extended so that any fixed binary input which is not an integer multiple of 4 bytes in size has its final block padded with enough bytes (1, 2, or 3) in order to make up that block to a total size of 4 bytes.
  • The VALUE of the pad byte MUST EQUAL the COUNT of pad-bytes being used (i.e., 0x01, 0x02, 0x03) respectively.
  • That VALUE is then encoded as a single extra Z85 character and appended to the primary encoded data; thus any non-4-byte-aligned data takes costs only 1 extra character to pad.

Example Encoding

Using the standard Z85 encoding specification

  1. input: "Hi!"
  2. convert to ascii hex: 48-69-21 (three bytes)
  3. pad with 1 byte also of VALUE 0x01: yields 48-69-21-01 (…to make a multiple of 4 bytes)
  4. convert to integer: 0x48692101 is 1214849281 decimal (for convenience)
  5. cast-out to base85: 23-23-15-19-41 (…the value is now 5 “digits” in base85)
  6. encode those base85 “digits” in z85: n-n-f-j-F
  7. append the z85 encoding of 0x01 (is “1“); final output: “nnfjF1

Final blocks which require padding with two bytes will be padded with 0x0202, and similarly three padding bytes will be 0x030303, and a 2 or 3 will be the suffixed respectively. Inputs which are naturally multiples of 4 bytes will not be padded.

Decoding

Decoding is the inverse operation, per the Z85 specification any encoded text MUST be a multiple of 5 characters in length, EXCEPT that:

  • we now permit an optional “6th and final” trailing character
  • which MUST decode to a value in the range 1..3 inclusive
  • presence of which triggers “unpadding” code for the terminal 5-tuple
  • and a check MUST be made that ALL of the unpacked pad bytes MUST match the decoded VALUE of the trailing character, else the decoder MUST throw an error

Such unpadding should be easily implementable in a streamable fashion via a one-block behind approach to decoding.

Test Vectors

HexZ85m
0xCA:]..F3
0xCAFE+kICz2
0xCAFEBA+kO%M1
0xCAFEBABE+kO#^
0xCAFEBABEDEADBEEFFEEDD00DBAAD+kO#^?MsJX@{w5wX#=2n2
0xCAFEBABEDEADBEEFFEEDD00DBAADF00D+kO#^?MsJX@{w5wX#>Dh
Test vectors for Z85m

Risks

The chief risk of this padding scheme is, essentially, unchecked truncation; by not mandating that all inputs MUST be padded, it becomes impossible to detect truncation of input to a 4-byte boundary. Loss of the single trailing padding-byte might be detected by looking for a trailing 0x01, 0x0202, or 0x030303 string of bytes in the output, but detection of such would not constitute “proof”.

That said: in the modern era, I would accept this risk. There are plenty of “envelope” serialization formats that would address or detect truncated data, and I feel that in space-sensitive applications it is those which should bear the burden of integrity.

License

Note: The following text intentionally echoes Z85, however the Digistan website appears to have been having severe problems for some time, so this text may be revised for clarity/correctness in the future.

Copyright (c) 2021 Alec Muffett.

This Specification is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. This Specification is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, see http://www.gnu.org/licenses.

This Specification is a free and open standard and is governed by the Digital Standards Organization’s Consensus-Oriented Specification System.

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *