Am just reading up on BMDiff:
[BigTable] compression looks for similar values along the rows, columns, and times. They use variations of BMDiff and Zippy. BMDiff gives them high write speeds (~100MB/s) and even faster read speeds (~1000MB/s).
…which led to reading the paper:
We propose a data compression scheme that recognizes the second occurrence of the input text as a repetition. It then represents the second string with a reference to the first, using just a few bytes.
…and thence:
Because we represent only long common strings, we are free to use an inefficient representation. We will represent a repeated string by the sequence ‘‘<start,length>’’, where start is the initial position and length is the size of the common sequence. For instance, the Constitution begins:
The Constitution of the United States PREAMBLE We, the people of the United States, in order to form a more perfect Union, …
Its compressed form begins:
The Constitution of the United States PREAMBLE We, the people <16,21>, in order to form a more perfect Union, …
The reason that I am reading this paper with joy is not only that I like Doug McIlroy’s writing style, but also their conclusions about combining BMDiff with Gzip:
The file sizes are given in megabytes, and the percentages show the effectiveness of each compression. By itself, com 1000 reduces the file to about 86% of its original size, while gzip reduces it to about 35% of its original size. If we apply gzip after applying com 1000, though, gzip is almost as effective as before: the two methods appear to factor. Our precompression algorithm compressed three different kinds of long common strings that went unnoticed by gzip …
…comments about gzip and pre-processing which are startlingly familiar from somewhere:
How does the DAWG dictionary-compression algorithm work?
Essentially it is a preprocessor for gzip that removes redundancy from a sorted list of words, and typically shrinks an input wordlist by some 50% without negatively impacting gzip’s ability to further compress the file. In the new version of the DAWG code – slightly improved over the version that ships with Crack v5.0, but fundamentally the same – all you need do is:
- sort the wordlist into normal Unix order. (beware localization!)
- for each word that the DAWG preprocessor reads…
- count how many leading characters it shares with the previous word that was read…
- encode that number as a character from the set [0-9A-Za-z] for values 0..61 (if the value is >61 then stop there)
- print said character (the encoded number) and the remaining stem of the word
- end-for-loop
[…]
Inspiration for using DAWG in Crack came from Paul Leyland back in the early 1990s, who mentioned something similar being used to encode dictionaries for crossword-puzzle solving programs; we continue to be astonished at how effective DAWG is on sorted inputs without materially impacting subsequent compression (ie: gzip); a gzipped-DAWG file is also typically about 50% of the size of the gzipped non-DAWGed file. Just goes to prove that knowledge of the sort of input you’ll be dealing with, can beat a general-purpose program hands-down; there are also interesting conclusions that can be drawn regarding the entropy of human languages after sorting.
It’s a really nice feeling when you can put someone else’s white paper into a personal context.
It’s better yet when you beat them to some of the conclusions by several years. 🙂
Leave a Reply