Snappy compression format

7/25/2023

The remaining bytes in the stream are encoded using one of four element types. The lower seven bits of each byte are used for data and the high bit is a flag to indicate the end of the length field. The first bytes of the stream are the length of uncompressed data, stored as a little-endian varint, which allows for use of a variable-length code. The format uses no entropy encoder, like Huffman tree or arithmetic encoder. Snappy encoding is not bit-oriented, but byte-oriented (only whole bytes are emitted or consumed from a stream). Snappy does not use inline assembler (except some optimizations ) and is portable. Decompression is tested to detect any errors in the compressed stream. It can be used in open-source projects like MariaDB ColumnStore, Cassandra, Couchbase, Hadoop, LevelDB, MongoDB, RocksDB, Lucene, Spark, and InfluxDB. Snappy is widely used in Google projects like Bigtable, MapReduce and in compressing data for Google's internal RPC systems. The compression ratio is 20–100% lower than gzip. Compression speed is 250 MB/s and decompression speed is 500 MB/s using a single core of a circa 2011 "Westmere" 2.26 GHz Core i7 processor running in 64-bit mode. It does not aim for maximum compression, or compatibility with any other compression library instead, it aims for very high speeds and reasonable compression. The mapping between an object's MessageInfo.type and its respective Protobuf message type must by extracted from the iWork applications at runtime.Snappy (previously known as Zippy) is a fast data compression and decompression library written in C++ by Google based on ideas from LZ77 and open-sourced in 2011. Fortunately, all of this information can be recovered from the iWork binaries using proto-dump.Ī full dump of the Protobuf messages can be found here. This information can be recovered by inspecting the TSPRegistry class at runtime.īecause Protobuf is not a self-describing format, applications attempting to understand the payloads must know a great deal about the data types and hierarchy of the objects serialized by iWork. The iWork applications manually map these integer values to their respective Protobuf message types, and the mappings vary slightly between Keynote, Pages and Numbers. The format of the payload is determined by the type field of the associated MessageInfo message. The ArchiveInfo includes a variable number of MessageInfo messages describing the encoded Payloads that follow, though in practice iWork files seem to only have one payload message per ArchiveInfo. Each object begins with a varint representing the length of the ArchiveInfo message, followed by the ArchiveInfo message itself. The uncompresed IWA contains the Component's objects, serialized consecutively in a Protobuf stream. The 4 byte header is not included in the chunk length. The next three bytes are interpreted as a 24-bit little-endian integer indicating the length of the chunk. The first byte indicates the chunk type, which in practice is always 0 for iWork, indicating a Snappy compressed chunk. The stream is composed of contiguous chunks prefixed by a 4 byte header. In particular, they do not include the required Stream Identifier chunk, and compressed chunks do not include a CRC-32C checksum. IWA files are stored in Snappy's framing format, though they do not adhere rigorously to the spec. Snappy is a compression format created by Google aimed at providing decent compression ratios at high speeds. iwa (iWork Archive) files, a custom format consisting of a Protobuf stream wrapped in a Snappy stream. iwa files are inherently compressed (see Snappy Compression), the zip implementation used for Index.zip could be designed to be minimial and efficient. iwa files, only the Index.zip must be locked. Saving a document might involve writing out several Components, so instead of coordinating writes to the various individual. One possibility is that Index.zip is used to prevent the syncronization issues that would occur if reading and writing a document involved accessing many small files.

The iWork '13 applications contain a separate, more complete zip implementation used for reading and writing iWork '09 documents (which are bundles that have been zipped in their entirety), so I believe the choice to forgo compression for Index.zip is intentional. Simply expanding Index.zip and then recreating it with a standard zip utility will result in a document that iWork refuses to open. It does not support any form of compression or extensions like Zip64. Curiously, the zip implementation iWork uses for this file is extremely limited.

0 Comments

Snappy compression format

Leave a Reply.

Author

Archives

Categories