Thursday, October 29, 2015

Grab bag list of LZHAM improvements

Putting this somewhere so I don't forget it:

Patching specific:

- Allow the decompressor to refer to a static dictionary anywhere in memory (i.e. no memcpy() needed into the decompressor's dictionary buffer)

- bsdiff is so simple looking. Why does it work so well when patching very similar files?

- For patching, investigate allowing the compressor to emit "delta rep" matches, or rep matches with an optional plus/minus distances.

- Use Matt Mahoney's FV tool on patch file scenarios.

- Add a visualization mode to LZHAM's compressor, like FV.

General improvements:

- Symbol price accuracy: the parsing jobs uses approximate symbol statistics that are locked at the beginning of blocks. Examine how inaccurate these statistics really are.

- Try to SIMD the Huffman table update routes.

- The codec is too focused on the totally general case, which is streaming. Many useful scenarios do not involve streaming at all.

Add a "small file" optimization mode to LZHAM's compressor, which can be used as a hint to the compressor that it's OK to try encoding the first block in multiple ways.

Add a decompressor variant that doesn't support streaming at all, to reduce the # of decompressor basic blocks (idea from John Brooks).

- Add a compile time option that reduces the size of the decompressor as much as possible, even if it sacrifices perf.

- The big Huffman tables (like for literals, delta literals, match len/dist categories) are all initialized to symbol frequencies of 1's, which is wasteful for things like text files. Implement some way for the compressor to have control over this, like escape codes to jam small regions of the symbol table frequencies to 1's, or perhaps configuration bits at the start of the stream.

- LZHAM's Huffman table decoding fastbits (symbol decoding acceleration) table is too large on very small streams (an observation due to Charles Bloom). The decoder's should start with small tables and grow them over time.

- From John Brooks: Combine Deflate's idea of storing compressed Huffman table codelengths in the data stream with LZHAM's current approach of rebuilding the tables over time. At the start of streams, use compressed codelengths, then switch to dynamic.

- Add a configuration bit (global or per block?) to completely disable rep matches, which I believe will help a small amount on text files. Have the compressor try this optimization out.

- Steal the idea of global configuration settings from LZMA that tweak some of its prediction models, so the user can call LZHAM multiple times with different settings and choose the smallest results.

- There are many compromise decisions in LZHAM. For example, the decompressor state machine transition table can't be varied, the bitwise arithmetic coder's adaption rate can't be varied, and the Huffman table update interval is only user controllable. Allowing the compressor to optimize along these axis can result in gains.

- Further profile and improve LZHAM's small block perf (reduce startup cost, increase throughput near start of streams).

Platform specific:

- Get Emscripten compilation of the decompressor working, for Javascript support

- Deeply examine and optimize the generated assembly of the decompressor on ARM

Traumatic:

- Just dump arithmetic and Huffman coding and just switch to something like rANS. I think I'll need to do this to ultimately compete against Brotli.

Very long term pie in the sky stuff:

I have a frighteningly complex ROLZ branch of LZHAM alpha somewhere that works. Re-evaluate it.

3 comments:

  1. > - Allow the decompressor to refer to a static dictionary anywhere in memory (i.e. no memcpy() needed into the decompressor's dictionary buffer)

    Is this so you can map the file into memory and do sparse access as-needed?

    > - Deeply examine and optimize the generated assembly of the decompressor on ARM

    FWIW, branches are way more expensive on ARM which is another argument against huffman tree walks.

    -JB

    ReplyDelete
    Replies
    1. Yes, and to also avoid the current memcpy(). Also, in theory I could use this to lower the decompressor's RAM cost when in patching mode, by allowing the dictionary size to be smaller than the allowable/codable match distance. Right now, the dictionary size must be at least as large as the patch source file's size.

      Delete
    2. LZHAM avoids the Huffman tree walk in the most probable scenario, but yea they suck. It's possible to use multilevel tables to accelerate decoding, which would cut down the avg # of jumps to get a symbol, but this uses more RAM and they would be more expensive to build. Huffman accel table building hurts LZHAM's decomp perf a lot on small files.

      Delete