Friday, August 23, 2013

LZHAM2 notes

I've been thinking in the background about LZHAM2 again. (LZHAM is a lossless codec I released a while back on Google code, which I'm now calling LZHAM1.) For LZHAM2 I'm aiming for much higher and more consistent decompression throughput, and more control over the various decompression rate vs. ratio tradeoffs that are fixed in stone right now. Here are all of my notes, put here mostly so I don't forget them. (I first sent them to Michael Crogan who's interested in this stuff but realized they where probably more useful here.)

LZHAM2 notes:

- LZHAM1's decompressor can be greatly slowed down from having to rebuild Huffman decode tables too much, especially on nearly uncompressible files (because all the updates are spread around to way too many tables, so the decompressor gets stuck in a ditch constantly updating tables). The codec needs to be way smarter about when tables are updated.

Here's the big hammer approach to this: Support offloading Huffman table construction onto a pool of 1 or more worker threads. This is kinda tricky because the compressor must delay using updated Huffman tables for a while, because of the newly introduced latency of when the decompressor can switch to the new table. Determining how much latency to actually use will be an interesting problem (maybe make it adjustable/overridable by the user).

Worst case scenario, if the main thread needs to switch to an updated table that's not available yet it can not wait and just immediately compute the table itself (obviously wasting some CPU, but who cares because most apps rarely if ever use all available CPU's anyway).

Consider sending a tiny signalling message to the decompressor that indicates when the table must be updated.

Pretty much every modern mobile/desktop/etc. platform supports multiple HW threads, so LZHAM2 should be able to get a bunch of Huffman table updates for "free" if I can make this work well enough.

- SIMD-ify Huffman and/or arithmetic decompression. I'm on the fence about this, but the symbol decompression rate improvements I've heard from others experimenting in this domain are remarkable.

- Really try hard to optimize the Huffman decode table generator. SIMD it, whatever, it's way more important than I thought.

- LZHAM1 uses a simple approach to trigger the resetting of the Huffman table update rate: when the overall compression ratio has a big drop in the last few blocks it can reset the update rates of all the tables. There's a bunch of lag in the current implementation (all the way up at the block level) because the compressor's design is limited to a single pass (streaming) approach, and I didn't want to go back and re-code a whole block during a reset. Try an alternative that uses either more buffering or multiple passes.

- LZHAM1 uses too many Huffman tables, and is sloppy about its higher order contexts (the order-1 contexts are typically just the upper X bits of the previous byte, etc.) There's got to be a smarter way of dealing with this other than just lopping off low order bits. I went too far on using tons of tables and more contexts in order to match LZMA's ratio on large files. The codec needs to be more configurable so it can use less contexts for faster decompression.

- Do a thorough analysis on a wide range of file types and (importantly) file sizes. I just didn't spend much time concentrating on LZHAM1's small file size performance because I thought large solid files would be the more important real-world use case.

- I did a high quality integration of LZHAM directly into 7zip (both the command line tool and the 7z Windows archiver app) for testing, which helped me shake out a few of the remaining higher level API bugs. I didn't release this publicly however, but I did release the API fixes that came from this work. This was a super useful thing to do.

- Charles Bloom made several concise suggestions on how to improve LZHAM on his blog when he compared the codec vs. several others. Some of these suggestions are in the reply section, I need to save them.

- Finally get LZHAM's compressor into Telemetry and figure out how to better thread it. The current approach is really basic and just forks & joins on every block.

- Cloud compression is very interesting from an optimization perspective. I've been limiting myself to 1 machine with X threads and plain streaming compression (with minimal buffering) only. These are some important axes to explore. I used ~200 machines (with 4-12 compile threads on each box) to compile and optimize Portal 2's vertex/pixel shaders, imagine the parsing levels and compression options you can try out on hundreds of machines.

- Switch to cmake and Linux as my primary dev platform. I no longer use Windows as my primary dev platform. Linux is clearly the path forward and Windows is now that thing I port to.

Various things I tried that didn't make it into LZHAM1:

- In the early days I spent quite a bit of time experimenting with Deflate or LZX-style static Huffman tables vs. the dynamic stuff used in LZHAM1. I delta coded the code lengths vs. the previous block's code lengths into the output stream (I think I first saw Bloom do this in his early LZ research codecs). At the time I found the practical constraints on the # of Huffman tables, the # of symbols, etc. this placed on the design seemed too restricting. I hit a wall and couldn't compete against LZMA this way. I think there's still plenty of room to optimize the dynamic table rebuild approach which is why I keep pushing it.

- Match finding using suffix arrays and largest common prefix (LCP) tables
Got it working using the best algorithms I could find back around '07 and then again in '10, but my implementation had perf/memory scaling issues with larger dictionaries. Adding new data into the dictionary (and "sliding out" old data) was extremely expensive because the tables had to be rebuilt. LZMA's matching algorithm was easier to implement and a known thing so I went with that.

- I have a working branch of LZHAM that uses ROLZ (reduced offset LZ). It has a nicely improved ratio, but the sheer complexity of this beast (not to mention the lower decompression throughput due to updating the ROLZ tables) was just too much for me to handle as a side project so I put the whole experiment on ice.

- Early versions of LZHAM1's parser supported favoring matches that it thought would likely be in the decompressor's L2 cache. (It actually had a whole data structure that modeled a basic L2 cache that was used to bias the symbol prices.) This seemed like an important optimization for console CPU's, but I never measured any real benefit on the PC so I removed it and moved on.

Misc:

- I keep wondering why Google continues to invest in Deflate with Zopfli, etc. when it's clearly ancient stuff (Deflate was introduced 20 years ago). A new open codec that strikes the right balance somewhere in the spectrum between Deflate/LZX/LZMA/LZHAM/etc. would be super useful to a lot of people, and they have the clout and talent to do it. They should have enough data points from existing codecs and internal experience due to Zopfli to have confidence in building a new codec.

An effort like this would make a huge impact across the entire web stack. The gain would be relatively massive compared to the tiny improvements Zopfli's been able to achieve (~5% for 100x increase in cost means it's time to move on). 

If the new codec is made zlib API compatible (like I do in LZHAM and miniz), which is easy, then dropping it into existing codebases would be fairly straightforward.

- Someone needs to write a universal preprocessing/long range match library that properly supports streaming and is trivial to add in front of other codecs. I've been treating preprocessing as a totally separate axes vs. compression, assuming somebody would eventually solve this problem.

It could support various executable formats (dwarf, exe, etc.), xml, json, html, jpeg, mp3, wav, png, raw images, deflate, etc. All the best archivers already do this and the research has been done, but AFAIK it's not available as a single robust library.

- The decompressor can be viewed as a virtual CPU with a very limited but tightly compressed instruction set. I've been wondering what tricks from the CPU world could be effectively applied to LZ. I wonder if there are more useful instructions other than "here's a literal" or "copy X bytes from the dictionary using this offset".

With good parsing it's easy to add more node types to the parse graph. Right now I'm adding only literals (which are coded in various ways depending on previous state), and various matches and truncated versions of these matches.

- There are some deep/fundamental inefficiencies in the entire class of LZMA/LZHAM/etc. style algorithms. Bloom has covered this topic well on his blog, and I also realized this while working on LZHAM. For example, when a match ends, the decompressor has some knowledge from the dictionary about what the next character(s) are likely *not* to be. This knowledge can be used to exclude some dictionary strings in future matches. (However, the parser sometimes purposely truncates matches so it's possible for a match's follower byte to actually match the input but not be used.) There's code space inefficiency all over the place that seems like a big opportunity, but exploiting it seems hard to do efficiently.