Thursday, April 28, 2016

Hiring Group Dynamics

So there are several interesting hiring related phenomenon I've seen at various companies. I think some of the most exaggerated hiring behavior will emerge at "flat" companies with yearly bonuses based (partially) from the data gathered during peer feedback.

Here's a description of one category of emergent behavior I noticed when the programmers have nearly free reign to run the hiring process and who will ultimately get hired:

Want a good bonus? Never hire new competition!

You would think the programmers doing the hiring would always be fair and unbiased in their assessments of each candidate's abilities, right? And they would, of course, always optimize for adding value to the company by making good hires. The company programmers involved in the process would choose good candidates for each opening, irrespective of politics, or concerns over their future positions or bonuses, etc.

In practice, I think especially at companies with massive yearly bonuses, the company's programmers will band together unofficially and make it practically impossible for potential competition to enter the company, make waves, and possibly eclipse the old guard. We have a classic conflict of interest situation here. This tendency to embargo your competition is especially effective when hiring specialists, such as graphics programmers, although I've seen it happen regardless of specialty.

At one well-known company, I watched around a dozen experienced graphics programmers get rejected in our interview process. Each time, without exception it was a NO HIRE, even though we were in dire need of graphics programmers. A few of the names were pretty well known in graphics circles, so my jaw dropped after several of these NO HIRE interviews.

I was involved in some of these interviews. Almost every time, these candidates would do sometimes incredible things during the whiteboard interview, but somehow one or two graphics programmers would always find some other reason to be thumbs down. (I didn't say anything at the time, because I was afraid doing so would have made enemies and hurt my career at this company. I was basically incentivized to say nothing by the peer feedback based bonus system.)

Eventually, upper management quietly noticed that irrespective of our company's dire need of graphics engineers, we weren't hiring them anyway. This company had a major upcoming threat to its primary profit generating product looming in its future, and the counter to this competitive threat involved some very specialized graphics engineering. The CEO had to step in and basically just subvert the entire completely broken hiring process and just start hiring graphics contractors almost sight unseen.

Unfortunately, these graphics contractors had virtually no path to full-time employment, so they got treated like 3rd class citizens at best and all were eventually pushed out. (Sometimes years later, even after delivering massive amounts of value to some teams.)

Anyhow, how do I know all this stuff? At this particular company, I somehow fell through the cracks and was interviewed and hired as a generalist programmer, not a graphics specialists. Eventually, the old graphics guard basically got lazy and shied away from the company's toughest graphics problems (or actually shipping anything involving new graphics code), but somebody had to do this "dirty" graphics work. The non-graphics programmers figured things out and started sending graphics work my way, and I started asking myself "why are we not hiring any graphics programmers?!"

Unfortunately I'm terrible at saying "no" to requests for help, so this resulted in a lot of work.

Turns out, that refined, "fair" hiring machine that management was so proud of was a total joke.

Sunday, April 10, 2016

Tips for Interviewing at Software Companies

Here's another blog post in the "Rich goes off the rails and reveals a bunch of shit he's learned over the years while working as a corporate programmer" category.

Companies Must be Continually Reminded that the Interview Goes Both Ways

Many corps have an internal company culture that places the company in the superior position relative to job candidates. These companies feel they can choose who they want from a seemingly endless variety of potential employees, so who cares how they're treated right? The reasoning goes "we'll just hire someone else" if a candidate pushes back.

We need to collectively turn the tables on companies like this. Let's give them a powerful form of feedback. Let's exercise our right to "route around" bad companies and not apply or accept job offers from corporations that act like we are replaceable cogs. (Alternately, let's all talk between ourselves and collectively compare notes and boost our compensation rates and working conditions. They can't stop us!)

I'm hoping this blog post will help make people more able to discern the good companies from the bad ones. For the record, I do believe there are many good companies out there, but for every good company there seems to be a bunch of bad ones.

Remember: We Write the Software Which Runs the System

Here's a key concept to internalize: We write the software that literally drives this entire system. Food production and distribution, electricity production and distribution, telecommunications, government, finance, trucking, planes, etc. It's all ran by computers one way or the other, and we write the code that makes these computers work. Without computers this entire system crumbles into the dark ages.

As time goes by more and more of the system is becoming automated and computerized. This means we as programmers collectively have the power in these relationships with corporations, but we haven't effectively organized ourselves or figured out how to best exercise our power yet. We now have the technology to instantly communicate between ourselves, which if we all start using it can lead to massive changes in a relatively short period of time.

Interview Tips

Some corps are exquisitely designed to extract as much "value" from you as quickly as possible, your health and sanity be damned. Above all, I want to help other programmers avoid places like this. When applying for a position at a software corp, keep these things in mind:

- Follow your instincts.

Ask a lot of questions. Learn how to interpret body language. Are you treated with respect? Are your questions answered in a straightforward way?

Remember, the hiring pipelines of these companies are tuned to take advantage of the macro-level psychological profile of "typical" programmers. Get educated, fast. These companies are not your friends. They will try to get into your brain and "bend" you psychologically in order to make you conform and "fit in" to their brand of corporate utopia.

Trust your gut feelings during the interview! If you feel disrespected or not taken seriously, don't ignore it. It's not just in your head. Run away! You won't grow there, and it'll be a dehumanizing place.

- "You need us to achieve success, you are nothing so follow us!"

Run away fast! I've seen this tactic applied against dozens of developers after one company collapsed in Dallas. Sadly, it worked with a bunch of people and they ultimately all got screwed.

- Deeply analyze any critique given to you during the interview

Sometimes critique that seems merit-based is really bias in disguise. It's very important that if you get bad feedback, you stand back and think "Is this true? Or is there just something wrong with this company (elitism, sexism, they just didn't want to hire you, etc)?"

- How much is your time worth?

Ultimately, you are selling your time for digital digits in some bank computer somewhere. You will not get this time back, period. It's worth a lot, probably much more than you think.

How much income does the company actually make given your time? Some companies make millions of dollars per software developer, yet pay only a fraction of this to you.

Remember, this is a market and market principles apply here. By increasing our communication levels and giving feedback to the market (by routing around bad companies, demanding higher pay during negotiations, pushing back during interviews, etc.) we can collectively raise our salaries, compensation packages, and improve our working conditions.

- Are you gambling your time away trying to get your stock "lottery ticket"?

In this scenario, you're willing to be underpaid, but you're hoping the company will sell out in X months or years and make you millions. Just beware that this is a form of gambling with your time and finances. 

I've seen a few companies exquisitely exploit and continually encourage this "gambler mindset" with its workers in order to suppress wages. Be careful!

- Admit that you probably suck at negotiating

Generally, in my experience programmers make horrible negotiators. The most important thing is to approach the negotiation with the proper mindset. They need you, not vice versa, and they probably have much more money (and the capability to pay you) than you suspect.

This is a topic definitely worthy of another blog post.

- Learn how to recognize negative psychological traits like sociopathy and narcissism.

Some companies are full of sociopathic monsters whose job description (and honestly, their corporate duty) is to exploit you as much as possible.

Learn to recognize the signals. They will try to get into your head, quickly build a mental model of you, then play off your willingness to not rock the boat or be seen as a "troublemaker" to the corporation. They will find subtle ways to threaten you, if needed, to keep you in line.

Narcissists can be especially horrible to work around, especially when they are in management. Learn the signs.

- Beware of code words like "Elite", "10Xer", "no bosses", "scheduled crunch", etc.

"Elite" - Programmers willing to work endlessly in sometimes horrific conditions are labeled "Elite". Yes, we have a very screwed up system here, where people who get exploited and are worked until exhaustion are labeled "Elite". Avoid companies like the plague that use this word in their job descriptions or recruiting emails. 

(Attention recruiters: Please stop sending me emails with the word "elite" anywhere in them. Thanks!)

"10X programmer" - Someone who hacks up some shit (sometimes actually using stolen ideas), and then depends on other practicing programmers to do the actual work of making these (sometimes very sub-optimal) ideas actually shippable. These programmers tend to publicly take full public credit for things they only partially worked on or thought up. We don't need more of these "10x" assholes, instead we need to completely reboot our programmer culture so the very concept of a "10x'er" is totally alien.

"No bosses" - This recruiting shtick from 1999 means there are many powerful bosses in hiding, and/or that everyone is effectively your boss. Also avoid companies that advertise this like the plague. It's a recruiting tactic designed to attract programmers who had bad bosses in the past. (Believe it or not, there are many very good managers out there!) 

Also, without managers, you will be directly exposed to the many wolves out there who can make your programming life a living hell. A good manager will shield their programmers from endless bullshit and insanity.

"Crunching" - This means the company externalizes the cost to your health and sanity of working endless hours. They may have bad planning, or bad business models, whatever. Avoid companies that crunch endlessly like the plague unless you just don't care about your health.

"Scheduled Crunch" or "we occasionally crunch" - Again, any company that schedules crunch just doesn't understand or care how to plan, or is ignoring the cost of working crazy hours on your health.

- Ask: Does the corp own all your work?

Can you work on stuff at home, like open source software or your own commercial software? Avoid companies like the plague that don't let you work on your own stuff.

- Ask or check for insane contract clauses

Is there a non-disparagement clause? If you quit, must you wait X months or whatever before you can work on something else? Push back and avoid companies that do questionable shit like this.

- Who contacted you first?

If the company reached out to you first, did a recruiter contact you, or an actual programmer at the company? 

Ideally, a real-life programmer reached out to you. Only a programmer working at the company can genuinely answer your questions and give you a real idea about the working conditions and types of problems you will be working on there.

Remember recruiters are just part of the company's "hiring pipeline". The pipeline is basically "X Programmers In -> Y Programmers Out". You are just a replaceable number to these companies. Recruiters will say anything in order to get you to sign the dotted line.

- Is there a whiteboard interview?

Is there a whiteboard interview? Push back and say no. I've helped hire at several successful companies (easily over 100 people over the years) that didn't use whiteboard interviews at all. Anyone saying "this is just the way things are done" doesn't have perspective and is part of the serious problems our industry has.

After taking and giving way too many of these whiteboard interviews, I think they are total fucking bullshit. Whiteboard interviews test a candidate's ability to scribble uncompilable pseudo-code on a chalkboard (!) while being faced down by multiple adversarial programmers. (Who in some cases would rather be doing anything else, and who don't want more internal programmer competition in the first place!)

Some companies give programmers whiteboard questions from various books verbatim, like this one. This is just downright ridiculous, a waste of time for everyone involved, and a total demonstration that the process is completely bogus.

Whiteboard interviews are extremely stressful to candidates. I've seen amazing programmers just lock up and become dysfunctional in these conditions. We are testing candidates for the wrong abilities. I refuse to take part in any more of this insanity.

Some programmers use these bogus hazing ritual-like whiteboard interviews to help drive down the applicant's ego while simultaneously driving up their egos. This is a huge red flag -- avoid these programmers and the companies who employ them.

- "We only hire senior programmers"

Let's translate: We don't help train new programmers. The phrase "programmer empathy" isn't on our radar here. We probably treat each other like complete trash. We actually assume you are an idiot until you battle your way into a position of respect. Avoid companies like this!

Remember, all senior programmers at one time were junior programmers.

- What's the company's culture?

Ask lots of questions from people who work there. Is the official company message great, but when you pull programmers aside they actually hate working there? Search the web for reviews of the company, search linkedin and find former employees and ask them about the company.

- Talk to the executives

Are they sociopaths? Raging narcissists? Ask them what they look for in programmer candidates, and see how they respond. Do they treat you with respect?

I've known execs who thought programmers were literally crazy, and trust me the companies they ran were not healthy.

- Look around the office

Little things can give you a lot of information. Is the office a mess? How much space and privacy do employees have? Is the environment quiet or loud?

- Does this company give proper attribution for ideas it uses?

I'm throwing this out here because I've noticed one very well known VR company outright steal ideas from its competitors or academics working in the space. Explicitly ask the company about its attribution policy.

Personally, I will never involve myself in any way with people or corporations who outright steal ideas for personal or corporate gain. (I can't believe I have to even say this. We have fallen to the level of stealing and re-branding ideas from each other!)

- Master your fears

What do you fear? Recognize it and look past it, because these companies are designed to exploit your fears and use them against you as a weapon.

Saturday, April 9, 2016

Why I left Unity

I left Unity about a month ago. It was a short, but amazing and enlightening, experience. I felt that my job there was done (or here). My first little contribution to the world of VR.

Instead of pivoting onto some new project or something, I've decided to basically pivot my entire life. Currently, it seems a large subset of the industry has rearranged itself around VR/AR, so I figure my timing couldn't be better. Unsurprisingly, pivoting your entire life isn't easy but I believe it's for the better. Working a full-time software job endlessly bulldozing countless lines of anonymous C/C++ code isn't on my list of happy things to do anymore. Screw that!

I'm now living in Seattle, a city I've only visited like 3 times in my entire 6 years living in Washington. Holy shit, what an amazing city! Bellevue is such a barren wasteland compared to Seattle it's ridiculous. I'm avoiding the eastside completely because the place reminds me of a mental prison. I've got some particularly bad memories there. I can't even look at downtown anymore without thinking of various horrible memories. I'm now a much happier independent consultant here in Seattle.

You know, just thinking here: I must be a bunch of game/software industry execs worst nightmare. "Oh shit, Rich is some crazy ass SOB, he's going to spill all of our endless secrets and damage our hiring and spotless reputations!" Yes, I've seen a lot of shit in my career. A lot of it the world should know about -- someday. If it makes you feel any better, I'm almost as public as I possibly can be, and I'm staying that way. I really love blogging.

I've been thinking a lot about the word "counterculture", especially in relation to the software industry. I know exactly what the "mainstream", or "normal" software developer culture is. The corporate software development and game industry culture I've seen so far is immature, exploitative, abusive, and downright dehumanizing. Where are the alternative software developer cultures?

Imagine the kinds of amazing new software that could be created in alternative cultural environments. To really improve this industry we need to upgrade our culture, not our hardware, comp-sci curriculum, or office arrangements. We need to fix the root problem, which is to escape from this insane mainstream culture and create something healthier.

Sunday, March 27, 2016

Quotes from "My Year In Startup Hell"

I loved this article on (exerted from "Disrupted: My Misadventure in the Start-Up Bubble"). It covers so many corporate company culture things I've seen or experienced in my game development career in a single article. The article is from a startup perspective, but much of this applies to many other more mature companies. These quotes especially struck a chord with me:
"Arriving here feels like landing on some remote island where a bunch of people have been living for years, in isolation, making up their own rules and rituals and religion and language—even, to some extent, inventing their own reality. This happens at all organizations, but for some reason tech startups seem to be especially prone to groupthink. Every tech startup seems to be like this. Believing that your company is not just about making money, that there is a meaning and a purpose to what you do, that your company has a mission, and that you want to be part of that mission—that is a big prerequisite for working at one of these places."
On people that get fired:
"Dharmesh’s culture code incorporates elements of HubSpeak. For example, it instructs that when someone quits or gets fired, the event will be referred to as “graduation.” In my first month at HubSpot I’ve witnessed several graduations, just in the marketing department. We’ll get an email from Cranium saying, “Team, just letting you know that Derek has graduated from HubSpot, and we’re excited to see how he uses his superpowers in his next big adventure!” Only then do you notice that Derek is gone, that his desk has been cleared out. Somehow Derek’s boss will have arranged his disappearance without anyone knowing about it. People just go up in smoke, like Spinal Tap drummers."
On what I call "Reality Shaping":
"The ideal HubSpotter is someone who exhibits a quality known as GSD, which stands for “get shit done.” This is used as an adjective, as in “Courtney is always in super-GSD mode.” The people who lead customer training seminars are called inbound marketing professors and belong to the faculty at HubSpot Academy. Our software is magical, such that when people use it—wait for it—one plus one equals three. Halligan and Dharmesh first introduced this alchemical concept at HubSpot’s annual customer conference, with a huge slide behind them that said “1 + 1 = 3.” Since then it has become an actual slogan at the company. People use the concept of one plus one equals three as a prism through which to evaluate new ideas. One day Spinner, the woman who runs PR, tells me, “I like that idea, but I’m not sure that it’s one-plus-one-equals-three enough.”
This is so true:
"Another thing I’m learning in my new job is that while people still refer to this business as the “tech industry,” in truth it is no longer really about technology at all. “You don’t get rewarded for creating great technology, not anymore,” says a friend of mine who has worked in tech since the 1980s, a former investment banker who now advises startups. “It’s all about the business model. The market pays you to have a company that scales quickly. It’s all about getting big fast. Don’t be profitable, just get big."
"On top of the fun stuff you create a mythology that attempts to make the work seem meaningful. Supposedly millennials don’t care so much about money, but they’re very motivated by a sense of mission. So, you give them a mission. You tell your employees how special they are and how lucky they are to be here. You tell them that it’s harder to get a job here than to get into Harvard and that because of their superpowers they have been selected to work on a very important mission to change the world. You make a team logo. You give everyone a hat and a T-shirt. You make up a culture code and talk about creating a company that everyone can love. You dangle the prospect that some might get rich."
Umm yea I know the feeling:
"Training takes place in a tiny room, where for two weeks I sit shoulder to shoulder with 20 other new recruits, listening to pep talks that start to sound like the brainwashing you get when you join a cult. It’s everything I ever imagined might take place inside a tech company, only even better."
On the office environment:
"Everyone works in vast, open spaces, crammed next to one another like seamstresses in Bangladeshi shirt factories, only instead of being hunched over sewing machines people are hunched over laptops. Nerf-gun battles rage, with people firing weapons from behind giant flat-panel monitors, ducking and rolling under desks. People hold standing meetings and even walking meetings, meaning the whole group goes for a walk and the meeting takes place while you’re walking."
Personally, I've learned over the years that I really need to live solidly in reality. I can't switch back and forth from "Corporate Enforced Reality #57" to "Real-World Ground-Level Reality" every single day of the week and stay happy and healthy long term.

I've learned that I don't need to associate myself with any corporation to be a "real" developer. If you aren't treated with respect by someone because you aren't associated with a company label, that person may very well not be a person you want to associate yourself with.

Here's a little rant: I think having to rigidly conform to a corporate mental model (like the insane one described at the company above) to earn money is demeaning and even dehumanizing. Folks, 1+1 does not equal 3. This is insane.

Somewhere back in time I must have fallen into some jacked parallel universe where treating workers like utterly replaceable mind controllable automatons is normal, accepted, encouraged, and even something to be proud of. In the universe of 1984, 1+1=3.

And the money the company gives you? It's just binary bits in some bank computer. There are plenty of ways of making money that don't involve becoming mentally insane. Trading off sanity for a bi-weekly set of binary digits added to your account balance actually isn't the greatest idea in my experience. Staying at a job you are unhappy with thinking that, eventually, you will achieve true long lasting happiness during "retirement" is also a pretty extreme way of living life. There are other paths to happiness.

My brain now pushes back when I think about living and working this way again. It basically says "hey that's super unhealthy and unsustainable!". I used to be irrationally fearful of working outside of a corporate bubble. Fear is your worst enemy and can make you much more manipulatable to others.

A long time ago, at Ensemble Studios in Dallas, I was completely wrapped up in our company's special little super insular "tribal" company culture. The company collapsed overnight and we all learned what was actually occurring at the corporate level for the previous 6-9 months. It became super clear that we all were living in a fairy-tale corporate enforced reality bubble. Even many of my "company friends" evaporated overnight into non-friends. It was ugly: even the very personalities of formally awesome coworkers instantly changed.

My ego was tied solidly into this company's culture and products. After it collapsed I had to carefully hit the "ego reset button". I had to strongly resist automatically following the locally exploitative paths laid out for me after Ensemble collapsed.

Monday, January 18, 2016

Compression ratio plots of zlib competitors Brotli and BitKnit

Continuing yesterday's post, here are the compression ratios for all 12k data points (between 256 -128MB) in the LZHAM vs. LZMA test file corpus, on various codecs: LZ4HC L8, Brotli L9, Rad's BitKnit VeryHigh, LZHAM m4, and of course zlib L9.

Vertical Axis = Compression ratio (higher is more compression)
Horizontal Axis = File, sorted purely by zlib's compression ratio
Color = Codec (using the same color coding as my previous post)

The data points have been sorted by zlib's compression ratio, which is why the green line is so nice and smooth. These are the same data points as yesterday's scatter graphs.

LZ4HC vs. BitKnit vs. zlib:

LZ4HC vs. Brotli vs. zlib:

 Brotli vs. BitKnit vs. zlib:

Here are a couple bonus plots, this time for LZHAM vs. LZ4HC or Brotli:


It's clear from looking at these plots that simply stating "codec X has a higher ratio than codec Y" is at best a gross approximation. It highly depends on the file's content, the file's size, and even how well a file's content resembles what the codec designer's tuned the codec to handle best.

For example, Brotli has special optimizations (such as a precomputed static dictionary) for textual data. Also, like zlib, it uses entropy coding tables (the Huffman symbol code lengths) precomputed by the compressor, which can give it the edge on smaller files vs. codecs that use purely adaptive table updating approaches like LZHAM. (Also, I would imagine that the more stationary the data source, the more precomputed Huffman tables make sense.)

Another advantage Brotli shares with zlib due to its usage of precomputed Huffman tables: It doesn't need to spend valuable CPU time computing Huffman code lengths during decompression. LZHAM struggles to do this quickly, particularly on small files (less than approx. 4-8KB) where most decompression time is spent computing Huffman code lengths (and not actually decompressing the file!).

It's also possible to design a codec to be very strong at handling binary files. Apparently, BitKnit is tuned more in this direction. It still handles text files well, but it makes some intelligent design tradeoffs that favor really high and symmetrical compression/decompression performance with only a small sacrifice to text file ratios. This tradeoff makes a lot of sense, particularly in the game development context where a lot of data files are in various special binary formats.

Interestingly, Brotli and BitKnit seem to flip flop back and forth as to who is best ratio-wise. There are noticeable clusters of files where Brotli is slightly better, then clusters with BitKnit. I'll be analyzing these clusters soon to attempt to see what's going on. I believe this helps show that this data file corpus has a decent amount of interesting variety.

Finally, Brotli's compression ratio is just about always at least as good as zlib's (or extremely close). IMO, the Brotli team's Zopfli roots are showing strongly here.

Next Steps:

I need to break down these ratio graphs into clusters somehow. So we can then show results for "small text files" vs. "large binary files" etc. As the compressed totals show yesterday, Brotli and BitKnit have approximately equal compression power across the entire corpus. But there are categories of data were one codec is better than the other.

Looking into the future, it may be a good idea for the next major compressor to support both precomputed (Brotli-style) and semi-adaptive (LZHAM and presumably BitKnit-style) entropy table updating approaches.

Thanks to Blue Shift's CEO, John Brooks, for suggesting to chart this way.

Sunday, January 17, 2016

zlib in serious danger of becoming obsolete


I’m now starting to deeply analyze the performance of two new general purpose data compression codecs. One is Google’s Brotli codec, another is a brand new codec from Rad Game Tools named “BitKnit”. Both codecs are attempting to displace zlib, which is used by the Linux kernel, and is one of the most used compression libraries in the world. So I’m paying very close attention to what’s going on here.

To put things into perspective, in the lossless compression world we’re lucky to see a significant advancement every 5-10 years. Now, we have two independently implemented codecs that are giving zlib serious competition on multiple axes: throughput, ratio, and even code size.


I’m now using what I think is a very interesting and insightful approach to deeply analyze the practical performance characteristics of lossless codecs. As I learned while working on the Steam Linux/SteamOS project, robust benchmarking can be extremely difficult in practice. So I'm still gathering and analyzing the data, and tweaking how it’s graphed. What I’ve seen so far looks very interesting for multiple reasons.

First, it's looking pretty certain that both BitKnit and Brotli compete extremely well against zlib's decompression performance, but at much higher (LZMA/LZHAM-like) compression ratios. Amazingly, BitKnit’s compressor is also extremely fast, around the same speed as zlib’s. (By comparison, at maximum compression levels, both Brotli’s and LZHAM's compressors are pretty slow.) The graphs in this post only focus on decompression throughput, however. I’m saving the compression throughput analysis for another post.

One rough way of judging the complexity of a compressor vs. others is to compare the number of lines of code in each implementation. BitKnit at 2,700 lines of code (including comments) is smaller than both LZ4 (3,306 - no comments), zlib's (23,845 - no comments, incl. 3k lines of asm), or LZHAM’s (11,651 - no comments). Brotli's is rather large at 47,919 lines (no comments), but some fraction of this consists of embedded static tables.

Interestingly to me, BitKnit’s decompressor uses around half the temporary work RAM of LZHAM's (16k vs. 34k).

New Benchmarking Approach

While writing and analyzing LZHAM I started with a tiny set (like 5-10) of files for early testing. I spent a huge amount of time optimizing the compressor to excel on large text XML files such as enwik8/9, which are popular in the lossless data compression world. I consider this a serious mistake, so I've been rethinking how to best benchmark these systems.

The new codec analysis approach I’m using runs each decompressor on thousands of test corpus files, then I plot the resulting (throughput, ratio) data pairs in Excel using scatter graphs. The data points are colored by codec, and the points are transparent so regions with higher density (or with data points from multiple overlapping codecs) are more easily visualized. This is far better than what the Squash Compression Benchmark does IMO, because at a single glance you can see the results on thousands of (hopefully interesting) files, instead of the results on only a single file at a time from a tiny set of corpus files.

I generated these scatter graphs on 12k data files from the final LZHAM vs. LZMA corpus. There is some value in using these data files, because I used this same test corpus to analyze LZHAM to ensure it was competitive against LZMA. This corpus consists of a mix of game data, traditional textual data, every other compression corpus I could get my hands on, some artificial XML/JSON/UBJ test data, and lots of other interesting stuff I’ve encountered on the various projects I’ve worked on over the years. (Unfortunately, I cannot publicly distribute the bulk of these data files until someone comes up with an approach that allows me to share the corpus in a one way encrypted manner that doesn’t significantly impact throughput or ratio. I have no idea how this could really be done, or even if it's possible.)

The Data

The X axis is decompression throughput (right=faster), and the Y axis is compression ratio (higher=better ratio or more compression). The very bottom of the graph is the uncompressible line (ratio=1.0).

Color code:

Black/Gray = LZHAM
Red = Brotli
Green = zlib
Blue = BitKnit
Yellow = LZ4

Totals for 11,999 files (including uncompressible files):

Uncomp:   2,499,169,096
lz4:      1,167,777,908 2.14
zlib:     1,044,180,362 2.39
brotli:     918,949,263 2.72
bitknit:    898,621,908 2.78
lzham:      882,723,287 2.83

Totals after removing 1,330 files with a zlib compression ratio less than 1.1 (i.e. uncompressible files):

Uncomp:    2,147,530,394
lz4:         815,221,536 2.63
zlib:        693,090,474 3.1
brotli:      568,461,065 3.78
bitknit:     547,869,148 3.92
lzham:       532,235,143 4.03

This is a log2 log2 plot, basically an overview of the data:

This is a zoomed linear plot , looking more closely at the uncompressible (ratio=1) or nearly uncompressible (ratio very close but not 1) regions:

This log2 log2 plot is limited to just LZHAM vs. BitKnit:

Finally, another log2 log2 plot showing just BitKnit vs. zlib:

Current Observations

Fabian Giesen (Rad) and I have noticed several interesting things about these scatter plots:

- The data points with a ratio of 1 (or extremely close to 1) show how well the algorithm handles uncompressible data, which is hopefully near memcpy() performance.

(Note LZMA’s data will be very interesting, because it doesn’t have good handling for uncompressible data.)

There are a handful (around 50-60 depending on the codec) of data points with a ratio slightly below 1 (.963-.999). The majority are small (287-1kb) uncompressible files.

- Slightly "above" this ratio (very close to ratio 1, but not quite), literal handling dominates the decompressor's workload. There are distinct clusters on and near the ratio=1 line for each compressor.

LZHAM actually does kinda well here vs. the others, but it falls apart rapidly as the ratio increases.

- Notice the rough pushed down "<" shape of each algorithm's plot. LZHAM's is pretty noticeable. At the bottom right (ratios at/close to 1.0), literals dominate the decompression time.

Interpreting this as if all algorithms are plain LZ with discrete literals and matches:
As you go "up" to higher ratios, the decompressor has to process more and more matches, which (at least in LZHAM) are more expensive to handle vs. literals. After the “bend”, as you go up and to the right the matches grow increasingly numerous and longer (on average).

- LZHAM has an odd little cluster of data points to the right (on the ratio ~3 line) where it almost keeps up with BitKnit. Wonder what that is exactly? (Perhaps lots of easily decoded rep matches?)

- Notice zlib’s throughput stops increasing and plateaus as the ratio increases - why is that? Somebody needs to dive into zlib’s decompressor and see why it’s not scaling here.

I need to add my implementation of zlib’s core API (miniz) to see how well it compares.

Important Notes:

- The x64 benchmark command line app was run on Win10, 2x Xeon E5-2690 V2 3.0GHz (20 cores/40 threads). Benchmark app is single threaded.

- All test corpus files are between 256 bytes and 127.7 MB

- All algorithms were directly linked into the executable, and the decompressors were invoked in the same way

- Each decompressor is invoked repeatedly in a loop until 10ms has elapsed, this is done 4 times and shortest average runtime is taken

- Brotli was limited to comp level 9, as level 10 can be too slow for rapid experimentation, 16MB dictionary size (the largest it supports)

- LZHAM was limited to using its regular parser (not it's best of X arrivals parser, i.e. the "extreme parsing" flag was disabled), 64MB dictionary size.

- LZ4HC compressor, level 8

- zlib's asm modules were not used in this run. It'll be interesting to see what difference the asm optimizations make.

Next Step

The next major update will also show LZMA, Zstd, and miniz. I'm also going to throw some classes from crunch in here to clusterize the samples, so we can get a better idea of how well each algorithm performs on different classes of data.

I feel strongly that scatter graphs like these can be used to help intelligently guide the design and implementation of new practical compressors.

A big thanks to Fabian “ryg” Giesen at Rad Game Tools for giving me access to "BitKnit" for analysis. BitKnit is going to be released in the next major version of Oodle, Rad’s network and general purpose data compression product.