in ,

shrynk – Using Machine Learning to learn how to Compress, Hacker News

shrynk – Using Machine Learning to learn how to Compress, Hacker News


    

**********

My server storing cryptocurrency data started to overflow as I was storing compressed CSV files. Trying to come up with a quick solution, I figured I should just switch to a more effective compression algorithm (it was stored using gzip). But how to quickly figure out which will be better?

Fast forward: I made the package

shrynkfor compression using machine learning! It helps you by choosing (and applying) the format to compress your dataframes, json, or actually, files in general.

Given example data, it is able to compress using 30% overall less disk space using a mixed strategy by machine learning compared to the best single compression algorithm (meaning: choosing any other single compression algorithm to compress everything will be worse).

You can try it for yourself athttps://shrynk.ai (***************. ************ **************************

**********************

Bonus: If the algorithm has it wrong, the features of the data (not the data itself) will be added to the python package on the next release!

Next, I will explain Compression, Machine Learning and the library I’ve built in Python. You can also jump to a section using the links below.

Compression

Compression is about reducing the size of data. The obvious reason for This is to be able to store more data. Since IO / Networking (to disk or over the internet) is often the bottleneck, and not CPU Processing, it makes sense to compress data before it goes in transit, and incur time for decompressing afterwards.

Different algorithms exist for different types of data, such as images, video, audio, but also text and general purpose files. Any data can be compressed as long as there is a pattern to use. For example, a compression algorithm might find that in images there is a lot of repetition: there will be areas of pixels with the same color, and it is possible to use compression to avoid having to store every single pixel invidivually.

In video, it is not just “2D” but there’s a time dimension as well, and one succesful way to compress might be to only store the delta between frames, that is, only store where the video is different from the last frame. Yet another example is whitespace which can be compressed in JSON files.

Compressing tabular data

The first case of shrynk has to do with matrix / tabular data: rows and columns of data. Here is an example file calledtoy_data.csv

********:

********************* gender, age female, female, 035 female,   male, 063   male, 069

Let’s concern ourselves with the gender variable only. For each row, storingfemaleormale is not optimal given that we know these are the only values ​​in this column. One improvement could be to take all the values ​​in the column, and doing a replacement:F (forfemale

, and

(M) ********************* (formale (**********************. The shorter the string, the better, right? Of course, to decompress you need to add the extra data thatF (meansfemale**************, and (M) ********************** meansmale, but like this you only have to store the “longer” string once. This is a part of what is called

dictionary encodingand is used in the

Parquetformat.

Another optimization of Parquet is to useRun-length encoding, which makes use of the observation that very often the same value occurs in sequence. Oversimplifying a bit, we could encode the column gender as3F2M. Decompressing this would expand it back into the original data.

Note that at the same time, you might be able to imagine that the Parquet schema is not necessarily better for floating-point values ​​(like pi), as mostly these values will be unique, we cannot use these tricks.

Tabular data in Python

In the case of tabular data, we often usepandas .read_csvto read in data from disk and produce a (DataFrame

However, csv is often uncompressed. A dataframe can easily be written back to disk using compression though. Here are several available options, with and without compression:

********************* fastparquet UNCOMPRESSED fastparquet GZIP fastparquet SNAPPY fastparquet LZO fastparquet LZ4 fastparquet ZSTD pyarrow brotli pyarrow gzip pyarrow lz4 pyarrow snappy pyarrow zstd csv gzip csv bz2 csv zip csv xz csv

The first part before the  is the “engine” (csv
is handled bydf.to_csv, and pyarrowandfastparquetare different libraries used fordf.to_parquet), whereas the latter part is the compression that is being used.

It turns out that it is difficult to find out when to use which format, that is, finding the right “boundaries” choosing between so many options. It depends on a lot of factors.

It would be great to know which compression would be best, wouldn’t it?

Running benchmarks

First of all,shrynk

can help with running benchmark for your own file, or of course, for a collection of files.

I took the gender / age toy data from before and show below how to get the actual profiled results per compression for a file.

*********************

from (************************, ********************************** shrynk ******************** (show_benchmark) show_benchmark********************

(

“toy_data.csv”

)

# takes either a (n un) compressed filename, or DataFrame / object

# Output (sorted on size) ****************** (kwargs)

size (******************************** (write) read           

csv

********************************** (bz2) ******************************** 89
************************************** (0.) ************************************************************************************************************************************************************************************************ 0. 143

              

csv

0 **********************************************************************************************************************************************************************************************0.

         

csv

********************************** (gzip) ******************************** 100 (************************, ******************************* (0.) **********************************************************************************************************************************************************************************************

           

csv

********************************** (xz) ******************************** 126

(0.) ******************************************************************************************************************************************************************************************** (

********************************************************************************************************************************************************************************

          

csv

********************************** (zip) ******************************** (****************************************************************************************************************************************************************************** 0. () ********************************** (0.) ********************************************************************************************************************************************************************************************

   fastparquet********************

******************************** (LZ4) ******************************** (************************************************************************************************************************** . 286

137

  fastparquet********************

******************************** (GZIP) ******************************** (**************************************************************************************************************************. (****************** (

fastparquet********************

******************************** (SNA) ********************************..

********************** () ********************************** (0.) ************************************************************************************************************************************************************** (***************************** (0.) **************************************************************************************************************************************************************  fastparquet********************

****************************** (ZSTD) ******************************** (**********************************************************************************************************************. **********************************************************************************************************************************************************************

   fastparquet********************

******************************** (LZO) ******************************** (**********************************************************************************************************************

(0.) ****************************************************************************************************************************************************************************************************

fastparquet********************

******************************** (UNC) ********************************..

********************** (********************************** (0.) (************************, ******************************* (0.) ****************************************************************************************************

   

pyarrow

********************

********************************** (brotli) ******************************** (********************************************************************** (

0.0) *******************************************************************************************************************************************************************************************************************************************  (****************************** (0.) **********************************************************************************************************************************************************************************************************************************        

pyarrow

********************

********************************** (lz4) ******************************** (********************************************************************. ******************************************************************************************************************************************************************************************************************************************** (******************************** (0.) **************************************************************************************************************************************************************************************************************************************

   

pyarrow

********************

********************************** (snappy) ******************************** (********************************************************************
(0.0) *************************************************************************************************************************************************************************************************************************************************  (******************************** (0.0) ******************************************************************************************************************************************************************************************************************************************

     

pyarrow

********************

********************************** (zstd) ******************************** (********************************************************************

0.0********************** (0.) **************************************************************************************************************************************************************************************************************************************      

pyarrow

********************

********************************** (gzip) ******************************** (******************************************************************

**********************************************************************************************************************************************************************************************************************************************0.***********************)


You can see that in the case of this file, it is recommend to store the data ascsv bz2, as it yields a very small file on disk (and it uses the default settings).

Running all the compression benchmarks only took 3 seconds on my laptop, which might make you wonder why we wouldn’t just always run the benchmarks? Well, for any sizeable dataframe, you do not want to have to run the benchmarks as this can take a very long time - especially for an unfortunate compression algorithm taking a long time for the particular data at hand. At the same time, why run the benchmarks when we can predict? Enter Machine Learning.

Machine Learning

Computers are good at automating human decisions. In order to make decisions without having to explicitly program it to do so, it will instead leverage historical data, such as characteristics and actual decisions, to predict the best decision in the future for (new) ************************************ characteristics.

For For example, Machine Learning helps in deciding whether a new email is spam based on users having labeled loads of other emails in the past as either spam or not spam (the decisions ), and using the words (their characteristics) that occur more frequently in the spam emails (eg “Nigerian Prince”) compared to the non-spam emails (eg “Hi John”) to drive the decision. Well, assuming that the person’s name was actually John, of course, otherwise it’s probably spam again.

The usual downside of machine learning is that it is very costly to gather the correct decisions - it is often manual labor by people.

The cool thing about machine learning for compression is that we can try all compressions for a file to find out what the optimal decision would be, without much cost (except time) .

In this case, ~ files have been used to allow the modeling of the compression. Their characteristics have been recorded and the things we would like to learn to minimize:    

how large they become (size on disk),

          

how long a write takes (write timeand

          

how long a read takes (read time

   (****************************************

It could be that, based on some data of let's say (rows and**************************************************************************************************************************************************************************************************************************************** columns of price data, compression A is best in terms of size, But if it concerns (rows and************************************************************************************************************************************************************************************** columns of time data, compression B would be best. Perhaps depending on other factors, such as the amount of text columns, the situation will again be different.

To give more example features, currently it will considerpercentage of null values ​​,average string length, and how distinct varia bles are (cardinality), and even a few more. These are all used to figure out which compression to apply in a similar context. Checkhere for an overview of features used. Machine Learning in Shrynk

Note that there are other attempts at using machine learning to do compression… most notably in compressing images. Shrynk uses a meta approach instead, as it only uses already existing compression methods.

For each file, it will apply the compression and gather the values ​​for each compression algorithm on size, write and read columns and converts them to (z-scores) : This makes them comparable at each dimension and is a novel approach as far as I know. The lowest sum of z-scores given the user weights gets chosen as a classification label. Let’s see that in slow-motion.

First to see how converting to z-scores works:

*********************

from (************************, ********************************** (sklearn.preprocessing) ******************************** (import) ****************************************************** (scale) scale********************

([1,2,3])

# array ([-1.224, 0, 1.224]) ****************** (scale)

# array ([-1.224, 0, 1.224]) **********************************You can see that the scale does not matter but the relative difference does: (1, 2, 3) and ((******************************************************************************************************************************************************************************************************, , 320) get the same scores even though they are x larger. Also note that here we are ignoring the unit (bytes vs seconds).Here a fake example to show 3 compression scores of a single imaginary file, and only considering size and write:

********************* size write compr A 114 kb 2s compr B 218 kb 1s compr C 307 kb 3s

Converting per column to Z-scores:
********************* z-size z-write comp A -1. 0 comp B 0 -1. comp C 1. (1)

Then to multiply the z-scores with User weights (Size=1, Write=2):

********************* us uw | user-z-sum  -1. 0 | -1.      0 -2. 55 -2.   1. (2) | 3. 78

In the last column you can see the sum over the rows to get a weighted z-score for each compression.

Given the example data and s=1 and w=2,

compression B

would have the lowest summed z-score and thus be best! This means that the characteristics of this data (such as

num_rows (etc) and the label

compression B

will be used to train a classification model.

In In the end, the input will be this result for each file (so the sample size isnumber_of_files

; notnumber_of_files * number_of_compression)

A sample data set might look like (completely made up):

********************* num_cols num_rows missing | best_label        1000 0.1 | csv bz        5 01575879 0 | fastparquet GZIP       0.2 | pyarrow brotli       (***************************************************************************************************************************** (0.) csv gzip      ... and so on ....
A simpleRandomForestClassifierwill be trained on the included benchmark data for 3000 files, based on the user weights.Knowing that, let’s look at some usage examples.  Usage

Basic example of saving (which predicts the best type) and loading:

*********************

import

********************************* (pandas) ********************************** (as ****************** (pd) from********************

shrynk

********************************** (import) ********************************* (save) ******************************,load (********************************* # save dataframe compressed ****************** (my_df) =************************************ (pd)

******************************** (DataFrame) ({

"a"(********************************: [

1]})

file_path

=

********************************** (save) ********************************(

(my_df)
*********************************

"mypath.csv"************************************ # e.g. mypath.csv.bz2

# load compressed file ****************** (loaded_df) =************************************ (load) () **********************

file_path

)
If you just want the prediction, you can also use

infer:

*********************

import

********************************* (pandas) ********************************** (as ****************** (pd) from********************

shrynk

********************************** (import) *********************************infer

infer

(

******************************** (pd) ********************************

****************************** (DataFrame) ({ (****************************** "a"

(**********************, ********************************** [1]}))

# {"engine": "csv", "compression": " bz2 "} **********************************For more control, it is possible to use shrynk classes directly:

*********************

import

********************************* (pandas) ********************************** (as ****************** (pd) from********************

shrynk

********************************** (import) ********************************* PandasCompressor df********************

=

******************************** (pd) ********************************

****************************** (DataFrame) ({ (****************************** "a"

(**********************, ********************************** [1,2,3]})

pdc

=

********************************** (PandasCompressor) *********************************()

pdc

********************************** (train_model) ********************************(

(3)
*********************************  (1)  (********************************, **********************************  (1) ******************************** (1) ) (********************************

pdc

********************************** (predict) *********************************(

(df)
****************************************

pdc

********************************** (show_benchmark) ********************************(

(df)
*********************************  (3)  (********************************, **********************************  (1) ******************************** (1) ********************************, ******************************  (1) **************************** (1) )

After installing it is also available on the command as

shrynkUse it to automagically compress, decompress and run benchmarks for files:

********************* shrynk compress mydata.csv shrynk decompress mydata.json.gz shrynk benchmark mydata.png

 Data comes packagedNote that the data comes packaged in shrynk: this is a feature, not a bug. It allows people to provide their own requirements for size, write and read speeds. It will then train a model on the spot (if not trained earlier).I also hope that including the data encourages others to see if they can improve the compression score:)) You can also use it for validating improvements to the algorithm, or generically, the approach.  How well does it work? Cross-validationI have builtin avalidate

function that is available to all shrynk classes so you can test your own strategies or train and verify results on your own data.

It uses the available data for the given class, and produces results for cross-validation; taking into account the user defined weights.

Validation was a tough one, mostly because it took me a while to figure out how to express the results in a meaningful way when allowing the use of weights. In the end I chose to keep the dimensions of interest,size (************************,read (***********************,write**************, and show how the aggregate of shrynk's predictions per object compare (in proportions) against choosing to always use a single strategy. I hope the example makes it clear.

See the code below:
*********************

from (************************, ********************************** shrynk ****************** (PandasCompressor)

pdc

=

********************************** (PandasCompressor) *********************************()

weights********************

=

********************************** ( (****************** (1) ******************************** (1) ,
(0) *********************************

,

(0) *********************************

)

acc

********************

, (**********************, ************************************ (result) ********************************=() ********************** (PDC)

********************************* (validate) ************************ (*

weights

)

Note that shrynk in the example below is not only 035% better in size (check the arrows), but also much better in terms of read- and write time compared to always applying the single best strategy,csv   xz. The1.

**********************************************************************************************************************************************************************************************************************************value indicates it is only 0.1% away from what would be achievable in terms of size if it would always choose the best compression per file in the validation set. At the same time, it is 6. times slower in terms of reading time compared to always choosing the best. It was after all optimizing for size.

********************* [shrynk] s=1 w=0 r=0 ---------------- it 0/5: accuracy shrynk 063% it 1/5: accuracy shrynk 100. 0% it 2/5: accuracy shrynk********************************************************************************************************************************************************************************** it 3/5: accuracy shrynk 44% it 4/5: accuracy shrynk (**********************************************************************************************************************************************************************************************************. ************************************************************************************************************************************************************************************ Avg Accuracy: 0. (**************************************************************************** results sorted on SIZE, shown in proportion increase vs ground truth best                   size read write shrynk --->1. 06 6. (7.) csv xz --->1. (************************************************************************************************************************************************************************************************************************. ************************************************************************************************************************************************************************************** 035 csv bz2 1. 387 (************************************************************************************************************************************************************************************************************************************************. (******************************************************************************************************************************************************************************************************************************************

csv zip 2. (*********************************************************************************************************************************************************************************************************************************************. (********************************************************************************************************************************************************************************************************************************************************************************************************************** fastparquet ZSTD 4. (5.) ******************************************************************************************************************************************************************************************************************** (6.) ************************************************************************************************************ fastparquet GZIP 4. (6.) ********************************************************************************** (8) ****************************************************************************************** fastparquet LZO 5. (4) ************************************************************************** (6.) *************************************************************************************************************************************** fastparquet LZ4 5. (4.) ********************************************************************************* (6.) ********************************************************************************************************************************************** fastparquet SNA. 6. 5. 112 6. 652 csv 7. (9.) ******************************************************************************************************************************************************************************** (****************************************************************************************************** csv gzip 7. (9.) *********************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************. pyarrow brotli 8. 2 . 3. pyarrow zstd 8. 715 1. pyarrow gzip 9.0 (2) 2. pyarrow lz4 9. (1.) ************************************************************************************************************************************************************************************************** (1.) **************************************************************************************************************************************************** pyarrow snappy 9. (1.) **************************************************************************************************************************************************************************** (1.) **************************************************************************************************************************************** fastparquet UNC

****************************************************************************************************************************************** (5.) ************************************************************************************** (6.)

Lastly, let's look at user weights ofsize=3andwrite=1to see a mixed approach:

********************* size write z shrynk 1. (4.) ******************************************************************************************************************************************************************************************************************** (-4.) csv bz2 1. (9.) ************************************************************************************************************ -3. csv zip 2. (7.) ************************************************************************************ (-3.) fastparquet ZSTD 6. (6.) *********************************************************************************************************************************************** (-1.) ************************************************************************************************************************************************************************************* fastparquet GZIP 6. **************************************************************************************************************************************** [1]. ************************************************************************** fastparquet LZO 6. (6.) ******************************************************************************************************************************** (-0.) fastparquet LZ4 7. (6.) **************************************************************************************************************************************************************************************************************************** csv xz 1. 023. . fastparquet SNA. 7. (6.) -0. 395 pyarrow zstd . (1.) ************************************************************************************************************************************ (0.) ********************************************************************************************************************************************************** pyarrow brotli . (2.) **************************************************************************************** (0.) ************************************************************************************************************************************************* pyarrow lz4 (****************************************************************************************************************************************************************************************************************************************************. (1.) ********************************************************************************************************************************************************** (0.) **************************************************************************************************************** pyarrow snappy (****************************************************************************************************************************************************************************************************************************************************. 366 1. (0.) ************************************************************************************************************** pyarrow gzip (****************************************************************************************************************************************************************************************************************************************************. (2) ******************************************************************************************************************************** (0.) ****************************************************************************************** csv gzip (****************************************************************************************************************************************************************************************************************************************************. (6.) ************************************************************************************************************** (1.) csv (************************************************************************************************************************************************************************* (6.) **************************************************************************************************************** (1.) ********************************************************************************************************* fastparquet UNC (**********************************************************************************************************************************************************************************************************************************. ******************************************************************************************************************************************** (6.) ************************************************************************************************ (8. ****************************************************************)
 Conclusion

Since having to come up with manual rules for when to use which compression would be very complex and time costly, this was a great case for machine learning. It’s very scalable, as when a new compression algorithm shows up on the market, shrynk will be able to benchmark that one and no new rules have to be figured out.

Note that it is currently required (**************************************************************************************************************************************************************************************************************************% less disk compared to if I would have used only a single best strategy (in reality, you might not even always use the single best strategy but a worse one). Though, granted, this was on rather heterogeneous data: small, big, and really versatile, e.g. in terms of text, and missingness. To prevent bias, it was trained on balancing the occurence of all compression algorithms to be best.You know you're in for a win with machine learning when you can get (or even create) data cheaply.

It also illustrates an awesome way to enhance coding: by having a computer predicting the best way to do “something” - in this case, compress.

What's nextIf you've liked this, feel free to follow me on githuband leave a star forshrynk on github.If you haven't tried it yet, you can see what it looks like in action athttps://shrynk.aiAnd please help out if you're interested, as any attempts at Pull Requests will be welcomed: it's a python project, by the community, for the community.

further investigation of which variables are most predictive (and cheap to compute)  The stats computation could probably be sped up as well once we further investigate which features are most succesful.
      Add other data types to the framework (JSON). and Bytes have recently been added)
         I've been working on a package to generate artificial data. Imagine creating slight variations based off requirements, or existing dataframes to have data evolve in the direction we want to learn better boundaries!
           Other improvements?
        

    (**************************************
    (Read More) ****************************************************