Reoccuring Sequences Compression (RSC) - A Theory.

Introduction

I find the subject of data compression fascinating. The idea of taking something 'long' - making it 'short' - but still having the exact same 'meaning' - is pretty cool, and extremely useful; for transfering it from one place to another on less capicitated and/or slower medias, for example. Data compression is everywhere - from pictures through audio to video. It's an important part in almost any type of technological equipment. It saves two very valuable and expensive things for humans - it saves TIME and SPACE.

For years, as a hobby, I've been toying with trying to figure-out and develop all sorts of compression-algorithms of my own. I'd like to describe one of those in this article - I think it's pretty easy to understand - but much less easy to execute - as you will soon see why.

But first:

Really quick recap of how computer data compression works

"Binary" is a language that uses exactly 2 'symbols'. In the computer-world context - those symbols are called '0' and '1'. You can build an infinite number of 'words' (just like with any language) out of these two symbols. 011101 is a word, 1110000011110 is a word, 101 is a word and so on. All data ('files') on computers is saved in binary (hardware-wise - the '0' and '1' are represented in some-sort of an 'On' and 'Off' electrical states). Traditionally - files are comprised of any number of 8-symbol long 'sequences' (each comprised of 8 0's and 1's). one symbol (0 or 1) - is called a 'bit' (Binary digIT) - and each sequence of 8 bits is called a 'byte'.

Data compression works by taking one of those bytes (or several of them together) - let's say: 01100111 - and calling it something else - something shorter, let's say: 01. In this case, the compression algorithm 'decides' - that 01 will be equal to 01100111 - and thus - when analyzing an input file - it will replace any occurence of 01100111 - with 01 (and usually putting some sort of mechanism for the decoder to 'know' this, a 'dictionary' of some sorts). If 01100111 appears in that input file 120 times (that's 120 bytes) - and each time it is replaced with 01 - those 120 bytes will take just 240 BITS - that's 30 bytes. 30 bytes that will 'mean' exactly the same as 120 bytes. That's 25% less. That's compression.

The Reoccuring Sequences Compression theory

The premise of any compression algorithm is first to decide which of the data-parts of a file - will be best to have compressed - those are most likely parts that appear the most times in the file - and/or have the longest length.

The RSC theory starts with a number. That number I will call: Reference Sequence Size (RSS) That number can be any number greater than 0. The higher it is - the better the chances for a 'substantial' amount of compression. Let's choose a value for that number: 256. (8-bits bytes have exactly 256 different combinations of bits - so this number might be a good example to explain with, although it could be any, as said).

Now it works like this: We take the first 256 bytes of a file (byte number 0 - through byte number 255) - we'll use this sequence of bytes as a 'Reference'. Now we take the next sequence of 256 bytes after that - (byte 256 to 512) and we check if it is identical to our 'reference' sequence; We take a note if it does, then we go to the 'next' sequence: - that's bytes 257 to 513. We test again. Then we check bytes 258 to 514, then 259 to 515... We repeat that until we reach the end of the file.

After that - we do this same process again, only this time we use a different sequence as a reference - we will use byte 1 through byte 256 - and we will again compare it against every other 256-bytes long sequence. Then we'll do it again and again, each time referencing a different sequence until we reach the end of the file. What we will have then - is statistical data about the number of occurrences of 256-bytes sequences in the file. Then, we can take the most common sequence appearing in this file - and give it a representing symbol. The shorter the symbol will be - the better; we will give that sequence the symbol of 0. We can give the symbol of 1 to the second most common sequence, and then the symbol of 00 to the next, and 01 to the next and so on - when we run out of redundant sequences - we can then rewrite the file with our new symbols - thus reducing its size - then - we can repeat the process of testing sequences against a reference-sequence again, only this time using a shorter RSS - for example 255. This time we make sure we don't touch the bits that have already been compressed - those bits who already represent sequences of 256 bytes (we should save notes about their locations on the file) - then we'll again assign shorter symbols (those which were still haven't assigned) for the 255-bytes sequences - and rewrite them to the file. We'll then use an RSS of 254. and repeat.Then 253, and so on, until we reach a file filled with just our own symbols, or we reach an RSS of 2. We'll also have to somehow save the 'meanings' of each symbol to the file, and that will also add some size to the output file, but basically - we should have a much shorter file.

Own testing (so far...)

After thinking about this - I wrote a C++ program that put this algorithm into action. My test file was about 800,000 bytes long (800KB). For "making calculations" on the run - I soon realized that the amount of memory (RAM) on the computer (2GB) won't be near to enough - but I could use the much greater amount of space on the hard-drives. But then I discovered - that for the program to run and show results - it takes time - a great amount of time - wayyyy too much time. My 1 year old PC - is pretty fast and strong (2.4Ghz with quad cpu).

At first I started my test with an RSS of 65535 (the highest 'decimal value' of a 2-bytes sequence - aka 64K). On a 800KB file there are about 734,465 sequences of 64Ks - (each byte in the file is actually the first byte of a 64K sequence, except the bytes in the last 64Ks of the file, since the sequences that starts with those are too short - shorter than 64K) - On my computer, each 'pass' for 1 reference took about 5-6 minutes(!) to complete. 5 * 734,465 = 3,672,325 minutes. That's about 61,205 hours, which sums-up to about 2550 days - or almost 7 years(!). When I tried lower RSSs, like 256, 16, 8 or even 4 - the time it took for 1 pass to finish was about 2-3 minutes - which on a 800KB file - still takes close to 4 years to complete. :(

Also see next section (Revisions):

Revisions

"Reference Bank"

Something that could reduce the amount of time each full pass per one RSS can take - is to use a "reference bank" (refbank) - for each reference (sequence-reference) - we will check if that reference is inside the refbank - if it is, we would skip the checks for that reference - and if it isn't in the bank we will add the reference to the bank, after finishing the checks for it. This could speed-up things, on the long-run - if indeed there are enough redundant references in the file (and that's what this whole theory is counting-on anyway). If using the RAM for the refbank - on large files - it could use-up a lot of memory, and eventually run out of it - but again, this is dependent of whether or not we have many identical references in the file (which, again, we're hoping it does). I ran a test, without a refbank, on a 10K file - it took about 8 hours to finish per 1 RSS (of 16 bytes). I'm running another test on the same 10K file, this time with using a ref-bank. So far, it seems to be a little faster...

User:YuvalAloni/RSC

Contents