Data Compression means the transformation of information that is performed to reduce its volume. It is used to ensure the rational use of hardware resources that store, process, transmit and perform any other operations with information.
Data Compression in NetApp storage
The Data Compression process is based on the elimination of redundancy, which is characteristic of intact (uncompressed) data. The simplest example of information redundancy is too many repetitions of the same word in the text.
To remove this kind of redundancy, you need to replace a frequently occurring word with a link to another piece of data that is encoded and has a strictly specified volume.
Reducing the “weight” of data can be achieved by replacing encoded words with too often repeated data types and long codes of too rare data (entropy coding). If the data does not have redundancy (encrypted information, “white noise”, short signal, etc.), then it will not be possible to compress them without losing information.
Lossless Data Compression is a process that allows, if necessary, to completely restore the original information, because the volume of stored information does not decrease, despite the decrease in the space it occupies.
The above possibility may appear if the probabilities are unevenly distributed on the messages. For example, when some of the messages that are possible in theory did not occur in the early encoding of these messages.
Data Compression algorithms for unknown data types
There are 2 main methods of data compression that have an unknown format:
- Each successive compressed character is either placed in the output buffer in its original form, or a group of several compressed characters is replaced by a reference to a similar group of encoded characters. This method is most often used when creating self-extracting software.
- Statistics (frequency of occurrence of data in the code) are collected once or continuously for each sequence of characters that are compressed. Based on these statistics, the probability of the value of the next encoded character (or their sequence) is determined. Then one of the types of entropy coding is used to replace frequently occurring data types with short code words, and rare ones with longer ones.