Lossless data compressionâfully reversible process. Dictionary methods: use (build up) the dictionary,. LZ77, LZ78, LZW, WBLZW, WLZW. Statistical methods:.
On Implementation of Word-based Compression Methods Jakub Jaroš
Petr Procházka
Jan Holub
Department of Computer Science and Engineering, FEE CTU Prague, ˇ 13, 121 35 Praha 2, Czech Republic Karlovo námestí
PSC Memics 2008 + PSC
Introduction
Basic Notions
Table of Contents
1
Introduction Basic Notions Related Work
2
Open Dense Code Basic Ideas Experiments
3
Summary
PSC J. Jaroš, P. Procházka, J. Holub (FEE CTU)
Word-based Data Compression
Memics’08+PSC
2 / 16
Introduction
Basic Notions
The Data Compression
Process to reduce time and/or space. Lossless data compression—fully reversible process. Dictionary methods: use (build up) the dictionary, LZ77, LZ78, LZW, WBLZW, WLZW.
Statistical methods: use statistical information (probability of occurrences), Huffman code, Arithmetic code, WAC, Dense codes.
Character-based X Word-based approach.
PSC J. Jaroš, P. Procházka, J. Holub (FEE CTU)
Word-based Data Compression
Memics’08+PSC
3 / 16
Introduction
Basic Notions
The Word-based approach
Efficient for textual data (natural languages, formal languages...) Use words instead of characters as the symbols of the alphabet. Strictly alternating sequence of words and non-words. Faster adaptation to the encoded data.
PSC J. Jaroš, P. Procházka, J. Holub (FEE CTU)
Word-based Data Compression
Memics’08+PSC
4 / 16
Introduction
Related Work
Table of Contents
1
Introduction Basic Notions Related Work
2
Open Dense Code Basic Ideas Experiments
3
Summary
PSC J. Jaroš, P. Procházka, J. Holub (FEE CTU)
Word-based Data Compression
Memics’08+PSC
5 / 16
Introduction
Related Work
End-Tagged Dense Code (ETDC) Word-based compression method proposed by Brisaboa et al. (2003). Byte-oriented—improves the speed of compression and decompression. Uses the probability to define the rank of each word. Shorter codeword is assigned to the word with higher rank. Structure of codewords: sequence of blocks, the most important bit—to define the last block. Codeword
0
J. Jaroš, P. Procházka, J. Holub (FEE CTU)
0
Word-based Data Compression
1
PSC Memics’08+PSC
6 / 16
Introduction
Related Work
(s, c)-Dense Code (SCDC)
Word-based compression method proposed by Brisaboa et al. (2007). Similar to ETDC. Differs between Continuers and Stoppers. Codeword is the sequence Continuers closed by one Stopper. Continuers
Stoppers
Continuers
Stoppers
0, 1, ..., 33, 34, 35, 36, 37, 38, 39, ..., 175, 176, 177, 178, ..., 254, 255
PSC J. Jaroš, P. Procházka, J. Holub (FEE CTU)
Word-based Data Compression
Memics’08+PSC
7 / 16
Open Dense Code
Basic Ideas
Table of Contents
1
Introduction Basic Notions Related Work
2
Open Dense Code Basic Ideas Experiments
3
Summary
PSC J. Jaroš, P. Procházka, J. Holub (FEE CTU)
Word-based Data Compression
Memics’08+PSC
8 / 16
Open Dense Code
Basic Ideas
Open Dense Code (ODC) Generalized concept of dense coding. Covers ETDC and SCDC. Provides a frame for definition of many other dense code schemas. Definition The b-ary Open Dense Code (ODC) is a couple hb, Gi where b is a size of block and G = (N, T , P, S) is a grammar defining syntax of the code. ODC assigns to the r -th most frequent symbol (starting with r = 0) a codeword cr of k blocks, which satisfies following conditions: 1
cr ∈ L(G),
2
cr is not a prefix of any other codeword ci ∈ L(G), Pk −1 i Pk i i=1 Πj=1 vj , where vj is number of codewords i=1 Πj=1 vj ≤ r < covered by block j. PSC
3
J. Jaroš, P. Procházka, J. Holub (FEE CTU)
Word-based Data Compression
Memics’08+PSC
9 / 16
Open Dense Code
Basic Ideas
Dynamic ODC (dodc3) Schema strongly adjusted to natural language compression. b = 8; G(N, T , P, S) : N = {Codeword} T = {a, b, c, d, e} S = Codeword a is 1B word, b is 1st byte of 2B word, c is 2nd byte of 2B word, d is 1B non-word, e is 1B ESC symbol, Byte h00000000i h00000001i − h10011111i h10100000i − h11011111i h11100000i − h11111111i J. Jaroš, P. Procházka, J. Holub (FEE CTU)
P:
Codeword → a Codeword → b c Codeword → d Codeword → e a ∈ {1, ..., 159} b ∈ {160, ..., 223} c ∈ {0, ..., 255} d ∈ {224, ..., 255} e=0
Meaning ESC 1B alphanumeric word 2B alphanumeric word 1B non-alphanumeric word
Word-based Data Compression
# codewords 1 159 16384 32 Memics’08+PSC
PSC 10 / 16
Open Dense Code
Experiments
Table of Contents
1
Introduction Basic Notions Related Work
2
Open Dense Code Basic Ideas Experiments
3
Summary
PSC J. Jaroš, P. Procházka, J. Holub (FEE CTU)
Word-based Data Compression
Memics’08+PSC
11 / 16
Open Dense Code
Experiments
Tested Corpora Textual files in natural languages. Canterbury and Large Canterbury Corpus. Calgary Corpus. Gutenberg Corpus. File bible.txt alice29.txt plrabn12.txt book1 book2 paper1 paper2 wrnpc11.txt 17073-8.txt 2donq10.txt 8va5810.txt
Notation canL can1 can2 cal1 cal2 cal3 cal4 gut1 gut2 gut3 gut4
Language English English English English English English English English Spanish Spanish Spanish
Source Large Cantebury Cantebury Cantebury Calgary Calgary Calgary Calgary Gutenberg Gutenberg Gutenberg Gutenberg
Size [B] 4,047,392 152,089 481,861 768,711 610,856 53,161 82,199 3,217,389 1,805,493 2,106,147 979,749
PSC J. Jaroš, P. Procházka, J. Holub (FEE CTU)
Word-based Data Compression
Memics’08+PSC
12 / 16
Open Dense Code
Experiments
Compression ratio Substantially better than character-based methods. Comparable to Word-based dictionary methods. Better than ETDC (SCDC) at large files. Compression Ratio 100 wac wlzwe2 wblzw dodc3 detdc dscdc huff cac3 lz77
90
80
Compression Ratio [%]
70
60
50
40
30
20
10
0
1 canL
2 can2
J. Jaroš, P. Procházka, J. Holub (FEE CTU)
3 cal1
4 5 cal4 gut1 Compressed Files
Word-based Data Compression
6 gut2
7 gut3
8 gut4
Memics’08+PSC
PSC 13 / 16
Open Dense Code
Experiments
Compression speed
The best of Word-based methods. Comparable to Huffman encoding. Alg./File
canL
can1
can2
cal1
cal2
cal3
cal4
gut1
gut2
gut3
gut4
wac wlzwe2 wblzw dodc3 detdc dscdc
2.40 0.91 1.14 24.12 22.71 21.44
2.14 0.94 1.13 16.86 2.68 3.46
0.60 0.75 0.96 17.97 9.35 10.70
0.64 0.70 0.92 18.33 12.02 13.09
1.26 0.78 0.98 21.58 7.87 9.25
2.13 0.80 1.07 11.52 1.01 1.41
1.79 0.79 1.06 14.25 1.54 2.06
1.04 0.89 0.85 22.23 20.05 18.82
0.31 0.42 0.51 16.95 13.27 12.30
0.39 0.42 0.50 18.74 13.48 13.39
0.88 0.45 0.54 19.92 10.27 11.13
huff cac3 lz77
32.99 8.04 0.03
30.22 7.65 0.03
35.35 7.88 0.02
33.32 7.72 0.02
28.84 7.42 0.03
28.17 7.04 0.08
29.03 7.47 0.05
33.72 7.77 0.02
28.89 7.83 0.02
29.45 8.03 0.02
28.49 7.79 0.03
Table: Compression speed in MB/s, general comparison
PSC J. Jaroš, P. Procházka, J. Holub (FEE CTU)
Word-based Data Compression
Memics’08+PSC
14 / 16
Summary
Summary
Conclusion Defined a generalized concept of dense coding called Open Dense Code (ODC). Designed dodc3 schema focused on natural language compression: compression and decompression speed is comparable to character-based algorithms, substantially better compression ratio.
Generally better compression ratio of word-based methods was proven.
Future work Tests on large and multilingual files. Apply ODC concept to formal languages etc.
PSC J. Jaroš, P. Procházka, J. Holub (FEE CTU)
Word-based Data Compression
Memics’08+PSC
15 / 16
Summary
Summary
Conclusion Defined a generalized concept of dense coding called Open Dense Code (ODC). Designed dodc3 schema focused on natural language compression: compression and decompression speed is comparable to character-based algorithms, substantially better compression ratio.
Generally better compression ratio of word-based methods was proven.
Future work Tests on large and multilingual files. Apply ODC concept to formal languages etc.
PSC J. Jaroš, P. Procházka, J. Holub (FEE CTU)
Word-based Data Compression
Memics’08+PSC
15 / 16
End of the presentation
Thank you for your attention.
PSC J. Jaroš, P. Procházka, J. Holub (FEE CTU)
Word-based Data Compression
Memics’08+PSC
16 / 16