On Implementation of Word-based Compression Methods

On Implementation of Word-based Compression Methods Jakub Jaroš

Petr Procházka

Jan Holub

Department of Computer Science and Engineering, FEE CTU Prague, ˇ 13, 121 35 Praha 2, Czech Republic Karlovo námestí

PSC Memics 2008 + PSC

Introduction

Basic Notions

Table of Contents

1

Introduction Basic Notions Related Work

2

Open Dense Code Basic Ideas Experiments

3

Summary

PSC J. Jaroš, P. Procházka, J. Holub (FEE CTU)

Word-based Data Compression

Memics’08+PSC

2 / 16

Introduction

Basic Notions

The Data Compression

Process to reduce time and/or space. Lossless data compression—fully reversible process. Dictionary methods: use (build up) the dictionary, LZ77, LZ78, LZW, WBLZW, WLZW.

Statistical methods: use statistical information (probability of occurrences), Huffman code, Arithmetic code, WAC, Dense codes.

Character-based X Word-based approach.



Memics’08+PSC

3 / 16

Introduction

Basic Notions

The Word-based approach

Efficient for textual data (natural languages, formal languages...) Use words instead of characters as the symbols of the alphabet. Strictly alternating sequence of words and non-words. Faster adaptation to the encoded data.



Memics’08+PSC

4 / 16

Introduction

Related Work

Table of Contents

1


2


3

Summary



Memics’08+PSC

5 / 16

Introduction

Related Work

End-Tagged Dense Code (ETDC) Word-based compression method proposed by Brisaboa et al. (2003). Byte-oriented—improves the speed of compression and decompression. Uses the probability to define the rank of each word. Shorter codeword is assigned to the word with higher rank. Structure of codewords: sequence of blocks, the most important bit—to define the last block. Codeword

0

J. Jaroš, P. Procházka, J. Holub (FEE CTU)

0


1

PSC Memics’08+PSC

6 / 16

Introduction

Related Work

(s, c)-Dense Code (SCDC)

Word-based compression method proposed by Brisaboa et al. (2007). Similar to ETDC. Differs between Continuers and Stoppers. Codeword is the sequence Continuers closed by one Stopper. Continuers

Stoppers

Continuers

Stoppers

0, 1, ..., 33, 34, 35, 36, 37, 38, 39, ..., 175, 176, 177, 178, ..., 254, 255



Memics’08+PSC

7 / 16

Open Dense Code

Basic Ideas

Table of Contents

1


2


3

Summary



Memics’08+PSC

8 / 16

Open Dense Code

Basic Ideas

Open Dense Code (ODC) Generalized concept of dense coding. Covers ETDC and SCDC. Provides a frame for definition of many other dense code schemas. Definition The b-ary Open Dense Code (ODC) is a couple hb, Gi where b is a size of block and G = (N, T , P, S) is a grammar defining syntax of the code. ODC assigns to the r -th most frequent symbol (starting with r = 0) a codeword cr of k blocks, which satisfies following conditions: 1

cr ∈ L(G),

2

cr is not a prefix of any other codeword ci ∈ L(G), Pk −1 i Pk i i=1 Πj=1 vj , where vj is number of codewords i=1 Πj=1 vj ≤ r < covered by block j. PSC

3



Memics’08+PSC

9 / 16

Open Dense Code

Basic Ideas

Dynamic ODC (dodc3) Schema strongly adjusted to natural language compression. b = 8; G(N, T , P, S) : N = {Codeword} T = {a, b, c, d, e} S = Codeword a is 1B word, b is 1st byte of 2B word, c is 2nd byte of 2B word, d is 1B non-word, e is 1B ESC symbol, Byte h00000000i h00000001i − h10011111i h10100000i − h11011111i h11100000i − h11111111i J. Jaroš, P. Procházka, J. Holub (FEE CTU)

P:

Codeword → a Codeword → b c Codeword → d Codeword → e a ∈ {1, ..., 159} b ∈ {160, ..., 223} c ∈ {0, ..., 255} d ∈ {224, ..., 255} e=0

Meaning ESC 1B alphanumeric word 2B alphanumeric word 1B non-alphanumeric word


# codewords 1 159 16384 32 Memics’08+PSC

PSC 10 / 16

Open Dense Code

Experiments

Table of Contents

1


2


3

Summary



Memics’08+PSC

11 / 16

Open Dense Code

Experiments

Tested Corpora Textual files in natural languages. Canterbury and Large Canterbury Corpus. Calgary Corpus. Gutenberg Corpus. File bible.txt alice29.txt plrabn12.txt book1 book2 paper1 paper2 wrnpc11.txt 17073-8.txt 2donq10.txt 8va5810.txt

Notation canL can1 can2 cal1 cal2 cal3 cal4 gut1 gut2 gut3 gut4

Language English English English English English English English English Spanish Spanish Spanish

Source Large Cantebury Cantebury Cantebury Calgary Calgary Calgary Calgary Gutenberg Gutenberg Gutenberg Gutenberg

Size [B] 4,047,392 152,089 481,861 768,711 610,856 53,161 82,199 3,217,389 1,805,493 2,106,147 979,749



Memics’08+PSC

12 / 16

Open Dense Code

Experiments

Compression ratio Substantially better than character-based methods. Comparable to Word-based dictionary methods. Better than ETDC (SCDC) at large files. Compression Ratio 100 wac wlzwe2 wblzw dodc3 detdc dscdc huff cac3 lz77

90

80

Compression Ratio [%]

70

60

50

40

30

20

10

0

1 canL

2 can2


3 cal1

4 5 cal4 gut1 Compressed Files


6 gut2

7 gut3

8 gut4

Memics’08+PSC

PSC 13 / 16

Open Dense Code

Experiments

Compression speed

The best of Word-based methods. Comparable to Huffman encoding. Alg./File

canL

can1

can2

cal1

cal2

cal3

cal4

gut1

gut2

gut3

gut4

wac wlzwe2 wblzw dodc3 detdc dscdc

2.40 0.91 1.14 24.12 22.71 21.44

2.14 0.94 1.13 16.86 2.68 3.46

0.60 0.75 0.96 17.97 9.35 10.70

0.64 0.70 0.92 18.33 12.02 13.09

1.26 0.78 0.98 21.58 7.87 9.25

2.13 0.80 1.07 11.52 1.01 1.41

1.79 0.79 1.06 14.25 1.54 2.06

1.04 0.89 0.85 22.23 20.05 18.82

0.31 0.42 0.51 16.95 13.27 12.30

0.39 0.42 0.50 18.74 13.48 13.39

0.88 0.45 0.54 19.92 10.27 11.13

huff cac3 lz77

32.99 8.04 0.03

30.22 7.65 0.03

35.35 7.88 0.02

33.32 7.72 0.02

28.84 7.42 0.03

28.17 7.04 0.08

29.03 7.47 0.05

33.72 7.77 0.02

28.89 7.83 0.02

29.45 8.03 0.02

28.49 7.79 0.03

Table: Compression speed in MB/s, general comparison



Memics’08+PSC

14 / 16

Summary

Summary

Conclusion Defined a generalized concept of dense coding called Open Dense Code (ODC). Designed dodc3 schema focused on natural language compression: compression and decompression speed is comparable to character-based algorithms, substantially better compression ratio.

Generally better compression ratio of word-based methods was proven.

Future work Tests on large and multilingual files. Apply ODC concept to formal languages etc.



Memics’08+PSC

15 / 16

Summary

Summary

Conclusion Defined a generalized concept of dense coding called Open Dense Code (ODC). Designed dodc3 schema focused on natural language compression: compression and decompression speed is comparable to character-based algorithms, substantially better compression ratio.

Generally better compression ratio of word-based methods was proven.

Future work Tests on large and multilingual files. Apply ODC concept to formal languages etc.



Memics’08+PSC

15 / 16

End of the presentation

Thank you for your attention.



Memics’08+PSC

16 / 16