Steganographic Method for Data Hiding in Microsoft Word Documents ...

23 downloads 375 Views 5MB Size Report
File Format) of digital and printed Text document file which is a file of. Microsoft ... This thesis introduce a system for hiding in Microsoft Word which is a component of the ...... carrier. The arrangement itself may be an embedded signature that is.
Republic of Iraq Ministry of Higher Education and Scientific Research University of Technology Collage of Science Department of Computer Science

Steganographic Method for Data Hiding in Microsoft Word Documents Structure by a Change Tracking Technique A Thesis Submitted To the Department of Computer Science of the University of Technology in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science

By Amani Yousif Al-Baghdady

Supervision By Prof. Dr. Abdul Monem S. Rahma May 26, 2009

Jamada El Thaniah 2, 1430

‫ﺣﻴﻢ‬ ِ ‫ﺣ ٰﻤﻦ ﺍﻟﺮ‬‫ﺑِﺴﻢِ ﺍﻟﹶﻠــﻪِ ﺍﻟﺮ‬ @čêŠìãŽ@Ý Ž rßfl@č@čŠüaflë@čp@ëٰ á ٰ Ûa@ŠŽ ìãŽ@@Že @đòubuŒ@ŽÀ@ Ž bj—àčÛa@ ć bj—ßč@bèîÏ@đñëÙ“à× ٰ @‡ Ž Óìí@Žð ć Š†@Žk ć ×ì×fl@bèã‫ﱠ‬d×@Ž@òubuŽ Ûa @üëfl@đòîÓ‹’fl@ü‫@ﱠ‬đòãìŽníŒfl@đò×flŠkß ٰ @Žđñ‹v’fl@åßč @Šć bflã@@éŽ flàflm@@flìÛflë@õŽ ó›íŽ@bènŽ íŒfl@†Ž bØífl@đòîi‹Ëfl @õŽ bfl“flí@åflß@čêŠìäŽ Û@  a@ôč‡èflí@Šć ìãŽ@óÜ ٰ flÇ@Šć ì㎠@č@ÝØči@  aflë@č@‘bäÜÛč@flÞٰrß‫ﻷ‬e@  a@l Ž ‹›íflë @ @áć îÜflÇ@đ@øfl’ ‫ﻈﻴﻢ‬‫ﻠﻲ ﺍﻟﻌ‬‫ﻪ ﺍﻟﻌ‬ ‫ﺪﻕ ﺍﻟﻠـــ‬‫ﺻ‬ ‫ﺳﻮﺭﺓ ﺍﻟﻨﻮﺭ‬ (35) ‫ﺍﻻﻳﺔ‬

Acknowledgment

Firstly of all my great thanks to God who helped me and gave me the ability to perform this work. My deepest gratitude and appreciation go to my Supervisor Prof. Dr. Abdul Monem S. Rahma for his helpful comments, his bright ideas, technical information he provided for me, being generous with his knowledge, who teach me exceed impossible to reach to my aim. The guidance, advice, suggestions, kindness heart, encouragement as well as fruitful assistance of my co-supervisor Dr. Hala B. Abdul Wahab was of great help in finishing this Thesis. I would like to express my gratefulness to Dr. Hilal H. Saleh Head of computer science Department of University of Technology for offering his encouragement. I would like to say "thank you" to Dr. Emad K. Jabar for his parental guidance combined with sweet objective hardness. Special thanks and appreciation to Mr. Faiq S. Baji for his advices and support during the period of my study, further more, this work would not have been achieved without the support and friendship of Esrra J. Baker and Huda Abdul Ridah AL-Safar. I would like to thank all the staff members of Computer Science Department specially Mss. Suham Abd in the library at the Department. Finally, I would like to thank my family for giving me so much time to improve myself and help me to think only of the best…………

Dedication

I n t h e n a m e o f G od , M o s t G r a c e f u l , M o s t M e r c i f u l To the City of Science and its Teacher ………… Prophet "Mohamed" To my injured Country………………….. …………………………….Iraq @To the guard angle, the pure affection, school of our age and stream of kindness who provides me with love, strength and courage, the person to whom I am still indebted, the dearest person………….………..my Mother To the great man who teaches me patience, and inspires me to seek the truth and all the wonderful things I know…………………………….my Father To those who taught me to dependent on myself to be like them, the guidance without which my steps are aimless in the darkness, the bright candles…………………………………………….……….…my Brothers: (Dr. Mahmood, Dr. Ali, LT. Pilot Anwar &Stu. Ibraheem) To who ignites my enthusiasm whenever its torch fades……..my Uncle (Assis. Prof. Sulaiman M. Abbas Head of Electrical Eng. Dep.) The true companions, who proved the deep meaning of friendship, who enriched me with courage and love ……………………… my Friends: (Huda, Dalya, Afrah, Azhar, Nuha, Issra, Zainab, Rabab, Sara, Roa'a) To the soul of my Aunt Suham................................................................. To everyone who helped me even with a word………………….………….. I hope that I will be well thought of…………………………………………. The researcher @

Miss Hacker

Linguistic Certification This is to certify that this thesis entitled "Steganographic Method for Data Hiding in Microsoft Word Documents structure by a Change Tracking Technique" by "Amani Y. Noori " was prepared under my linguistic supervision. Its language was amended to meet the style of the English language.

Linguistic Supervisor Signature: Name:

K. M. Ahmed Al-Najjar

Date:

/

/ 2009

Supervisor Certificate

I certify that this thesis was prepared under my supervision at Department of Computer Science in University of Technology in a partial fulfillment of the requirements for the Master's Degree in Computer Science.

Signature: Name: Prof. Dr. Abdul Monem S. Rahma Date:

/

/ 2009

Examining Committee Certificate This is to certify that we have read this thesis entitled, "Steganographic Method for Data Hiding in Microsoft Word Documents Structure by a Change Tracking Technique", and as an examining committee, examined the student "Amani Yousif Noori", in its contents and in what is related with it, and that in our opinion, it meets the standard of a thesis for the Degree of Master in Computer Science at the Computer Science Department, University of Technology with excellent grade.

Signature:

Signature:

Name: Dr. Saad K. Majeed

Name: Dr. Murtadha M. Hamad

(Chairman) Date:

/

(Member)

/ 2009

Date:

/

/ 2009

Signature:

Signature:

Name: Dr. Rehab F. Hassan

Name: Dr. Abdul Monem S. Rahma

(Member) Date:

/

(Supervisor) / 2009

Data:

/

/ 2009

Approved by, the Computer Science Department, University of Technology

Signature: Name: Dr. Helal H. Saleh Date:

/

/ 2009

Head of Computer Science Department

Security is a request for a person … society… and world security is not a responsibility or privilege accorded only to guards or security agents. Information hiding research has become the focus of the information security research because every Web sites and network communication depend on the multimedia, such as audio, video, image and so on. Information hiding technology can embed secret information into a digital media source without impairing the perceptual quality of that source; other people can’t feel this secret information. In this thesis method is proposed for the art of data hiding by taking advantage of the physical characteristics of computer system and how it stores document file and treating it as a compound file. The unused Block in this Microsoft Compound Document File Format (MCDFF) is used to hide or conceal data. The possibilities provided by Microsoft Word Processor program have also been utilized, such as Tools, to generate cover for hiding. The proposed system embeds Steganography Text in Structure (Binary File Format) of digital and printed Text document file which is a file of Microsoft Word Document file (Doc.) using two Processes of Hiding: Cover Generation Process and Embedding Process. Cover Generation Process: where the cover is a document of Microsoft Word Document file format 2003 (doc.) and will appear to be the product of a collaborative writing effort between Authors. Embedding Process hiding Text string in Unused Block of Binary File Format of that document cover.

I

This thesis introduce a system for hiding in Microsoft Word which is a component of the Microsoft Office System and taking into account Microsoft Office Applications it was found that Microsoft Word is less vulnerability than other Microsoft Office Applications depending on the last research published. This system is implemented using Visual C sharp.NET 2003 language with Windows XP service pack 2 as Operating System, on Laptop computer type P4 with RAM 1GB and 2.00 GHz with Mobil Intel processor to perform the proposed system.

II

List of Abbreviations Acronym

Full Name

ASCII

American Standard Code for Information Interchange

API

Application Programming Interface

APIs

Office Application Programming Interface

BAT

Block Allocation Table

BPCS

Bit Plane Complexity Segmentation

CBF

Chunk Based Format

CFG

Context Free Grammar

CLR

Common Language Runtime

COM

Component Object Model

DBF

Directory Based Format

DCT

Discrete Cosine Transformation

DirID

Directory Identifier

DLL

Dynamic-Link Library

FIB

File Information Block

GIF

Graphic Interchange Format

GUI

Graphic User Interface

HAS

Human Auditory System

HTML

Hyper Text Markup Language

IEEE

Institute of Electrical and Electronics Engineers

IH

Information Hiding

JPEG

Joint Photographic Expert Group

LSB

Least Significant Bit

Mac

Macintosh

MCDFF

Microsoft Compound Document File Format

MSAT

Master Sector Allocation Table

MSDN

Microsoft Developer Network

MSDOS

Microsoft Disk Operating System

OLE

Object Linking and Embedding

PIA

Primary Interop Assembly

PInvoke

Platform Invoke

POIFS

Poor Obfuscation Implementation File System

RMD

Raw Memory Dumps

RTF

Rich Text Format

SAT

Sector Allocation Table

SBAT

Small Block Allocation Table

SecID

Sector Identifier

TCP\IP

Transmission Control Protocol /Internet Protocol

UTF

Unicode Transformation Format

VBA

Visual Basic for Application

Win

Windows

WYSIWYG What You See Is What You Get XML

Extensible Markup Language

List of Figures Figure No.

Description

Page No.

1.1 Information Hiding Hierarchy…………………………………. 4 1.2 Generic digital watermarking scheme………………………….. 5 1.3 Watermarking example…………………………………………. 6 1.4 A data hiding example………………………………………….. 9 2.1 Steganography basic model…………………………………….. 13 2.2 Steganography Types…………………………………………… 14 2.3 Text Hiding methods…………………………………………… 25 2.4 Color quantization……………………………………………… 30 2.5 Halftone quantization…………………………………………... 31 2.6 Huffman Tree for example…………………………………….. 35 2.7 Huffman tree for the 26-letter Alphabet……………………….. 36 3.1 Word Versions for Different Operating System……………….. 38 3.2 External Structure of a Word Document………………………. 41 3.3 Track Change Example………………………………………… 43 3.4 Comments Example…………………………………………… 43 3.5 File Structure Types……………………………………………. 45 3.6 logic view of file……………………………………………….. 47 3.7 Storage and Streams structure………………………………………… 48 3.8 Sample Word document storage format……………………….. 50 3.9 The structure of Hard Disk……………………………………. 54 3.10 MS Compound files structure………………………………… 64 3.11 Word Object Model…………………………………………. 66 3.12 Platform Invokes call to an unmanaged Dll Function…………. 67 4.1 Block Diagram for Proposed System ……………………………70 4.2 Screenshot of Microsoft Word in case of collaborative document authoring…………………………………………………………71 4.3 Author A sends a stegodocument S to a recipient B…………….72 4.4 Hiding Algorithm Flowchart…………………………………….76 4.5 Search Unused Block Algorithm Flowchart……………………. 80 4.6 Extracting Algorithm Flowchart…………………………………83 5.1 Word Reference…………………………………………………. 87 5.2 Block diagram for Unused Block path in Document file……….. 89 5.3 The main menu for the proposed system……………………….. .90 5.4 Cover Document before Track Change………………………… 90 5.5 Cover Document after Track change…………………………… 91 5.6 The Embedding Process Window………………………………. 94 5.7 Document after Hiding…………………………………………. 94 5.8 Extracting Process Window……………………………………. 95

List of Tables Table Name

Description

Page No.

2.1 Steganography Attacks.…………………………………….... 2.2 Probabilities of occurrence in English language.……………. 3.1 MCDFF Metadata...…………………………………………. 3.2 Compound document header structure……………………… 3.3 Header (block1)—512(0x200) bytes ……………………….. 3.4 Directory entry structure…………………………………….. 3.5 Property – 128(0x80) byte block……………………………. 3.6 Block Allocation Table.……………………………………... 3.7 Office 2003 applications and component type libraries…….. 5.1 Comparisons between the proposed system and other text hiding methods…………………………………………………………..

32 37 53 56 57 60 61 63 65 96

Glossary Terms

Description

1

Byte order

The order in which single bytes of a bigger data type are represented or stored.

2

Compound document

File format used to store several objects in a single file, objects can be organized hierarchically in storages and streams.

2

Compound document header

Structure in a compound document containing initial settings.

5

Control stream

Stream in a compound document containing internal control data.

6

Directory

List of directory entries for all storages and streams in a compound document

7

Directory entry

Part of the directory containing relevant data for a storage or a stream.

8

Directory entry identifier (DirID)

Zero-based index of a directory entry.

9

Directory stream

Sector chain containing the directory.

10 DirID

Zero-based index of a directory entry

11 End Of Chain

Special sector identifier used to indicate the end of a SecID chain.

12 File offset

Physical position in a file.

13 Free SecID

Special sector identifier for unused sectors

14 Header

Short for “compound document header”.

SecID

Master sector

15 allocation table

SecID chain containing sector identifiers of all sectors used by the sector allocation table.

(MSAT)

16 MSAT SecID

Special sector identifier used to indicate that a sector is part of the master sector allocation table.

17 Red-black tree

Tree structure used to organise direct members of a storage.

18 Root storage

Built-in storage that contains all other objects (storages and streams) in a compound document.

19 Root storage

Directory entry representing the root storage.

22 SecID

Zero-based index of a sector (short for “sector identifier”).

23 SecID chain

An array of sector identifiers (SecIDs) specifying the sectors that are part of a sector chain and thus enumerates all sectors used by a stream.

24 Sector

Part of a compound document with fixed size that contains any kind of stream (user stream or control stream) data.

No.

Subject

1

Chapter One : General Introduction and Survey

1.1 1.2 1.3

Introduction Information Hiding History Information Hiding Hierarchy The Difference between Cryptography, Steganography and Watermarking Information Hiding Applications Literature Survey Aim of Thesis Thesis Outlines

1.4 1.5 1.6 1.7 1.8

2 2.1 2.2 2.3 2.3.1 2.3.2 2.3.3 2.4 2.4.1 2.4.2 2.4.3 2.4.4 2.5 2.5.1 2.5.2 2.5.3 2.5.4 2.5.5 2.5.6 2.5.7 2.6 2.7 2.8 2.9 2.10 2.11

No. of page

1 2 4 6 7 9 11 12

Chapter Two : Steganography Introduction Steganography Basic Model Steganography Types Pure Steganography Secret Key Steganography Public Key Steganography Steganography Algorithms Spatial Domain Based Steganography Transform Domain Based Steganography Document Based Steganography File Structure Based Steganography Steganography Under various Media Hiding in Disk Space Hiding in Network Packets Hiding in Software and Circuity Hiding in Video Hiding in Audio Hiding in Image Hiding in Text Classification of Text Hiding Techniques Steganalysis Attacks are available to the Steganalyst Introduction to the code Why Encode the Data Huffman Coding

13 13 14 14 15 16 16 16 17 18 18 18 18 19 20 20 20 21 21 21 31 32 33 33 34

3 3.1 3.2 3.3 3.4 3.4.1 3.4.2 3.5 3.6 3.6.1 3.6.2 3.7 3.7.1 3.7.2 3.7.3 3.8 3.9 3.10 2.11 3.12 3.12.1 3.12.2 3.12.3 3.12.4 3.12.5 3.12.6 3.13 3.14 3.15 3.16 3.17 3.18

4 4.1 4.2 4.3

Introduction History of Word Microsoft Word Document and its Components Annotation and collaboration Tools Track Changes Comments File Format Identify the Type of a File Filename Extension Magic Number File Structure Raw Memory Dumps/Unstructured Formats (RMD) Chunk Based Formats (CBF) Directory Based Formats (DBF) Structure Storage Microsoft Compound Document File Format(MCDFF) Structure of a Word Documents files Format of the Main Stream MCDFF metadata Compound Document Header Byte Order Sector File Offset Property Table (Directory) Block Allocation Table (BAT) Sector Allocation Table (SAT) Office Automation PIA for Microsoft Office 2003 Word Object Model Platform Invoke (PInvoke) Application Programming Interface (API) Office Application Programming Interface (APIs)

38 39 41 42 42 43 44 44 44 45 45 46 46 46 47 49 50 52 53 55 58 59 59 62 64 64 65 65 67 68 68

Chapter Four : Proposed Hiding System in Document File Introduction Cover Generation Process Embedding Process

5 5.1 5.2

Chapter Three: Microsoft Word Document File

69 71 73

Chapter Five : Experimental Results and Discussion Introduction System Implementation

84 90

5.2.1 5.2.2 5.2.3 5.2.4 5.3

6 6.1 6.2

I II III

Document before Hiding Embedding Process Document after Hiding Extracting Process Comparisons between proposed system and the most popular hiding methods

90 91 94 95 96

Chapter Six : Conclusions and Suggestions for Future Work Conclusions Suggestions for Future Work

Glossary References Appendix A Appendix B Appendix C

97 98

1 Chapter One ZxÇxÜtÄ \ÇàÜÉwâvà|ÉÇ tÇw fâÜäxç

Chapter One

General Introduction and Survey

Chapter One General Introduction and Survey 1.1 Introduction [XIU06]

T

he development of the Internet, information processing technologies and the rapid development of communication, the images, audio, video and other multimedia information can be

rapidly transmitted in variety of communication networks, so it can provide greater convenience to compression, storage, and reproduction processing applications. At the same time, it is convenient to share information resources, and the network has become the main means of communication. Now, all confidential information, including national security information, military information, and personal information (such as credit card numbers), are required for transmission through the network, but the Internet is an open environment, so information security has become increasingly important today. Information security technology has two main branches: cryptography and information hiding. Cryptography was widely used in various industries. There have been many years of research in encryption technology and there are many encryption algorithms. But the encryption technology can clearly inform users that the documents or other media have been encrypted, the attacker can use a variety of tools to attack the secret information. Although the technique of encryption developed rapidly, but the attacker’s tool is also strengthened. It is the so-called “instructors always keep one step ahead”. Because of the rapid development of

1

Chapter One

General Introduction and Survey

computer capabilities, some limitations already appear in the application of encryption technology. This makes people pay more attention to the other main branch of information. The purpose of the traditional encryption technology is to conceal the content, so the encrypted documents are difficult to read.

1.2 Information Hiding History Hiding messages is nothing new over the past years; multitudes of methods have been used to hide information. One of the first documents describing steganography is from the histories of Herodotus. In ancient Greece, the text was written on wax covered tablets. To avoid capture, he scraped the wax off the tablets and wrote a message on the underlying wood. He then covered the tables with wax again. The tables appeared to be blank and unused so they passed inspection by sentries without question [JOH99]. Historically various steganographic techniques have been used including: I. Tattoo. A Roman general that shaved the head of a slave tattooing a message on his scalp. When the slave's hair grew back, the general dispatched the slave to deliver the hidden message to its intended recipient [DIC07]. II. Character marking. Select letters of printed or typewritten text are over written in pencil. The marks are ordinarily not visible unless the paper is held at an angle to bright light [DOB97]. III. Invisible ink. From the 1st century through World War II invisible inks were often used to conceal hidden messages. A number of substances (milk, vinegar, fruit juices and urine) can

2

Chapter One

General Introduction and Survey

be used for writing. They leave no visible trace until heat or some chemical is applied to the paper. IV. Pin punctures. Small pin punctures on selected letters are ordinarily not visible unless the paper is held up in front of a light [DOB97]. V. Microfilm. While Paris was under siege in 1870, messages were sent by carrier pigeon. A Parisian photographer used a microfilm technique to enable each pigeon to carry a higher volume of data [DIC07]. VI. Null ciphers (unencrypted message) were also used. In this method the first letter of each word spells out a message. But messages are very hard to construct [KAH96]. The following message was actually sent by a German spy during Second World War [RIM97].

"Apparently neutral's is thoroughly discounted and ignored. Isman hard hit. Blockade issue affects pretext for embargo on by- products, ejecting suets and vegetable oils".

Decoding this message by taking the second letter in each word reveals the following secret message: "Perishing sails from NY June 1".

3

Chapter One

General Introduction and Survey

1.3 Information Hiding Hierarchy Information Hiding (IH) is a kind of technique in the area of information security. It is a technique to secretly embed information into digital contents such as images, audios, movies, document, so that it cannot be visually or audibly perceived, a data hiding example can be shown in figure (1.4) [YOS06]. The Terminology which was agreed at first international workshop on this subject in Figure (1.1) [CAC98]: : Covert channels in the context of multilevel secure systems (e.g. military computer systems),as communication paths were neither designed nor intended to transfer information at all these channels typically used by untrustworthy programs to leak information to their owner while performing a service for another program [KAT00]. : Anonymity is finding ways to hide the Metacontent of messages, that is, the sender and the recipients of a message [KAT00].

IH Covert channels

Steganography

Copyright marking

Anonymity Robust Copyright

Fragile Watermarking

fingerPrinting Watermarking

Figure (1.1) Information hiding hierarchy 4

Chapter One

General Introduction and Survey

: Steganography an important sub discipline of information hiding is art and science of communicating in a way which hides the existence of the communication [KAT00]. : Fingerprinting is a term that denotes special applications of watermarking. It relates to watermarking application which information such as the creator or recipient of digital data is embedded as watermarks [KAT00]. : In contrasting to Steganography, Copyright marking guarantees that embedded data can be reliably detected after the image has been modified (but not destroyed beyond recognition) [CAC98]. : Watermarking is the process of embedding information into digital multimedia content such that the information (which we call the watermark) can later be extracted or detected for a variety of purposes including copy prevention and control, an example of watermarking can be shown in figure(1.3) [BAK05].

Watermark Marking Algorithm

host Data

Watermark Data

secter/public key (K)

Figure (1.2) Generic digital Watermarking scheme [KAT00]

There are several approaches to classify watermarking systems. One could categorize them according to the watermarking powerful against types of attack. 5

Chapter One

General Introduction and Survey

: Fragile Watermarks are watermarks that have only very limited robustness. The embedded watermarks will change, or disappear, if a watermarked object is altered. This type of watermark can be used for authentication purpose to verify the originality of watermarked object [BAK05]. : Robust watermarking is designed to survive "moderate to severe signal processing attacks". In such a way that any signal transform of reasonable

strength

cannot

remove

the

watermark.

Robust

watermarks are public able in image copyright protection and fingerprinting [BAK05].

Figure (1.3) watermarking example [ROC08]

1.4 The Differences between Cryptography, Steganography and Watermark. The cryptographer's interest is primarily with obscuring the content of a message, but not the communication of the message. The steganographer, on the other hand is concerned with hiding the very communication of the message, while the digital watermarked attempts to add sufficient metadata to a message to establish ownership, provenance, source, etc. Cryptography and steganography share the feature that the object of interest is embedded, 6

Chapter One

General Introduction and Survey

hidden or obscured, whereas the object of interest in watermarking is the host or carrier which is being protected by the object that is embedded, hidden or obscured. Further, watermarking and steganography may be used with or without cryptography; and imperceptible watermarking shares functionality with steganography, whereas perceptible watermarking does not [BER06].

1.5 Information Hiding Applications [XIU06] The advantages of information hiding technology have been applied in many prospects, including e-commerce, electronic transaction protection, confidential communications, copyright protection, copy control, operation tracking, authentication, and signature fields. A recent research shows that the following applications of information hiding stimulated people’s research interest: I. Military organization and other intelligence agencies need secret communication. In the modern battlefield when the sensitive signal detection may lead to the rapid release of the attacks, the military often used communications preparation or distribution of atmospheric scattering of spectral transmission technology to ensure accurate signal transmission. II. Terrorists are also studying the use of information hiding technology. Through research, the US anti-terrorist organizations analysis that in the September 11 incident, the terrorists used steganograhpy technique, which embed the instructions into multimedia (such as images) and transmitted in Internet If there were no hidden writing specialized analysis tools, it is difficult to detect concealing write processed pictures.

7

Chapter One

General Introduction and Survey

III. As the electronic-commerce is springing up, information security becomes more important. In addition to encryption technique, people are more concerned about the hidden message authentication techniques. The extensive application of information hiding technology can be roughly categorized as follows: : Secret communications: it hided the communications process and the communicators. : Copyright protection: authorized Watermark perceived to be embedded in the way of multimedia. : Testing and certification: digital works could be carried out certification, and to tamper with a test. : Piracy tracking: used to track the author or some backup buyers. : Information identified: some of the information is hidden in the carrier medium, in order to interpret some elements about the medium. : Reproduction control and access control: with embedded digital watermarks to express some of the access control system. : Information control: using information technology to control certain information. : Bills security: Bills security is to make sure that the hidden watermarks on the bills could still exist after printed. It can guarantee the authenticity of the bills.

8

Chapter One

General Introduction and Survey

Message to be hidden

The cover image

The prodece stego image

Figure (1.4) a data hiding example [ROC08]

1.6 Literature Survey The following is a review of different works used in environment: I.

Abdul Wahab, H., B., 2001, [ABD01] "Information Hiding in

written Text Using Context Free Grammar (CFG) ", this work embeded text (English text) after being constructed according to CFG in another text (English Text). The proposed system gives good results and can be applied in several cases in life when sending encrypted message that draws suspicions.

II.

Al-Shamkhy, R., A., 2001, [ALS01] "Hiding Text in Text Using

Dictionary Method", This Thesis proposed a system that uses the text media to embed its secret file text depending on a dictionary. This dictionary contains English words sorted in an alphabetical order to be

9

Chapter One

General Introduction and Survey

selected by user in order to build the cover message. The receiver does not need this dictionary, this will decrease the amount of information which is needed on the receiver side and this will increase the security of the proposed system. III.

Al-Saady, B., Y., 2005,"Document Protection Using Digital

Watermarking ", in this thesis, four methods are suggested to embed a watermark in a document created by Microsoft word program. The two types of watermarking suggested are visible as a background, and invisible watermark that depends on the macro technique. The ability of macro program to run with document helps us to use the macro program to control the watermarking operation. There are three suggested methods to use the macro program as a tool to protect both watermark and document from the unauthorized modification. These methods are powerful methods to protect both watermark and document when applied to Microsoft word document.

IV.

Al-Abaichi, A., M., 2005,"Analyzing and Detecting Information

Hiding in Computer Printed Text", the proposed system is used to analyze and detect hidden information in the printed text after converting it to a gray scale image consisting of two phases, analysis and detection. In the first phase, the boundary of the text image, the baseline from two sides, beginning and ending with each (line, word, and character) are fixed, the gaps between words and at the ending of lines are determined and No. of line, No. of words No. of characters and No. of gaps between words are calculated. Each detection phase deals with mainly four methods used for hiding the secret message in a format text such as line-shift(up, down), open space method (inter-word-space, and of line space and inter-sentencespace),word-shift (horizontal) and feature code (shorten or lengthen the upward, shorten or lengthen the downward) of the character. 10

Chapter One

V.

General Introduction and Survey

Eckstein, K. and Jahnke, M. 2005, "Data hiding in Journaling

File Systems", this article structures and compares existing data hiding methods for UNIX file systems in terms of usability and countermeasures. It discusses variant techniques related to advanced file system and proposes a new technique that stores substantial amounts of data inside journaling file systems in a robust fashion with low delectability, which is demonstrated by means of a proof-of-concept implementation for the exit journaling file system. VI.

Lie, T., Y., and Tsai W., H., 2007, [LIU07] "A New Steganography

Method for Data Hiding in Microsoft Word Documents by a Change Tracking Technique", this research proposed method for hiding by taking text segments in the document and degenerated, mimicking to be the work of an author with inferior writing skills, with the secret message embedded in the choices of degenerations. The degenerations are then revised with the changes being tracked, making it appear as if a cautious author is correcting the mistakes.

1.7 Aim of Thesis The aim of this thesis is to use Information Hiding Technology to embed Text in structure (Binary File Format) of digital and printed Text document which is Microsoft Word Document file 2003 using Steganography method. This can be achieved by the following: : The cover document which is a Document of Microsoft Word Document 2003 is made to be the product of a collaborative writing effort between many authors to avoid drawing suspensions that there is hidden data in document.

11

Chapter One

General Introduction and Survey

1.8 Thesis Outlines This thesis begins with an introduction to information hiding technique and its hierarchy. Chapter Two: "Steganography ", presents a general description of Steganography, Text hiding methods and Huffman Encoding.

Chapter Three: "Microsoft Word Document File Format" introduces a complete description about Microsoft Word Document the software and its file format and structure.

Chapter Four: "Proposed Hiding System in Microsoft Compound Document file Format ", presents a Cover generation process, MCDFF metadata and Hiding processes.

Chapter Five: "Experiment Results and Discussion" introduces a complete description about the proposed method implementation and results.

Chapter Six: "Conclusions and Suggestions for Future work ", presents the derived conclusions and some suggested ideas for future work.

12

2 Chapter Two fàxztÇÉzÜtÑ{ç

Chapter Two

Steganography

Chapter Two Steganography 2.1 Introduction

T

he word Steganography comes from two roots in the Greek language, "Stegos" meaning hidden/covered or roof, and "Graphia" simply means writing [KRE04].

The Goal of Steganography is to hide message inside other harmless message in a way that does not allow any enemy to even detect that there is a second secret message present (to avoid drawing suspensions) [KAT00]. Steganography uses the illusion of normality to mask the existence of covert activity. The illusion is manifested through the use of a myriad of forms including written documents, photographs, paintings, music, sounds, physical items, and even the human body. Two parts of the system are required to accomplish the objective, successful masking of the message and keeping the key to its location and/or deciphering a secret [DIC07].

2.2 Steganography Basic Model Stego Key

Stego Key

Cover

Cover

Embedding Process

Message to hide

Stego Cover

Extracting Process

Hidden Message

Figure (2.1) steganography basic model 13

Chapter Two

Steganography

Each data hiding Method consists of: I. Embedding Process. II. Extracting Process. The Embedding Process is used to hide secret message inside a Cover ((or carrier).The Cover carrier and the embedded message create a stegocarrier. The Extracting Process is used to extract secret message from a carrier. Hiding information may require a stegokey or password that is additional secret information so that only those who possess the secret keyword can access the hidden message. Cover medium + Embedded massage+ Stegokey = Stego- medium.

2.3 Steganography Types There are basically three types of steganographic protocols described in the following figure: Steganography

Pure Steganograph

Secret key Steganograph

Public Key Steganograph

Figure (2.2) Steganograhy Types

2.3.1 Pure Steganography [KAT00] A steganography system which does not require the prior exchange of some secret information (like stego-key) is called a pure Steganography. Both sender and receiver must have access to the embedding and extracting algorithm. 14

Chapter Two

Steganography

Definition: (Pure steganography) The quadruple б = < C, M, D, E >, where C is the set of possible covers, M the set of secret messages with |C| ≥ | M |, E: C × M → C the embedding function, and D: C→ M, the extracting function, With the property that D (E(c, m)) = m for all m ∈ M and c ∈ C is called a pure steganography system. 2.3.2 Secret Key Steganography Secret key steganography is defined as a steganographic system that requires the exchange of a secret key (stego-key) prior to communication. Secret key steganography takes a cover message and embeds the secret message inside it by using a secret key (stego-key). Only the parties who know the secret key can reverse the process and read the secret message. Unlike pure steganography where a perceived invisible communication channel is present, secret key steganography exchanges a stego-key, which makes it more susceptible to interception. The benefit of secret key steganography is even if it is intercepted; only parties who know the secret key can extract the secret message [DUN02]. Definition: (Secret Key Steganography) The quintuple б = < C, M, K, D, E >, where C is the set of possible covers, M the set of secret messages with |C| ≥ | M |, K the set of secret keys, E k: C ×M ×K → C and 15

Chapter Two

Steganography

Dk: C × K→ M With the property that Dk (Ek(c, m, k), k) = m For all m ∈M, c ∈ C and k∈ K, is called a secret key steganographic system [KAT00]. 2.3.3 Public Key Steganography As in public key cryptography, public key steganography does not rely on the exchange of secret key. Public key steganography system requires the use of two keys, one private and one public key; the public key is stored in a public database, whereas the public key is used in the embedding process, the secret key is used to reconstruct the secret message[KAT00].

2.4 Steganography Algorithms Stegaongraphy Algorithms are classified according to five categories: (1). Spatial domain based steganography; (2). Transform domain based steganography; (3). Document based steganography; (4). File structure based steganography; (5). Other categories.

2.4.1 Spatial Domain Based Steganography Spatial steganography mainly includes LSB (Least Significant Bit) steganography and BPCS (Bit Plane Complexity Segmentation) algorithm. The spatial methods are most frequently employed by steganography tools because of fine concealment, great capability of hidden information and easy realization [MIN06]. 16

Chapter Two

Steganography

: LSB Replacement & Matching Least Significant Bit (LSB) which replaces the least significant bit in some bytes of the cover file to hide a sequence of bytes which contains the hidden data, LSB steganography includes two schemes: Sequential embedding and scattered embedding. Taking images as example, sequential embedding replaces the pixels’ LSBs with the message one by one sequentially. Scattered embedding makes message randomly scatter over the whole image by a random sequence to control the embedding places. : BPCS Steganography As the approach of bit-replacing in LSB steganography, BPCS steganography hides secret data by the way of block-replacing, each bit plane of the image is segmented into the same size pixel-blocks. The BPCS’s capacity can reach 50% of the cover image data. However, the large capacity embedding will bring more influence to the image [MIN06]. 2.4.2 Transform Domain Based Steganography [KAT00] The LSB modification techniques are easy ways to embed information, but they are highly vulnerable to even small cover modification. An attacker can simply apply signal processing techniques in order to destroy the secret information entirely. Transform domain methods hide messages in significant areas of the cover image which makes them more robust to attacks, such as compression, cropping, and some image processing, than the LSB approach. However, while they are more robust to various kinds of signal processing, they remain imperceptible to human sensory system. Many transform domain variations exist. One method is to use the discrete cosine transformation (DCT). 17

Chapter Two

Steganography

2.4.3 Document based Steganography This kind of tools embeds data in document files by adding tabs or spaces to .txt or .doc files. One of the provided steganographic tool is Software called Snow Snow embeds data in .txt files by adding tabs and spaces at the end of text line. Every 3 bits are encoded with 0 to 7 spaces and the spaces are segmented with a tab. So the number of secret bits should be a multiple of 3, otherwise they would be filled up with 0 bits. 2.4.4 File structure based Steganography Structural embedding inserts secret data in the redundant bits of cover file, such as the reserved bits in the file header or the marker segments in the file format.

This makes hidden data immune to the

visual/aural Attack and the statistical detection [MIN06].

2.5 Steganography under Various Media The onset of computer technology and the internet has given new life to steganography and the creative methods with which it is employed. Computer-based steganographic techniques introduce changes to digital carriers to embed information foreign to the native carriers [JOH01]. Carriers of such message may resemble innocent sounding text, disks and storage devices, network traffic and protocols the way software or circuits are arranged, audio, images, video, or any other digitally represented code or transmission [JOH01]. 2.5.1 Hiding in Disk space [MIK07] Another way to hide information relies on finding unused space that is not readily apparent to an observer. Taking advantage of unused or reserved space to hold covert information provides a means of hiding 18

Chapter Two

Steganography

information without perceptually degrading the carrier. The way operation systems store files typically results in unused space that appears to be allocated to files. Another method of hiding information in file system is to create a hidden partition. These partitions are not seen if the system is started normally. However, in many cases, running a disk configuration utility exposes the hidden partition. These concepts have been expanded in a novel proposal of a steganographic file system. If the user knows the file name and password, then access is granted to the file; otherwise, no evidence of the file exists in the system of the hidden files. 2.5.2 Hiding in Network packets [JOH01] Various network protocols have characters that can be used to hide information. TCP/IP packets are used to transport information; an uncountable number of packets are transmitted daily over the internet. Any of these packets can provide a covert communication channel. The packet headers have unused space or other values that can be manipulated to hide information. However, filters can be set to detect information in the "unused" or reversed spaces. One way to circumvent this detection is to take advantage of information in the headers that typically go unchecked by most systems. Such information includes the values for sequence and identification numbers.

19

Chapter Two

Steganography

2.5.3 Hiding in software and circuitry Data can also be hidden based on the physical arrangement of a carrier. The arrangement itself may be an embedded signature that is unique to the creator. An example of this is in the layout of code distributed in a program or the layout of electronic circuits on a board, this type of "marking" can be used to uniquely identify the design origin and cannot be removed without significant change to the network [JOH01]. 2.5.4 Hiding in video For video, a combination of sound and image techniques can be used. This is due to the fact that video generally has separate inner files for the video (consisting of many images) and the sound. So techniques can be applied in both areas to hide data. Due to the size of video files, the scope for adding lots of data is much greater and therefore the chances of hidden data being detected is quite low [CUM04]. 2.5.5 Hiding in Audio Data hiding in audio signals is especially challenging, because the Human Auditory System (HAS) operates over a wide dynamic range. To put this in perspective, the (HAS) perceives over a range of power greater than one million to one and a range of frequencies greater than one thousand to one making it extremely hard to add or remove data from the original data structure. The only weakness in the (HAS) comes at trying to differentiate sounds (loud sounds drown out quiet sounds) and this is what must be exploited to encode secret messages in audio without being detected [DUN02].

20

Chapter Two

Steganography

2.5.6 Hiding in Image Given the proliferation of digital images, especially on the Internet, and given the large amount of redundant bits present in the digital representation of an image, images are the most popular cover objects for steganography [MOR00]. Using image files as hosts for steganographic messages takes advantage of the limited capabilities of the human visual system. Encoding extra data in an image file changes pixels in the image, but these changes would remain imperceptible to the human eye [BER05]. 2.5.7 Hiding in Text Written Text can be used as a method to transmit secret messages. Only small amounts of data can be hidden when hiding data in text. Thus, this method is known to have a common low data rate. Important point must be said that the embedding task in text requires the interaction of the user; it therefore cannot be automated, while image and audio can embed the data directly and automatically according to its algorithm.

2.6 Classification of Text Hiding Techniques:Steganograhy methods can try to encode the information directly in the text or in the text format as shown in figure (2.3). I. Encoding Information Directly in the Text Many ways have been proposed to hide information directly in text like Syntactic, Semantics, P.Waynar, Chapman, Translation and HTML. : Syntactic method: where the structure of sentences is transformed without significantly altering their meaning. This method utilizes punctuation, diction [VIL06]. 21

Chapter Two

Steganography

Example of using punctuation: The phrase "bread, butter, and milk" and "bread, butter and milk" are both considered correct usage of commas as a list, such that when the comma appears before the "and" this represents as a "1" and the second phrase represents as a "0"[ALS01]. Example of using Diction and structure of the text: The sentence "Before the night is over, I will finish" and The sentence "I will finish before the night is over" This method is more transparent than the punctuation method .When a verb comes at the beginning of the sentence this will encode as a "1",when an adverbial comes at the beginning of the sentence this will be encoded as a "0"[ALS01].The expected data rate only several bits per kilobytes of text, use of punctuation is noticeable to even casual reader and changing the punctuation will impact the clarity and even the meaning of the text so this can be considered as a Disadvantage of using Punctuation. : Semantics Method Where words are replaced by their synonyms and/or sentences are transformed via suppression or inclusion of noun phrase coreferences [VIL06]. Example of using Semantic Method The word "big" could be considered primary and the word "large" is considered secondary. Decoding primary words will be read as ones, secondary words as zero [ALS01]. However, syntactic and semantic methods are not suitable for all types of documents (e.g. contracts, identity documents, literary texts) and need, in general, human supervision [VIL06].

22

Chapter Two

Steganography

: P.Wayner Method Peter Wayner proposed a Mimic Function which exploits the statistical profile of a message, since the stego-objects are created only according to statistical profile, the semantic component are entirely ignored. Wayner described one of the most promising techniques, he uses (CFG) to create cover-text and chooses the productions according to the secret message to be transmitted, the secret information is not embedded in the cover, and the cover itself is the secret message. If the grammar is unambiguous the receiver can extract the information by applying standard parsing techniques [KAT00]. Wayner proposed an extension to the technique of mimic function, given a set of production, assigning a probability to each possible production. The sender then constructs a Huffman compression function and converts the secret message to a binary bit. The receiver then parses the cover in order to reconstruct the productions which have been used in the embedding step; this can be accomplished by the use of a parse tree for the given CFG [ALS01]. But the vulnerable aspect of this technique is difficult to select meaningful type categories without considering the eventual grammatical requirements of a natural-language style-source [ALD05]. : Chapman and Davida Method Chapman and Davida proposed a system which consists of two functions, NICETEXT and SCRAMBLE. Given a large dictionary of words of different types, and a style source, which describes how words of different types can be used to form a meaningful sentence, NICETEXT transforms secret message bits into sentence by selecting words out of the

23

Chapter Two

Steganography

dictionary which conform to a sentence structure given in style source [ALS01]. SCRAMBLE reconstructs the secret if the dictionary which has been used is known. Style resources can either be created from natural-language sentence or be generated using CFG [ALS01]. The most obvious problem with the manual method is that it takes too long to enter large lists. Nicetext focuses on creating large, sophisticated dictionaries with thousands of words [ALD05]. : Translation- based steganography Use the expected errors in the translation process, especially in machine translation, to solve the issue of producing implausible text; information is hidden in the noise that occurs in language translation. In cases where sending imperfect translations to a resulting from translationbased steganography are inconspicuous. The translation-based approach, however, may be vulnerable to active attacks [LIU07]. : HTML Information is hidden in HTML files by adding useless spaces and line breaks or by changing the case of letters in the tags [JOH98]. Html files are good candidates for including extra spaces but Web browses ignore these "extra" spaces and they go unnoticed until the source of the page is revealed [KAT00].

24

Chapter Two

Steganography

Text Hiding Techniques

Encoding

Encoding Information In The Tex Format

Information Directly in The Text

Semantic

Syntax

method

P.Wayner

method

method

Binary code

Binary

Chapman Translation based Daivdea Steganogra method phy

Binary code

Feature

Line-shift

Binary code

encoding

encoding

Binary code

Binary code

Color quantizati on

Binary code

Figure (2.3) Text hiding method

25

Halftone quantizat ion

Binary code Binary code

code code

Openspace

encoding encoding

Binary code

Binary

Wordshift

HTML

Binary code

Chapter Two

Steganography

43

Chapter Two

II.

Steganography

Encoding Information in the Text Format [ALS01]. Information can be embedded in the format rather than in the

message itself. secret information can be stored in the size of inter-line or inter-word spaces. If the spaces between two lines are smaller than some threshold, a "0" is encoded; otherwise a "1" is encoded. Infrequent additional white space characters are introduced to form the secret message. : Open Space method Encode through manipulation of white space (unused space) on the printed page. There are three methods for using white space to encode data. : Inter-Sentence Spacing [ALS01]. This method deals with encoding a binary message into a text by placing one or two spaces after the sentence, such that one space represents "0" and two spaces represent "1". The disadvantage of this method is that it is insufficient, requiring a great deal of text to encode a very few bits(one bit per sentence).This equates to a data rate of approximately one bit per 160 bytes assuming sentences are on average two 80 character lines of text. Its ability to encode depends on the structure of the text and many word processors automatically set the number of spaces after periods to one or two characters. A. End-of-line spaces [ALS01]. This method deals with inserting spaces at the end of lines. The data are encoded allowing for a predetermined number of spaces at the end of each line. This method has a number of advantages in that it goes unnoticed by readers and the amount of hidden information is maximum

26

Chapter Two

Steganography

than inter-sentence method and the disadvantage like some programs like "sendmails" may in advertently remove the extra space characters. B. Inter-Word-Spaces [ALS01]. Using the white space to encode data involves right justification of text. One space between words is interpreted as a "0".Two spaces are between words are interpreted as a "1". This method has a number of advantages like changing the number of trailing space, there is little chance of changing the meaning of a phrase or sentence and the casual reader is unlikely to take notice of slight modifications in white space. The disadvantage is that if the reader does not notice its manipulation, then the word processor may inadvertently change the number of spaces, destroying the hidden data. : Line-Shift Coding In this method, text lines are vertically shifted (moved up or down) according to the secret message bits, whereas other lines are kept stationary for the purpose of synchronization. If a line is moved up, a "1" is encoded; otherwise a "0" is encoded [DUC01]. The disadvantage of this method is that it represents the most visible text coding technique to the reader; large documents encode a few bits (one bit per line) and the need for the original message may decrease the security of the system [ALS01]. : Word-shift Coding [ALS01] In this method, codewords are coded into a document by shifting the horizontal or vertical locations of words within text lines, while maintaining a natural spacing appearance. This method is only applicable to documents with variable spacing between adjacent words.

27

Chapter Two

Steganography

as a result of this variable spacing, it is necessary to have the original image, or to at least know the spacing between words in the un encoded document. A. Encode Codeword (Horizontal Shift- Word) For each text line, the largest and the smallest spaces between words are found. It is possible to alter every space between two words [ALS01]. For example take the Sentence1: We explore new

steganographic and cryptographic algorithms and

techniques throughout the world to produce wide variety and security in the electronic web called the Internet Applying some horizontal shifting word algorithm to obtain the following sentence Sentence 2: We explore

new

steganographic and cryptographic algorithms and

techniques throughout the world to produce wide variety and security in the electronic web called the Internet. By overlapping the two sentences, obtain the following: We explore

new

steganographic and cryptographic algorithms and

techniques throughout the world

to produce

wide

variety and

security in the electronic web called the Internet. This is achieved by expanding the space before wide, web by one point and condensing the space after explore, the world by one point in sentence1,the sentences containing the shifted words appear harmless, but combining this with the original sentence produces a different message: explore the world wide web. In the same method, can encode binary message instead of encoded word. For example, if expand the space before explore, the world, 28

Chapter Two

Steganography

wide, web by one point, this will be encoded as "1", and if condense the space after explore, the world, wide, web by one point, this will be encoded as "0". By applying random horizontal shifts to all words in the document, an attacker could eliminate the encoding. B. Encode Codeword (Vertical Shift- Word) Shifting the vertical locations of words can be used to help identify an original document. A similar method can be applied to display an entirely different message [ALS01]. For example take the following sentence: We explore new

steganographic and cryptographic algorithms and

techniques throughout the world to produce wide variety and security in the electronic web called the Internet. Applying some vertical shifting word algorithm to obtain the following sentence: We

explore

new

steganographic and cryptographic algorithms and

techniques throughout

the world

to produce wide variety and security

in the electronic web called the Internet. In the same method, can encode binary message instead of encoded word. For example, if shift up the words explore, the world, this will be encoded as "1", and if we shift down the words wide, web this will be encoded as "0". : Feature Encoding Where feature such as Shape, Size, or Position are manipulated .In this method certain text features are altered, or not altered depending on the codeword. For example, one could encode bits into text by extending or shortening the upward, vertical end lines of letters such as 29

Chapter Two

Steganography

b, d, h, etc. generally before encoding, feature randomization takes place. Character end line lengths would be randomly lengthened or shortened, then altered again to encode the specific data. This removes the possibility of visual decoding, as the original end line lengths would not be known to code, one requires the original image. Examples of using feature coding Long d can be decoded of as "1"

short d can be decoded as "0".

Long h can be decoded of as "1"

short d can be decoded as "0".

Long b can be decoded of as "1"

short d can be decoded as "0".

This method has a number of advantages like high amount of data encoding, largely indiscernible to the reader; the disadvantage is that the feature coding can be defeated by adjusting each endline length to fixed value [ALS01]. : Color quantization [VIL06] The main idea of this method is to quantize the color or luminance intensity of each character in such a manner that the human visual system is not able to distinguish between the original and quantized characters, but it can be easily performed by a specialized reader machine. An example illustrating this method is shown in Figure (2.4). Therein, dark characters encode a 0, whereas light ones encode a 1. A binary sequence can be sequentially embedded into the cover text. Notice that the embedding rate is comparatively higher than the rate of inter-line or inter-word space modulation methods. VAMOS A TRABAJAR (a) VAMOS A TRABAJAR 01011001000101 (b) Figure (2.4) .Color quantization: (a) original text; (b) marked text (exaggerated) 30

Chapter Two

Steganography

: Halftone Quantization [VIL06] This method relies on half toning, a widely used printing technology that enables continuous tone images to be printed with one color ink (grayscale) or a few color inks (color). Here, the discussion is restricted to black & white printers. In order to simulate a given gray shade a halftone printer uses a halftone screen. This method exploits the fact that there exist several possible choices for the halftone screen leading to the same gray shade. Therefore, one can use this property in order to hide data on each text character by using a different halftone screen according to the message m that wishes to embed. The major strength of this method is that all characters in the stego text will have the same grade shade. This method is intended mainly for printed documents.

(a)

(b)

(c)

Figure (2.5) Halftone quantization: (a) Original character; (b) marked character for m = 0; (c) Marked character for m = 1.

2.7 Steganalysis A goal of steganography is to avoid drawing suspicion to the transmission of hidden message. If suspicion is raised, this goal is defeated. Steganalysis is the art of discovering and rendering useless such covert message [JOH01]. In other words steganalysis attempts to detect the existence of hidden information [ALS01].

31

Chapter Two

Steganography

the steganlyst is one who applies a stganalysis in an attempt to detect the existence of hidden information and /or render it useless. Two aspects of steganalysis involve the detection and distortion of embedded messages Detection requires that the analyst observes various relationships between combinations of cover, message, stego-media, and steganograghy tool. Distortion attacks require that the analyst manipulates the stego-media to render the embedded information useless or remove it altogether [ETT98].

2.8 Attacks are available to the Steganalyst There are many possible situations which confront the Steganalyst, depending on what information is available. The different cases are shown in table (2.1) [JAJ98]: Table (2.1) Steganography Attack 1-Stego-only attacks: only the stego-object is available for analysis.

2-Known cover attack: the "original" cover-object and stego-object are

both available. 3-Known message attack: At some point, the attacker may know the

hidden message. Analyzing the stego-object for patterns that correspond to the hidden message may be beneficial for future attacks against that system. Even with the message, this May be very difficult and may even be considered equivalent to The Stego-only attack. 4-Chosen stego attack: The steganograghy tool (algorithm) and Steg-

object is known. 5-Chosen message attack: the steganalyst generates stego-object from

some steganography tool or algorithm from a chosen message. The goal in this attack is to determine corresponding patterns in the stego-object that may point to the use of specific steganography tools or algorithms.

32

Chapter Two

Steganography

6-known stego attack: The steganography algorithm (tool) is known and

both the original and stego-objects are available.

2.9 Introduction to the Code [ABD01] A code is nothing more than a set of strings over a certain alphabet. For example, the set C= {0, 10, 110, 1110} is a code over the alphabet {0, 1}. Of course, codes are generally used to encode message. For instance, it may use the set C to encode the first four letters of the alphabet, as follows: a

0

b

10

c

110

d

1110

Then can encode words (or messages) built up from these letters. The word "cab", for instance, is encoded as cab

110010

2.10 Why Encode the Data [KUO70] There are three reasons to encode data that is about to be transmitted (through space, for instance) or stored (on computer disk, for instance). The first reason is for efficiency. It clearly makes sense to compress data as much as possible in order to save transmission time or storage space. In fact, data compression is very big business in the computer world. The second reason to encode data is for error detection and /or correction. The third reason is for secrecy, so that unauthorized persons cannot read the data. In other words, the goals of encoding are for efficiency, error correction, and secrecy.

33

Chapter Two

Steganography

2.11 Huffman Coding There are different ways of encoding data and one of these ways is Huffman coding [Web06]. In 1952, D.A.Huffman published a method for constructing highly efficient instantaneous encoding schemes. This method is now known as Huffman Encoding [ROM96]. The idea behind Huffman coding is simply to use shorter bit patterns for more common characters, and longer bit patterns for less common characters [Web06]. The method starts by building a list of the entire alphabet symbols in descending order of their probabilities .It then constructs a tree with a symbol at every leaf, from the bottom up. This is done in steps where, at each step, the two symbols with smallest probabilities are selected, added to the top of the partial tree, deleted from the list, and replaced with an auxiliary symbol representing both of them. When the list is reduced to just one auxiliary symbol (representing the entire alphabet) the tree is complete [SAL95]. : An Example [Web06] To encode the letters A (0.12), E (0.42), I (0.09), O (0.30), U (0.07), listed with their respective probabilities. Go through the following steps: 1. Consider each of the letters as a symbol with its respective probability. 2. Find the two symbols with the smallest probability and combine them into a new symbol with both letters by adding

34

Chapter Two

Steganography

the probabilities. (Note1: There may be a choice between two symbols with the same probability, if this is the case, a symbol can be chosen, the final tree and codes will be different, but the overall

efficiency

of

the

code

will

be

the

same)

(Note 2: Frequency counts or other values may be used instead of probabilities) 3. Repeat step 2 until there is only one symbol left with a probability of 1. 4. To see the code, redraw all the symbols in the form of a tree, where each symbol contains either a single letter or splits up into two smaller symbols. Label all the left branches of the tree with a 0 and all the right branches with a 1. The code for each of the letters is the sequence of 0's and 1's that lead to it on the tree, starting from the symbol with a probability of 1.

Figure (2.6) Huffman Tree for example

5. Thus the codes for each letter are: A = 100, E = 0, I = 1011, O = 11, U = 1010. 35

Chapter Two

Steganography

The Huffman code for the 26- letter Alphabet 000

E

0.1300

0010

T

0.0900

0

0011

A

0.0800

1

0100

O

0.0800

0

0101

N

0.0700

1

0110

R

0.0650

0

0111

I

0.0650

1

10000

H

0.0600

0

10001

S

0.0600

1

10010

D

0.0400

0

10011

L

0.0350

1

10100

C

0.0300

0

10101

U

0.0300

1

10110

M

0.0300

0

10111

F

0.0200

1

11000

P

0.0200

0

11001

Y

0.0200

1

11010

B

0.0150

0

11011

W

0.0150

1

11100

G

0.0150

0

11101

V

0.0100

1

111100

J

0.0050

0

111101

K

0.0050

1

111110

X

0.0050

1111110

Q

0.0025

0

1111111

Z

0.0025

1

[ROM96]

0 0

0. 3 0

1

0

0.580 0.28

1

0.195

0

1

0

1

0 0.305

0

1 0.11

1

0 0.70

0

1 0

0.115

1

0.025 1 0.010

0

0

1 0.045 0.020

0.010 1 0.005

1

Figure (2.7) Huffman tree for the 26-letter Alphabet 36

Chapter Two

Steganography

Table (2.2) shows the letters of the alphabet with approximate probabilities of occurrence in English, based on statistical data. The second columns of the table show Huffman encoding scheme (emphasizing table (2.2)) is used in this work) [ROM96]. Table (2.2) Probabilities of Occurrence in English Text Symbol E T A O N R I H S D L C U M F P Y B W G V J K X Q Z

Probability 0.1300 0.0900 0.0800 0.0800 0.0700 0.0650 0.0650 0.0600 0.0600 0.0400 0.0350 0.0300 0.0300 0.0300 0.0200 0.0200 0.0200 0.0150 0.0150 0.0150 0.0100 0.0050 0.0050 0.0050 0.0025 0.0025

Huffman code 000 0010 0011 0100 0101 0110 0111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 111100 111101 111110 1111110 1111111

37

3 Chapter Three `|vÜÉáÉyà jÉÜw WÉvâÅxÇà Y|Äx YÉÜÅtà (*AwÉv)

Chapter Three

Microsoft Word Document File Format

Chapter Three Microsoft Word Document file 3.11Introduction Microsoft Word is a word processing software, many word versions were written for several platforms1 including IBM PC running DOS, the Apple Macintosh and Microsoft Windows as shown in Figure(3.1). It is a component of the Microsoft Office System; Microsoft began calling it Microsoft Office Word instead of merely Microsoft Word.

140

Word Versions Number

120 100 80 60 40 20

MS-DOS

0 1983 1986 1989 1991 1995 1998 2000 2003 2006 2008

Macintosh Windows

ijjgjgg Years of Issuing

Figure (3.1) Word Versions for Different Operating Systems

1

Platform: the underlying Hardware or Software for a System 38

Chapter Three

Microsoft Word Document File Format

3.2 2History of Word Many concepts and ideas of Word were brought from Bravo the original GUI word processor developed at Xerox PARC1 [Web08]. Bravo's creator Charles Simonyi left PARC to work for Microsoft in 1981. Simonyi hired Richard Brodie, who had worked with him on Bravo, away from PARC that summer [Web02]. Word featured a concept of "What You See Is What You Get", or WYSIWYG, and was the first application with such features as the ability to display bold and italics text on an IBM PC. Word made full use of the mouse, which was so unusual at the time that Microsoft offered a bundled Word-with-Mouse package [Web08]. Although MS-DOS was a character-based system, Microsoft Word was the first word processor for the IBM PC that showed actual line breaks and typeface markups such as bold and italics directly on the screen while editing, although this was not a true WYSIWYG system because available displays did not have the resolution to show actual typefaces[Web02]. : Word 97 Word 97 had the same general operating performance as later versions such as Word 2000. This was the first copy of Word featuring the "Office Assistant"2, which was an animated helper used in all Office programs [Web08]. : Word 2000 For most users, one of the most obvious changes introduced with Word 2000 (and the rest of the Office 2000 suite) was a clipboard3 that could hold multiple objects at once. Another noticeable change was that the 2

1:Xerox PARC Research and Development Company 1970 2:Office Assistant animated helper used in all office programs 3: clipboard a special file or memory area (buffer) where data is stored temporary before being copied to another location used for copy and paste.

39

Chapter Three

Microsoft Word Document File Format

Office Assistant, whose frequent unsolicited appearance in Word 97 had annoyed many users, was changed to be less intrusive [Web08]. : Word 2002 Word 2002 was bundled with Office XP and was released in 2001 although its appearance was different; it had many of the same features as Word 2003. One of the key advertising strategies for the software was the removal of the Office Assistant in favor of a new help system, although it was simply disabled by default Word 2002[Web08]. : Word 2003 For the 2003 version, the Office programs, including Word, were rebranded to emphasize the unity of the Office suite, so that Microsoft Word officially became Microsoft Office Word. Users continue to use both names [Web08]. : Word 2007 The release includes numerous changes, including a new XMLbased file format, a redesigned interface, and an integrated equation editor [Web08]. : Word 2008 Word 2008 is the most recent version of Microsoft Word for the Mac, released on January 15, 2008. It includes some new features from Word 2007[Web08].

40

Chapter Three

Microsoft Word Document File Format

3.3 Microsoft Word Document and its Components [Web11] Documents in Word have a hierarchical structure as shown in the figure (3.2)

Figure (3.2) External Structure of a Word Document

Different types of properties apply to different units in hierarchy: : Section. By default a document is a single section, but setting for margins, headers and footers, footnote, and columns apply to whole sections so need a section break to change any of these for only part of a document. Make a new section using Inset| Break and selecting one of the four types of "section breaks". : Paragraph. most of formatting in Word applies at the paragraph level indents, line spacing, default font properties, bullets etc. can apply

many aspects of paragraph formatting all at once to a

paragraph using paragraph styles . : Character. Some formatting attributes apply at the level of individual character, such as the bold font in the first word of this paragraph can apply a set of character attributes together using character styles.

41

Chapter Three

Microsoft Word Document File Format

In addition to these parts of the main document, there are other special kinds of text which word refers to as other "stories". These include footnotes, comments, headers and footers, these items are stored separately from the main text and require special commands to access and edit. : Customizations. such as definitions, macros and toolbars may either be stored in the document or in the document's associated template : Styles. Are collections of format specifications which can be applied all together to a paragraph or a group of characters. The advantage of using styles to apply formatting is that can easily change the formatting of all paragraphs of a certain type (e.g. examples, section, heading or footnotes) simply by redefining the style. A linguistics paper usually goes through a number of stages: as a term paper. As a draft you circulate for comments as a conference handout, as a journal submission, as camera-ready copy for a volume. Each of these stages has its own format requirements. Using styles right from the beginning for all formatting can save a huge amount of time over a paper.

3.4 Annotation and collaboration tools [Web11] As a linguist, will often be working together with someone else on a document either as a co-author, or in a student-teacher relationship. Word has some easy-to-use tools to facilitate such collaborative work. 3.4.1 Track Changes The “Track Changes” tool gives access to a simple method of keeping track of the changes a particular user makes to a document. Insertions will display in color and underlined; deletions and format changes will display in bubbles like comments, an example of Track change can be shown in figure (3.3) [web11]. 42

Chapter Three

Microsoft Word Document File Format

Track Changes is a way for Microsoft Word to keep track of the changes you make to a document. Track Changes is also known as redline, or redlining. This is because some industries traditionally draw a vertical red line in the margin to show that some text has changed [web04].

Figure (3.3) Track change example

3.4.2 Comments The “Comment” feature allows comments to be added to the document. In Page Layout view, recent versions of Word will be display comments in "bubbles" on the right side of the text (moving text over to make room in the margin for the comment). Comments from different reviewers will appear in different colors, comments example in figure (3.4) [web11].

Figure (3.4) comments example

43

Chapter Three

Microsoft Word Document File Format

3.5 File Format [Web03] A file format is a particular way to encode information for storage in a computer file. Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa. There are different kinds of formats for different kinds of information. Within any format type, e.g., word processor documents, there will typically be several different formats. Sometimes these formats compete with each other. Some file formats are designed to store very particular sorts of data: the JPEG format for example, is designed only to store static photographic images other file formats, however, are designed for storage of several different types of data.

3.6 Identifying the type of a file [Web03] Since files are seen by programs as streams of data, a method is required to determine the format of a particular file within the file system an example of metadata. Different operating systems have traditionally taken different approaches to this problem, with each approach having its own advantages and disadvantages as follows. 3.6.1 Filename Extension One popular method in use by several operating systems, including DOS and Windows, is to determine the format of a file based on the section of its name following the final period. This portion of the filename is

44

Chapter Three

Microsoft Word Document File Format

Known as the filename extension For example, HTML documents are identified by names that end with .html (or .htm) [Web03]. 3.6.2 Magic Number An alternative method, often associated with UNIX and its derivatives, is to store a "magic number" inside the file itself. Originally, this term was used for a specific set of 2-byte identifiers at the beginning of a file, but since any un decoded binary sequence can be regarded as a number, any feature of a file format which uniquely distinguishes it can be used for identification. GIF images, for instance, always begin with the ASCII representation of either GIF87a or GIF89a, depending upon the standard to which they adhere [Web03].

3.7 File Structure Each format uses structure (a way to organize data for storing) in a file [FOL98]. There are several types of ways to structure data in a file. The most usual ones are described in figure (3.5).

File structure

Raw memory dumps (RMD)

Chunk based format (CBF)

Directory based format (DBF)

Figure (3.5) File Structure Types 45

Chapter Three

Microsoft Word Document File Format

3.7.1 Raw Memory Dumps/Unstructured Formats (RMD) [Web03] Earlier file formats used raw data formats that consisted of directly dumping the memory images of one or more structures into the file. This has several drawbacks. Unless the memory images also have reserved spaces for future extensions, extending and improving this type of structured file is very difficult. On the other hand, developing tools for reading and writing these types of files are very simple. The limitations of the unstructured formats led to the development of other types of file formats that could be easily extended and be backward compatible at the same time. 3.7.2 Chunk based Formats (CBF) [Web03] In this kind of file structure, each piece of data is embedded in a container that contains a signature identifying the data, as well the length of the data (for binary encoded files). This type of container is called a chunk. The signature is usually called a chunk id, chunk identifier, or tag identifier. With this type of file structure, tools that do not know certain chunk identifiers simply skip those that they do not understand. Even XML can be considered a kind of chunk based format, since each data element is surrounded by tags which are akin to chunk identifiers. 3.7.3 Directory based Formats (DBF) [web03] This is another extensible format, that closely resembles a file system (OLE Documents are actual file systems), where the file is composed of

46

Chapter Three

Microsoft Word Document File Format

'directory entries' that contain the location of the data within the file itself as well as its signatures (and in certain cases its type). Good examples of these types of file structures are disk images, OLE documents [Web03].

3.8 Structure Storage The lowest level of organization that is normally imposed on a file is a stream of bytes. By storing data in a file which is merely as a stream of bytes, the ability to distinguish among the fundamental information units of data will be lost. These fundamental pieces of information are called fields. Fields are grouped together to form records. Records are grouped together to form Block [FOL98] as shown in figure (3.6). Block Record Field Stream of bits 0, 1

Figure (3.6) logic view of file

In persistent storage, normally files are stored in the form of bytes. A file is treated as a raw sequence of bytes. The entire file is stored in the blocks on the disk. These blocks are scattered on the disk. When reading this file, the file system manages its pointers and returns a sequence of bytes [CHA00].

47

Chapter Three

Microsoft Word Document File Format

Structure storage follows a different approach to store a file and its data on the persistent storage. Structure storage provides a way by defining how to treat a file as a structured collection of objects. These objects are storages and streams as shown in figure (3.7). Root

STORAGE

STORAGE

STORAGE

STREAM

STREAM

STORAGE

STREAM

Figure (3.7) Storage and Stream Structure

A storage object is kind of a directory and it can contain other storage objects and stream objects that can be thought of as a stream object as a file. Like a file, a stream contains data stored as a consecutive sequence of bytes. A compound file is a combination of these two objects [CHA00]. A compound file is file which contains different types of data saved in a structured format having a compound file which has some text, some images and other data. Now we want to add one more object to a file. In the traditional approach, when saving a file, the file system rewrites the entire data. But the structured storage approach eliminates this rewriting process and increases the read/write performance. The new data is written to the next available location in permanent storage and the storage object updates the table of pointers it maintains to track the locations of its storage objects and stream objects [CHA00].

48

Chapter Three

Microsoft Word Document File Format

Here are some other benefits: : Structured storage approach provides control over separate objects. It can read/write separate objects instead of the entire compound file [CHA00]. : More than one user can concurrently read/write the same file [CHA00].

3.9 Microsoft Compound Document File Format (MCDFF) A word file may contain Excel sheet and chart, an image, a table, and some macros is an example of compound file. Files which use MCDFF (Microsoft Compound Document File Format) include output files from MS Office 97-2003, which consist of the applications like MS Word, PowerPoint, and Excel [CHA00]. The Microsoft Compound Document File Format (MCDFF) 2003 is a document file format based on OLE (Object Linking and Embedding), which is used for saving various resources as an integrated document in Microsoft [MIC07]. A storage component may exist as a standalone component. Each storage component may have one or more sub-storage components and stream components. Also the root component may have stream components directly within it [JIT06].

49

Chapter Three

Microsoft Word Document File Format

3.10 Structure of a Word Documents files Let's take a look at the structure of a Word document with an embedded Excel object, shown below in Figure (3.8).

MS Word

Data

Table

CompObj

Word Document

Summary Information

Document Summary Information

JPEG Image

Excel Sheet

Object Pool

Work Book

Summary Information

Document Summary Information

Figure (3.8) Sample of Word document storage format

The binary format for Microsoft Word 97 and later versions is based on a structure referred to as a .doc file or compound file. A Word .doc file consists of a [MIC07]: I. Word Document (Main stream) II. Summary information stream III. Table stream IV. Data stream V. Custom XML storage (Added in Word 2007) Zero or more object streams which contain private data for OLE 2.0 objects embedded within the Word document [MIC07]. The 'MS Word' component is the root component containing several streams and one storage item. Different parts of the document such as the 50

Chapter Three

Microsoft Word Document File Format

actual contents, any table inserted, the CompObj associated with the DLL files for the objects, the Summary Information for the content, any image inserted, and the Document Summary Information, all take the form of streams under the root component. The ObjectPool is the collective storage of all the sub-storage components. Figure (3.8) displays samples of the substorage Excel component. The Excel Sheet itself is a storage component within the ObjectPool and has its own streams of information the Workbook, SummaryInformation and DocumentSummaryInformation [JIT06]. : Custom XML Datastore (Added in Word 2007): The custom XML data store specifies custom defined XML files contained in the binary Microsoft Word 97 format or the Office Open XML Formats [MIC07]. : Data stream: The stream within a Word .doc file that contains various data that anchor to characters in the main stream. For example, binary data are described in-line pictures and/or form fields [MIC07]. : Main stream: The stream within a Word .doc file that contains the bulk of Word‘s binary data [MIC07]. : Object storage: A storage that contains binary data for an embedded OLE 2.0 object. Multiple instances are referred to as storages [MIC07]. : Stream: The physical encoding of a Word document's text and sub data structures in a random access stream within a .doc file [MIC07]. : Summary Information Stream: The stream within a Word .doc file that contains the document summary information [MIC07].

51

Chapter Three

Microsoft Word Document File Format

: Table stream: The stream within a Word .doc file that contains the various plcf‘s and tables that describe a document‘s structures [MIC07].

3.11 Format of the Main Stream The main stream of a Word binary file (complex format) consists of the Word file header (FIB), the text, and the formatting information. : FIB (File Information Block) The header of a Word file begins at offset 0 in the file. This gives the beginning offset and lengths of the document's text stream and subsidiary data structures within the file. It also stores other file status information. The FIB contains a "magic number" and pointers to the various other parts of the file, as well as information about the length of the file. The FIB is defined in the structure definition section of this document [MIC07]. : Text The text part contains all text of the document (including footnotes, header and footer lines, etc.) the document's text is also located in the main stream [DIA08]. Word has used this same file format since its first version. This means that Word 1.0 can read Word 5.0 files and vice-versa. This compatibility was accomplished by defining all structures to be larger than they needed to be and setting all reserved fields to zero for using in future versions.

52

Chapter Three

Microsoft Word Document File Format

Reserved pointers in the document header have been used to add entirely new document sections (such as document retrieval information and bookmark tables) [Web09]. Because of the important issue of compatibility with future versions, all fields in all structures which are not currently being used MUST be filled with zeros. When the fields are finally defined for a new feature, they will make zero either the default value of those fields or make zero represent un initialized state which will be ignored [Web09].

3.12 MCDFF metadata MCDFF uses metadata to manage information about Streams, Storage. Table (3.1) describes the type of information contained in each metadata in MCDFF [HYU08]. Name of metadata Header

Table (3.1) MCDFF Metadata Information Contained Signature, Pointer Table of BAT

BAT

Block Allocation Table

SBAT

Small Block Allocation Table

Directory

Stream & Storage information

The exact format structure of these metadata was provided by the Spreadsheet Project of Open Office.org Documentation of the Microsoft Compound Document File Format [DAN07] and the Apache POIFS Project of Apache.org. [MAR07] because POIFS file systems are called "file system", because they contain multiple embedded files in a manner similar to the traditional file systems if had a word processor file with the extension ".doc", would actually have a POIFS file system with a document file archived inside of that file system. [MAR07].Most

53

Chapter Three

Microsoft Word Document File Format

operating systems, including Microsoft Windows manage hard disk drives by dividing their storage space into units known as partitions. So before being able to store data on a partition, it must be formatted. Formatting a partition organizes the associated space into what is called a filesystem, which provides space for storing the names and attributes of files as well as the data they contain. Microsoft Windows supports several types of filesystems, such as FAT and FAT32,Formatting a disk divides the disk into tracks and sectors, each track is divided into sectors sometimes called disk blocks as shown in figure (3.9) where Partitions comprise the logical structure of a disk drive, the way humans and most computer programs understand the structure. However, disk drives have an underlying physical structure that more closely resembles the actual structure of the hardware.

Figure (3.9) the structure of a hard disk [MCC99]

MCDFF uses two types of data unit: Small Block (Sector) and Big Block (Block) [HYU08]. If the Stream size is less than 4096, the file is stored in small blocks and the SBAT is used to walk the small blocks (Sector) making up the file. If the file size is 4096 or larger, the file is stored in big blocks (Blocks) 54

Chapter Three

Microsoft Word Document File Format

and the main BAT is used to walk the big blocks making up the file [MAR07]. The (zero-based) index of a sector is called sector identifier (SecID) SecIDs are signed 32-bit integer values. If a SecID is not negative, it must refer to an existing sector. If a SecID is negative, it has a special meaning. : –1 Free SecID Free sector, may exist in the file, but is not part of any stream [DAN07]. : –2 End Of Chain SecID Trailing SecID in a SecID chain : –3 SAT SecID Sector is used by the sector allocation table : –4 MSAT SecID Sector is used by the master sector allocation table. 3.12.1 Compound Document Header The compound document header (simply “header” in the following) contains all data needed to start reading a compound document file. The header is always located at the beginning of the file; this implies that the first sector (with SecID 0) always starts at file offset 512.The first 64 bits of the header form id or magic number identifier of office file. The header also contains an array of block numbers. These block numbers refer to blocks in the file. When these blocks are read together they form the Block Allocation Table. The header also contains a pointer to the first element in the property table, also known as the root element, and a pointer to the small Block Allocation Table (SBAT) [MAR07]. The block allocation table or BAT, along with the property table specifies which blocks in the file system belong to which files [MAR07]. The Contents of the compound document header structure are described in the following Table.

55

Chapter Three

Microsoft Word Document File Format

Table (3.2) compound document header structure [DAN07]. offset 0

Size 8

8

16

24

2

26

2

28

2

30

2

32

2

34

10

44

4

48

4

52 56

4 4

60

4

64

4

68

4

72

4

76

436

Contents

Compound document file identifier: D0 CF 11 E0 A1 B11AE1 Unique identifier (UID) of this file Revision number of the file format (most used is 003E) Version number of the file format (most used is 0003) Byte order identifier FEH FFH = Little-Endian FFH FEH = Big-Endian Size of a sector in the compound document file in power-of-two (ssz), real sector size is sec_size = 2ssz bytes (minimum value is 7 which means 128 bytes, most used value is 9 which means 512 bytes) Size of a short-sector in the short-stream container stream in power-of-two (sssz), ) real short-sector size is short_sec_size = 2sssz bytes (maximum value is sector size ssz, see above, most used value is 6 which means 64 bytes) Not used Total number of sectors used for the sector allocation table SecID of first sector of the directory stream Not used

Minimum size of a standard stream (in bytes, minimum allowed and most used size is 4096 bytes), streams with an actual size smaller than (and not equal to) this value are stored as shortstreams SecID of first sector of the short-sector allocation table or -2 (End Of Chain SecID) if not extant Total number of sectors used for the short-sector allocation table SecID of first sector of the master sector allocation table or -2 (End Of Chain SecID) if no additional sectors used Total number of sectors used for the master sector allocation table First part of the master sector allocation table containing 109 SecIDs

The following header format structure in Table (3.3) is used to give Block information if the file is stored in Block. Note: The shadow cells in Table (3.3) are used in this work.

56

Chapter Three

Microsoft Word Document File Format

Table (3.3) Header (block 1) -- 512 (0x200) bytes [MAR07] Field

Description

Offset

Length

Default value or const

FILETYPE

Magic number identifying this as a POIFS files system. Unknown constant Unknown Constant Unknown Constant Unknown Constant (revision?) Unknown Constant (version?) Unknown Constant Log, base 2, of the big block size Log, base 2, of the small block size Unknown Constant Unknown Constant Number of elements in the BAT array Block index of the first block of the property table Unknown Constant Unknown Constant Block index of first big block containing the small block allocation table (SBAT)

0x0000

Long

0xE11AB1A1E011CFD0

0x0008

Integer

0

0x000C

Integer

0

0x0014

Integer

0

0x0018

Short

0x003B

0x001A

Short

0x0003

0x001C

Short

-2

0x001E

Short

9 (2 ^ 9 = 512 bytes)

0x0020

Integer

6 (2 ^ 6 = 64 bytes)

0x0024

Integer

0

0x0028

Integer

0

0x002C

Integer

required

0x0030

Integer

required

0x0034

Integer

0

0x0038

Integer

0x00001000

0x003C

Integer

-2

UK1 UK2 UK3 UK4

UK5

UK6 LOG_2_BIG_BLOCK_SIZE

LOG_2_SMALL_BLOCK_SIZE

UK7 UK8 BAT_COUNT

PROPERTIES_START

UK9 UK10 SBAT_START

57

Chapter Three SBAT_Block_Count

XBAT_START

XBAT_COUNT

BAT_ARRAY

N/A

Microsoft Word Document File Format Number of big blocks holding the SBAT Block index of the first block in the Extended Block Allocation Table (XBAT) Number of elements in the Extended Block Allocation Table (to be added to the BAT) Array of block indices constituting the Block Allocation Table (BAT) Header block data not otherwise described in this table

0x0040

Integer

1

0x0044

Integer

-2

0x0048

Integer

0

0x004C, 0x0050, 0x0054 ... 0x01FC

Integer[ ]

-1 for unused elements, at least first element must be filled.

N/A

N/A

-1

3.12.2 Byte Order [DAN07] All data items containing more than one byte may be stored using the Little-Endian or Big-Endian method, but in real world applications only the Little-Endian method is used. The LittleEndian method stores the least significant byte first and the most significant byte last. This applies to all data types like 16-bit integers, 32-bit integers, and Unicode characters. : Example: The 32-bit integer value 13579BDFH is converted into the Little-Endian byte sequence DFH 9BH 57H 13H, or to the Big-Endian byte sequence 13H 57H 9BH DFH.

58

Chapter Three

Microsoft Word Document File Format

3.12.3 Sector File Offsets [DAN07] With the values from the header it is possible to calculate a file offset from a SecID: sec_pos(SecID) = 512 + SecID · sec_size …………….(3.1) = 512 + SecID · 2 ssz : Example with ssz = 10 and SecID = 5: sec_pos(SecID) = 512 + SecID · 2 ssz = 512 + 5 · 210 = 512 + 5 · 1024 = 5632. Note: The previous equation is used to calculate Block Position too. 3.12.4 Property Table (Directory) The Property Table is essentially nothing more than the directory system. Properties (directories) are 128 byte records contained within the 512 byte blocks. Each directory entry refers to storage or a stream in the compound document. the zero-based index of a directory entry is called directory entry identifier (DirID). There is a special directory entry at the beginning of the directory (with the DirID 0). It represents the root storage and is called root storage entry [DAN07]. The contents of the directory entry structure are described in the following table.

59

Chapter Three

Microsoft Word Document File Format

Table (3.4) directory entry structure [DAN07] Offset 0

Size 64

Contents

64

2

66

1

Type of the entry:

67

1

Node colour of the entry:

68

4

DirID of the left child node inside the red-black tree of all direct members of the parent storage (if this entry is a user storage or stream), –1 if there is no left child

72

4

DirID of the right child node inside the red-black tree of all direct members of the parent storage (if this entry is a user storage or stream), –1 if there is no right child

76

4

DirID of the root node entry of the red-black tree of all storage members (if this entry is a storage), –1 otherwise

80

16

Unique identifier, if this is a storage (not of interest in the following, may be all 0)

96

4

100

8

108

8

116

4

User flags (not of interest in the following, may be all 0) Time stamp of creation of this entry. Most implementations do not write a valid time stamp, but fill up this space with zero bytes. Time stamp of last modification of this entry. Most implementations do not write a valid time stamp, but fill up this space with zero bytes. SecID of first sector or short-sector, if this entry refers to a stream ,SecID of first sector of the short-stream container stream, if this is the Root storage entry,0 otherwise

120

4

124

4

Character array of the name of the entry, always 16-bit Unicode characters, with trailing zero character (results in a maximum name length of 31 characters) Size of the used area of the character buffer of the name (not character count), including the trailing zero character (e.g. 12 for a name with 5 characters: (5+1)·2 = 12) 00H = Empty 01H = User storage 02H = User stream

03H = LockBytes (unknown) 04H = Property (unknown) 05H = Root storage

00H = Red 01H = Black

Total stream size in bytes, if this entry refers to a stream, total size of the short stream container stream, if this is the root storage entry, 0 otherwise Not used

The following property Format Structure in Table (3.5) is used to give Block information if the file is stored in Block. Note: the shadow cells in Table (3.5) are used in this work. 60

Chapter Three

Microsoft Word Document File Format

Table (3.5) Property -- 128 (0x80) byte block [MAR07] Field Description Offset Length Default value or const Short[] 0x0000 for 0x00, NAME A unicode nullunused 0x02, terminated elements, field 0x04, ... uncompressed 16bit required, 32 0x3E string (lose the high (0x40) element bytes) containing the max name of the property. NAME_SIZE Number of characters 0x40 Short Required in the NAME field PROPERTY_TYPE Property type 0x42 Byte 1 (directory), 2 (directory, file, or root) (file), or 5 (root entry) NODE_COLOR Node color 0x43 Byte 0 (red) or 1 (black) PREVIOUS_PROP Previous property 0x44 Integer -1 index NEXT_PROP Next property index 0x48 Integer -1 CHILD_PROP First child property 0x4c Integer -1 index SECONDS_1 Seconds component of 0x64 Integer 0 the created timestamp? DAYS_1 Days component of the 0x68 Integer 0 created timestamp? Integer 0 SECONDS_2 Seconds component of 0x6C the modified timestamp? DAYS_2 Days component of the 0x70 Integer 0 modified timestamp? START_BLOCK Starting block of the 0x74 Integer Required file, used as the first block in the file and the pointer to the next block from the BAT SIZE Actual size of the file 0x78 Integer 0 this property points to. (Used to truncate the blocks to the real size).

61

Chapter Three

Microsoft Word Document File Format

3.14.5 Block Allocation Table (BAT) The BAT (Block Allocation Table) is the main table for spaces within MCDFF, which is needed to read any other Stream in the file [HYU08]. The BAT blocks are pointed at by the bat array contained in the header these blocks form a large table of integers. These integers are block numbers. The Block Allocation Table holds chains of integers [MAR07]. The elements in these chains refer to blocks in the files. The starting block of a file is NOT specified in the BAT. It is specified by the property of a given file. The elements in this BAT are both the block number (within the file minus the header) and the number of the next BAT element in the chain. This can be thought of as a linked list of blocks. The BAT array contains the links from one block to the next, including the end of chain marker [MAR07]. The BAT format structure is shown in Table (3.6). Here's an example: Let's assume that the BAT begins as follows: BAT [0] = 2 BAT [1] = 5 BAT [2] = 3 BAT [3] = 4 BAT [4] = 6 BAT [5] = -1 BAT [6] = 7 BAT [7] = -2 62

Chapter Three

Microsoft Word Document File Format

Now, if we have a file whose Property Table entry says it begins with index 0, walk the BAT array and see that the file consists of blocks 0 (because the start block is 0), 2 (because BAT[ 0 ] is 2), 3 (BAT[ 2 ] is 3), 4 (BAT[ 3 ] is 4), 6 (BAT[ 4 ] is 6), and 7 (BAT[ 6 ] is 7). It ends at block 7 because BAT [7] is -2, which is the end of chain marker. Similarly, a file beginning at index 1 consists of blocks 1 and 5 and block 5 refers to unused block. The other special number in a BAT array is: : -3, which indicate a "special" block, such as a block used to make up the Small Block Array, the Property Table, the main BAT, or the SBAT [MAR07].

Table (3.6) Block Allocation Table Block [MAR07] Field Description Offset Length Default value or const Integer -1 = unused 0x0000, BAT_ELEMENT Any given -2 = end of chain 0x0004, element in the -3 = special (e.g., BAT 0x0008, ... BAT block block) 0x01FC All other values point to the next element in the chain and the next index of a block composing the file.

In the physical structure of an MCDFF file, each Block is numbered with an index number under a Header. Figure (3.10) shows the process of accessing “Sample A Stream”. The first index number for “Sample A Stream” is included in its Directory entry. It accesses the BAT to find the index number of the other Blocks that “Sample A Stream” uses – in this Example, if the first index number is 1st in Directory Entry, “Sample A Stream” consists of three Blocks as 1st, 4th and 5th from BAT [HYU08]. 63

Chapter Three

Microsoft Word Document File Format

Figure (3.10) MS Compound files structure [HYU08]

3.12.6 Sector Allocation Table (SAT) The Sector Allocation Table (SAT) is an array of SecIDs. It contains the SecID chain of all user streams. The size of the SAT (number of SecIDs) is equal to the number of existing sectors in the compound document file [DAN07].

3.13 Office Automation Office Automation /OLE Automation (later renamed by Microsoft to just Automation) is an inter-process communication mechanism based on Component Object Model (COM) that was intended for use by scripting languages – originally Visual Basic – but now are used by languages run on Windows. It provides an infrastructure whereby applications called automation controllers can access and manipulate (i.e. set properties of or call methods on) shared automation objects that are exported by other 64

Chapter Three

Microsoft Word Document File Format

applications in OLE Automation. The automation controller is the "client" and the application exporting the automation objects is the "server" [Web10].

3.14 PIA for Microsoft Office 2003 The following tables list the PIAs available for use with Office 2003.Table (3.7) lists Microsoft Office 2003 applications and component type libraries that have the same version number and that are signed with the same key [KHO05].

Table (3.7) Office 2003 applications and component type libraries with the same version number, signed with the same key [KHO05] Office 2003 Application or component Microsoft Office 11.0 Object Library Mirosoft Word 11.0 Object Libyrar

PIA Name

PIA Namespace

Office.dll

Microsoft.Office.Core

Microsoft.Office.Interop.Word.dll

Microsoft.Office.Interop.Word

3.15 Word Object Model Word provides hundreds of objects. These objects are organized in a hierarchy that closely follows the user interface. Word Visual Basic Helps to contain a diagram of Word's object model. The figure is "live" – when clicking on an object you will be taken to the Help topic for that object. Figure (3.11) shows the portion of the object model diagram that describes the Document object [GRA01]. The Key object in Word is Document, which represents a single, open document; the Document object has lots of properties and methods. Many of its properties are references to collections such as Paragraphs, Tables and Sections. Each of these collections contains references to objects of the 65

Chapter Three

Microsoft Word Document File Format

indicated type, each object contains information about the appropriate piece of the document. For example, the Paragraph object has properties like KeepWithNext and Style, as well as methods like Indent and Outdent [GRA01].

Figure (3.11).Word Object Model – The Word Visual Basic Help file offers a global view of Word's structure [GRA01]. 66

Chapter Three

Microsoft Word Document File Format

3.16 Platform Invoke (PInvoke) There is a need to call a function located in an unmanaged DLL library from within the .NET framework. Platform invokes or PInvoke is the technique used to make this happen [Web01].

Figure (3.12) a platform invokes call to an unmanaged DLL function [Web01].

When platform invoke calls an unmanaged function, it performs the following sequence of actions [Web01]: I. Locates the DLL containing the function. II. Loads the DLL into memory. III. Locates the address of the function in memory and pushes its arguments onto the stack, marshaling data as required. Note Locating and loading the DLL, and locating the address of the function in memory occur only on the first call to the function. IV. Transfers control to the unmanaged function.

67

Chapter Three

Microsoft Word Document File Format

3.17 Application Programming Interfaces (API) [Web12] An API is a set of functions that can be used to work with a component, application, or operating system. Typically, an API consists of one or more DLLs that provide some specific functionality. DLLs are files that contain functions that can be called from any application running in Microsoft Windows.

3.18 Office Application Programming Interfaces (APIs) [Web05] Office binary file formats are designed to be accessed through the Office Application Programming Interfaces (APIs), instead of by direct manipulation of the file format. Because of the complexity of the formats, direct manipulation can cause corruption and is strongly discouraged. The Office 97-2003 binary file formats use the Windows Structured Storage APIs. The Office-specific information is stored as streams in this more generalized format. Common elements, such as document properties, can be accessed through the Structured Storage APIs.

68

4 Chapter Four cÜÉÑÉáxw [|w|Çz fçáàxÅ |Ç ;`VWYY