Beyond Passive Audiobook: How Digital Audiobooks Get Interactive Marco Furini Computer Science Department - University of Piemonte Orientale 15100 Alessandria, Italy Email: [email protected]
Abstract— Although the audiobook market is growing at an exceptional speed, an audiobook is limited to passive listeners. In this paper we propose a novel approach to transform a passive listener into a story director. By introducing interactivity into digital audiobook, a user can interact with the storyline and the story develops according to user choices. We propose an architecture that produces interactive audiobooks in a transparent and secure way. Transparency is achieved by using the AAC format for the audio stream, the MPEG7-DDL for the audio description and the ISO Base Media File Format for the audiobook file format; security is achieved with a mechanism that protects the digital content from illegal usage and avoids any possible content malicious alteration. The simplicity of our approach, along with the growing usage of portable devices, may find new markets for the audiobook in the educational and entertainment worlds.
I. I NTRODUCTION A recent research reports the total size of the digital home entertainment market at $129 billion in 2004, $166 billion in 2005 and $411 billion by 2010, with an average annual growth rate of nearly 20% over the next five years . Several players participate to this market, but in this paper we focus on a particular one: the audiobook. This application is being more and more used by people around the world and the Audio Publishers Association estimates the size of the audiobook market at $832 million . Simply put, an audiobook is audio representation of a written book and, although most people think of audiobook as a support for sight-impaired people and although in the early days of audiobook usage, sight-impaired people or slow learners were the main users of audiobooks, nowadays the interest around audiobooks is growing at an exceptional speed. Looking at the revenues of the different audiobook formats (cassette tapes, standard CDs, MP3 CDs and downloadable format), the downloadable digital format had the largest growth in the audiobook industry: in 2001, download sales were $5,143,000; In 2002, an increase of 112 percent led download sales to $10,940,000 and, in 2003 the sales were of $ 18,490,000 (plus 69 percent) . This trend is expected to continue due to the proliferation of broadband networks and to the reduction of the portable audio players cost. The audiobook industry is then moving from audiobook to digital audiobook. This shift causes advantages to both the audiobook industry and the consumer. The former can reduce the production cost by eliminating manufacturing costs, while the latter can buy and download an audiobook wherever
and whenever he/she wants at a lower price than any other audiobook format (e.g., an audiobook costs around $40 on tape/CD, and around $20 if downloaded from a webstore). The large interest around audiobook is obtained even though the audiobook is limited to passive listener. In fact, a listener cannot interact with the story, but only with the media player (pause, rewind, fast-forward). The contribution of this paper is to introduce interactivity inside a digital audiobook in order to more actively involve a listener via interactive procedures, so to take him/her beyond conventional passive listening. By providing interactivity, a user may affect the storyline and the story might develop differently. In this way, the user is no longer a passive listener, but he/she becomes the director of the story. Hence, an interactive audiobook may be played over and over and may provide a different version every time. To realize an interactive audiobook, we propose an architecture composed of a script manager, a scene manager and an interaction handler. These entities cooperate to produce an interactive audiobook. In essence the idea is to divide the audio story into multiple audio scenes and, for some of these scenes, the user can select the line the story development has to take. In this way, the storyline may differ from time to time. To realize this, all the audio scenes, as well as all the possible scene transitions, have to be identified and described. Since we seek for transparency, we consider the ISO Base Media File Format as the file format for our digital interactive audiobook. In this way, our interactive audiobook can be directly used on systems that support the play out of 3GP and MP4 files. Furthermore, the usage of the ISO Base Media File Format allows us to use the Advanced Audio Coding (AAC) for the audio stream and the MPEG7-DDL markup description language for all the audio descriptions . Content protection is achieved through the development of a security mechanism that ensures that only a legal audiobook owner can enjoy the full features of an interactive audiobook and also ensures that (malicious) alterations cannot be done to the produced interactive audiobook. This security mechanism is developed with classic security tools (encryption, watermarking and hash functions). Digital interactive audiobook can be an exciting entertainment application for all the owner of a portable audio player (in particular, commuters and students), but can also be an educational (learning through interactivity has been proved to
1-4244-0667-6/07/$25.00 © 2007 IEEE
be more effective) and a game application (e.g., adventure games). Hence, our proposal may help the audiobook industry to explore new markets. The remainder of this paper is organized as follows. In Section II we briefly review the ISO Base Media File Format, the AAC, the MPEG7-DDL and the security tools used by our approach; in Section III we present details of our proposal. Conclusions are drawn in Section IV.
ftyp moov trak mdia
meta hdlr = ‘mp7t’
Example of a ISO Base Media File structure.
II. P RELIMINARIES Since for compatibility reasons, we consider: i) the ISO Base Media File Format as the file format of our interactive audiobook; ii) the AAC format as the audio stream format and iii) the MPEG7-DDL as the markup language to describe the audio information, in this section we briefly review their main characteristics. We also briefly review the basics of the security tools we will use to develop our security mechanism to protect audiobook contents. A. The ISO Base Media File Format The ISO Base Media File Format  has been defined to combine different multimedia streams (e.g., audio, video, 2D3D graphics, text, etc.) into a single file. The file structure consists of several different boxes, where each box has a name, a size and a type. A box may contain actual media data (e.g., audio data) or metadata (information to define the access, placement and timing properties of the media). By combining these boxes, different multimedia streams can be organized in order to realize a presentation (also called movie). Briefly, a movie is logically divided into one or more tracks; Each track is independent from other tracks and represents a timed sequence of media (e.g., audio and video). It contains a media box and a meta box, which contains the name of the exact media type and any parameter the decoder will need. Figure 1 shows a simple example. The file contains three different boxes: File Type Box (ftyp), the Movie Box (moov), and the Media Data Box (mdat). The box ftyp is used to improve interoperability and file type compatibility, the box moov is the movie container and is designed to include all the data related to the presentation and the mdat box is the media data container. The box moov contains the trak box that is the container for an individual stream. The trak box contains mdia (the media information container) and meta (the metadata container). In addition, a user data box (udta box) can be used to store descriptive metadata related to the whole presentation. All the descriptive metadata shall be stored using the MPEG-7 language. B. The MPEG7-DDL The MPEG7-DDL  is a markup description language based on XML and is provided with a set of tags to describe multimedia contents. It produces a description of the spatial layout of different media elements (video, audio, graphics, text) as well as the temporal order in which these elements will be played out during the presentation. These tags have attributes and values and are usually enclosed between angle
brackets in the form , with the exception of tags that do not have attributes. Among the several pre-defined tags, it is worth presenting some tags that will be used to describe audio data in this paper. The ... tags allow decomposing an audio stream into several audiosegments, by specifying the showtime and the duration (through the tags ...) of each audio-segment. The ... tags allow labelling a portion of the MPEG7 descritpion. The and tags are used to write simple text in the description. C. The AAC File Format The Advanced Audio Coding (AAC) is a standard audio encoding algorithm proposed by the MPEG group and has been chosen as one of the audio format that can be stored inside the ISO Base Media File Format. The AAC format provides high audio quality at low bit-rates and is gaining wide adoption in the marketplace (it has been adopted by the major standards: MPEG-4, 3GPP and 3GPP2). Readers can refer to , ,  for further details about this format. Here, we simply highlight that an AAC file is composed of a set of audio blocks, where each block contains sufficient data to be decoded. This is the main difference with other encoding algorithms, where to decode an audio block, it is necessary to have several audio blocks (usually the adjacent audio blocks). Conversely, an AAC player can jump from one block to another without any problem. D. Security Tools To protect a digital content, many vendors use digital rights management (DRM) systems, which allow wrapping a media file with a control mechanism that enforces some limitations, and discloses the material only to authorized users. Encryption and information hiding are the key components of a typical DRM system. Encryption is carried out with typical symmetric-key encryption algorithms. In this way, to play out an encrypted media file, the cryptographic key must be available to the software player (but not to users in order to avoid unauthorized usage or distribution). Information Hiding is a technique, used by many DRM systems, to hide information inside the media file. The hiding is achieved through watermarking techniques . The inserted information (called watermark) is hidden, imperceptible and directly connected to the media content (i.e.
1-4244-0667-6/07/$25.00 © 2007 IEEE
it is spread out in the media file). In addition a watermark should be statistically invisible (users should not take advantage from comparing different watermarked copies), robust (it cannot be removed by simple manipulation of the media file: for instance compression or conversions) and tamper resistant (it should be impossible altering/identifying/removing the watermark as well as inserting a valid watermark for an un-authorized person). To hide the information, a watermarking technique typically uses a secret key to generate a random sequence during the embedding process. On the decoding side, the watermarking key is also used to extract the watermark. If the watermarking scheme is statistically invisible, robust, and tamper resistant, extracting and altering a watermark is hard with no knowledge of the watermarking key. III. O UR P ROPOSAL In this section we present details of our proposal that aims at providing users with the ability of interacting with the storyline of an audiobook, so that an active role in the story development may be played. To better highlight what story interaction means, let us consider two famous movies: Sliding doors and Pulp Fiction. The former is about a woman’s love life and career both hinge on whether she catches a train or not. In the movie, we see it both ways, in parallel. In an interactive system, the user would be asked whether the woman catches a train or not and the story would develop accordingly. In Pulp Fiction, a main story is retold from different perspective. In an interactive system, the user may be given the possibility of selecting the preferred perspective point of view. The contribution of this paper is to define an architecture to produce interactive audiobooks, so that users can interact with the audiobook and the storyline may be modified by user interaction. In this way, our proposal takes users beyond passive listening. In doing this, we seek for two other important goals: transparency and security. With transparency we mean that the interactive audiobook should be build inside a known file format. With security we mean that only a legal owner of an audiobook can play it out and also that alteration of the audiobook data (or a part of it) is not allowed. To meet transparency, we consider the more and more used ISO Base Media File Format (recently adopted by the mobile industries with 3GP files and by the MPEG group with MP4 files). To meet security, we develop a mechanism that, by using classic security tools such as watermarking and hash function, is able to verify the integrity of an audiobook file and to ensure that only a legal audiobook owner can enjoy the full features of an interactive audiobook. In the following we first describe the entities involved in our system (script, scene and interaction manager and audiobook player) and then we show how these entities cooperate to produce, protect and play out an interactive audiobook. A. System Architecture Since an interactive audiobook allows users to interact and to modify the story development, the audiobook has to contain
Audiobook Protection and Production Multiple Stories
Audiobook Play out Audio Scene
Audio Scene Description
Scene Transition Table
Scene Manager Interaction Manager
Fig. 2. Our proposed system architecture to produce interactive audiobooks.
multiple storylines so that a user can move from one storyline to another by interacting with the system. Hence, multiple storylines and interactions have to be organized. To this aim, we propose the architecture depicted in Figure 2 and we describe it in the following. 1) Script manager: The script manager is in charge of: i) dividing every storyline into several audio scene; ii) describing each audio scene with its timing characteristics and iii) defining the possible paths (i.e., scene transitions) that a user can take by interacting with the audio scene. Hence, the basic unit of our system is the audio scene, which can be of four possible types: initial, interactive, sequential and ending. The initial scene is the first audiobook scene and only one initial scene per audiobook is allowed. The interactive scene is a scene that allows users to interact with the story. The sequential scene is a scene where interaction is not allowed. The ending scene is a scene that ends a storyline (note that multiple ending scenes may be present). For instance, let us consider a simple story: a woman, named Camilla, dreams about six lucky numbers and wants to bet on them. Whether she will reach the bet agency in time to bet or not, depends on the actions taken. Figure 3 and Table I show a simple example of the tasks done by a script manager: The story is divided into six different scenes. • AS 1. Initial-Interactive scene. Camilla dreams six numbers and she goes out to bet on them. Subway or bus is the first choice she has to make. • AS 2.1. Interactive scene. Camilla catches the bus, but since there is a traffic jam, she wonders whether to get off or to wait on the bus. • AS 2.2. Sequential scene. Camilla catches the subway and she reaches the bet agency. • AS 3.1. Ending scene. Camilla realizes that the agency is about to close. She gets off the bus and she goes shopping. • AS 3.2. Sequential scene. Camilla decides to get off the bus and to walk to the agency. • AS 4.1. Ending scene. Camilla wins one million dollar. Although very simple and with only two interactive scenes, the user is provided with three different stories and with two different endings: 1-2.1-3.1, 1-2.1-3.2-4.1 and 1-2.2-4.1. Once all the scenes are defined, the script manager describes them, along with all the possible scene transitions, with the
1-4244-0667-6/07/$25.00 © 2007 IEEE
A dream reveals six lucky numbers to Camilla. When she wakes up, she goes to the bet agency. Subway or bus?
Camilla gets the bus, but there is a traffic jam, so she wonders whether to get off and go on a foot, or to wait on the bus?
AS 3.2 AS 2.2
Camilla gets the subway and reaches the bet agency in 5 minutes
AS 3.1 After a while, Camilla realizes that the agency is about to close and so she gets off the bus and she goes shopping.
She decides to get off and to go on a foot. She arrives just in time to bet.
AS 4.1 After one hour Camilla won 1 million dollar
Multiple scenes can describe multiple storylines.
"41" 00:02:30 00:00:17 After one hour Camilla won 1 million dollar. TABLE I T HE USAGE OF MPEG7-DDL TO DESCRIBE AN AUDIO SCENE .
MPEG7-DDL language. Table I shows an example of an audio scene description. The audiosegment tag is used to define an audio scene; every audio scene (identified with a unique label) specifies the beginning (mediatimepoint) and the duration of the audio segment (mediaduration). If available, a text description of the audio scene may be specified through the textannotation tags. To handle all the possible user interactive choices, the script manager defines a table (the scene transition table). Table II shows the scene transition table. Each entry is identified by the SID (Scene IDentifier) and includes a possible question (only for interactive scenes) and the possible audio scene destinations. If the scene is interactive, there are at least two possible audio scene destinations; if the scene is of an ending type, no audio destination is present. If the scene is sequential, a single destination is present. 2) Interaction manager: The interaction manager is in charge of handling the interactions between the user and the system. The interaction interface depends on the available SID 1 2.1 2.2 3.1 3.2 4.1
Question Subway or bus? Get off the bus or not? -
Option 1 2.2 3.2 4.1 4.1 -
TABLE II S CENE T RANSITION TABLE .
Option 2 2.1 3.1 -
hardware. For instance, the user can interact with a pad, a keyboard, a voice recognition system or a visual gesture recognition system. Regardless of the used interface, the interaction manager is activated when an interactive audio scene is played out, and by using the scene transition table, it identifies the question to pose to the user; using the user answer and the scene transition table, the next audio scene to play out is identified and passed to the scene manager. 3) Scene manager: The scene manager is in charge of identifying the audio scene to play out. The scene manager controls both the audiobook player and the interaction manager. By cooperating with the interaction manager, it gets the SID of the scene that has to be played out; using this SID, it accesses the MPEG7-DDL description and finds out the audio segment of the scene to play out; it analyzes the audio-segment description and passes the play out timing information to the media player. 4) The Player: The player is in charge of playing out the audio-segments that the scene manager asks to play out. The player has to cooperate with the scene manager as it has to jump from one segment to another depending on the audio segment that the scene manager asks to play out. Hence, an enhanced audiobook player should be available. In fact, if played with a classic player, the interactive audiobook would simply be played as a normal audiobook, sequentially from the first audio scene to the last one. B. Production, Protection and Playout The entities just described cooperate to produce, protect and to play out an interactive audiobook. 1) Audiobook Production: Once the script manager has produced all the audio scenes and the scene transition table, the audiobook production can begin. The audiobook production first involves the production and the description of the audio data. Since we are using the ISO Base Media File Format, we consider the AAC audio format and the MPEG7-DDL markup description language. The audio stream is produced according to: • A single pre-defined storyline is linearly encoded so that the audio scenes that compose the story can be found linearly from the first to the last. This story is stored in a clear way in order to let any player to play it out. • All the other audio scenes that compose the multiple storylines are stored in the second part of the file. This part is encrypted. The play out of this part without the appropriate key would produce audio noise. • The first and the second part are separated with 120 seconds of silence, so that, if played with a normal player, the pre-defined storyline would not be immediately followed by audio noise. By meeting the above constraints, every user can listen to a pre-defined story and only a legal interactive audiobook owner can enjoy the full features of an interactive audiobook. In fact, only who owns the decryption key can access to the content stored in the second part of the file. The second task of the audiobook production is the description of all the audio scenes and of the scene transition table.
1-4244-0667-6/07/$25.00 © 2007 IEEE
This description id done by the script manager and is produced in an MPEG7-DDL format. Finally, the audio stream and the audiobook description have to be stored inside the ISO Base Media File Format. As described in Section II, the ISO Base Media File Format is organized through boxes. Hence, the AAC audio stream is stored inside a trak box, while the static MPEG7-DDL description is stored inside the box (udta box). In this way, thanks to the use of this standard format, every player compatible with the ISO Base Media File Format is able to read our interactive audiobook file. Once the file is completed, it is passed to the security manager that adds protection to it. C. Audiobook Protection For security reasons, the audiobook file has the following data watermarked in it, as shown in Figure 4: • α, the key for decryption the audio stream; • k, the key for decryption the audio description Adesc ; • W ID = H(EK (Adesc )) for the weak integrity check; • SID, the audio scene identifier (watermarked inside any audio scene, with the exception of ending scenes); Data are hidden into the file using classical spread-spectrum techniques ,  and using the watermarking key of the user’s software player. Notice that the software player must know the watermarking key in order to read the watermarked data. This key is hidden in the player’s code with suitable software engineering techniques. Copies of the software player released to different users use different watermarking keys. Therefore each interactive audiobook is released for a specific instance of the audiobook player, by embedding all the information using the audiobook player’s watermarking key. The α key is stored inside the audiobook file as it has to be provided to the software player. The key is stored in the first audio scene and the player can retrieve it while playing it out. The k key is needed by the player to decrypt the audio scenes description and the scene transition table. This key is watermarked inside the first audio scene. The W ID = H(EK (Adesc )) is watermarked to protect the audiobook from unauthorized alteration. Integrity is checked via a lightweight verification procedure that compares the hash of the whole audiobook description (encrypted), against the W ID. Note that, for performance reasons, the use of stronger cryptographic tools is avoided. A unique SID (Scene IDentifier) is watermarked inside every audio scene (with the exception of ending scenes). This is done to uniquely identify each audio scene, as this knowledge is needed by the scene manager to activate the interaction manager. D. Audiobook Play Out The player, while rendering the first audio scene, extracts the embedded watermark data W ID (for weak integrity verification), k (for decrypting the audio scene description and the scene transition table), and α (for decrypting the second part of the audio stream).
SID1 a K WID SID2
Clear Audio Data
AS21 AS22 AS31 AS32 Silence
Encrypted Audio Data
Information watermarked inside the audiobook file.
After extraction, the player checks the weak integrity by computing H(EK (Adesc )) and by comparing it with the retrieved W ID. If the integrity check fails, reproduction is interrupted, otherwise the audiobook description and the second part of the audio track are decrypted using the retrieved k and α, and reproduction of the audiobook continues normally. ACKNOWLEDGEMENTS This work has been partially supported by the Italian M.I.U.R. under the MOMA initiative. IV. C ONCLUSIONS AND F UTURE W ORK In this paper we propose an architecture to produce interactive audiobooks. We showed that users can play an active role in the story development. Transparency and security were also considered: By using the ISO Base Media File Format, the AAC and MPEG7-DDL, our approach can be directly implemented over those systems that support 3GP and MP4 files and security is taken into account by developing a protection mechanism that ensures that i) only legitimate user can enjoy the full potentiality of an interactive audiobook and ii) alteration to the audiobook cannot be done. A future development of our system is the integration with a multimodal recognition system. The simplicity of our proposal may help expanding the audiobook also in entertainment and educational markets. R EFERENCES  Business Communications Company Inc., ”Digital Home Entertainment”, March 2006.http://www.bccresearch.com/  Pfitzmann, B., and M. Schunter, ”Asymmetric Fingerprinting,” in Advances in Cryptology - EUROCRYPT ’96, LNCS 1070, 1996, pp. 84-95.  Audio Publisher Association. [online]. Available: www.audiopub.org  Information technology Coding of audio-visual objects Part 12: ISO base media file format. ISO/IEC 14496-12:2005.  K. Brandenburg, Perceptual coding of high quality digital audio, Applications of digital signal processing to audio and acustics, Chapter 2, pp.39-83, Kluwer, Boston 1998.  M. Bosi, K. Brandenburg, S.Quackenbush, L.Fielder, K.Akagiri, H.Fuchs, M.Diets, J.Herre, G.Davidson, Y.Oikawa, ISO/IEC MPEG-2 Advanced Audio Coding, Journal Audio Eng. Soc., Vol. 45, No. 10, pp. 789-814. October 1997  J.Herre, H.Purnhagen, General audio coding, The MPEG4 Book, Chapter 11, Prentice Hall. 2002  Pfitzmann, B., ”Trials of Traced Traitors,” in Information Hiding: First International Workshop, LNCS 1174, 1996. pp. 49-64.  B. Pfitzmann, M. Waidner, ”Anonymous Fingerprinting,” in Advances in Cryptology - EUROCRYPT ’97, LNCS 1233, 1997, pp. 88-102.  S. Cheng, H. Yu and Z. Xiong, ”Enhanced spread spectrum watermarking of MPEG-2 AAC audio”, In Proceedings of the IEEE International Conf. on Acoustics, Speech, and Signal Processing. 3728 3731, 2002.  J. Cox, J. Killian, F.T., Leighton and T. Shamoon, ”Secure spread spectrum watermarking for multimedia”, IEEE Trans. Image Process. 6 (Dec.) ,16731678, 1997.  MPEG7-DDL Home Page. [online]. Available: http://archive.dstc.edu.au/mpeg7-ddl/  I. Cox, M. Miller and J.Bloom, ”Digital Watermarking”, Morgan Kaufmann, 2001
1-4244-0667-6/07/$25.00 © 2007 IEEE