The Lehigh Steel Collection

0 downloads 0 Views 600KB Size Report
The full LSC only contains images of entire pages. Moreover, the LSC .... Each document was scanned and sent as a PDF to a specific directory. File names .... BYU Harold B. Lee Library. Brigham ... Tong, Xiang, and David Evans. "A Statistical ...

The Lehigh Steel Collection: A New Open Dataset for Document Recognition Research * Barri Brunoa and Daniel Loprestib Department of Computer Science and Engineering, Lehigh University, 19 Memorial Drive West, Bethlehem, PA 18015 USA

ABSTRACT Document image analysis is a data-driven discipline. For a number of years, research was focused on small, homogeneous datasets such as the University of Washington corpus of scanned journal pages. More recently, library digitization efforts have raised many interesting problems with respect to historical documents and their recognition. In this paper, we present the Lehigh Steel Collection (LSC), a new open dataset we are currently assembling which will be, in many ways, unique to the field. LSC is an extremely large, heterogeneous set of documents dating from the 1960's through the 1990's relating to the wide-ranging research activities of Bethlehem Steel, a now-bankrupt company that was once the second-largest steel producer and the largest shipbuilder in the United States. As a result of the bankruptcy process and the disposition of the company's assets, an enormous quantity of documents (we estimate hundreds of thousands of pages) were left abandoned in buildings recently acquired by Lehigh University. Rather than see this history destroyed, we stepped in to preserve a portion of the collection via digitization. Here we provide an overview of LSC, including our efforts to collect and scan the documents, a preliminary characterization of what the collection contains, and our plans to make this data available to the research community for non-commercial purposes. Keywords: OCR, Graphics Recognition, Document Analysis, Performance Evaluation, Datasets

1. BACKGROUND AND HISTORY 1.1 Bethlehem Steel Corporation and Homer Research Labs The Bethlehem Steel Corporation, founded in 1857 and located at the base of South Mountain in Bethlehem, Pennsylvania, was once the largest ship builder and second largest steel producer in the United States. The corporation’s downfall and eventual bankruptcy is widely considered one of the most prominent examples of the country’s shift away from industrial manufacture. Through the late 1800’s and early 1900’s Bethlehem Steel prospered, supplying armor plating and ships for World Wars I and II. In 1958, the corporation invested $25 million to convert the top of South Mountain into a 31 acre state-ofthe-art research facility (Figure 1). Three years later, Bethlehem Steel Chairman Arthur B. Homer dedicated the 8 building facility and officially opened the Homer Research Labs. The complex eventually employed nearly 1,000 engineers, scientists, and other technicians and became the world’s largest steel laboratory, specializing in process metallurgy, ceramics, and thermodynamics. The corporation's prosperity following World War II lasted about two decades, during which the United States saw little foreign competition in steel manufacturing, but by the 1970’s Bethlehem Steel had begun to decline. Executives placed little emphasis on modernizing techniques to keep up with industry changes and small foreign factories were able to undersell the large U.S. corporation. Furthermore, many Bethlehem Steel employees began retiring and collecting the generous pensions they had been promised during the corporation’s heyday. As more and more plants closed due to *

Send correspondence to Daniel Lopresti Barri Bruno: E-mail: [email protected], Telephone: (516) 359-9276 b Daniel Lopresti: E-mail: [email protected], Telephone: (610) 758-5782 a

diminishing profits, Bethlehem Steel felt the growing pressure of paying out these benefits and severance to the unemployed. In 1986, the corporation was forced to sell the five largest Homer Lab buildings and much of the surrounding land to Lehigh University, which converted the space into its Mountaintop Campus. Today this space houses much of the university’s College of Education and resources for Civil Engineering, Chemical Engineering, and Biological Sciences. Bethlehem Steel was able to keep three buildings and about 220,000 square feet of lab space. Nine years later, the corporation was forced to close the steel manufacturing plants at the base of South Mountain; however, the Homer Research Labs remained open, still employing roughly 75 researchers. On October 5, 2001, Bethlehem Steel finally filed for bankruptcy and in 2003 the International Steel Group (ISG) bought out all of the corporation’s assets, including the research labs. In 2005, Mittal Steel Co. purchased Homer Labs from ISG and shut down the complex for good. A sixth building was purchased by Lehigh University and the remaining two were closed by the end of the year. Eventually, the property and plants at the base of South Mountain were sold to Sands BethWorks in an attempt to revitalize the Bethlehem community. By 2009, this space had been converted into a performing arts center, casino and resort, and three outdoor music venues, but the buildings at the top of the mountain remained intact and unoccupied. In May of 2013, Lehigh University signed agreements to purchase the remaining two buildings of the eight building complex that used to be Homer Research Labs. Much of the large machinery was cleared out by Mittal Steel and one work bay was outfitted for student research projects. The lab offices (Figure 1); however, remained the untouched home to hundreds of thousands of original documents from the corporation’s last few decades. The Lehigh Steel Collection (LSC) is a fraction of these papers that have been collected, scanned, and organized to represent a potentially large and unique compilation of documents.1,2,3,4,5

Figure 1. Left: Office from Building C as of July 2013. Although in greater disarray than most other offices, this room gives an example of the wealth of data contained here. Right: Hallway of Building C as of July 2013. Evidence of extensive water damage and mold can be found throughout the unoccupied buildings. Photo courtesy of Glenn Piper.

1.2 Current Datasets and the LSC The LSC is an extremely large heterogeneous collection with the potential for more homogeneous sub-collections. Based on working estimates, the LSC could potentially grow to include hundreds of thousands of unedited modern documents. Due to the varied nature of the collection, it takes on and even expands the attributes of many current datasets. Unlike many pre-existing datasets which contain images scanned in 300 dpi as only bitonal or grayscale, the LSC is scanned in 600 dpi and color is preserved in all documents which naturally contain it. Similar to the Persian Heritage Image Binarization Dataset (PHIBD) of 2012, the LSC also contains numerous documents exhibiting different types of degradation. These include, but are not limited to, fading and imperfections in the paper itself. However, the PHIBD contains manuscript images from historical documents, some of which consist of a few lines from a page of writing6. The full LSC only contains images of entire pages. Moreover, the LSC exhibits

characteristic degradation such as malformed letters and artifacts of the duplication process, which can be found in the Eisenhower Communiques. Still, the Eisenhower collection consists of only 610 facsimiles, whereas the LSC has already grown to encompass thousands of documents9. The LSC also contains employee and customer information as well. Similar to the RIMES dataset of handwritten documents and faxes, the LSC includes many examples of personally identifiable information (PII). Unfortunately, while RIMES represents fabricated identities10, the identities in the LSC are real and must be protected where necessary. A full discussion of PII and identity protection efforts can be found in the next section. Much of our current information about this collection lies in the organizational structure we have recorded. Unfortunately, unlike datasets such as Reuters-21578, which was characterized and assembled by Reuters personnel, we have almost no inside information on the documents and the data they contain8. Nevertheless, we see potential in the organization information we do have. We compare the LSC to the Enron dataset, which contained email correspondences and information about containing folders 7. Just as this corpus is suitable for evaluation of email classification methods, the LSC may be useful for evaluating classifications for printed correspondences. We propose that, due to the size of the dataset, numerous sub-collections may be organized to deal with specific document and graphic analysis problems. These may include, but are not limited to the following: Segmentation- The LSC contains countless variations of printed text, handwriting, tables, images, and other possible segments for which ground truth is useful. Many sub-collections of the LSC are possible for testing general segmentation algorithms. Technical symbol identification- The LSC includes a series of electrical diagrams which contain certain technical symbols. Similar to many of the TC-10 datasets11, these models may be collected for testing symbol identification algorithms. Signature verification- Within the LSC are the signatures of various Bethlehem Steel employees and contracted vendors. Simple attempted forgeries could easily be created and merged to generate a signature verification training set and test set. Named entity detection and organization- Recently, researchers at the Asia Research Institute developed a basic extraction system for determining author affiliations in a group of scholarly papers 12. Given that much of the LSC exists as correspondences with vendors and employees, we put forward a similar problem to develop algorithms for finding common threads within correspondences and their authors.

1.3 Privacy As previously mentioned, the LSC contains a fair amount of personally identifiable information (PII). The United States Department of Commerce defines PII as “any information about an individual…including any information that can be used to distinguish or trace an individual’s identity and any information that is linked to an individual.” This includes, but not limited to, name, social security number, and biometric data. Contrastingly, a list of only credit scores is not considered PII because an individual cannot be distinguished from the data alone13. We are currently in discussions with the Office of the General Counsel of Lehigh University on this matter and specific guidelines will be formulated prior to dataset release. Our primary concern is the privacy of those included. We also want to protect the interests of the former Bethlehem Steel, its subsidiaries, vendors, and current asset holders. We consider the LSC to be discarded or abandoned documents and in the United States there is no common law expectation of privacy for discarded materials. There are, however, limits to what can legally be taken from a company's garbage14. Our discussions with attorneys representing Lehigh and Mittal Steel will determine the limits of what can be released in an open dataset. We expect, however, that we will soon be able to make a significant fraction of this large quantity of documents openly available to the research community for non-commercial purposes.

2. DIGITIZATION EFFORTS In this section we discuss the general procedure for collecting, scanning, and organizing the documents. We refer to two specific types of documents when scanned: packets and loose documents. For our purposes, any documents that are contained in the currently discussed folder and no deeper levels of organization are considered “loose” in this folder. For example, if a paper is contained in a file drawer but not placed in another folder, we refer to it as loose in the drawer. A packet is a group of papers that logically go together. Because “logical coherency” is difficult to describe, we define a

packet as any papers that are bound together, contain page numbers suggesting they come from one document, share a heading and time stamp, or other similar features. Due to the nearly unlimited possibilities in this case, some may be determined by human interpretation.

2.1 Collection and Documentation Procedure First, a room was chosen to scan. For the initial test, this room had to be fairly clean contain a myriad of document types, thus our documents contain no mold which could potentially be harmful to the researchers and equipment. Once the room was chosen, the layout was well documented in pictures. Photos were also taken of some of the larger documents, such as engineering drawings, which needed to be collected and scanned separately. Once the entire layout was recorded, documents could be removed. Each file cabinet and drawer was removed one at a time with special care to preserve the order of the folders and documents inside. The contents of each drawer was placed in a file box and labeled according to the cabinet, drawer, and location of the documents. A sample box label can be seen in Figure 2.

Figure 2. The label on Box 1 shows the documents come from room 207, a file cabinet labeled “CRB 29”, and a drawer labeled “RAS 2 Stoves”. The letter “A” denotes that this box is the first of multiple boxes from this drawer and that these documents start at the front of the drawer.

2.2 Scanning and Organizing the Collection All documents were scanned on an Aficio MP 6002 printer/scanner/copier. Sheets from 5.5” x 8.5” to 11” x 17” can be scanned on the flatbed scanner while 8.5” x 11” sheets can be scanned on one or two sides by an auto feeder with a 150 sheet capacity. The Aficio has a true copy resolution of 600x600 dpi and grayscale gradient with 256 levels. All documents were scanned on auto-size at 600 dpi. Smaller documents which could not be automatically sized were scanned as A5. The auto-color option allowed us to scan in full color, grayscale, or bitonal according to contents. Output formats include TIFF/JPEG and PDF15. Each document was scanned and sent as a PDF to a specific directory. File names and hierarchies were recorded after each document was scanned and files were then moved to the correct directory. Documents were scanned and contained as a hierarchy of directories. Starting with the full box, loose papers were scanned and recorded. Next, loose packets (those that were loose in the box but show coherency) were scanned and recorded. Once all the loose documents were dealt with, each folder was removed one by one and scanned according to the same procedure. The loose papers within the folder were scanned first, followed by the packets. This continued through as many levels as necessary until all papers in the box were scanned. If a packet was stapled, clipped or bound in any way, that binding was temporarily removed for scanning and then redone once the files were recorded. Order was also retained of deemed necessary. Generally, loose papers in a folder were not kept in order. Packets, however, were kept in order. Similarly, if a bound packet consisted of loose papers and packets, the order was deemed important based on the higher level binding and preserved. Figure 3 explains a sample hierarchy of documents.

Box 1 Folder 1

Loose Loose Papers (A.pdf)

Loose Packet 1 (B.pdf)

Loose Packet 2 (C.pdf)

Packet 1.1

Packet 1.2

Folder 2 Loose Papers (H.pdf)


Loose (D.pdf)

Loose (F.pdf)

Loose 1 (I.pdf)

Packet (E.pdf)

Packet (G.pdf)

Packet (J.pdf)

Loose 2 (K.pdf) Figure 3. The chart above represents a sample box of documents. Within the box are two folders and a few loose documents. Among the loose documents are two packets and assorted papers. The loose papers will be scanned together and saved as A.pdf while each packet will be scanned by itself and saved as B.pdf and C.pdf. The first folder contains two packets. Due to the organization, we can infer that each packet contains a loose sheet and a subordinate packet. Each will be saved as a separate PDF file. Finally, folder 2 contains loose papers saved as H.pdf and a packet. In this case, the packet must have been bound, suggesting that order should be preserved. Thus, we get a few loose papers (I.pdf) followed by a packet (J.pdf) and more loose papers (K.pdf).

A few conditional changes to procedure were added along the way for efficiency. If a packet contained more than one double sided sheet and more than one single sided sheet, these were scanned separately to save time and utilize the dual scanning features of the sheet feeder. Both file names were then recorded under the same packet. Similarly, if a packet consisted of multiple sheets and a Post It or small note, the sheets were fed through the auto-feeder and the note was scanned separately. Due to their size, smaller notes were scanned on the flat bed as A5 size documents. The PDF containing the sheets and the PDF containing the note were both recorded under the same packet. Although most documents were scanned with the sheet feeder, 11” x 17” engineering drawings were scanned on the flat bed, as were some facsimiles where the paper was deemed too fragile to run through the auto-feeder. Most documents in the LSC were scanned and saved according to the above protocol; however, a few restrictions apply. The largest document the flat bed scanner can accommodate is an 11” x 17” sheet. Many of the boxes contained large drawings or layouts that could not be included in the current scanning run. We outline our intentions for these oversized documents in the future plans section. Specific publications were also left out for ease of scanning. Any publication which was professionally bound (i.e. a book or pamphlet) could not be easily taken apart for scanning and reattached, thus they were excluded.

2.3 Characterizing the Collection Currently, close to 30,000 pages have been collected and roughly one third of these have been scanned and organized. The average document size is 8.5” x 11” with the smallest documents scanned at 148mm x 210mm (A5 size) and the largest in the set scanned at 11” x 17”. There are also over 3,500 oversized drawings and data readouts that will need to be scanned professionally and ten published pamphlets and books that could not be included. About 90% of the documents consist of printed text and graphics and 15%-20% of those contain handwritten notes either as cover sheets or in the margins. A majority of the documents are printed on standard white paper; however, a few facsimiles exist on thin

fax paper as well. All the dated documents thus far have been from the 1970’s, 1980’s, 1990’s and 2000’s. Although not all documents are originals (i.e. some are photocopies), they are all originals with respect to the collection as a whole. No documents were created for the LSC dataset. Most documents are bitonal or grayscale, but some do contain color. Aside from simple highlights, these are mostly graphs and drawings. Figures 4, 5, and 6 show some samples of documents that have been found. Pages can be seen at

Figure 4. Above is another document from the LSC dataset. It comes from a company fax detailing certain specifications for purchasing brick.

Figure 5. Above is a full color layout plan from a collection of photocopied drawings.

Figure 6. Above is a document from the LSC dataset. It comes from a packet of information on a specific purchase. The name and signature associated with the page have been removed to ensure no PII is leaked prior to discussions with attorneys representing Lehigh and Mittal Steel.

3. FUTURE PLANS We estimate that it takes a researcher one day to scan a box containing roughly 2,000 sheets. At the rate the LSC has been growing, it should take 1-2 more months to scan the first office in its entirety. Once an entire office has been scanned and organized, more accurate document statistics will be collected and possible sub-collections, such as those previously mentioned, will be looked at more closely. We plan to have the larger drawings and documents scanned professionally and entered into the collection. The whole collection will then be run through Tesseract OCR16 and analyzed. We recognize that Tesseract is not the most accurate OCR software on the market; however, it is a free, well known program which can be run from the command line. In this respect, it fits our needs for initial testing 18. Once the full OCR output has been gathered, it will likely require some post-processing. Studies have highlighted the importance of post-processing OCR results to increase the accuracy of the results. We are exploring techniques such as variations of Statistical Language Modeling (SLM) and testing against the dynamic Google spelling dictionary17,19,20. The next logical step may be to crowd-source some of the ground truth work and round out the dataset with full documentation of each image. Disregarding unforeseen circumstances, we hope to have some of the LSC available for public use by the time of the 2014 Document Recognition and Retrieval Conference. Our current effort is collecting metadata for classifying the entire corpus more accurately. We have been working to keep a database of each document and its attributes. This includes, but is not limited to, the presence of color, handwriting, graphs, signatures, logos, and drawings.

4. DISCUSSION AND CONCLUSION The LSC dataset is currently unlike any other we have seen. This collection is unique because it is a large, modern, heterogeneous set. A number of the current datasets involve historic documents while the LSC contains data from the last few decades. It is likely that for some kinds of research, sub-collections of documents that share a specific property will have to be identified, but a dataset like this also makes possible new kinds of research questions, for example, clustering documents based on topics or identifying threads in communications across multiple disconnected documents. We see great potential in a set with these attributes and are pursuing its release for non-commercial purposes.

ACKNOWLEDGMENTS This is supported in part by a DARPA IPTO grant administered by Raytheon BBN Technologies. Additional support was provided by the Lehigh University Summer Mountaintop Project.

REFERENCES 1. Shope, Dan, and Kurt Blumenau. "Vaunted Steel Research Lab Closing." Morning Call [Bethlehem, PA] 23 04 2005, Web. 22 Jul. 2013. 2. Shope, Dan, and Kurt Blumenau. "New Owners Shut Down Former Homer Labs." Morning Call [Bethlehem, PA] 22 04 2005, Web. 22 Jul. 2013.. 3. Shope, Dan. "Homer Research Laboratories Has Only a Glorious History to Carry On." Morning Call [Bethlehem, PA] 31 12 2005, Web. 22 Jul. 2013. 4. Shope, Dan. "Fate of Stell Remnant Uncertain." Morning Call [Bethlehem, PA] 12 12 2004, Web. 22 Jul. 2013. 5. Loomis, Carol. "The Sinking Of Bethlehem Steel A hundred years ago one of the 500’s legendary names was born. Its decline and ultimate death took nearly half that long. A FORTUNE autopsy." Fortune [New York, NY] 05 04 2004, Web. 22 Jul. 2013. 6. Nafchi, Hossein. "Persian Heritage Image Binarization Dataset (PHIBD 2012)." IAPR TC11. N.p., 03 Jul 2013. Web. 2 Aug 2013. . 7. Klimt, Bryan, and Yiming Yang. "Introducing the Enron Corpus." First Conference on Email and Anti-Spam (CEAS) Proceedings. (2004): n. page. Web. 20 Nov. 2013. 8. Lewis, David. "Reuters-21578 Text Categorization Test Collection." UCI Knowledge Discovery in Databases Archive. AT&T Labs - Research, 26 Sept 1997. Web. 18 Nov 2013. . 9. "The Eisenhower Communiques." BYU Harold B. Lee Library. Brigham Young University, n.d. Web. 18 Nov 2013. . 10. The Rimes Database. N.p., 02 Mar 2011. Web. 2 Aug 2013. . 11. "Description of Final Tests." TC 10 Technical Committee on Graphics Representation. N.p.. Web. 2 Aug 2013. . 12. Do , Huy, Muthu Chandrasek, and Philip Cho. "Chandrasek." Extracting and Matching Authors and Affiliations in Scholarly Documents. Indianapolis: 2013. Web. 1 Aug. 2013. 13. McCallister, Erika, Tim Grance, and Karen Scarfone. United States. Department of Commerce. Guide to Protecting the Confidentiality of Personally Identifiable Information (PII). Gaithersburg: , 2010. Web. 14. CALIFORNIA v. GREENWOOD. The Oyez Project at IIT Chicago-Kent College of Law. 30 July 2013. . 15. "Ricoh Aficio MP 6002/MP 7502/ MP 9002." Ricoh-USA. Ricoh-USA. Web. 4 Aug 2013. . 16. "tesseract-ocr." Google Code. Google Code. Web. 5 Aug 2013. . 17. Bassil, Youssef, and Mohammad Alwani. "OCR Post-Processing Error Correction Algorithm Using Google's Online Spelling Suggestion." Journal of Emerging Trends in Computing and Information Sciences. 3.1 (2012): n. page. Web. 24 Nov. 2013. 18. Smith, Ray. "An Overview of the Tesseract OCR Engine."Document Analysis and Recognition. 2. (2007): 629633. Web. 24 Nov. 2013. 19. Tong, Xiang, and David Evans. "A Statistical Approach to Automatic OCR Engine Correction in Context." Workshop on Very Large Corpora. (1996): 88-100. Print.

20. Zhuang, Li, and Xiaoyan Zhu. "An OCR Post-Processing Approach Based on Multi-Knowledge."KnowledgeBased Intelligent Information and Engineering Systems. (2005): 346-352. Web. 24 Nov. 2013.