Minolta PS 7000 - Carnegie Mellon University

7 downloads 29 Views 2MB Size Report
Million Book Universal Library. Project :Manual for Metadata. Capture, Digitization, and OCR. Gabrielle V. Michalek, editor. Carnegie Mellon University. May 7 ...
Million Book Universal Library Project :Manual for Metadata Capture, Digitization, and OCR

Gabrielle V. Michalek, editor. Carnegie Mellon University. May 7, 2003

2

Table of Contents Data Production ................................................................................................3 Getting MARC Records From OCLC ...............................................................4 Creating Metadata Using Dublin Core ..............................................................6 Minolta PS 7000 – Quickscan Software Instructions ......................................12 ABBYYFineReader 6.0 Instructions ...............................................................30

3

Data Production •

Bitonal images with a pixel depth of 1 bit-per-pixel were scanned at a resolution of 600 dots per inch (dpi). Images stored as "Intel" TIFF (Tagged Image File Format) files, with the header content specified. The compression algorithm used is ITU (Formerly CCITT) Group 4.



TIFF version 5.0 is acceptable. Subject to testing, version 6.0 (or later) may also be acceptable.



Initial-capture system includes dynamic thresholding or a similar feature to capture variability of darkness in the imprint and possibly darker (e.g., foxed) backgrounds from decay. Images should be as readable as the original pages.



"Typical" or "expected" data to be provided for most TIFF tags (normally, the data supplied by software default settings). A specification for the TIFF header to be produced to include scanner technical information, filename, and other data, but to be in no way a burden on the production service.



Images written in sequential order, with corresponding 8.3 file names, e.g., 00000001.tif as first image in volume sequence and 00000341.tif as 341st image in volume sequence



Volumes to be provided to Million Book Project by libraries with unique identifiers that conform to 8.3 format; images should be in directories named with corresponding identifier (e.g., akf3435.001 as identifier for volume will result in directory with same name, and 00000001.tif through 0000000N.tif within that directory)



Images and directories (as specified above) to be written by Million Book Project to gold CD-ROM meeting agreed upon specifications, and using ISO9660 format.



Skew to be within a specified range of degrees allowed. Excerpt from NCF Million Book Proposal

4

Getting MARC Records From OCLC

5

You receive MARC data from a company called OCLC. OCLC maintains an international database of library holdings. The OCLC product you use to get the MARC record from is called Connexion. It can be accessed at: http://connexion.oclc.org/ You will use OCLC’s Connexion product to search the OCLC database to determine if a book has already been catalogued and to export a MARC binary record. To do this go to the Connexion URL listed above and click on the logon icon. The Authorization number is 110-250-490 and the Password is BOOKS. Select the General Tab Select the Admin Tab Select Export Options Select the MARC option Select the Export to File option That is all you need to fill out in this section Next go to Cataloging Tab Go to Search menue and select WorldCat You will be presented with a search interface that will allow you to search for materials via title, author, etc… Perform your search Once you receive a hit on a search you must display the full record you believe matches the item you are looking for to compare the record to the item in hand to determine that they match up. Once the record is displayed and you are sure they match. Go to the View menue Select MARC Text Area Go to Action menue Select Export Record in MARC Save record in metadata file created for each book. Add extension .mrc to each file. This will create a MARC binary record. For materials not already cataloged, or materials that cannot be located in OCLC you should create a Dublin Core record

6

Creating Metadata Using Dublin Core

7

Materials that have not been catalogued should be catalogued using Dublin Core. Dublin Core is a subset of MARC. Dublin Core fields represent the lowest common demoninator for cataloging any type of library holdings. To read more about Dublin Core go to: http://dublincore.org/documents/dces/ There is a Dublin Core template that will produce an HTML output of a Dublin Core record that can be accessed here: http://www.lub.lu.se/cgi-bin/nmdc.pl Dublin Core Metadata Element Set, Version 1.1 The definitions provided here include both the conceptual and representational form of the Dublin Core elements. The Definition attribute captures the semantic concept and the Datatype and Comment attributes capture the data representation. Each Dublin Core definition refers to the resource being described. A resource is defined in [RFC2396] as "anything that has identity". For the purposes of Dublin Core metadata, a resource will typically be an information or service resource, but may be applied more broadly. Element: Title Name: Title Identifier: Title Definition: A name given to the resource. Comment: Typically, a Title will be a name by which the resource is formally known. Element: Creator Name: Creator Identifier: Creator Definition: An entity primarily responsible for making the content of the resource. Comment: Examples of a Creator include a person, an organization, or a service. Typically, the name of a Creator should be used to indicate the entity. Element: Subject Name: Subject and Keywords Identifier: Subject Definition: The topic of the content of the resource. Comment: Typically, a Subject will be expressed as keywords,key phrases or classification codes that describe a topic of the resource. Recommended best practice is to select a value from a controlled vocabulary or formal classification scheme. Element: Description Name: Description

8 Identifier: Description Definition: An account of the content of the reso Comment: Description may include but is not limited to: an abstract, table of contents, reference to a graphical representation of content or a free-text account of the content. Element: Publisher Name: Publisher Identifier: Publisher Definition: An entity responsible for making the resource available Comment: Examples of a Publisher include a person, an organization, or a service. Typically, the name of a Publisher should be used to indicate the entity. Element: Contributor Name: Contributor Identifier: Contributor Definition: An entity responsible for making contributions to the content of the resource. Comment: Examples of a Contributor include a person, an organization, or a service.Typically, the name of a Contributor should be used to indicate the entity. Element: Date Name: Date Identifier: Date Definition: A date associated with an event in the life cycle of the resource. Comment: Typically, Date will be associated with the creation or availability of the resource. Recommended best practice for encoding the date value is defined in a profile of ISO 8601 [W3CDTF] and follows the YYYY-MM-DD format. Element: Type Name: Resource Type Identifier: Type Definition: The nature or genre of the content of the resource. Comment: Type includes terms describing general categories, functions, genres, or aggregation levels for content. Recommended best practice is to select a value from a controlled vocabulary (for example, the working draft list of Dublin Core Types [DCT1]). To describe the physical or digital manifestation of the resource, use the

9 FORMAT element. Element: Format Name: Format Identifier: Format Definition: The physical or digital manifestation of the resource. Comment: Typically, Format may include the media-type or dimensions of the resource. Format may be used to determine the software, hardware or other equipment needed to display or operate the resource. Examples of dimensions include size and duration. Recommended best practice is to select a value from a controlled vocabulary (for example, the list of Internet Media Types [MIME] defining computer media formats). Element: Identifier Name: Resource Identifier Identifier: Identifier Definition: An unambiguous reference to the resource within a given context. Comment: Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system. Example formal identification systems include the Uniform Resource Identifier (URI) (including the Uniform Resource Locator (URL)), the Digital Object Identifier (DOI) and the International Standard Book Number (ISBN). Element: Source Name: Source Identifier: Source Definition: A Reference to a resource from which the present resource is derived. Comment: The present resource may be derived from the Source resource in whole or in part. Recommended best practice is to reference the resource by means of a string or number conforming to a formal identification system. Element: Language Name: Language Identifier: Language Definition: A language of the intellectual content of the resource. Comment: Recommended best practice for the values of the Language element is defined by RFC 1766 [RFC1766] which includes a two-letter Language Code (taken from the ISO 639 standard [ISO639]), followed optionally, by a two-letter

10 Country Code (taken from the ISO 3166 standard [ISO3166 For example, 'en' for English, 'fr' for French, or 'en-uk' for English used in the United Kingdom. Element: Relation Name: Relation Identifier: Relation Definition: A reference to a related resource. Comment: Recommended best practice is to reference the resource by means of a string or number conforming to a formal identification system. Element: Coverage Name: Coverage Identifier: Coverage Definition: The extent or scope of the content of the resource. Comment: Coverage will typically include spatial location (a place name or geographic coordinates), temporal period (a period label, date, or date range) or jurisdiction (such as a named administrative entity). Recommended best practice is to select a value from a controlled vocabulary (for example, the Thesaurus of Geographic Names [TGN]) and that, where appropriate, named places or time periods be used in preference to numeric identifiers such as sets of coordinates or date ranges. Element: Rights Name: Rights Management Identifier: Rights Definition: Information about rights held in and over the resource. Comment: Typically, a Rights element will contain a rights management statement for the resource, or reference a service providing such information. Rights information often encompasses Intellectual Property Rights (IPR), Copyright, and various Property Rights. If the Rights element is absent, no assumptions can be made about the status of these and other rights with respect to the resource.

11 Dublin Core Template Complete as many fields as possible without guessing.

12

Minolta PS 7000 Quick Scan Scanning Instructions

13 QuickScan Software Instructions 1. Click on QuickScan icon on the desktop.

2. Go under the Options menu and click on Scanner Settings.

• • • •

The Mode box should read “Black and White.” Change the Dots per inch setting to 600 dpi. Change the Page Size to accommodate the size of the book to be scanned. Change brightness from Manual to Automatic.

14

3. Next, click on the More button for the Special Features settings.

• • • • •

Change Scan Mode to Split (Left page then Right page). Or, which every mode you wish to scan in. Make Sure that Frame Masking, Finger Masking and Centering options are selected. Click on the Center-Line Erase option and make sure to select Automatic Detection. Under Scan Method, at the bottom of the screen, make sure to select the Front Panel option. Click OK.

15 4. Go to the File, select the Scan Batch to File option, Create New Batch.

16

5. The book to be scanned should already have a file created using the template. Go under the F:\directory, or the directory, which contains the appropriate folder. In the folder you should find your book folder, then click on it.

• • • • •

Give your book a file name using the OCLC number or ISBN number. Under Schema Activation option, choose Use Schema. (Warning: If you do not select this option, your files will not be saved properly.) Then, check the Warn on Overwrite option. Click OK.

17

5. Select Start Scanning.

6. At this point you can place the book on the scanner and scan the pages using the buttons on the front panel of the scanner.

18

7. Error Messages. If for some reason you get an error message that interrupts the scanning process, it is probably because the book shifted and the scanner needs to readjust. To continue scanning: • Under the file menu go to Scan Batch to File and choose the Insert Pages option. • A Prepare Scanner window will appear. Select the Start Scanning option. • Most likely, you will get another error message, at which time you select ok. • Repeat the above mentioned procedure. The scanner will work on the second time. • Thumb nail images of your pages will appear along the left-handed side of the screen. • When you go to insert pages, make sure that the blinking cursor bar is to the right of the page, which is to be before the pages you wish to insert. 8. To open a file that has been previously scanned, go to File, click on Open.

19

9.

Select the name of the file you wish to open by clicking on it. list of “tif” files will appear under the file name. Click on the Select All button to your right in the Open Document Window.

20 10.

A list of all the “tif” files will appear under Selected Files. Click the OK.

21

11.

The following screen will appear. To start scanning, select Start Scanning.

22 12.

Inserting Pages. Go to File, Scan Batch To File, and select Insert Pages.

23

13.

Make sure that your cursor is to the right of the page, which you want to come before the inserted pages.

24 14.

Go to File, Scan Batch to File, and select Insert Pages.

25 16. Click OK.

26 17. Select, Start Scanning.

27 18. Deleting Pages. • Put your cursor to the right of the image you wish to delete. • With your mouse, hold down the left button, and drag. • The image should now be highlighted in black. • Click on the Delete key, which is located on your keyboard. • The image should now be deleted.

28 19.

Using ScandAll software to post-process. (This only works if you are using the ScandAll IP options.) When scanning is complete, you can post-process your scanned pages. Go to the IP menu and choose the Configure option.

29

20.

Highlight the options you wish to select and click Add. Your choices should then be added to the Selected Filters section of the screen. To crop, double click on Crop1 and select your margins. After you have choose what you wish to correct, click OK. Go to the IP menu, chose Run on Document options. The post-processing process will now start.

30

ABBYYFineReader 6.0 Instructions

31 ABBYY FineReader 6.0 Instructions: 1. Click on the Abbyy Fine Reader Icon

2. Select Tools, Options

32

3. Select Formatting. Select Retain font and font size. Check Keep pictures. Click OK. Select Format Settings.

33 4. Select RTF/DOC. Check Keep line breaks. Check Retain text color. Click OK.

34 5. Select HTML. Check Keep line breaks. Check Retain text color. Click OK.

35 6. Select TXT. Check Keep line breaks. Check Use blank line as paragraph separator. Click OK. Click Close.

36

7. Select Process, Click on Open & Read.

37 8. When the Open window appears, select the file that you wish to OCR. In this example, the file was called Operation of Electron Microscope. Next, the OTIFF image file was selected.

38

9. Highlight all the TIF files you wish to OCR and select Open.

39 10. The following screen will then appear. The OCR’ed sections will appear to the right of the screen in the Text Page window.

40 11. To save your results, select Save.

41 11. After pressing the Save button, the Save Wizard screen will pop up. Choose Save to File in the Choose save type window. Select Retain font and font size. Select Keep pictures. Select All pages. Click OK.

42 12. Save OCR’ed images to your file as Text Document. Select All pages. Select Name files as source images. Select Save.

43 13. To close the file, go to File and Select Close Batch.

44