Shades of Grey: (Yet) Another Role for BIS-TDWG Arturo H. Ariño*
University of Navarra, Department of Environmental Biology *[email protected]
TDWG Annual Conference, Jönköping, 2014 -- Presentation Slides
The existence of a “long tail” of biodiversity research data 1, composed of multitude of hard-to-reach datasets by small publishers 2 has been often postulated as a potential source of much biodiversity information 3 that remains “dark data” 4: we know they are there but do not know where. Much of this heap of data could eventually be mobilized 5 and become usable, but there is a distinct risk that such data, once known, may be lost forever (“unknown knowns”, 6) as, e.g., their media become obsolete, or persons that might know the metadata vanish. Still, a vast body of data exists in a variety of “grey” forms as regarded from the point of view of their usability in biodiversity research, which now generally requires to be digitally accessible (DAK, 7: those residing in paper publications or reports that have not (yet) been digitized, and may well never be so: e.g. very short-run printings, local reports, or merely notebooks never formalized into some form of publication. These may be known and localizable (thence not completely “dark”) but not practically usable until they become mobilized in digital form. TDWG’s evolving standards are the state-of-the-art semantic tools enabling such mobilization, and can further be used to ascertain the fitness-for-use 8 of the derivative datasets. However, the disproportionate effort required to mobilize undigitized data (as compared to digital datasets) might readily move such grey data into darkness. Incentives for mobilization (in effect, rescue boards for data) may emerge if the semantic corpus already in TDWG is progressively expanded to capture ancillary data types that may appeal to a wider community. As a starting point, perhaps the Ecology community could be targeted, and eventually a linkage to Ecological Metadata Language (EML) could again be considered (cfr. Charter of the Observations Task Group) to enhance the desirability to extract and digitize, within TDWG standards, biodiversity information
currently fading from grey shades into the darkness.
Heads Up! A story on why preserving taxonomical data, while important, is fraught with hurdles.
[An oribatid mite pictured on an image reference card and under the microscope. From
Ariño A.H., Baquero E., Jordana R., 2005: Imaging Soil Mesofauna. In: Steiner et al., Digital Imaging of Biological type Specimens. ENBI, Stuttgart.]
The Pérez-Iñigo collection as deposited at the Museum of Zoology of the University of Navarra.
Original taxonomical description of Pergalumnidae oribatid acari by Pérez-Iñigo and Baggio.
Pérez-Iñigo’s legacy files in 5.25-inch floppy disks.
Old AT-type PC being salvaged for parts including a 6.25-inch floppy disk drive.
Functional, rescued PC from the ’80s operating under DOS 3.2.
Pérez-Iñigo’s original specimen listings as IBM Filing Assistant and IBM Writing Assistant DOS-based files.
The destruction of the collections at the Butantan Institute in the 2010 fire.
The long tail of dark data
Long and Dark Data. From Heidorn (2008): “[…] the growing number of scientists globally and the increase in the amount of data each scientist can generate […] does not in any way insure that the data is accessible now or that it will be accessible in the future. It has always been the case that scientists have generated more data than they eventually publish […] ““There is a wealth of science data that is almost impossible to see. This is science’s dark data. We can find much of this dark data in the long tail of science data. Because it is difficult to find dark data, it is underutilized and routinely lost.” P. Bryan Heidorn: Shedding Light on the Dark Data in the Long Tail of Science. Libr. Trends 57, (2008).
When properly assessed for fitness-for-use, mined, and cleaned, digital data from multiple provenances (Digital Accesible Knowledge) can be directly used in research. It is the opposite to dark data. Grey data are in between. Digital Accessible Knowledge: regarding biodiversity, “primary data that are both digital and accessible in standard formats (TDWG, 2007)” (Sousa-Baena, Couto Garcia & Peterson, 2013)
The Shades of Data
DAK: Public, Can be located and reached. Digital: Online Papers, Datasets, Files, Databases, Repositories. Analog: Printed literature, Prints, recordings, Catalogues. Gray: Difficult to access. Digital: Isolated media, Obsolete media, Private files. Analog: Card files, ledgers, notebooks, Reports, gray literature, Private known papers. Dark: Cannot be located or acessed. Digital: Unreadable media, Uninterpretable Unknown files. Analog: Untracked ledgers, notes, Locked reports, Private unknown papers.
Where are biodiversity data hidding? Obvious places to look
Where are biodiversity data hidding?
Even digital data can be(come) dark
Rate of Attrition The rate of attrition of knowledge increases as the documentation used to interpret datasets is lost (Michener, 2000) But lifespan can be shortened by obsolescence or accidents: Conjecture #1: Unless meta-documented, the maximum lifespan of a document is at best that of the researcher who generated it. Conjecture #2: The probability of inheritance of document understanding is inversely proportional to its complexity.
Legacy and obsolete media containing potentially valuable biodiversity data: A Sinclair QL with memory expansion and floppy drive controller rigged as portable computer, QL microdrives, a half-inch (9-track) IBM tape, 5.25-inch floppy disks, 3.5-inch floppy discs, zip cartridge.
Biodiversity Data Access: DAK & Dark data (from Ariño et al., in review: GBIF Best
Practice Guide for ‘Data Gap Analysis for Biodiversity Stakeholders’. GBIF, Copenhagen.) Access to, and forms of, biodiversity data. DAK – Digital Accessible Knowledge: Primary data that are both digital and accessible in standard formats LK – Locked knowledge: Data that are known to exist, but cannot be accessed because of some barrier (e.g. paywall, obsolete digital systems, inability to digitize) BK – Buried knowledge: Data that exist but whose existence is not known or cannot be ascertained by users.
The Fate of the Long Tail Fate #1: Become DAK Get digitized, published, recovered, usable Fate #2: Become Dark Get lost! Fate #3: There is no #3.
Michener’s classical example of information entropy, or “normal degradation in information content associated with data and metadata over time. Specific details about problems with individual ítems […] are lost relatively rapidly. General details about data collection are lost through time. Retirement or career change makes access by scientists to “mental storage” difficult or unlikely. […] Accidents or changes in storage technology (dashed line) may eliminate access to remaining raw data and metadata any time” (Michener et al., 1997).
The Effect of Article Age on Four Obstacles to Receiving Data from the Authors. Figure 1 in Vines et al., 2014.
Grey and dark data: Surely it is secret, but is it safe?
In the case of the type of Spinosaurus aegyptiacus, destroyed in WWII (Holotype
destroyed during the night of 24/25 April 1944 in a British bombing raid of Munich), reconstruction was posible because photographs and descriptions had been published: data existed.
We thought that we were exempt from Dark Data because our specimens were put in a Museum? If our data was lost, then data could be reconstructed from thye collections? Not true. Both data and the possibility of reconstruction of data can be lost, too.
The ruins of the Museum Bocage, Lisbon, after the fire of 1978. The destruction of the collection registries during the fire has made impossible to know exactly what was lost.
How much more knowledge could have been gained were data now dark still available?
Data availability means data use. Example: Number of papers using data from
various sources made available through GBIF since 2007, and number of available data records (from A.H. Ariño: Filling Biodiversity Knowledge Gaps: GBIF supporting research for conservation management. XXI GBIF Governing Board, New Delhi, 2014).
In the GBIF index, data use rate has stabilized – thence the data availability impact on scientific output becomes predictable.
The reaction A World’s Endeavor: Digitization [Online Journals, Google Books, BHL, Museums, Webs, Europeana/OpenUp…] A World’s Attitude: Facilitate Discovery and Sharing [RDA, OA, TDWG, GBIF, OpenUp!…] A World’s Library: Repositories [Dryad, Fedora, ArXiv…]
The keys to success: STANDARDS [DwC, ABCD…] METADATA [XML, EML, OAI, BML?] -> Essential Biodiversity Variables/Indicators? INCENTIVES [Mining, Attribution, Data Papers…] STRATEGY
Rules of Thumb (Michener 2000) • the more comprehensive the metadata, the greater the longevity (and value) of the data • structured metadata can greatly facilitate data discovery, encourage “best metadata practices” and support data and metadata use by others • metadata implementation takes time!!! • start implementing metadata for new data collection efforts and then prioritize “legacy” and ongoing data sets that are of greatest benefit to the broadest user community
Success of GBIF strongly tied to TDWG’s standards Why not repeat that success in a new role? - Helping set standards for gray data capture - Start somewhere – RDA? EML? - Use strategy
Research Data Alliance (RDA)
RDA: Biodiversity Integration Interest Group Some TDWG members involved. Current chair: Dimitris Koureas Will focus on NAMES as linking points Potential work arena in the extraction and linking of scientific names from grey data Expected to deliver a names/GNA-related sketch in 2015
RDA: Long Tail of Research Data Interest Group Cross-interest group with Biodiversity: sharing similar problems Demonstrating the usefulness of workflow tolos (scratchpads) to facilitate capture of long-tail data in BI Other groups: Focused on institution-level policies and practices to improve LT recovery
Ecological Metadata Language (EML) XML schemas for use with ecological data: Common structure to document ecological data Structure for software applications development Developed in the context of LTER Relatively free descriptors for experimental and observational data Formalization of reported data: Giving a paper/report/dataset semantic content
Is it sensible to merge both schemas? Options: Expand DwC’s collection-level descriptors Atomize EML with DwC Ecological Metadata Language (EML): EML Focuses on the whole dataset: Could be considered an envelope for PBR: a higherlevel descriptor for specimen-based ot collection-based data Common interoperability with DwC posible at many levels, e.g. geographical, taxonomical—NAMES is an obvious link Makes sense to reduce duplicity when describing/analyzing grey data The combination of BIS standards and EML might completely describe a grey source: Incentive to digitization and recovery
Summary: Why let all gray data die in darkness? We may have tools for rescue enticement: TDWG standards EML RDA fora – support? The sooner we join forces, the less “extinction” rate.
“Whoever saves one data saves the entire data.”
1. Heidorn, P. B. Shedding Light on the Dark Data in the Long Tail of Science. Libr. Trends 57, (2008). 2. Koureas, D. Scratchpads for community involvement for natural history collections. (2014). at 3. Chavan, V., O’Tuama, E., Gaiji, S., Remsen, D. & King, N. Every datum counts! Capitalising on small contributions to the big dreams of mobilising biodiversity information. (2008). at 4. Khalsa, S. J. et al. Brokering for EarthCube Communities: A Road Map. (National Snow and Ice Data Center, 2013). doi:http://dx.doi.org/10.7265/N59C6VBC 5. Chavan, V. & Penev, L. The data paper: a mechanism to incentivize data publishing in biodiversity science. BMC Bioinformatics 12 Suppl 1, S2 (2011). 6. Garrity, G. M., Lyons, C. & Cole, J. R. Knowledge bleed, Phenbank, and NamesforLife. (2006). at 7. Sousa-Baena, M. S., Couto Garcia, L. & Peterson, A. T. Completeness of digital accesible knowledge of the plants of Brazil and priorities for survey and inventory. Divers. Distrib. 2013, 1–13 (2013).
8. Hill, A. W., Otegui, J., Ariño, A. H. & Guralnick, R. P. GBIF Position Paper on Future Directions and Recommendations for Enhancing Fitness-for-Use Across the GBIF Network. GBIF 25 (Global Biodiversity Information Facility, 2010). at 9. Ariño A.H., Baquero E., Jordana r:, 2005: Imaging Soil Mesofauna. In: Steiner et al.: Digital Imaging of Biological type Specimens. ENBI, Stuttgart. 10. Sousa-Baena L., Couto Garcia M., & Peterson A.T.: Completeness of digital accessible knowledge of the plants of Brazil and priorities for survey and inventory. Diversity and Distributions, (2013) 1–13. DOI: 10.1111/ddi.12136. 11. Ariño, A.H., et al., in review: GBIF Best Practice Guide for ‘Data Gap Analysis for Biodiversity Stakeholders’. GBIF, Copenhagen. 12. Michener W.K., Brunt J.W., Helly J.J., Kirchner T.B., Stafford S.G, 1997: Nongeospatial Metadata for Ecological Sciences. Ecological Applications, 7(1): 330342. 13. Vines, T.H., Albert, A.Y.K., Andrew, R.L., Debarre, F., Bock, D.G., Franklin, M.T., Gilbert, K.J., Moore, J.-S., Renaut, S., Rennison, D.J. The availability of research data declines rapidly with article age. Current Biology, 24: 1-4. 14. A.H. Ariño: Filling Biodiversity Knowledge Gaps: GBIF supporting research for conservation management. XXI GBIF Governing Board, New Delhi, 2014. 15. Leonard Krishtalka: Strategic Thinking: An Investment/Return Portfolio. A presentation to the Governing Board of GBIF, Lillehammer, 2012.
With special thanks to: ANABEL PEREZ DE ZABALZA
GAIL KAMPMEIER, CYNDY PARR, ANDERS TELENIUS AND THE EXECUTIVE OF BIODIVERSITY INFORMATION STANDARDS TDWG ANA AMEZCUA, ANGEL CHAVES, MARIA IMAS, AND THE PEOPLE AT THE MUSEUM OF ZOOLOGY OF THE UNIVERSITY OF NAVARRA THE GLOBAL BIODIVERSITY INFORMATION FACILITY (GBIF) THE ASSOCIATED CENTER OF THE NATIONAL UNIVERSITY FOR DISTANCE EDUCATION AT PAMPLONA (SPAIN) THE DEPARTMENT OF ENVIRONMENTAL BIOLOGY AND THE UNIVERSITY OF NAVARRA, SPAIN THE AUTHORS QUOTED AND CITED HERE UNDER THE FAIR USE ACT No bytes were seriously harmed while preparing this PPTX. (And copies exist of those who actullay were anyway). This file used 1328 watt-hours and 16 cups of black coffee.
The opinions expressed here are mine and not my employers’. employers’. All images, plots and analyses by the author except where otherwise noted PPTX © 2014 A.H. Ariño, University of Navarra www.unav.edu/departamento/ambiun/