OAI Metadata Harvesting Workshop

A full-day workshop held at JCDL2003 (workshops program) in Houston, TX on Saturday, 31 May 2003.

Position statements and suggested topics for discussion

The table below lists the position statements and suggested topics submitted by each workshop participant. Links to participants' slides are provided where available.

Participant Position statements and suggested topics

Donatella Castelli

(Istituto di Elaborazione della Informazione, Piza, Italy )

(registered at conference)

Naomi Dushay

(Cornell Information Science)

(NSDL)

Position statement

A primary goal of the National Science Digital Library (NSDL) project is to transform the use of digital resources in science education, in the broadest sense. As part of this effort, we are creating a repository of relevant metadata by gathering large amounts of descriptive metadata pertaining to resources in the fields of science, technology, engineering and mathematics. Because of the NSDL's strong focus on education, many of its funded projects are collecting or developing complex learning objects, often with similarly complex metadata. At the same time, the projects often do not have much metadata or OAI relevant technical expertise available.

Currently the NSDL central repository harvests metadata almost exclusively using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). It has been our experience that these harvests rarely go smoothly, usually due to the data provider’s lack of technical expertise. Once the technical obstacles with the protocol interactions are overcome, we generally encounter problems with the metadata itself. In a traditional library, this metadata would be carefully vetted by librarians. In the NSDL model, we have a single central metadata specialist (librarian) overseeing a largely automated process that pulls XML formatted metadata from vetted resources and normalizes it for our local repository. Finding ways to minimize human effort in harvest and metadata ingest while supporting the availability of high quality, consistent metadata for NSDL services is a major challenge for the NSDL project.

Proposed topics for discussion

Topic 1: Aggregator issues
  • metadata quality issues: inconsistent incoming, normalize locally?
  • non-persistent deleted records
  • rights/permissions: for metadata, for resource
  • provenance in practice: nested provenance, responseDate problems
  • duplicate metadata records for same resource
Topic 2: Data provider points of confusion - the knowledge gap
  • OAI problems: identifiers (OAI vs. dc), datestamps (header, responseDate, dc:date), multiple metadata formats allowed, abouts (rights here vs. dc:rights), sets, resumptionTokens.
  • XML problems: encoding (UTF8, XML entities, URL, HTML), namespaces, schemas, xsi:schemaLocation and validation.
  • chunk size
  • OAI in a box is not one size fits all. Static repositories, better documentation, and better testing tools will help, but not fix all of this.

[ Slides: PPT, PDF, 6up PDF ]

Edward Fox

(Dept. of Computer Science, Virginia Tech)

(CSTC, NDLDT, NCSTRL, CITIDEL)

Position statement

Background:

  • I attended the Santa Fe meeting.
  • I've been serving on the OAI Steering Committee, and ran several OAI workshops.
  • We use OAI in CSTC (www.cstc.org), NDLTD (www.ndltd.org), NCSTRL (www.ncstrl.org), CITIDEL (www.citidel.org).
  • I helped the Mellon / SOLINET / Emory project AmericanSouth.org to make use of OAI.
  • I served as supervisor for Hussein Suleman's dissertation related to OAI.

Proposed topics for discussion

Topic 1: Foundations for OAI-PMH
  • We should develop a layer below OAI-PMH and standardize that. It then would be a foundation for OAI-PMH as well as for XOAI, as proposed in Hussein Suleman's dissertation.
  • We should formalize and standardize both OAI-PMH and XOAI as part of the broad framework of OAI.
Topic 2: Integration of regular and static OAI repositories
  • We should provide an integrated framework for providing and harvesting at all levels of engagement across a community with common interest but with differing sizes of collections.
  • We should package up and carefully integrate: a) Regular OAI tools b) The gateway effort now in alpha mode at LANL, being developed by Herbert et al.

[ Slides: PPT, PDF, 6up PDF ]

Thomas Habing

(Grainger Engineering Library Information Center, University of Illinois at Urbana-Champaign)

(Grainger DLI)

Position statement

I am a Research Programmer at the Grainger Engineering Library Information Center, University of Illinois at Urbana-Champaign. I have been involved with the OAI protocol since its early days, developing one of the data providers for the OAI alpha test around October, 2000. After this I helped develop a number of the OAI toolkits, both providers and harvesters, for Grainger's Mellon-funded OAI project. I have developed and continue to maintain many of these tools as Open Source, downloadable from SourceForge.

I am currently involved with a number of research initiatives at Grainger which involve the OAI protocol, including an IMLS Digital Collections and Content (DCC) project and an NSDL project to make mathematical content available for harvesting. I have also developed a search system utilizing OAI harvested records of scientific and engineering related resources which is in active use in the Grainger Library.

Proposed topics for discussion

Topic 1: Better turnkey provider solutions

We have done a fair amount of work helping various different potential data providers make their metadata OAI harvestable. However, we are finding it difficult to develop turnkey or out-of-the- box solutions. What is everyone's experience in attempting to develop or use the various OAI toolkits? What is needed to make it easier? Is the static, gateway protocol an answer? Is some lower-level standard needed? Do we need better documentation or best practices guidelines?

Topic 2: De-dupping OAI harvested records

Many service providers are choosing to make the metadata aggregations they've harvested available for re-harvesting by other service providers with the result that some service providers are re-harvesting duplicate or overlapping content without knowing it. Other content providers are providing OAI harvestable metadata records for resources held by multiple sites or resources like Websites not under their institutional control -- with the result that there are overlaps in collections of OAI metadata being offered. In latter case, the descriptive information in records typically varies, but the records are describing essentially the same object or at least one instance of that object. Should harvesting services be collaborating on possible technologies and/or best practices for de-dupping records they harvest?

Katrina Hagedorn

(Digital Library Production Service, University of Michigan)

(OAIster)

Position statement

OAIster is one of the first large-scale harvesters of OAI metadata records. Beginning in June 2002 with around 60K records from around 60 repositories harvested, we have grown to over 1.1 million records from over 150 repositories to date. We use UIUC's Java-version harvester and have developed our own Java-based transformation scripts to filter the harvested records and transform them into our DLXS Bibliographic Class encoding format for use with DLXS XPAT software and middleware. OAIster filters out records that do not link to digital resource representations, and makes these records searchable to end-users at http://www.oaister.org/. Future plans include improvements to the searching interface, refinement of the filtering methods, and co-ordination with other campus services.

Proposed topics for discussion

Topic 1: Rights, restrictions and access

We harvest records that point to objects that are restricted to certain communities and/or people, so even though we provide free access to the metadata, users can be surprised when they attempt to access the digital object itself. A restricted flag (yes/no) within OAI-PMH could assist harvesters in gathering just the records they need. An expanded version of the flag that indicates which communities the digital object is restricted to bleeds into the current DC Rights field, and couldn't be standardized easily because of metadata inconsistency issues.

Topic 2: Automated repository discovery

There is a partner to the idea of automated repository discovery -- when is a repository not viable anymore? Should the records be deleted? How do we discover this?

[ Slides: PPT, PDF, 6up PDF ]

Terry Harrison

(Old Dominion University)

(registered at conference)

Xiaoming Liu

(Research Library, Los Alamos National Laboratory)

(Arc, Kepler, DP9)

Position statement

Prior to joining LANL, I was a PhD student at Old Dominion University. I have worked closely with the Open Archives Initiative during their development effort that led to the PMH1.x and 2.0. I also developed/co-developed the Arc (http://arc.cs.odu.edu) - a cross archive searching tool, Kepler (http://kepler.cs.odu.edu) - a P2P based publication framework, and DP9 (http://dlib.cs.odu.edu/dp9) services at ODU.

Proposed topics for discussion

Topic 1: Improving freshness of service providers

The lack of adequate synchronization of metadata records between data providers and service providers can distort the results a user obtains from a service provider. In the current OAI-PMH framework, the only approach to minimize asynchrony is for the harvester to harvest more frequently. However, frequent harvesting is inefficient in cases data providers have significantly varying update frequencies.

We propose several possible approaches may be used to determine the change of a repository:

  • Best Estimation: The harvester may estimate the record update frequency by learning from the harvest history.
  • Syndication: A data provider may describe its update frequency explicitly in OAI-PMH Identify response.
  • Subscribe/Notify: A data provider may notify a service provider whenever its content is changed. This model requires communications outside of OAI-PMH and might be implemented as an additional verb.

[ Slides: PPT, PDF, 6up PDF ]

Michael Nelson

(Old Dominion University)

(registered at conference)

[ Slides: PPT, PDF, 6up PDF ]

Heinrich Stamerjohanns

(Institute for Science Networking at the University of Oldenburg)

(www.physnet.net)

Position statement

I am a researcher at the Institute of Science Networking at the University of Oldenburg. We have a long experience with harvesting metadata from distributed archives, which we have been doing since 1995 with PhysDoc (http://physnet.uni-oldenburg.de/PhysNet).

We have implemented a Data Provider for this heterogenous data and have implemented a Service Provider to collect data from OAI Data Providers and other collections.

I have implemented a Data Provider for PhysDoc in PHP, which is available at http://www.physnet.uni-oldenburg.de/projects/OAD/software.html.

We are currently developing a PEAR package for Data- and Service-Providers which is available at SourceForge.

Besides our focus on the physics community we help libraries and other institutions to become OAI Data providers and use the OAI-Protocol for internal metadata transfer.

Through DINI, the Deutsche Initiative für Netzwerkinformation e.V, we support the dissemination and setup of OAI-compatible archives by organising workshops and giving tutorials on OAI.

Proposed topics for discussion

Topic 1: Metadata issues
  • Normalizing metadata
  • Should The Service-Provider have a method to tell the Data-Provider how many records it would like to receive in one chunk?
  • Current usage of metadata-formats other than DC.
  • How to convert proprietary markup (latex, word) in metadata to UTF8.
Topic 2: Automation
  • How to organize sets, so that Service-Provider may be automatically guided to the metadata they are interested in.
  • How to (automatically) discover new Data-Providers?
Topic 3: SOAP
  • OAI and SOAP.

[ Slides: PDF ]

Simeon Warner

(Cornell Information Science)

(arXiv.org)

Position statement

I am one of the maintainers and developers of the arXiv e-print archive (http://arXiv.org/), and have worked with the Open Archives Initiative (OAI) since its inception. I wrote and maintain arXiv's data-provider implementation and thus deal with occasional problem reports from harvesters.

During development of the OAI-PMH v1.0, I wrote a test harvester in Perl which has been extended for the subsequent v1.1 and v2.0 releases. I use this harvester for testing and it has also been used by the NSDL. To cope with bad XML data, I wrote the utf8conditioner which replaces bad codes in UTF-8/XML streams with dummy codes that (usually) allow the XML to be parsed. This has proved invaluable in testing and in the diagnosis of problems with repository implementations. I am currently engaged in the creation of "Harvie", new harvesting software written in Java and designed to be deployed in automated production systems. This work is in conjunction with a Oyvind Raad (Cornell).

Proposed topics for discussion

Topic 1: Semantics and use of reponseDate and the actual time of response issue

The OAI-PMH v2.0 specification states that the responseDate in each response should be the UTC date and time of the response. The provenance guidelines say that it is the responseDate that should be used to create the harvestDate entry in the provenance container. Experience shows that the responseDate is not always expressed correctly and not always accurate and this has implications for the accuracy of provenance information. There is a need for both tighter validity checking, sanity checks and guidelines for dealing with problem cases. (Thanks to Naomi Dushay, Jon Phipps and Tim Cornwell for input on this issue.)

Topic 2: Tightening specification and schema

There are a number of places in which both the OAI-PMH v2.0 specification and the OAI-PMH response schema are not as explicit or as tight as they might be. I propose running through a laundry list of known issues, suggesting the "correct" interpretation, and discussing problems that might arise from tightening/clarifying the specification.

[ Slides: PPT, PDF, 6up PDF ]

Simeon Warner, $Date: 2003/11/24 20:48:51 $