A full-day workshop held at JCDL2003 (workshops program) in Houston, TX on Saturday, 31 May 2003.
The table below lists the position statements and suggested topics submitted by each workshop participant. Links to participants' slides are provided where available.
Participant | Position statements and suggested topics |
Donatella Castelli(Istituto di Elaborazione della Informazione, Piza, Italy ) |
(registered at conference) |
Naomi Dushay(Cornell Information Science) (NSDL) |
Position statementA primary goal of the National Science Digital Library (NSDL) project is to transform the use of digital resources in science education, in the broadest sense. As part of this effort, we are creating a repository of relevant metadata by gathering large amounts of descriptive metadata pertaining to resources in the fields of science, technology, engineering and mathematics. Because of the NSDL's strong focus on education, many of its funded projects are collecting or developing complex learning objects, often with similarly complex metadata. At the same time, the projects often do not have much metadata or OAI relevant technical expertise available. Currently the NSDL central repository harvests metadata almost exclusively using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). It has been our experience that these harvests rarely go smoothly, usually due to the data provider’s lack of technical expertise. Once the technical obstacles with the protocol interactions are overcome, we generally encounter problems with the metadata itself. In a traditional library, this metadata would be carefully vetted by librarians. In the NSDL model, we have a single central metadata specialist (librarian) overseeing a largely automated process that pulls XML formatted metadata from vetted resources and normalizes it for our local repository. Finding ways to minimize human effort in harvest and metadata ingest while supporting the availability of high quality, consistent metadata for NSDL services is a major challenge for the NSDL project. Proposed topics for discussionTopic 1: Aggregator issues
|
Edward Fox(Dept. of Computer Science, Virginia Tech) |
Position statementBackground:
Proposed topics for discussionTopic 1: Foundations for OAI-PMH
|
Thomas Habing(Grainger Engineering Library Information Center, University of Illinois at Urbana-Champaign) |
Position statementI am a Research Programmer at the Grainger Engineering Library Information Center, University of Illinois at Urbana-Champaign. I have been involved with the OAI protocol since its early days, developing one of the data providers for the OAI alpha test around October, 2000. After this I helped develop a number of the OAI toolkits, both providers and harvesters, for Grainger's Mellon-funded OAI project. I have developed and continue to maintain many of these tools as Open Source, downloadable from SourceForge. I am currently involved with a number of research initiatives at Grainger which involve the OAI protocol, including an IMLS Digital Collections and Content (DCC) project and an NSDL project to make mathematical content available for harvesting. I have also developed a search system utilizing OAI harvested records of scientific and engineering related resources which is in active use in the Grainger Library. Proposed topics for discussionTopic 1: Better turnkey provider solutionsWe have done a fair amount of work helping various different potential data providers make their metadata OAI harvestable. However, we are finding it difficult to develop turnkey or out-of-the- box solutions. What is everyone's experience in attempting to develop or use the various OAI toolkits? What is needed to make it easier? Is the static, gateway protocol an answer? Is some lower-level standard needed? Do we need better documentation or best practices guidelines? Topic 2: De-dupping OAI harvested recordsMany service providers are choosing to make the metadata aggregations they've harvested available for re-harvesting by other service providers with the result that some service providers are re-harvesting duplicate or overlapping content without knowing it. Other content providers are providing OAI harvestable metadata records for resources held by multiple sites or resources like Websites not under their institutional control -- with the result that there are overlaps in collections of OAI metadata being offered. In latter case, the descriptive information in records typically varies, but the records are describing essentially the same object or at least one instance of that object. Should harvesting services be collaborating on possible technologies and/or best practices for de-dupping records they harvest? |
Katrina Hagedorn(Digital Library Production Service, University of Michigan) (OAIster) |
Position statementOAIster is one of the first large-scale harvesters of OAI metadata records. Beginning in June 2002 with around 60K records from around 60 repositories harvested, we have grown to over 1.1 million records from over 150 repositories to date. We use UIUC's Java-version harvester and have developed our own Java-based transformation scripts to filter the harvested records and transform them into our DLXS Bibliographic Class encoding format for use with DLXS XPAT software and middleware. OAIster filters out records that do not link to digital resource representations, and makes these records searchable to end-users at http://www.oaister.org/. Future plans include improvements to the searching interface, refinement of the filtering methods, and co-ordination with other campus services. Proposed topics for discussionTopic 1: Rights, restrictions and accessWe harvest records that point to objects that are restricted to certain communities and/or people, so even though we provide free access to the metadata, users can be surprised when they attempt to access the digital object itself. A restricted flag (yes/no) within OAI-PMH could assist harvesters in gathering just the records they need. An expanded version of the flag that indicates which communities the digital object is restricted to bleeds into the current DC Rights field, and couldn't be standardized easily because of metadata inconsistency issues. Topic 2: Automated repository discoveryThere is a partner to the idea of automated repository discovery -- when is a repository not viable anymore? Should the records be deleted? How do we discover this? |
Terry Harrison(Old Dominion University) |
(registered at conference) |
Xiaoming Liu(Research Library, Los Alamos National Laboratory) |
Position statementPrior to joining LANL, I was a PhD student at Old Dominion University. I have worked closely with the Open Archives Initiative during their development effort that led to the PMH1.x and 2.0. I also developed/co-developed the Arc (http://arc.cs.odu.edu) - a cross archive searching tool, Kepler (http://kepler.cs.odu.edu) - a P2P based publication framework, and DP9 (http://dlib.cs.odu.edu/dp9) services at ODU. Proposed topics for discussionTopic 1: Improving freshness of service providersThe lack of adequate synchronization of metadata records between data providers and service providers can distort the results a user obtains from a service provider. In the current OAI-PMH framework, the only approach to minimize asynchrony is for the harvester to harvest more frequently. However, frequent harvesting is inefficient in cases data providers have significantly varying update frequencies. We propose several possible approaches may be used to determine the change of a repository:
|
Michael Nelson(Old Dominion University) |
(registered at conference) |
Heinrich Stamerjohanns(Institute for Science Networking at the University of Oldenburg) |
Position statementI am a researcher at the Institute of Science Networking at the University of Oldenburg. We have a long experience with harvesting metadata from distributed archives, which we have been doing since 1995 with PhysDoc (http://physnet.uni-oldenburg.de/PhysNet). We have implemented a Data Provider for this heterogenous data and have implemented a Service Provider to collect data from OAI Data Providers and other collections. I have implemented a Data Provider for PhysDoc in PHP, which is available at http://www.physnet.uni-oldenburg.de/projects/OAD/software.html. We are currently developing a PEAR package for Data- and Service-Providers which is available at SourceForge. Besides our focus on the physics community we help libraries and other institutions to become OAI Data providers and use the OAI-Protocol for internal metadata transfer. Through DINI, the Deutsche Initiative für Netzwerkinformation e.V, we support the dissemination and setup of OAI-compatible archives by organising workshops and giving tutorials on OAI. Proposed topics for discussionTopic 1: Metadata issues
[ Slides: PDF ] |
Simeon Warner(Cornell Information Science) |
Position statementI am one of the maintainers and developers of the arXiv e-print archive (http://arXiv.org/), and have worked with the Open Archives Initiative (OAI) since its inception. I wrote and maintain arXiv's data-provider implementation and thus deal with occasional problem reports from harvesters. During development of the OAI-PMH v1.0, I wrote a test harvester in Perl which has been extended for the subsequent v1.1 and v2.0 releases. I use this harvester for testing and it has also been used by the NSDL. To cope with bad XML data, I wrote the utf8conditioner which replaces bad codes in UTF-8/XML streams with dummy codes that (usually) allow the XML to be parsed. This has proved invaluable in testing and in the diagnosis of problems with repository implementations. I am currently engaged in the creation of "Harvie", new harvesting software written in Java and designed to be deployed in automated production systems. This work is in conjunction with a Oyvind Raad (Cornell). Proposed topics for discussionTopic 1: Semantics and use ofreponseDate and the actual time of response issue
The OAI-PMH v2.0 specification states that the There are a number of places in which both the OAI-PMH v2.0 specification and the OAI-PMH response schema are not as explicit or as tight as they might be. I propose running through a laundry list of known issues, suggesting the "correct" interpretation, and discussing problems that might arise from tightening/clarifying the specification. |