zimeon

about talks publications code

arXiv and the 2015 NSF mandate

30 Mar 2015 | scholarly communication

Notes on how arXiv or other publication repositories might meet the requirement of the NSF open-access mandate, based on the NSF report issued 2015-03-19: http://www.nsf.gov/pubs/2015/nsf15052/nsf15052.pdf

Requirements for publications

In section 3.1 the report states:

NSF will require that either the version of record or the final accepted peer-reviewed manuscriptin peer-reviewed scholarly journals and papers in juried conference proceedings or transactions described in the scope above (Section 2.0) and resulting from new awards resulting from proposals submitted, or due, on or after the January 2016 effective date must:

  • Be deposited in a public access compliant repository designated by NSF;
  • Be available for download, reading, and analysis free of charge no later than 12 months after initial publication;
  • Possess a minimum set of machine-readable metadata elementsin a metadata record to be made available free of charge upon initial publication (Section 7.3.1);
  • Be managed to ensure long-term preservation (Section 7.7); and
  • Be reported in annual and final reports during the period of the award with a unique persistent identifier that provides links to the full text of the publications well as other metadata elements.

From which we have a number of questions:

  1. Could arXiv become a “compliant repository designated by NSF”?
  2. What are the metadata requirements (and how do these tie with requirements for other OA initiatives such as the UK HEFCE mandate and the OpenAIRE requirements)?
  3. What are the preservation requirements (would ingestion of arXiv content into Cornell University Library Archival Repository (CULAR) meet these)?

The answer to the first question is initially “no” because a single repository run by the DOE has been chosen, but there is a suggestion of later expansion to additional repositories:

In the initial implementation, NSF has identified the Department of Energy’s PAGES (Public Access Gateway for Energy and Science) system as its designated repositoryand will require NSF-funded authors to upload a copy of their journal articles or juried conference paper to the DOE PAGES repositoryin the PDF/A format, an open, non-proprietary standard (ISO 19005-1:2005). Either the final accepted version or the version of record may be submitted. NSF’s award termsalready require authors to make available copies of publications to the Cognizant Program Officers as part of the current reporting requirements. As described more fully in Sections 7.8 and 8.2, NSF will extend the current reporting systemto enable automated compliance.

Future expansions, described in Section 7.3.1, may provide additional repository services.

This initial plan and extension intention is reiterated in section 7.0:

In the initial implementation, NSF has identified the DOE PAGES system to support managing journal articles and juried conference papers. In the future, NSF may add additional partners and repository services in a federated system.

In section 7.1.2 the MOU with DOE for repository services is described:

In the initial implementation, NSF has entered into a non-exclusive relationship for repository services with the DOE that will enable authors whose work is subject to the NSF public access requirement to make their articles publicly availablein the DOE repository system (PAGES). A MOU was executed in 2014. This agreement sets forth the terms of the cooperation between the two Federal agencies and provides for pilots and testing required to ensure that relevant information from the two systems flows correctly. The system will be available for NSF-funded authors to use on a voluntary basis by the end of calendar 2015.

This brings to mind the interesting question of whether there will be articles accepted into PAGES that would not meet arXiv’s moderation standards. Perhaps the conditions for acceptance into PAGES will be that an author be NSF-funded and that the manuscript was accepted or published. How the second condition is verified is not clear, along with definition of what venues are considered acceptable.

The NSF plan and the DOE system requires PDF/A for the article but supports other formats for anciallary materials as describes in section 7.2.1:

NSF will require awardee institutions to ensure that authors of articles and papers that fall within the scope of this plan (as defined in Section 2.0) deposit copies of the author’s final accepted peer-reviewed manuscriptor the version of record in the PDF/A standard in a repository maintained on behalf of NSF by DOE. … The DOE system also supports non-text formats (images, video, and supporting digital data).

In future implementations, NSF will explore options that would allow authors to deposit copies in repositories maintained by other Federal agencies or by other public/private third parties that meet all of thecriteria set forth in the OSTP memorandum and to report that submission back to NSF. Further explanation is provided in Section 7.3.1

The deposit process is described a manual in section 7.3.1:

The DOE PAGES system offers centralized metadata and indexing together with the flexibility of a distributed system of linking to authoritative copies of the full-text of the material (either the final accepted manuscript or the publisher’s version of record). PAGES also accepts manual upload of PDF/A- compliant documents, which will be required of all NSF-funded authors.

In later implementations, NSF expects to add additional partners (discussed in the next paragraphs), leveraging DOE PAGES’ capability to maintain centralized metadata records and link to other repository systems. This enables NSF to maintain management control of the information without unnecessary duplication of submission and the associated burden on the awardees and investigators, or the risk of multiple and inconsistent version. A list of designated public access repository systems will be maintained on the NSF website (Sections 7.1.4 and 7.1.5).

Public/private partnerships. Various groups (including publishers, Federal agencies, and academic libraries) are actively workingon initiatives that will maintain consistent metadata (CrossRef); identification of agency funding (FundRef); identification of rights and Open Access status as proposed by NISO/NFAIS; and consistent identification of authors and other contributors (ORCID, ResearcherID). The DOEPAGES system relies on the concept of “best available version” and takes advantage of the publishers’ consolidated metadata repository, CrossRef, to link metadata records in DOE’s system to full-text versions of papers maintained by cooperating publishers.

Additional text in 7.3.1 notes the possibility of the DOE’s dark archive being illuminated in the case that a publishers’ site become inaccessible. The there is a more detailed description of the expansion ideas:

NSF supports these public-private, cross-agency activities and will incorporate new infrastructure capabilities as they mature. The Foundation’s incremental approach allows NSF to evolve its systems as new capabilities become available. Over time, NSF expects to expand the range of eligible repositories as follows:

  • Systems operated by other Federal agencies. NSF investigators typically have multiple funding sources. Since a given itemmay be based on funding from more than one agency, NSF expects to allow submissions of articles and papers to public access repositories operated by other Federal agencies that meet the standards of the OSTP February 22, 2013, memorandum and for which the investigator can provide a persistent identifier as an element in annual or final reports. Implementation of this expansion is likely to begin no earlier than FY 2016.
  • Systems operated by third parties. A coalition of publishers Clearinghouse for Open Research of the United States (CHORUS) and a group of institutions of higher education in combination with the research libraries under the leadership of AAU/ARL/APLU (e.g., SHARE) have also proposed potential solutions. NSF will continue discussions with these groups (and others that may come forward) with the expectation that information from them may be incorporated into subsequent implementations of the proposed system.

The ability to expand NSF’s system to accommodate multiple repositories in a federated system depends both on the capabilities of potential partner systems and on the level of technological complexity that NSF’s internal systems can efficiently support. In collaboration with other Federal agencies and interested parties, NSF will develop criteria for eligible repositories, based on the criteria set forth in the OSTP memorandum, and will provide appropriate guidance for awardees and investigatorson the website.

NSF may initiate these discussions as early as FY 2016.

The remainder of section 7.3.1 outlines metadata requirements:

Metadata. To facilitate integration with NSF’s internal administrative systems, support simple searches, maintain the connection between the metadata record and the full content of the article, preserve the attribution to the author and to the original publisher, and provide access to a description of the material independently of its embargo status, NSF will requirea minimum of eight metadata fields in a record that will become available free of charge upon initial publication of the article:

  1. Persistent identifier (e.g., DOI, NSF identifier created by DOE, etc.), which links the metadata record with the associated content;
  2. Author names(s)with associated persistent identifiers (such as the NSF investigator ID or, eventually, ORCIDor a similar system);
  3. Title of the article;
  4. Journal or serial title, preferably with identifiers (e.g., ISSN);
  5. Name(s) of agency/agencies and award number(s);
  6. Representation of intellectual property rights;
  7. Link(s) to underlying data including but not limited to the Supplementary Material published with the journal article itself; and
  8. Author- or publisher-supplied Abstract

Note inconsistency here with the earlier suggestion of Fundref identifiers for agency information. Currently arXiv has a limited implementation of ORCID associations, does not included funding information, and does not indicate whether a manuscript is the “final accepted” or “version of record” (or some other version).

Discussion of collaborating search services specifies that any cooperating system will need to be 508-compliant but is not very specific. In section 7.4 there is discussion of future federated implementation:

Future implementations. NSF’s approach recognizes that search is an area in which there is substantial activity and innovation in the research community and in commercial applications and services. These services are increasingly providing platforms for collaboration. In the proposed federated system that NSF envisions, search capabilities would be available at the partner nodes (e.g., DOE, other Federal agencies, third party systems hosted by publishers and universities/academic libraries) as well as through third parties (Google, Bing, and so on).

The strategy of developing a federated multi-organization repository system withpublic access to NSF-funded publications implies several search-related technical requirements for participating organizations:

  • Support for standardized publication metadata, e.g., Dublin Core, MARC, etc. Ideally, a single metadata standard would be adopted, but more realistically it should be possible to support a small number of metadata standards.
  • Support for a standard query language, e.g., SQL, Z39.50, Datalog, etc.
  • Support for a standard application program interface (API), e.g., REST API, to access various functions on the metadata and/or document collection.
  • Support for a standard data exchange protocol, e.g., HTML, FTP, and so on (for the actual metadata and documents).

The list of technical requirements seems an odd acronym soup that has not been carefully thought out and includes a number of inappropriate and outdated standards.

In section 7.7 the report discusses the use of various mechanisms to ensure long-term access to publications and data. The focus regarding format is the use of PDF/A and possible transform from other versions of PDF. Bit preservation is ensured via dark archives with good management and storage redundancy.

Section 13.0 summarizes the timeline for implementation and regarding other repositories includes:

  1. Implement expansion of the federated system to include DOE and other partners, repositories, and research products. Planning will begin in FY 2016 with the addition of one partner, which will be identified by the NSF Public Access Working Group, and implementation of this second partnership may begin in FY 2017.

In conclusion, it seems that at the moment the NSF are focusing on the single DOE repository and timeline for considering other repositories is unclear. However, if and when there is a willingness to consider other deposit venues it seems that arXiv would not have to do too much to meet the requirements. While the details of these requirements are different from the UK HEFCE and OpenAIRE requirements, they are many similarities.

ub @zimeon
twitter @zimeon