utf8conditioner

This is a small C program that will either check or `fix' a UTF-8 byte stream. It was designed to be used within an OAI harvester to attempt to remove bad codes from supposedly UTF-8 byte streams so that they can then be parsed using a standard XML parser which would otherwise fail. Numeric character references (e.g. °) are also decoded and the Unicode character checked for validity.

Without knowing why an illegal code is present in a supposedly UTF-8 byte stream, and thus what it should have been, it is impossible to properly correct an illegal UTF-8 byte stream. However, experience has shown that some OAI server implementations which do not use good XML writing libraries sometimes write byte streams which include codes from other character sets which make otherwise correctly structured XML responses unparsable. The utf8conditioner simply substitutes a dummy character or string (? by default) in place of bad byte sequences. This causes local corruption but usually allows the harvesting process to continue. Any harvester should, of course, flag such errors to the operator and the operator should report such errors to the data-provider so that they can be corrected.

This software is supplied under the GNU General Public License, see the file COPYING for details. This software makes use of the routine getopt.c (included with the source) developed by the University of California, Berkeley and its contributors.

2007-02-09: A Java port of this code is available as part of the kopal Library for Retrieval and Ingest (koLibRI). Thanks to Stefan E. Funk of Göttingen State and University Library.

Download

Current: utf8conditioner.tar.gz (26kB), utf8conditioner.zip (35kB).

All versions (see HISTORY file)
25 October 2005: tar.gz, zip.
15 April 2003: tar.gz, zip.
14 January 2003: tar.gz, zip.

Compiling and testing

This code was written and tested on an Intel-based Linux system using gcc. It is written in ANSI C and should compile on other platforms. I'd be interested to hear of problems/solutions for use on other platforms.

  1. Unpack and change directory to location of source (unpacks in directory utf8)
  2. Edit the Makefile as necessary to set c compiler etc.
  3. Run make
  4. Executable is called utf8conditioner, it can be moved anywhere that is convenient.
  5. Run with -h flag to show options, e.g. ./utf8conditioner -h.

Use

Typical use for checking (-c flag) an XML file (-x flag for XML) is:

cat utf8file | ./utf8conditioner -c -x

Typical use for checking and `fixing' an XML file is:

cat utf8file | ./utf8conditioner -x > fixedFile

The way I use utf8conditioner for OAI harvesting is as follows:

   Make OAI request to repository
   Attempt to parse response
   if (failed) {
     Pipe response through utf8conditioner (with -x flag)
     Attempt to parse conditioned response
     if (success) {
       Write warning message for operator
       Use conditioned response in place of original
     } else {
       Abort harvest
     }
   }

You should, of course, report any problems in OAI response to the data-provider so that they can be corrected. This utility simply aids debugging and makes it possible to complete the rest a harvest under certain error conditions.

Bugs

Please send bug reports to me at simeon@cs.cornell.edu.


Simeon Warner $Id: index.html,v 1.7 2005/11/08 23:41:19 simeon Exp $