utf8conditioner
This is a small C program that will either check or `fix' a UTF-8 byte stream. It was designed to be used within an OAI harvester to attempt to remove bad codes from supposedly UTF-8 byte streams so that they can then be parsed using a standard XML parser which would otherwise fail. Numeric character references (e.g. °) are also decoded and the Unicode character checked for validity.
Without knowing why an illegal code is present in a supposedly UTF-8
byte stream, and thus what it should have been, it is impossible to properly
correct an illegal UTF-8 byte stream.
However, experience has shown that some OAI server implementations which do not
use good XML writing libraries sometimes write byte streams which include codes
from other character sets which make otherwise correctly structured XML responses
unparsable.
The utf8conditioner
simply substitutes a dummy character or string
(?
by default) in place of bad byte sequences. This causes local
corruption but usually allows the harvesting process to continue. Any harvester
should, of course, flag such errors to the operator and the operator should
report such errors to the data-provider so that they can be corrected.
This software is supplied under the GNU General Public License,
see the file COPYING
for details.
This software makes use of the routine getopt.c
(included
with the source) developed by the University of California, Berkeley
and its contributors.
2007-02-09: A Java port of this code is available as part of the kopal Library for Retrieval and Ingest (koLibRI). Thanks to Stefan E. Funk of Göttingen State and University Library.
Current:
utf8conditioner.tar.gz
(26kB),
utf8conditioner.zip
(35kB).
All versions (see HISTORY file)
25 October 2005:
tar.gz
,
zip
.
15 April 2003:
tar.gz
,
zip
.
14 January 2003:
tar.gz
,
zip
.
This code was written and tested on an Intel-based Linux system using gcc. It is written in ANSI C and should compile on other platforms. I'd be interested to hear of problems/solutions for use on other platforms.
Makefile
as necessary to set c compiler etc.
make
utf8conditioner
, it can be moved anywhere
that is convenient.
Typical use for checking (-c flag) an XML file (-x flag for XML) is:
cat utf8file | ./utf8conditioner -c -x
Typical use for checking and `fixing' an XML file is:
cat utf8file | ./utf8conditioner -x > fixedFile
The way I use utf8conditioner
for OAI harvesting is as follows:
Make OAI request to repository Attempt to parse response if (failed) { Pipe response through utf8conditioner (with -x flag) Attempt to parse conditioned response if (success) { Write warning message for operator Use conditioned response in place of original } else { Abort harvest } }
You should, of course, report any problems in OAI response to the data-provider so that they can be corrected. This utility simply aids debugging and makes it possible to complete the rest a harvest under certain error conditions.
Please send bug reports to
me at
simeon@cs.cornell.edu
.