ZMARCO 0.2 README (2003-06-20)

ZMARCO is an Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH) 2.0 compliant data provider. The 'Z' in ZMARCO stands for Z39.50; 'MARC' stands for MAchine-Readable Cataloging; and the 'O' stands for OAI, as in the Open Archives Inititive. Essentially ZMARCO allows MARC records which are available through a Z39.50 server to relatively easily be made available via the OAI-PMH.

The rationale for ZMARCO is that Z39.50 and MARC are fairly ubiquitous in the traditional library world, while at the same time the OAI-PMH is quickly being adopted as a light-weight protocol for the sharing of metadata within the digital library community. Therefore, it would seem useful to develop a tool that would allow the ubiquitous (but complex) Z39.50 and MARC protocols to be utilized for the creation of the OAI data providers, thereby making the huge amount of data which is already available via these older standards also available via the new OAI-PMH. This is an attempt toward that end.

Platform and Environment

Development Details

Installation

Source Code

The source code for the ZMARCOPopulator is included in a separate zip file which was installed with this package, ZMARCOPop_0.2_src.zip. The source code for the VBZOOM.dll can be obtained from VBZOOM site on SourceForge. The source code for the YAZ.dll can be obtained from Indexdata. And, of course, the source code for the ASP scripts are all included with this package.

Database

ZMARCO consists of two components. The first is a simple ZMARCO database which must be pre-populated with some minimal data which are periodically harvested from the Z39.50 server. This is required because most typical Z39.50/MARC catalogs do not allow queries based on the last modified date stamp for the metadata itself (BIB-1 USE attributes 1011 or 1012). However, access to these datestamps is essential for the OAI protocol. Fortunately, even if they are not directly queriable via Z39.50, most (but not all) MARC records have these datestamps, either in the 005 field or the first six characters of the 008 field. If these dates are not available, the current date is arbitrarily used for the record.

In order to populate the ZMARCO database some assumptions (which may not be universally true) needed to be made. First, the Z39.50/MARC catalog supports queries for the year of publication (BIB-1 USE attribute 31), and all records of interest have a valid year of publication which can be queried. The Z39.50/MARC catalog must also support queries for the local number (BIB-1 USE attribute 12, also found in the MARC 001, Control Number field). Also, the Z39.50 server must not arbitrarily limit the number of hits allowed for a single query; if a query returns 125,362 hits, the server must make all 125,362 of those records available for presentation. A simple Z39.50 client program, ZMARCOPopulator.exe, uses these assumptions to sequentially harvest all records within in a range of publication years. It then pulls out the MARC 001, 005 or 008/00-05 fields and adds them to the database. It also stores the publication year used to find the records, so that it can be used as a set parameter for the OAI requests. Once ZMARCOPopulator.exe has populated the ZMARCO database, the database can be used by the actual OAI provider to make available for harvesting all the indexed records in the Z39.50/MARC catalog. Also, in order to keep the ZMARCO database current, the ZMARCOPopulator.exe will need to be completely re-run periodically.

Currently, all of the above assumptions are hard-coded into the ZMARCOPopulator.exe. However, since the source code is available it should be possible for a programmer to relatively easily modify the code to support the nuances of any particular Z39.50/MARC catalog. Basic changes to the programs should not require any detailed knowledge of either Z39.50 or MARC. All of the code for handling these is encapulsated in an easy-to-use ActiveX DLL called VBZOOM which is used by both the ZMARCOPopulator.exe and the actual ZMARCO OAI provider ASP scripts. We hope to eventually make the program more flexible by expressing the various assumptions in configuration files to allow non-programmers to also modify the behavior of the program.

Using ActiveX Data Objects (ADO and ADOX), the ZMARCOPopulator.exe will automatically create an Access database the first time it is run. The database consists of a single table which is described below. For testing purposes, we have used the Access database to harvest and provide about 4 million MARC records from the University of Illinois' online catalog, and the performance was acceptable. However, for high usage and frequent simultaneous harvesting scenarios, using a more robust database such as Oracle or SQL Server would be preferred. This can be accomplished with some fairly simple modifications to the database connection strings used in the programs. However, if this is done, the database will need to be created and set up manually before the ZMARCOPopulator.exe is run for the first time.

Database details

Records Table
ControlNumberLastTransactionDateStampPublicationYearDeletedGeneratedDate
Datatypes:Text (20)Date/TimeNumber (Integer)Yes/NoYes/No
Indexed:Unique Primary KeyNon-Unique IndexNon-Unique IndexNot IndexedNot Indexed
Bib-1 Use Attribute:121011 or 101231N/AN/A
MARC Field:001005 or 008/00-05008/07-10 or 008/11-14N/AN/A
OAI Use:Unique IdentifierSelective Harvesting DatestampSelective Harvesting SetSpecDeleted StatusN/A
Note: The 005 or 008/00-05 fields may be missing from some MARC records, in which case the LastTransactionDateStamp is set to the date/time that the record was added to this database. If this is done, the GeneratedDate column will be set to 'Yes'. The GeneratedDate column is currently informational only, and is not used by the OAI provider. Note: The Deleted column is currently not used. The deletedRecord element of the Identify verb is always set to 'no'. However, during the periodic repopulation of this database, it would be possible to flag records that appear to have been deleted. This might be done in a future version of this tool.

ZMARCOPopulator.exe details

Following is an image of the ZMARCOPopulator user interface:

Status Fields
The two, top-most text fields are read-only status fields. The top, larger field will display continuing messages as the program runs. This will include the number of records found and stored for each year, as well as any error or warning messages. The text in this field will be stored in a text file named OMZPopLog.txt when the program exists.

The second, smaller status field just displays the information about the record currently being processed. This information should scroll past too fast to conveniently read, but you will know that the programs is working or not.
Start Button
Just what it says, start processing. Once processing starts, the only way to stop is to exit the program. If you need to stop the program but don't want to restart from scratch, take note of the year that is currently being harvested, and then use that year as the Starting Year when the program is restarted.
Z39.50 -- host:port/database Field
This field must contain the host name or IP address and port number of the Z39.50 server. It must also contain the database name which contains the records to be harvested. The expected syntax is host:port/database. NOTE: There is currently no provision to support user ids or passwords for the Z39.50 server. Once again some simple modifications to the code could allow this -- maybe a future addition.
Starting Year and Ending Year Fields
These fields must contain the range of publication years to be harvested from the Z39.50 server. The programs counts down from to the Starting Year, going to the Ending Year, inclusive, so the Starting Year must be greater than the Ending Year. Only valid 4-digit numbers will be accepted in these fields.

IIS ASP Scripts

The second component of ZMARCO is the actual OAI data provider. These are various Active Server Page scripts written in VBScript and JScript. They parse out and handle the various OAI requests, interfacing with the ZMARCO database or the Z39.50 server as required.

File details:

global.asa
Initializes several global, application-level objects and variables that will be needed by the other scripts, such as database, xml, and z39.50 objects, or configuration parameters for same.
functions.inc
This contains various functions that are needed by multiple of the other scripts. This file is included in the scripts that need access to the functions. Functions include things such as parsing OAI identifiers, generating UTC datestamps, creating and parsing resumption tokens, etc. These are mostly functions that should be reusable across many different OAI implementations, and are not specific to the ZMARCO implementation.
header.inc
This file contains a function for writing the standard header element for an OAI record. It is included in other scripts that need this ability.
metadata.inc
This file contains functions for writing the actual metadata in various formats. It contains the code for retrieving records from the Z39.50 server and transforming them into OAI metadata, such as Dublin Core or MARC XML.
OAI.asp
This is the base script to which all requests come. This script writes the standard OAI wrapper elements, checks parameters for validity, returning error elements as needed, and then dispatches the request to one of the other scripts depending on the OAI verb. This script is mostly independent of the any particular implementation details and could be used as the base script for other OAI implementations, not just the ZMARCO implementation.
GetRecord.asp
Identify.asp
ListIdentifiers.asp
ListMetadataFormats.asp
ListRecords.asp
ListSets.asp
Each of the above scripts handles a different OAI verb, just as their names imply. These scripts will access the ZMARCO database and the Z39.50 server as needed to respond to different requests. The responses to some of these verbs are essentially hardcoded into the scripts, such as the Identify.asp or ListMetadataFormats.asp scripts. The other scripts may require a fair amount of program logic to access the database and the Z39.50 server.
MARC21slim2DC.xsl
XSLT that transforms MARC21 XML into the old OAI version 1.1 Dublin Core.
MARC21slim2MODS.xsl
XSLT that transforms MARC21 XML into the Metadata Object Description Schema (MODS) format.
MARC21slim2OAI_DC.xsl
XSLT that transforms MARC21 XML into the new OAI version 2.0 Dublin Core.
MARC21slim2OAI_MARC.xsl
XSLT that transforms MARC21 XML into the old OAI version 1.1 MARC XML.

Supported metadata formats:

oai_dc
This is the simple Dublin Core metadata required by the OAI 2.0 standard.
oai_dc_1.1
This is the simple Dublin Core metadata required by the old OAI 1.1 standard.
marc21
This is the MARC 21 XML format defined by the Library of Congress standards office.
oai_marc
This is the MARC XML format defined by the old OAI 1.1 standard.
mods
This is the Metadata Object Description Schema format defined by the Library of Congress standards office.

Questions, Comments, or Bug Reports can be sent to Tom Habing.

Thanks for trying our code!