readme w e winkler 990226 9a
[ Edited Abowd and Vilhuber, 2005-04-07 ]

Preliminaries

This example contains executables (programs) that can be run on a Itanium running Linux. The original version was for DOS/Windows, and is still available. Differences between the versions are pointed out below. If you wish to run on other computers (f.i. Mac OS X), you will need to recompile the programs (source code is available in winkler-mt_src.zip, 64-bit clean-code is available from John Abowd or Lars Vilhuber).

To use this example, you will need to know how to copy and rename files, to use a text editor, and to run programs in the Linux or MS-DOS shell under Windows 95, 98, or NT. Your mileage in the GUI may vary...

See the Tips and Tricks page for help on working on Linux systems.

How to run gdriver standardizer.

The purpose of the program is to put a standardized address and standardized name at the end of a file.

Inputs

Configuration and primary data

The program has three main inputs.

parmn.txt - gives the name of the input file.
parmdr.txt - gives the start location and length of the name field, the start location and length of the address field, & the logical record length (number of column positions) in the input file.
test.dat - name of input file. Make sure that all records have the same length.

Auxiliary data

The name standardization subportion has a large number of tables that need to be present in the directory where gdriver is run:

BLNK.DAT
CONJ.DAT
JRSR.DAT
NICK.DAT
PATT.DAT
PREF.DAT
SKPL.DAT
SUFF.DAT
THRE.DAT
TITL.DAT
TWO.DAT

The address standardization subportion has three tables that must be present in the directory where gdriver is run.

PATTERNS.DAT
MASTER1.DAT
CLUE_A1.DAT

Also, you need the file initfile.dat to be present. It has three lines. Each line corresponds to the name of the appropriate input table in the order above.

Outputs

sumnp: A summary output of the name standardization process. The summary file also gives the locations of the standardized components of the address and of the name.
stanout.txt: The main output file. The standardized address is in the first set of columns after the original main file and the standardized name is in the second set of columns.
otherout.txt: Secondary output will occur when names of the form Jane and John Smith are processed (in this example, the file will be empty.

Running the standardizer

Unzip the contents of standardizer-clean-data.zip. It will create a sub-directory called "standardizer". This is the input data and the auxiliary data.

Unzip the appropriate standardizer for your platform according to the table below.

DOS or Windows (32bit or 64bit)	standardizer-win32.zip
Linux 32bit native	standardizer-linux-i386.tgz
Linux 64bit (Itanium) native	standardizer-linux-ia64.tgz

These will expand into the same subdirectory.

Ensure all the primary, parameter, and auxiliary data files are present.
Click on gdriver, or preferably,
- open a command shell (Start->Run... and type 'cmd' followed by ENTER)
- navigate to that directory ('cd Desktop\standard')
- Run the standardizer by typing 'gdriver'. You will get additional diagnostic output that gets lost if you double-click on the file.
Analyze the output. Possibly modify the input files to see if the standardizer can handle your specific input files.

Compiling from source

Andrey Baliakin has cleaned up some of Bill Winkler's original code. The source code is known to compile using GCC, your mileage with other compilers and on other systems may vary. Some problems with addressing are known on ia64 systems.

To compile,

Download and untar the sources (standardizer-source.tgz)
Edit the makefile. The default compiler is GCC, change the options to reflect your setup
Compile
Test the binary. You'll want standardizer-clean-data.zip. This contains all pattern files, and some very simple example data and configuration files that you can use to see if your compiled version works fine.
Please submit any new platforms not mentioned above to us, so we can provide them to the wider community.