readme w e winkler 990226 9a
[ Edited Abowd and Vilhuber, 2005-04-07 ]
Preliminaries
This example contains executables (programs) that can be run on a
Itanium running Linux. The original version was for DOS/Windows, and is
still available. Differences between the versions are pointed out
below. If you wish to run on other computers (f.i. Mac OS X), you will
need to recompile the programs (source code is available in
winkler-mt_src.zip, 64-bit clean-code is available from John Abowd or
Lars Vilhuber).
To use this example, you will need to know how to copy and rename
files, to use a text editor, and to run programs in the Linux or MS-DOS
shell under Windows 95, 98, or NT. Your mileage in the GUI may vary...
See the Tips and Tricks
page for help on working on Linux systems.
How
to run gdriver
standardizer.
The purpose of the program is to put a standardized address and
standardized
name at the end of a file.
Inputs
Configuration
and
primary data
The program has three main inputs.
- parmn.txt
- gives the name of the input file.
- parmdr.txt
- gives the start location and length of the
name field, the start location and length of the address field,
&
the logical record length (number of column positions) in the input
file.
- test.dat
- name of input file. Make sure that all records
have the same length.
Auxiliary
data
The name standardization subportion has a large number of tables that
need to be present in the directory where gdriver is run:
- BLNK.DAT
- CONJ.DAT
- JRSR.DAT
- NICK.DAT
- PATT.DAT
- PREF.DAT
- SKPL.DAT
- SUFF.DAT
- THRE.DAT
- TITL.DAT
- TWO.DAT
The address standardization subportion has three
tables that must be present in the directory where gdriver
is run.
- PATTERNS.DAT
- MASTER1.DAT
- CLUE_A1.DAT
Also, you need the file initfile.dat
to be present. It has three lines. Each line corresponds to the name of
the appropriate input table in the order above.
Outputs
- sumnp:
A summary output
of the name standardization process.
The summary file
also gives the locations of the standardized components of the address
and
of the name.
- stanout.txt:
The main
output file. The standardized address is in the first set of
columns after the original main file and the standardized name is in
the
second set of columns.
- otherout.txt:
Secondary output will occur when names of
the form
Jane and John Smith are processed (in this example, the file will be
empty.
Running the standardizer
- Unzip the contents of standardizer-clean-data.zip.
It
will create a sub-directory called "standardizer". This is the input
data and the auxiliary data.
- Unzip the appropriate
standardizer for your platform according to
the table below.
These will expand into the same subdirectory.
- Ensure all the primary, parameter, and auxiliary data files are present.
- Click on gdriver,
or preferably,
- open a command shell
(Start->Run... and type 'cmd'
followed by ENTER)
- navigate to that
directory ('cd Desktop\standard')
- Run the standardizer by
typing 'gdriver'. You will get
additional diagnostic output that gets lost if you double-click on the
file.
- Analyze the output.
Possibly modify the input files to see if
the standardizer can handle your specific input files.
Compiling from source
Andrey Baliakin has cleaned up some of Bill Winkler's original code.
The source code is known to compile using GCC, your mileage with other
compilers and on other systems may vary. Some problems with addressing
are known on ia64 systems.
To compile,
- Download and untar the sources
(standardizer-source.tgz)
- Edit the makefile. The default compiler is GCC, change the
options to reflect your setup
- Compile
- Test the binary. You'll want standardizer-clean-data.zip.
This contains all pattern files, and some very simple example data and
configuration files that you can use to see if your compiled version
works fine.
- Please submit any new platforms not mentioned above to us, so we
can provide them to the wider community.