readme w e winkler 000112 3p
(modified Abowd and Vilhuber, 2005-03-29)

Preliminaries

This example contains executables (programs) that can be run on a Itanium running Linux. The original version was for DOS/Windows, and is still available. Differences between the versions are pointed out below. If you wish to run on other computers (f.i. Mac OS X), you will need to recompile the programs (source code is available in winkler-mt_src.zip, 64-bit clean-code is available from John Abowd or Lars Vilhuber).

To use this example, you will need to know how to copy and rename files, to use a text editor, and to run programs in the Linux or MS-DOS shell under Windows 95, 98, or NT. Your mileage in the GUI may vary...

See the Tips and Tricks page for help on working on Linux systems.

Setup


A matching example

(more details of the programs are given at the end of this file.)

Input data

Two inputs sorted by cluster & first character of surname: Inputs must be sorted by appropriate keys for first matching pass

Parameters

Running the code

First pass

  1. Create frequencies
    1. copy parmn1.txt to parmn.txt
    2. run cb1 to produce counter file cntcb.dat.
  2. Create probabilities
    1. rename/copy/move cntcb.dat to cntcb3.dat.
    2. replace lines 2-11 of cntcb3.dat with the 10 lines in c_parm.txt
    3. run eci to produce file of probabilities initi.dat
  3. Run first match pass
    1. rename/copy/move initi.dat as init.dat.
    2. rename/copy/move parmmf1.txt as parmmf.txt
    3. run matcher to get matching outputs and pointer file
  4. Create formatted output file
    1. run runprt to get printout print.dat
    2. rename print.dat as pr1.dat
  5. determine high and low cutoffs
    1. Inspect at pr1.dat
    2. You should find that (8.0, 1.19) are good values
    3. enter the two numbers as the first line of cutoff.dat
  6. Identify links, non-links, and clericals
    1. run resid2 to get r_sorta.dat & r_sortb.dat
    2. (If you prefer to use SAS instead of resid2.exe: run resid2.sas in SAS to get pub1ab.dat & pub1bb.dat)
  7. Save the parameter files
    1. rename/move/copy pointer file pntmf.dat as pntmf1.dat
    2. rename/move/copy parmn.txt as parmn1.txt
    3. rename/move/copy parmmf.txt as parmmf1.txt

Second pass

  1. Prepare the data files:
  2. Prepare parameter files
    1. rename/move/copy parmmf2.txt as parmmf.txt
    2. rename/move/copy parmn2.txt as parmn.txt
  3. Run second match pass
    1. run matcher to get matching outputs and pointer file
  4. Create formatted output file
    1. run runprt to get printout print.dat
    2. rename/move/copy print.dat as pr2.dat
  5. Save the parameter files
    1. rename/move/copy pointer file pntmf.dat as pntmf2.dat
    2. rename/move/copy parmn.txt as parmn2.txt
    3. rename/move/copy parmmf.txt as parmmf2.txt
  6. Look at pr2.dat to determine whether any additional matches are found. If new matches are found, why were they not found on the first pass?

More details on programs

Look at read.cnt, read.em, & read.mt (drag-and-drop on Textpad). The following programs and scripts are provided:
Program Function Platform notes
cb1[.exe] Create counts C source code available, all platforms
eci[.exe] EM algorithm Fortran source available, all platforms
matcher[.exe] SRD Matcher C source code available, all platforms
prt1[.exe] Create output from matcher C source code available, all platforms
resid2[.exe] Extracts records from match output and input data C++ source code available, all platform
runprt[.bat] Combines sort and prt1 to produce output Script file
sort.exe GNU sort Provided as binary only for DOS platform. Compiled from GNU sources

resid2[.exe]

The program resid2[.exe] uses inputs parmf.txt, pntmf.dat, parmn.txt, pub1a.dat, & pub1b.dat to produce outputs r_sorta.dat, & r_sortb.dat where the names of the outputs are obtained from the names in parmn.txt

resid2.sas

The SAS program resid2.sas performs the same functions, but outputs to pub1ab.dat and pub1bb.dat, which are already sorted.

sort.exe

This program is only provided for DOS and Windows. On Linux, use the native 'sort' command; see 'man sort' for options. They are both derived from the same GNU sources, although the DOS version provided here is much older.

The GNU sort program sort.exe works like the unix sort. Only a few commands are described below.
 `sort' [-cmuV] [-t c] [-o `file'] [-T `dir'] 
[-bdfiMnr] [+n [-m] ...] [`file' ...]

`-o`file''
Send output to `file' (overwriting).

`-r'
Sort in reverse.

How to specify the sort keys
============================

Keys are zero based, thus the first field has number 0, and so on.

`+num1.num2'
Start a new key at character num2 of field num1.

`-num1.num2'
Extend the key upto (not including) character num2 of field num1.

Valid HTML 4.01!