SPSSX Discussion

String Matches

Classic

List

Threaded

22 messages Options

JKRockStomper

String Matches

Richard Ristow

Re: String Matches

At 12:18 PM 5/12/2011, JKRockStomper wrote:

>I have inherited some legacy code to convert criminal arrest literals to
>NCIC offense codes; the problem at hand is that it takes several hours, 11
>hours, to do the conversion on 200,000 records. This code is made up of
>several thousand
>
>IF INDEX(arrest_string, "substring") > 0 statements.

It sounds like "arrest_string" is a single string listing one or more
offenses, as English terms or descriptions, and you wish to find all
of the 20,000(?) NCIC codes that correspond to an offense described
in the string.

If you have a SINGLE offense description with no preceding or
following text, RECODE is much faster than an IF chain:

RECODE description
('literal01' = 1)
('literal02' = 2)
('literal20000' = 20000)
(ELSE = 99999) INTO NCIC.

(Replace the 'literals' by the descriptions you have, and the numbers
by the corresponding NCIC codes.)

In the string, are the descriptions uniformly separated by commas, or
otherwise parseable into the individual descriptions? Could you post
a few examples of strings listing more than one offense? If so,
something like this should work well; here, allowing for at most 5
offenses per arrest-string, and descriptions no longer than 20
letters (and not tested):

STRING #Describe1 TO #Describe5 (A20).
NUMERIC NCIC1 TO NCIC5 (F5).

* Blank out the description fields -- this is important .
* when using scratch variables: .
RECODE #Describe1 TO #Describe5 (ELSE=' ').

<Code to separate the descriptions from "arrest_string" into
five or fewer descriptions>

RECODE #Describe1 TO #Describe5
(' ' = SYSMIS)
('literal01' = 1)
('literal02' = 2)
('literal20000' = 20000)
(ELSE = 99999) INTO NCIC1 TO NCIC5.

If you can do something like this, it should run a good deal faster.

(The NCIC codes will come out in their order in the arrest string. If
you want some other order, that'll take some additional code.)

>Right now, we get about a 95% match on everything but I would like
>to increase this is at all possible.

It often works well to refine your process progressively. That is,
look at what descriptions come out as 99999 (unrecognized); see how
many occur repeatedly, and whether you can identify by eye what NCIC
code they should have; and add those to the RECODE list. (There is no
problem, in RECODE, in having any number of different literals
translating to the same NCIC code.)

-Best of luck,
Richard Ristow

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Art Kendall

Re: String Matches

In reply to this post by JKRockStomper

Please give little more detail.�

Please provide a small mock-up of the situation,� i.e, a small number of cases with the type of variables that the real data has.

What does the input data look like?� If there a set of variables that have strings in them each of which would result in a single code?� Are there strings each of which would result in several codes?
Are there several different strings that would yield the same code?� � Is there an order to the variables? Is there an order to the contents of the strings?
Are the substrings entered consistently, i.e., could a substring� be Robbery, robbery, robry, etc.

Art Kendall
Social Research Consultants

On 5/12/2011 12:18 PM, JKRockStomper wrote:

Good morning

I have inherited some legacy code to convert criminal arrest literals to
NCIC offense codes; the problem at hand is that it takes several hours, 11
hours, to do the conversion on 200,000 records.  This code is made up of
several thousand IF INDEX(arrest_string, “substring”) > 0 statements.

I have been given the task of researching better approaches to the
conversion process, possibly rewriting all 20,000 lines of code.  I have
thought about SOUNDEX or METAPHONE but do not want to go that route if the
outcome is still going to be close to the same time or the results will not
be a great match.  Right now, we get about a 95% match on everything but I
would like to increase this is at all possible.

I am wondering what others have used in the past or currently using to do a
similar process?


--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/String-Matches-tp4390718p4390718.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Art Kendall
Social Research Consultants

Maguin, Eugene

Re: String Matches

In reply to this post by JKRockStomper

JR,

Perhaps you have already considered this, but an alternative that others
have suggested for this sort of thing is to use a look-up file. Basicly, a
file of criminal arrest literals and NCIC offense codes is treated as a
'table' in a match files command. One important problem in this method is
the overhead of creating the look-up file. However, if you aggregated a
current criminal arrest literals file and then ran the current syntax, the
resulting file would be the base of the look-up. This method won't deal with
misspellings or new criminal arrest literals. The current system doesn't
either. Thus the look-up file has to be constantly updated. On the other
hand, so does the current code.

Since you seem to have large numbers and, perhaps, a regular succession of
files to recode, you might be able to analyze the structure of failed
matches for 'consistent' structural features and use those features, if they
are also unique, for matching in a second pass.

Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
JKRockStomper
Sent: Thursday, May 12, 2011 12:18 PM
To: [hidden email]
Subject: String Matches

Good morning

I have inherited some legacy code to convert criminal arrest literals to
NCIC offense codes; the problem at hand is that it takes several hours, 11
hours, to do the conversion on 200,000 records. This code is made up of
several thousand IF INDEX(arrest_string, "substring") > 0 statements.

I have been given the task of researching better approaches to the
conversion process, possibly rewriting all 20,000 lines of code. I have
thought about SOUNDEX or METAPHONE but do not want to go that route if the
outcome is still going to be close to the same time or the results will not
be a great match. Right now, we get about a 95% match on everything but I
would like to increase this is at all possible.

I am wondering what others have used in the past or currently using to do a
similar process?

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/String-Matches-tp4390718p43907
18.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

mpirritano

Re: String Matches

JR,

Someone on this list turned me onto a program available for free from
the CDC called Link Plus. It does either soundex or NYSIIS fuzzy
matching. It's designed for matching on name and demographic data but
there is a field called generic string. You could use that. It's really
easy. I don't know how much more accurate NYSIIS is than soundex or
metaphone.

Here's the link:

http://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm

Thanks
Matt

Matthew Pirritano, Ph.D.
Research Analyst IV
Medical Services Initiative (MSI)
Orange County Health Care Agency
(714) 568-5648
-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Gene Maguin
Sent: Thursday, May 12, 2011 10:00 AM
To: [hidden email]
Subject: Re: String Matches

JR,

Perhaps you have already considered this, but an alternative that others
have suggested for this sort of thing is to use a look-up file. Basicly,
a
file of criminal arrest literals and NCIC offense codes is treated as a
'table' in a match files command. One important problem in this method
is
the overhead of creating the look-up file. However, if you aggregated a
current criminal arrest literals file and then ran the current syntax,
the
resulting file would be the base of the look-up. This method won't deal
with
misspellings or new criminal arrest literals. The current system doesn't
either. Thus the look-up file has to be constantly updated. On the other
hand, so does the current code.

Since you seem to have large numbers and, perhaps, a regular succession
of
files to recode, you might be able to analyze the structure of failed
matches for 'consistent' structural features and use those features, if
they
are also unique, for matching in a second pass.

Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
JKRockStomper
Sent: Thursday, May 12, 2011 12:18 PM
To: [hidden email]
Subject: String Matches

Good morning

I have inherited some legacy code to convert criminal arrest literals to
NCIC offense codes; the problem at hand is that it takes several hours,
11
hours, to do the conversion on 200,000 records. This code is made up of
several thousand IF INDEX(arrest_string, "substring") > 0 statements.

I have been given the task of researching better approaches to the
conversion process, possibly rewriting all 20,000 lines of code. I have
thought about SOUNDEX or METAPHONE but do not want to go that route if
the
outcome is still going to be close to the same time or the results will
not
be a great match. Right now, we get about a 95% match on everything but
I
would like to increase this is at all possible.

I am wondering what others have used in the past or currently using to
do a
similar process?

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/String-Matches-tp4390718p4
3907
18.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

JKRockStomper

Re: String Matches

In reply to this post by Art Kendall

Art Usuall the arrest field is whatever the police officer enters into the computer. Here are some nicer examples: 001 COUNTS OF UNDER INFLUENCE CNTL SUB 001 COUNTS OF VANDALISM -$400 001 COUNTS OF PETTY THEFT RETAIL/ETC 001 COUNTS OF THEFT 001 COUNTS OF USE/UNDER INFL CONTRLD SU BENCH WARRANT #476195 / 001 COUNTS OF INF CORP INJ,SPOUSE/COHAB - F BENCH WARRANT #616250 / 001 COUNTS OF DRIVE,LIC SUSPENDED/ETC - M BENCH WARRANT #476195 / 001 COUNTS OF INFLICT CRPL INJ SP/COHAB - F TRANSPORT/SELL NARC CONT SUB F DISORDERLY CONDUCT PROSTITUTION M POSS COCAINE BASE SALE WRT 410341 - F DISTURB PEACE WRT #410341 - M 001 COUNTS OF POSS-PUR COKE BASE F-SALE - F 001 COUNTS OF REC KNWN STOLN PROP $400 PLUS - F 01 COUNTS OF POSSESS BURGLARY TOOLS 001 COUNTS OF VIO ORD,PREVNT DOMES VIOL M BENCH WARRANT #129503 / 001 COUNTS OF POS/PUR F/SALE NARC/C/SUB - F 00067F PREVENT/DISUADE WITNESS/VICTIM F/REPORTING CRIM 09611F PROBATION VIOLATION BENCH WARRANT #469182 / 001 COUNTS OF FORCE/ADW NOT FIREARM,GBI - F BENCH WARRANT #476195 INFLICT CRPL INJ SP/COHAB F 001 COUNTS OF POSS/PUR COKE BASE F/SALE - F 001 COUNTS OF POSS/PUR COKE BASE F/SALE-F 001 COUNTS OF DRIVE,LIC SUSPENDED/ETC M The current code looks something like this to figure out a few of the NCIC codes. IF INDEX(arstliteral,"TREAS0N") >0 AND INDEX(arstliteral,"MISPRIS0N") =0 NCIC =0101. IF INDEX(arstliteral,"TREASON") >0 AND INDEX(arstliteral,"MISPRISON") >0 NCIC =0102 . IF INDEX(arstliteral,"TREAS0N") >0 AND INDEX(arstliteral,"MISPRIS0N") >0 NCIC =0102. IF INDEX(arstliteral,"ESPION") >0 NCIC =0103. IF INDEX(arstliteral,"ESPI0N") >0 NCIC =0103. IF INDEX(arstliteral,"SABOT") >0 NCIC =0104 . IF INDEX(arstliteral,"SAB0T") >0 NCIC =0104 . IF INDEX(arstliteral,"SEDIT") >0 NCIC =0105 . IF INDEX(arstliteral,"SELECT") >0 AND INDEX(arstliteral,"SERV") >0 NCIC=0106 . IF INDEX(arstliteral,"SELECT") >0 AND INDEX(arstliteral,"SRVC") >0 NCIC=0106. IF INDEX(arstliteral,"SOVER") >0 NCIC =0199. IF INDEX(arstliteral,"S0VER") >0 NCIC =0199.

View this message in context: Re: String Matches
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

Richard Ristow

Re: String Matches

At 02:26 PM 5/12/2011, JKRockStomper wrote:

>Usually the arrest field is whatever the police officer enters into
>the computer. Here are some nicer examples:

>001 COUNTS OF UNDER INFLUENCE CNTL SUB 001 COUNTS OF VANDALISM -$400
>001 COUNTS OF PETTY THEFT RETAIL/ETC 001 COUNTS OF THEFT 001 COUNTS
>OF USE/UNDER INFL CONTRLD SU BENCH WARRANT #476195 / 001 COUNTS OF
>INF CORP INJ,SPOUSE/COHAB - F BENCH WARRANT #616250 / 001 COUNTS OF
>DRIVE,LIC SUSPENDED/ETC - M BENCH WARRANT #476195 / 001 COUNTS OF
>INFLICT CRPL INJ SP/COHAB - F TRANSPORT/SELL NARC CONT SUB F
>DISORDERLY CONDUCT PROSTITUTION M POSS COCAINE BASE SALE WRT 410341
>- F DISTURB PEACE WRT #410341 - M 001 COUNTS OF POSS-PUR COKE BASE
>F-SALE - F 001 COUNTS OF REC KNWN STOLN PROP $400 PLUS - F 01 COUNTS
>OF POSSESS BURGLARY TOOLS 001 COUNTS OF VIO ORD,PREVNT DOMES VIOL M
>BENCH WARRANT #129503 / 001 COUNTS OF POS/PUR F/SALE NARC/C/SUB - F
>00067F PREVENT/DISUADE WITNESS/VICTIM F/REPORTING CRIM 09611F
>PROBATION VIOLATION BENCH WARRANT #469182 / 001 COUNTS OF FORCE/ADW
>NOT FIREARM,GBI - F BENCH WARRANT #476195 INFLICT CRPL INJ SP/COHAB
>F 001 COUNTS OF POSS/PUR COKE BASE F/SALE - F 001 COUNTS OF POSS/PUR
>COKE BASE F/SALE-F 001 COUNTS OF DRIVE,LIC SUSPENDED/ETC M

One thing that may be going for you is that each charge seems to begin with
"001 COUNTS OF"
presumably allowing for more than one count as well. That should let
you separate the text into the individual charges.

>The current code looks something like this to figure out a few of
>the NCIC codes.

>IF INDEX(arstliteral,"TREAS0N") >0
> AND INDEX(arstliteral,"MISPRIS0N") =0 NCIC =0101.
>IF INDEX(arstliteral,"TREASON") >0
> AND INDEX(arstliteral,"MISPRISON") >0 NCIC =0102 .
>IF INDEX(arstliteral,"TREAS0N") >0
> AND INDEX(arstliteral,"MISPRIS0N") >0 NCIC =0102. IF
> INDEX(arstliteral,"ESPION") >0 NCIC =0103. IF
> INDEX(arstliteral,"ESPI0N") >0 NCIC =0103. IF
> INDEX(arstliteral,"SABOT") >0 NCIC =0104 . IF
> INDEX(arstliteral,"SAB0T") >0 NCIC =0104 . IF
> INDEX(arstliteral,"SEDIT") >0 NCIC =0105 . IF
> INDEX(arstliteral,"SELECT") >0 AND INDEX(arstliteral,"SERV") >0
> NCIC=0106 . IF INDEX(arstliteral,"SELECT") >0 AND
> INDEX(arstliteral,"SRVC") >0 NCIC=0106. IF
> INDEX(arstliteral,"SOVER") >0 NCIC =0199. IF
> INDEX(arstliteral,"S0VER") >0 NCIC =0199.

I trust that the actual coded is laid out more like the first few
lines that I've reformatted; the completely-wrapped code you sent is
very difficult indeed to read, let alone debug. (Notice that the test
for "TREASON" and "MISPRISON" occurs twice in the code you sent.)

It does seem you're interested in groups of offenses. Laying aside
the question of parsing into individual charges, the following would
run noticeably faster than a sequence of Ifs, and I think be easier
to understand as well.

DO IF INDEX(arstliteral,"TREAS0N") >0.
. DO IF INDEX(arstliteral,"MISPRIS0N") =0
. COMPUTE NCIC = 0101.
. ELSE.
. COMPUTE NCIC = 0102.
. END IF.
END IF.

Calls to the INDEX function are probably slow, and taking up much of
your time. The above code has at most two INDEX calls, and in many
cases only one. The corresponding IF statements,

IF INDEX(arstliteral,"TREAS0N") >0
AND INDEX(arstliteral,"MISPRIS0N") =0 NCIC =0101.
IF INDEX(arstliteral,"TREASON") >0
AND INDEX(arstliteral,"MISPRISON") >0 NCIC =0102 .

have four INDEX calls for every case.

-Onward,
Richard

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Garry Gelade

Re: String Matches

In reply to this post by JKRockStomper

JR

This is actually a text mining problem, and I would suggest SPSS Text Mining is a route to explore. I think there are potential benefits in added flexibility and accuracy to be had, and I am sure your 11 hour run would be much reduced.

Regards

Garry Gelade

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of JKRockStomper
Sent: 12 May 2011 17:18
To: [hidden email]
Subject: String Matches

Good morning

I have inherited some legacy code to convert criminal arrest literals to
NCIC offense codes; the problem at hand is that it takes several hours, 11
hours, to do the conversion on 200,000 records. This code is made up of
several thousand IF INDEX(arrest_string, “substring”) > 0 statements.

I have been given the task of researching better approaches to the
conversion process, possibly rewriting all 20,000 lines of code. I have
thought about SOUNDEX or METAPHONE but do not want to go that route if the
outcome is still going to be close to the same time or the results will not
be a great match. Right now, we get about a 95% match on everything but I
would like to increase this is at all possible.

I am wondering what others have used in the past or currently using to do a
similar process?

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/String-Matches-tp4390718p4390718.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon K Peck

Re: String Matches

In reply to this post by JKRockStomper

It appears that you have semi-structured data, so nothing is likely to work 100%, but there are a number of approaches you could take. One would be to set this up in Text Mining for Surveys or using Modeler Premium or another text mining product. That would be the most general solution. But it is not necessarily good at spelling corrections unless you build a dictionary with spelling variations for a specialized vocabulary.

Another approach would be a coding scheme for the text that attempts to extract a canonicalized phonetic value for each word and then match on those. There are several algorithms in common use for this. You mentioned soundex. That is the most primitive albeit still popular, but it is designed for names - and english ones at that - and there are better choices. nysiis is a more sophisticated such algorithm, although it was also designed mainly for names. These are both available in the extendedTransforms.py programmability module and can be used most easily with the SPSSINC TRANS extension command. You would just encode each word and then match each against a lookup table of similarly transformed terms. The vlookup class in that same module is a convenient way of doing that.

Perhaps the best approach if you are really just looking for fuzzy matches with spelling errors, though would be to use a difference metric for each word against a lookup table of terms and take the closest without the somewhat artificial phonetic schemes used for names. This is more like what a spelling checker would do. There are two algorithms for string differencing also available in the extendedTransforms module. Levenshtein distance calculates the number of permutations and substitutions of the text required to match two strings. Jaro-Winkler distance is a more sophisticated version that weights different parts of a word differently. What's best tends to vary depending on how the errors were originally produced, which is in part a function of the data entry method. Using these is a little more complicated, because you should ideally compare each term in the data with everything in your lookup dictionary or some approximation thereof, but it would not be too hard to set up using programmability.

HTH,
Jon Peck

Jon Peck
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: JKRockStomper <[hidden email]>
To: [hidden email]
Date: 05/12/2011 10:20 AM
Subject: [SPSSX-L] String Matches
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Good morning I have inherited some legacy code to convert criminal arrest literals to NCIC offense codes; the problem at hand is that it takes several hours, 11 hours, to do the conversion on 200,000 records. This code is made up of several thousand IF INDEX(arrest_string, “substring”) > 0 statements. I have been given the task of researching better approaches to the conversion process, possibly rewriting all 20,000 lines of code. I have thought about SOUNDEX or METAPHONE but do not want to go that route if the outcome is still going to be close to the same time or the results will not be a great match. Right now, we get about a 95% match on everything but I would like to increase this is at all possible. I am wondering what others have used in the past or currently using to do a similar process? -- View this message in context:http://spssx-discussion.1045642.n5.nabble.com/String-Matches-tp4390718p4390718.htmlSent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

JKRockStomper

Re: String Matches

In reply to this post by Maguin, Eugene

Gene

This is what is currently being done. Every time the code is run the code is updated with new strings, but you did give me a new approach to this problem. Since I have the full strings with appropriate matched NCIC codes, I could just build a huge dictionary of full string matches to compare with the new strings. Thus updating this dictionary everytime a new string is encountered. This might be the least intensive approach since I have 20 years of arrest strings and their corresponding matches already.

JKRockStomper

Re: String Matches

In reply to this post by Richard Ristow

Richard

The reason why the code appears to be duplicated is because the officer sometimes enters a zero instead of the letter "o". This is the case with other letters in the alphabet.

I have thought about using DO IF instead of the current IF statements because of the same reason you meantion. This will take a ton of time to recode but I am sure it would reduce the run time and figuring out if there are duplicate code.

Currently, whenever a new instance of a string appears that is not assigned the appropriate NCIC number and new IF statement just gets added to the syntax. I agree that this might not be the best solution because we do get a lot of duplicate IF statements.

JKRockStomper

Re: String Matches

In reply to this post by Jon K Peck

Jon

What you talk about is something I was considering when I mentioned Soundex or Metaphone. I have considered turning every word into some sort of code and then looking looking up the codes. My thought on using Soundex was that sometimes there are misspellings thus the Soundex result might overlook these errors and reduce the number of code statements.

Albert-Jan Roskam

Re: String Matches

In reply to this post by JKRockStomper

Hello,

The code below reads out a lookup table (a csv based on your sps file; should be easy to create), pre-compiles all the needed regexes and stores them is a fast data structure. Next, all the records in the dataset are compared with all the regexes, and a ncic code + line number is written to another csv.

The terms in the lookup table are interpreted as regular expressions. This offers flexibility, but you need to watch out for metacharacters. For instance, ".", means any character in regex speak, so you'll want to write "\." instead. The question mark can be nicely used to specify optional chars, e.g. usuall?y stands for 'usually' and 'usualy'.

The code works using this mini-sample.

from __future__ import with_statement # only Python 2.5.
import re, csv

# could be any iterable
lines = ["001 COUNTS OF UNDER INFLUENCE CNTL SUB 001 COUNTS OF VANDALISM",
         " -$400 001 COUNTS OF PETTY THEFT RETAIL/ETC 001 COUNTS OF THEFT 001",
         " -$400 001 COUNTS OF PETTY THEFT RETAIL/ETC 001 COUNTS OF ESPION 001",
         "001 COUNTS OF UNDER MISPRIS0N TREAS0N COUNTS OF VANDALISM"]

def compileRegexes(lookupCsv):
    """ Create a fast lookup table for pattern matches. Compiles a set
    of regexes of the form set([(ncic, compiled_regex)]) """
    sniffer = csv.Sniffer()
    dialect = sniffer.sniff(open(lookupCsv).readline())
    reader = csv.reader(open(lookupCsv), delimiter=dialect.delimiter)
    header = reader.next()
    ncic = header.index("NCIC") # must be very first column
    term1 = header.index("TERM1")
    compiledRegexes = set()
    for line in reader:
        search_terms = []
        line[ncic]
        # add TERM1 thru TERMn
        for col in range(len(header)):
            if col >= term1 and line[col]:
                search_terms.append(line[col])
        search_terms = sorted(set(search_terms))
        compiledRegexes.update([(line[ncic], re.compile("|".join(search_terms)))])
    return compiledRegexes

def deriveNCICs(lookupCsv, resultCsv, lines):
    """ Compare every line with every regex pattern in the lookup table,
    if the line matches, spit out the NCIS code"""
    compiledRegexes = compileRegexes(lookupCsv)
    with open(resultCsv, "wb") as f:
        writer = csv.writer(f)
        for lino, line in enumerate(lines):
            for item in compiledRegexes:
                ncic, regex = item[0], item[1]
                m = re.findall(regex, line)
                if m:
                    #print "%04d. %s <--> %s"% (lino+1, ", ".join(m), ncic)
                    writer.writerow([lino+1]+[ncic])
                    break
    print "Done!"

# lookup table has columns NCIC TERM1 TERM2
# Terms may be specified as regexes, metachracters may need to be escaped.
# This will make a lot of comparisons! n x m.
deriveNCICs(lookupCsv="d:/temp/lookup-table.csv",
            resultCsv="d:/temp/result.csv",
            lines=lines)

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

From: JKRockStomper <[hidden email]>
To: [hidden email]
Sent: Thu, May 12, 2011 8:26:03 PM
Subject: Re: [SPSSX-L] String Matches

Art Usuall the arrest field is whatever the police officer enters into the computer. Here are some nicer examples: 001 COUNTS OF UNDER INFLUENCE CNTL SUB 001 COUNTS OF VANDALISM -$400 001 COUNTS OF PETTY THEFT RETAIL/ETC 001 COUNTS OF THEFT 001 COUNTS OF USE/UNDER INFL CONTRLD SU BENCH WARRANT #476195 / 001 COUNTS OF INF CORP INJ,SPOUSE/COHAB - F BENCH WARRANT #616250 / 001 COUNTS OF DRIVE,LIC SUSPENDED/ETC - M BENCH WARRANT #476195 / 001 COUNTS OF INFLICT CRPL INJ SP/COHAB - F TRANSPORT/SELL NARC CONT SUB F DISORDERLY CONDUCT PROSTITUTION M POSS COCAINE BASE SALE WRT 410341 - F DISTURB PEACE WRT #410341 - M 001 COUNTS OF POSS-PUR COKE BASE F-SALE - F 001 COUNTS OF REC KNWN STOLN PROP $400 PLUS - F 01 COUNTS OF POSSESS BURGLARY TOOLS 001 COUNTS OF VIO ORD,PREVNT DOMES VIOL M BENCH WARRANT #129503 / 001 COUNTS OF POS/PUR F/SALE NARC/C/SUB - F 00067F PREVENT/DISUADE WITNESS/VICTIM F/REPORTING CRIM 09611F PROBATION VIOLATION BENCH WARRANT #469182 / 001 COUNTS OF FORCE/ADW NOT FIREARM,GBI - F BENCH WARRANT #476195 INFLICT CRPL INJ SP/COHAB F 001 COUNTS OF POSS/PUR COKE BASE F/SALE - F 001 COUNTS OF POSS/PUR COKE BASE F/SALE-F 001 COUNTS OF DRIVE,LIC SUSPENDED/ETC M The current code looks something like this to figure out a few of the NCIC codes. IF INDEX(arstliteral,"TREAS0N") >0 AND INDEX(arstliteral,"MISPRIS0N") =0 NCIC =0101. IF INDEX(arstliteral,"TREASON") >0 AND INDEX(arstliteral,"MISPRISON") >0 NCIC =0102 . IF INDEX(arstliteral,"TREAS0N") >0 AND INDEX(arstliteral,"MISPRIS0N") >0 NCIC =0102. IF INDEX(arstliteral,"ESPION") >0 NCIC =0103. IF INDEX(arstliteral,"ESPI0N") >0 NCIC =0103. IF INDEX(arstliteral,"SABOT") >0 NCIC =0104 . IF INDEX(arstliteral,"SAB0T") >0 NCIC =0104 . IF INDEX(arstliteral,"SEDIT") >0 NCIC =0105 . IF INDEX(arstliteral,"SELECT") >0 AND INDEX(arstliteral,"SERV") >0 NCIC=0106 . IF INDEX(arstliteral,"SELECT") >0 AND INDEX(arstliteral,"SRVC") >0 NCIC=0106. IF INDEX(arstliteral,"SOVER") >0 NCIC =0199. IF INDEX(arstliteral,"S0VER") >0 NCIC =0199.

View this message in context: Re: String Matches
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

JKRockStomper

Re: String Matches

In reply to this post by Richard Ristow

Richard was up late last night thinking about this problem. Here is a copy of his email that he sent me:

Excuse me, but you've got me thinking about this. At 02:26 PM
5/12/2011, JKRockStomper wrote:

>The current code looks something like this to figure out a few of
>the NCIC codes.

This has great potential for bugs, some that may go undetected for years.

>IF INDEX(arstliteral,"TREAS0N") >0
> AND INDEX(arstliteral,"MISPRIS0N") =0 NCIC =0101.
>IF INDEX(arstliteral,"TREASON") >0
> AND INDEX(arstliteral,"MISPRISON") >0 NCIC =0102 .
>IF INDEX(arstliteral,"TREAS0N") >0
> AND INDEX(arstliteral,"MISPRIS0N") >0 NCIC =0102.
>
>
>IF INDEX(arstliteral,"ESPION") >0 NCIC =0103.
>IF INDEX(arstliteral,"ESPI0N") >0 NCIC =0103.
>
>IF INDEX(arstliteral,"SABOT") >0 NCIC =0104 .
>IF INDEX(arstliteral,"SAB0T" ) >0 NCIC =0104 .
>
>IF INDEX(arstliteral,"SEDIT") >0 NCIC =0105 .
>
>IF INDEX(arstliteral,"SELECT") >0
> AND INDEX(arstliteral,"SERV") >0 NCIC=0106 .
>IF INDEX(arstliteral,"SELECT") >0
> AND INDEX(arstliteral,"SRVC") >0 NCIC=0106.
>IF INDEX(arstliteral,"SOVER") >0 NCIC =0199.
>IF INDEX(arstliteral,"S0VER") >0 NCIC =0199.

First, do your really have all those tests in your code twice, or is
that a glitch in how you extracted it so send to us? If you're
worried about speed and size, duplicating tests wastes both to no purpose.

Second, the logic you posted returns *one* NCIC code: that for the
last criterion met. Is that what you want? Or do you actually do
something else? You pointed out that one arrest can match multiple NCIC codes.

Third, if you have a misspelled keyword in your code, the only way
you'll catch it is by noticing that some NCIC code never turns up.
And misspelled keywords in the arrest report won't be caught at all
-- see, for example, two abbreviations of "controlled substance" in
the data you sent:

>001 COUNTS OF UNDER INFLUENCE CNTL SUB
>001 COUNTS OF USE/UNDER INFL CONTRLD SU BENCH WARRANT #476195 /

-Best of luck. But I'm worried,
Richard

JKRockStomper

Re: String Matches

Richard

The reason why it appears to be duplicate code in the syntax that I provided is because sometimes the officer misspells the word. An example of this is when he or she will enter a "zero" instead of the letter "O" in the word "treason". I have talked to the previous coders and whenever they encountered this in the past they just entered another line of IF statements.

On your second point, you do point out a vital flaw in the code. Yes, it does assign the last NCIC code that it encounters, this has always been a problem and the solution in the past has been to manually look a sample of the outcome and re-assign values to the arrest strings. I do believe that there is a better solution out there than this approach.

Third, as previously stated, whenever a new misspelling is encountered they have always added a new line of code. This is just how it has been done for years but now I have inherited the project and I feel that this approach has worked in the past but might not be the most efficient.

Jon K Peck

Re: String Matches

In reply to this post by JKRockStomper

I would suggest using nysiis rather than soundex, but the two distance functions would likely be better if you can isolate the terms of interest. If they are just mixed in, then I think building a dictionary of terms and codes would work better and would certainly be much easier to maintain.

The steps if you use programmability would be:
1. create a SPSS dataset listing each term of interest and the code. (Decide whether to keep terms in upper or lower case). Updates then just mean adding a new case.

2. Using the SPSSINC TRANS extension command,
use the INITIALIZE subcommand to load the table
use a snippet of code to split the text field and/or do any other prefiltering, case conversion, word breaking etc
That code would attempt a lookup in the table for each word and return the code if found; otherwise return None, which would become
sysmis.

I can't say how fast this would run compared to the current system - where the runtime seems excessive for the problem, but it would be vastly easier to validate and maintain.

The advantage of vlookup is that the table does not require any ordering, and values can be found any number of times, so if you are currently spending a lot of time sorting, that becomes unnecessary, especially with field extraction from a longer string.

Jon Peck
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: JKRockStomper <[hidden email]>
To: [hidden email]
Date: 05/13/2011 09:24 AM
Subject: Re: [SPSSX-L] String Matches
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Jon What you talk about is something I was considering when I mentioned Soundex or Metaphone. I have considered turning every word into some sort of code and then looking looking up the codes. My thought on using Soundex was that sometimes there are misspellings thus the Soundex result might overlook these errors and reduce the number of code statements. -- View this message in context:http://spssx-discussion.1045642.n5.nabble.com/String-Matches-tp4390718p4393409.htmlSent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

JKRockStomper

Re: String Matches

In reply to this post by JKRockStomper

I think I have developed a game plan to accomplish this for the time being. It uses a combination of everything everybody is saying.

1) Using the past arrest string. Develop a list of unique arrest string and using the legacy syntax create the NCIC codes. Use this outcome data to develop a dictionary of unique arrest strings and corresponding NCIC codes.

2) When new data comes in, SORT the arrest string and MATCH the new file with the dictionary file. The unmatch strings will have sysmis values.

3) Run the sysmis values through the legacy syntax to find new NCIC codes for these strings. Create code to add these new arrest strings and codes to the dictionary.

4) Repeat steps 2 to 4 as necessary.

Over time this would produce a dictionary file that is huge but the MATCH command is much faster than our current setup. Plus the result of the unmatches should be less every time thus the run time will be less.

How does this sound?

JKRockStomper

Re: String Matches

Well, I took 1 million arrest strings from the past and saw that there are only 190 thousand unique strings. This makes me think the my proposed setup is the right way to go. I am going to try it out and report back to you all on Monday after I run the 190 thousand thru the process.

Richard Ristow

Re: String Matches

In reply to this post by JKRockStomper

At 01:42 PM 5/13/2011, JKRockStomper wrote:

>I think I have developed a game plan to accomplish this for the time
>being. It uses a combination of everything everybody is saying.
>
>1) Using the past arrest string. Develop a list of unique arrest
>string and using the legacy syntax create the NCIC codes. Use this
>outcome data to develop a dictionary of unique arrest strings and
>corresponding NCIC codes.
>
>2) When new data comes in, SORT the arrest string and MATCH the new
>file with the dictionary file. The unmatch strings will have sysmis values.

On thing you don't talk about, though you may be handling it: an
arrest string can include SEVERAL charges. I'm not clear where one
arrest string ends and the next begins, in your sample data, but this
appears to be one that includes several charges:

001 COUNTS OF INFLICT CRPL INJ SP/COHAB - F
TRANSPORT/SELL NARC CONT SUB F
DISORDERLY CONDUCT PROSTITUTION M
POSS COCAINE BASE SALE WRT 410341 - F
DISTURB PEACE WRT #410341 - M

And some strings have material that is not relevant for identifying
the charge, such as references to bench warrants:

001 COUNTS OF USE/UNDER INFL CONTRLD SU BENCH WARRANT #476195 /

Before you build or use a dictionary, especially using exact-match
techniques like RECODE or MATCH FILES, you'll need logic to separate
multiple charges and remove incidental text before a step like this:

>2) When new data comes in, SORT the arrest string and MATCH the new
>file with the dictionary file. The unmatch strings will have sysmis values.

But I dare say you're already doing this.

-Best of luck to you,
Richard

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

parisec

assign max value in long file format

Hi all,

Here is yet another "how do i tell SPSS to do this" type of question.

I have a file that looks like this:

ID Number
1 1
1 2
1 3
1 4
2 1
2 2
3 1
3 2
3 3
3 4
3 5
4 1

etc....some IDs have Numbers range from 1 to 103.

I need to be able to eliminate all IDs that don't have at least Number GE 3.

I was thinking that i could compute a MaxNumber which is the maximum number of occurrances.

Using the above data, my result would look like this:

ID Number MaxNum
1 1 4
1 2 4
1 3 4
1 4 4
2 1 2
2 2 2
3 1 5
3 2 5
3 3 5
3 4 5
3 5 5
4 1 1

I could then select if MaxNum GE 3 and save my reduced file.

Would someone either: 1) Have code that they are willing to share that can do this, or 2) Suggest a better option?

Many thanks!
Carol

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD