SPSSX Discussion

substring counts

Classic

List

Threaded

6 messages Options

pspangler1

substring counts

I have a variable, String1, of 1,900 string records. I would like to count their occurrence within String2 that is 800,000 different string records. I have been looking at COUNT, but I believe I need to index on String1? There was past conversation on some Python code to solve this but I never saw a post for the code. Any suggestions?

data list/String1 String2

begin data

'female red shoes' (a20) 'female red shoes for sale' (a80)

'black hat' (a20) 'black hat for sale' (a80)

'mens necktie' (a20) 'mens necktile for sale' (a80)

end data.

This email may contain confidential information for the sole use of the intended recipient(s). If you are not an intended recipient, please notify the sender and delete all copies immediately.

Richard Ristow

Re: substring counts

At 05:00 PM 3/19/2014, Peter Spangler wrote:

>I have a variable, String1, of 1,900 string records. I would like to
>count their occurrence within String2 that is 800,000 different
>string records.

First of all, I take it that String1 and String2 are in *different*
files -- that your DATA LIST (which, anyhow, doesn't work) has
nothing to do with what you actually have.

So (if I'm right about what you mean) you want to test for the
occurrence of each of your 1900 String1 values in each of the 800,000
String2 values. That's 1.52E9 tests, which is going to strain even a
large modern-day computer.

You want to, first, generate all 1.52E9 pairs of String1 and String2
values as records in a combined file. That's a Cartesian product of
the inputs; I've just posted a macro that does that, but it's
optimized for matching based on a key value, and for your purposes I
recommend the extension command STATS CARTPROD (1).

At that, you'll probably have to make several runs, each with a piece
of your 1900 String1 values.

Once you have a pair of values, one String1 and one String2, in the
same record, the test is

COMPUTE IS_IN = INDEX(String2,String1) NE 0.

=======================================
(1) See
Date: Tue, 4 Mar 2014 14:32:29 -0700
From: Jon K Peck <[hidden email]>
Subject: News from the SPSS Community
To: [hidden email]
X-ELNK-Received-Info: spv=0;
X-ELNK-AV: 0
X-ELNK-Info: sbv=0; sbrc=.0; sbf=bb; sbw=000;

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

MaxJasper

RE: substring counts

In reply to this post by pspangler1

To enjoy life, you might wish to isolate each string into its components, then replace them with NOMINAL labels. For example:

'female red shoes for sale' replaced by:

· Gender: {0=F, 1=M}

· Color: {0=Black, 1=Red, ….}

· Sale: {0=No, 1=Yes}

data list/String1 String2

begin data

'female red shoes' (a20) 'female red shoes for sale' (a80)

'black hat' (a20) 'black hat for sale' (a80)

'mens necktie' (a20) 'mens necktile for sale' (a80)

end data.Â

David Marso

Re: substring counts

Administrator

In reply to this post by pspangler1

See INDEX function.
Why would you believe python is a solution to this simple problem.
Sorry, I'm getting out of the habit of writing code for people.
See INDEX in the manual and try something for yourself.

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

David Marso

Re: substring counts

Administrator

In reply to this post by Richard Ristow

I would personally rip everything apart into single words, aggregate them and match with a table.
Rebuild from matches of the TABLE.
Won't write the code for free.

Richard Ristow wrote

At 05:00 PM 3/19/2014, Peter Spangler wrote:

>I have a variable, String1, of 1,900 string records. I would like to
>count their occurrence within String2 that is 800,000 different
>string records.

First of all, I take it that String1 and String2 are in *different*
files -- that your DATA LIST (which, anyhow, doesn't work) has
nothing to do with what you actually have.

So (if I'm right about what you mean) you want to test for the
occurrence of each of your 1900 String1 values in each of the 800,000
String2 values. That's 1.52E9 tests, which is going to strain even a
large modern-day computer.

You want to, first, generate all 1.52E9 pairs of String1 and String2
values as records in a combined file. That's a Cartesian product of
the inputs; I've just posted a macro that does that, but it's
optimized for matching based on a key value, and for your purposes I
recommend the extension command STATS CARTPROD (1).

At that, you'll probably have to make several runs, each with a piece
of your 1900 String1 values.

Once you have a pair of values, one String1 and one String2, in the
same record, the test is

COMPUTE IS_IN = INDEX(String2,String1) NE 0.

=======================================
(1) See
Date: Tue, 4 Mar 2014 14:32:29 -0700
From: Jon K Peck <[hidden email]>
Subject: News from the SPSS Community
To: [hidden email]
X-ELNK-Received-Info: spv=0;
X-ELNK-AV: 0
X-ELNK-Info: sbv=0; sbrc=.0; sbf=bb; sbw=000;

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Rich Ulrich

Re: substring counts

In reply to this post by pspangler1

Are all the matches this precise and systematic?
Then: Starting with String1 in one file and String2 in another,
create a variable for IsString1 (1=yes, 0=no), and
make a file that has IsString1, String, like this --

1 female red shoes
0 female red shoes for sale
.
.
.

Sort by String(A) Isstring1(D) so that the String1 is at the start
of every line that starts with the same words.

Use Lag and Index to count the number of matches.

--
Rich Ulrich

Date: Wed, 19 Mar 2014 14:00:30 -0700
From: [hidden email]
Subject: substring counts
To: [hidden email]

data list/String1 String2

begin data

'female red shoes' (a20) 'female red shoes for sale' (a80)

'black hat' (a20) 'black hat for sale' (a80)

'mens necktie' (a20) 'mens necktile for sale' (a80)

end data.

This email may contain confidential information for the sole use of the intended recipient(s). If you are not an intended recipient, please notify the sender and delete all copies immediately.