I have a variable, String1, of 1,900 string records. I would like to count their occurrence within String2 that is 800,000 different string records. I have been looking at COUNT, but I believe I need to index on String1? There was past conversation on some Python code to solve this but I never saw a post for the code. Any suggestions?
data list/String1 String2 begin data 'female red shoes' (a20) 'female red shoes for sale' (a80) 'black hat' (a20) 'black hat for sale' (a80)
'mens necktie' (a20) 'mens necktile for sale' (a80) end data. This email may contain confidential information for the sole use of the intended recipient(s). If you are not an intended recipient, please notify the sender and delete all copies immediately. |
At 05:00 PM 3/19/2014, Peter Spangler wrote:
>I have a variable, String1, of 1,900 string records. I would like to >count their occurrence within String2 that is 800,000 different >string records. First of all, I take it that String1 and String2 are in *different* files -- that your DATA LIST (which, anyhow, doesn't work) has nothing to do with what you actually have. So (if I'm right about what you mean) you want to test for the occurrence of each of your 1900 String1 values in each of the 800,000 String2 values. That's 1.52E9 tests, which is going to strain even a large modern-day computer. You want to, first, generate all 1.52E9 pairs of String1 and String2 values as records in a combined file. That's a Cartesian product of the inputs; I've just posted a macro that does that, but it's optimized for matching based on a key value, and for your purposes I recommend the extension command STATS CARTPROD (1). At that, you'll probably have to make several runs, each with a piece of your 1900 String1 values. Once you have a pair of values, one String1 and one String2, in the same record, the test is COMPUTE IS_IN = INDEX(String2,String1) NE 0. ======================================= (1) See Date: Tue, 4 Mar 2014 14:32:29 -0700 From: Jon K Peck <[hidden email]> Subject: News from the SPSS Community To: [hidden email] X-ELNK-Received-Info: spv=0; X-ELNK-AV: 0 X-ELNK-Info: sbv=0; sbrc=.0; sbf=bb; sbw=000; ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by pspangler1
To enjoy life, you might wish to isolate each string into its components, then replace them with NOMINAL labels. For example: 'female red shoes for sale' replaced by: · Gender: {0=F, 1=M} · Color: {0=Black, 1=Red, ….} · Sale: {0=No, 1=Yes} I have a variable, String1, of 1,900 string records. I would like to count their occurrence within String2 that is 800,000 different string records. I have been looking at COUNT, but I believe I need to index on String1? There was past conversation on some Python code to solve this but I never saw a post for the code. Any suggestions?  data list/String1 String2 begin data 'female red shoes' (a20) 'female red shoes for sale' (a80) 'black hat' (a20) 'black hat for sale' (a80) 'mens necktie' (a20) 'mens necktile for sale' (a80) end data. |
Administrator
|
In reply to this post by pspangler1
See INDEX function.
Why would you believe python is a solution to this simple problem. Sorry, I'm getting out of the habit of writing code for people. See INDEX in the manual and try something for yourself.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Administrator
|
In reply to this post by Richard Ristow
I would personally rip everything apart into single words, aggregate them and match with a table.
Rebuild from matches of the TABLE. Won't write the code for free.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by pspangler1
Are all the matches this precise and systematic?
Then: Starting with String1 in one file and String2 in another, create a variable for IsString1 (1=yes, 0=no), and make a file that has IsString1, String, like this -- 1 female red shoes 0 female red shoes for sale . . . Sort by String(A) Isstring1(D) so that the String1 is at the start of every line that starts with the same words. Use Lag and Index to count the number of matches. -- Rich Ulrich Date: Wed, 19 Mar 2014 14:00:30 -0700 From: [hidden email] Subject: substring counts To: [hidden email] I have a variable, String1, of 1,900 string records. I would like to count their occurrence within String2 that is 800,000 different string records. I have been looking at COUNT, but I believe I need to index on String1? There was past conversation on some Python code to solve this but I never saw a post for the code. Any suggestions?
data list/String1 String2 begin data 'female red shoes' (a20) 'female red shoes for sale' (a80) 'black hat' (a20) 'black hat for sale' (a80)
'mens necktie' (a20) 'mens necktile for sale' (a80) end data. This email may contain confidential information for the sole use of the intended recipient(s). If you are not an intended recipient, please notify the sender and delete all copies immediately. |
Free forum by Nabble | Edit this page |