|
Hi,
I have two columns and for each value in column A, I want to find if the same value exists in column B. Example: Col A Col B 12 11 13 14 14 12 11 10 In the above example, I would say that 13 is not present in column B. How can I do this using arrays and loops in SPSS? Thanks, Sowmya ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Hi
> I have two columns and for each value in column A, I want to find if the > same value exists in column B. > Example: > Col A Col B > 12 11 > 13 14 > 14 12 > 11 10 > > In the above example, I would say that 13 is not present in column B. How > can I do this using arrays and loops in SPSS? > Quick&dirty solution (I'm sure that more elegant ones will be provided by other people): * Your sample dataset *. DATA LIST LIST/ColA ColB (2 F8). BEGIN DATA 12 11 13 14 14 12 11 10 END DATA. * Code assumes SPSS 14 or newer is used *. DATASET NAME Data. DATASET DECLARE APresentInB WINDOW=MINIMIZED. PRESERVE. SET MXLOOPS=1000. /* If sample size GT 1000, change MXLOOPS *. * WARNING: This code assumes no missing data *. MATRIX. GET ColA/VAR=ColA. GET ColB/VAR=ColB. COMPUTE n=NROW(ColA). COMPUTE Present=MAKE(n,1,0). LOOP i=1 TO n. . LOOP j=1 TO n. . DO IF ColA(i) EQ ColB(j). . COMPUTE Present(i)=1. . END IF. . END LOOP. END LOOP. SAVE Present /OUTFILE=APresentInB. END MATRIX. RESTORE. MATCH FILES /FILE=* /FILE='APresentInB'. DATASET CLOSE APresentInB. VALUE LABEL COL1 0'Absent in ColB' 1'Present in ColB'. FORMAT COL1(F8). LIST. HTH, Marta García-Granero -- For miscellaneous statistical stuff, visit: http://gjyp.nl/marta/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
sowmya sankar wrote:
> One question: what would happen if there are missing values? For > example, in my data, column A has 4731 rows and column B has 4861 > rows. Would the same code run? As a general rule, a complete description of your dataset and your goal will save a lot of time. You described your goal perfectly, but omitted some important information concerning your dataset (this is not a rebuke, just an advise that will help you to get better an faster responses from the list). BTW, given the sample size you mention (over 4000) be patient, this code performs two nested loops, and the time needed increases exponentially with sample size. MATRIX will give an error message if missing data are present on either column. This modification will make the code work OK: * Your sample dataset (modified to add some missing data to test the code) *. DATA LIST LIST/ColA ColB (2 F8). BEGIN DATA 12 11 13 14 14 12 11 10 -99 15 10 -99 END DATA. RECODE ALL (-99=SYSMIS) . LIST. * Code assumes SPSS 14 or newer is used *. DATASET NAME Data. DATASET DECLARE APresentInB WINDOW=MINIMIZED. PRESERVE. SET MXLOOPS=5000. /* If sample size GT 5000, change MXLOOPS *. * Code modified to accept missing data *. TEMPORARY. LIST. MATRIX. GET ColA/VAR=ColA /MISSING=ACCEPT /SYSMIS=-99. GET ColB/VAR=ColB /MISSING=ACCEPT /SYSMIS=-99. COMPUTE n=NROW(ColA). COMPUTE Present=MAKE(n,1,0). LOOP i=1 TO n. . LOOP j=1 TO n. . DO IF ColA(i) EQ ColB(j). . COMPUTE Present(i)=1. . END IF. . END LOOP. END LOOP. SAVE Present /OUTFILE=APresentInB. END MATRIX. RESTORE. MATCH FILES /FILE=* /FILE='APresentInB'. DATASET CLOSE APresentInB. VALUE LABEL COL1 0'Absent in ColB' 1'Present in ColB'. * Cleaning missing data in result column *. IF MISSING(ColA) COL1=$sysmis. FORMAT COL1(F8). LIST. Now it will work. Marta ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Here's a Python way to do this. It takes two passes but is linear in running time. The code assumes that sysmis values should never match. The result is a new variable, present, with values of 0 or 1. (Logically, the result should perhaps be sysmis if ColA is sysmis. That could be easily accommodated, but it probably doesn’t matter here.)
This code will work with SPSS 15 or later (and maybe 14.0.1). HTH, Jon Peck * create some data. DATA LIST LIST/ColA ColB (2 F8). BEGIN DATA 12 11 13 14 14 12 11 10 -99 15 10 -99 END DATA. RECODE ALL (-99=SYSMIS) . exec. BEGIN PROGRAM. import spss, spssdata caseset = set() curs = spssdata.Spssdata(accessType="w") for case in curs: if not case.ColB is None: caseset.add(case.ColB) curs.restart() curs.append("present") curs.commitdict() for case in curs: curs.casevalues([case.ColA in caseset]) curs.CClose() END PROGRAM. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Marta García-Granero Sent: Tuesday, August 19, 2008 9:42 AM To: [hidden email] Subject: Re: [SPSSX-L] arrays in SPSS sowmya sankar wrote: > One question: what would happen if there are missing values? For > example, in my data, column A has 4731 rows and column B has 4861 > rows. Would the same code run? As a general rule, a complete description of your dataset and your goal will save a lot of time. You described your goal perfectly, but omitted some important information concerning your dataset (this is not a rebuke, just an advise that will help you to get better an faster responses from the list). BTW, given the sample size you mention (over 4000) be patient, this code performs two nested loops, and the time needed increases exponentially with sample size. MATRIX will give an error message if missing data are present on either column. This modification will make the code work OK: * Your sample dataset (modified to add some missing data to test the code) *. DATA LIST LIST/ColA ColB (2 F8). BEGIN DATA 12 11 13 14 14 12 11 10 -99 15 10 -99 END DATA. RECODE ALL (-99=SYSMIS) . LIST. * Code assumes SPSS 14 or newer is used *. DATASET NAME Data. DATASET DECLARE APresentInB WINDOW=MINIMIZED. PRESERVE. SET MXLOOPS=5000. /* If sample size GT 5000, change MXLOOPS *. * Code modified to accept missing data *. TEMPORARY. LIST. MATRIX. GET ColA/VAR=ColA /MISSING=ACCEPT /SYSMIS=-99. GET ColB/VAR=ColB /MISSING=ACCEPT /SYSMIS=-99. COMPUTE n=NROW(ColA). COMPUTE Present=MAKE(n,1,0). LOOP i=1 TO n. . LOOP j=1 TO n. . DO IF ColA(i) EQ ColB(j). . COMPUTE Present(i)=1. . END IF. . END LOOP. END LOOP. SAVE Present /OUTFILE=APresentInB. END MATRIX. RESTORE. MATCH FILES /FILE=* /FILE='APresentInB'. DATASET CLOSE APresentInB. VALUE LABEL COL1 0'Absent in ColB' 1'Present in ColB'. * Cleaning missing data in result column *. IF MISSING(ColA) COL1=$sysmis. FORMAT COL1(F8). LIST. Now it will work. Marta ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Team:
When running FA for data reduction, I have always used eighenvalue to determine number of components along with screen plot. However, I realize, through posts, that this approach is facing much criticism. I will really appreciate if anyone can share any documents or explanation that explains in detail why relying on typical method (retaining factors near to eigenvalue of 1) is not the best approach? Thanks, MS This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorised review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful. ====================To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by sowmya sankar
At 09:59 AM 8/19/2008, sowmya sankar wrote:
>I have two columns and for each value in column A, I want to find if >the same value exists in column B. >Example: |-----------------------------|---------------------------| |Output Created |19-AUG-2008 21:13:39 | |-----------------------------|---------------------------| [TestData] CaseID Col_A Col_B Alpha 12 11 Beta 13 14 Gamma 14 12 Delta 11 10 Number of cases read: 4 Number of cases listed: 4 (Grin) It's an interesting problem, not the least as an exercise in the different tools that SPSS offers. Here's a solution in native SPSS transformation language. It should be N*log(N) in running time, dominated by the SORTs. (Jon, are you sure yours is linear? I think N*log(N) should be theoretical best case.) Like Marta's solution, this uses datasets, requiring SPSS 14+. However, it could easily be changed to use disk scratch files instead, for earlier versions. DATASET COPY Demo. DATASET ACTIVATE Demo WINDOW=FRONT. * Create a sorted lookup table of the column B values, . * renaming them "Column A". You'll see why, later. . ADD FILES /FILE=Demo /RENAME=(Col_A Col_B = Dump Col_A) /KEEP = Col_A. DATASET NAME LookUp WINDOW=FRONT. SORT CASES BY Col_A. . /**/ LIST /*-*/. List |-----------------------------|---------------------------| |Output Created |19-AUG-2008 21:13:41 | |-----------------------------|---------------------------| [LookUp] Col_A 10 11 12 14 Number of cases read: 4 Number of cases listed: 4 DATASET ACTIVATE Demo WINDOW=FRONT. * The kludgiest feature of this solution is variable . * CaseSeq#, which is necessary to restore the original . * case order after the sort and lookup: . NUMERIC CaseSeq# (F4). COMPUTE CaseSeq# = $CASENUM. * Now, it's easy: . SORT CASES BY Col_A. MATCH FILES /FILE=* /FILE=LookUp /IN=IsInColB /BY Col_A. FORMATS IsInColB (F2). * Drop values that occur only in column B: . SELECT IF NOT MISSING(CaseSeq#). * Restore original order of cases in file: . SORT CASES BY CaseSeq#. LIST. List |-----------------------------|---------------------------| |Output Created |19-AUG-2008 21:13:42 | |-----------------------------|---------------------------| [Demo] CaseID Col_A Col_B CaseSeq# IsInColB Alpha 12 11 1 1 Beta 13 14 2 0 Gamma 14 12 3 1 Delta 11 10 4 1 Number of cases read: 4 Number of cases listed: 4 ============================= APPENDIX: Test data, and code ============================= DATA LIST LIST/ CaseID Col_A Col_B (A6, F3, F3). BEGIN DATA Alpha 12 11 Beta 13 14 Gamma 14 12 Delta 11 10 END DATA. DATASET NAME TestData WINDOW=FRONT. LIST. DATASET COPY Demo. DATASET ACTIVATE Demo WINDOW=FRONT. * Create a sorted lookup table of the column B values, . * renaming them "Column A". You'll see why, later. . ADD FILES /FILE=Demo /RENAME=(Col_A Col_B = Dump Col_A) /KEEP = Col_A. DATASET NAME LookUp WINDOW=FRONT. SORT CASES BY Col_A. . /**/ LIST /*-*/. DATASET ACTIVATE Demo WINDOW=FRONT. * The kludgiest feature of this solution is variable . * CaseSeq#, which is necessary to restore the original . * case order after the sort and lookup: . NUMERIC CaseSeq# (F4). COMPUTE CaseSeq# = $CASENUM. * Now, it's easy: . SORT CASES BY Col_A. MATCH FILES /FILE=* /FILE=LookUp /IN=IsInColB /BY Col_A. FORMATS IsInColB (F2). * Drop values that occur only in column B: . SELECT IF NOT MISSING(CaseSeq#). * Restore original order of cases in file: . SORT CASES BY CaseSeq#. LIST. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by sowmya sankar
Whoops! Revision. At 09:59 AM 8/19/2008, sowmya sankar wrote:
>For each value in column A, I want to find if the same value exists >in column B. The previous solution won't work if the same value occurs more than once in either column A or column B. Here, with revised code and test data: DATASET COPY Demo. DATASET ACTIVATE Demo WINDOW=FRONT. * Create a sorted lookup table of the UNIQUE column B . * values, named as if they're from column A: . ADD FILES /FILE=Demo /RENAME=(Col_A Col_B = Dump Col_A) /KEEP = Col_A. DATASET NAME LookUp WINDOW=FRONT. SORT CASES BY Col_A. * Drop duplicates from the list: . SELECT IF MISSING(LAG(Col_A)) OR Col_A NE LAG(Col_A). . /**/ LIST /*-*/. List |-----------------------------|---------------------------| |Output Created |19-AUG-2008 21:54:07 | |-----------------------------|---------------------------| [LookUp] Col_A 10 11 12 14 15 Number of cases read: 5 Number of cases listed: 5 DATASET ACTIVATE Demo WINDOW=FRONT. * The kludgiest feature of this solution is variable . * CaseSeq#, which is necessary to restore the original . * case order after the sort and lookup: . NUMERIC CaseSeq# (F4). COMPUTE CaseSeq# = $CASENUM. * Now, it's easy: . SORT CASES BY Col_A. * Revision: Use /TABLE= instead of /FILE= for the . * lookup table. (So, it is no longer necessary to . * explicitly drop values that occur only in Column B. . * Howeve, it's mainly to handle values that occur more . * than once in column A.) . MATCH FILES /FILE=* /TABLE=LookUp /IN=IsInColB /BY Col_A. FORMATS IsInColB (F2). * Restore original order of cases in file: . SORT CASES BY CaseSeq#. LIST. List |-----------------------------|---------------------------| |Output Created |19-AUG-2008 21:54:08 | |-----------------------------|---------------------------| [Demo] CaseID Col_A Col_B CaseSeq# IsInColB Alpha 12 11 1 1 Beta 13 14 2 0 Gamma 14 12 3 1 Delta 11 10 4 1 Epsilon 12 15 5 1 Zeta 9 11 6 0 Number of cases read: 6 Number of cases listed: 6 ============================= APPENDIX: Test data, and code ============================= DATA LIST LIST/ CaseID Col_A Col_B (A8, F3, F3). BEGIN DATA Alpha 12 11 Beta 13 14 Gamma 14 12 Delta 11 10 Epsilon 12 15 Zeta 9 11 END DATA. DATASET NAME TestData WINDOW=FRONT. LIST. DATASET COPY Demo. DATASET ACTIVATE Demo WINDOW=FRONT. * Create a sorted lookup table of the UNIQUE column B . * values, named as if they're from column A: . ADD FILES /FILE=Demo /RENAME=(Col_A Col_B = Dump Col_A) /KEEP = Col_A. DATASET NAME LookUp WINDOW=FRONT. SORT CASES BY Col_A. * Drop duplicates from the list: . SELECT IF MISSING(LAG(Col_A)) OR Col_A NE LAG(Col_A). . /**/ LIST /*-*/. DATASET ACTIVATE Demo WINDOW=FRONT. * The kludgiest feature of this solution is variable . * CaseSeq#, which is necessary to restore the original . * case order after the sort and lookup: . NUMERIC CaseSeq# (F4). COMPUTE CaseSeq# = $CASENUM. * Now, it's easy: . SORT CASES BY Col_A. * Revision: Use /TABLE= instead of /FILE= for the . * lookup table. (So, it is no longer necessary to . * explicitly drop values that occur only in Column B. . * Howeve, it's mainly to handle values that occur more . * than once in column A.) . MATCH FILES /FILE=* /TABLE=LookUp /IN=IsInColB /BY Col_A. FORMATS IsInColB (F2). * Restore original order of cases in file: . SORT CASES BY CaseSeq#. LIST. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Richard Ristow
I think mine is linear: two data passes - O(N). Set insertion and membership text are O(1). (They use a hash table, which is constant time or within an epsilon or so.) No sorting.
Of course, that doesn't mean it's always faster, since there is overhead passing the data to Python, but for large enough N it will always win. -Jon -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow Sent: Tuesday, August 19, 2008 7:21 PM To: [hidden email] Subject: Re: [SPSSX-L] arrays in SPSS [snip] At 09:59 AM 8/19/2008, sowmya sankar wrote: >I have two columns and for each value in column A, I want to find if >the same value exists in column B. >Example: |-----------------------------|---------------------------| |Output Created |19-AUG-2008 21:13:39 | |-----------------------------|---------------------------| [TestData] CaseID Col_A Col_B Alpha 12 11 Beta 13 14 Gamma 14 12 Delta 11 10 Number of cases read: 4 Number of cases listed: 4 (Grin) It's an interesting problem, not the least as an exercise in the different tools that SPSS offers. Here's a solution in native SPSS transformation language. It should be N*log(N) in running time, dominated by the SORTs. (Jon, are you sure yours is linear? I think N*log(N) should be theoretical best case.) [>>>Peck, Jon] [snip] ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
| Free forum by Nabble | Edit this page |
