Dear SPSS Experts,
I would greatly appreciate seeing any syntax for extending the Identify Duplicate Cases procedure beyond its current limit of 64 variables. I have pasted the syntax developed through the SPSS drop down menu and tried to add a variable, but I see that the procedure is stopped by any attempt to sort on more than 64 keys. Thus, I welcome any syntax that anyone has developed to identify duplicate cases for more than 64 variables. I would prefer SPSS syntax as the solution. However, I should note that I have the Python extension installed; consequently, even though I do not know how one programs in Python, I suspect that, with appropriate guidance, I could install a Python program if Python script is the answer to my problem. I welcome your feedback on this matter. Thank you in advance. David B. Nolle ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
I will be out of the office on Monday, October 16. I will reply to your e-mail after I return on Tuesday afternoon, October 17.
Jeff ____________________________ Jeff Allum Research Associate Council of Graduate Schools One Dupont Circle, NW, Suite 230 Washington, DC 20036-1173 (202) 461-3878 (direct) (202) 223-3791 (main) (202) 461-3879 (fax) [hidden email]<mailto:[hidden email]> www.cgsnet.org<http://www.cgsnet.org/> ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by David B. Nolle-2
The Sort Cases command has a limit of 64
sort variables in a single pass. You should be able to use multiple Sort
Cases commands to get around that limit.
Rick Oliver Senior Information Developer Business Analytics (SPSS) From: "David B. Nolle" <[hidden email]> To: [hidden email] Date: 10/14/2011 04:07 PM Subject: Extending the "Identify Duplicate Cases..." Procedure Beyond 64 Variables Sent by: "SPSSX(r) Discussion" <[hidden email]> Dear SPSS Experts, I would greatly appreciate seeing any syntax for extending the Identify Duplicate Cases procedure beyond its current limit of 64 variables. I have pasted the syntax developed through the SPSS drop down menu and tried to add a variable, but I see that the procedure is stopped by any attempt to sort on more than 64 keys. Thus, I welcome any syntax that anyone has developed to identify duplicate cases for more than 64 variables. I would prefer SPSS syntax as the solution. However, I should note that I have the Python extension installed; consequently, even though I do not know how one programs in Python, I suspect that, with appropriate guidance, I could install a Python program if Python script is the answer to my problem. I welcome your feedback on this matter. Thank you in advance. David B. Nolle ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
Subsequent SORTS will 'undo' any previous sorts ;-(
Just occurred to me that one could do multiple sorts with the following addendum. SORT CASES BY var001 TO var064. COMPUTE KEY_01_64=$CASENUM. SORT CASES BY Key_01_64 var65 TO var128. COMPUTE Key01_128=$CASENUM. SORT CASES BY Key_01_128 var129 TO var192. etc.... The above might be rather inefficient with huge files. Depending upon the nature of the data you could concatenate multiple data fields and then sort on the results. The following works for single digit numbers. Should be easy to adapt for other situations. ---- *TEST DATA GENERATOR*. INPUT PROGRAM. LOOP CASE=1 to 1000. DO REPEAT V=Var001 to var200. COMPUTE V=TRUNC(UNIFORM(10)). END REPEAT. END CASE. END LOOP. END FILE. END INPUT PROGRAM. **ACTUAL SOLUTION BEGINS HERE **. NUMERIC Keys01 TO KEYS13. RECODE Keys01 TO KEYS13(ELSE=0). VECTOR VARS=Var001 to var200 / Keys=Keys01 TO KEYS13. COMPUTE #Index=1. COMPUTE #POW=1. LOOP #=1 TO 200. + COMPUTE KEYS(#INDEX)=KEYS(#INDEX)*10+VARS(#). + COMPUTE #POW=#POW+1. + DO IF #POW=17. + COMPUTE #POW=1. + COMPUTE #INDEX=#INDEX+1. + END IF. END LOOP. FORMATS Keys01 to keys13 (N16.0). FORMATS Var001 TO Var200 (F1.0). SORT CASES BY Keys01 TO Keys14.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
No need for the key calculation. The
sort is a stable sort. So if you sort from right to left, you get
everything in order.
E.g. sort cases by z. sort cases by y. sort cases by x. gives you the same result as sort cases by x y z. Jon Peck (no "h") Senior Software Engineer, IBM [hidden email] new phone: 720-342-5621 From: David Marso <[hidden email]> To: [hidden email] Date: 10/14/2011 04:11 PM Subject: Re: [SPSSX-L] Extending the "Identify Duplicate Cases..." Procedure Beyond 64 Variables Sent by: "SPSSX(r) Discussion" <[hidden email]> Subsequent SORTS will 'undo' any previous sorts ;-( Just occurred to me that one could do multiple sorts with the following addendum. SORT CASES BY var001 TO var064. COMPUTE KEY_01_64=$CASENUM. SORT CASES BY Key_01_64 var65 TO var128. COMPUTE Key01_128=$CASENUM. SORT CASES BY Key_01_128 var129 TO var192. etc.... The above might be rather inefficient with huge files. Depending upon the nature of the data you could concatenate multiple data fields and then sort on the results. The following works for single digit numbers. Should be easy to adapt for other situations. ---- *TEST DATA GENERATOR*. INPUT PROGRAM. LOOP CASE=1 to 1000. DO REPEAT V=Var001 to var200. COMPUTE V=TRUNC(UNIFORM(10)). END REPEAT. END CASE. END LOOP. END FILE. END INPUT PROGRAM. **ACTUAL SOLUTION BEGINS HERE **. NUMERIC Keys01 TO KEYS13. RECODE Keys01 TO KEYS13(ELSE=0). VECTOR VARS=Var001 to var200 / Keys=Keys01 TO KEYS13. COMPUTE #Index=1. COMPUTE #POW=1. LOOP #=1 TO 200. + COMPUTE KEYS(#INDEX)=KEYS(#INDEX)*10+VARS(#). + COMPUTE #POW=#POW+1. + DO IF #POW=17. + COMPUTE #POW=1. + COMPUTE #INDEX=#INDEX+1. + END IF. END LOOP. FORMATS Keys01 to keys13 (N16.0). FORMATS Var001 TO Var200 (F1.0). SORT CASES BY Keys01 TO Keys14. Rick Oliver wrote: > > The Sort Cases command has a limit of 64 sort variables in a single pass. > You should be able to use multiple Sort Cases commands to get around that > limit. > > Rick Oliver > Senior Information Developer > Business Analytics (SPSS) > > > > > From: "David B. Nolle" <dbnolle@> > To: SPSSX-L@.uga > Date: 10/14/2011 04:07 PM > Subject: Extending the "Identify Duplicate Cases..." Procedure > Beyond 64 Variables > Sent by: "SPSSX(r) Discussion" <SPSSX-L@.uga> > > > > Dear SPSS Experts, > > I would greatly appreciate seeing any syntax for extending the Identify > Duplicate Cases procedure beyond its current limit of 64 variables. I have > pasted the syntax developed through the SPSS drop down menu and tried to > add > a variable, but I see that the procedure is stopped by any attempt to sort > on more than 64 keys. Thus, I welcome any syntax that anyone has developed > to identify duplicate cases for more than 64 variables. > > I would prefer SPSS syntax as the solution. However, I should note that I > have the Python extension installed; consequently, even though I do not > know > how one programs in Python, I suspect that, with appropriate guidance, I > could install a Python program if Python script is the answer to my > problem. > > I welcome your feedback on this matter. > > Thank you in advance. > > David B. Nolle > > ===================== > To manage your subscription to SPSSX-L, send a message to > LISTSERV@.UGA (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Extending-the-Identify-Duplicate-Cases-Procedure-Beyond-64-Variables-tp4903960p4904067.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Rick Oliver-3
Rick,
Thank you for your response. However, I am
not sure about how one can incorporate your suggestion of multiple SORT
procedures into the standard syntax generated by the developers of SPSS for the
procedure called "Identify Duplicate Cases..." under the Data menu for SPSS
19.0.0.1.
For example, I have pasted the syntax generated by
the "Identify Duplicate Cases..." procedure in SPSS 19.0.0.1 below. I have
specified exactly 64 variables (the limit with this current procedure) but
I want to add 50 additional variables (say,var01 through var50).
Do you know how I can use your suggestion of
multiple SORT procedures within the framework of your (IBM SPSS) standard syntax
(see below) to produce an identification of duplicate cases on all 114
variables?
I should note that simply running the standard
procedure twice (once for 64 variables and once for the remaining 50 variables)
is not the answer because the duplicates under the first run may not be
appropriately linked to the duplicates under the second run.
Thus, I welcome your advice on this
matter.
* Identify Duplicate Cases.
SORT CASES BY q2(A) q3a(A) q3b(A) q3c(A) q3d(A) q3e(A) q4a(A) q4b(A) q4c(A) q4d(A) q4e(A) q4g(A) q4f(A) q5(A) q6a(A) q6b(A) q6c(A) q6d(A) q6e(A) q6f(A) q6g(A) q6h(A) q6i(A) q6j(A) q6k(A) q6l(A) q6m(A) q7a(A) q7b(A) q7c(A) q7d(A) q7e(A) q7f(A) q7g(A) q7h(A) q7i(A) q7j(A) q7k(A) q7l(A) q7m(A) q8(A) q9(A) q10(A) q11(A) q12(A) q13a(A) q13b(A) q13c(A) q14a(A) q14b(A) q14c(A) q14d(A) q14e(A) q15a(A) q15b(A) q15c(A) q15d(A) q15e(A) q15f(A) q15g(A) q15h(A) q16a(A) q16b(A) q16c(A). MATCH FILES /FILE=* /BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g q4f q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b q14c q14d q14e q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c /FIRST=PrimaryFirst3 /LAST=PrimaryLast. DO IF (PrimaryFirst3). COMPUTE MatchSequence=1-PrimaryLast. ELSE. COMPUTE MatchSequence=MatchSequence+1. END IF. LEAVE MatchSequence. FORMATS MatchSequence (f7). COMPUTE InDupGrp=MatchSequence>0. SORT CASES InDupGrp(D). MATCH FILES /FILE=* /DROP=PrimaryLast InDupGrp MatchSequence. VARIABLE LABELS PrimaryFirst3 'Indicator of each first matching case as Primary'. VALUE LABELS PrimaryFirst3 0 'Duplicate Case' 1 'Primary Case'. VARIABLE LEVEL PrimaryFirst3 (ORDINAL). FREQUENCIES VARIABLES=PrimaryFirst3. EXECUTE. Thank you for your time and interest.
David
|
Administrator
|
Reading Jon (no-h)'s suggestion.
SORT CASES BY var01 TO Var50. SORT CASES BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g q4f q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b q14c q14d q14e q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c. MATCH FILES /FILE=* /BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g q4f q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b q14c q14d q14e q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c var01 TO var50 /FIRST=PrimaryFirst3 /LAST=PrimaryLast. DO IF (PrimaryFirst3). + COMPUTE MatchSequence=1-PrimaryLast. ELSE. + COMPUTE MatchSequence=MatchSequence+1. END IF. LEAVE MatchSequence. FORMATS MatchSequence (f7). COMPUTE InDupGrp=MatchSequence>0. SORT CASES InDupGrp(D). MATCH FILES /FILE=* /DROP=PrimaryLast InDupGrp MatchSequence. VARIABLE LABELS PrimaryFirst3 'Indicator of each first matching case as Primary'. VALUE LABELS PrimaryFirst3 0 'Duplicate Case' 1 'Primary Case'. VARIABLE LEVEL PrimaryFirst3 (ORDINAL). FREQUENCIES VARIABLES=PrimaryFirst3.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by David B. Nolle-2
David, Rick, and Jon,
Thank you very much for providing the solution to
the problem of extending the identification of duplicates beyond 64
variables (see below). Rick initiated the solution, Jon clarified the solution,
and David followed Jon's suggestion and built the solution to handle my
problem.
I think that your expert contributions to this list
are fantastic, and I commend you for your generous willingness to help all of us
to handle a wide variety of problems within the framework of SPSS. I think
that the many consistent contributors to SPSS-L have clearly
demonstrated that the intellectual capital and programming skills
undergirding SPSS-L are outstanding.
David
David B. Nolle
SORT CASES BY var01 TO Var50.
SORT CASES BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g q4f q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b q14c q14d q14e q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c. MATCH FILES /FILE=* /BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g q4f q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b q14c q14d q14e q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c var01 TO var50 /FIRST=PrimaryFirst3 /LAST=PrimaryLast. DO IF (PrimaryFirst3). + COMPUTE MatchSequence=1-PrimaryLast. ELSE. + COMPUTE MatchSequence=MatchSequence+1. END IF. LEAVE MatchSequence. FORMATS MatchSequence (f7). COMPUTE InDupGrp=MatchSequence>0. SORT CASES InDupGrp(D). MATCH FILES /FILE=* /DROP=PrimaryLast InDupGrp MatchSequence. VARIABLE LABELS PrimaryFirst3 'Indicator of each first matching case as Primary'. VALUE LABELS PrimaryFirst3 0 'Duplicate Case' 1 'Primary Case'. VARIABLE LEVEL PrimaryFirst3 (ORDINAL). FREQUENCIES VARIABLES=PrimaryFirst3. |
Free forum by Nabble | Edit this page |