SPSSX Discussion

Extending the "Identify Duplicate Cases..." Procedure Beyond 64 Variables

Classic

List

Threaded

8 messages Options

David B. Nolle-2

Extending the "Identify Duplicate Cases..." Procedure Beyond 64 Variables

Dear SPSS Experts,

I would greatly appreciate seeing any syntax for extending the Identify
Duplicate Cases procedure beyond its current limit of 64 variables. I have
pasted the syntax developed through the SPSS drop down menu and tried to add
a variable, but I see that the procedure is stopped by any attempt to sort
on more than 64 keys. Thus, I welcome any syntax that anyone has developed
to identify duplicate cases for more than 64 variables.

I would prefer SPSS syntax as the solution. However, I should note that I
have the Python extension installed; consequently, even though I do not know
how one programs in Python, I suspect that, with appropriate guidance, I
could install a Python program if Python script is the answer to my problem.

I welcome your feedback on this matter.

Thank you in advance.

David B. Nolle

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Allum, Jeff

Automatic reply: Extending the "Identify Duplicate Cases..." Procedure Beyond 64 Variables

I will be out of the office on Monday, October 16. I will reply to your e-mail after I return on Tuesday afternoon, October 17.

Jeff
____________________________
Jeff Allum
Research Associate
Council of Graduate Schools
One Dupont Circle, NW, Suite 230
Washington, DC 20036-1173
(202) 461-3878 (direct)
(202) 223-3791 (main)
(202) 461-3879 (fax)
[hidden email]<mailto:[hidden email]>
www.cgsnet.org<http://www.cgsnet.org/>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Rick Oliver-3

Re: Extending the "Identify Duplicate Cases..." Procedure Beyond 64 Variables

In reply to this post by David B. Nolle-2

The Sort Cases command has a limit of 64 sort variables in a single pass. You should be able to use multiple Sort Cases commands to get around that limit.

Rick Oliver
Senior Information Developer
Business Analytics (SPSS)

From: "David B. Nolle" <[hidden email]>
To: [hidden email]
Date: 10/14/2011 04:07 PM
Subject: Extending the "Identify Duplicate Cases..." Procedure Beyond 64 Variables
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Dear SPSS Experts, I would greatly appreciate seeing any syntax for extending the Identify Duplicate Cases procedure beyond its current limit of 64 variables. I have pasted the syntax developed through the SPSS drop down menu and tried to add a variable, but I see that the procedure is stopped by any attempt to sort on more than 64 keys. Thus, I welcome any syntax that anyone has developed to identify duplicate cases for more than 64 variables. I would prefer SPSS syntax as the solution. However, I should note that I have the Python extension installed; consequently, even though I do not know how one programs in Python, I suspect that, with appropriate guidance, I could install a Python program if Python script is the answer to my problem. I welcome your feedback on this matter. Thank you in advance. David B. Nolle ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

David Marso

Re: Extending the "Identify Duplicate Cases..." Procedure Beyond 64 Variables

Administrator

Rick Oliver wrote

The Sort Cases command has a limit of 64 sort variables in a single pass.
You should be able to use multiple Sort Cases commands to get around that
limit.

Rick Oliver
Senior Information Developer
Business Analytics (SPSS)

From: "David B. Nolle" <[hidden email]>
To: [hidden email]
Date: 10/14/2011 04:07 PM
Subject: Extending the "Identify Duplicate Cases..." Procedure
Beyond 64 Variables
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Dear SPSS Experts,

I would greatly appreciate seeing any syntax for extending the Identify
Duplicate Cases procedure beyond its current limit of 64 variables. I have
pasted the syntax developed through the SPSS drop down menu and tried to
add
a variable, but I see that the procedure is stopped by any attempt to sort
on more than 64 keys. Thus, I welcome any syntax that anyone has developed
to identify duplicate cases for more than 64 variables.

I would prefer SPSS syntax as the solution. However, I should note that I
have the Python extension installed; consequently, even though I do not
know
how one programs in Python, I suspect that, with appropriate guidance, I
could install a Python program if Python script is the answer to my
problem.

I welcome your feedback on this matter.

Thank you in advance.

David B. Nolle

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

Jon K Peck

Re: Extending the "Identify Duplicate Cases..." Procedure Beyond 64 Variables

No need for the key calculation. The sort is a stable sort. So if you sort from right to left, you get everything in order.
E.g. sort cases by z.
sort cases by y.
sort cases by x.
gives you the same result as
sort cases by x y z.

Jon Peck (no "h")
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: David Marso <[hidden email]>
To: [hidden email]
Date: 10/14/2011 04:11 PM
Subject: Re: [SPSSX-L] Extending the "Identify Duplicate Cases..." Procedure Beyond 64 Variables
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Subsequent SORTS will 'undo' any previous sorts ;-( Just occurred to me that one could do multiple sorts with the following addendum. SORT CASES BY var001 TO var064. COMPUTE KEY_01_64=$CASENUM. SORT CASES BY Key_01_64 var65 TO var128. COMPUTE Key01_128=$CASENUM. SORT CASES BY Key_01_128 var129 TO var192. etc.... The above might be rather inefficient with huge files. Depending upon the nature of the data you could concatenate multiple data fields and then sort on the results. The following works for single digit numbers. Should be easy to adapt for other situations. ---- *TEST DATA GENERATOR*. INPUT PROGRAM. LOOP CASE=1 to 1000. DO REPEAT V=Var001 to var200. COMPUTE V=TRUNC(UNIFORM(10)). END REPEAT. END CASE. END LOOP. END FILE. END INPUT PROGRAM. **ACTUAL SOLUTION BEGINS HERE **. NUMERIC Keys01 TO KEYS13. RECODE Keys01 TO KEYS13(ELSE=0). VECTOR VARS=Var001 to var200 / Keys=Keys01 TO KEYS13. COMPUTE #Index=1. COMPUTE #POW=1. LOOP #=1 TO 200. + COMPUTE KEYS(#INDEX)=KEYS(#INDEX)*10+VARS(#). + COMPUTE #POW=#POW+1. + DO IF #POW=17. + COMPUTE #POW=1. + COMPUTE #INDEX=#INDEX+1. + END IF. END LOOP. FORMATS Keys01 to keys13 (N16.0). FORMATS Var001 TO Var200 (F1.0). SORT CASES BY Keys01 TO Keys14. Rick Oliver wrote: > > The Sort Cases command has a limit of 64 sort variables in a single pass. > You should be able to use multiple Sort Cases commands to get around that > limit. > > Rick Oliver > Senior Information Developer > Business Analytics (SPSS) > > > > > From: "David B. Nolle" <dbnolle@> > To: SPSSX-L@.uga > Date: 10/14/2011 04:07 PM > Subject: Extending the "Identify Duplicate Cases..." Procedure > Beyond 64 Variables > Sent by: "SPSSX(r) Discussion" <SPSSX-L@.uga> > > > > Dear SPSS Experts, > > I would greatly appreciate seeing any syntax for extending the Identify > Duplicate Cases procedure beyond its current limit of 64 variables. I have > pasted the syntax developed through the SPSS drop down menu and tried to > add > a variable, but I see that the procedure is stopped by any attempt to sort > on more than 64 keys. Thus, I welcome any syntax that anyone has developed > to identify duplicate cases for more than 64 variables. > > I would prefer SPSS syntax as the solution. However, I should note that I > have the Python extension installed; consequently, even though I do not > know > how one programs in Python, I suspect that, with appropriate guidance, I > could install a Python program if Python script is the answer to my > problem. > > I welcome your feedback on this matter. > > Thank you in advance. > > David B. Nolle > > ===================== > To manage your subscription to SPSSX-L, send a message to > LISTSERV@.UGA (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > -- View this message in context:http://spssx-discussion.1045642.n5.nabble.com/Extending-the-Identify-Duplicate-Cases-Procedure-Beyond-64-Variables-tp4903960p4904067.htmlSent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

David B. Nolle-2

Re: Extending the "Identify Duplicate Cases..." Procedure Beyond 64 Variables

In reply to this post by Rick Oliver-3

Rick,

Thank you for your response. However, I am not sure about how one can incorporate your suggestion of multiple SORT procedures into the standard syntax generated by the developers of SPSS for the procedure called "Identify Duplicate Cases..." under the Data menu for SPSS 19.0.0.1.

For example, I have pasted the syntax generated by the "Identify Duplicate Cases..." procedure in SPSS 19.0.0.1 below. I have specified exactly 64 variables (the limit with this current procedure) but I want to add 50 additional variables (say,var01 through var50).

Do you know how I can use your suggestion of multiple SORT procedures within the framework of your (IBM SPSS) standard syntax (see below) to produce an identification of duplicate cases on all 114 variables?

I should note that simply running the standard procedure twice (once for 64 variables and once for the remaining 50 variables) is not the answer because the duplicates under the first run may not be appropriately linked to the duplicates under the second run. Thus, I welcome your advice on this matter.

* Identify Duplicate Cases.
SORT CASES BY q2(A) q3a(A) q3b(A) q3c(A) q3d(A) q3e(A) q4a(A) q4b(A) q4c(A) q4d(A) q4e(A) q4g(A)
    q4f(A) q5(A) q6a(A) q6b(A) q6c(A) q6d(A) q6e(A) q6f(A) q6g(A) q6h(A) q6i(A) q6j(A) q6k(A) q6l(A)
    q6m(A) q7a(A) q7b(A) q7c(A) q7d(A) q7e(A) q7f(A) q7g(A) q7h(A) q7i(A) q7j(A) q7k(A) q7l(A) q7m(A)
    q8(A) q9(A) q10(A) q11(A) q12(A) q13a(A) q13b(A) q13c(A) q14a(A) q14b(A) q14c(A) q14d(A) q14e(A)
    q15a(A) q15b(A) q15c(A) q15d(A) q15e(A) q15f(A) q15g(A) q15h(A) q16a(A) q16b(A) q16c(A).
MATCH FILES
/FILE=*
/BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g q4f q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j
    q6k q6l q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m q8 q9 q10 q11 q12 q13a q13b q13c
    q14a q14b q14c q14d q14e q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c
/FIRST=PrimaryFirst3
/LAST=PrimaryLast.
DO IF (PrimaryFirst3).
COMPUTE MatchSequence=1-PrimaryLast.
ELSE.
COMPUTE MatchSequence=MatchSequence+1.
END IF.
LEAVE MatchSequence.
FORMATS MatchSequence (f7).
COMPUTE InDupGrp=MatchSequence>0.
SORT CASES InDupGrp(D).
MATCH FILES
/FILE=*
/DROP=PrimaryLast InDupGrp MatchSequence.
VARIABLE LABELS PrimaryFirst3 'Indicator of each first matching case as Primary'.
VALUE LABELS PrimaryFirst3 0 'Duplicate Case' 1 'Primary Case'.
VARIABLE LEVEL PrimaryFirst3 (ORDINAL).
FREQUENCIES VARIABLES=PrimaryFirst3.
EXECUTE.

Thank you for your time and interest.

David

----- Original Message -----

From: [hidden email]

To: [hidden email]

Cc: [hidden email]

Sent: Friday, October 14, 2011 5:21 PM

Subject: Re: Extending the "Identify Duplicate Cases..." Procedure Beyond 64 Variables

The Sort Cases command has a limit of 64 sort variables in a single pass. You should be able to use multiple Sort Cases commands to get around that limit.

Rick Oliver
Senior Information Developer
Business Analytics (SPSS)

From: "David B. Nolle" <[hidden email]>
To: [hidden email]
Date: 10/14/2011 04:07 PM
Subject: Extending the "Identify Duplicate Cases..." Procedure Beyond 64 Variables
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Dear SPSS Experts, I would greatly appreciate seeing any syntax for extending the Identify Duplicate Cases procedure beyond its current limit of 64 variables. I have pasted the syntax developed through the SPSS drop down menu and tried to add a variable, but I see that the procedure is stopped by any attempt to sort on more than 64 keys. Thus, I welcome any syntax that anyone has developed to identify duplicate cases for more than 64 variables. I would prefer SPSS syntax as the solution. However, I should note that I have the Python extension installed; consequently, even though I do not know how one programs in Python, I suspect that, with appropriate guidance, I could install a Python program if Python script is the answer to my problem. I welcome your feedback on this matter. Thank you in advance. David B. Nolle ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

David Marso

Re: Extending the "Identify Duplicate Cases..." Procedure Beyond 64 Variables

Administrator

Reading Jon (no-h)'s suggestion.
SORT CASES BY var01 TO Var50.
SORT CASES BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g
q4f q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l
q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m
q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b q14c q14d q14e
q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c.
MATCH FILES /FILE=* /BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g q4f q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b q14c q14d q14e q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c var01 TO var50
/FIRST=PrimaryFirst3 /LAST=PrimaryLast.
DO IF (PrimaryFirst3).
+ COMPUTE MatchSequence=1-PrimaryLast.
ELSE.
+ COMPUTE MatchSequence=MatchSequence+1.
END IF.
LEAVE MatchSequence.
FORMATS MatchSequence (f7).
COMPUTE InDupGrp=MatchSequence>0.
SORT CASES InDupGrp(D).
MATCH FILES /FILE=* /DROP=PrimaryLast InDupGrp MatchSequence.
VARIABLE LABELS PrimaryFirst3 'Indicator of each first matching case as Primary'.
VALUE LABELS PrimaryFirst3 0 'Duplicate Case' 1 'Primary Case'.
VARIABLE LEVEL PrimaryFirst3 (ORDINAL).
FREQUENCIES VARIABLES=PrimaryFirst3.

David B. Nolle-2 wrote

Rick,

Thank you for your response. However, I am not sure about how one can incorporate your suggestion of multiple SORT procedures into the standard syntax generated by the developers of SPSS for the procedure called "Identify Duplicate Cases..." under the Data menu for SPSS 19.0.0.1.

For example, I have pasted the syntax generated by the "Identify Duplicate Cases..." procedure in SPSS 19.0.0.1 below. I have specified exactly 64 variables (the limit with this current procedure) but I want to add 50 additional variables (say,var01 through var50).

Do you know how I can use your suggestion of multiple SORT procedures within the framework of your (IBM SPSS) standard syntax (see below) to produce an identification of duplicate cases on all 114 variables?

I should note that simply running the standard procedure twice (once for 64 variables and once for the remaining 50 variables) is not the answer because the duplicates under the first run may not be appropriately linked to the duplicates under the second run. Thus, I welcome your advice on this matter.

* Identify Duplicate Cases.
SORT CASES BY q2(A) q3a(A) q3b(A) q3c(A) q3d(A) q3e(A) q4a(A) q4b(A) q4c(A) q4d(A) q4e(A) q4g(A)
q4f(A) q5(A) q6a(A) q6b(A) q6c(A) q6d(A) q6e(A) q6f(A) q6g(A) q6h(A) q6i(A) q6j(A) q6k(A) q6l(A)
q6m(A) q7a(A) q7b(A) q7c(A) q7d(A) q7e(A) q7f(A) q7g(A) q7h(A) q7i(A) q7j(A) q7k(A) q7l(A) q7m(A)
q8(A) q9(A) q10(A) q11(A) q12(A) q13a(A) q13b(A) q13c(A) q14a(A) q14b(A) q14c(A) q14d(A) q14e(A)
q15a(A) q15b(A) q15c(A) q15d(A) q15e(A) q15f(A) q15g(A) q15h(A) q16a(A) q16b(A) q16c(A).
MATCH FILES
/FILE=*
/BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g q4f q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j
q6k q6l q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m q8 q9 q10 q11 q12 q13a q13b q13c
q14a q14b q14c q14d q14e q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c
/FIRST=PrimaryFirst3
/LAST=PrimaryLast.
DO IF (PrimaryFirst3).
COMPUTE MatchSequence=1-PrimaryLast.
ELSE.
COMPUTE MatchSequence=MatchSequence+1.
END IF.
LEAVE MatchSequence.
FORMATS MatchSequence (f7).
COMPUTE InDupGrp=MatchSequence>0.
SORT CASES InDupGrp(D).
MATCH FILES
/FILE=*
/DROP=PrimaryLast InDupGrp MatchSequence.
VARIABLE LABELS PrimaryFirst3 'Indicator of each first matching case as Primary'.
VALUE LABELS PrimaryFirst3 0 'Duplicate Case' 1 'Primary Case'.
VARIABLE LEVEL PrimaryFirst3 (ORDINAL).
FREQUENCIES VARIABLES=PrimaryFirst3.
EXECUTE.

Thank you for your time and interest.

David

----- Original Message -----
From: Rick Oliver
To: David B. Nolle
Cc: [hidden email]
Sent: Friday, October 14, 2011 5:21 PM
Subject: Re: Extending the "Identify Duplicate Cases..." Procedure Beyond 64 Variables

The Sort Cases command has a limit of 64 sort variables in a single pass. You should be able to use multiple Sort Cases commands to get around that limit.

Rick Oliver
Senior Information Developer
Business Analytics (SPSS)

From: "David B. Nolle" <[hidden email]>
To: [hidden email]
Date: 10/14/2011 04:07 PM
Subject: Extending the "Identify Duplicate Cases..." Procedure Beyond 64 Variables
Sent by: "SPSSX(r) Discussion" <[hidden email]>

------------------------------------------------------------------------------

Dear SPSS Experts,

I would greatly appreciate seeing any syntax for extending the Identify
Duplicate Cases procedure beyond its current limit of 64 variables. I have
pasted the syntax developed through the SPSS drop down menu and tried to add
a variable, but I see that the procedure is stopped by any attempt to sort
on more than 64 keys. Thus, I welcome any syntax that anyone has developed
to identify duplicate cases for more than 64 variables.

I would prefer SPSS syntax as the solution. However, I should note that I
have the Python extension installed; consequently, even though I do not know
how one programs in Python, I suspect that, with appropriate guidance, I
could install a Python program if Python script is the answer to my problem.

I welcome your feedback on this matter.

Thank you in advance.

David B. Nolle

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

David B. Nolle-2

Re: Extending the "Identify Duplicate Cases..." Procedure Beyond 64 Variables

In reply to this post by David B. Nolle-2

David, Rick, and Jon,

Thank you very much for providing the solution to the problem of extending the identification of duplicates beyond 64 variables (see below). Rick initiated the solution, Jon clarified the solution, and David followed Jon's suggestion and built the solution to handle my problem.

I think that your expert contributions to this list are fantastic, and I commend you for your generous willingness to help all of us to handle a wide variety of problems within the framework of SPSS. I think that the many consistent contributors to SPSS-L have clearly demonstrated that the intellectual capital and programming skills undergirding SPSS-L are outstanding.

David

David B. Nolle

SORT CASES BY var01 TO Var50.
SORT CASES BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g
    q4f q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l
    q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m
    q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b q14c q14d q14e
    q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c.
MATCH FILES /FILE=* /BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g q4f
q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l q6m q7a q7b q7c q7d q7e
q7f q7g q7h q7i q7j q7k q7l q7m q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b
q14c q14d q14e q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c var01
TO var50
/FIRST=PrimaryFirst3 /LAST=PrimaryLast.
DO IF (PrimaryFirst3).
+ COMPUTE MatchSequence=1-PrimaryLast.
ELSE.
+ COMPUTE MatchSequence=MatchSequence+1.
END IF.
LEAVE MatchSequence.
FORMATS MatchSequence (f7).
COMPUTE InDupGrp=MatchSequence>0.
SORT CASES InDupGrp(D).
MATCH FILES /FILE=* /DROP=PrimaryLast InDupGrp MatchSequence.
VARIABLE LABELS PrimaryFirst3 'Indicator of each first matching case as
Primary'.
VALUE LABELS PrimaryFirst3 0 'Duplicate Case' 1 'Primary Case'.
VARIABLE LEVEL PrimaryFirst3 (ORDINAL).
FREQUENCIES VARIABLES=PrimaryFirst3.