Identify Duplicate Cases

classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Identify Duplicate Cases

emma78
Hi,
I have a question regarding duplicate rows in a  datatset.
I want to find those Ids who have the same value in all variables in the whole dataset.
I found some syntax in the list


SORT CASES BY var01 TO Var50.
SORT CASES BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g
    q4f q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l
    q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m
    q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b q14c q14d q14e
    q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c.
MATCH FILES  /FILE=* /BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g q4f
q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j  q6k q6l q6m q7a q7b q7c q7d q7e
q7f q7g q7h q7i q7j q7k q7l q7m q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b
q14c q14d q14e q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c var01
TO var50
  /FIRST=PrimaryFirst3  /LAST=PrimaryLast.
DO IF (PrimaryFirst3).
+  COMPUTE  MatchSequence=1-PrimaryLast.
ELSE.
+  COMPUTE  MatchSequence=MatchSequence+1.
END IF.
LEAVE  MatchSequence.
FORMATS  MatchSequence (f7).
COMPUTE  InDupGrp=MatchSequence>0.
SORT CASES InDupGrp(D).
MATCH FILES /FILE=* /DROP=PrimaryLast InDupGrp MatchSequence.
VARIABLE LABELS  PrimaryFirst3 'Indicator of each first matching case as
Primary'.
VALUE LABELS  PrimaryFirst3 0 'Duplicate Case' 1 'Primary Case'.
VARIABLE LEVEL  PrimaryFirst3 (ORDINAL).
FREQUENCIES VARIABLES=PrimaryFirst3.


The problem I have is that I have 200 variables and I have to split the sort cases command as shown in the above syntax because of the 64 variable limitation.

But when i tried the syntax above it doesn`t work.
 I have to sort them all in same order the error message says.

Does anybody know how to change the syntax so that it is working?


Thank you!



Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Maguin, Eugene
I first want to suggest that you investigate the Identify duplicates function from the Data dropdown box. It may not be adequate and I cannot comment on its adequacy for this specific problem because I have not ever used it.

If it is found to be inadequate, I suggest the following scheme, which depending on the nature of your data may not be workable.

My suggestion is that using a do repeat structure you concantenate your variables into a single string variable and sort on that. It looks like you have around 100-110 variables and if those variables are all true f1.0 variables, i.e., 0-9 range then you have a 100-110 character string, which is "so what". But, if those variables are true floating point numbers, then I don't know that this idea will work.

Gene Maguin



-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of emma78
Sent: Wednesday, November 18, 2015 10:41 AM
To: [hidden email]
Subject: Identify Duplicate Cases

Hi,
I have a question regarding duplicate rows in a  datatset.
I want to find those Ids who have the same value in *all* variables in the whole dataset.
I found some syntax in the list


/SORT CASES BY var01 TO Var50.
SORT CASES BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g
    q4f q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l
    q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m
    q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b q14c q14d q14e
    q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c.
MATCH FILES  /FILE=* /BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g q4f
q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j  q6k q6l q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b q14c q14d q14e q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c var01 TO var50
  /FIRST=PrimaryFirst3  /LAST=PrimaryLast.
DO IF (PrimaryFirst3).
+  COMPUTE  MatchSequence=1-PrimaryLast.
ELSE.
+  COMPUTE  MatchSequence=MatchSequence+1.
END IF.
LEAVE  MatchSequence.
FORMATS  MatchSequence (f7).
COMPUTE  InDupGrp=MatchSequence>0.
SORT CASES InDupGrp(D).
MATCH FILES /FILE=* /DROP=PrimaryLast InDupGrp MatchSequence.
VARIABLE LABELS  PrimaryFirst3 'Indicator of each first matching case as Primary'.
VALUE LABELS  PrimaryFirst3 0 'Duplicate Case' 1 'Primary Case'.
VARIABLE LEVEL  PrimaryFirst3 (ORDINAL).
FREQUENCIES VARIABLES=PrimaryFirst3./

The problem I have is that I have 200 variables and I have to split the sort cases command as shown in the above syntax because of the 64 variable limitation.

But when i tried the syntax above it doesn`t work.
 I have to sort them all in same order the error message says.

Does anybody know how to change the syntax so that it is working?


Thank you!







--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicate-Cases-tp5730968.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

David Marso
Administrator
This post was updated on .
In reply to this post by emma78
SORT CASES BY ALL.
MATCH FILES ... / BY ALL.
----
Oops , my bad.  
Try the following idea.

/* simulate some data for testing */.
MATRIX.
SAVE TRUNC(UNIFORM(10000,300)*3) /OUTFILE * / VARIABLES x001 TO x300.
END MATRIX.

SORT CASES BY x257 TO x300.
SORT CASES BY x193 TO x256.
SORT CASES BY x129 TO x192.
SORT CASES BY x065 TO x128.
SORT CASES BY x001 TO x064.

MATCH FILES / FILE * / FIRST=@TOP / LAST=@BOT  / BY ALL.
COMPUTE @FLAGDUP=(@TOP NE @BOT).
FREQUENCIES @FLAGDUP.

emma78 wrote
Hi,
I have a question regarding duplicate rows in a  datatset.
I want to find those Ids who have the same value in all variables in the whole dataset.
I found some syntax in the list


SORT CASES BY var01 TO Var50.
SORT CASES BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g
    q4f q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l
    q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m
    q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b q14c q14d q14e
    q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c.
MATCH FILES  /FILE=* /BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g q4f
q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j  q6k q6l q6m q7a q7b q7c q7d q7e
q7f q7g q7h q7i q7j q7k q7l q7m q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b
q14c q14d q14e q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c var01
TO var50
  /FIRST=PrimaryFirst3  /LAST=PrimaryLast.
DO IF (PrimaryFirst3).
+  COMPUTE  MatchSequence=1-PrimaryLast.
ELSE.
+  COMPUTE  MatchSequence=MatchSequence+1.
END IF.
LEAVE  MatchSequence.
FORMATS  MatchSequence (f7).
COMPUTE  InDupGrp=MatchSequence>0.
SORT CASES InDupGrp(D).
MATCH FILES /FILE=* /DROP=PrimaryLast InDupGrp MatchSequence.
VARIABLE LABELS  PrimaryFirst3 'Indicator of each first matching case as
Primary'.
VALUE LABELS  PrimaryFirst3 0 'Duplicate Case' 1 'Primary Case'.
VARIABLE LEVEL  PrimaryFirst3 (ORDINAL).
FREQUENCIES VARIABLES=PrimaryFirst3.


The problem I have is that I have 200 variables and I have to split the sort cases command as shown in the above syntax because of the 64 variable limitation.

But when i tried the syntax above it doesn`t work.
 I have to sort them all in same order the error message says.

Does anybody know how to change the syntax so that it is working?


Thank you!
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Maguin, Eugene
David, More than likely you have more experience with this sort of problem but I was surprised by your advice as I would have assumed that the sort command has some limitation in the number of variables that can be specified as keys. Same comment, actually, with match files.

Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David Marso
Sent: Wednesday, November 18, 2015 11:22 AM
To: [hidden email]
Subject: Re: Identify Duplicate Cases

SORT CASES BY ALL.
MATCH FILES ... / BY ALL.
----

emma78 wrote
> Hi,
> I have a question regarding duplicate rows in a  datatset.
> I want to find those Ids who have the same value in
*
> all
*
>  variables in the whole dataset.
> I found some syntax in the list
>
/

> SORT CASES BY var01 TO Var50.
> SORT CASES BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g
>     q4f q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l
>     q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m
>     q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b q14c q14d q14e
>     q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c.
> MATCH FILES  /FILE=* /BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e
> q4g q4f
> q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j  q6k q6l q6m q7a q7b q7c
> q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m q8 q9 q10 q11 q12 q13a q13b
> q13c q14a q14b q14c q14d q14e q15a q15b q15c q15d q15e q15f q15g q15h
> q16a q16b q16c
> var01
> TO var50
>   /FIRST=PrimaryFirst3  /LAST=PrimaryLast.
> DO IF (PrimaryFirst3).
> +  COMPUTE  MatchSequence=1-PrimaryLast.
> ELSE.
> +  COMPUTE  MatchSequence=MatchSequence+1.
> END IF.
> LEAVE  MatchSequence.
> FORMATS  MatchSequence (f7).
> COMPUTE  InDupGrp=MatchSequence>0.
> SORT CASES InDupGrp(D).
> MATCH FILES /FILE=* /DROP=PrimaryLast InDupGrp MatchSequence.
> VARIABLE LABELS  PrimaryFirst3 'Indicator of each first matching case
> as Primary'.
> VALUE LABELS  PrimaryFirst3 0 'Duplicate Case' 1 'Primary Case'.
> VARIABLE LEVEL  PrimaryFirst3 (ORDINAL).
> FREQUENCIES VARIABLES=PrimaryFirst3.
/

>
> The problem I have is that I have 200 variables and I have to split
> the sort cases command as shown in the above syntax because of the 64
> variable limitation.
>
> But when i tried the syntax above it doesn`t work.
>  I have to sort them all in same order the error message says.
>
> Does anybody know how to change the syntax so that it is working?
>
>
> Thank you!





-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicate-Cases-tp5730968p5730970.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

David Marso
Administrator
I suspect my edit didn't percolate through to Nabble.
The idea is that SORT preserves later ordering of cases within a set of specified keys.
If one sorts on the LATER variables first then that order will be preserved in later sorts on prior keys.
/*Build a simulation file */.
MATRIX.
SAVE TRUNC(UNIFORM(10000,300)*3) /OUTFILE * / VARIABLES x001 TO x300.
END MATRIX.

SORT CASES BY x257 TO x300.
SORT CASES BY x193 TO x256.
SORT CASES BY x129 TO x192.
SORT CASES BY x065 TO x128.
SORT CASES BY x001 TO x064.

MATCH FILES / FILE * / FIRST=@TOP / LAST=@BOT  / BY ALL.
COMPUTE @FLAGDUP=(@TOP NE @BOT).
FREQUENCIES @FLAGDUP.



Maguin, Eugene wrote
David, More than likely you have more experience with this sort of problem but I was surprised by your advice as I would have assumed that the sort command has some limitation in the number of variables that can be specified as keys. Same comment, actually, with match files.

Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David Marso
Sent: Wednesday, November 18, 2015 11:22 AM
To: [hidden email]
Subject: Re: Identify Duplicate Cases

SORT CASES BY ALL.
MATCH FILES ... / BY ALL.
----

emma78 wrote
> Hi,
> I have a question regarding duplicate rows in a  datatset.
> I want to find those Ids who have the same value in
*
> all
*
>  variables in the whole dataset.
> I found some syntax in the list
>
/
> SORT CASES BY var01 TO Var50.
> SORT CASES BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g
>     q4f q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l
>     q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m
>     q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b q14c q14d q14e
>     q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c.
> MATCH FILES  /FILE=* /BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e
> q4g q4f
> q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j  q6k q6l q6m q7a q7b q7c
> q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m q8 q9 q10 q11 q12 q13a q13b
> q13c q14a q14b q14c q14d q14e q15a q15b q15c q15d q15e q15f q15g q15h
> q16a q16b q16c
> var01
> TO var50
>   /FIRST=PrimaryFirst3  /LAST=PrimaryLast.
> DO IF (PrimaryFirst3).
> +  COMPUTE  MatchSequence=1-PrimaryLast.
> ELSE.
> +  COMPUTE  MatchSequence=MatchSequence+1.
> END IF.
> LEAVE  MatchSequence.
> FORMATS  MatchSequence (f7).
> COMPUTE  InDupGrp=MatchSequence>0.
> SORT CASES InDupGrp(D).
> MATCH FILES /FILE=* /DROP=PrimaryLast InDupGrp MatchSequence.
> VARIABLE LABELS  PrimaryFirst3 'Indicator of each first matching case
> as Primary'.
> VALUE LABELS  PrimaryFirst3 0 'Duplicate Case' 1 'Primary Case'.
> VARIABLE LEVEL  PrimaryFirst3 (ORDINAL).
> FREQUENCIES VARIABLES=PrimaryFirst3.
/
>
> The problem I have is that I have 200 variables and I have to split
> the sort cases command as shown in the above syntax because of the 64
> variable limitation.
>
> But when i tried the syntax above it doesn`t work.
>  I have to sort them all in same order the error message says.
>
> Does anybody know how to change the syntax so that it is working?
>
>
> Thank you!





-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicate-Cases-tp5730968p5730970.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Maguin, Eugene
Thank you for explaining that. The example code made it clear to me. Always good to get a better understanding of something. Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David Marso
Sent: Wednesday, November 18, 2015 11:52 AM
To: [hidden email]
Subject: Re: Identify Duplicate Cases

I suspect my edit didn't percolate through to Nabble.
The idea is that SORT preserves later ordering of cases within a set of specified keys.
If one sorts on the LATER variables first then that order will be preserved in later sorts on prior keys.
/*Build a simulation file */.
MATRIX.
SAVE TRUNC(UNIFORM(10000,300)*3) /OUTFILE * / VARIABLES x001 TO x300.
END MATRIX.

SORT CASES BY x257 TO x300.
SORT CASES BY x193 TO x256.
SORT CASES BY x129 TO x192.
SORT CASES BY x065 TO x128.
SORT CASES BY x001 TO x064.

MATCH FILES / FILE * / FIRST=@TOP / LAST=@BOT  / BY ALL.
COMPUTE @FLAGDUP=(@TOP NE @BOT).
FREQUENCIES @FLAGDUP.




Maguin, Eugene wrote
> David, More than likely you have more experience with this sort of
> problem but I was surprised by your advice as I would have assumed
> that the sort command has some limitation in the number of variables
> that can be specified as keys. Same comment, actually, with match files.
>
> Gene Maguin
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:

> SPSSX-L@.UGA

> ] On Behalf Of David Marso
> Sent: Wednesday, November 18, 2015 11:22 AM
> To:

> SPSSX-L@.UGA

> Subject: Re: Identify Duplicate Cases
>
> SORT CASES BY ALL.
> MATCH FILES ... / BY ALL.
> ----
>
> emma78 wrote
>> Hi,
>> I have a question regarding duplicate rows in a  datatset.
>> I want to find those Ids who have the same value in
> *
>> all
> *
>>  variables in the whole dataset.
>> I found some syntax in the list
>>
> /
>> SORT CASES BY var01 TO Var50.
>> SORT CASES BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g
>>     q4f q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l
>>     q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m
>>     q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b q14c q14d q14e
>>     q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c.
>> MATCH FILES  /FILE=* /BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e
>> q4g q4f
>> q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j  q6k q6l q6m q7a q7b q7c
>> q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m q8 q9 q10 q11 q12 q13a q13b
>> q13c q14a q14b q14c q14d q14e q15a q15b q15c q15d q15e q15f q15g q15h
>> q16a q16b q16c
>> var01
>> TO var50
>>   /FIRST=PrimaryFirst3  /LAST=PrimaryLast.
>> DO IF (PrimaryFirst3).
>> +  COMPUTE  MatchSequence=1-PrimaryLast.
>> ELSE.
>> +  COMPUTE  MatchSequence=MatchSequence+1.
>> END IF.
>> LEAVE  MatchSequence.
>> FORMATS  MatchSequence (f7).
>> COMPUTE  InDupGrp=MatchSequence>0.
>> SORT CASES InDupGrp(D).
>> MATCH FILES /FILE=* /DROP=PrimaryLast InDupGrp MatchSequence.
>> VARIABLE LABELS  PrimaryFirst3 'Indicator of each first matching case
>> as Primary'.
>> VALUE LABELS  PrimaryFirst3 0 'Duplicate Case' 1 'Primary Case'.
>> VARIABLE LEVEL  PrimaryFirst3 (ORDINAL).
>> FREQUENCIES VARIABLES=PrimaryFirst3.
> /
>>
>> The problem I have is that I have 200 variables and I have to split
>> the sort cases command as shown in the above syntax because of the 64
>> variable limitation.
>>
>> But when i tried the syntax above it doesn`t work.
>>  I have to sort them all in same order the error message says.
>>
>> Does anybody know how to change the syntax so that it is working?
>>
>>
>> Thank you!
>
>
>
>
>
> -----
> Please reply to the list and not to my personal email.
> Those desiring my consulting or training services please feel free to
> email me.
> ---
> "Nolite dare sanctum canibus neque mittatis margaritas vestras ante
> porcos ne forte conculcent eas pedibus suis."
> Cum es damnatorum possederunt porcos iens ut salire off sanguinum
> cliff in abyssum?"
> --
> View this message in context:
> http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicate-Cases
> -tp5730968p5730970.html Sent from the SPSSX Discussion mailing list
> archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the command. To leave the
> list, send the command SIGNOFF SPSSX-L For a list of commands to
> manage subscriptions, send the command INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the command. To leave the
> list, send the command SIGNOFF SPSSX-L For a list of commands to
> manage subscriptions, send the command INFO REFCARD





-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicate-Cases-tp5730968p5730973.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

emma78
In reply to this post by David Marso
Perfect, thank you!
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

emma78
In reply to this post by David Marso
Hi,
one question regarding this topic:
I tried to find the duplicates in one column. I know that I can use the normal syntax for this, but I want to use that for hundreds of columns simultaneously, each column in a row.
I tried to adapt this syntax with a loop

sort cases lfdn v_1.
compute dub= (lfdn=lag(lfdn,1)).
sort lfdn (A) v_1(D).
if (dub=0)  dub =2* (lfdn=lag (lfdn,1)).

But it diesn`t work,

Isn`t it possible to use sort cases and loop in one synatx or do I something wrong?

Thank you!
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Jon K Peck
No, SORT is an operation over the entire dataset while transformations are applied case by case.  I missed the original objective, but there are ways to do this without any sorting depending on exactly what you want to do.
 
 

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621
 
 
----- Original message -----
From: emma78 <[hidden email]>
Sent by: "SPSSX(r) Discussion" <[hidden email]>
To: [hidden email]
Cc:
Subject: Re: [SPSSX-L] Identify Duplicate Cases
Date: Sun, Nov 22, 2015 5:24 AM
 
Hi,
one question regarding this topic:
I tried to find the duplicates in one column. I know that I can use the
normal syntax for this, but I want to use that for hundreds of columns
simultaneously, each column in a row.
I tried to adapt this syntax with a loop

sort cases lfdn v_1.
compute dub= (lfdn=lag(lfdn,1)).
sort lfdn (A) v_1(D).
if (dub=0)  dub =2* (lfdn=lag (lfdn,1)).

But it diesn`t work,

Isn`t it possible to use sort cases and loop in one synatx or do I something
wrong?

Thank you!



--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicate-Cases-tp5730968p5730996.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

 

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

emma78
Ah, good to know:-)

What I want to do:

I have e.g.20 variables  ( mostly string ) in my dataset.
I want to check for each variable if they are duplicate cases in them.
I can do it for each variable one after another but it would be great if I can have all variables in one syntax to save time;-)

Thank you!
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Rich Ulrich
"... mostly string" is something that confuses me here.  Are the numeric
fields associated with particular string fields?  - It seems that duplication
of numbers is hard to avoid except when they are IDs.

I'm assuming that you have something like Names, which you
want to check for duplication, when it comes to strings that are
all the same length.  Then, maybe, the numbers are like Social Security
numbers, and each is associated with a name.

Well, ignoring numbers, you might re-write your file from wide-form to
long-form, VarsToCases, and do your sorting.  You can use Split Files to
report on each original variable alone. Repeat separately for numbers
if numbers are separate from names.  What you do after sorting depends on
what you are trying to accomplish -- Counting?  Listing?

--
Rich Ulrich



> Date: Sun, 22 Nov 2015 08:28:55 -0700

> From: [hidden email]
> Subject: Re: Identify Duplicate Cases
> To: [hidden email]
>
> Ah, good to know:-)
>
> What I want to do:
>
> I have e.g.20 variables ( mostly string ) in my dataset.
> I want to check for *each *variable if they are duplicate cases in them.
> I can do it for each variable one after another but it would be great if I
> can have all variables in one syntax to save time;-)
>
> Thank you!

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

emma78
Yes, the string variables are the most interesting ones:

I have

ID   v1      v2         v3                 v4           v5
1     test    house   tree             nothing       none of these
2     test    garden  car             nothing       key

3      sky     ---         people       key           nothing



ID 1 and 2 need to be flagged because of the 'test' and 'nothing' mentioned two times
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Art Kendall
From your example it seems that you are asking to find cases (rows) that have identical  values on any number of variables.

This sounds like an unusual thing to do, so I most likely do not understand your question.

Please explain in more detail the reasons you are doing this.
What is the study about?
How did you gather your data?
Are you trying to find out whether the data from what is supposed to be the same case is correctly entered?
Are you planning to use the results of your effort in further analyses?

Are the variables entered in free text fields so that "TEST" IS THE SAME AS "Test" or "test" or "tEst"?
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Art Kendall
P.S. I appears that the syntax you posted is a lot like that from "Identify duplicate cases" in the GUI.

The GUI is a great way to draft your syntax.  Be sure to exit via <paste> so that you have what you want.

Is this a one time effort? or is this something you expect to do often?
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

John F Hall
In reply to this post by emma78
I'm as confused as everyone else on this one: why on earth are you using
strings in the first place?

Try something like:

Recode v1 ("test" = 1)(else= 0) into v1test
        /v4 ("nothing" = 1) (else = 0) into v2test.

Count dup = v1test v2test (1).

Temp.
Select if dup gt 1.

List id.

John F Hall (Mr)
[Retired academic survey researcher]

Email:   [hidden email]  
Website: www.surveyresearch.weebly.com
SPSS start page:  www.surveyresearch.weebly.com/1-survey-analysis-workshop




-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
emma78
Sent: 22 November 2015 19:57
To: [hidden email]
Subject: Re: Identify Duplicate Cases

Yes, the string variables are the most interesting ones:

I have

ID   v1      v2         v3                 v4           v5
1     test    house   tree             nothing       none of these
2     test    garden  car             nothing       key

3      sky     ---         people       key           nothing



ID 1 and 2 need to be flagged because of the 'test' and 'nothing' mentioned
two times



--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicate-Cases-tp573
0968p5731000.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command SIGNOFF SPSSX-L For a list of
commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

emma78
In reply to this post by Art Kendall
Sorry for any confusion 😊
Actually i have some cases whith Different ids but identical Text in String fields. Its not the ideal dataset but thats the way it isπŸ˜‰
And because there are many of those String variables, it Would be Great if i get an idea of how to Loop through those variables.😊 and mark those with identical Text. It Has to be 100% identical, so test and Test Has not to be marked.


Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Art Kendall
Would cases have all variables except the variable that identifies the case ID be identical and in the same order?
Would these cases be identical except for the ID?
CaseID myvar1 myvar2 myvar3
123 apple pear apple
456 pear apple apple

Or would you be looking for these to be identical except for CaseID
123 apple pear apple
456 apple pear apple

Again are your string variables created by something that is consistent (software) are are they entered by people and be prone to typos etc?

Are you saying that "typo" variations are NOT identical?

Do variables have restricted ranges of legitimate values?  

It may be that you need to protect private data, but if you could elaborate on your question it would make it possible for list members to help you.


Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Bruce Weaver
Administrator
In reply to this post by emma78
Does the following syntax do what you want for V1?  (I have no SPSS on this machine, so it has not been tested.)

AGGREGATE
 /BREAK = V1
 /V1flag = NU.
RECODE V1flag (1=0) (ELSE=1).
FORMATS V1flag(F1).
VARIABLE LABELS V1flag "V1 value appears 2 or more times".
VALUE LABELS V1flag 1 "Yes" 0 "No".
FREQUENCIES V1flag.

If that does what you want, it could form the basis of a macro that loops through all of the variables.

emma78 wrote
Yes, the string variables are the most interesting ones:

I have

ID   v1      v2         v3                 v4           v5
1     test    house   tree             nothing       none of these
2     test    garden  car             nothing       key

3      sky     ---         people       key           nothing



ID 1 and 2 need to be flagged because of the 'test' and 'nothing' mentioned two times
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

emma78
Yes for it does exactly what I want:-)
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Maguin, Eugene
In reply to this post by emma78
I know you have good solution from Bruce (and a good solution is good enough).
I'm curious about your problem and I have something in mind, which may not work out, but to begin I'd like to clearly understand the problem.

Is the following statement correct?
In a dataset consisting of non-duplicated id numbers, determine whether a variable has the same value for
any two or more cases (rows). The variables to be checked are all strings but of varying widths.

Should a variable have the same value for multiple cases, what happens next? If four variables have the same values on the same subset of cases, what happens next?

Gene Maguin


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of emma78
Sent: Sunday, November 22, 2015 7:24 AM
To: [hidden email]
Subject: Re: Identify Duplicate Cases

Hi,
one question regarding this topic:
I tried to find the duplicates in one column. I know that I can use the normal syntax for this, but I want to use that for hundreds of columns simultaneously, each column in a row.
I tried to adapt this syntax with a loop

sort cases lfdn v_1.
compute dub= (lfdn=lag(lfdn,1)).
sort lfdn (A) v_1(D).
if (dub=0)  dub =2* (lfdn=lag (lfdn,1)).

But it diesn`t work,

Isn`t it possible to use sort cases and loop in one synatx or do I something wrong?

Thank you!



--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicate-Cases-tp5730968p5730996.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
12