SPSSX Discussion

Identify Duplicate Cases

Classic

List

Threaded

34 messages Options

emma78

Identify Duplicate Cases

Hi,
I have a question regarding duplicate rows in a datatset.
I want to find those Ids who have the same value in all variables in the whole dataset.
I found some syntax in the list

SORT CASES BY var01 TO Var50.
SORT CASES BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g
q4f q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l
q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m
q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b q14c q14d q14e
q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c.
MATCH FILES /FILE=* /BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g q4f
q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l q6m q7a q7b q7c q7d q7e
q7f q7g q7h q7i q7j q7k q7l q7m q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b
q14c q14d q14e q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c var01
TO var50
/FIRST=PrimaryFirst3 /LAST=PrimaryLast.
DO IF (PrimaryFirst3).
+ COMPUTE MatchSequence=1-PrimaryLast.
ELSE.
+ COMPUTE MatchSequence=MatchSequence+1.
END IF.
LEAVE MatchSequence.
FORMATS MatchSequence (f7).
COMPUTE InDupGrp=MatchSequence>0.
SORT CASES InDupGrp(D).
MATCH FILES /FILE=* /DROP=PrimaryLast InDupGrp MatchSequence.
VARIABLE LABELS PrimaryFirst3 'Indicator of each first matching case as
Primary'.
VALUE LABELS PrimaryFirst3 0 'Duplicate Case' 1 'Primary Case'.
VARIABLE LEVEL PrimaryFirst3 (ORDINAL).
FREQUENCIES VARIABLES=PrimaryFirst3.

The problem I have is that I have 200 variables and I have to split the sort cases command as shown in the above syntax because of the 64 variable limitation.

But when i tried the syntax above it doesn`t work.
I have to sort them all in same order the error message says.

Does anybody know how to change the syntax so that it is working?

Thank you!

Maguin, Eugene

Re: Identify Duplicate Cases

I first want to suggest that you investigate the Identify duplicates function from the Data dropdown box. It may not be adequate and I cannot comment on its adequacy for this specific problem because I have not ever used it.

If it is found to be inadequate, I suggest the following scheme, which depending on the nature of your data may not be workable.

My suggestion is that using a do repeat structure you concantenate your variables into a single string variable and sort on that. It looks like you have around 100-110 variables and if those variables are all true f1.0 variables, i.e., 0-9 range then you have a 100-110 character string, which is "so what". But, if those variables are true floating point numbers, then I don't know that this idea will work.

Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of emma78
Sent: Wednesday, November 18, 2015 10:41 AM
To: [hidden email]
Subject: Identify Duplicate Cases

Hi,
I have a question regarding duplicate rows in a datatset.
I want to find those Ids who have the same value in *all* variables in the whole dataset.
I found some syntax in the list

/SORT CASES BY var01 TO Var50.
SORT CASES BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g
q4f q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l
q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m
q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b q14c q14d q14e
q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c.
MATCH FILES /FILE=* /BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g q4f
q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b q14c q14d q14e q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c var01 TO var50
/FIRST=PrimaryFirst3 /LAST=PrimaryLast.
DO IF (PrimaryFirst3).
+ COMPUTE MatchSequence=1-PrimaryLast.
ELSE.
+ COMPUTE MatchSequence=MatchSequence+1.
END IF.
LEAVE MatchSequence.
FORMATS MatchSequence (f7).
COMPUTE InDupGrp=MatchSequence>0.
SORT CASES InDupGrp(D).
MATCH FILES /FILE=* /DROP=PrimaryLast InDupGrp MatchSequence.
VARIABLE LABELS PrimaryFirst3 'Indicator of each first matching case as Primary'.
VALUE LABELS PrimaryFirst3 0 'Duplicate Case' 1 'Primary Case'.
VARIABLE LEVEL PrimaryFirst3 (ORDINAL).
FREQUENCIES VARIABLES=PrimaryFirst3./

The problem I have is that I have 200 variables and I have to split the sort cases command as shown in the above syntax because of the 64 variable limitation.

But when i tried the syntax above it doesn`t work.
I have to sort them all in same order the error message says.

Does anybody know how to change the syntax so that it is working?

Thank you!

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicate-Cases-tp5730968.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

David Marso

Re: Identify Duplicate Cases

Administrator

This post was updated on .

In reply to this post by emma78

SORT CASES BY ALL.
MATCH FILES ... / BY ALL.
----
Oops , my bad.
Try the following idea.

/* simulate some data for testing */.
MATRIX.
SAVE TRUNC(UNIFORM(10000,300)*3) /OUTFILE * / VARIABLES x001 TO x300.
END MATRIX.

SORT CASES BY x257 TO x300.
SORT CASES BY x193 TO x256.
SORT CASES BY x129 TO x192.
SORT CASES BY x065 TO x128.
SORT CASES BY x001 TO x064.

MATCH FILES / FILE * / FIRST=@TOP / LAST=@BOT / BY ALL.
COMPUTE @FLAGDUP=(@TOP NE @BOT).
FREQUENCIES @FLAGDUP.

emma78 wrote

Hi,
I have a question regarding duplicate rows in a datatset.
I want to find those Ids who have the same value in all variables in the whole dataset.
I found some syntax in the list

SORT CASES BY var01 TO Var50.
SORT CASES BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g
q4f q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l
q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m
q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b q14c q14d q14e
q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c.
MATCH FILES /FILE=* /BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g q4f
q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l q6m q7a q7b q7c q7d q7e
q7f q7g q7h q7i q7j q7k q7l q7m q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b
q14c q14d q14e q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c var01
TO var50
/FIRST=PrimaryFirst3 /LAST=PrimaryLast.
DO IF (PrimaryFirst3).
+ COMPUTE MatchSequence=1-PrimaryLast.
ELSE.
+ COMPUTE MatchSequence=MatchSequence+1.
END IF.
LEAVE MatchSequence.
FORMATS MatchSequence (f7).
COMPUTE InDupGrp=MatchSequence>0.
SORT CASES InDupGrp(D).
MATCH FILES /FILE=* /DROP=PrimaryLast InDupGrp MatchSequence.
VARIABLE LABELS PrimaryFirst3 'Indicator of each first matching case as
Primary'.
VALUE LABELS PrimaryFirst3 0 'Duplicate Case' 1 'Primary Case'.
VARIABLE LEVEL PrimaryFirst3 (ORDINAL).
FREQUENCIES VARIABLES=PrimaryFirst3.

The problem I have is that I have 200 variables and I have to split the sort cases command as shown in the above syntax because of the 64 variable limitation.

But when i tried the syntax above it doesn`t work.
I have to sort them all in same order the error message says.

Does anybody know how to change the syntax so that it is working?

Thank you!

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

Maguin, Eugene

Re: Identify Duplicate Cases

David, More than likely you have more experience with this sort of problem but I was surprised by your advice as I would have assumed that the sort command has some limitation in the number of variables that can be specified as keys. Same comment, actually, with match files.

Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David Marso
Sent: Wednesday, November 18, 2015 11:22 AM
To: [hidden email]
Subject: Re: Identify Duplicate Cases

SORT CASES BY ALL.
MATCH FILES ... / BY ALL.
----

emma78 wrote
> Hi,
> I have a question regarding duplicate rows in a datatset.
> I want to find those Ids who have the same value in
*
> all
*
> variables in the whole dataset.
> I found some syntax in the list
>
/

> SORT CASES BY var01 TO Var50.
> SORT CASES BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g
> q4f q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l
> q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m
> q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b q14c q14d q14e
> q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c.
> MATCH FILES /FILE=* /BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e
> q4g q4f
> q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l q6m q7a q7b q7c
> q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m q8 q9 q10 q11 q12 q13a q13b
> q13c q14a q14b q14c q14d q14e q15a q15b q15c q15d q15e q15f q15g q15h
> q16a q16b q16c
> var01
> TO var50
> /FIRST=PrimaryFirst3 /LAST=PrimaryLast.
> DO IF (PrimaryFirst3).
> + COMPUTE MatchSequence=1-PrimaryLast.
> ELSE.
> + COMPUTE MatchSequence=MatchSequence+1.
> END IF.
> LEAVE MatchSequence.
> FORMATS MatchSequence (f7).
> COMPUTE InDupGrp=MatchSequence>0.
> SORT CASES InDupGrp(D).
> MATCH FILES /FILE=* /DROP=PrimaryLast InDupGrp MatchSequence.
> VARIABLE LABELS PrimaryFirst3 'Indicator of each first matching case
> as Primary'.
> VALUE LABELS PrimaryFirst3 0 'Duplicate Case' 1 'Primary Case'.
> VARIABLE LEVEL PrimaryFirst3 (ORDINAL).
> FREQUENCIES VARIABLES=PrimaryFirst3.

>
> The problem I have is that I have 200 variables and I have to split
> the sort cases command as shown in the above syntax because of the 64
> variable limitation.
>
> But when i tried the syntax above it doesn`t work.
> I have to sort them all in same order the error message says.
>
> Does anybody know how to change the syntax so that it is working?
>
>
> Thank you!

-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicate-Cases-tp5730968p5730970.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

David Marso

Re: Identify Duplicate Cases

Administrator

I suspect my edit didn't percolate through to Nabble.
The idea is that SORT preserves later ordering of cases within a set of specified keys.
If one sorts on the LATER variables first then that order will be preserved in later sorts on prior keys.
/*Build a simulation file */.
MATRIX.
SAVE TRUNC(UNIFORM(10000,300)*3) /OUTFILE * / VARIABLES x001 TO x300.
END MATRIX.

SORT CASES BY x257 TO x300.
SORT CASES BY x193 TO x256.
SORT CASES BY x129 TO x192.
SORT CASES BY x065 TO x128.
SORT CASES BY x001 TO x064.

MATCH FILES / FILE * / FIRST=@TOP / LAST=@BOT / BY ALL.
COMPUTE @FLAGDUP=(@TOP NE @BOT).
FREQUENCIES @FLAGDUP.

Maguin, Eugene wrote

David, More than likely you have more experience with this sort of problem but I was surprised by your advice as I would have assumed that the sort command has some limitation in the number of variables that can be specified as keys. Same comment, actually, with match files.

Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David Marso
Sent: Wednesday, November 18, 2015 11:22 AM
To: [hidden email]
Subject: Re: Identify Duplicate Cases

SORT CASES BY ALL.
MATCH FILES ... / BY ALL.
----

emma78 wrote
> Hi,
> I have a question regarding duplicate rows in a datatset.
> I want to find those Ids who have the same value in
*
> all
*
> variables in the whole dataset.
> I found some syntax in the list
>
/
> SORT CASES BY var01 TO Var50.
> SORT CASES BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g
> q4f q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l
> q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m
> q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b q14c q14d q14e
> q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c.
> MATCH FILES /FILE=* /BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e
> q4g q4f
> q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l q6m q7a q7b q7c
> q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m q8 q9 q10 q11 q12 q13a q13b
> q13c q14a q14b q14c q14d q14e q15a q15b q15c q15d q15e q15f q15g q15h
> q16a q16b q16c
> var01
> TO var50
> /FIRST=PrimaryFirst3 /LAST=PrimaryLast.
> DO IF (PrimaryFirst3).
> + COMPUTE MatchSequence=1-PrimaryLast.
> ELSE.
> + COMPUTE MatchSequence=MatchSequence+1.
> END IF.
> LEAVE MatchSequence.
> FORMATS MatchSequence (f7).
> COMPUTE InDupGrp=MatchSequence>0.
> SORT CASES InDupGrp(D).
> MATCH FILES /FILE=* /DROP=PrimaryLast InDupGrp MatchSequence.
> VARIABLE LABELS PrimaryFirst3 'Indicator of each first matching case
> as Primary'.
> VALUE LABELS PrimaryFirst3 0 'Duplicate Case' 1 'Primary Case'.
> VARIABLE LEVEL PrimaryFirst3 (ORDINAL).
> FREQUENCIES VARIABLES=PrimaryFirst3.
/
>
> The problem I have is that I have 200 variables and I have to split
> the sort cases command as shown in the above syntax because of the 64
> variable limitation.
>
> But when i tried the syntax above it doesn`t work.
> I have to sort them all in same order the error message says.
>
> Does anybody know how to change the syntax so that it is working?
>
>
> Thank you!

-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicate-Cases-tp5730968p5730970.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Maguin, Eugene

Re: Identify Duplicate Cases

Thank you for explaining that. The example code made it clear to me. Always good to get a better understanding of something. Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David Marso
Sent: Wednesday, November 18, 2015 11:52 AM
To: [hidden email]
Subject: Re: Identify Duplicate Cases

I suspect my edit didn't percolate through to Nabble.
The idea is that SORT preserves later ordering of cases within a set of specified keys.
If one sorts on the LATER variables first then that order will be preserved in later sorts on prior keys.
/*Build a simulation file */.
MATRIX.
SAVE TRUNC(UNIFORM(10000,300)*3) /OUTFILE * / VARIABLES x001 TO x300.
END MATRIX.

SORT CASES BY x257 TO x300.
SORT CASES BY x193 TO x256.
SORT CASES BY x129 TO x192.
SORT CASES BY x065 TO x128.
SORT CASES BY x001 TO x064.

MATCH FILES / FILE * / FIRST=@TOP / LAST=@BOT / BY ALL.
COMPUTE @FLAGDUP=(@TOP NE @BOT).
FREQUENCIES @FLAGDUP.

Maguin, Eugene wrote
> David, More than likely you have more experience with this sort of
> problem but I was surprised by your advice as I would have assumed
> that the sort command has some limitation in the number of variables
> that can be specified as keys. Same comment, actually, with match files.
>
> Gene Maguin
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:

> SPSSX-L@.UGA

> ] On Behalf Of David Marso
> Sent: Wednesday, November 18, 2015 11:22 AM
> To:

> SPSSX-L@.UGA

> Subject: Re: Identify Duplicate Cases
>
> SORT CASES BY ALL.
> MATCH FILES ... / BY ALL.
> ----
>
> emma78 wrote
>> Hi,
>> I have a question regarding duplicate rows in a datatset.
>> I want to find those Ids who have the same value in
> *
>> all
> *
>> variables in the whole dataset.
>> I found some syntax in the list
>>
> /
>> SORT CASES BY var01 TO Var50.
>> SORT CASES BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e q4g
>> q4f q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l
>> q6m q7a q7b q7c q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m
>> q8 q9 q10 q11 q12 q13a q13b q13c q14a q14b q14c q14d q14e
>> q15a q15b q15c q15d q15e q15f q15g q15h q16a q16b q16c.
>> MATCH FILES /FILE=* /BY q2 q3a q3b q3c q3d q3e q4a q4b q4c q4d q4e
>> q4g q4f
>> q5 q6a q6b q6c q6d q6e q6f q6g q6h q6i q6j q6k q6l q6m q7a q7b q7c
>> q7d q7e q7f q7g q7h q7i q7j q7k q7l q7m q8 q9 q10 q11 q12 q13a q13b
>> q13c q14a q14b q14c q14d q14e q15a q15b q15c q15d q15e q15f q15g q15h
>> q16a q16b q16c
>> var01
>> TO var50
>> /FIRST=PrimaryFirst3 /LAST=PrimaryLast.
>> DO IF (PrimaryFirst3).
>> + COMPUTE MatchSequence=1-PrimaryLast.
>> ELSE.
>> + COMPUTE MatchSequence=MatchSequence+1.
>> END IF.
>> LEAVE MatchSequence.
>> FORMATS MatchSequence (f7).
>> COMPUTE InDupGrp=MatchSequence>0.
>> SORT CASES InDupGrp(D).
>> MATCH FILES /FILE=* /DROP=PrimaryLast InDupGrp MatchSequence.
>> VARIABLE LABELS PrimaryFirst3 'Indicator of each first matching case
>> as Primary'.
>> VALUE LABELS PrimaryFirst3 0 'Duplicate Case' 1 'Primary Case'.
>> VARIABLE LEVEL PrimaryFirst3 (ORDINAL).
>> FREQUENCIES VARIABLES=PrimaryFirst3.
> /
>>
>> The problem I have is that I have 200 variables and I have to split
>> the sort cases command as shown in the above syntax because of the 64
>> variable limitation.
>>
>> But when i tried the syntax above it doesn`t work.
>> I have to sort them all in same order the error message says.
>>
>> Does anybody know how to change the syntax so that it is working?
>>
>>
>> Thank you!
>
>
>
>
>
> -----
> Please reply to the list and not to my personal email.
> Those desiring my consulting or training services please feel free to
> email me.
> ---
> "Nolite dare sanctum canibus neque mittatis margaritas vestras ante
> porcos ne forte conculcent eas pedibus suis."
> Cum es damnatorum possederunt porcos iens ut salire off sanguinum
> cliff in abyssum?"
> --
> View this message in context:
> http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicate-Cases
> -tp5730968p5730970.html Sent from the SPSSX Discussion mailing list
> archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

> (not to SPSSX-L), with no body text except the command. To leave the
> list, send the command SIGNOFF SPSSX-L For a list of commands to
> manage subscriptions, send the command INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

> (not to SPSSX-L), with no body text except the command. To leave the
> list, send the command SIGNOFF SPSSX-L For a list of commands to
> manage subscriptions, send the command INFO REFCARD

-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicate-Cases-tp5730968p5730973.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

emma78

Re: Identify Duplicate Cases

In reply to this post by David Marso

Perfect, thank you!

emma78

Re: Identify Duplicate Cases

In reply to this post by David Marso

Hi,
one question regarding this topic:
I tried to find the duplicates in one column. I know that I can use the normal syntax for this, but I want to use that for hundreds of columns simultaneously, each column in a row.
I tried to adapt this syntax with a loop

sort cases lfdn v_1.
compute dub= (lfdn=lag(lfdn,1)).
sort lfdn (A) v_1(D).
if (dub=0) dub =2* (lfdn=lag (lfdn,1)).

But it diesn`t work,

Isn`t it possible to use sort cases and loop in one synatx or do I something wrong?

Thank you!

Jon K Peck

Re: Identify Duplicate Cases

No, SORT is an operation over the entire dataset while transformations are applied case by case. I missed the original objective, but there are ways to do this without any sorting depending on exactly what you want to do.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

----- Original message -----
From: emma78 <[hidden email]>
Sent by: "SPSSX(r) Discussion" <[hidden email]>
To: [hidden email]
Cc:
Subject: Re: [SPSSX-L] Identify Duplicate Cases
Date: Sun, Nov 22, 2015 5:24 AM

Hi,
one question regarding this topic:
I tried to find the duplicates in one column. I know that I can use the
normal syntax for this, but I want to use that for hundreds of columns
simultaneously, each column in a row.
I tried to adapt this syntax with a loop

sort cases lfdn v_1.
compute dub= (lfdn=lag(lfdn,1)).
sort lfdn (A) v_1(D).
if (dub=0) dub =2* (lfdn=lag (lfdn,1)).

But it diesn`t work,

Isn`t it possible to use sort cases and loop in one synatx or do I something
wrong?

Thank you!

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicate-Cases-tp5730968p5730996.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

emma78

Re: Identify Duplicate Cases

Ah, good to know:-)

What I want to do:

I have e.g.20 variables ( mostly string ) in my dataset.
I want to check for each variable if they are duplicate cases in them.
I can do it for each variable one after another but it would be great if I can have all variables in one syntax to save time;-)

Thank you!

Rich Ulrich

Re: Identify Duplicate Cases

"... mostly string" is something that confuses me here. Are the numeric
fields associated with particular string fields? - It seems that duplication
of numbers is hard to avoid except when they are IDs.

I'm assuming that you have something like Names, which you
want to check for duplication, when it comes to strings that are
all the same length. Then, maybe, the numbers are like Social Security
numbers, and each is associated with a name.

Well, ignoring numbers, you might re-write your file from wide-form to
long-form, VarsToCases, and do your sorting. You can use Split Files to
report on each original variable alone. Repeat separately for numbers
if numbers are separate from names. What you do after sorting depends on
what you are trying to accomplish -- Counting? Listing?

--
Rich Ulrich

> Date: Sun, 22 Nov 2015 08:28:55 -0700

> From: [hidden email]
> Subject: Re: Identify Duplicate Cases
> To: [hidden email]
>
> Ah, good to know:-)
>
> What I want to do:
>
> I have e.g.20 variables ( mostly string ) in my dataset.
> I want to check for *each *variable if they are duplicate cases in them.
> I can do it for each variable one after another but it would be great if I
> can have all variables in one syntax to save time;-)
>
> Thank you!

emma78

Re: Identify Duplicate Cases

Yes, the string variables are the most interesting ones

:

I have

ID v1 v2 v3 v4 v5
1 test house tree nothing none of these
2 test garden car nothing key

3 sky --- people key nothing

ID 1 and 2 need to be flagged because of the 'test' and 'nothing' mentioned two times

Art Kendall

Re: Identify Duplicate Cases

From your example it seems that you are asking to find cases (rows) that have identical values on any number of variables.

This sounds like an unusual thing to do, so I most likely do not understand your question.

Please explain in more detail the reasons you are doing this.
What is the study about?
How did you gather your data?
Are you trying to find out whether the data from what is supposed to be the same case is correctly entered?
Are you planning to use the results of your effort in further analyses?

Are the variables entered in free text fields so that "TEST" IS THE SAME AS "Test" or "test" or "tEst"?

Art Kendall
Social Research Consultants

Art Kendall

Re: Identify Duplicate Cases

P.S. I appears that the syntax you posted is a lot like that from "Identify duplicate cases" in the GUI.

The GUI is a great way to draft your syntax. Be sure to exit via <paste> so that you have what you want.

Is this a one time effort? or is this something you expect to do often?

Art Kendall
Social Research Consultants

John F Hall

Re: Identify Duplicate Cases

In reply to this post by emma78

I'm as confused as everyone else on this one: why on earth are you using
strings in the first place?

Try something like:

Recode v1 ("test" = 1)(else= 0) into v1test
/v4 ("nothing" = 1) (else = 0) into v2test.

Count dup = v1test v2test (1).

Temp.
Select if dup gt 1.

List id.

John F Hall (Mr)
[Retired academic survey researcher]

Email: [hidden email]
Website: www.surveyresearch.weebly.com
SPSS start page: www.surveyresearch.weebly.com/1-survey-analysis-workshop

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
emma78
Sent: 22 November 2015 19:57
To: [hidden email]
Subject: Re: Identify Duplicate Cases

Yes, the string variables are the most interesting ones:

I have

ID v1 v2 v3 v4 v5
1 test house tree nothing none of these
2 test garden car nothing key

3 sky --- people key nothing

ID 1 and 2 need to be flagged because of the 'test' and 'nothing' mentioned
two times

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicate-Cases-tp573
0968p5731000.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command SIGNOFF SPSSX-L For a list of
commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

emma78

Re: Identify Duplicate Cases

In reply to this post by Art Kendall

Sorry for any confusion 😊
Actually i have some cases whith Different ids but identical Text in String fields. Its not the ideal dataset but thats the way it is😉
And because there are many of those String variables, it Would be Great if i get an idea of how to Loop through those variables.😊 and mark those with identical Text. It Has to be 100% identical, so test and Test Has not to be marked.

Art Kendall

Re: Identify Duplicate Cases

Would cases have all variables except the variable that identifies the case ID be identical and in the same order?
Would these cases be identical except for the ID?
CaseID myvar1 myvar2 myvar3
123 apple pear apple
456 pear apple apple

Or would you be looking for these to be identical except for CaseID
123 apple pear apple
456 apple pear apple

Again are your string variables created by something that is consistent (software) are are they entered by people and be prone to typos etc?

Are you saying that "typo" variations are NOT identical?

Do variables have restricted ranges of legitimate values?

It may be that you need to protect private data, but if you could elaborate on your question it would make it possible for list members to help you.

Art Kendall
Social Research Consultants

Bruce Weaver

Re: Identify Duplicate Cases

Administrator

In reply to this post by emma78

Does the following syntax do what you want for V1? (I have no SPSS on this machine, so it has not been tested.)

AGGREGATE
/BREAK = V1
/V1flag = NU.
RECODE V1flag (1=0) (ELSE=1).
FORMATS V1flag(F1).
VARIABLE LABELS V1flag "V1 value appears 2 or more times".
VALUE LABELS V1flag 1 "Yes" 0 "No".
FREQUENCIES V1flag.

If that does what you want, it could form the basis of a macro that loops through all of the variables.

emma78 wrote

Yes, the string variables are the most interesting ones:

I have

ID v1 v2 v3 v4 v5
1 test house tree nothing none of these
2 test garden car nothing key

3 sky --- people key nothing

ID 1 and 2 need to be flagged because of the 'test' and 'nothing' mentioned two times

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).

emma78

Re: Identify Duplicate Cases

Yes for it does exactly what I want:-)

Maguin, Eugene

Re: Identify Duplicate Cases

In reply to this post by emma78

I know you have good solution from Bruce (and a good solution is good enough).
I'm curious about your problem and I have something in mind, which may not work out, but to begin I'd like to clearly understand the problem.

Is the following statement correct?
In a dataset consisting of non-duplicated id numbers, determine whether a variable has the same value for
any two or more cases (rows). The variables to be checked are all strings but of varying widths.

Should a variable have the same value for multiple cases, what happens next? If four variables have the same values on the same subset of cases, what happens next?

Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of emma78
Sent: Sunday, November 22, 2015 7:24 AM
To: [hidden email]
Subject: Re: Identify Duplicate Cases

Hi,
one question regarding this topic:
I tried to find the duplicates in one column. I know that I can use the normal syntax for this, but I want to use that for hundreds of columns simultaneously, each column in a row.
I tried to adapt this syntax with a loop

sort cases lfdn v_1.
compute dub= (lfdn=lag(lfdn,1)).
sort lfdn (A) v_1(D).
if (dub=0) dub =2* (lfdn=lag (lfdn,1)).

But it diesn`t work,

Isn`t it possible to use sort cases and loop in one synatx or do I something wrong?

Thank you!

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicate-Cases-tp5730968p5730996.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD