Identify Duplicate Cases

classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Bruce Weaver
Administrator
In that case, here is a quick stab at including the AGGREGATE command in a macro that loops through the variables.



* Create a sample dataset.
NEW FILE.
DATASET CLOSE all.
DATA LIST list / ID (F5.0)  v1 to v5 (5A12).
BEGIN DATA
1     test    house   tree    nothing    none
2     test    garden  car     nothing    key
3      sky     ---    people  key        nothing
END DATA.
LIST.

* Insert the AGGREGATE command in a looping macro.

DEFINE !Flags (
 Root = !CHAREND('/') /
 First = !CHAREND('/') /
 Last = !CMDEND )

!DO !i = !First !TO !LAST
!LET !V = !CONCAT(!Root,!i)
!LET !Flag = !CONCAT("Flag",!V)
AGGREGATE
 /BREAK = !V
 /!Flag = NU.
RECODE !Flag (1=0) (ELSE=1).
FORMATS !Flag(F1).
VARIABLE LABELS !Flag !CONCAT(!V," value appears 2 or more times").
!DOEND
EXECUTE.
!ENDDEFINE.

* Call the macro.
*SET MPRINT ON.
!Flags Root = V / First = 1 / Last = 5.
*SET MPRINT OFF.
LIST FlagV1 to FlagV5.

Output from LIST:

FlagV1 FlagV2 FlagV3 FlagV4 FlagV5
 
   1      0      0      1      0
   1      0      0      1      0
   0      0      0      0      0
 
Number of cases read:  3    Number of cases listed:  3

emma78 wrote
Yes for it does exactly what I want:-)
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

David Marso
Administrator
 Bruce Weaver posted:
"In that case, here is a quick stab at including the AGGREGATE command in a macro that loops through the variables. "
<SNIP>

I don't know if I like the one data pass per variable here ;-(
Why not simply wrap FREQUENCIES in OMS with  SPLIT FILE on ID ?
Not sure why someone would want to do this in the first place.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Bruce Weaver
Administrator
Good point, FREQUENCIES with OMS would eliminate all those data passes.  But if I understood correctly, Emma does not want the SPLIT FILE  by ID.  


David Marso wrote
Bruce Weaver posted:
"In that case, here is a quick stab at including the AGGREGATE command in a macro that loops through the variables. "
<SNIP>

I don't know if I like the one data pass per variable here ;-(
Why not simply wrap FREQUENCIES in OMS with  SPLIT FILE on ID ?
Not sure why someone would want to do this in the first place.
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

emma78
In reply to this post by Maguin, Eugene
Yes thats right, if i find some with the Same value i will have a closer Look in the Data and Probably delete those cases
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

emma78
In reply to this post by Bruce Weaver
Hi Bruce,
your syntax works very well :-)

Is there a chance to got it more general? For example if the variables are not named v_1 up to v_100 , but a little bit confused like q3_1, q3_1_1, q_3_2, q4

I tried to adapt it but I didn`t suceed...

SPLIT FILE  by ID.   What do you mean by this?

Really apreciate your help!
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Maguin, Eugene
An alternative way to work this problem is through Varstocases followed by Aggregate in addvariables mode. A non-duplicated value dataset will show an NU function count of 1. What you do next depends on what the resulting dataset is to be. If the intent is to retain the first instance and eliminate (i.e., blank) instances i = 2 to n and then restructure the dataset back to wide format, that's just one data pass, I believe but have not actually tested that part, and a casestovars.
Gene Maguin


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of emma78
Sent: Tuesday, November 24, 2015 8:37 AM
To: [hidden email]
Subject: Re: Identify Duplicate Cases

Hi Bruce,
your syntax works very well :-)

Is there a chance to got it more general? For example if the variables are not named v_1 up to v_100 , but a little bit confused like q3_1, q3_1_1, q_3_2, q4

I tried to adapt it but I didn`t suceed...

SPLIT FILE  by ID.   What do you mean by this?

Really apreciate your help!



--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicate-Cases-tp5730968p5731028.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Jon K Peck
In reply to this post by emma78

This would be a good place to use the SPSSINC SELECT VARIABLES extension command.


SPSSINC SELECT VARIABLES MACRONAME="!qish"
/PROPERTIES PATTERN = "q\d*_".

defines a macro named !qish listing all variables whose names start with q followed by zero or more digits followed by _. This pattern could, of course be generalized.

Then just refer to that macro where you need a variable list.



Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

Hi Bruce,
your syntax works very well :-)

Is there a chance to got it more general? For example if the variables are
not named v_1 up to v_100 , but a little bit confused like q3_1, q3_1_1,
q_3_2, q4

I tried to adapt it but I didn`t suceed...

SPLIT FILE  by ID.   What do you mean by this?

Really apreciate your help!

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

emma78

Sorry for my stupid questions
But how can I use it for the makro, obviously I haven't got a clue....This doesn't work


DEFINE !Flags (
 First = !CHAREND('/') /
 Last = !CMDEND )

!DO !i = !First !TO !LAST
!LET !V = !qish
!LET !Flag = !CONCAT("Flag",!i)
AGGREGATE
 /BREAK = !V
 /!Flag = NU.
RECODE !Flag (1=0) (ELSE=1).
FORMATS !Flag(F1).
VARIABLE LABELS !Flag !CONCAT(!V,"mehrfach").
!DOEND
EXECUTE.
!ENDDEFINE.

* Call the macro.
*SET MPRINT ON.
!Flags First = 1 / Last = 3.
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

David Marso
Administrator
In reply to this post by Maguin, Eugene
I for one am having extreme difficulty grasping what OP actually wants to achieve here and whether there is even any utility in such a thing.  What is the point of this exercise in the first place.  Do you REALLY want to remove all cases for which there is a duplicated value in any variable? What is the point of this?  Sounds truly suspect and a very weird thing to do in my opinion.
---
--
Maguin, Eugene wrote
An alternative way to work this problem is through Varstocases followed by Aggregate in addvariables mode. A non-duplicated value dataset will show an NU function count of 1. What you do next depends on what the resulting dataset is to be. If the intent is to retain the first instance and eliminate (i.e., blank) instances i = 2 to n and then restructure the dataset back to wide format, that's just one data pass, I believe but have not actually tested that part, and a casestovars.
Gene Maguin


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of emma78
Sent: Tuesday, November 24, 2015 8:37 AM
To: [hidden email]
Subject: Re: Identify Duplicate Cases

Hi Bruce,
your syntax works very well :-)

Is there a chance to got it more general? For example if the variables are not named v_1 up to v_100 , but a little bit confused like q3_1, q3_1_1, q_3_2, q4

I tried to adapt it but I didn`t suceed...

SPLIT FILE  by ID.   What do you mean by this?

Really apreciate your help!



--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicate-Cases-tp5730968p5731028.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

emma78
Hi David,
Yes i want to delete them because i do Not Know if one Person filled out the Survey  Twice. If the Data in the String var is the Same its a hint for Duplicates for me.
It Sounds weird but unfortunately the Data Looks like that ...
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

David Marso
Administrator
Sounds like overkill as stated.  Wouldn't it be more reasonable to consider cases with some substantial number of same answers as duplicates rather than basing it on a single match?

emma78 wrote
Hi David,
Yes i want to delete them because i do Not Know if one Person filled out the Survey  Twice. If the Data in the String var is the Same its a hint for Duplicates for me.
It Sounds weird but unfortunately the Data Looks like that ...
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Timothy Hennigar
The probablilty that someone filled out a survey - WITH MULTIPLE OPENS - and
wrote the exact same thing in all them (case and all)
- VIRTUALLY 0 - that would not find duplicate cases at all I argue

If that's what you are looking for - waste of time






-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
David Marso
Sent: Tuesday, November 24, 2015 03:55 PM
To: [hidden email]
Subject: Re: Identify Duplicate Cases

Sounds like overkill as stated.  Wouldn't it be more reasonable to consider
cases with some substantial number of same answers as duplicates rather than
basing it on a single match?


emma78 wrote
> Hi David,
> Yes i want to delete them because i do Not Know if one Person filled
> out the Survey  Twice. If the Data in the String var is the Same its a
> hint for Duplicates for me.
> It Sounds weird but unfortunately the Data Looks like that ...





-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email
me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos
ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in
abyssum?"
--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicate-Cases-tp573
0968p5731036.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command SIGNOFF SPSSX-L For a list of
commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Bruce Weaver
Administrator
In reply to this post by emma78
Here is a modified version of the macro that takes a single argument, which is a list of variables.  

* Emma had problems with macro because variable names
* are odd (e.g., q3_1, q3_1_1, q_3_2, q4).
* Create a sample dataset using those variable names.
NEW FILE.
DATASET CLOSE all.
DATA LIST list / ID (F5.0)  q3_1 q3_1_1 q_3_2 q4 q4_1 (5A12).
BEGIN DATA
1     test    house   tree    nothing    none
2     test    garden  car     nothing    key
3      sky     ---    people  key        nothing
END DATA.
LIST.

* Here is a modified version of the macro that takes
* a LIST of variable names.

DEFINE !Flags (Vlist = !CMDEND )

!DO !V !IN (!Vlist)
!LET !Flag = !CONCAT("Flag_",!V)
AGGREGATE
 /BREAK = !V
 /!Flag = NU.
RECODE !Flag (1=0) (ELSE=1).
FORMATS !Flag(F1).
VARIABLE LABELS !Flag !CONCAT(!V," value appears 2 or more times").
!DOEND
EXECUTE.
!ENDDEFINE.

* Call the macro.
*SET MPRINT ON.
!Flags Vlist = q3_1 q3_1_1 q_3_2 q4 q4_1.
*SET MPRINT OFF.
LIST Flag_q3_1 to Flag_q4_1.

Output from LIST:
Flag_q3_1 Flag_q3_1_1 Flag_q_3_2 Flag_q4 Flag_q4_1
 
    1          0           0        1        0
    1          0           0        1        0
    0          0           0        0        0
 
Number of cases read:  3    Number of cases listed:  3

The SPLIT FILE by ID stuff was referring to a completely different way to approach this that David was suggesting.  But that would assume you are only looking for duplicate values WITHIN the same ID (in a file with multiple rows per ID).  As I said before, that does not appear to be what you want to do.  

p.s. - Like David and some others, I too am not entirely clear on WHY you want to do what you're doing.  But never mind!  ;-)


emma78 wrote
Hi Bruce,
your syntax works very well :-)

Is there a chance to got it more general? For example if the variables are not named v_1 up to v_100 , but a little bit confused like q3_1, q3_1_1, q_3_2, q4

I tried to adapt it but I didn`t suceed...

SPLIT FILE  by ID.   What do you mean by this?

Really apreciate your help!
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Art Kendall
In reply to this post by Maguin, Eugene
It seems that you gathered the data via some online survey system.  It makes many differences whether the responses are "choose one" or "type in an answer".
Are the questions presented in a fixed order, a branched order, or a randomized order?

If there are answers that are logically inconsistent, you might subset your cases, e.g., (males vs females)  by (retired vs still working) etc.

Would it be a clue if some cases were incomplete?

How many cases do you have?



Art Kendall
Social Research Consultants
12