Login  Register

Re: Dropping duplicates

Posted by Melissa Ives on Mar 21, 2007; 2:33pm
URL: http://spssx-discussion.165.s1.nabble.com/Dropping-duplicates-tp1074608p1074613.html

Just a thought, it seems like you could sort so that the one you want to
drop always FOLLOWS the one you would want to keep then use the LAG
function to identify duplicates.  Something like this:

Compute drop=(id=lag(id) and outcome="N" and lag(outcome="1")).

This will create drop=1 for any record with the same ID where the
current outcome is N and there exists another outcome=1.

Melissa
The bubbling brook would lose its song if you removed the rocks.


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
ariel barak
Sent: Tuesday, March 20, 2007 3:26 PM
To: [hidden email]
Subject: [SPSSX-L] Dropping duplicates

Fellow SPSS users,

I have a set of data which I know has duplicates in it. The option of
having the data provider go through their records and signify which are
duplicates and which aren't is not an option. I have run the duplicate
cases by incident number and age in order to weed out cases which I
don't believe to be duplicates and am left with a set of data similar to
that at the end of the e-mail below. There are around 400 cases which
are differentiated from each other only by incident number and outcome -
the age of the offenders are the same. It is possible that this same
syntax will have to be run against a much larger number of cases in the
future.

In this case, '1' stands for arrested and 'N' for not arrested. I need
syntax that will delete one record with an 'N' for each record where
there is a '1' on the incident. Here are some of the possible scenarios
and what i would like to keep using syntax. In each scenario, you can
assume that all cases have the same incident number although the
complete data set has 199 incident numbers. The number of offenders per
incident is always between 2 and 9.

The datasets at the bottom go through each of these scenarios in the
same order as they are presented here. The first set is the data with
the duplicates I want to delete and the second is with the duplicates I
wish to delete dropped...problem and solution.

I greatly appreciate any help that you may be able to give and will be
glad to clarify any questions. Thanks!

-Ariel Barak

Scenario 1)
Data Solution
N N
N N

Scenario 2)
Data Solution
1 1
N

Scenario 3)
Data Solution
1 1
1 1
N

Scenario 4)
Data Solution
1 1
N N
N

Scenario 5)
Data Solution
1 1
N N
N N
N N
N
Scenario 6)
Data Solution
1 1
1 1
N
N

Scenario 7)
Data Solution
1 1
1 1

data list / incidentnumber 1-9 (F) age 10-11 Outcome 12 (A) .
begin data
14386912419N
14386912419N
264872871231
26487287123N
371863475451
371863475451
37186347545N
648172350341
64817235034N
64817235034N
715484287291
71548428729N
71548428729N
71548428729N
71548428729N
864708752551
864708752551
86470875255N
86470875255N
904687125411
904687125411
end data.

value labels outcome
'1' 'Arrested'
'N' 'Not Arrested'.

DATASET NAME Problem.

data list / incidentnumber 1-9 (F) age 10-11 Outcome 12 (A) .
begin data
14386912419N
14386912419N
264872871231
371863475451
371863475451
648172350341
64817235034N
715484287291
71548428729N
71548428729N
71548428729N
864708752551
864708752551
 904687125411
904687125411
end data.

value labels outcome
'1' 'Arrested'
'N' 'Not Arrested'.

DATASET NAME Solution.


PRIVILEGED AND CONFIDENTIAL INFORMATION
This transmittal and any attachments may contain PRIVILEGED AND
CONFIDENTIAL information and is intended only for the use of the
addressee. If you are not the designated recipient, or an employee
or agent authorized to deliver such transmittals to the designated
recipient, you are hereby notified that any dissemination,
copying or publication of this transmittal is strictly prohibited. If
you have received this transmittal in error, please notify us
immediately by replying to the sender and delete this copy from your
system. You may also call us at (309) 827-6026 for assistance.