Dropping duplicates

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Dropping duplicates

ariel barak
Fellow SPSS users,

I have a set of data which I know has duplicates in it. The option of having
the data provider go through their records and signify which are duplicates
and which aren't is not an option. I have run the duplicate cases by
incident number and age in order to weed out cases which I don't believe to
be duplicates and am left with a set of data similar to that at the end of
the e-mail below. There are around 400 cases which are differentiated from
each other only by incident number and outcome - the age of the offenders
are the same. It is possible that this same syntax will have to be run
against a much larger number of cases in the future.

In this case, '1' stands for arrested and 'N' for not arrested. I need
syntax that will delete one record with an 'N' for each record where there
is a '1' on the incident. Here are some of the possible scenarios and what i
would like to keep using syntax. In each scenario, you can assume that all
cases have the same incident number although the complete data set has 199
incident numbers. The number of offenders per incident is always between 2
and 9.

The datasets at the bottom go through each of these scenarios in the same
order as they are presented here. The first set is the data with the
duplicates I want to delete and the second is with the duplicates I wish to
delete dropped...problem and solution.

I greatly appreciate any help that you may be able to give and will be glad
to clarify any questions. Thanks!

-Ariel Barak

Scenario 1)
Data Solution
N N
N N

Scenario 2)
Data Solution
1 1
N

Scenario 3)
Data Solution
1 1
1 1
N

Scenario 4)
Data Solution
1 1
N N
N

Scenario 5)
Data Solution
1 1
N N
N N
N N
N
Scenario 6)
Data Solution
1 1
1 1
N
N

Scenario 7)
Data Solution
1 1
1 1

data list / incidentnumber 1-9 (F) age 10-11 Outcome 12 (A) .
begin data
14386912419N
14386912419N
264872871231
26487287123N
371863475451
371863475451
37186347545N
648172350341
64817235034N
64817235034N
715484287291
71548428729N
71548428729N
71548428729N
71548428729N
864708752551
864708752551
86470875255N
86470875255N
904687125411
904687125411
end data.

value labels outcome
'1' 'Arrested'
'N' 'Not Arrested'.

DATASET NAME Problem.

data list / incidentnumber 1-9 (F) age 10-11 Outcome 12 (A) .
begin data
14386912419N
14386912419N
264872871231
371863475451
371863475451
648172350341
64817235034N
715484287291
71548428729N
71548428729N
71548428729N
864708752551
864708752551
 904687125411
904687125411
end data.

value labels outcome
'1' 'Arrested'
'N' 'Not Arrested'.

DATASET NAME Solution.
Reply | Threaded
Open this post in threaded view
|

Re: Dropping duplicates

Melissa Ives
Just a thought, it seems like you could sort so that the one you want to
drop always FOLLOWS the one you would want to keep then use the LAG
function to identify duplicates.  Something like this:

Compute drop=(id=lag(id) and outcome="N" and lag(outcome="1")).

This will create drop=1 for any record with the same ID where the
current outcome is N and there exists another outcome=1.

Melissa
The bubbling brook would lose its song if you removed the rocks.


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
ariel barak
Sent: Tuesday, March 20, 2007 3:26 PM
To: [hidden email]
Subject: [SPSSX-L] Dropping duplicates

Fellow SPSS users,

I have a set of data which I know has duplicates in it. The option of
having the data provider go through their records and signify which are
duplicates and which aren't is not an option. I have run the duplicate
cases by incident number and age in order to weed out cases which I
don't believe to be duplicates and am left with a set of data similar to
that at the end of the e-mail below. There are around 400 cases which
are differentiated from each other only by incident number and outcome -
the age of the offenders are the same. It is possible that this same
syntax will have to be run against a much larger number of cases in the
future.

In this case, '1' stands for arrested and 'N' for not arrested. I need
syntax that will delete one record with an 'N' for each record where
there is a '1' on the incident. Here are some of the possible scenarios
and what i would like to keep using syntax. In each scenario, you can
assume that all cases have the same incident number although the
complete data set has 199 incident numbers. The number of offenders per
incident is always between 2 and 9.

The datasets at the bottom go through each of these scenarios in the
same order as they are presented here. The first set is the data with
the duplicates I want to delete and the second is with the duplicates I
wish to delete dropped...problem and solution.

I greatly appreciate any help that you may be able to give and will be
glad to clarify any questions. Thanks!

-Ariel Barak

Scenario 1)
Data Solution
N N
N N

Scenario 2)
Data Solution
1 1
N

Scenario 3)
Data Solution
1 1
1 1
N

Scenario 4)
Data Solution
1 1
N N
N

Scenario 5)
Data Solution
1 1
N N
N N
N N
N
Scenario 6)
Data Solution
1 1
1 1
N
N

Scenario 7)
Data Solution
1 1
1 1

data list / incidentnumber 1-9 (F) age 10-11 Outcome 12 (A) .
begin data
14386912419N
14386912419N
264872871231
26487287123N
371863475451
371863475451
37186347545N
648172350341
64817235034N
64817235034N
715484287291
71548428729N
71548428729N
71548428729N
71548428729N
864708752551
864708752551
86470875255N
86470875255N
904687125411
904687125411
end data.

value labels outcome
'1' 'Arrested'
'N' 'Not Arrested'.

DATASET NAME Problem.

data list / incidentnumber 1-9 (F) age 10-11 Outcome 12 (A) .
begin data
14386912419N
14386912419N
264872871231
371863475451
371863475451
648172350341
64817235034N
715484287291
71548428729N
71548428729N
71548428729N
864708752551
864708752551
 904687125411
904687125411
end data.

value labels outcome
'1' 'Arrested'
'N' 'Not Arrested'.

DATASET NAME Solution.


PRIVILEGED AND CONFIDENTIAL INFORMATION
This transmittal and any attachments may contain PRIVILEGED AND
CONFIDENTIAL information and is intended only for the use of the
addressee. If you are not the designated recipient, or an employee
or agent authorized to deliver such transmittals to the designated
recipient, you are hereby notified that any dissemination,
copying or publication of this transmittal is strictly prohibited. If
you have received this transmittal in error, please notify us
immediately by replying to the sender and delete this copy from your
system. You may also call us at (309) 827-6026 for assistance.
Reply | Threaded
Open this post in threaded view
|

Re: Dropping duplicates

ariel barak
Hi Melissa and Fellow SPSS users,

I think you're pointing me in the right direction...however, when I tried
the syntax you suggested and I got this error:


COMPUTE drop=(incidentnumber=lag(incidentnumber) AND outcome='N' AND
lag(outcome
   ='1')).

>Error # 4323 in column 85.  Text: )
>The first argument of the LAG function must be a variable.  It must not be
>a constant or an expression.
>This command not executed.

EXE.
The issue is with the second lag command...any thoughts on how to get around
this?

Your help is GREATLY appreciated.

-Ariel


On 3/21/07, Melissa Ives <[hidden email]> wrote:

>
> Just a thought, it seems like you could sort so that the one you want to
> drop always FOLLOWS the one you would want to keep then use the LAG
> function to identify duplicates.  Something like this:
>
> Compute drop=(id=lag(id) and outcome="N" and lag(outcome="1")).
>
> This will create drop=1 for any record with the same ID where the
> current outcome is N and there exists another outcome=1.
>
> Melissa
> The bubbling brook would lose its song if you removed the rocks.
>
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
> ariel barak
> Sent: Tuesday, March 20, 2007 3:26 PM
> To: [hidden email]
> Subject: [SPSSX-L] Dropping duplicates
>
> Fellow SPSS users,
>
> I have a set of data which I know has duplicates in it. The option of
> having the data provider go through their records and signify which are
> duplicates and which aren't is not an option. I have run the duplicate
> cases by incident number and age in order to weed out cases which I
> don't believe to be duplicates and am left with a set of data similar to
> that at the end of the e-mail below. There are around 400 cases which
> are differentiated from each other only by incident number and outcome -
> the age of the offenders are the same. It is possible that this same
> syntax will have to be run against a much larger number of cases in the
> future.
>
> In this case, '1' stands for arrested and 'N' for not arrested. I need
> syntax that will delete one record with an 'N' for each record where
> there is a '1' on the incident. Here are some of the possible scenarios
> and what i would like to keep using syntax. In each scenario, you can
> assume that all cases have the same incident number although the
> complete data set has 199 incident numbers. The number of offenders per
> incident is always between 2 and 9.
>
> The datasets at the bottom go through each of these scenarios in the
> same order as they are presented here. The first set is the data with
> the duplicates I want to delete and the second is with the duplicates I
> wish to delete dropped...problem and solution.
>
> I greatly appreciate any help that you may be able to give and will be
> glad to clarify any questions. Thanks!
>
> -Ariel Barak
>
> Scenario 1)
> Data Solution
> N N
> N N
>
> Scenario 2)
> Data Solution
> 1 1
> N
>
> Scenario 3)
> Data Solution
> 1 1
> 1 1
> N
>
> Scenario 4)
> Data Solution
> 1 1
> N N
> N
>
> Scenario 5)
> Data Solution
> 1 1
> N N
> N N
> N N
> N
> Scenario 6)
> Data Solution
> 1 1
> 1 1
> N
> N
>
> Scenario 7)
> Data Solution
> 1 1
> 1 1
>
> data list / incidentnumber 1-9 (F) age 10-11 Outcome 12 (A) .
> begin data
> 14386912419N
> 14386912419N
> 264872871231
> 26487287123N
> 371863475451
> 371863475451
> 37186347545N
> 648172350341
> 64817235034N
> 64817235034N
> 715484287291
> 71548428729N
> 71548428729N
> 71548428729N
> 71548428729N
> 864708752551
> 864708752551
> 86470875255N
> 86470875255N
> 904687125411
> 904687125411
> end data.
>
> value labels outcome
> '1' 'Arrested'
> 'N' 'Not Arrested'.
>
> DATASET NAME Problem.
>
> data list / incidentnumber 1-9 (F) age 10-11 Outcome 12 (A) .
> begin data
> 14386912419N
> 14386912419N
> 264872871231
> 371863475451
> 371863475451
> 648172350341
> 64817235034N
> 715484287291
> 71548428729N
> 71548428729N
> 71548428729N
> 864708752551
> 864708752551
> 904687125411
> 904687125411
> end data.
>
> value labels outcome
> '1' 'Arrested'
> 'N' 'Not Arrested'.
>
> DATASET NAME Solution.
>
Reply | Threaded
Open this post in threaded view
|

Re: Dropping duplicates

Melissa Ives
In reply to this post by ariel barak
My bad.  Move the close parentheses to after 'lag(outcome' and before
the '='1')' instead of after it... Like this.

COMPUTE drop=(incidentnumber=lag(incidentnumber) AND outcome='N' AND
lag(outcome)='1').

Melissa
The bubbling brook would lose its song if you removed the rocks.


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
ariel barak
Sent: Wednesday, March 21, 2007 10:26 AM
To: [hidden email]
Subject: Re: [SPSSX-L] Dropping duplicates

Hi Melissa and Fellow SPSS users,

I think you're pointing me in the right direction...however, when I
tried the syntax you suggested and I got this error:


COMPUTE drop=(incidentnumber=lag(incidentnumber) AND outcome='N' AND
lag(outcome
   ='1')).

>Error # 4323 in column 85.  Text: )
>The first argument of the LAG function must be a variable.  It must not

>be a constant or an expression.
>This command not executed.

EXE.
The issue is with the second lag command...any thoughts on how to get
around this?

Your help is GREATLY appreciated.

-Ariel


On 3/21/07, Melissa Ives <[hidden email]> wrote:
>
> Just a thought, it seems like you could sort so that the one you want
> to drop always FOLLOWS the one you would want to keep then use the LAG

> function to identify duplicates.  Something like this:
>
> Compute drop=(id=lag(id) and outcome="N" and lag(outcome="1")).
>
> This will create drop=1 for any record with the same ID where the
> current outcome is N and there exists another outcome=1.
>
> Melissa
> The bubbling brook would lose its song if you removed the rocks.
>
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf
> Of ariel barak
> Sent: Tuesday, March 20, 2007 3:26 PM
> To: [hidden email]
> Subject: [SPSSX-L] Dropping duplicates
>
> Fellow SPSS users,
>
> I have a set of data which I know has duplicates in it. The option of
> having the data provider go through their records and signify which
> are duplicates and which aren't is not an option. I have run the
> duplicate cases by incident number and age in order to weed out cases
> which I don't believe to be duplicates and am left with a set of data
> similar to that at the end of the e-mail below. There are around 400
> cases which are differentiated from each other only by incident number

> and outcome - the age of the offenders are the same. It is possible
> that this same syntax will have to be run against a much larger number

> of cases in the future.
>
> In this case, '1' stands for arrested and 'N' for not arrested. I need

> syntax that will delete one record with an 'N' for each record where
> there is a '1' on the incident. Here are some of the possible
> scenarios and what i would like to keep using syntax. In each
> scenario, you can assume that all cases have the same incident number
> although the complete data set has 199 incident numbers. The number of

> offenders per incident is always between 2 and 9.
>
> The datasets at the bottom go through each of these scenarios in the
> same order as they are presented here. The first set is the data with
> the duplicates I want to delete and the second is with the duplicates
> I wish to delete dropped...problem and solution.
>
> I greatly appreciate any help that you may be able to give and will be

> glad to clarify any questions. Thanks!
>
> -Ariel Barak
>
> Scenario 1)
> Data Solution
> N N
> N N
>
> Scenario 2)
> Data Solution
> 1 1
> N
>
> Scenario 3)
> Data Solution
> 1 1
> 1 1
> N
>
> Scenario 4)
> Data Solution
> 1 1
> N N
> N
>
> Scenario 5)
> Data Solution
> 1 1
> N N
> N N
> N N
> N
> Scenario 6)
> Data Solution
> 1 1
> 1 1
> N
> N
>
> Scenario 7)
> Data Solution
> 1 1
> 1 1
>
> data list / incidentnumber 1-9 (F) age 10-11 Outcome 12 (A) .
> begin data
> 14386912419N
> 14386912419N
> 264872871231
> 26487287123N
> 371863475451
> 371863475451
> 37186347545N
> 648172350341
> 64817235034N
> 64817235034N
> 715484287291
> 71548428729N
> 71548428729N
> 71548428729N
> 71548428729N
> 864708752551
> 864708752551
> 86470875255N
> 86470875255N
> 904687125411
> 904687125411
> end data.
>
> value labels outcome
> '1' 'Arrested'
> 'N' 'Not Arrested'.
>
> DATASET NAME Problem.
>
> data list / incidentnumber 1-9 (F) age 10-11 Outcome 12 (A) .
> begin data
> 14386912419N
> 14386912419N
> 264872871231
> 371863475451
> 371863475451
> 648172350341
> 64817235034N
> 715484287291
> 71548428729N
> 71548428729N
> 71548428729N
> 864708752551
> 864708752551
> 904687125411
> 904687125411
> end data.
>
> value labels outcome
> '1' 'Arrested'
> 'N' 'Not Arrested'.
>
> DATASET NAME Solution.
>


PRIVILEGED AND CONFIDENTIAL INFORMATION
This transmittal and any attachments may contain PRIVILEGED AND
CONFIDENTIAL information and is intended only for the use of the
addressee. If you are not the designated recipient, or an employee
or agent authorized to deliver such transmittals to the designated
recipient, you are hereby notified that any dissemination,
copying or publication of this transmittal is strictly prohibited. If
you have received this transmittal in error, please notify us
immediately by replying to the sender and delete this copy from your
system. You may also call us at (309) 827-6026 for assistance.
Reply | Threaded
Open this post in threaded view
|

Re: Dropping duplicates

ariel barak
Melissa and Fellow SPSS users,

Thanks for the quick response Melissa. The syntax you suggested is now
working almost perfectly. The only issue is that it does not flag both
'N's in Scenario 6 in my initial e-mail:

Scenario 6)
Data Solution
1 1
1 1
N
N

It flags the first 'N' but not the second. Further, if the data were as in
the new scenario below, it would only flag the first 'N' instead of all of
them.

Scenario 8)
Data Solution
1 1
1 1
1 1
1 1
1 1
N
N
N

Basically, if the number of 1's on an incident is equal to or greater than
the number of 'N's, all 'N's need to be flagged.

Any ideas on how to correct this?

Thanks again,
-Ariel


On 3/21/07, Melissa Ives <[hidden email]> wrote:

> My bad.  Move the close parentheses to after 'lag(outcome' and before
> the '='1')' instead of after it... Like this.
>
> COMPUTE drop=(incidentnumber=lag(incidentnumber) AND outcome='N' AND
> lag(outcome)='1').
>
> Melissa
> The bubbling brook would lose its song if you removed the rocks.
>
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
> ariel barak
> Sent: Wednesday, March 21, 2007 10:26 AM
> To: [hidden email]
> Subject: Re: [SPSSX-L] Dropping duplicates
>
> Hi Melissa and Fellow SPSS users,
>
> I think you're pointing me in the right direction...however, when I
> tried the syntax you suggested and I got this error:
>
>
> COMPUTE drop=(incidentnumber=lag(incidentnumber) AND outcome='N' AND
> lag(outcome
>   ='1')).
>
> >Error # 4323 in column 85.  Text: )
> >The first argument of the LAG function must be a variable.  It must not
>
> >be a constant or an expression.
> >This command not executed.
>
> EXE.
> The issue is with the second lag command...any thoughts on how to get
> around this?
>
> Your help is GREATLY appreciated.
>
> -Ariel
>
>
> On 3/21/07, Melissa Ives <[hidden email]> wrote:
> >
> > Just a thought, it seems like you could sort so that the one you want
> > to drop always FOLLOWS the one you would want to keep then use the LAG
>
> > function to identify duplicates.  Something like this:
> >
> > Compute drop=(id=lag(id) and outcome="N" and lag(outcome="1")).
> >
> > This will create drop=1 for any record with the same ID where the
> > current outcome is N and there exists another outcome=1.
> >
> > Melissa
> > The bubbling brook would lose its song if you removed the rocks.
> >
> >
> > -----Original Message-----
> > From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf
> > Of ariel barak
> > Sent: Tuesday, March 20, 2007 3:26 PM
> > To: [hidden email]
> > Subject: [SPSSX-L] Dropping duplicates
> >
> > Fellow SPSS users,
> >
> > I have a set of data which I know has duplicates in it. The option of
> > having the data provider go through their records and signify which
> > are duplicates and which aren't is not an option. I have run the
> > duplicate cases by incident number and age in order to weed out cases
> > which I don't believe to be duplicates and am left with a set of data
> > similar to that at the end of the e-mail below. There are around 400
> > cases which are differentiated from each other only by incident number
>
> > and outcome - the age of the offenders are the same. It is possible
> > that this same syntax will have to be run against a much larger number
>
> > of cases in the future.
> >
> > In this case, '1' stands for arrested and 'N' for not arrested. I need
>
> > syntax that will delete one record with an 'N' for each record where
> > there is a '1' on the incident. Here are some of the possible
> > scenarios and what i would like to keep using syntax. In each
> > scenario, you can assume that all cases have the same incident number
> > although the complete data set has 199 incident numbers. The number of
>
> > offenders per incident is always between 2 and 9.
> >
> > The datasets at the bottom go through each of these scenarios in the
> > same order as they are presented here. The first set is the data with
> > the duplicates I want to delete and the second is with the duplicates
> > I wish to delete dropped...problem and solution.
> >
> > I greatly appreciate any help that you may be able to give and will be
>
> > glad to clarify any questions. Thanks!
> >
> > -Ariel Barak
> >
> > Scenario 1)
> > Data Solution
> > N N
> > N N
> >
> > Scenario 2)
> > Data Solution
> > 1 1
> > N
> >
> > Scenario 3)
> > Data Solution
> > 1 1
> > 1 1
> > N
> >
> > Scenario 4)
> > Data Solution
> > 1 1
> > N N
> > N
> >
> > Scenario 5)
> > Data Solution
> > 1 1
> > N N
> > N N
> > N N
> > N
> > Scenario 6)
> > Data Solution
> > 1 1
> > 1 1
> > N
> > N
> >
> > Scenario 7)
> > Data Solution
> > 1 1
> > 1 1
> >
> > data list / incidentnumber 1-9 (F) age 10-11 Outcome 12 (A) .
> > begin data
> > 14386912419N
> > 14386912419N
> > 264872871231
> > 26487287123N
> > 371863475451
> > 371863475451
> > 37186347545N
> > 648172350341
> > 64817235034N
> > 64817235034N
> > 715484287291
> > 71548428729N
> > 71548428729N
> > 71548428729N
> > 71548428729N
> > 864708752551
> > 864708752551
> > 86470875255N
> > 86470875255N
> > 904687125411
> > 904687125411
> > end data.
> >
> > value labels outcome
> > '1' 'Arrested'
> > 'N' 'Not Arrested'.
> >
> > DATASET NAME Problem.
> >
> > data list / incidentnumber 1-9 (F) age 10-11 Outcome 12 (A) .
> > begin data
> > 14386912419N
> > 14386912419N
> > 264872871231
> > 371863475451
> > 371863475451
> > 648172350341
> > 64817235034N
> > 715484287291
> > 71548428729N
> > 71548428729N
> > 71548428729N
> > 864708752551
> > 864708752551
> > 904687125411
> > 904687125411
> > end data.
> >
> > value labels outcome
> > '1' 'Arrested'
> > 'N' 'Not Arrested'.
> >
> > DATASET NAME Solution.
Reply | Threaded
Open this post in threaded view
|

Re: Dropping duplicates

Melissa Ives
In reply to this post by ariel barak
You should be able to have the 1 response from the first N response
'cascade' down to any other responses with something like this:

If (incidentnumber=lag(incidentnumber) AND outcome='N' AND
lag(drop)='1') drop=lag(drop).

Melissa
The bubbling brook would lose its song if you removed the rocks.


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
ariel barak
Sent: Wednesday, March 21, 2007 11:01 AM
To: [hidden email]
Subject: Re: [SPSSX-L] Dropping duplicates

Melissa and Fellow SPSS users,

Thanks for the quick response Melissa. The syntax you suggested is now
working almost perfectly. The only issue is that it does not flag both
'N's in Scenario 6 in my initial e-mail:

Scenario 6)
Data Solution
1 1
1 1
N
N

It flags the first 'N' but not the second. Further, if the data were as
in the new scenario below, it would only flag the first 'N' instead of
all of them.

Scenario 8)
Data Solution
1 1
1 1
1 1
1 1
1 1
N
N
N

Basically, if the number of 1's on an incident is equal to or greater
than the number of 'N's, all 'N's need to be flagged.

Any ideas on how to correct this?

Thanks again,
-Ariel


On 3/21/07, Melissa Ives <[hidden email]> wrote:

> My bad.  Move the close parentheses to after 'lag(outcome' and before
> the '='1')' instead of after it... Like this.
>
> COMPUTE drop=(incidentnumber=lag(incidentnumber) AND outcome='N' AND
> lag(outcome)='1').
>
> Melissa
> The bubbling brook would lose its song if you removed the rocks.
>
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf
> Of ariel barak
> Sent: Wednesday, March 21, 2007 10:26 AM
> To: [hidden email]
> Subject: Re: [SPSSX-L] Dropping duplicates
>
> Hi Melissa and Fellow SPSS users,
>
> I think you're pointing me in the right direction...however, when I
> tried the syntax you suggested and I got this error:
>
>
> COMPUTE drop=(incidentnumber=lag(incidentnumber) AND outcome='N' AND
> lag(outcome
>   ='1')).
>
> >Error # 4323 in column 85.  Text: )
> >The first argument of the LAG function must be a variable.  It must
> >not
>
> >be a constant or an expression.
> >This command not executed.
>
> EXE.
> The issue is with the second lag command...any thoughts on how to get
> around this?
>
> Your help is GREATLY appreciated.
>
> -Ariel
>
>
> On 3/21/07, Melissa Ives <[hidden email]> wrote:
> >
> > Just a thought, it seems like you could sort so that the one you
> > want to drop always FOLLOWS the one you would want to keep then use
> > the LAG
>
> > function to identify duplicates.  Something like this:
> >
> > Compute drop=(id=lag(id) and outcome="N" and lag(outcome="1")).
> >
> > This will create drop=1 for any record with the same ID where the
> > current outcome is N and there exists another outcome=1.
> >
> > Melissa
> > The bubbling brook would lose its song if you removed the rocks.
> >
> >
> > -----Original Message-----
> > From: SPSSX(r) Discussion [mailto:[hidden email]] On
> > Behalf Of ariel barak
> > Sent: Tuesday, March 20, 2007 3:26 PM
> > To: [hidden email]
> > Subject: [SPSSX-L] Dropping duplicates
> >
> > Fellow SPSS users,
> >
> > I have a set of data which I know has duplicates in it. The option
> > of having the data provider go through their records and signify
> > which are duplicates and which aren't is not an option. I have run
> > the duplicate cases by incident number and age in order to weed out
> > cases which I don't believe to be duplicates and am left with a set
> > of data similar to that at the end of the e-mail below. There are
> > around 400 cases which are differentiated from each other only by
> > incident number
>
> > and outcome - the age of the offenders are the same. It is possible
> > that this same syntax will have to be run against a much larger
> > number
>
> > of cases in the future.
> >
> > In this case, '1' stands for arrested and 'N' for not arrested. I
> > need
>
> > syntax that will delete one record with an 'N' for each record where

> > there is a '1' on the incident. Here are some of the possible
> > scenarios and what i would like to keep using syntax. In each
> > scenario, you can assume that all cases have the same incident
> > number although the complete data set has 199 incident numbers. The
> > number of
>
> > offenders per incident is always between 2 and 9.
> >
> > The datasets at the bottom go through each of these scenarios in the

> > same order as they are presented here. The first set is the data
> > with the duplicates I want to delete and the second is with the
> > duplicates I wish to delete dropped...problem and solution.
> >
> > I greatly appreciate any help that you may be able to give and will
> > be
>
> > glad to clarify any questions. Thanks!
> >
> > -Ariel Barak
> >
> > Scenario 1)
> > Data Solution
> > N N
> > N N
> >
> > Scenario 2)
> > Data Solution
> > 1 1
> > N
> >
> > Scenario 3)
> > Data Solution
> > 1 1
> > 1 1
> > N
> >
> > Scenario 4)
> > Data Solution
> > 1 1
> > N N
> > N
> >
> > Scenario 5)
> > Data Solution
> > 1 1
> > N N
> > N N
> > N N
> > N
> > Scenario 6)
> > Data Solution
> > 1 1
> > 1 1
> > N
> > N
> >
> > Scenario 7)
> > Data Solution
> > 1 1
> > 1 1
> >
> > data list / incidentnumber 1-9 (F) age 10-11 Outcome 12 (A) .
> > begin data
> > 14386912419N
> > 14386912419N
> > 264872871231
> > 26487287123N
> > 371863475451
> > 371863475451
> > 37186347545N
> > 648172350341
> > 64817235034N
> > 64817235034N
> > 715484287291
> > 71548428729N
> > 71548428729N
> > 71548428729N
> > 71548428729N
> > 864708752551
> > 864708752551
> > 86470875255N
> > 86470875255N
> > 904687125411
> > 904687125411
> > end data.
> >
> > value labels outcome
> > '1' 'Arrested'
> > 'N' 'Not Arrested'.
> >
> > DATASET NAME Problem.
> >
> > data list / incidentnumber 1-9 (F) age 10-11 Outcome 12 (A) .
> > begin data
> > 14386912419N
> > 14386912419N
> > 264872871231
> > 371863475451
> > 371863475451
> > 648172350341
> > 64817235034N
> > 715484287291
> > 71548428729N
> > 71548428729N
> > 71548428729N
> > 864708752551
> > 864708752551
> > 904687125411
> > 904687125411
> > end data.
> >
> > value labels outcome
> > '1' 'Arrested'
> > 'N' 'Not Arrested'.
> >
> > DATASET NAME Solution.


PRIVILEGED AND CONFIDENTIAL INFORMATION
This transmittal and any attachments may contain PRIVILEGED AND
CONFIDENTIAL information and is intended only for the use of the
addressee. If you are not the designated recipient, or an employee
or agent authorized to deliver such transmittals to the designated
recipient, you are hereby notified that any dissemination,
copying or publication of this transmittal is strictly prohibited. If
you have received this transmittal in error, please notify us
immediately by replying to the sender and delete this copy from your
system. You may also call us at (309) 827-6026 for assistance.
Reply | Threaded
Open this post in threaded view
|

Re: Dropping duplicates

Richard Ristow
In reply to this post by ariel barak
At 04:25 PM 3/20/2007, ariel barak wrote:

>There are around 400 cases which are differentiated from each other
>only by incident number and outcome - the age of the offenders are the
>same.
>
>In this case, '1' stands for arrested and 'N' for not arrested. I need
>syntax that will delete one record with an 'N' for each record where
>there is a '1' on the [same] incident [and age].

Try this; it appears to get the result you want. It uses SPSS draft
variables, which are carried over from case to case, to look back as
many cases as necessary. SPSS 15 draft output.

DATASET ACTIVATE PROBLEM.
LIST.

List
|-----------------------------|---------------------------|
|Output Created               |21-MAR-2007 13:43:56       |
|-----------------------------|---------------------------|
[PROBLEM]

incidentnumber age Outcome

   143,869,124   19 N
   143,869,124   19 N
   264,872,871   23 1
   264,872,871   23 N
   371,863,475   45 1
   371,863,475   45 1
   371,863,475   45 N
   648,172,350   34 1
   648,172,350   34 N
   648,172,350   34 N
   715,484,287   29 1
   715,484,287   29 N
   715,484,287   29 N
   715,484,287   29 N
   715,484,287   29 N
   864,708,752   55 1
   864,708,752   55 1
   864,708,752   55 N
   864,708,752   55 N
   904,687,125   41 1
   904,687,125   41 1

Number of cases read:  21    Number of cases listed:  21


*  Identify beginning of an incident/person .

COMPUTE #New_Guy = 9.
DO IF   MISSING(LAG(incidentnumber))
      OR MISSING(LAG(age)).
.    COMPUTE #New_Guy = 1.
ELSE IF incidentnumber NE LAG(incidentnumber)
      OR age            NE LAG(age).
.    COMPUTE #New_Guy = 1.
ELSE.
.    COMPUTE #New_Guy = 0.
END IF.

*  Bug check .
DO IF   #New_Guy EQ 9.
.  PRINT / 'New_Guy bug: Case ' $CASENUM
            ' incid. ' incidentnumber ' age ' age.
END IF.

*  "#Drop_N" is the number of 'N' records to be dropped .
IF #New_Guy EQ 1  #Drop_N = 0.

NUMERIC #Drop_It (F2).
DO IF   outcome EQ '1'.
.  COMPUTE #Drop_It = 0.
.  COMPUTE #Drop_N  = #Drop_N + 1.
ELSE IF outcome EQ 'N'.
.  COMPUTE #Drop_It = #Drop_N GT 0.
.  COMPUTE #Drop_N  = #Drop_N - 1.
ELSE.
.  PRINT / 'Drop_N  bug: Case ' $CASENUM
            ' incid. ' incidentnumber ' age ' age
            ' outcome ' outcome.
END IF.

.  FORMATS #New_Guy #Drop_N (F3).

.  /*--    DO IF   #New_Guy EQ 1                       /*-*/.
.  /*--       PRINT / ' '                              /*-*/.
.  /*--    END IF                                      /*-*/.

.  /*--    PRINT /                                     /*-*/
    /*--    incidentnumber age outcome                  /*-*/
    /*--    'New: ' #New_Guy ' Drop this: ' #Drop_It    /*-*/
    /*--    'Drop yet '#Drop_It                         /*-*/.
.  /*--    EXECUTE.

SELECT IF NOT #Drop_It.

FREQUENCIES outcome.

Frequencies
|-----------------------------|---------------------------|
|Output Created               |21-MAR-2007 13:43:57       |
|-----------------------------|---------------------------|
[PROBLEM]
Statistics [suppressed - no missing data]

Outcome
|-----|---------------|---------|-------|-------------|---------------|
|     |               |Frequency|Percent|Valid Percent|Cumulative     |
|     |               |         |       |             |Percent        |
|-----|---------------|---------|-------|-------------|---------------|
|Valid|1  Arrested    |9        |60.0   |60.0         |60.0           |
|     |N  Not Arrested|6        |40.0   |40.0         |100.0          |
|     |---------------|---------|-------|-------------|---------------|
|     |Total          |15       |100.0  |100.0        |               |
|-----|---------------|---------|-------|-------------|---------------|


LIST.

List
|-----------------------------|---------------------------|
|Output Created               |21-MAR-2007 13:43:57       |
|-----------------------------|---------------------------|
[PROBLEM]

incidentnumber age Outcome

   143,869,124   19 N
   143,869,124   19 N
   264,872,871   23 1
   371,863,475   45 1
   371,863,475   45 1
   648,172,350   34 1
   648,172,350   34 N
   715,484,287   29 1
   715,484,287   29 N
   715,484,287   29 N
   715,484,287   29 N
   864,708,752   55 1
   864,708,752   55 1
   904,687,125   41 1
   904,687,125   41 1

Number of cases read:  15    Number of cases listed:  15


*  ........................................ .
*  Cross-check: desired result              .

DATASET ACTIVATE  PROBLEM.
DATASET DECLARE   Selct_N.
AGGREGATE OUTFILE=Selct_N
    /BREAK    = incidentnumber age outcome
    /N_Selctd = NU.

DATASET ACTIVATE  SOLUTION.
DATASET DECLARE   Soln_N.
AGGREGATE OUTFILE=Soln_N
    /BREAK    = incidentnumber age outcome
    /N_Wanted = NU.

MATCH FILES
    /FILE=Selct_N
    /FILE=Soln_N
    /BY         incidentnumber age outcome.

RECODE N_Selctd N_Wanted (MISSING = 0).

TEMPORARY.
SELECT IF N_Selctd NE N_Wanted.
LIST.

List
|-----------------------------|---------------------------|
|Output Created              |21-MAR-2007 13:43:59       |
|-----------------------------|---------------------------|
Number of cases read:  0    Number of cases listed:  0



===================
APPENDIX: Test data
===================
* ...................................................          .
* ............   Test data, from posting ............          .
* ...................................................          .
* I.   Input to edit   .
* ...................................................          .
NEW FILE.
data list
   / incidentnumber 1-9 (F) age 10-11 Outcome 12 (A) .
begin data
14386912419N
14386912419N
264872871231
26487287123N
371863475451
371863475451
37186347545N
648172350341
64817235034N
64817235034N
715484287291
71548428729N
71548428729N
71548428729N
71548428729N
864708752551
864708752551
86470875255N
86470875255N
904687125411
904687125411
end data.

value labels outcome
   '1' 'Arrested'
   'N' 'Not Arrested'.

FORMATS incidentnumber (COMMA12)
        /age            (F3).

SORT CASES BY incidentnumber age (A)
               outcome            (A).
DATASET NAME TestData.
* ...................................................          .
* II.  Result desired  .
* ...................................................          .


NEW FILE.
data list
   / incidentnumber 1-9 (F) age 10-11 Outcome 12 (A) .
begin data
14386912419N
14386912419N
264872871231
371863475451
371863475451
648172350341
64817235034N
715484287291
71548428729N
71548428729N
71548428729N
864708752551
864708752551
904687125411
904687125411
end data.

FORMATS incidentnumber (COMMA12)
        /age            (F3).

value labels outcome
'1' 'Arrested'
'N' 'Not Arrested'.
DATASET NAME SOLUTION.


* ...................................................          .
* III. Load input data, to run.
* ...................................................          .
DATASET ACTIVATE TestData.
DATASET COPY     PROBLEM.
* ...................................................          .