SPSSX Discussion - Re: Dropping duplicates

Re: Dropping duplicates

Posted by ariel barak on Mar 21, 2007; 4:26pm
URL: http://spssx-discussion.165.s1.nabble.com/Dropping-duplicates-tp1074608p1074614.html

Hi Melissa and Fellow SPSS users,

I think you're pointing me in the right direction...however, when I tried
the syntax you suggested and I got this error:

COMPUTE drop=(incidentnumber=lag(incidentnumber) AND outcome='N' AND
lag(outcome
='1')).

>Error # 4323 in column 85. Text: )
>The first argument of the LAG function must be a variable. It must not be
>a constant or an expression.
>This command not executed.

EXE.
The issue is with the second lag command...any thoughts on how to get around
this?

Your help is GREATLY appreciated.

-Ariel

On 3/21/07, Melissa Ives <[hidden email]> wrote:

>
> Just a thought, it seems like you could sort so that the one you want to
> drop always FOLLOWS the one you would want to keep then use the LAG
> function to identify duplicates. Something like this:
>
> Compute drop=(id=lag(id) and outcome="N" and lag(outcome="1")).
>
> This will create drop=1 for any record with the same ID where the
> current outcome is N and there exists another outcome=1.
>
> Melissa
> The bubbling brook would lose its song if you removed the rocks.
>
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
> ariel barak
> Sent: Tuesday, March 20, 2007 3:26 PM
> To: [hidden email]
> Subject: [SPSSX-L] Dropping duplicates
>
> Fellow SPSS users,
>
> I have a set of data which I know has duplicates in it. The option of
> having the data provider go through their records and signify which are
> duplicates and which aren't is not an option. I have run the duplicate
> cases by incident number and age in order to weed out cases which I
> don't believe to be duplicates and am left with a set of data similar to
> that at the end of the e-mail below. There are around 400 cases which
> are differentiated from each other only by incident number and outcome -
> the age of the offenders are the same. It is possible that this same
> syntax will have to be run against a much larger number of cases in the
> future.
>
> In this case, '1' stands for arrested and 'N' for not arrested. I need
> syntax that will delete one record with an 'N' for each record where
> there is a '1' on the incident. Here are some of the possible scenarios
> and what i would like to keep using syntax. In each scenario, you can
> assume that all cases have the same incident number although the
> complete data set has 199 incident numbers. The number of offenders per
> incident is always between 2 and 9.
>
> The datasets at the bottom go through each of these scenarios in the
> same order as they are presented here. The first set is the data with
> the duplicates I want to delete and the second is with the duplicates I
> wish to delete dropped...problem and solution.
>
> I greatly appreciate any help that you may be able to give and will be
> glad to clarify any questions. Thanks!
>
> -Ariel Barak
>
> Scenario 1)
> Data Solution
> N N
> N N
>
> Scenario 2)
> Data Solution
> 1 1
> N
>
> Scenario 3)
> Data Solution
> 1 1
> 1 1
> N
>
> Scenario 4)
> Data Solution
> 1 1
> N N
> N
>
> Scenario 5)
> Data Solution
> 1 1
> N N
> N N
> N N
> N
> Scenario 6)
> Data Solution
> 1 1
> 1 1
> N
> N
>
> Scenario 7)
> Data Solution
> 1 1
> 1 1
>
> data list / incidentnumber 1-9 (F) age 10-11 Outcome 12 (A) .
> begin data
> 14386912419N
> 14386912419N
> 264872871231
> 26487287123N
> 371863475451
> 371863475451
> 37186347545N
> 648172350341
> 64817235034N
> 64817235034N
> 715484287291
> 71548428729N
> 71548428729N
> 71548428729N
> 71548428729N
> 864708752551
> 864708752551
> 86470875255N
> 86470875255N
> 904687125411
> 904687125411
> end data.
>
> value labels outcome
> '1' 'Arrested'
> 'N' 'Not Arrested'.
>
> DATASET NAME Problem.
>
> data list / incidentnumber 1-9 (F) age 10-11 Outcome 12 (A) .
> begin data
> 14386912419N
> 14386912419N
> 264872871231
> 371863475451
> 371863475451
> 648172350341
> 64817235034N
> 715484287291
> 71548428729N
> 71548428729N
> 71548428729N
> 864708752551
> 864708752551
> 904687125411
> 904687125411
> end data.
>
> value labels outcome
> '1' 'Arrested'
> 'N' 'Not Arrested'.
>
> DATASET NAME Solution.
>