SPSSX Discussion

Missing dates?

Classic

List

Threaded

52 messages Options

123

Asil Ozdogru

Identify Duplicate Cases

Hello,

Example:

StudyID	Service	PeriodS	Source	à	Duplicate
30001	446	7/1/2006	1		Yes
30001	446	7/1/2006	2		No
30002	452	7/1/2004	1		No
30002	452	7/1/2004	1		No

Thanks,

Asil

ariel barak

Re: Identify Duplicate Cases

Hi Asil,

You seem to be defining duplicate cases differently than most do...typically I would say that the 3rd and 4th row that you showed are duplicates because all 4 of the variables are the same. However, the syntax below gives the results that you listed.

If there is a list of 4 cases where the StudyID Service and PeriodS are the same but the Source values are all different, say 1,2,3, and 4, only Source=4 will NOT be listed as a duplicate. See the last 4 records in the sample data below.

Hope this is what you want. If it's not, respond to the list again.

-Ari
*************************************************.
DATA LIST LIST /StudyID (F8) Service (F8) PeriodS (ADATE) Source (F8).
BEGIN DATA
30001 446 07/01/2006 1
30001 446 07/01/2006 2
30002 452 07/01/2004 1
30002 452 07/01/2004 1
30002 459 01/01/2004 1
30002 459 01/01/2004 2
30002 459 01/01/2004 3
30002 459 01/01/2004 4
END DATA.
DATASET NAME Test.

SORT CASES BY StudyID Service PeriodS (A) Source (D).

COMPUTE Duplicate=0.
IF (Lag(StudyID)=StudyID) AND (Lag(Service)=Service) AND
(Lag(PeriodS)=PeriodS) AND (Lag(Source)>Source) Duplicate=1.
EXECUTE.

SORT CASES BY StudyID Service PeriodS Source (A).

VALUE LABELS Duplicate 0 'No' 1 'Yes'.
*************************************************.

On Thu, Apr 15, 2010 at 4:35 PM, Asil Ozdogru <[hidden email]> wrote:

Hello,

Is there a way that I can identify duplicate cases based on four variables of which I would like first three to be the same and last one to be different? In another words, two cases would be labeled to be duplicate if they match on first 3 variables (StudyID Service PeriodS) and differ on the 4^th variable (Source).

Example:

StudyID

Service

PeriodS

Source

à

Duplicate

30001

446

7/1/2006

1

Yes

30001

446

7/1/2006

2

No

30002

452

7/1/2004

1

No

30002

452

7/1/2004

1

No

Thanks,

Asil

Dennis Deck

Re: Identify Duplicate Cases

In reply to this post by Asil Ozdogru

I looks like StudyID and Service are strings.

I assume that the date in PeriodS is limited in some way - here I use only Month - but you could expand on that idea.

Try something like:

STRING MonthS (A2) .

COMPUTE MonthS = STRING(XDATE.Month(PeriodS),A2) .

STRING Combined (A9) .

COMPUTE Combined = CONCAT(StudyID,Service,Month) .

*( If the vars were numbers use: COMPUTE Combined = StudyID*100000 + Service*100 + Month ) .

SORT CASES BY Combined Source .

COMPUTE Match = 0 .

IF ($casenum>1 and Combined=LAG(Combined) AND Source NE LAG(Source)) Match = 1 .

* This flags a record that matched the preceeding one .

* If you need to identify pairs you could reverse sort and flag the other of the pair.

Dennis Deck

RMC Research

From: Asil Ozdogru [mailto:[hidden email]]
Sent: Thursday, April 15, 2010 2:36 PM
Subject: Identify Duplicate Cases

Hello,

Example:

StudyID	Service	PeriodS	Source	à	Duplicate
30001	446	7/1/2006	1		Yes
30001	446	7/1/2006	2		No
30002	452	7/1/2004	1		No
30002	452	7/1/2004	1		No

Thanks,

Asil

Richard Ristow

Re: Identify Duplicate Cases

A comment on sort keys: At 02:17 AM 4/16/2010, Dennis Deck wrote,

Try something like:

STRING MonthS (A2) . COMPUTE MonthS = STRING(XDATE.Month(PeriodS),A2) . STRING Combined (A9) . COMPUTE Combined = CONCAT(StudyID,Service,Month) . *( If the vars were numbers use: * COMPUTE Combined = StudyID*100000 + Service*100 + Month ) . SORT CASES BY Combined Source .

It's common to build a catenated sort key like this, but I don't recommend it. It's a bother, and it isn't necessary. SORT CASES, MATCH FILES, ADD FILES, and AGGREGATE (I think those are the relevant commands, and anyhow it'll apply for other commands too) cheerfully accept sets of variables as keys:

SORT CASES BY StudyID Service Month Source .In this formulation, you needn't assume that "the date in PeriodS is limited in some way, [e.g., to] Month". If appropriate, you can use

SORT CASES BY StudyID Service PeriodS Source .(Also, I think the code posted has a bug: it won't work properly unless all the PeriodS dates are in the same calendar year.) ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Richard Ristow

Re: Identify Duplicate Cases

In reply to this post by Asil Ozdogru

At 05:35 PM 4/15/2010, Asil Ozdogru wrote:

Is there a way that I can identify duplicate cases based on four variables of which I would like first three to be the same and last one to be different? In another words, two cases would be labeled to be duplicate if they match on first 3 variables (StudyID Service PeriodS) and differ on the 4^th variable (Source).

Example:
StudyID Service PeriodS Source Duplicate 30001 446 7/1/2006 1 Yes 30001 446 7/1/2006 2 No 30002 452 7/1/2004 1 No 30002 452 7/1/2004 1 No

What bothers me is, why isn't the second row a duplicate?

Anyway, Ariel Barak and Dennis Deck have posted solutions using LAG. Here's one using AGGREGATE. It labels any group of records with same StudyID, Service, and PeriodS as 'duplicates' if more than one value of Source occurs in the group; see the second record. If you want different logic, post again. This code is tested:

* The following SORT is necessary for MATCH CASES after AGGREGATE. .
* That AGGREGATE must be into a separate dataset, since there's a   .
* subsequent AGGREGATE that relies on *this* AGGREGATE producing    .
* exactly one record per BREAK group.                               .

SORT CASES BY     StudyID Service PeriodS.

* The following dataset has one record for each source encountered .
* for each StudyID-Service-PeriodS combination:                     .
* (Using OUTFILE=* without MODE=ADDVARIABLES puts the AGGREGATE     .
* output in a new, unnamed, active dataset.)                        .

AGGREGATE OUTFILE=*
/BREAK=StudyID Service PeriodS   Source
/NBYSource 'No. of occurrences of this source' = NU.

* Here, there's one record for each source:                         .

. /**/ LIST /*-*/.

List
|-----------------------------|---------------------------|
|Output Created               |16-APR-2010 11:09:39       |
|-----------------------------|---------------------------|
StudyID Service    PeriodS Source NBYSource

30001    446   07/01/2006    1          1
30001    446   07/01/2006    2          1
30002    452   07/01/2004    1          2

Number of cases read: 3    Number of cases listed: 3

* Counting these records gives a count of the number of different   .
* sources.                                                          .
* (This, by the way, is the operation that cannot be done in the    .
* original file.)                                                   .

AGGREGATE OUTFILE=*
/BREAK=StudyID Service PeriodS
/NSources
'No. of sources for this StudyID-Service-PeriodS combination' = NU.

* And here, there's one record for each StudyID-Service-PeriodS     .
* combination, with a count of the number of sources seen:          .

DATASET NAME      SourceCount WINDOW=FRONT.
. /**/ LIST /*-*/.

List
|-----------------------------|---------------------------|
|Output Created               |16-APR-2010 11:09:39       |
|-----------------------------|---------------------------|
[SourceCount]

StudyID Service    PeriodS NSources

30001    446   07/01/2006        2
30002    452   07/01/2004        1

Number of cases read: 2    Number of cases listed: 2

* Attach the count of sources to the original records, and classify .
* as duplicate or not.                                              .

MATCH FILES
   /FILE =Data
   /TABLE=SourceCount
   /BY    StudyID Service PeriodS.

DATASET NAME      ID_Dups     WINDOW=FRONT.
STRING DupsByMe (A3).
RECODE NSources (1='No') (2 THRU HI='Yes') INTO DupsByMe.
LIST.

List
|-----------------------------|---------------------------|
|Output Created               |16-APR-2010 11:09:40       |
|-----------------------------|---------------------------|
[ID_Dups]

StudyID Service    PeriodS Source Duplicate NSources DupsByMe

30001    446   07/01/2006    1   Yes              2 Yes
30001    446   07/01/2006    2   No               2 Yes
30002    452   07/01/2004    1   No               1 No
30002    452   07/01/2004    1   No               1 No

Number of cases read: 4    Number of cases listed: 4
=============================
APPENDIX: Test data, and code
=============================
DATA LIST LIST/    StudyID Service PeriodS   Source Duplicate    (N5,     F3,     ADATE10,   F2,    A4). BEGIN DATA      30001    446     7/1/2006   1      Yes    30001    446     7/1/2006   2      No    30002    452     7/1/2004   1      No    30002    452     7/1/2004   1      No END DATA. DATASET NAME      Data. LIST. * The following SORT is necessary for MATCH CASES after AGGREGATE. . * That AGGREGATE must be into a separate dataset, since there's a   . * subsequent AGGREGATE that relies on *this* AGGREGATE producing    . * exactly one record per BREAK group.                               . SORT CASES BY     StudyID Service PeriodS. * The following dataset has one record for each source encountered . * for each StudyID-Service-PeriodS combination:                     . * (Using OUTFILE=* without MODE=ADDVARIABLES puts the AGGREGATE     . * output in a new, unnamed, active dataset.)                        . AGGREGATE OUTFILE=* /BREAK=StudyID Service PeriodS   Source /NBYSource 'No. of occurrences of this source' = NU. * Here, there's one record for each source:                         . . /**/ LIST /*-*/. * Counting these records gives a count of the number of different   . * sources.                                                          . * (This, by the way, is the operation that cannot be done in the    . * original file.)                                                   . AGGREGATE OUTFILE=* /BREAK=StudyID Service PeriodS /NSources 'No. of sources for this StudyID-Service-PeriodS combination' = NU. * And here, there's one record for each StudyID-Service-PeriodS     . * combination, with a count of the number of sources seen:          . DATASET NAME      SourceCount WINDOW=FRONT. . /**/ LIST /*-*/. * Attach the count of sources to the original records, and classify . * as duplicate or not.                                              . MATCH FILES    /FILE =Data    /TABLE=SourceCount    /BY    StudyID Service PeriodS. DATASET NAME      ID_Dups     WINDOW=FRONT. STRING DupsByMe (A3). RECODE NSources (1='No') (2 THRU HI='Yes') INTO DupsByMe. LIST. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Asil Ozdogru

Re: Identify Duplicate Cases

Thank you all. I actually want to choose between different Sources whichever has the greater number of Units, like in the below example. Lag command seems to be working but I could not modify it to do what I want. Any suggestions on that?

Thanks,

StudyID	Service	PeriodS	Source	Units	Duplicate
30001	408	1/1/2002	1	0.75	No
30001	408	1/1/2002	2	0.5	Yes
30002	201	7/1/2008	1	2	No
30002	201	7/1/2008	1	0.5	No
30003	564	1/1/2006	1	1	Yes
30003	564	1/1/2006	2	1.5	No

Asil

From: Richard Ristow [mailto:[hidden email]]
Sent: Friday, April 16, 2010 11:16 AM
To: Asil Ozdogru; [hidden email]
Cc: Ariel Barak; Dennis Deck
Subject: Re: Identify Duplicate Cases

At 05:35 PM 4/15/2010, Asil Ozdogru wrote:

Is there a way that I can identify duplicate cases based on four variables of which I would like first three to be the same and last one to be different? In another words, two cases would be labeled to be duplicate if they match on first 3 variables (StudyID Service PeriodS) and differ on the 4^th variable (Source).

Example:
StudyID Service PeriodS   Source Duplicate
30001    446     7/1/2006   1      Yes
30001    446     7/1/2006   2      No
30002    452     7/1/2004   1      No
30002    452     7/1/2004   1      No

What bothers me is, why isn't the second row a duplicate?

Anyway, Ariel Barak and Dennis Deck have posted solutions using LAG. Here's one using AGGREGATE. It labels any group of records with same StudyID, Service, and PeriodS as 'duplicates' if more than one value of Source occurs in the group; see the second record. If you want different logic, post again. This code is tested:

* The following SORT is necessary for MATCH CASES after AGGREGATE. .
* That AGGREGATE must be into a separate dataset, since there's a   .
* subsequent AGGREGATE that relies on *this* AGGREGATE producing    .
* exactly one record per BREAK group.                               .

SORT CASES BY     StudyID Service PeriodS.

* The following dataset has one record for each source encountered .
* for each StudyID-Service-PeriodS combination:                     .
* (Using OUTFILE=* without MODE=ADDVARIABLES puts the AGGREGATE     .
* output in a new, unnamed, active dataset.)                        .

AGGREGATE OUTFILE=*
/BREAK=StudyID Service PeriodS   Source
/NBYSource 'No. of occurrences of this source' = NU.

* Here, there's one record for each source:                         .

. /**/ LIST /*-*/.

List
|-----------------------------|---------------------------|
|Output Created               |16-APR-2010 11:09:39       |
|-----------------------------|---------------------------|
StudyID Service    PeriodS Source NBYSource

30001    446   07/01/2006    1          1
30001    446   07/01/2006    2          1
30002    452   07/01/2004    1          2

Number of cases read: 3    Number of cases listed: 3

* Counting these records gives a count of the number of different   .
* sources.                                                          .
* (This, by the way, is the operation that cannot be done in the    .
* original file.)                                                   .

AGGREGATE OUTFILE=*
/BREAK=StudyID Service PeriodS
/NSources
'No. of sources for this StudyID-Service-PeriodS combination' = NU.

* And here, there's one record for each StudyID-Service-PeriodS     .
* combination, with a count of the number of sources seen:          .

DATASET NAME      SourceCount WINDOW=FRONT.
. /**/ LIST /*-*/.

List
|-----------------------------|---------------------------|
|Output Created               |16-APR-2010 11:09:39       |
|-----------------------------|---------------------------|
[SourceCount]

StudyID Service    PeriodS NSources

30001    446   07/01/2006        2
30002    452   07/01/2004        1

Number of cases read: 2    Number of cases listed: 2

* Attach the count of sources to the original records, and classify .
* as duplicate or not.                                              .

MATCH FILES
   /FILE =Data
   /TABLE=SourceCount
   /BY    StudyID Service PeriodS.

DATASET NAME      ID_Dups     WINDOW=FRONT.
STRING DupsByMe (A3).
RECODE NSources (1='No') (2 THRU HI='Yes') INTO DupsByMe.
LIST.

List
|-----------------------------|---------------------------|
|Output Created               |16-APR-2010 11:09:40       |
|-----------------------------|---------------------------|
[ID_Dups]

StudyID Service    PeriodS Source Duplicate NSources DupsByMe

30001    446   07/01/2006    1   Yes              2 Yes
30001    446   07/01/2006    2   No               2 Yes
30002    452   07/01/2004    1   No               1 No
30002    452   07/01/2004    1   No               1 No

Number of cases read: 4    Number of cases listed: 4
=============================
APPENDIX: Test data, and code
=============================
DATA LIST LIST/
   StudyID Service PeriodS   Source Duplicate
   (N5,     F3,     ADATE10,   F2,    A4).
BEGIN DATA
   30001    446     7/1/2006   1      Yes
   30001    446     7/1/2006   2      No
   30002    452     7/1/2004   1      No
   30002    452     7/1/2004   1      No
END DATA.
DATASET NAME      Data.

LIST.

* The following SORT is necessary for MATCH CASES after AGGREGATE. .
* That AGGREGATE must be into a separate dataset, since there's a   .
* subsequent AGGREGATE that relies on *this* AGGREGATE producing    .
* exactly one record per BREAK group.                               .

SORT CASES BY     StudyID Service PeriodS.

* The following dataset has one record for each source encountered .
* for each StudyID-Service-PeriodS combination:                     .
* (Using OUTFILE=* without MODE=ADDVARIABLES puts the AGGREGATE     .
* output in a new, unnamed, active dataset.)                        .

AGGREGATE OUTFILE=*
/BREAK=StudyID Service PeriodS   Source
/NBYSource 'No. of occurrences of this source' = NU.

* Here, there's one record for each source:                         .

. /**/ LIST /*-*/.

* Counting these records gives a count of the number of different   .
* sources.                                                          .
* (This, by the way, is the operation that cannot be done in the    .
* original file.)                                                   .

AGGREGATE OUTFILE=*
/BREAK=StudyID Service PeriodS
/NSources
'No. of sources for this StudyID-Service-PeriodS combination' = NU.

* And here, there's one record for each StudyID-Service-PeriodS     .
* combination, with a count of the number of sources seen:          .

DATASET NAME      SourceCount WINDOW=FRONT.

. /**/ LIST /*-*/.

* Attach the count of sources to the original records, and classify .
* as duplicate or not.                                              .

MATCH FILES
   /FILE =Data
   /TABLE=SourceCount
   /BY    StudyID Service PeriodS.

DATASET NAME      ID_Dups     WINDOW=FRONT.

STRING DupsByMe (A3).
RECODE NSources (1='No') (2 THRU HI='Yes') INTO DupsByMe.
LIST.

Richard Ristow

Re: Identify Duplicate Cases

At 03:44 PM 4/16/2010, Asil Ozdogru wrote:

Thank you all. I actually want to choose between different Sources whichever has the greater number of Units, like in the below example. Lag command seems to be working but I could not modify it to do what I want. Any suggestions on that?

I've added StudyIDs 900001 and 900002 to your test data. And I've converted your numbers of units into integers. The logic below looks for exact equality between different counts of units; and fractions that are exactly equal mathematically, may not come out exactly equal computationally.
|-----------------------------|---------------------------| |Output Created               |16-APR-2010 17:36:24       | |-----------------------------|---------------------------| [Data] StudyID Service    PeriodS Source Units Dup 30001    408   01/01/2002    1       75 No 30001    408   01/01/2002    2       50 Yes 30002    201   07/01/2008    1      200 No 30002    201   07/01/2008    1       50 No 30003    564   01/01/2006    1      100 Yes 30003    564   01/01/2006    2      150 No 90001    101   02/03/2010    1      150 -- 90001    101   02/03/2010    2       70 -- 90001    101   02/03/2010    3      150 -- 90002    103   03/01/2010    1       50 -- 90002    103   03/01/2010    2       30 -- 90002    103   03/01/2010    2       50 -- 90002    103   03/01/2010    3       10 -- 90002    103   03/01/2010    3       30 -- 90002    103   03/01/2010    4       60 -- 90002    103   03/01/2010    4       20 -- Number of cases read: 16    Number of cases listed: 16What you now say you want looks very different from how you formerly described your problem. We'd appreciate it if you'd make clear what you want, from the beginning. Why the dickens didn't you mention 'Units', when that's a crucial variable in the selection?? You're abusing the good will of the respondents.

This no longer looks like detecting duplicates at all. Under the rule you're now describing, records can be rejected for not belonging to the source with most units, without duplicating anything. You have, rather, a complicated take/drop criterion. Anyway, here's a three-AGGREGATE solution (tested) that seems to do it.

* Total units for each source w/in StudyID-Service-PeriodS:         . AGGREGATE OUTFILE=* MODE=ADDVARIABLES /BREAK=StudyID Service PeriodS   Source /TotUnits 'Total units from this source'=SUM(Units). FORMATS    TotUnits (F6). * Largest number of units for any source within                     . * StudyID-Service-PeriodS:                                          . AGGREGATE OUTFILE=* MODE=ADDVARIABLES /BREAK=StudyID Service PeriodS /MaxUnits 'Highest no. of units for any source' = MAX(TotUnits). * If more than one source has the same number of units, take the    . * lowest-numbered source                                            . AGGREGATE OUTFILE=* MODE=ADDVARIABLES /BREAK=StudyID Service PeriodS TotUnits /MinSrce 'Lowest numbered source with this many units'= MIN(Source). STRING TakeIt (A4). IF   TotUnits EQ MaxUnits    & Source   EQ MinSrce   TakeIt='Take'. RECODE TakeIt (' '='Drop'). LIST. List |-----------------------------|---------------------------| |Output Created               |16-APR-2010 17:36:25       | |-----------------------------|---------------------------| [Data] StudyID Service    PeriodS Source Units Dup TotUnits MaxUnits MinSrce TakeIt 30001    408   01/01/2002    1       75 No        75       75      1   Take 30001    408   01/01/2002    2       50 Yes       50       75      2   Drop 30002    201   07/01/2008    1      200 No       250      250      1   Take 30002    201   07/01/2008    1       50 No       250      250      1   Take 30003    564   01/01/2006    1      100 Yes      100      150      1   Drop 30003    564   01/01/2006    2      150 No       150      150      2   Take 90001    101   02/03/2010    1      150 --       150      150      1   Take 90001    101   02/03/2010    2       70 --        70      150      2   Drop 90001    101   02/03/2010    3      150 --       150      150      1   Drop 90002    103   03/01/2010    1       50 --        50       80      1   Drop 90002    103   03/01/2010    2       30 --        80       80      2   Take 90002    103   03/01/2010    2       50 --        80       80      2   Take 90002    103   03/01/2010    3       10 --        40       80      3   Drop 90002    103   03/01/2010    3       30 --        40       80      3   Drop 90002    103   03/01/2010    4       60 --        80       80      2   Drop 90002    103   03/01/2010    4       20 --        80       80      2   Drop Number of cases read: 16    Number of cases listed: 16============================
APPENDIX: Test data, and code
=============================
DATA LIST LIST/    StudyID Service PeriodS Source Units   Dup    (N5,      F3,    ADATE10, F2,   F5.2,   A4). BEGIN DATA    30001     408    1/1/2002   1     0.75   No    30001     408    1/1/2002   2     0.5   Yes    30002     201    7/1/2008   1     2   No    30002     201    7/1/2008   1     0.5   No    30003     564    1/1/2006   1     1   Yes    30003     564    1/1/2006   2     1.5   No    90001     101    2/3/2010   1     1.5   --    90001     101    2/3/2010   2     0.7   --    90001     101    2/3/2010   3     1.5   --    90002     103    3/1/2010   1     0.5   --    90002     103    3/1/2010   2     0.3   --    90002     103    3/1/2010   2     0.5   --    90002     103    3/1/2010   3     0.1   --    90002     103    3/1/2010   3     0.3   --    90002     103    3/1/2010   4     0.6   --    90002     103    3/1/2010   4     0.2   -- END DATA. DATASET NAME     Data          WINDOW=FRONT. * One pain in the neck is that code requires exact comparison of    . * numbers of units; and fractions rarely compare exactly.           . * So, multiply number of units by 100 and make the result an        . * integer:                                                          . COMPUTE Units = RND(Units*100). FORMAT Units (F6). LIST. * Total units for each source w/in StudyID-Service-PeriodS:         . AGGREGATE OUTFILE=* MODE=ADDVARIABLES /BREAK=StudyID Service PeriodS   Source /TotUnits 'Total units from this source'=SUM(Units). FORMATS    TotUnits (F6). * Largest number of units for any source within                     . * StudyID-Service-PeriodS:                                          . AGGREGATE OUTFILE=* MODE=ADDVARIABLES /BREAK=StudyID Service PeriodS /MaxUnits 'Highest no. of units for any source' = MAX(TotUnits). * If more than one source has the same number of units, take the    . * lowest-numbered source                                            . AGGREGATE OUTFILE=* MODE=ADDVARIABLES /BREAK=StudyID Service PeriodS TotUnits /MinSrce 'Lowest numbered source with this many units'= MIN(Source). STRING TakeIt (A4). IF   TotUnits EQ MaxUnits    & Source   EQ MinSrce   TakeIt='Take'.    RECODE TakeIt (' '='Drop'). LIST. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Albert-Jan Roskam

Re: Identify Duplicate Cases

In reply to this post by Asil Ozdogru

Hi,

If you SORT CASES BY StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.

An entirely different approach would be to use CASESTOVARS and use COMPUTE x = max (unit.1 to unit.4). Here, too, the file should be sorted by id.

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Fri, 4/16/10, Asil Ozdogru <[hidden email]> wrote:

From: Asil Ozdogru <[hidden email]>
Subject: Re: [SPSSX-L] Identify Duplicate Cases
To: [hidden email]
Date: Friday, April 16, 2010, 9:44 PM

Thank you all. I actually want to choose between different Sources whichever has the greater number of Units, like in the below example. Lag command seems to be working but I could not modify it to do what I want. Any suggestions on that?

Thanks,

StudyID

Service

PeriodS

Source

Units

Duplicate

30001

408

1/1/2002

1

0.75

No

30001

408

1/1/2002

2

0.5

Yes

30002

201

7/1/2008

1

2

No

30002

201

7/1/2008

1

0.5

No

30003

564

1/1/2006

1

1

Yes

30003

564

1/1/2006

2

1.5

No

Asil

From: Richard Ristow [mailto:[hidden email]]
Sent: Friday, April 16, 2010 11:16 AM
To: Asil Ozdogru; [hidden email]
Cc: Ariel Barak; Dennis Deck
Subject: Re: Identify Duplicate Cases

At 05:35 PM 4/15/2010, Asil Ozdogru wrote:

Is there a way that I can identify duplicate cases based on four variables of which I would like first three to be the same and last one to be different? In another words, two cases would be labeled to be duplicate if they match on first 3 variables (StudyID Service PeriodS) and differ on the 4^th variable (Source).

Example:
StudyID Service PeriodS   Source Duplicate
30001    446     7/1/2006   1      Yes
30001    446     7/1/2006   2      No
30002    452     7/1/2004   1      No
30002    452     7/1/2004   1      No

What bothers me is, why isn't the second row a duplicate?

Anyway, Ariel Barak and Dennis Deck have posted solutions using LAG. Here's one using AGGREGATE. It labels any group of records with same StudyID, Service, and PeriodS as 'duplicates' if more than one value of Source occurs in the group; see the second record. If you want different logic, post again. This code is tested:

* The following SORT is necessary for MATCH CASES after AGGREGATE. .
* That AGGREGATE must be into a separate dataset, since there's a   .
* subsequent AGGREGATE that relies on *this* AGGREGATE producing    .
* exactly one record per BREAK group.                               .

SORT CASES BY     StudyID Service PeriodS.

* The following dataset has one record for each source encountered .
* for each StudyID-Service-PeriodS combination:                     .
* (Using OUTFILE=* without MODE=ADDVARIABLES puts the AGGREGATE     .
* output in a new, unnamed, active dataset.)                        .

AGGREGATE OUTFILE=*
/BREAK=StudyID Service PeriodS   Source
/NBYSource 'No. of occurrences of this source' = NU.

* Here, there's one record for each source:                         .

. /**/ LIST /*-*/.

List
|-----------------------------|---------------------------|
|Output Created               |16-APR-2010 11:09:39       |
|-----------------------------|---------------------------|
StudyID Service    PeriodS Source NBYSource

30001    446   07/01/2006    1          1
30001    446   07/01/2006    2          1
30002    452   07/01/2004    1          2

Number of cases read: 3    Number of cases listed: 3

* Counting these records gives a count of the number of different   .
* sources.                                                          .
* (This, by the way, is the operation that cannot be done in the    .
* original file.)                                                   .

AGGREGATE OUTFILE=*
/BREAK=StudyID Service PeriodS
/NSources
'No. of sources for this StudyID-Service-PeriodS combination' = NU.

* And here, there's one record for each StudyID-Service-PeriodS     .
* combination, with a count of the number of sources seen:          .

DATASET NAME      SourceCount WINDOW=FRONT.
. /**/ LIST /*-*/.

List
|-----------------------------|---------------------------|
|Output Created               |16-APR-2010 11:09:39       |
|-----------------------------|---------------------------|
[SourceCount]

StudyID Service    PeriodS NSources

30001    446   07/01/2006        2
30002    452   07/01/2004        1

Number of cases read: 2    Number of cases listed: 2

* Attach the count of sources to the original records, and classify .
* as duplicate or not.                                              .

MATCH FILES
   /FILE =Data
   /TABLE=SourceCount
   /BY    StudyID Service PeriodS.

DATASET NAME      ID_Dups     WINDOW=FRONT.
STRING DupsByMe (A3).
RECODE NSources (1='No') (2 THRU HI='Yes') INTO DupsByMe.
LIST.

List
|-----------------------------|---------------------------|
|Output Created               |16-APR-2010 11:09:40       |
|-----------------------------|---------------------------|
[ID_Dups]

StudyID Service    PeriodS Source Duplicate NSources DupsByMe

30001    446   07/01/2006    1   Yes              2 Yes
30001    446   07/01/2006    2   No               2 Yes
30002    452   07/01/2004    1   No               1 No
30002    452   07/01/2004    1   No               1 No

Number of cases read: 4    Number of cases listed: 4
=============================
APPENDIX: Test data, and code
=============================
DATA LIST LIST/
   StudyID Service PeriodS   Source Duplicate
   (N5,     F3,     ADATE10,   F2,    A4).
BEGIN DATA
   30001    446     7/1/2006   1      Yes
   30001    446     7/1/2006   2      No
   30002    452     7/1/2004   1      No
   30002    452     7/1/2004   1      No
END DATA.
DATASET NAME      Data.

LIST.

* The following SORT is necessary for MATCH CASES after AGGREGATE. .
* That AGGREGATE must be into a separate dataset, since there's a   .
* subsequent AGGREGATE that relies on *this* AGGREGATE producing    .
* exactly one record per BREAK group.                               .

SORT CASES BY     StudyID Service PeriodS.

* The following dataset has one record for each source encountered .
* for each StudyID-Service-PeriodS combination:                     .
* (Using OUTFILE=* without MODE=ADDVARIABLES puts the AGGREGATE     .
* output in a new, unnamed, active dataset.)                        .

AGGREGATE OUTFILE=*
/BREAK=StudyID Service PeriodS   Source
/NBYSource 'No. of occurrences of this source' = NU.

* Here, there's one record for each source:                         .

. /**/ LIST /*-*/.

* Counting these records gives a count of the number of different   .
* sources.                                                          .
* (This, by the way, is the operation that cannot be done in the    .
* original file.)                                                   .

AGGREGATE OUTFILE=*
/BREAK=StudyID Service PeriodS
/NSources
'No. of sources for this StudyID-Service-PeriodS combination' = NU.

* And here, there's one record for each StudyID-Service-PeriodS     .
* combination, with a count of the number of sources seen:          .

DATASET NAME      SourceCount WINDOW=FRONT.

. /**/ LIST /*-*/.

* Attach the count of sources to the original records, and classify .
* as duplicate or not.                                              .

MATCH FILES
   /FILE =Data
   /TABLE=SourceCount
   /BY    StudyID Service PeriodS.

DATASET NAME      ID_Dups     WINDOW=FRONT.

STRING DupsByMe (A3).
RECODE NSources (1='No') (2 THRU HI='Yes') INTO DupsByMe.
LIST.

Richard Ristow

Re: Identify Duplicate Cases

At 03:19 AM 4/17/2010, Albert-Jan Roskam wrote:

If you SORT CASES BY StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.

This is a good approach, possibly superior to the three-AGGREGATE solution I posted: probably a little more efficient; possibly clearer, though it's hard to tell that without implementing.

Albert-Jan's logic, as posted, requires at most one record per Source for any StudyID-Service-PeriodS combination. There'd have to be one AGGREGATE to create a file satisfying that criterion, and then a MATCH FILES to match the results to the original file. And an ADD FILES with a /FIRST= clause would probably be easier than LAG.And that's about all I'd change. Thanks, Albert-Jan.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Asil Ozdogru

Re: Identify Duplicate Cases

Thank you all. There is more than one record per source for any StudyID-Service-PeriodS combinations. Aggregate function seems to be working well.

Excuse me that I did not provide detailed description of the problem initially. I thought that I might work it out once I had the initial steps.

On another note, is there an easy way to have value labels of a numeric variable as another string variable rather than recode all the values?

Finally, could moderator of the list (SPSSX-L) change the settings of the listserv so that our email addresses are not visible in online archives? In the current way, we are easy prey for spammers.

Thanks,

Asil

From: Richard Ristow [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 1:10 PM
To: Albert-Jan Roskam; [hidden email]
Cc: Asil Ozdogru
Subject: Re: Identify Duplicate Cases

At 03:19 AM 4/17/2010, Albert-Jan Roskam wrote:

If you SORT CASES BY StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.

Albert-Jan Roskam

Re: Identify Duplicate Cases

Hi,

>>On another note, is there an easy way to have value labels of a numeric variable as >>another string variable rather than recode all the values?

==> string newvar (a80).

compute newvar = valuelabel(numvar).

Re: hiding mail addresses: good idea, although it's too late for my own address I'm afraid.

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Tue, 4/20/10, Asil Ozdogru <[hidden email]> wrote:

From: Asil Ozdogru <[hidden email]>
Subject: RE: Identify Duplicate Cases
To: "'Richard Ristow'" <[hidden email]>, "Albert-Jan Roskam" <[hidden email]>, "[hidden email]" <[hidden email]>
Date: Tuesday, April 20, 2010, 8:43 PM

Thank you all. There is more than one record per source for any StudyID-Service-PeriodS combinations. Aggregate function seems to be working well.

Excuse me that I did not provide detailed description of the problem initially. I thought that I might work it out once I had the initial steps.

On another note, is there an easy way to have value labels of a numeric variable as another string variable rather than recode all the values?

Finally, could moderator of the list (SPSSX-L) change the settings of the listserv so that our email addresses are not visible in online archives? In the current way, we are easy prey for spammers.

Thanks,

Asil

From: Richard Ristow [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 1:10 PM
To: Albert-Jan Roskam; [hidden email]
Cc: Asil Ozdogru
Subject: Re: Identify Duplicate Cases

At 03:19 AM 4/17/2010, Albert-Jan Roskam wrote:

If you SORT CASES BY StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.

This is a good approach, possibly superior to the three-AGGREGATE solution I posted: probably a little more efficient; possibly clearer, though it's hard to tell that without implementing.

Albert-Jan's logic, as posted, requires at most one record per Source for any StudyID-Service-PeriodS combination. There'd have to be one AGGREGATE to create a file satisfying that criterion, and then a MATCH FILES to match the results to the original file. And an ADD FILES with a /FIRST= clause would probably be easier than LAG.

And that's about all I'd change. Thanks, Albert-Jan.

Asil Ozdogru

Re: Identify Duplicate Cases

Value label command did not work, gave me this error below. I am using v. 13 if that makes any difference.

data list list / var1 (f3).

begin data

120

121

122

end data.

val lab var1

120 'A'

121 'B'

122 'C'.

string var2 (a1).

compute var2=valuelabel(var1).

>Error # 4309 in column 256. Text: (End of Command)

>Invalid combination of data types in an assignment. Character strings may

>only be assigned to string variables. Numeric and logical quantities may

>only be assigned to numeric variables. Consider using the STRING or NUMBER

>function.

>This command not executed.

exe.

Thanks,

Asil

From: Albert-Jan Roskam [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 3:59 PM
To: 'Richard Ristow'; [hidden email]; Asil Ozdogru
Subject: RE: Identify Duplicate Cases

Hi,

>>On another note, is there an easy way to have value labels of a numeric variable as >>another string variable rather than recode all the values?

==> string newvar (a80).

compute newvar = valuelabel(numvar).

Re: hiding mail addresses: good idea, although it's too late for my own address I'm afraid.

From: Asil Ozdogru <[hidden email]>
Subject: RE: Identify Duplicate Cases
To: "'Richard Ristow'" <[hidden email]>, "Albert-Jan Roskam" <[hidden email]>, "[hidden email]" <[hidden email]>
Date: Tuesday, April 20, 2010, 8:43 PM

Thank you all. There is more than one record per source for any StudyID-Service-PeriodS combinations. Aggregate function seems to be working well.

Excuse me that I did not provide detailed description of the problem initially. I thought that I might work it out once I had the initial steps.

On another note, is there an easy way to have value labels of a numeric variable as another string variable rather than recode all the values?

Finally, could moderator of the list (SPSSX-L) change the settings of the listserv so that our email addresses are not visible in online archives? In the current way, we are easy prey for spammers.

Thanks,

Asil

From: Richard Ristow [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 1:10 PM
To: Albert-Jan Roskam; [hidden email]
Cc: Asil Ozdogru
Subject: Re: Identify Duplicate Cases

At 03:19 AM 4/17/2010, Albert-Jan Roskam wrote:

If you SORT CASES BY StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.

Jon K Peck

Re: Identify Duplicate Cases

The VALUELABEL function was added in Version 14.

Sorry,

Jon Peck
SPSS, an IBM Company
[hidden email]
312-651-3435

From:	Asil Ozdogru <[hidden email]>
To:	[hidden email]
Date:	04/20/2010 06:01 PM
Subject:	Re: [SPSSX-L] Identify Duplicate Cases
Sent by:	"SPSSX(r) Discussion" <[hidden email]>

Value label command did not work, gave me this error below. I am using v. 13 if that makes any difference.

data list list / var1 (f3).
begin data
120
121
122
end data.
val lab var1
120 'A'
121 'B'
122 'C'.
string var2 (a1).
compute var2=valuelabel(var1).

>Error # 4309 in column 256. Text: (End of Command)
>Invalid combination of data types in an assignment. Character strings may
>only be assigned to string variables. Numeric and logical quantities may
>only be assigned to numeric variables. Consider using the STRING or NUMBER
>function.
>This command not executed.

exe.

Thanks,

Asil

From: Albert-Jan Roskam [mailto:fomcl@...]
Sent: Tuesday, April 20, 2010 3:59 PM
To: 'Richard Ristow'; [hidden email]; Asil Ozdogru
Subject: RE: Identify Duplicate Cases

Hi,
>>On another note, is there an easy way to have value labels of a numeric variable as >>another string variable rather than recode all the values?

==> string newvar (a80).
compute newvar = valuelabel(numvar).

Re: hiding mail addresses: good idea, although it's too late for my own address I'm afraid.

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Tue, 4/20/10, Asil Ozdogru <[hidden email]> wrote:

From: Asil Ozdogru <[hidden email]>
Subject: RE: Identify Duplicate Cases
To: "'Richard Ristow'" <[hidden email]>, "Albert-Jan Roskam" <[hidden email]>, "[hidden email]" <[hidden email]>
Date: Tuesday, April 20, 2010, 8:43 PM
Thank you all. There is more than one record per source for any StudyID-Service-PeriodS combinations. Aggregate function seems to be working well.

Excuse me that I did not provide detailed description of the problem initially. I thought that I might work it out once I had the initial steps.

On another note, is there an easy way to have value labels of a numeric variable as another string variable rather than recode all the values?

Finally, could moderator of the list (SPSSX-L) change the settings of the listserv so that our email addresses are not visible in online archives? In the current way, we are easy prey for spammers.

Thanks,

Asil

From: Richard Ristow [mailto:wrristow@...]
Sent: Tuesday, April 20, 2010 1:10 PM
To: Albert-Jan Roskam; [hidden email]
Cc: Asil Ozdogru
Subject: Re: Identify Duplicate Cases

At 03:19 AM 4/17/2010, Albert-Jan Roskam wrote:

If you SORT CASES BY StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.

This is a good approach, possibly superior to the three-AGGREGATE solution I posted: probably a little more efficient; possibly clearer, though it's hard to tell that without implementing.

Albert-Jan's logic, as posted, requires at most one record per Source for any StudyID-Service-PeriodS combination. There'd have to be one AGGREGATE to create a file satisfying that criterion, and then a MATCH FILES to match the results to the original file. And an ADD FILES with a /FIRST= clause would probably be easier than LAG.

And that's about all I'd change. Thanks, Albert-Jan.

John F Hall

Re: Identify Duplicate Cases [Value labels as string for numeric variables]

In reply to this post by Albert-Jan Roskam

Not quite sure what you mean, but I seem to remember a /COPY sub-command somewhere for automated transfer of labels, missing values etc to new vars .

Desperately busy rewriting all my tutorials from 15 to 18, so no time to ferret around. Searched for /COPY in manual, but no luck. Jon or ViaAnn should know, or Bruce or Raynald

----- Original Message -----

From: [hidden email]

To: [hidden email]

Sent: Tuesday, April 20, 2010 9:58 PM

Subject: Re: Identify Duplicate Cases

Hi,

>>On another note, is there an easy way to have value labels of a numeric variable as >>another string variable rather than recode all the values?

==> string newvar (a80).

compute newvar = valuelabel(numvar).

Re: hiding mail addresses: good idea, although it's too late for my own address I'm afraid.

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Tue, 4/20/10, Asil Ozdogru <[hidden email]> wrote:

From: Asil Ozdogru <[hidden email]>
Subject: RE: Identify Duplicate Cases
To: "'Richard Ristow'" <[hidden email]>, "Albert-Jan Roskam" <[hidden email]>, "[hidden email]" <[hidden email]>
Date: Tuesday, April 20, 2010, 8:43 PM

Thank you all. There is more than one record per source for any StudyID-Service-PeriodS combinations. Aggregate function seems to be working well.

Excuse me that I did not provide detailed description of the problem initially. I thought that I might work it out once I had the initial steps.

On another note, is there an easy way to have value labels of a numeric variable as another string variable rather than recode all the values?

Finally, could moderator of the list (SPSSX-L) change the settings of the listserv so that our email addresses are not visible in online archives? In the current way, we are easy prey for spammers.

Thanks,

Asil

From: Richard Ristow [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 1:10 PM
To: Albert-Jan Roskam; [hidden email]
Cc: Asil Ozdogru
Subject: Re: Identify Duplicate Cases

At 03:19 AM 4/17/2010, Albert-Jan Roskam wrote:

If you SORT CASES BY StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.

This is a good approach, possibly superior to the three-AGGREGATE solution I posted: probably a little more efficient; possibly clearer, though it's hard to tell that without implementing.

Albert-Jan's logic, as posted, requires at most one record per Source for any StudyID-Service-PeriodS combination. There'd have to be one AGGREGATE to create a file satisfying that criterion, and then a MATCH FILES to match the results to the original file. And an ADD FILES with a /FIRST= clause would probably be easier than LAG.

And that's about all I'd change. Thanks, Albert-Jan.

Ruben Geert van den Berg

VALUELABEL function without VALUELABEL function

In reply to this post by Jon K Peck

I think you could mimic the VALUELABEL function with a (somewhat nasty) workaround but I'm not entirely sure whether all ingredients are present in V13. However, the syntax below gives an example, I hope it may be useful.

Best regards,

Ruben van den Berg

Methodologist

TNS NIPO

E: [hidden email]

P: +31 20 522 5738

I: www.tns-nipo.com

*Create testdata.
inp pro.
loop id=1 to 100.
comp v1=tru(rv.uni(1,4)).
end cas.
end loop.
end fil.
end inp pro.
datas nam d1.
val lab v1 1'bla'2'bli'3'blo'.

*The last match files command requires the testdata file to be sorted according to v1, therefore.

sor cas v1.

*Here the solution starts.

* OMS.
DATASET DECLARE vals.
OMS
/SELECT TABLES
/IF COMMANDS=['File Information'] SUBTYPES=['Variable Values']
/DESTINATION FORMAT=SAV NUMBERED=TableNumber_
OUTFILE='vals' VIEWER=NO
/TAG='vals'.

disp dic
/ var v1.

omsend tag=['vals'].
datas act vals.

*Now we have a second dataset containing the values and corresponding value labels of v1. First we'll clean it up a bit.

matc fil fil *
/kee var2 label.
exe.

ren var (var2 Label=v1 valuelabel_v1).

*And now we'll merge the value labels into the original data.

matc fil fil d1
/tab vals
/by v1.
exe.

*Now we'll close the redundant datasets and rename the merged dataset 'd1', like our original dataset.

datas clo all.
datas nam d1.

*Finished.

Date: Tue, 20 Apr 2010 18:21:19 -0600
From: [hidden email]
Subject: Re: Identify Duplicate Cases
To: [hidden email]

The VALUELABEL function was added in Version 14.

Sorry,

Jon Peck
SPSS, an IBM Company
[hidden email]
312-651-3435

From:	Asil Ozdogru <[hidden email]>
To:	[hidden email]
Date:	04/20/2010 06:01 PM
Subject:	Re: [SPSSX-L] Identify Duplicate Cases
Sent by:	"SPSSX(r) Discussion" <[hidden email]>

Value label command did not work, gave me this error below. I am using v. 13 if that makes any difference.

data list list / var1 (f3).
begin data
120
121
122
end data.
val lab var1
120 'A'
121 'B'
122 'C'.
string var2 (a1).
compute var2=valuelabel(var1).

>Error # 4309 in column 256. Text: (End of Command)
>Invalid combination of data types in an assignment. Character strings may
>only be assigned to string variables. Numeric and logical quantities may
>only be assigned to numeric variables. Consider using the STRING or NUMBER
>function.
>This command not executed.

exe.

Thanks,

Asil

From: Albert-Jan Roskam [[hidden email]]
Sent: Tuesday, April 20, 2010 3:59 PM
To: 'Richard Ristow'; [hidden email]; Asil Ozdogru
Subject: RE: Identify Duplicate Cases

Hi,
>>On another note, is there an easy way to have value labels of a numeric variable as >>another string variable rather than recode all the values?

==> string newvar (a80).
compute newvar = valuelabel(numvar).

Re: hiding mail addresses: good idea, although it's too late for my own address I'm afraid.

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Tue, 4/20/10, Asil Ozdogru <[hidden email]> wrote:

From: Asil Ozdogru <[hidden email]>
Subject: RE: Identify Duplicate Cases
To: "'Richard Ristow'" <[hidden email]>, "Albert-Jan Roskam" <[hidden email]>, "[hidden email]" <[hidden email]>
Date: Tuesday, April 20, 2010, 8:43 PM
Thank you all. There is more than one record per source for any StudyID-Service-PeriodS combinations. Aggregate function seems to be working well.

Excuse me that I did not provide detailed description of the problem initially. I thought that I might work it out once I had the initial steps.

On another note, is there an easy way to have value labels of a numeric variable as another string variable rather than recode all the values?

Finally, could moderator of the list (SPSSX-L) change the settings of the listserv so that our email addresses are not visible in online archives? In the current way, we are easy prey for spammers.

Thanks,

Asil

From: Richard Ristow [[hidden email]]
Sent: Tuesday, April 20, 2010 1:10 PM
To: Albert-Jan Roskam; [hidden email]
Cc: Asil Ozdogru
Subject: Re: Identify Duplicate Cases

At 03:19 AM 4/17/2010, Albert-Jan Roskam wrote:

If you SORT CASES BY StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.

This is a good approach, possibly superior to the three-AGGREGATE solution I posted: probably a little more efficient; possibly clearer, though it's hard to tell that without implementing.

Albert-Jan's logic, as posted, requires at most one record per Source for any StudyID-Service-PeriodS combination. There'd have to be one AGGREGATE to create a file satisfying that criterion, and then a MATCH FILES to match the results to the original file. And an ADD FILES with a /FIRST= clause would probably be easier than LAG.

And that's about all I'd change. Thanks, Albert-Jan.

Express yourself instantly with MSN Messenger! MSN Messenger

Art Kendall

Re: Identify Duplicate Cases

In reply to this post by Asil Ozdogru

VALUELABEL Function <script type="text/javascript" src="ms-its:mainhelp.chm::/chmhelp.js"></script>

snipped from <help>

STRING labelvar (A120).
COMPUTE labelvar=VALUELABEL(var1).
DO REPEAT varlist=var2, var3, var4
         /newvars=labelvar2, labelvar3, labelvar4.
- STRING newvars(A120).
- COMPUTE newvars=VALUELABEL(varlist).
END REPEAT.

Art Kendall
Social Research Consultants

On 4/20/2010 2:43 PM, Asil Ozdogru wrote:

Thank you all. There is more than one record per source for any StudyID-Service-PeriodS combinations. Aggregate function seems to be working well.

Excuse me that I did not provide detailed description of the problem initially. I thought that I might work it out once I had the initial steps.

On another note, is there an easy way to have value labels of a numeric variable as another string variable rather than recode all the values?

Finally, could moderator of the list (SPSSX-L) change the settings of the listserv so that our email addresses are not visible in online archives? In the current way, we are easy prey for spammers.

Thanks,

Asil

From: Richard Ristow [[hidden email]]
Sent: Tuesday, April 20, 2010 1:10 PM
To: Albert-Jan Roskam; [hidden email]
Cc: Asil Ozdogru
Subject: Re: Identify Duplicate Cases

At 03:19 AM 4/17/2010, Albert-Jan Roskam wrote:

If you SORT CASES BY StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.

This is a good approach, possibly superior to the three-AGGREGATE solution I posted: probably a little more efficient; possibly clearer, though it's hard to tell that without implementing.

Albert-Jan's logic, as posted, requires at most one record per Source for any StudyID-Service-PeriodS combination. There'd have to be one AGGREGATE to create a file satisfying that criterion, and then a MATCH FILES to match the results to the original file. And an ADD FILES with a /FIRST= clause would probably be easier than LAG.

And that's about all I'd change. Thanks, Albert-Jan.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Art Kendall
Social Research Consultants

John F Hall

Re: Identify Duplicate Cases

In reply to this post by Asil Ozdogru

Asil

It's not clear what you are trying to do, but does this help? Data list needs to know which column(s) are to be read.

data list list / var1 1-3 .

begin data

120

121

122

end data.

val lab var1

120 'A'

121 'B'

122 'C'.

exe.

----- Original Message -----

From: [hidden email]

To: [hidden email]

Sent: Tuesday, April 20, 2010 11:04 PM

Subject: Re: Identify Duplicate Cases

Value label command did not work, gave me this error below. I am using v. 13 if that makes any difference.

data list list / var1 (f3).

begin data

120

121

122

end data.

val lab var1

120 'A'

121 'B'

122 'C'.

string var2 (a1).

compute var2=valuelabel(var1).

Thanks,

Asil

From: Albert-Jan Roskam [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 3:59 PM
To: 'Richard Ristow'; [hidden email]; Asil Ozdogru
Subject: RE: Identify Duplicate Cases

Hi,

>>On another note, is there an easy way to have value labels of a numeric variable as >>another string variable rather than recode all the values?

==> string newvar (a80).

compute newvar = valuelabel(numvar).

Re: hiding mail addresses: good idea, although it's too late for my own address I'm afraid.

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Tue, 4/20/10, Asil Ozdogru <[hidden email]> wrote:

From: Asil Ozdogru <[hidden email]>
Subject: RE: Identify Duplicate Cases
To: "'Richard Ristow'" <[hidden email]>, "Albert-Jan Roskam" <[hidden email]>, "[hidden email]" <[hidden email]>
Date: Tuesday, April 20, 2010, 8:43 PM

Thank you all. There is more than one record per source for any StudyID-Service-PeriodS combinations. Aggregate function seems to be working well.

Excuse me that I did not provide detailed description of the problem initially. I thought that I might work it out once I had the initial steps.

On another note, is there an easy way to have value labels of a numeric variable as another string variable rather than recode all the values?

Finally, could moderator of the list (SPSSX-L) change the settings of the listserv so that our email addresses are not visible in online archives? In the current way, we are easy prey for spammers.

Thanks,

Asil

From: Richard Ristow [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 1:10 PM
To: Albert-Jan Roskam; [hidden email]
Cc: Asil Ozdogru
Subject: Re: Identify Duplicate Cases

At 03:19 AM 4/17/2010, Albert-Jan Roskam wrote:

If you SORT CASES BY StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.

This is a good approach, possibly superior to the three-AGGREGATE solution I posted: probably a little more efficient; possibly clearer, though it's hard to tell that without implementing.

Albert-Jan's logic, as posted, requires at most one record per Source for any StudyID-Service-PeriodS combination. There'd have to be one AGGREGATE to create a file satisfying that criterion, and then a MATCH FILES to match the results to the original file. And an ADD FILES with a /FIRST= clause would probably be easier than LAG.

And that's about all I'd change. Thanks, Albert-Jan.

Asil Ozdogru

Re: Identify Duplicate Cases

Thanks John,

My question was how I could create a new string variable based on value labels of a numeric variable. The command I was advised (compute StringVar=valuelabel(NumVar)) seems to be working in versions after 14. I have a version 13 and currently using a hefty recode command to do it now.

Asil

From: John F Hall [mailto:[hidden email]]
Sent: Wednesday, April 21, 2010 2:05 PM
To: Asil Ozdogru; [hidden email]
Subject: Re: Re: Identify Duplicate Cases

Asil

It's not clear what you are trying to do, but does this help? Data list needs to know which column(s) are to be read.

data list list / var1 1-3 .

begin data

120

121

122

end data.

val lab var1

120 'A'

121 'B'

122 'C'.

exe.

----- Original Message -----

From: [hidden email]

To: [hidden email]

Sent: Tuesday, April 20, 2010 11:04 PM

Subject: Re: Identify Duplicate Cases

Value label command did not work, gave me this error below. I am using v. 13 if that makes any difference.

data list list / var1 (f3).

begin data

120

121

122

end data.

val lab var1

120 'A'

121 'B'

122 'C'.

string var2 (a1).

compute var2=valuelabel(var1).

Thanks,

Asil

From: Albert-Jan Roskam [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 3:59 PM
To: 'Richard Ristow'; [hidden email]; Asil Ozdogru
Subject: RE: Identify Duplicate Cases

Hi,

>>On another note, is there an easy way to have value labels of a numeric variable as >>another string variable rather than recode all the values?

==> string newvar (a80).

compute newvar = valuelabel(numvar).

Re: hiding mail addresses: good idea, although it's too late for my own address I'm afraid.

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Tue, 4/20/10, Asil Ozdogru <[hidden email]> wrote:

From: Asil Ozdogru <[hidden email]>
Subject: RE: Identify Duplicate Cases
To: "'Richard Ristow'" <[hidden email]>, "Albert-Jan Roskam" <[hidden email]>, "[hidden email]" <[hidden email]>
Date: Tuesday, April 20, 2010, 8:43 PM

Thank you all. There is more than one record per source for any StudyID-Service-PeriodS combinations. Aggregate function seems to be working well.

Excuse me that I did not provide detailed description of the problem initially. I thought that I might work it out once I had the initial steps.

On another note, is there an easy way to have value labels of a numeric variable as another string variable rather than recode all the values?

Finally, could moderator of the list (SPSSX-L) change the settings of the listserv so that our email addresses are not visible in online archives? In the current way, we are easy prey for spammers.

Thanks,

Asil

From: Richard Ristow [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 1:10 PM
To: Albert-Jan Roskam; [hidden email]
Cc: Asil Ozdogru
Subject: Re: Identify Duplicate Cases

At 03:19 AM 4/17/2010, Albert-Jan Roskam wrote:

If you SORT CASES BY StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.

This is a good approach, possibly superior to the three-AGGREGATE solution I posted: probably a little more efficient; possibly clearer, though it's hard to tell that without implementing.

Albert-Jan's logic, as posted, requires at most one record per Source for any StudyID-Service-PeriodS combination. There'd have to be one AGGREGATE to create a file satisfying that criterion, and then a MATCH FILES to match the results to the original file. And an ADD FILES with a /FIRST= clause would probably be easier than LAG.

And that's about all I'd change. Thanks, Albert-Jan.

Albert-Jan Roskam

Re: Identify Duplicate Cases

Hi,

In that case, you could make your life slightly easier by running DISPLAY DICTIONARY, then paste the part with the value labels in Excel, then use string concatenation to generate spss syntax. It may be easier to make IF statements. Something like:

"IF (varname eq "&B2&") varname = '"&C2&"'."

Note the use of single and double quotes.

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Wed, 4/21/10, Asil Ozdogru <[hidden email]> wrote:

From: Asil Ozdogru <[hidden email]>
Subject: Re: [SPSSX-L] Identify Duplicate Cases
To: [hidden email]
Date: Wednesday, April 21, 2010, 10:44 PM

Thanks John,

My question was how I could create a new string variable based on value labels of a numeric variable. The command I was advised (compute StringVar=valuelabel(NumVar)) seems to be working in versions after 14. I have a version 13 and currently using a hefty recode command to do it now.

Asil

From: John F Hall [mailto:[hidden email]]
Sent: Wednesday, April 21, 2010 2:05 PM
To: Asil Ozdogru; [hidden email]
Subject: Re: Re: Identify Duplicate Cases

Asil

It's not clear what you are trying to do, but does this help? Data list needs to know which column(s) are to be read.

data list list / var1 1-3 .
begin data
120
121
122
end data.
val lab var1
120 'A'
121 'B'
122 'C'.
exe.

----- Original Message -----

From: Asil Ozdogru

To: SPSSX-L@...

Sent: Tuesday, April 20, 2010 11:04 PM

Subject: Re: Identify Duplicate Cases

Value label command did not work, gave me this error below. I am using v. 13 if that makes any difference.

data list list / var1 (f3).
begin data
120
121
122
end data.
val lab var1
120 'A'
121 'B'
122 'C'.
string var2 (a1).
compute var2=valuelabel(var1).

Thanks,

Asil

From: Albert-Jan Roskam [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 3:59 PM
To: 'Richard Ristow'; [hidden email]; Asil Ozdogru
Subject: RE: Identify Duplicate Cases

Hi,
>>On another note, is there an easy way to have value labels of a numeric variable as >>another string variable rather than recode all the values?

==> string newvar (a80).
compute newvar = valuelabel(numvar).

Re: hiding mail addresses: good idea, although it's too late for my own address I'm afraid.

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Tue, 4/20/10, Asil Ozdogru <[hidden email]> wrote:

From: Asil Ozdogru <[hidden email]>
Subject: RE: Identify Duplicate Cases
To: "'Richard Ristow'" <[hidden email]>, "Albert-Jan Roskam" <[hidden email]>, "[hidden email]" <[hidden email]>
Date: Tuesday, April 20, 2010, 8:43 PM

Thank you all. There is more than one record per source for any StudyID-Service-PeriodS combinations. Aggregate function seems to be working well.

Excuse me that I did not provide detailed description of the problem initially. I thought that I might work it out once I had the initial steps.

On another note, is there an easy way to have value labels of a numeric variable as another string variable rather than recode all the values?

Finally, could moderator of the list (SPSSX-L) change the settings of the listserv so that our email addresses are not visible in online archives? In the current way, we are easy prey for spammers.

Thanks,

Asil

From: Richard Ristow [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 1:10 PM
To: Albert-Jan Roskam; [hidden email]
Cc: Asil Ozdogru
Subject: Re: Identify Duplicate Cases

At 03:19 AM 4/17/2010, Albert-Jan Roskam wrote:
If you SORT CASES BY StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.

This is a good approach, possibly superior to the three-AGGREGATE solution I posted: probably a little more efficient; possibly clearer, though it's hard to tell that without implementing.

Albert-Jan's logic, as posted, requires at most one record per Source for any StudyID-Service-PeriodS combination. There'd have to be one AGGREGATE to create a file satisfying that criterion, and then a MATCH FILES to match the results to the original file. And an ADD FILES with a /FIRST= clause would probably be easier than LAG.

And that's about all I'd change. Thanks, Albert-Jan.

Bruce Weaver

Re: Identify Duplicate Cases

Administrator

In reply to this post by Asil Ozdogru

Asil Ozdogru wrote

Thanks John,

My question was how I could create a new string variable based on value labels of a numeric variable. The command I was advised (compute StringVar=valuelabel(NumVar)) seems to be working in versions after 14. I have a version 13 and currently using a hefty recode command to do it now.

Asil

http://spsstools.net/Scripts/Labels/PopulateVariableWithValueLabelsOfAnotherVariable.txt

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).

123