Missing dates?

classic Classic list List threaded Threaded
52 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Identify Duplicate Cases

Asil Ozdogru

Hello,

 

Is there a way that I can identify duplicate cases based on four variables of which I would like first three to be the same and last one to be different? In another words, two cases would be labeled to be duplicate if they match on first 3 variables (StudyID Service PeriodS) and differ on the 4th variable (Source).

 

Example:

StudyID

Service

PeriodS

Source

à

Duplicate

30001

446

7/1/2006

1

Yes

30001

446

7/1/2006

2

No

30002

452

7/1/2004

1

No

30002

452

7/1/2004

1

No

 

Thanks,

 

Asil

 

Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

ariel barak
Hi Asil,

You seem to be defining duplicate cases differently than most do...typically I would say that the 3rd and 4th row that you showed are duplicates because all 4 of the variables are the same. However, the syntax below gives the results that you listed.

If there is a list of 4 cases where the StudyID Service and PeriodS are the same but the Source values are all different, say 1,2,3, and 4, only Source=4 will NOT be listed as a duplicate. See the last 4 records in the sample data below.

Hope this is what you want. If it's not, respond to the list again.

-Ari
*************************************************.
DATA LIST LIST /StudyID (F8) Service (F8) PeriodS (ADATE) Source (F8).
BEGIN DATA
30001 446 07/01/2006 1
30001 446 07/01/2006 2
30002 452 07/01/2004 1
30002 452 07/01/2004 1
30002 459 01/01/2004 1
30002 459 01/01/2004 2
30002 459 01/01/2004 3
30002 459 01/01/2004 4
END DATA.
DATASET NAME Test.

SORT CASES BY StudyID Service PeriodS (A) Source (D).

COMPUTE Duplicate=0.
IF (Lag(StudyID)=StudyID) AND (Lag(Service)=Service) AND
(Lag(PeriodS)=PeriodS) AND (Lag(Source)>Source) Duplicate=1.
EXECUTE.

SORT CASES BY StudyID Service PeriodS Source (A).

VALUE LABELS Duplicate 0 'No' 1 'Yes'.
*************************************************.

On Thu, Apr 15, 2010 at 4:35 PM, Asil Ozdogru <[hidden email]> wrote:

Hello,

 

Is there a way that I can identify duplicate cases based on four variables of which I would like first three to be the same and last one to be different? In another words, two cases would be labeled to be duplicate if they match on first 3 variables (StudyID Service PeriodS) and differ on the 4th variable (Source).

 

Example:

StudyID

Service

PeriodS

Source

à

Duplicate

30001

446

7/1/2006

1

Yes

30001

446

7/1/2006

2

No

30002

452

7/1/2004

1

No

30002

452

7/1/2004

1

No

 

Thanks,

 

Asil

 


Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Dennis Deck
In reply to this post by Asil Ozdogru

I looks like StudyID and Service are strings.  

I assume that the date in PeriodS is limited in some way - here I use only Month - but you could expand on that idea. 

 

Try something like:

 

STRING MonthS (A2) .

COMPUTE MonthS = STRING(XDATE.Month(PeriodS),A2)  .

 

STRING Combined (A9) .

COMPUTE Combined = CONCAT(StudyID,Service,Month) .

 

*( If the vars were numbers use: COMPUTE  Combined = StudyID*100000 + Service*100 +  Month ) .

 

SORT CASES BY Combined  Source .

 

COMPUTE Match = 0 .

IF ($casenum>1 and Combined=LAG(Combined) AND Source NE LAG(Source)) Match = 1  .

* This flags a record that matched the preceeding one .

 

* If you need to identify pairs you could reverse sort and flag the other of the pair.

 

 

Dennis Deck

RMC Research

 

From: Asil Ozdogru [mailto:[hidden email]]
Sent: Thursday, April 15, 2010 2:36 PM
Subject: Identify Duplicate Cases

 

Hello,

 

Is there a way that I can identify duplicate cases based on four variables of which I would like first three to be the same and last one to be different? In another words, two cases would be labeled to be duplicate if they match on first 3 variables (StudyID Service PeriodS) and differ on the 4th variable (Source).

 

Example:

StudyID

Service

PeriodS

Source

à

Duplicate

30001

446

7/1/2006

1

Yes

30001

446

7/1/2006

2

No

30002

452

7/1/2004

1

No

30002

452

7/1/2004

1

No

 

Thanks,

 

Asil

 

Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Richard Ristow
A comment on sort keys: At 02:17 AM 4/16/2010, Dennis Deck wrote,

Try something like:
 
STRING MonthS (A2) .
COMPUTE MonthS = STRING(XDATE.Month(PeriodS),A2)  .
 
STRING Combined (A9) .
COMPUTE Combined = CONCAT(StudyID,Service,Month) .
 
*( If the vars were numbers use:
*  COMPUTE  Combined = StudyID*100000 + Service*100 +  Month ) .
 
SORT CASES BY Combined  Source .

It's common to build a catenated sort key like this, but I don't recommend it. It's a bother, and it isn't necessary. SORT CASES, MATCH FILES, ADD FILES, and AGGREGATE (I think those are the relevant commands, and anyhow it'll apply for other commands too) cheerfully accept sets of variables as keys:

SORT CASES BY StudyID Service Month  Source .

In this formulation, you needn't assume that "the date in PeriodS is limited in some way, [e.g., to] Month". If appropriate, you can use

SORT CASES BY StudyID Service PeriodS  Source .

(Also, I think the code posted has a bug: it won't work properly unless all the PeriodS dates are in the same calendar year.)
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Richard Ristow
In reply to this post by Asil Ozdogru
At 05:35 PM 4/15/2010, Asil Ozdogru wrote:

Is there a way that I can identify duplicate cases based on four variables of which I would like first three to be the same and last one to be different? In another words, two cases would be labeled to be duplicate if they match on first 3 variables (StudyID Service PeriodS) and differ on the 4th variable (Source).
 
Example:
StudyID Service  PeriodS   Source  Duplicate
30001    446     7/1/2006   1      Yes
30001    446     7/1/2006   2      No
30002    452     7/1/2004   1      No
30002    452     7/1/2004   1      No

What bothers me is, why isn't the second row a duplicate?

Anyway, Ariel Barak and Dennis Deck have posted solutions using LAG. Here's one using AGGREGATE. It labels any group of records with same StudyID, Service, and PeriodS as 'duplicates' if more than one value of Source occurs in the group; see the second record. If you want different logic, post again. This code is tested:

*  The following SORT is necessary for MATCH CASES after AGGREGATE.  .
*  That AGGREGATE must be into a separate dataset, since there's a   .
*  subsequent AGGREGATE that relies on *this* AGGREGATE producing    .
*  exactly one record per BREAK group.                               .

SORT CASES BY     StudyID Service  PeriodS.

*  The following dataset has one record for each source encountered  .
*  for each StudyID-Service-PeriodS combination:                     .
*  (Using OUTFILE=* without MODE=ADDVARIABLES puts the AGGREGATE     .
*  output in a new, unnamed, active dataset.)                        .

AGGREGATE OUTFILE=*
  /BREAK=StudyID Service  PeriodS   Source
  /NBYSource 'No. of occurrences of this source' = NU.

*  Here, there's one record for each source:                         .

.  /**/ LIST /*-*/.
 
List
|-----------------------------|---------------------------|
|Output Created               |16-APR-2010 11:09:39       |
|-----------------------------|---------------------------|
StudyID Service    PeriodS Source NBYSource

 30001    446   07/01/2006    1          1
 30001    446   07/01/2006    2          1
 30002    452   07/01/2004    1          2

Number of cases read:  3    Number of cases listed:  3

 
*  Counting these records gives a count of the number of different   .
*  sources.                                                          .
*  (This, by the way, is the operation that cannot be done in the    .
*  original file.)                                                   .

AGGREGATE OUTFILE=*
  /BREAK=StudyID Service  PeriodS
  /NSources
  'No. of sources for this StudyID-Service-PeriodS combination' = NU.

*  And here, there's one record for each StudyID-Service-PeriodS     .
*  combination, with a count of the number of sources seen:          .

DATASET NAME      SourceCount WINDOW=FRONT.
.  /**/ LIST /*-*/.
 
List
|-----------------------------|---------------------------|
|Output Created               |16-APR-2010 11:09:39       |
|-----------------------------|---------------------------|
[SourceCount]
 
StudyID Service    PeriodS NSources

 30001    446   07/01/2006        2
 30002    452   07/01/2004        1

Number of cases read:  2    Number of cases listed:  2

*  Attach the count of sources to the original records, and classify .
*  as duplicate or not.                                              .

MATCH FILES
   /FILE =Data
   /TABLE=SourceCount
   /BY    StudyID Service  PeriodS.

DATASET NAME      ID_Dups     WINDOW=FRONT.
STRING DupsByMe (A3).
RECODE NSources (1='No') (2 THRU HI='Yes') INTO DupsByMe.
LIST.
 
List
|-----------------------------|---------------------------|
|Output Created               |16-APR-2010 11:09:40       |
|-----------------------------|---------------------------|
[ID_Dups]
 
StudyID Service    PeriodS Source Duplicate NSources DupsByMe

 30001    446   07/01/2006    1   Yes              2 Yes
 30001    446   07/01/2006    2   No               2 Yes
 30002    452   07/01/2004    1   No               1 No
 30002    452   07/01/2004    1   No               1 No

Number of cases read:  4    Number of cases listed:  4
=============================
APPENDIX: Test data, and code
=============================
DATA LIST LIST/
   StudyID Service  PeriodS   Source  Duplicate
   (N5,     F3,     ADATE10,   F2,    A4).
BEGIN DATA  
   30001    446     7/1/2006   1      Yes
   30001    446     7/1/2006   2      No
   30002    452     7/1/2004   1      No
   30002    452     7/1/2004   1      No
END DATA.
DATASET NAME      Data.

LIST.

*  The following SORT is necessary for MATCH CASES after AGGREGATE.  .
*  That AGGREGATE must be into a separate dataset, since there's a   .
*  subsequent AGGREGATE that relies on *this* AGGREGATE producing    .
*  exactly one record per BREAK group.                               .

SORT CASES BY     StudyID Service  PeriodS.

*  The following dataset has one record for each source encountered  .
*  for each StudyID-Service-PeriodS combination:                     .
*  (Using OUTFILE=* without MODE=ADDVARIABLES puts the AGGREGATE     .
*  output in a new, unnamed, active dataset.)                        .

AGGREGATE OUTFILE=*
  /BREAK=StudyID Service  PeriodS   Source
  /NBYSource 'No. of occurrences of this source' = NU.
 
*  Here, there's one record for each source:                         .

.  /**/ LIST /*-*/.

*  Counting these records gives a count of the number of different   .
*  sources.                                                          .
*  (This, by the way, is the operation that cannot be done in the    .
*  original file.)                                                   .

AGGREGATE OUTFILE=*
  /BREAK=StudyID Service  PeriodS
  /NSources
  'No. of sources for this StudyID-Service-PeriodS combination' = NU.

*  And here, there's one record for each StudyID-Service-PeriodS     .
*  combination, with a count of the number of sources seen:          .

DATASET NAME      SourceCount WINDOW=FRONT.

.  /**/ LIST /*-*/.

*  Attach the count of sources to the original records, and classify .
*  as duplicate or not.                                              .

MATCH FILES
   /FILE =Data
   /TABLE=SourceCount
   /BY    StudyID Service  PeriodS.

DATASET NAME      ID_Dups     WINDOW=FRONT.

STRING DupsByMe (A3).
RECODE NSources (1='No') (2 THRU HI='Yes') INTO DupsByMe.
LIST
.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Asil Ozdogru

Thank you all. I actually want to choose between different Sources whichever has the greater number of Units, like in the below example. Lag command seems to be working but I could not modify it to do what I want. Any suggestions on that?

 

Thanks,

 

StudyID

Service

PeriodS

Source

Units

Duplicate

30001

408

1/1/2002

1

0.75

No

30001

408

1/1/2002

2

0.5

Yes

30002

201

7/1/2008

1

2

No

30002

201

7/1/2008

1

0.5

No

30003

564

1/1/2006

1

1

Yes

30003

564

1/1/2006

2

1.5

No

 

Asil

 

From: Richard Ristow [mailto:[hidden email]]
Sent: Friday, April 16, 2010 11:16 AM
To: Asil Ozdogru; [hidden email]
Cc: Ariel Barak; Dennis Deck
Subject: Re: Identify Duplicate Cases

 

At 05:35 PM 4/15/2010, Asil Ozdogru wrote:


Is there a way that I can identify duplicate cases based on four variables of which I would like first three to be the same and last one to be different? In another words, two cases would be labeled to be duplicate if they match on first 3 variables (StudyID Service PeriodS) and differ on the 4th variable (Source).
 
Example:
StudyID Service  PeriodS   Source  Duplicate
30001    446     7/1/2006   1      Yes
30001    446     7/1/2006   2      No
30002    452     7/1/2004   1      No
30002    452     7/1/2004   1      No


What bothers me is, why isn't the second row a duplicate?

Anyway, Ariel Barak and Dennis Deck have posted solutions using LAG. Here's one using AGGREGATE. It labels any group of records with same StudyID, Service, and PeriodS as 'duplicates' if more than one value of Source occurs in the group; see the second record. If you want different logic, post again. This code is tested:

*  The following SORT is necessary for MATCH CASES after AGGREGATE.  .
*  That AGGREGATE must be into a separate dataset, since there's a   .
*  subsequent AGGREGATE that relies on *this* AGGREGATE producing    .
*  exactly one record per BREAK group.                               .

SORT CASES BY     StudyID Service  PeriodS.

*  The following dataset has one record for each source encountered  .
*  for each StudyID-Service-PeriodS combination:                     .
*  (Using OUTFILE=* without MODE=ADDVARIABLES puts the AGGREGATE     .
*  output in a new, unnamed, active dataset.)                        .

AGGREGATE OUTFILE=*
  /BREAK=StudyID Service  PeriodS   Source
  /NBYSource 'No. of occurrences of this source' = NU.

*  Here, there's one record for each source:                         .

.  /**/ LIST /*-*/.
 
List
|-----------------------------|---------------------------|
|Output Created               |16-APR-2010 11:09:39       |
|-----------------------------|---------------------------|
StudyID Service    PeriodS Source NBYSource

 30001    446   07/01/2006    1          1
 30001    446   07/01/2006    2          1
 30002    452   07/01/2004    1          2

Number of cases read:  3    Number of cases listed:  3

 
*  Counting these records gives a count of the number of different   .
*  sources.                                                          .
*  (This, by the way, is the operation that cannot be done in the    .
*  original file.)                                                   .

AGGREGATE OUTFILE=*
  /BREAK=StudyID Service  PeriodS
  /NSources
  'No. of sources for this StudyID-Service-PeriodS combination' = NU.

*  And here, there's one record for each StudyID-Service-PeriodS     .
*  combination, with a count of the number of sources seen:          .

DATASET NAME      SourceCount WINDOW=FRONT.
.  /**/ LIST /*-*/.
 
List
|-----------------------------|---------------------------|
|Output Created               |16-APR-2010 11:09:39       |
|-----------------------------|---------------------------|
[SourceCount]
 
StudyID Service    PeriodS NSources

 30001    446   07/01/2006        2
 30002    452   07/01/2004        1

Number of cases read:  2    Number of cases listed:  2

*  Attach the count of sources to the original records, and classify .
*  as duplicate or not.                                              .

MATCH FILES
   /FILE =Data
   /TABLE=SourceCount
   /BY    StudyID Service  PeriodS.

DATASET NAME      ID_Dups     WINDOW=FRONT.
STRING DupsByMe (A3).
RECODE NSources (1='No') (2 THRU HI='Yes') INTO DupsByMe.
LIST.
 
List
|-----------------------------|---------------------------|
|Output Created               |16-APR-2010 11:09:40       |
|-----------------------------|---------------------------|
[ID_Dups]
 
StudyID Service    PeriodS Source Duplicate NSources DupsByMe

 30001    446   07/01/2006    1   Yes              2 Yes
 30001    446   07/01/2006    2   No               2 Yes
 30002    452   07/01/2004    1   No               1 No
 30002    452   07/01/2004    1   No               1 No

Number of cases read:  4    Number of cases listed:  4
=============================
APPENDIX: Test data, and code
=============================
DATA LIST LIST/
   StudyID Service  PeriodS   Source  Duplicate
   (N5,     F3,     ADATE10,   F2,    A4).
BEGIN DATA  
   30001    446     7/1/2006   1      Yes
   30001    446     7/1/2006   2      No
   30002    452     7/1/2004   1      No
   30002    452     7/1/2004   1      No
END DATA.
DATASET NAME      Data.

LIST.

*  The following SORT is necessary for MATCH CASES after AGGREGATE.  .
*  That AGGREGATE must be into a separate dataset, since there's a   .
*  subsequent AGGREGATE that relies on *this* AGGREGATE producing    .
*  exactly one record per BREAK group.                               .

SORT CASES BY     StudyID Service  PeriodS.

*  The following dataset has one record for each source encountered  .
*  for each StudyID-Service-PeriodS combination:                     .
*  (Using OUTFILE=* without MODE=ADDVARIABLES puts the AGGREGATE     .
*  output in a new, unnamed, active dataset.)                        .

AGGREGATE OUTFILE=*
  /BREAK=StudyID Service  PeriodS   Source
  /NBYSource 'No. of occurrences of this source' = NU.
 
*  Here, there's one record for each source:                         .

.  /**/ LIST /*-*/.

*  Counting these records gives a count of the number of different   .
*  sources.                                                          .
*  (This, by the way, is the operation that cannot be done in the    .
*  original file.)                                                   .

AGGREGATE OUTFILE=*
  /BREAK=StudyID Service  PeriodS
  /NSources
  'No. of sources for this StudyID-Service-PeriodS combination' = NU.

*  And here, there's one record for each StudyID-Service-PeriodS     .
*  combination, with a count of the number of sources seen:          .

DATASET NAME      SourceCount WINDOW=FRONT.

.  /**/ LIST /*-*/.

*  Attach the count of sources to the original records, and classify .
*  as duplicate or not.                                              .

MATCH FILES
   /FILE =Data
   /TABLE=SourceCount
   /BY    StudyID Service  PeriodS.

DATASET NAME      ID_Dups     WINDOW=FRONT.

STRING DupsByMe (A3).
RECODE NSources (1='No') (2 THRU HI='Yes') INTO DupsByMe.
LIST.

 

Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Richard Ristow
At 03:44 PM 4/16/2010, Asil Ozdogru wrote:

Thank you all. I actually want to choose between different Sources whichever has the greater number of Units, like in the below example. Lag command seems to be working but I could not modify it to do what I want. Any suggestions on that?

I've added StudyIDs 900001 and 900002 to your test data. And I've converted your numbers of units into integers. The logic below looks for exact equality between different counts of units; and fractions that are exactly equal mathematically, may not come out exactly equal computationally.
|-----------------------------|---------------------------|
|Output Created               |16-APR-2010 17:36:24       |
|-----------------------------|---------------------------|
[Data]
StudyID Service    PeriodS Source  Units Dup

 30001    408   01/01/2002    1       75 No
 30001    408   01/01/2002    2       50 Yes
 30002    201   07/01/2008    1      200 No
 30002    201   07/01/2008    1       50 No
 30003    564   01/01/2006    1      100 Yes
 30003    564   01/01/2006    2      150 No
 90001    101   02/03/2010    1      150 --
 90001    101   02/03/2010    2       70 --
 90001    101   02/03/2010    3      150 --
 90002    103   03/01/2010    1       50 --
 90002    103   03/01/2010    2       30 --
 90002    103   03/01/2010    2       50 --
 90002    103   03/01/2010    3       10 --
 90002    103   03/01/2010    3       30 --
 90002    103   03/01/2010    4       60 --
 90002    103   03/01/2010    4       20 --

Number of cases read:  16    Number of cases listed:  16

What you now say you want looks very different from how you formerly described your problem. We'd appreciate it if you'd make clear what you want, from the beginning. Why the dickens didn't you mention 'Units', when that's a crucial variable in the selection?? You're abusing the good will of the respondents.

This no longer looks like detecting duplicates at all. Under the rule you're now describing, records can be rejected for not belonging to the source with most units, without duplicating anything. You have, rather, a complicated take/drop criterion. Anyway, here's a three-AGGREGATE solution (tested) that seems to do it.

*  Total units for each source w/in StudyID-Service-PeriodS:         .

AGGREGATE OUTFILE=* MODE=ADDVARIABLES
  /BREAK=StudyID Service  PeriodS   Source
  /TotUnits 'Total units from this source'=SUM(Units).
FORMATS
   TotUnits (F6).

*  Largest number of units for any source within                     .
*  StudyID-Service-PeriodS:                                          .

AGGREGATE OUTFILE=* MODE=ADDVARIABLES
  /BREAK=StudyID Service  PeriodS
  /MaxUnits 'Highest no. of units for any source' = MAX(TotUnits).

*  If more than one source has the same number of units, take the    .
*  lowest-numbered source                                            .

AGGREGATE OUTFILE=* MODE=ADDVARIABLES
  /BREAK=StudyID Service  PeriodS  TotUnits
  /MinSrce 'Lowest numbered source with this many units'= MIN(Source).

STRING TakeIt (A4).

IF   TotUnits EQ MaxUnits
   & Source   EQ MinSrce   TakeIt='Take'.

RECODE TakeIt (' '='Drop').

LIST.

List
|-----------------------------|---------------------------|
|Output Created               |16-APR-2010 17:36:25       |
|-----------------------------|---------------------------|
[Data]
StudyID Service    PeriodS Source  Units Dup  TotUnits MaxUnits MinSrce TakeIt

 30001    408   01/01/2002    1       75 No        75       75      1   Take
 30001    408   01/01/2002    2       50 Yes       50       75      2   Drop
 30002    201   07/01/2008    1      200 No       250      250      1   Take
 30002    201   07/01/2008    1       50 No       250      250      1   Take
 30003    564   01/01/2006    1      100 Yes      100      150      1   Drop
 30003    564   01/01/2006    2      150 No       150      150      2   Take
 90001    101   02/03/2010    1      150 --       150      150      1   Take
 90001    101   02/03/2010    2       70 --        70      150      2   Drop
 90001    101   02/03/2010    3      150 --       150      150      1   Drop
 90002    103   03/01/2010    1       50 --        50       80      1   Drop
 90002    103   03/01/2010    2       30 --        80       80      2   Take
 90002    103   03/01/2010    2       50 --        80       80      2   Take
 90002    103   03/01/2010    3       10 --        40       80      3   Drop
 90002    103   03/01/2010    3       30 --        40       80      3   Drop
 90002    103   03/01/2010    4       60 --        80       80      2   Drop
 90002    103   03/01/2010    4       20 --        80       80      2   Drop

Number of cases read:  16    Number of cases listed:  16
============================
APPENDIX: Test data, and code
=============================
DATA LIST LIST/
   StudyID Service  PeriodS  Source Units   Dup
   (N5,      F3,    ADATE10,  F2,   F5.2,   A4).
BEGIN DATA
   30001     408    1/1/2002   1     0.75   No
   30001     408    1/1/2002   2     0.5   Yes
   30002     201    7/1/2008   1     2   No
   30002     201    7/1/2008   1     0.5   No
   30003     564    1/1/2006   1     1   Yes
   30003     564    1/1/2006   2     1.5   No
   90001     101    2/3/2010   1     1.5   --
   90001     101    2/3/2010   2     0.7   --
   90001     101    2/3/2010   3     1.5   --
   90002     103    3/1/2010   1     0.5   --
   90002     103    3/1/2010   2     0.3   --
   90002     103    3/1/2010   2     0.5   --
   90002     103    3/1/2010   3     0.1   --
   90002     103    3/1/2010   3     0.3   --
   90002     103    3/1/2010   4     0.6   --
   90002     103    3/1/2010   4     0.2   --
END DATA.
DATASET NAME     Data          WINDOW=FRONT.
*  One pain in the neck is that code requires exact comparison of    .
*  numbers of units; and fractions rarely compare exactly.           .
*  So, multiply number of units by 100 and make the result an        .
*  integer:                                                          .

COMPUTE Units = RND(Units*100).
FORMAT  Units (F6).

LIST.



*  Total units for each source w/in StudyID-Service-PeriodS:         .

AGGREGATE OUTFILE=* MODE=ADDVARIABLES
  /BREAK=StudyID Service  PeriodS   Source
  /TotUnits 'Total units from this source'=SUM(Units).
FORMATS
   TotUnits (F6).

*  Largest number of units for any source within                     .
*  StudyID-Service-PeriodS:                                          .

AGGREGATE OUTFILE=* MODE=ADDVARIABLES
  /BREAK=StudyID Service  PeriodS
  /MaxUnits 'Highest no. of units for any source' = MAX(TotUnits).

*  If more than one source has the same number of units, take the    .
*  lowest-numbered source                                            .

AGGREGATE OUTFILE=* MODE=ADDVARIABLES
  /BREAK=StudyID Service  PeriodS  TotUnits
  /MinSrce 'Lowest numbered source with this many units'= MIN(Source).
 
STRING TakeIt (A4).

IF   TotUnits EQ MaxUnits
   & Source   EQ MinSrce   TakeIt='Take'.
  
RECODE TakeIt (' '='Drop').

LIST.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Albert-Jan Roskam
In reply to this post by Asil Ozdogru
Hi,

If you SORT CASES BY  StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.

An entirely different approach would be to use CASESTOVARS and use COMPUTE x = max (unit.1 to unit.4). Here, too, the file should be sorted by id.

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Fri, 4/16/10, Asil Ozdogru <[hidden email]> wrote:

From: Asil Ozdogru <[hidden email]>
Subject: Re: [SPSSX-L] Identify Duplicate Cases
To: [hidden email]
Date: Friday, April 16, 2010, 9:44 PM

Thank you all. I actually want to choose between different Sources whichever has the greater number of Units, like in the below example. Lag command seems to be working but I could not modify it to do what I want. Any suggestions on that?

 

Thanks,

 

StudyID

Service

PeriodS

Source

Units

Duplicate

30001

408

1/1/2002

1

0.75

No

30001

408

1/1/2002

2

0.5

Yes

30002

201

7/1/2008

1

2

No

30002

201

7/1/2008

1

0.5

No

30003

564

1/1/2006

1

1

Yes

30003

564

1/1/2006

2

1.5

No

 

Asil

 

From: Richard Ristow [mailto:[hidden email]]
Sent: Friday, April 16, 2010 11:16 AM
To: Asil Ozdogru; [hidden email]
Cc: Ariel Barak; Dennis Deck
Subject: Re: Identify Duplicate Cases

 

At 05:35 PM 4/15/2010, Asil Ozdogru wrote:


Is there a way that I can identify duplicate cases based on four variables of which I would like first three to be the same and last one to be different? In another words, two cases would be labeled to be duplicate if they match on first 3 variables (StudyID Service PeriodS) and differ on the 4th variable (Source).
 
Example:
StudyID Service  PeriodS   Source  Duplicate
30001    446     7/1/2006   1      Yes
30001    446     7/1/2006   2      No
30002    452     7/1/2004   1      No
30002    452     7/1/2004   1      No


What bothers me is, why isn't the second row a duplicate?

Anyway, Ariel Barak and Dennis Deck have posted solutions using LAG. Here's one using AGGREGATE. It labels any group of records with same StudyID, Service, and PeriodS as 'duplicates' if more than one value of Source occurs in the group; see the second record. If you want different logic, post again. This code is tested:

*  The following SORT is necessary for MATCH CASES after AGGREGATE.  .
*  That AGGREGATE must be into a separate dataset, since there's a   .
*  subsequent AGGREGATE that relies on *this* AGGREGATE producing    .
*  exactly one record per BREAK group.                               .

SORT CASES BY     StudyID Service  PeriodS.

*  The following dataset has one record for each source encountered  .
*  for each StudyID-Service-PeriodS combination:                     .
*  (Using OUTFILE=* without MODE=ADDVARIABLES puts the AGGREGATE     .
*  output in a new, unnamed, active dataset.)                        .

AGGREGATE OUTFILE=*
  /BREAK=StudyID Service  PeriodS   Source
  /NBYSource 'No. of occurrences of this source' = NU.

*  Here, there's one record for each source:                         .

.  /**/ LIST /*-*/.
 
List
|-----------------------------|---------------------------|
|Output Created               |16-APR-2010 11:09:39       |
|-----------------------------|---------------------------|
StudyID Service    PeriodS Source NBYSource

 30001    446   07/01/2006    1          1
 30001    446   07/01/2006    2          1
 30002    452   07/01/2004    1          2

Number of cases read:  3    Number of cases listed:  3

 
*  Counting these records gives a count of the number of different   .
*  sources.                                                          .
*  (This, by the way, is the operation that cannot be done in the    .
*  original file.)                                                   .

AGGREGATE OUTFILE=*
  /BREAK=StudyID Service  PeriodS
  /NSources
  'No. of sources for this StudyID-Service-PeriodS combination' = NU.

*  And here, there's one record for each StudyID-Service-PeriodS     .
*  combination, with a count of the number of sources seen:          .

DATASET NAME      SourceCount WINDOW=FRONT.
.  /**/ LIST /*-*/.
 
List
|-----------------------------|---------------------------|
|Output Created               |16-APR-2010 11:09:39       |
|-----------------------------|---------------------------|
[SourceCount]
 
StudyID Service    PeriodS NSources

 30001    446   07/01/2006        2
 30002    452   07/01/2004        1

Number of cases read:  2    Number of cases listed:  2

*  Attach the count of sources to the original records, and classify .
*  as duplicate or not.                                              .

MATCH FILES
   /FILE =Data
   /TABLE=SourceCount
   /BY    StudyID Service  PeriodS.

DATASET NAME      ID_Dups     WINDOW=FRONT.
STRING DupsByMe (A3).
RECODE NSources (1='No') (2 THRU HI='Yes') INTO DupsByMe.
LIST.
 
List
|-----------------------------|---------------------------|
|Output Created               |16-APR-2010 11:09:40       |
|-----------------------------|---------------------------|
[ID_Dups]
 
StudyID Service    PeriodS Source Duplicate NSources DupsByMe

 30001    446   07/01/2006    1   Yes              2 Yes
 30001    446   07/01/2006    2   No               2 Yes
 30002    452   07/01/2004    1   No               1 No
 30002    452   07/01/2004    1   No               1 No

Number of cases read:  4    Number of cases listed:  4
=============================
APPENDIX: Test data, and code
=============================
DATA LIST LIST/
   StudyID Service  PeriodS   Source  Duplicate
   (N5,     F3,     ADATE10,   F2,    A4).
BEGIN DATA  
   30001    446     7/1/2006   1      Yes
   30001    446     7/1/2006   2      No
   30002    452     7/1/2004   1      No
   30002    452     7/1/2004   1      No
END DATA.
DATASET NAME      Data.

LIST.

*  The following SORT is necessary for MATCH CASES after AGGREGATE.  .
*  That AGGREGATE must be into a separate dataset, since there's a   .
*  subsequent AGGREGATE that relies on *this* AGGREGATE producing    .
*  exactly one record per BREAK group.                               .

SORT CASES BY     StudyID Service  PeriodS.

*  The following dataset has one record for each source encountered  .
*  for each StudyID-Service-PeriodS combination:                     .
*  (Using OUTFILE=* without MODE=ADDVARIABLES puts the AGGREGATE     .
*  output in a new, unnamed, active dataset.)                        .

AGGREGATE OUTFILE=*
  /BREAK=StudyID Service  PeriodS   Source
  /NBYSource 'No. of occurrences of this source' = NU.
 
*  Here, there's one record for each source:                         .

.  /**/ LIST /*-*/.

*  Counting these records gives a count of the number of different   .
*  sources.                                                          .
*  (This, by the way, is the operation that cannot be done in the    .
*  original file.)                                                   .

AGGREGATE OUTFILE=*
  /BREAK=StudyID Service  PeriodS
  /NSources
  'No. of sources for this StudyID-Service-PeriodS combination' = NU.

*  And here, there's one record for each StudyID-Service-PeriodS     .
*  combination, with a count of the number of sources seen:          .

DATASET NAME      SourceCount WINDOW=FRONT.

.  /**/ LIST /*-*/.

*  Attach the count of sources to the original records, and classify .
*  as duplicate or not.                                              .

MATCH FILES
   /FILE =Data
   /TABLE=SourceCount
   /BY    StudyID Service  PeriodS.

DATASET NAME      ID_Dups     WINDOW=FRONT.

STRING DupsByMe (A3).
RECODE NSources (1='No') (2 THRU HI='Yes') INTO DupsByMe.
LIST.

 


Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Richard Ristow
At 03:19 AM 4/17/2010, Albert-Jan Roskam wrote:

If you SORT CASES BY  StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.

This is a good approach, possibly superior to the three-AGGREGATE solution I posted: probably a little more efficient; possibly clearer, though it's hard to tell that without implementing.

Albert-Jan's logic, as posted, requires at most one record per Source for any StudyID-Service-PeriodS combination. There'd have to be one AGGREGATE to create a file satisfying that criterion, and then a MATCH FILES to match the results to the original file. And an ADD FILES with a /FIRST= clause would probably be easier than LAG.

And that's about all I'd change. Thanks, Albert-Jan.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Asil Ozdogru

Thank you all. There is more than one record per source for any StudyID-Service-PeriodS combinations. Aggregate function seems to be working well.

 

Excuse me that I did not provide detailed description of the problem initially. I thought that I might work it out once I had the initial steps.

 

On another note, is there an easy way to have value labels of a numeric variable as another string variable rather than recode all the values?

 

Finally, could moderator of the list (SPSSX-L) change the settings of the listserv so that our email addresses are not visible in online archives? In the current way, we are easy prey for spammers.

 

Thanks,

 

Asil

 

From: Richard Ristow [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 1:10 PM
To: Albert-Jan Roskam; [hidden email]
Cc: Asil Ozdogru
Subject: Re: Identify Duplicate Cases

 

At 03:19 AM 4/17/2010, Albert-Jan Roskam wrote:


If you SORT CASES BY  StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.


This is a good approach, possibly superior to the three-AGGREGATE solution I posted: probably a little more efficient; possibly clearer, though it's hard to tell that without implementing.

Albert-Jan's logic, as posted, requires at most one record per Source for any StudyID-Service-PeriodS combination. There'd have to be one AGGREGATE to create a file satisfying that criterion, and then a MATCH FILES to match the results to the original file. And an ADD FILES with a /FIRST= clause would probably be easier than LAG.

And that's about all I'd change. Thanks, Albert-Jan.

 

Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Albert-Jan Roskam
Hi,

>>On another note, is there an easy way to have value labels of a numeric variable as >>another string variable rather than recode all the values?


==> string newvar (a80).

compute newvar = valuelabel(numvar).


Re: hiding mail addresses: good idea, although it's too late for my own address I'm afraid.



Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Tue, 4/20/10, Asil Ozdogru <[hidden email]> wrote:

From: Asil Ozdogru <[hidden email]>
Subject: RE: Identify Duplicate Cases
To: "'Richard Ristow'" <[hidden email]>, "Albert-Jan Roskam" <[hidden email]>, "[hidden email]" <[hidden email]>
Date: Tuesday, April 20, 2010, 8:43 PM

Thank you all. There is more than one record per source for any StudyID-Service-PeriodS combinations. Aggregate function seems to be working well.

 

Excuse me that I did not provide detailed description of the problem initially. I thought that I might work it out once I had the initial steps.

 

On another note, is there an easy way to have value labels of a numeric variable as another string variable rather than recode all the values?

 

Finally, could moderator of the list (SPSSX-L) change the settings of the listserv so that our email addresses are not visible in online archives? In the current way, we are easy prey for spammers.

 

Thanks,

 

Asil

 

From: Richard Ristow [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 1:10 PM
To: Albert-Jan Roskam; [hidden email]
Cc: Asil Ozdogru
Subject: Re: Identify Duplicate Cases

 

At 03:19 AM 4/17/2010, Albert-Jan Roskam wrote:


If you SORT CASES BY  StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.


This is a good approach, possibly superior to the three-AGGREGATE solution I posted: probably a little more efficient; possibly clearer, though it's hard to tell that without implementing.

Albert-Jan's logic, as posted, requires at most one record per Source for any StudyID-Service-PeriodS combination. There'd have to be one AGGREGATE to create a file satisfying that criterion, and then a MATCH FILES to match the results to the original file. And an ADD FILES with a /FIRST= clause would probably be easier than LAG.

And that's about all I'd change. Thanks, Albert-Jan.

 


Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Asil Ozdogru

Value label command did not work, gave me this error below. I am using v. 13 if that makes any difference.

 

data list list / var1 (f3).

begin data

120

121

122

end data.

val lab var1

120 'A'

121 'B'

122 'C'.

string var2 (a1).

compute var2=valuelabel(var1).

 

>Error # 4309 in column 256.  Text: (End of Command)

>Invalid combination of data types in an assignment.  Character strings may

>only be assigned to string variables.  Numeric and logical quantities may

>only be assigned to numeric variables.  Consider using the STRING or NUMBER

>function.

>This command not executed.

 

exe.

 

 

Thanks,

 

Asil

 

From: Albert-Jan Roskam [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 3:59 PM
To: 'Richard Ristow'; [hidden email]; Asil Ozdogru
Subject: RE: Identify Duplicate Cases

 

Hi,

>>On another note, is there an easy way to have value labels of a numeric variable as >>another string variable rather than recode all the values?

 

==> string newvar (a80).

compute newvar = valuelabel(numvar).

 

Re: hiding mail addresses: good idea, although it's too late for my own address I'm afraid.



Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Tue, 4/20/10, Asil Ozdogru <[hidden email]> wrote:


From: Asil Ozdogru <[hidden email]>
Subject: RE: Identify Duplicate Cases
To: "'Richard Ristow'" <[hidden email]>, "Albert-Jan Roskam" <[hidden email]>, "[hidden email]" <[hidden email]>
Date: Tuesday, April 20, 2010, 8:43 PM

Thank you all. There is more than one record per source for any StudyID-Service-PeriodS combinations. Aggregate function seems to be working well.

 

Excuse me that I did not provide detailed description of the problem initially. I thought that I might work it out once I had the initial steps.

 

On another note, is there an easy way to have value labels of a numeric variable as another string variable rather than recode all the values?

 

Finally, could moderator of the list (SPSSX-L) change the settings of the listserv so that our email addresses are not visible in online archives? In the current way, we are easy prey for spammers.

 

Thanks,

 

Asil

 

From: Richard Ristow [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 1:10 PM
To: Albert-Jan Roskam; [hidden email]
Cc: Asil Ozdogru
Subject: Re: Identify Duplicate Cases

 

At 03:19 AM 4/17/2010, Albert-Jan Roskam wrote:

If you SORT CASES BY  StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.


This is a good approach, possibly superior to the three-AGGREGATE solution I posted: probably a little more efficient; possibly clearer, though it's hard to tell that without implementing.

Albert-Jan's logic, as posted, requires at most one record per Source for any StudyID-Service-PeriodS combination. There'd have to be one AGGREGATE to create a file satisfying that criterion, and then a MATCH FILES to match the results to the original file. And an ADD FILES with a /FIRST= clause would probably be easier than LAG.

And that's about all I'd change. Thanks, Albert-Jan.

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Jon K Peck

The VALUELABEL function was added in Version 14.

Sorry,

Jon Peck
SPSS, an IBM Company
[hidden email]
312-651-3435



From: Asil Ozdogru <[hidden email]>
To: [hidden email]
Date: 04/20/2010 06:01 PM
Subject: Re: [SPSSX-L] Identify Duplicate Cases
Sent by: "SPSSX(r) Discussion" <[hidden email]>





Value label command did not work, gave me this error below. I am using v. 13 if that makes any difference.
 
data list list / var1 (f3).
begin data
120
121
122
end data.
val lab var1
120 'A'
121 'B'
122 'C'.
string var2 (a1).
compute var2=valuelabel(var1).
 
>Error # 4309 in column 256.  Text: (End of Command)
>Invalid combination of data types in an assignment.  Character strings may
>only be assigned to string variables.  Numeric and logical quantities may
>only be assigned to numeric variables.  Consider using the STRING or NUMBER
>function.
>This command not executed.
 
exe.
 
 
Thanks,
 
Asil
 
From: Albert-Jan Roskam [mailto:fomcl@...]
Sent:
Tuesday, April 20, 2010 3:59 PM
To:
'Richard Ristow'; [hidden email]; Asil Ozdogru
Subject:
RE: Identify Duplicate Cases

 

Hi,
>>On another note, is there an easy way to have value labels of a numeric variable as >>another string variable rather than recode all the values?
 
==> string newvar (a80).
compute newvar = valuelabel(numvar).
 
Re: hiding mail addresses: good idea, although it's too late for my own address I'm afraid.


Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Tue, 4/20/10, Asil Ozdogru <[hidden email]> wrote:


From: Asil Ozdogru <[hidden email]>
Subject: RE: Identify Duplicate Cases
To: "'Richard Ristow'" <[hidden email]>, "Albert-Jan Roskam" <[hidden email]>, "[hidden email]" <[hidden email]>
Date: Tuesday, April 20, 2010, 8:43 PM

Thank you all. There is more than one record per source for any StudyID-Service-PeriodS combinations. Aggregate function seems to be working well.
 
Excuse me that I did not provide detailed description of the problem initially. I thought that I might work it out once I had the initial steps.
 
On another note, is there an easy way to have value labels of a numeric variable as another string variable rather than recode all the values?
 
Finally, could moderator of the list (SPSSX-L) change the settings of the listserv so that our email addresses are not visible in online archives? In the current way, we are easy prey for spammers.
 
Thanks,
 
Asil
 
From: Richard Ristow [mailto:wrristow@...]
Sent:
Tuesday, April 20, 2010 1:10 PM
To:
Albert-Jan Roskam; [hidden email]
Cc:
Asil Ozdogru
Subject:
Re: Identify Duplicate Cases

 
At 03:19 AM 4/17/2010, Albert-Jan Roskam wrote:

If you SORT CASES BY  StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.

This is a good approach, possibly superior to the three-
AGGREGATE solution I posted: probably a little more efficient; possibly clearer, though it's hard to tell that without implementing.

Albert-Jan's logic, as posted, requires at most one record per
Source for any StudyID-Service-PeriodS combination. There'd have to be one AGGREGATE to create a file satisfying that criterion, and then a MATCH FILES to match the results to the original file. And an ADD FILES with a /FIRST= clause would probably be easier than LAG.

And that's about all I'd change. Thanks, Albert-Jan.

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases [Value labels as string for numeric variables]

John F Hall
In reply to this post by Albert-Jan Roskam
Not quite sure what you mean, but I seem to remember a /COPY sub-command somewhere for automated transfer of labels, missing values etc to new vars . 
 
Desperately busy rewriting all my tutorials from 15 to 18, so no time to ferret around.  Searched for  /COPY in manual, but no luck.  Jon or ViaAnn should know, or Bruce or Raynald
----- Original Message -----
Sent: Tuesday, April 20, 2010 9:58 PM
Subject: Re: Identify Duplicate Cases

Hi,

>>On another note, is there an easy way to have value labels of a numeric variable as >>another string variable rather than recode all the values?


==> string newvar (a80).

compute newvar = valuelabel(numvar).


Re: hiding mail addresses: good idea, although it's too late for my own address I'm afraid.



Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Tue, 4/20/10, Asil Ozdogru <[hidden email]> wrote:

From: Asil Ozdogru <[hidden email]>
Subject: RE: Identify Duplicate Cases
To: "'Richard Ristow'" <[hidden email]>, "Albert-Jan Roskam" <[hidden email]>, "[hidden email]" <[hidden email]>
Date: Tuesday, April 20, 2010, 8:43 PM

Thank you all. There is more than one record per source for any StudyID-Service-PeriodS combinations. Aggregate function seems to be working well.

 

Excuse me that I did not provide detailed description of the problem initially. I thought that I might work it out once I had the initial steps.

 

On another note, is there an easy way to have value labels of a numeric variable as another string variable rather than recode all the values?

 

Finally, could moderator of the list (SPSSX-L) change the settings of the listserv so that our email addresses are not visible in online archives? In the current way, we are easy prey for spammers.

 

Thanks,

 

Asil

 

From: Richard Ristow [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 1:10 PM
To: Albert-Jan Roskam; [hidden email]
Cc: Asil Ozdogru
Subject: Re: Identify Duplicate Cases

 

At 03:19 AM 4/17/2010, Albert-Jan Roskam wrote:


If you SORT CASES BY  StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.


This is a good approach, possibly superior to the three-AGGREGATE solution I posted: probably a little more efficient; possibly clearer, though it's hard to tell that without implementing.

Albert-Jan's logic, as posted, requires at most one record per Source for any StudyID-Service-PeriodS combination. There'd have to be one AGGREGATE to create a file satisfying that criterion, and then a MATCH FILES to match the results to the original file. And an ADD FILES with a /FIRST= clause would probably be easier than LAG.

And that's about all I'd change. Thanks, Albert-Jan.

 


Reply | Threaded
Open this post in threaded view
|

VALUELABEL function without VALUELABEL function

Ruben Geert van den Berg
In reply to this post by Jon K Peck
I think you could mimic the VALUELABEL function with a (somewhat nasty) workaround but I'm not entirely sure whether all ingredients are present in V13. However, the syntax below gives an example, I hope it may be useful.
 
Best regards,

Ruben van den Berg

Methodologist

TNS NIPO

E: [hidden email]

P: +31 20 522 5738

I: www.tns-nipo.com


*Create testdata.
inp pro.
loop id=1 to 100.
comp v1=tru(rv.uni(1,4)).
end cas.
end loop.
end fil.
end inp pro.
datas nam d1.
val lab v1 1'bla'2'bli'3'blo'.
 
*The last match files command requires the testdata file to be sorted according to v1, therefore.
 
sor cas v1.
 
*Here the solution starts.
 
* OMS.
DATASET DECLARE  vals.
OMS
  /SELECT TABLES
  /IF COMMANDS=['File Information'] SUBTYPES=['Variable Values']
  /DESTINATION FORMAT=SAV NUMBERED=TableNumber_
   OUTFILE='vals' VIEWER=NO
  /TAG='vals'.
 
disp dic
/ var v1.
 
omsend tag=['vals'].
datas act vals.
 
*Now we have a second dataset containing the values and corresponding value labels of v1. First we'll clean it up a bit.
 
matc fil fil *
/kee var2 label.
exe.

ren var (var2 Label=v1 valuelabel_v1).
 
*And now we'll merge the value labels into the original data.
 
matc fil fil d1
/tab vals
/by v1.
exe.
 
*Now we'll close the redundant datasets and rename the merged dataset 'd1', like our original dataset.
 
datas clo all.
datas nam d1.
 
*Finished.



 


Date: Tue, 20 Apr 2010 18:21:19 -0600
From: [hidden email]
Subject: Re: Identify Duplicate Cases
To: [hidden email]


The VALUELABEL function was added in Version 14.

Sorry,

Jon Peck
SPSS, an IBM Company
[hidden email]
312-651-3435



From: Asil Ozdogru <[hidden email]>
To: [hidden email]
Date: 04/20/2010 06:01 PM
Subject: Re: [SPSSX-L] Identify Duplicate Cases
Sent by: "SPSSX(r) Discussion" <[hidden email]>





Value label command did not work, gave me this error below. I am using v. 13 if that makes any difference.
 
data list list / var1 (f3).
begin data
120
121
122
end data.
val lab var1
120 'A'
121 'B'
122 'C'.
string var2 (a1).
compute var2=valuelabel(var1).
 
>Error # 4309 in column 256.  Text: (End of Command)
>Invalid combination of data types in an assignment.  Character strings may
>only be assigned to string variables.  Numeric and logical quantities may
>only be assigned to numeric variables.  Consider using the STRING or NUMBER
>function.
>This command not executed.
 
exe.
 
 
Thanks,
 
Asil
 
From: Albert-Jan Roskam [[hidden email]]
Sent:
Tuesday, April 20, 2010 3:59 PM
To:
'Richard Ristow'; [hidden email]; Asil Ozdogru
Subject:
RE: Identify Duplicate Cases

 
Hi,
>>On another note, is there an easy way to have value labels of a numeric variable as >>another string variable rather than recode all the values?
 
==> string newvar (a80).
compute newvar = valuelabel(numvar).
 
Re: hiding mail addresses: good idea, although it's too late for my own address I'm afraid.


Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Tue, 4/20/10, Asil Ozdogru <[hidden email]> wrote:


From: Asil Ozdogru <[hidden email]>
Subject: RE: Identify Duplicate Cases
To: "'Richard Ristow'" <[hidden email]>, "Albert-Jan Roskam" <[hidden email]>, "[hidden email]" <[hidden email]>
Date: Tuesday, April 20, 2010, 8:43 PM

Thank you all. There is more than one record per source for any StudyID-Service-PeriodS combinations. Aggregate function seems to be working well.
 
Excuse me that I did not provide detailed description of the problem initially. I thought that I might work it out once I had the initial steps.
 
On another note, is there an easy way to have value labels of a numeric variable as another string variable rather than recode all the values?
 
Finally, could moderator of the list (SPSSX-L) change the settings of the listserv so that our email addresses are not visible in online archives? In the current way, we are easy prey for spammers.
 
Thanks,
 
Asil
 
From: Richard Ristow [[hidden email]]
Sent:
Tuesday, April 20, 2010 1:10 PM
To:
Albert-Jan Roskam; [hidden email]
Cc:
Asil Ozdogru
Subject:
Re: Identify Duplicate Cases

 
At 03:19 AM 4/17/2010, Albert-Jan Roskam wrote:

If you SORT CASES BY  StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.

This is a good approach, possibly superior to the three-
AGGREGATE solution I posted: probably a little more efficient; possibly clearer, though it's hard to tell that without implementing.

Albert-Jan's logic, as posted, requires at most one record per
Source for any StudyID-Service-PeriodS combination. There'd have to be one AGGREGATE to create a file satisfying that criterion, and then a MATCH FILES to match the results to the original file. And an ADD FILES with a /FIRST= clause would probably be easier than LAG.

And that's about all I'd change. Thanks, Albert-Jan.

 

 




Express yourself instantly with MSN Messenger! MSN Messenger
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Art Kendall
In reply to this post by Asil Ozdogru
VALUELABEL Function <script type="text/javascript" src="ms-its:mainhelp.chm::/chmhelp.js"></script>
snipped from <help>
STRING labelvar (A120).
COMPUTE labelvar=VALUELABEL(var1).
DO REPEAT varlist=var2, var3, var4
         /newvars=labelvar2, labelvar3, labelvar4.
- STRING newvars(A120).
- COMPUTE newvars=VALUELABEL(varlist).
END REPEAT.
  
Art Kendall
Social Research Consultants

On 4/20/2010 2:43 PM, Asil Ozdogru wrote:

Thank you all. There is more than one record per source for any StudyID-Service-PeriodS combinations. Aggregate function seems to be working well.

 

Excuse me that I did not provide detailed description of the problem initially. I thought that I might work it out once I had the initial steps.

 

On another note, is there an easy way to have value labels of a numeric variable as another string variable rather than recode all the values?

 

Finally, could moderator of the list (SPSSX-L) change the settings of the listserv so that our email addresses are not visible in online archives? In the current way, we are easy prey for spammers.

 

Thanks,

 

Asil

 

From: Richard Ristow [[hidden email]]
Sent: Tuesday, April 20, 2010 1:10 PM
To: Albert-Jan Roskam; [hidden email]
Cc: Asil Ozdogru
Subject: Re: Identify Duplicate Cases

 

At 03:19 AM 4/17/2010, Albert-Jan Roskam wrote:


If you SORT CASES BY  StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.


This is a good approach, possibly superior to the three-AGGREGATE solution I posted: probably a little more efficient; possibly clearer, though it's hard to tell that without implementing.

Albert-Jan's logic, as posted, requires at most one record per Source for any StudyID-Service-PeriodS combination. There'd have to be one AGGREGATE to create a file satisfying that criterion, and then a MATCH FILES to match the results to the original file. And an ADD FILES with a /FIRST= clause would probably be easier than LAG.

And that's about all I'd change. Thanks, Albert-Jan.

 

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

John F Hall
In reply to this post by Asil Ozdogru

Asil

 

It's not clear what you are trying to do, but does this help?  Data list needs to know which column(s) are to be read.

 

data list list / var1 1-3 .

begin data

120

121

122

end data.

val lab var1

  120 'A'

  121 'B'

  122 'C'.

exe.

----- Original Message -----
Sent: Tuesday, April 20, 2010 11:04 PM
Subject: Re: Identify Duplicate Cases

Value label command did not work, gave me this error below. I am using v. 13 if that makes any difference.

 

 

data list list / var1 (f3).

begin data

120

121

122

end data.

val lab var1

120 'A'

121 'B'

122 'C'.

string var2 (a1).

compute var2=valuelabel(var1).

 

Thanks,

 

Asil

 

From: Albert-Jan Roskam [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 3:59 PM
To: 'Richard Ristow'; [hidden email]; Asil Ozdogru
Subject: RE: Identify Duplicate Cases

 

Hi,

>>On another note, is there an easy way to have value labels of a numeric variable as >>another string variable rather than recode all the values?

 

==> string newvar (a80).

compute newvar = valuelabel(numvar).

 

Re: hiding mail addresses: good idea, although it's too late for my own address I'm afraid.



Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Tue, 4/20/10, Asil Ozdogru <[hidden email]> wrote:


From: Asil Ozdogru <[hidden email]>
Subject: RE: Identify Duplicate Cases
To: "'Richard Ristow'" <[hidden email]>, "Albert-Jan Roskam" <[hidden email]>, "[hidden email]" <[hidden email]>
Date: Tuesday, April 20, 2010, 8:43 PM

Thank you all. There is more than one record per source for any StudyID-Service-PeriodS combinations. Aggregate function seems to be working well.

 

Excuse me that I did not provide detailed description of the problem initially. I thought that I might work it out once I had the initial steps.

 

On another note, is there an easy way to have value labels of a numeric variable as another string variable rather than recode all the values?

 

Finally, could moderator of the list (SPSSX-L) change the settings of the listserv so that our email addresses are not visible in online archives? In the current way, we are easy prey for spammers.

 

Thanks,

 

Asil

 

From: Richard Ristow [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 1:10 PM
To: Albert-Jan Roskam; [hidden email]
Cc: Asil Ozdogru
Subject: Re: Identify Duplicate Cases

 

At 03:19 AM 4/17/2010, Albert-Jan Roskam wrote:

If you SORT CASES BY  StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.


This is a good approach, possibly superior to the three-AGGREGATE solution I posted: probably a little more efficient; possibly clearer, though it's hard to tell that without implementing.

Albert-Jan's logic, as posted, requires at most one record per Source for any StudyID-Service-PeriodS combination. There'd have to be one AGGREGATE to create a file satisfying that criterion, and then a MATCH FILES to match the results to the original file. And an ADD FILES with a /FIRST= clause would probably be easier than LAG.

And that's about all I'd change. Thanks, Albert-Jan.

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Asil Ozdogru

Thanks John,

 

My question was how I could create a new string variable based on value labels of a numeric variable. The command I was advised (compute StringVar=valuelabel(NumVar)) seems to be working in versions after 14. I have a version 13 and currently using a hefty recode command to do it now.

 

Asil

 

From: John F Hall [mailto:[hidden email]]
Sent: Wednesday, April 21, 2010 2:05 PM
To: Asil Ozdogru; [hidden email]
Subject: Re: Re: Identify Duplicate Cases

 

Asil

 

It's not clear what you are trying to do, but does this help?  Data list needs to know which column(s) are to be read.

 

data list list / var1 1-3 .

begin data

120

121

122

end data.

val lab var1

  120 'A'

  121 'B'

  122 'C'.

exe.

----- Original Message -----

Sent: Tuesday, April 20, 2010 11:04 PM

Subject: Re: Identify Duplicate Cases

 

Value label command did not work, gave me this error below. I am using v. 13 if that makes any difference.

 

 

data list list / var1 (f3).

begin data

120

121

122

end data.

val lab var1

120 'A'

121 'B'

122 'C'.

string var2 (a1).

compute var2=valuelabel(var1).

 

Thanks,

 

Asil

 

From: Albert-Jan Roskam [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 3:59 PM
To: 'Richard Ristow'; [hidden email]; Asil Ozdogru
Subject: RE: Identify Duplicate Cases

 

Hi,

>>On another note, is there an easy way to have value labels of a numeric variable as >>another string variable rather than recode all the values?

 

==> string newvar (a80).

compute newvar = valuelabel(numvar).

 

Re: hiding mail addresses: good idea, although it's too late for my own address I'm afraid.



Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Tue, 4/20/10, Asil Ozdogru <[hidden email]> wrote:


From: Asil Ozdogru <[hidden email]>
Subject: RE: Identify Duplicate Cases
To: "'Richard Ristow'" <[hidden email]>, "Albert-Jan Roskam" <[hidden email]>, "[hidden email]" <[hidden email]>
Date: Tuesday, April 20, 2010, 8:43 PM

Thank you all. There is more than one record per source for any StudyID-Service-PeriodS combinations. Aggregate function seems to be working well.

 

Excuse me that I did not provide detailed description of the problem initially. I thought that I might work it out once I had the initial steps.

 

On another note, is there an easy way to have value labels of a numeric variable as another string variable rather than recode all the values?

 

Finally, could moderator of the list (SPSSX-L) change the settings of the listserv so that our email addresses are not visible in online archives? In the current way, we are easy prey for spammers.

 

Thanks,

 

Asil

 

From: Richard Ristow [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 1:10 PM
To: Albert-Jan Roskam; [hidden email]
Cc: Asil Ozdogru
Subject: Re: Identify Duplicate Cases

 

At 03:19 AM 4/17/2010, Albert-Jan Roskam wrote:

If you SORT CASES BY  StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.


This is a good approach, possibly superior to the three-AGGREGATE solution I posted: probably a little more efficient; possibly clearer, though it's hard to tell that without implementing.

Albert-Jan's logic, as posted, requires at most one record per Source for any StudyID-Service-PeriodS combination. There'd have to be one AGGREGATE to create a file satisfying that criterion, and then a MATCH FILES to match the results to the original file. And an ADD FILES with a /FIRST= clause would probably be easier than LAG.

And that's about all I'd change. Thanks, Albert-Jan.

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Albert-Jan Roskam
Hi,
 
In that case, you could make your life slightly easier by running DISPLAY DICTIONARY, then paste the part with the value labels in Excel, then use string concatenation to generate spss syntax. It may be easier to make IF statements. Something like:
"IF (varname eq "&B2&") varname = '"&C2&"'."
Note the use of single and double quotes.

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Wed, 4/21/10, Asil Ozdogru <[hidden email]> wrote:

From: Asil Ozdogru <[hidden email]>
Subject: Re: [SPSSX-L] Identify Duplicate Cases
To: [hidden email]
Date: Wednesday, April 21, 2010, 10:44 PM

Thanks John,

 

My question was how I could create a new string variable based on value labels of a numeric variable. The command I was advised (compute StringVar=valuelabel(NumVar)) seems to be working in versions after 14. I have a version 13 and currently using a hefty recode command to do it now.

 

Asil

 

From: John F Hall [mailto:[hidden email]]
Sent: Wednesday, April 21, 2010 2:05 PM
To: Asil Ozdogru; [hidden email]
Subject: Re: Re: Identify Duplicate Cases

 

Asil

 

It's not clear what you are trying to do, but does this help?  Data list needs to know which column(s) are to be read.

 

data list list / var1 1-3 .

begin data

120

121

122

end data.

val lab var1

  120 'A'

  121 'B'

  122 'C'.

exe.

----- Original Message -----

From: Asil Ozdogru

Sent: Tuesday, April 20, 2010 11:04 PM

Subject: Re: Identify Duplicate Cases

 

Value label command did not work, gave me this error below. I am using v. 13 if that makes any difference.

 

 

data list list / var1 (f3).

begin data

120

121

122

end data.

val lab var1

120 'A'

121 'B'

122 'C'.

string var2 (a1).

compute var2=valuelabel(var1).

 

Thanks,

 

Asil

 

From: Albert-Jan Roskam [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 3:59 PM
To: 'Richard Ristow'; [hidden email]; Asil Ozdogru
Subject: RE: Identify Duplicate Cases

 

Hi,

>>On another note, is there an easy way to have value labels of a numeric variable as >>another string variable rather than recode all the values?

 

==> string newvar (a80).

compute newvar = valuelabel(numvar).

 

Re: hiding mail addresses: good idea, although it's too late for my own address I'm afraid.



Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Tue, 4/20/10, Asil Ozdogru <[hidden email]> wrote:


From: Asil Ozdogru <[hidden email]>
Subject: RE: Identify Duplicate Cases
To: "'Richard Ristow'" <[hidden email]>, "Albert-Jan Roskam" <[hidden email]>, "[hidden email]" <[hidden email]>
Date: Tuesday, April 20, 2010, 8:43 PM

Thank you all. There is more than one record per source for any StudyID-Service-PeriodS combinations. Aggregate function seems to be working well.

 

Excuse me that I did not provide detailed description of the problem initially. I thought that I might work it out once I had the initial steps.

 

On another note, is there an easy way to have value labels of a numeric variable as another string variable rather than recode all the values?

 

Finally, could moderator of the list (SPSSX-L) change the settings of the listserv so that our email addresses are not visible in online archives? In the current way, we are easy prey for spammers.

 

Thanks,

 

Asil

 

From: Richard Ristow [mailto:[hidden email]]
Sent: Tuesday, April 20, 2010 1:10 PM
To: Albert-Jan Roskam; [hidden email]
Cc: Asil Ozdogru
Subject: Re: Identify Duplicate Cases

 

At 03:19 AM 4/17/2010, Albert-Jan Roskam wrote:

If you SORT CASES BY  StudyID (a) Service (a) PeriodS (a) units (d), the highest number of units are 'on top'. Then you can apply the COMPUTE command with LAG.


This is a good approach, possibly superior to the three-AGGREGATE solution I posted: probably a little more efficient; possibly clearer, though it's hard to tell that without implementing.

Albert-Jan's logic, as posted, requires at most one record per Source for any StudyID-Service-PeriodS combination. There'd have to be one AGGREGATE to create a file satisfying that criterion, and then a MATCH FILES to match the results to the original file. And an ADD FILES with a /FIRST= clause would probably be easier than LAG.

And that's about all I'd change. Thanks, Albert-Jan.

 

 


Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Bruce Weaver
Administrator
In reply to this post by Asil Ozdogru
Asil Ozdogru wrote
Thanks John,

My question was how I could create a new string variable based on value labels of a numeric variable. The command I was advised (compute StringVar=valuelabel(NumVar)) seems to be working in versions after 14. I have a version 13 and currently using a hefty recode command to do it now.

Asil
http://spsstools.net/Scripts/Labels/PopulateVariableWithValueLabelsOfAnotherVariable.txt

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
123