Duplicates based on two columns

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Duplicates based on two columns

mils
This post was updated on .
Hi,

I would like to find out which doctors have entered surveys in different waves. I have the following example data:

userid wave
1006 1
1006 1
1006 1
1006 2
1006 2
1006 3
1006 3
1006 4
1006 4
1005 1
1005 1
1005 1
1005 1
1005 1

and I would like to create a colum with the following information:

userid wave repeted_waves
1006 1 1
1006 1 1
1006 1 1
1006 2 1
1006 2 1
1006 3 1
1006 3 1
1006 4 1
1006 4 1
1005 1 0
1005 1 0
1005 1 0
1005 1 0
1005 1 0

where 1 means that they entered data in different waves and 0 means they entered data only in one wave.

Doctor 1006 entered data on different waves (1,2,3 and 4). On the other hand, doctor 1005 did not. Doctor 1005 only entered data in one wave (1).

I hope it makes sense. I've tried aggregate, but couldn't get the right numbers.

Thanks for your help.

mils
Reply | Threaded
Open this post in threaded view
|

Re: Duplicates based on two columns

Jignesh Sutar
Unfortunately you can  not do this in a single step. Would be good if you could do something like this:

AGGREGATE OUTFILE=* MODE=ADDVARIABLES /BREAK=UserID /WavesPerDoc=UNIQUEVALUES(wave).

However instead you can do this:

data list free /userid (f4.0) wave (f1.0).
BEGIN DATA
1006 1
1006 1
1006 1
1006 2
1006 2
1006 3
1006 3
1006 4
1006 4
1005 1
1005 1
1005 1
1005 1
1005 1
end data.


sort cases by userid wave.
match files file=* /by userid wave /first=FirstUserWave.
aggregate outfile=* mode=addvariables /break=userid /WavesPerDoc=sum(FirstUserWave).
compute RepWaveDocInd=WavesPerDoc>1.


You can also use the same logic to calculate the number of doctors per wave i.e. Waves 1 contained 2 doctors and Waves 2,3,4 contained only 1 doctor.

aggregate outfile=* mode=addvariables /break=wave /DocPerWave=sum(FirstUserWave).

You might find it helpful also, if not done so already, to create a doctor level indicator in the file if you need to extract out any figures at a doctor level. For example:


sort cases by userid.
match files file=* /by userid /first=FirstDoc.
recode FirstDoc (0=sysmis).

/* For each doctor ID produce table for count of waves completed*/.
ctables /table (FirstDoc>userid)[c] by WavesPerDoc[s][mean 'Waves answered per doctor'].  

/* frequency count of doctors completing x number of waves */.
ctables /table WavesPerDoc[c] by FirstDoc[c].
Reply | Threaded
Open this post in threaded view
|

Re: Duplicates based on two columns

Bruce Weaver
Administrator
I was also thinking of AGGREGATE, but used the SD function, like this:

DATA LIST free /userid (f4.0) wave (f1.0).
BEGIN DATA
1006 1
1006 1
1006 1
1006 2
1006 2
1006 3
1006 3
1006 4
1006 4
1005 1
1005 1
1005 1
1005 1
1005 1
END DATA.

AGGREGATE
  /OUTFILE=* MODE=ADDVARIABLES
  /BREAK=userid
  /repeated_waves=SD(wave).
* If there is only 1 wave, SD = 0; if multiple waves, SD > 0.
COMPUTE repeated_waves = (repeated_waves GT 0).
FORMATS repeated_waves (F1).
LIST.

OUTPUT:
userid wave repeated_waves
 
 1006    1         1
 1006    1         1
 1006    1         1
 1006    2         1
 1006    2         1
 1006    3         1
 1006    3         1
 1006    4         1
 1006    4         1
 1005    1         0
 1005    1         0
 1005    1         0
 1005    1         0
 1005    1         0
 
Number of cases read:  14    Number of cases listed:  14


Jignesh Sutar wrote
Unfortunately you can  not do this in a single step. Would be good if you could do something like this:

AGGREGATE OUTFILE=* MODE=ADDVARIABLES /BREAK=UserID /WavesPerDoc=UNIQUEVALUES(wave).

However instead you can do this:

data list free /userid (f4.0) wave (f1.0).
BEGIN DATA
1006 1
1006 1
1006 1
1006 2
1006 2
1006 3
1006 3
1006 4
1006 4
1005 1
1005 1
1005 1
1005 1
1005 1
end data.


sort cases by userid wave.
match files file=* /by userid wave /first=FirstUserWave.
aggregate outfile=* mode=addvariables /break=userid /WavesPerDoc=sum(FirstUserWave).
compute RepWaveDocInd=WavesPerDoc>1.


You can also use the same logic to calculate the number of doctors per wave i.e. Waves 1 contained 2 doctors and Waves 2,3,4 contained only 1 doctor.

aggregate outfile=* mode=addvariables /break=wave /DocPerWave=sum(FirstUserWave).

You might find it helpful also, if not done so already, to create a doctor level indicator in the file if you need to extract out any figures at a doctor level. For example:


sort cases by userid.
match files file=* /by userid /first=FirstDoc.
recode FirstDoc (0=sysmis).

/* For each doctor ID produce table for count of waves completed*/.
ctables /table (FirstDoc>userid)[c] by WavesPerDoc[s][mean 'Waves answered per doctor'].  

/* frequency count of doctors completing x number of waves */.
ctables /table WavesPerDoc[c] by FirstDoc[c].
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).