SPSSX Discussion

Duplicates based on two columns

Classic

List

Threaded

3 messages Options

mils

Duplicates based on two columns

This post was updated on .

Hi,

I would like to find out which doctors have entered surveys in different waves. I have the following example data:

userid wave
1006 1
1006 1
1006 1
1006 2
1006 2
1006 3
1006 3
1006 4
1006 4
1005 1
1005 1
1005 1
1005 1
1005 1

and I would like to create a colum with the following information:

userid wave repeted_waves
1006 1 1
1006 1 1
1006 1 1
1006 2 1
1006 2 1
1006 3 1
1006 3 1
1006 4 1
1006 4 1
1005 1 0
1005 1 0
1005 1 0
1005 1 0
1005 1 0

where 1 means that they entered data in different waves and 0 means they entered data only in one wave.

Doctor 1006 entered data on different waves (1,2,3 and 4). On the other hand, doctor 1005 did not. Doctor 1005 only entered data in one wave (1).

I hope it makes sense. I've tried aggregate, but couldn't get the right numbers.

Thanks for your help.

mils

Jignesh Sutar

Re: Duplicates based on two columns

Unfortunately you can not do this in a single step. Would be good if you could do something like this:

AGGREGATE OUTFILE=* MODE=ADDVARIABLES /BREAK=UserID /WavesPerDoc=UNIQUEVALUES(wave).

However instead you can do this:

data list free /userid (f4.0) wave (f1.0).
BEGIN DATA
1006 1
1006 1
1006 1
1006 2
1006 2
1006 3
1006 3
1006 4
1006 4
1005 1
1005 1
1005 1
1005 1
1005 1
end data.

sort cases by userid wave.
match files file=* /by userid wave /first=FirstUserWave.
aggregate outfile=* mode=addvariables /break=userid /WavesPerDoc=sum(FirstUserWave).
compute RepWaveDocInd=WavesPerDoc>1.

You can also use the same logic to calculate the number of doctors per wave i.e. Waves 1 contained 2 doctors and Waves 2,3,4 contained only 1 doctor.

aggregate outfile=* mode=addvariables /break=wave /DocPerWave=sum(FirstUserWave).

You might find it helpful also, if not done so already, to create a doctor level indicator in the file if you need to extract out any figures at a doctor level. For example:

sort cases by userid.
match files file=* /by userid /first=FirstDoc.
recode FirstDoc (0=sysmis).

/* For each doctor ID produce table for count of waves completed*/.
ctables /table (FirstDoc>userid)[c] by WavesPerDoc[s][mean 'Waves answered per doctor'].

/* frequency count of doctors completing x number of waves */.
ctables /table WavesPerDoc[c] by FirstDoc[c].

Bruce Weaver

Re: Duplicates based on two columns

Administrator

I was also thinking of AGGREGATE, but used the SD function, like this:

DATA LIST free /userid (f4.0) wave (f1.0).
BEGIN DATA
1006 1
1006 1
1006 1
1006 2
1006 2
1006 3
1006 3
1006 4
1006 4
1005 1
1005 1
1005 1
1005 1
1005 1
END DATA.

AGGREGATE
/OUTFILE=* MODE=ADDVARIABLES
/BREAK=userid
/repeated_waves=SD(wave).
* If there is only 1 wave, SD = 0; if multiple waves, SD > 0.
COMPUTE repeated_waves = (repeated_waves GT 0).
FORMATS repeated_waves (F1).
LIST.

OUTPUT:
userid wave repeated_waves

1006 1 1
1006 1 1
1006 1 1
1006 2 1
1006 2 1
1006 3 1
1006 3 1
1006 4 1
1006 4 1
1005 1 0
1005 1 0
1005 1 0
1005 1 0
1005 1 0

Number of cases read: 14 Number of cases listed: 14

Jignesh Sutar wrote

Unfortunately you can not do this in a single step. Would be good if you could do something like this:

AGGREGATE OUTFILE=* MODE=ADDVARIABLES /BREAK=UserID /WavesPerDoc=UNIQUEVALUES(wave).

However instead you can do this:

data list free /userid (f4.0) wave (f1.0).
BEGIN DATA
1006 1
1006 1
1006 1
1006 2
1006 2
1006 3
1006 3
1006 4
1006 4
1005 1
1005 1
1005 1
1005 1
1005 1
end data.

sort cases by userid wave.
match files file=* /by userid wave /first=FirstUserWave.
aggregate outfile=* mode=addvariables /break=userid /WavesPerDoc=sum(FirstUserWave).
compute RepWaveDocInd=WavesPerDoc>1.

You can also use the same logic to calculate the number of doctors per wave i.e. Waves 1 contained 2 doctors and Waves 2,3,4 contained only 1 doctor.

aggregate outfile=* mode=addvariables /break=wave /DocPerWave=sum(FirstUserWave).

You might find it helpful also, if not done so already, to create a doctor level indicator in the file if you need to extract out any figures at a doctor level. For example:

sort cases by userid.
match files file=* /by userid /first=FirstDoc.
recode FirstDoc (0=sysmis).

/* For each doctor ID produce table for count of waves completed*/.
ctables /table (FirstDoc>userid)[c] by WavesPerDoc[s][mean 'Waves answered per doctor'].

/* frequency count of doctors completing x number of waves */.
ctables /table WavesPerDoc[c] by FirstDoc[c].

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).