partial deduplication

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

partial deduplication

Albert-Jan Roskam
Hi list,

I have some output of a probabilistic linkage job. Basically for each ID I want to select the record pair with the highest weight ('dominance approach'). However, if the weight difference is rather small, I would like to include that record as well. The syntax below does the job, but I don't think it's a very efficient way (esp. the CREATE commands).

Does anybody have suggestions for improvement? The original dataset has a lot of variables, so I'm not sure if casetovars is a good idea.

Thanks in advance!

Cheers!!
Albert-Jan


data list free / id_mnd weight.
begin data
0 500
0 900
0 900
1 300
2 200
2 100
3 1000
3 10
4 120
4 300
5 200
end data.


* identify non-unique double cases.
sort cases by id_mnd (a) weight (d).
compute double = 1.
if (id_mnd ne lag(id_mnd) or $casenum = 1) double = 0.

* copy id & weight of next case.
create antilag_weight = lead (weight, 1).
create antilag_id_mnd = lead (id_mnd, 1).

* calculate weight difference.
if (id_mnd =  antilag_id_mnd ) weight_diff  = weight - antilag_weight.
if (double = 1 and $casenum ne 1) weight_diff =  lag(weight) - weight.
exe.

compute filter = (double = 0 or weight_diff < 200).

add files / file = * / drop = antilag_weight antilag_id_mnd.

select if filter = 1.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: partial deduplication

ViAnn Beadle
CREATE is a procedure that passes data for each invocation. You don't need
to specify it twice and can compute both variables at once. However, if you
reverse sort on ID, won't lag work, obviating the need for the CREATE
command.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Albert-jan Roskam
Sent: Wednesday, August 06, 2008 9:45 AM
To: [hidden email]
Subject: partial deduplication

Hi list,

I have some output of a probabilistic linkage job. Basically for each ID I
want to select the record pair with the highest weight ('dominance
approach'). However, if the weight difference is rather small, I would like
to include that record as well. The syntax below does the job, but I don't
think it's a very efficient way (esp. the CREATE commands).

Does anybody have suggestions for improvement? The original dataset has a
lot of variables, so I'm not sure if casetovars is a good idea.

Thanks in advance!

Cheers!!
Albert-Jan


data list free / id_mnd weight.
begin data
0 500
0 900
0 900
1 300
2 200
2 100
3 1000
3 10
4 120
4 300
5 200
end data.


* identify non-unique double cases.
sort cases by id_mnd (a) weight (d).
compute double = 1.
if (id_mnd ne lag(id_mnd) or $casenum = 1) double = 0.

* copy id & weight of next case.
create antilag_weight = lead (weight, 1).
create antilag_id_mnd = lead (id_mnd, 1).

* calculate weight difference.
if (id_mnd =  antilag_id_mnd ) weight_diff  = weight - antilag_weight.
if (double = 1 and $casenum ne 1) weight_diff =  lag(weight) - weight.
exe.

compute filter = (double = 0 or weight_diff < 200).

add files / file = * / drop = antilag_weight antilag_id_mnd.

select if filter = 1.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD