|
Hi list,
I have some output of a probabilistic linkage job. Basically for each ID I want to select the record pair with the highest weight ('dominance approach'). However, if the weight difference is rather small, I would like to include that record as well. The syntax below does the job, but I don't think it's a very efficient way (esp. the CREATE commands). Does anybody have suggestions for improvement? The original dataset has a lot of variables, so I'm not sure if casetovars is a good idea. Thanks in advance! Cheers!! Albert-Jan data list free / id_mnd weight. begin data 0 500 0 900 0 900 1 300 2 200 2 100 3 1000 3 10 4 120 4 300 5 200 end data. * identify non-unique double cases. sort cases by id_mnd (a) weight (d). compute double = 1. if (id_mnd ne lag(id_mnd) or $casenum = 1) double = 0. * copy id & weight of next case. create antilag_weight = lead (weight, 1). create antilag_id_mnd = lead (id_mnd, 1). * calculate weight difference. if (id_mnd = antilag_id_mnd ) weight_diff = weight - antilag_weight. if (double = 1 and $casenum ne 1) weight_diff = lag(weight) - weight. exe. compute filter = (double = 0 or weight_diff < 200). add files / file = * / drop = antilag_weight antilag_id_mnd. select if filter = 1. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
CREATE is a procedure that passes data for each invocation. You don't need
to specify it twice and can compute both variables at once. However, if you reverse sort on ID, won't lag work, obviating the need for the CREATE command. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Albert-jan Roskam Sent: Wednesday, August 06, 2008 9:45 AM To: [hidden email] Subject: partial deduplication Hi list, I have some output of a probabilistic linkage job. Basically for each ID I want to select the record pair with the highest weight ('dominance approach'). However, if the weight difference is rather small, I would like to include that record as well. The syntax below does the job, but I don't think it's a very efficient way (esp. the CREATE commands). Does anybody have suggestions for improvement? The original dataset has a lot of variables, so I'm not sure if casetovars is a good idea. Thanks in advance! Cheers!! Albert-Jan data list free / id_mnd weight. begin data 0 500 0 900 0 900 1 300 2 200 2 100 3 1000 3 10 4 120 4 300 5 200 end data. * identify non-unique double cases. sort cases by id_mnd (a) weight (d). compute double = 1. if (id_mnd ne lag(id_mnd) or $casenum = 1) double = 0. * copy id & weight of next case. create antilag_weight = lead (weight, 1). create antilag_id_mnd = lead (id_mnd, 1). * calculate weight difference. if (id_mnd = antilag_id_mnd ) weight_diff = weight - antilag_weight. if (double = 1 and $casenum ne 1) weight_diff = lag(weight) - weight. exe. compute filter = (double = 0 or weight_diff < 200). add files / file = * / drop = antilag_weight antilag_id_mnd. select if filter = 1. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
| Free forum by Nabble | Edit this page |
