SPSSX Discussion

partial deduplication

Classic

List

Threaded

2 messages Options

Albert-Jan Roskam

partial deduplication

Hi list,

I have some output of a probabilistic linkage job. Basically for each ID I want to select the record pair with the highest weight ('dominance approach'). However, if the weight difference is rather small, I would like to include that record as well. The syntax below does the job, but I don't think it's a very efficient way (esp. the CREATE commands).

Does anybody have suggestions for improvement? The original dataset has a lot of variables, so I'm not sure if casetovars is a good idea.

Thanks in advance!

Cheers!!
Albert-Jan

data list free / id_mnd weight.
begin data
0 500
0 900
0 900
1 300
2 200
2 100
3 1000
3 10
4 120
4 300
5 200
end data.

* identify non-unique double cases.
sort cases by id_mnd (a) weight (d).
compute double = 1.
if (id_mnd ne lag(id_mnd) or $casenum = 1) double = 0.

* copy id & weight of next case.
create antilag_weight = lead (weight, 1).
create antilag_id_mnd = lead (id_mnd, 1).

* calculate weight difference.
if (id_mnd = antilag_id_mnd ) weight_diff = weight - antilag_weight.
if (double = 1 and $casenum ne 1) weight_diff = lag(weight) - weight.
exe.

compute filter = (double = 0 or weight_diff < 200).

add files / file = * / drop = antilag_weight antilag_id_mnd.

select if filter = 1.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

ViAnn Beadle

Re: partial deduplication

CREATE is a procedure that passes data for each invocation. You don't need
to specify it twice and can compute both variables at once. However, if you
reverse sort on ID, won't lag work, obviating the need for the CREATE
command.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Albert-jan Roskam
Sent: Wednesday, August 06, 2008 9:45 AM
To: [hidden email]
Subject: partial deduplication

Hi list,

I have some output of a probabilistic linkage job. Basically for each ID I
want to select the record pair with the highest weight ('dominance
approach'). However, if the weight difference is rather small, I would like
to include that record as well. The syntax below does the job, but I don't
think it's a very efficient way (esp. the CREATE commands).

Does anybody have suggestions for improvement? The original dataset has a
lot of variables, so I'm not sure if casetovars is a good idea.

Thanks in advance!

Cheers!!
Albert-Jan

data list free / id_mnd weight.
begin data
0 500
0 900
0 900
1 300
2 200
2 100
3 1000
3 10
4 120
4 300
5 200
end data.

* identify non-unique double cases.
sort cases by id_mnd (a) weight (d).
compute double = 1.
if (id_mnd ne lag(id_mnd) or $casenum = 1) double = 0.

* copy id & weight of next case.
create antilag_weight = lead (weight, 1).
create antilag_id_mnd = lead (id_mnd, 1).

* calculate weight difference.
if (id_mnd = antilag_id_mnd ) weight_diff = weight - antilag_weight.
if (double = 1 and $casenum ne 1) weight_diff = lag(weight) - weight.
exe.

compute filter = (double = 0 or weight_diff < 200).

add files / file = * / drop = antilag_weight antilag_id_mnd.

select if filter = 1.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD