Removing complex duplicate cases in SPSS

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Removing complex duplicate cases in SPSS

Enrique Ramalle Gomara
Hi:

I need to solve a complex issue to remove duplicate cases. I will try to
explain carefully what I would need...

I am building a Rare Diseases (RD) Registry using an administrative database
(the CMBD) in SPSS format that collects data from hospital discharges. Each
hospital discharge creates one line is the SPSS database. Each patient is
identified by a unique personal number, variable  "History". The same
patient may origin several events if he is hospitalized many times.
Besides the variable "History", we are interested in variables registering
diagnosis codes. There could be up to 13 diagnosis codes, variables "C1" to
"C13". Frequently, some of those 13 possible codes are not filled in.  Some
times only the first ones (C1 to C4 or C5) are filled in and the others are
empty.

To build our registry we need to include all patients registered in the CMBD
database with any diagnosis code corresponding to a Rare Disease (we have a
code list to use) in any of the 13 possible string variables (C1 to C13). We
need to remove duplicate cases in order to avoid including the same patient
several times because he has been hospitalized many times. However, we can't
remove duplicates regarding only the variable "History" because the same
patient may suffer several RD and we have to include each patient as many
times as RD he has. Considering this situation, we can find the following
events/problems:

1. One patient with only one RD may have several hospitalizations. In this
case, the RD code may be in the same or different variables: C1 to C13. For
instance, in the first hospitalization the code is in variable C1, in the
second one the code is in C3 and in the last one the code is in C5. That
prevents us from removing duplicates comparing, besides the variable
history, variables C1 with C1, C2 with C2, C3 with C3 and so on. We need to
corroborate that, besides having the same history number, both events
contain the same RD code even if it is in different C1 to C13 variables. We
could remove a case only if the history was the same and both events had the
same RD code even in different diagnosis.

2. One patient with more than one RD. In this case, if the patient has
several hospitalizations with the code of the first RD (in the same or
different variable) and several hospitalizations with the code of the second
RD (in the same or different variable), we would need to include the patient
in our registry twice: the first event reflecting the first RD and the first
event reflecting the second RD.

3.  One patient with more than one RD. In this case, the patient may have
several hospitalizations with some of them containing more than one code of
different RD. Then, to remove one event we would need to corroborate that,
besides having the same history than other event, it contains the same RD
code and doesn't contain any other different RD code in any C1 to C13
variables. I mean, if two events with the same history number contain the
same RD code, but one of them has another different RD code, we cannot
remove the events. We would need to register the patient twice, each one
with one of the RD different codes.

I know that it's an extremely complex situation and I don't know if it's
possible to do that with SPSS, but if someone has an idea, please help me.

Here is what the data may look like. Codes of RD we are interested in are in
bold.





History              C1                    C2                    C3
C4                    C5
2345                 25.1                 56                    46.5
75                      .
2345                 12.6                 56                    89
25.1                   .
56768               47                    32.3                 43
.                         .
56768               32.3                 43                    453.8
15                      .
56768               78.0                 12.5                 .
.                       .
467                  32.3                 49                    40
15                      .
467                  40                    126.5                76
155.4                  .
467                  26                    32.3                 4945
94                    56


Thanks a lot!

Enrique Ramalle-Gómara, PhD
Department of Epidemiology
La Rioja (Spain)

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Removing complex duplicate cases in SPSS

David Marso
Administrator
see varstocases and aggregate commands.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"