Removing complex duplicate cases

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Removing complex duplicate cases

Enrique Ramalle Gomara

Hi:

 

I need to solve a complex issue to remove duplicate cases. I will try to explain carefully what I would need...

 

I am building a Rare Diseases (RD) Registry using an administrative database (the CMBD) in SPSS format that collects data from hospital discharges. Each hospital discharge creates one line is the SPSS database. Each patient is identified by a unique personal number, variable  "History". The same patient may origin several events if he is hospitalized many times. 

Besides the variable "History", we are interested in variables registering diagnosis codes. There could be up to 13 diagnosis codes, variables "C1" to "C13". Frequently, some of those 13 possible codes are not filled in.  Some times only the first ones (C1 to C4 or C5) are filled in and the others are empty.

 

To build our registry we need to include all patients registered in the CMBD database with any diagnosis code corresponding to a Rare Disease (we have a code list to use) in any of the 13 possible string variables (C1 to C13). We need to remove duplicate cases in order to avoid including the same patient several times because he has been hospitalized many times. However, we can't remove duplicates regarding only the variable "History" because the same patient may suffer several RD and we have to include each patient as many times as RD he has. Considering this situation, we can find the following events/problems:

 

1. One patient with only one RD may have several hospitalizations. In this case, the RD code may be in the same or different variables: C1 to C13. For instance, in the first hospitalization the code is in variable C1, in the second one the code is in C3 and in the last one the code is in C5. That prevents us from removing duplicates comparing, besides the variable history, variables C1 with C1, C2 with C2, C3 with C3 and so on. We need to corroborate that, besides having the same history number, both events contain the same RD code even if it is in different C1 to C13 variables. We could remove a case only if the history was the same and both events had the same RD code even in different diagnosis.

 

2. One patient with more than one RD. In this case, if the patient has several hospitalizations with the code of the first RD (in the same or different variable) and several hospitalizations with the code of the second RD (in the same or different variable), we would need to include the patient in our registry twice: the first event reflecting the first RD and the first event reflecting the second RD.  

 

3.  One patient with more than one RD. In this case, the patient may have several hospitalizations with some of them containing more than one code of different RD. Then, to remove one event we would need to corroborate that, besides having the same history than other event, it contains the same RD code and doesn't contain any other different RD code in any C1 to C13 variables. I mean, if two events with the same history number contain the same RD code, but one of them has another different RD code, we cannot remove the events. We would need to register the patient twice, each one with one of the RD different codes. 

 

I know that it's an extremely complex situation and I don't know if it's possible to do that with SPSS, but if someone has an idea, please help me.

 

Here is what the data may look like. Codes of RD we are interested in are in bold.

 

 

 

 

 

History              C1                    C2                    C3                    C4                    C5

2345                 25.1                 56                    46.5                 75                      .

2345                 12.6                 56                    89                    25.1                   .

56768               47                    32.3                 43                    .                         .

56768               32.3                 43                    453.8                15                      .

56768               78.0                 12.5                 .                         .                       .

467                  32.3                 49                    40                    15                      .                    

467                  40                    126.5                76                    155.4                  .        

467                  26                    32.3                 4945                 94                    56

 

 

Thanks a lot!


Enrique Ramalle-Gómara, PhD

Department of Epidemiology

La Rioja (Spain)




GOBIERNO DE LA RIOJA
AVISO LEGAL: La información contenida en este mensaje es confidencial y está destinada a ser leída sólo por la persona a la que va dirigida. Si Ud. no es el destinatario señalado le informamos que está prohibida, y puede ser ilegal, cualquier divulgación o reproducción de este mensaje.
Antes de imprimir este e-mail piense bien si es necesario hacerlo.
Reply | Threaded
Open this post in threaded view
|

Re: Removing complex duplicate cases

Maguin, Eugene

Enrique,

 

You posted this last week and David Marso suggested the two procedures that you would need to use. Were you able to use to them to write code? Did you get syntax errors that you don’t know how to resolve? Did the code you wrote execute without errors but give incorrect results? Tell us how much progress you were able to make. Show us the code that you are using. You gave a good description of the problem and good sample data. So, what happened next??

 

Gene Maguin

 

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Enrique Ramalle Gomara
Sent: Thursday, November 22, 2012 6:02 AM
To: [hidden email]
Subject: Removing complex duplicate cases

 

Hi:

 

I need to solve a complex issue to remove duplicate cases. I will try to explain carefully what I would need...

 

I am building a Rare Diseases (RD) Registry using an administrative database (the CMBD) in SPSS format that collects data from hospital discharges. Each hospital discharge creates one line is the SPSS database. Each patient is identified by a unique personal number, variable  "History". The same patient may origin several events if he is hospitalized many times. 

Besides the variable "History", we are interested in variables registering diagnosis codes. There could be up to 13 diagnosis codes, variables "C1" to "C13". Frequently, some of those 13 possible codes are not filled in.  Some times only the first ones (C1 to C4 or C5) are filled in and the others are empty.

 

To build our registry we need to include all patients registered in the CMBD database with any diagnosis code corresponding to a Rare Disease (we have a code list to use) in any of the 13 possible string variables (C1 to C13). We need to remove duplicate cases in order to avoid including the same patient several times because he has been hospitalized many times. However, we can't remove duplicates regarding only the variable "History" because the same patient may suffer several RD and we have to include each patient as many times as RD he has. Considering this situation, we can find the following events/problems:

 

1. One patient with only one RD may have several hospitalizations. In this case, the RD code may be in the same or different variables: C1 to C13. For instance, in the first hospitalization the code is in variable C1, in the second one the code is in C3 and in the last one the code is in C5. That prevents us from removing duplicates comparing, besides the variable history, variables C1 with C1, C2 with C2, C3 with C3 and so on. We need to corroborate that, besides having the same history number, both events contain the same RD code even if it is in different C1 to C13 variables. We could remove a case only if the history was the same and both events had the same RD code even in different diagnosis.

 

2. One patient with more than one RD. In this case, if the patient has several hospitalizations with the code of the first RD (in the same or different variable) and several hospitalizations with the code of the second RD (in the same or different variable), we would need to include the patient in our registry twice: the first event reflecting the first RD and the first event reflecting the second RD.  

 

3.  One patient with more than one RD. In this case, the patient may have several hospitalizations with some of them containing more than one code of different RD. Then, to remove one event we would need to corroborate that, besides having the same history than other event, it contains the same RD code and doesn't contain any other different RD code in any C1 to C13 variables. I mean, if two events with the same history number contain the same RD code, but one of them has another different RD code, we cannot remove the events. We would need to register the patient twice, each one with one of the RD different codes. 

 

I know that it's an extremely complex situation and I don't know if it's possible to do that with SPSS, but if someone has an idea, please help me.

 

Here is what the data may look like. Codes of RD we are interested in are in bold.

 

 

 

 

 

History              C1                    C2                    C3                    C4                    C5

2345                 25.1                 56                    46.5                 75                      .

2345                 12.6                 56                    89                    25.1                   .

56768               47                    32.3                 43                    .                         .

56768               32.3                 43                    453.8                15                      .

56768               78.0                 12.5                 .                         .                       .

467                  32.3                 49                    40                    15                      .                    

467                  40                    126.5                76                    155.4                  .        

467                  26                    32.3                 4945                 94                    56

 

 

Thanks a lot!

 

Enrique Ramalle-Gómara, PhD

Department of Epidemiology

La Rioja (Spain)

 



GOBIERNO DE LA RIOJA
AVISO LEGAL: La información contenida en este mensaje es confidencial y está destinada a ser leída sólo por la persona a la que va dirigida. Si Ud. no es el destinatario señalado le informamos que está prohibida, y puede ser ilegal, cualquier divulgación o reproducción de este mensaje.
Antes de imprimir este e-mail piense bien si es necesario hacerlo.

Reply | Threaded
Open this post in threaded view
|

Re: Removing complex duplicate cases

Albert-Jan Roskam
You can "melt" the data using varstocases, so the vars C1 through C5 become one var (let's call it "C").
You then have two identifiers: history and c. You can use those to deduplicate your data, either with LAG or
with AGGREGATE.


I actually read David's advice, *after* coming up with my own, almost identical advice, which to me suggests that

this might be the way to proceed.


Regards,
Albert-Jan


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a
fresh water system, and public health, what have the Romans ever done for us?



>________________________________
> From: "Maguin, Eugene" <[hidden email]>
>To: [hidden email]
>Sent: Monday, November 26, 2012 3:49 PM
>Subject: Re: [SPSSX-L] Removing complex duplicate cases
>
>
>Enrique,
>�
>You posted this last week and David Marso suggested the two procedures that you would need to use. Were you able to use to them to write code? Did you get syntax errors that you don’t know how to resolve? Did the code you wrote execute without errors but give incorrect results? Tell us how much progress you were able to make. Show us the code that you are using. You gave a good description of the problem and good sample data. So, what happened next??
>�
>Gene Maguin
>�
>From:SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Enrique Ramalle Gomara
>Sent: Thursday, November 22, 2012 6:02 AM
>To: [hidden email]
>Subject: Removing complex duplicate cases
>�
>Hi:
>�
>I� need� to� solve a complex� issue to remove duplicate cases.� I will try to explain carefully what I would need...
>�
>I am building a Rare Diseases (RD) Registry using an administrative database (the CMBD) in SPSS format� that collects data from hospital discharges. Each hospital discharge creates one line is the SPSS database. Each patient is identified by a unique personal number, variable�  "History". The same patient may origin several� events if he is hospitalized� many times.�
>Besides the variable "History", we are interested in variables registering diagnosis codes.� There could be up to 13 diagnosis codes, variables "C1" to "C13".� Frequently, some of those 13 possible codes are not filled in. � Some times only the first ones (C1 to C4 or C5) are filled in and the others are empty.
>�
>To build our registry we need to include all patients registered in the CMBD database with any diagnosis code corresponding to a Rare Disease (we have a code list to use) in any of the 13 possible string variables (C1 to C13).� We need to remove duplicate cases in order to avoid including the same patient� several times because he has been hospitalized� many times. However, we can't remove duplicates regarding only the variable "History" because the same patient may suffer several RD and we have to include each patient as many times as RD he has. Considering this situation, we can find� the following� events/problems:
>�
>1. One patient with only one RD may have several hospitalizations. In this case, the RD code may be in the same or different variables: C1 to C13. For instance, in the first hospitalization the code is in variable C1, in the second one the code is in C3 and in the last one the code is in C5. That prevents us from removing duplicates comparing, besides the variable history,� variables C1 with C1, C2 with C2, C3 with C3 and so on. We need to corroborate that, besides having the same history number, both events contain the same RD code even if it is in different C1 to C13 variables. We could remove a case only if the history was the same and both events had the same RD code even in different diagnosis.
>�
>2. One patient with more than one RD. In this case,� if the� patient has several hospitalizations with the code of the first� RD (in the same or� different variable)� and several hospitalizations with the code of the second RD (in the same or� different variable), we would need to include the patient in our registry twice: the first event� reflecting the first RD and the first event reflecting the second RD.� �
>�
>3.� � One patient with more than one RD. In this case,� the� patient may have several� hospitalizations with some of them containing more than one code of different RD.� Then, to remove one event� we would need to corroborate that, besides having the same history than other event,� it contains the same RD code and doesn't contain any other different RD code in any C1 to C13 variables.� I mean, if two events with the same history number� contain the same RD code, but one of them has another different RD code, we cannot remove the events. We would need to register the patient� twice, each one with one of the RD different codes.�
>�
>I know that it's an extremely complex situation and I don't know if it's possible to do that with SPSS, but if someone has an idea, please help me.
>�
>Here is what the data may look like. Codes of RD we are interested in are in bold.
>�
>�
>�
>�
>�
>History� � � � � � � � � � � � �  C1� � � � � � � � � � � � � � � � � � �  C2� � � � � � � � � � � � � � � � � � �  C3� � � � � � � � � � � � � � � � � � �  C4� � � � � � �  � � � � � � � � � � �  C5
>2345� � � � � � � � � � � � � � � �  25.1� � � � � � � � � � � � � � � �  56� � � � � � � � � � � � � � � � � � �  46.5� � � � � � � � � � � � � � � �  75� � � � � � � � � � � � � � � � � � �  �  .
>2345� � � � � � � � � � � � � � � �  12.6� � � � � � � � � � � � � � � �  56� � � � � � � � � � � � � � � � � � �  89� � � � � � � � � � � � � � � � � � �  25.1� � � � � � � � � � � � � � � �  �  .
>56768� � � � � � � � � � � � � �  47� � � � � � � � � � � � � � � � � � �  32.3� � � � � � � � � � � � � � � �  43� � � � � � � � � � � � � � � � � � �  .� � � � � � � � � � � � � � � � � � � � � �  �  .
>56768� � � � � � � � � � � � � �  32.3� � � � � � � � � � � � � � � �  43� � � � � � � � � � � � � � � � � � �  453.8� � � � � � � � � � � � � � �  15� � � � � � � � � � � � � � � � � � �  �  .
>56768� � � � � � � � � � � � � �  78.0� � � � � � � � � � � � � � � �  12.5� � � � � � � � � � � � � � � �  .� � � � � � � � � � � � � � � � � � � � � �  �  . � � � � � � � � � � � � � � � � � � �  � � .
>467� � � � � � � � � � � � � � � � �  32.3� � � � � � � � � � � � � � � �  49� � � � � � � � � � � � � � � � � � �  40� � � � � � � � � � � � � � � � � � �  15� � � � � � � � � � � � � � � � � � �  �  .� � � � � � � � � � � � � � � � � � � �
>467� � � � � � � � � � � � � � � � �  40� � � � � � � � � � � � � � � � � � �  126.5� � � � � � � � � � � � � � �  76� � � � � � � � � � � � � � � � � � �  155.4� � � � � � � � � � � � � � �  �  .� � � � � � � �
>467� � � � � � � � � � � � � � � � �  26� � � � � � � � � � � � � � � � � � �  32.3� � � � � � � � � � � � � � � �  4945� � � � � � � � � � � � � � � �  94� � � � � � � � � � � � � � � � � � �  56
>�
>�
>Thanks a lot!
>�
>Enrique Ramalle-Gómara, PhD
>Department of Epidemiology
>La Rioja (Spain)
>�
>
>________________________________
>
>
>GOBIERNO DE LA RIOJA
>AVISO LEGAL: La información contenida en este mensaje es confidencial y está destinada a ser leída sólo por la persona a la que va dirigida. Si Ud. no es el destinatario señalado le informamos que está prohibida, y puede ser ilegal, cualquier divulgación o reproducción de este mensaje.
>Antes de imprimir este e-mail piense bien si es necesario hacerlo.
>
>� ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~�

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD