Merging Files with Duplicates

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Merging Files with Duplicates

Nora Douglas
Hello all,

Here is my dilema, after some manipulation of multiple data files I now have two files.  The first file has CLNT (an id variable for clients in Agency A) and CASE_ID.1 through CASE_ID.22.  The second file has VICTIM_ID (an id variable for clients in the Agency B) and CASE_ID.1 through CASE_ID.22.  CLNT and VICTIM_ID are not the same.  Only CASE_ID will match between the two files.  I had to restructure the file to get rid of duplicates in the CASE_ID variable.  That led to the multiple CASE_ID variables.

I need to merge these two files using CASE_ID as the matching variable.

I want the syntax to look for matches in the first CLIENT_ID variable, and then move on to the second, third, fourth, etc., until it has filled in CLNT and VICTIM_ID for all cases (maybe some kind of DO REPEAT or something???).  I need to get CLNT, VICTIM_ID, AND CASE_ID.1-22 all in the same file.

I have pasted some data below.

DATA LIST /CLNT 1-9 CASE_ID.1 11-18 CASE_ID.2 20-27 CASE_ID.3 29-36.
BEGIN DATA
100834501 25288885 25322272 24929988
102120701 25366755 25116450 24897575
102346301 25112698 25310977 24994276
END DATA.

DATA LIST /VICTIM_ID 1-7 CASE_ID.1 9-16 CASE_ID.2 18-25 CASE_ID.3 27-34.
BEGIN DATA
3013395 25288885 25322272 24912728
3013550 25039615 25116450 25115922
3013572 25314224 25377552 24994276
END DATA.

Please let me know if I can provide any more information.  I am extremely appreciative of any help you can give me.

Thanks,

Nora















Nora E. Douglas, PhD, Research Specialist
Adult Protective Services
Texas Department of Family and Protective Services
701 West 51st Street, Suite 500, Austin, Texas 78751
(512) 438-5324
[hidden email]
http://www.dfps.state.tx.us/adult_protection/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Merging Files with Duplicates

Bob Schacht-3
At 01:17 PM 2/20/2008, Nora Douglas wrote:

>Hello all,
>
>Here is my dilema, after some manipulation of multiple data files I now
>have two files.  The first file has CLNT (an id variable for clients in
>Agency A) and CASE_ID.1 through CASE_ID.22.  The second file has VICTIM_ID
>(an id variable for clients in the Agency B) and CASE_ID.1 through
>CASE_ID.22.  CLNT and VICTIM_ID are not the same.  Only CASE_ID will match
>between the two files.  I had to restructure the file to get rid of
>duplicates in the CASE_ID variable.  That led to the multiple CASE_ID
>variables.
>
>I need to merge these two files using CASE_ID as the matching variable.
>
>I want the syntax to look for matches in the first CLIENT_ID variable, and
>then move on to the second, third, fourth, etc., until it has filled in
>CLNT and VICTIM_ID for all cases (maybe some kind of DO REPEAT or
>something???).  I need to get CLNT, VICTIM_ID, AND CASE_ID.1-22 all in the
>same file.
>
>I have pasted some data below.

Nora,
Just a few clarifying questions:

1. CLNT and VICTIM_ID both refer to individuals, right?
2. A CLNT can have (up to?) 22 cases-- or does each CLNT always have 22
cases? And therefore CLNT and CASE_ID have a one to many relationship, right?

3. However, for each CASE_ID, is there only one CLNT? Or can there be more
than one CLNT associated with a CASE_ID?

4. Is there any correspondence between CASE_IDn and a particular CASE_ID
number? For example, to use your sample data, could 25288885 be CASE_ID.1
for one CLNT, but also CASE_ID.3 for another CLNT? Your example does not
show this, but if true, then we have a different problem.

What I suspect is that you may need to construct a CASE_ID file, which will
make it easier to aggregate all information about a given case.

Bob Schacht


>DATA LIST /CLNT 1-9 CASE_ID.1 11-18 CASE_ID.2 20-27 CASE_ID.3 29-36.
>BEGIN DATA
>100834501 25288885 25322272 24929988
>102120701 25366755 25116450 24897575
>102346301 25112698 25310977 24994276
>END DATA.
>
>DATA LIST /VICTIM_ID 1-7 CASE_ID.1 9-16 CASE_ID.2 18-25 CASE_ID.3 27-34.
>BEGIN DATA
>3013395 25288885 25322272 24912728
>3013550 25039615 25116450 25115922
>3013572 25314224 25377552 24994276
>END DATA.
>
>Please let me know if I can provide any more information.  I am extremely
>appreciative of any help you can give me.
>
>Thanks,
>
>Nora
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>Nora E. Douglas, PhD, Research Specialist
>Adult Protective Services
>Texas Department of Family and Protective Services
>701 West 51st Street, Suite 500, Austin, Texas 78751
>(512) 438-5324
>[hidden email]
>http://www.dfps.state.tx.us/adult_protection/
>
>=====================
>To manage your subscription to SPSSX-L, send a message to
>[hidden email] (not to SPSSX-L), with no body text except the
>command. To leave the list, send the command
>SIGNOFF SPSSX-L
>For a list of commands to manage subscriptions, send the command
>INFO REFCARD

Robert M. Schacht, Ph.D. <[hidden email]>
Pacific Basin Rehabilitation Research & Training Center
1268 Young Street, Suite #204
Research Center, University of Hawaii
Honolulu, HI 96814

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Merging Files with Duplicates

hillel vardi
In reply to this post by Nora Douglas
Shalom

The way to merge these 2 fies it to restructure them to long and then merge.
It is not clear from your post if the case_id is unique  over the whole
file but the is what i assume

Her is the code

DATA LIST /CLNT 1-9 CASE_ID.1 11-18 CASE_ID.2 20-27 CASE_ID.3 29-36.
BEGIN DATA
100834501 25288885 25322272 24929988
102120701 25366755 25116450 24897575
102346301 25112698 25310977 24994276
END DATA.
VARSTOCASES  /MAKE CASE_ID FROM CASE_ID.1 CASE_ID.2 CASE_ID.3
 /INDEX = seq(3)
 /KEEP =  CLNT
 /NULL = KEEP.
sort cases by CASE_ID .
dataset name file1 .
DATA LIST /VICTIM_ID 1-7 CASE_ID.1 9-16 CASE_ID.2 18-25 CASE_ID.3 27-34.
BEGIN DATA
3013395 25288885 25322272 24912728
3013550 25039615 25116450 25115922
3013572 25314224 25377552 24994276
END DATA.
dataset name file2 .
execute .
VARSTOCASES  /MAKE CASE_ID FROM CASE_ID.1 CASE_ID.2 CASE_ID.3
 /INDEX = seq(3)
 /KEEP =  VICTIM_ID
 /NULL = KEEP.
sort cases by CASE_ID .
match file file= file1 /file= file2 /by CASE_ID .
execute .



Hillel Vardi
BGU

Nora Douglas wrote:

> Hello all,
>
> Here is my dilema, after some manipulation of multiple data files I now have two files.  The first file has CLNT (an id variable for clients in Agency A) and CASE_ID.1 through CASE_ID.22.  The second file has VICTIM_ID (an id variable for clients in the Agency B) and CASE_ID.1 through CASE_ID.22.  CLNT and VICTIM_ID are not the same.  Only CASE_ID will match between the two files.  I had to restructure the file to get rid of duplicates in the CASE_ID variable.  That led to the multiple CASE_ID variables.
>
> I need to merge these two files using CASE_ID as the matching variable.
>
> I want the syntax to look for matches in the first CLIENT_ID variable, and then move on to the second, third, fourth, etc., until it has filled in CLNT and VICTIM_ID for all cases (maybe some kind of DO REPEAT or something???).  I need to get CLNT, VICTIM_ID, AND CASE_ID.1-22 all in the same file.
>
> I have pasted some data below.
>
> DATA LIST /CLNT 1-9 CASE_ID.1 11-18 CASE_ID.2 20-27 CASE_ID.3 29-36.
> BEGIN DATA
> 100834501 25288885 25322272 24929988
> 102120701 25366755 25116450 24897575
> 102346301 25112698 25310977 24994276
> END DATA.
>
> DATA LIST /VICTIM_ID 1-7 CASE_ID.1 9-16 CASE_ID.2 18-25 CASE_ID.3 27-34.
> BEGIN DATA
> 3013395 25288885 25322272 24912728
> 3013550 25039615 25116450 25115922
> 3013572 25314224 25377552 24994276
> END DATA.
>
> Please let me know if I can provide any more information.  I am extremely appreciative of any help you can give me.
>
> Thanks,
>
> Nora
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Nora E. Douglas, PhD, Research Specialist
> Adult Protective Services
> Texas Department of Family and Protective Services
> 701 West 51st Street, Suite 500, Austin, Texas 78751
> (512) 438-5324
> [hidden email]
> http://www.dfps.state.tx.us/adult_protection/
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Merging Files with Duplicates

Maguin, Eugene
In reply to this post by Nora Douglas
Nora,

You've gotten some replies already and if my my questions/comments duplicate
theirs, I apologize. However, I'd already deleted their replies when I wrote
this.

1) I think the most important question is whether the same caseid can have
multiple client or victim ids. Your example data show that a client or
victim id can have multiple caseids. I'm interested in the converse
question. If the answer is No, then the match is one-to-one. If the answer
is Yes, then the match is either one-to-many or many-to-many.

2) You describe caseid.1 to caseid.22. Will every one of those 22 variables
have valid values for every case in both files?

3) When your done, will the desired dataset look like this?

DATA LIST /CLNT VICTIM CASE_ID
100834501 3013395 25288885
100834501 3013395 25322272
100834501       . 24929988
102120701       . 25366755
102120701 3013550 25116450
102120701       . 24897575
102346301       . 25112698
102346301       . 25310977
102346301 3013572 24994276

Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Merging Files with Duplicates

Nora Douglas
In reply to this post by Nora Douglas
Hi Bob,

My answers are below each question.

1) CLNT and VICTIM_ID both refer to individuals, right?

Yes, those variables represent individuals.  Each individual can have multiple cases.

2) A CLNT can have (up to?) 22 cases-- or does each CLNT always have 22 cases? And therefore CLNT and CASE_ID have a one to many relationship, right?

No, each client does not have 22 cases.  They can have up to 22 cases.  In fact only 2 out of 17,000 have 22 cases.

3) However, for each CASE_ID, is there only one CLNT? Or can there be more than one CLNT associated with a CASE_ID?

The file is organized by CLNT so that there are no duplicate CLNT ids.  There can be duplicate CASE_IDs for each CLNT.  For example, CASE_ID.1 and CASE_ID.2 could be the same.  Cases can have multiple stages so sometimes the id shows up more than once.  There is only one CLNT for each CASE_ID.

4) Is there any correspondence between CASE_IDn and a particular CASE_ID number? For example, to use your sample data, could 25288885 be CASE_ID.1 for one CLNT, but also CASE_ID.3 for another CLNT? Your example does not show this, but if true, then we have a different problem.

No, CASE_ID is unique to each CLNT.

Thank you so much for the help.  Let me know if you need any more information.

Thanks again,

Nora

-----Original Message-----

>From: Bob Schacht <[hidden email]>
>Sent: Feb 20, 2008 6:30 PM
>To: Nora Douglas <[hidden email]>, [hidden email]
>Subject: Re: Merging Files with Duplicates
>
>At 01:17 PM 2/20/2008, Nora Douglas wrote:
>>Hello all,
>>
>>Here is my dilema, after some manipulation of multiple data files I now
>>have two files.  The first file has CLNT (an id variable for clients in
>>Agency A) and CASE_ID.1 through CASE_ID.22.  The second file has VICTIM_ID
>>(an id variable for clients in the Agency B) and CASE_ID.1 through
>>CASE_ID.22.  CLNT and VICTIM_ID are not the same.  Only CASE_ID will match
>>between the two files.  I had to restructure the file to get rid of
>>duplicates in the CASE_ID variable.  That led to the multiple CASE_ID
>>variables.
>>
>>I need to merge these two files using CASE_ID as the matching variable.
>>
>>I want the syntax to look for matches in the first CLIENT_ID variable, and
>>then move on to the second, third, fourth, etc., until it has filled in
>>CLNT and VICTIM_ID for all cases (maybe some kind of DO REPEAT or
>>something???).  I need to get CLNT, VICTIM_ID, AND CASE_ID.1-22 all in the
>>same file.
>>
>>I have pasted some data below.
>
>Nora,
>Just a few clarifying questions:
>
>1. CLNT and VICTIM_ID both refer to individuals, right?
>2. A CLNT can have (up to?) 22 cases-- or does each CLNT always have 22
>cases? And therefore CLNT and CASE_ID have a one to many relationship, right?
>
>3. However, for each CASE_ID, is there only one CLNT? Or can there be more
>than one CLNT associated with a CASE_ID?
>
>4. Is there any correspondence between CASE_IDn and a particular CASE_ID
>number? For example, to use your sample data, could 25288885 be CASE_ID.1
>for one CLNT, but also CASE_ID.3 for another CLNT? Your example does not
>show this, but if true, then we have a different problem.
>
>What I suspect is that you may need to construct a CASE_ID file, which will
>make it easier to aggregate all information about a given case.
>
>Bob Schacht
>
>
>>DATA LIST /CLNT 1-9 CASE_ID.1 11-18 CASE_ID.2 20-27 CASE_ID.3 29-36.
>>BEGIN DATA
>>100834501 25288885 25322272 24929988
>>102120701 25366755 25116450 24897575
>>102346301 25112698 25310977 24994276
>>END DATA.
>>
>>DATA LIST /VICTIM_ID 1-7 CASE_ID.1 9-16 CASE_ID.2 18-25 CASE_ID.3 27-34.
>>BEGIN DATA
>>3013395 25288885 25322272 24912728
>>3013550 25039615 25116450 25115922
>>3013572 25314224 25377552 24994276
>>END DATA.
>>
>>Please let me know if I can provide any more information.  I am extremely
>>appreciative of any help you can give me.
>>
>>Thanks,
>>
>>Nora
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>Nora E. Douglas, PhD, Research Specialist
>>Adult Protective Services
>>Texas Department of Family and Protective Services
>>701 West 51st Street, Suite 500, Austin, Texas 78751
>>(512) 438-5324
>>[hidden email]
>>http://www.dfps.state.tx.us/adult_protection/
>>
>>=====================
>>To manage your subscription to SPSSX-L, send a message to
>>[hidden email] (not to SPSSX-L), with no body text except the
>>command. To leave the list, send the command
>>SIGNOFF SPSSX-L
>>For a list of commands to manage subscriptions, send the command
>>INFO REFCARD
>
>Robert M. Schacht, Ph.D. <[hidden email]>
>Pacific Basin Rehabilitation Research & Training Center
>1268 Young Street, Suite #204
>Research Center, University of Hawaii
>Honolulu, HI 96814
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Merging Files with Duplicates

Nora Douglas
In reply to this post by Nora Douglas
Hi Gene,

I answered some of these questions in an earlier reply.  So, pardon the repetition.

No, CASE_ID cannot have multiple CLNT or VICTIM_IDs.  You are right though, A CLNT or VICTIM_ID can have multiple CASE_IDs and the CASE_ID can be a duplicate.  Out cases have multiple stages and if there is more than one stage the CASE_ID is repeated in the file.

Not everyone has 22 cases.  There are only two individuals with 22 out of 17,000 in the file.

Yes, the data set can look like that.  That works for me!

Thanks for the help!

Nora

-----Original Message-----

>From: Gene Maguin <[hidden email]>
>Sent: Feb 21, 2008 8:31 AM
>To: [hidden email]
>Subject: Re: Merging Files with Duplicates
>
>Nora,
>
>You've gotten some replies already and if my my questions/comments duplicate
>theirs, I apologize. However, I'd already deleted their replies when I wrote
>this.
>
>1) I think the most important question is whether the same caseid can have
>multiple client or victim ids. Your example data show that a client or
>victim id can have multiple caseids. I'm interested in the converse
>question. If the answer is No, then the match is one-to-one. If the answer
>is Yes, then the match is either one-to-many or many-to-many.
>
>2) You describe caseid.1 to caseid.22. Will every one of those 22 variables
>have valid values for every case in both files?
>
>3) When your done, will the desired dataset look like this?
>
>DATA LIST /CLNT VICTIM CASE_ID
>100834501 3013395 25288885
>100834501 3013395 25322272
>100834501       . 24929988
>102120701       . 25366755
>102120701 3013550 25116450
>102120701       . 24897575
>102346301       . 25112698
>102346301       . 25310977
>102346301 3013572 24994276
>
>Gene Maguin
>
>=====================
>To manage your subscription to SPSSX-L, send a message to
>[hidden email] (not to SPSSX-L), with no body text except the
>command. To leave the list, send the command
>SIGNOFF SPSSX-L
>For a list of commands to manage subscriptions, send the command
>INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Merging Files with Duplicates

Richard Ristow
At 10:49 AM 2/21/2008, Nora Douglas wrote (in response to Gene Maguin):

>No, CASE_ID cannot have multiple CLNT or VICTIM_IDs.  [But] a CLNT
>or VICTIM_ID can have multiple CASE_IDs and the CASE_ID can be a duplicate.

That could make things simple: for each case (CASE_ID), you have one
CLNT and one VICTIM_ID. See if this gets you any closer. If I'm
missing the point, post again, with some examples of your data, if you can.

Anyway, at 06:17 PM 2/20/2008, you wrote (I'm repeating):

>I now have two files.  The first file has CLNT (an id variable for
>clients in Agency A) and CASE_ID.1 through CASE_ID.22.  The second
>file has VICTIM_ID (an id variable for clients in the Agency B) and
>CASE_ID.1 through CASE_ID.22.  CLNT and VICTIM_ID are not the
>same.  Only CASE_ID will match between the two files.  I had to
>restructure the file to get rid of duplicates in the CASE_ID
>variable.  That led to the multiple CASE_ID variables.

If I've got this right, you started with three files, something like this:

A. File CASES (I'll call it that) with variables
CASE_ID CLNT VICTIM_ID <other variables about cases>
Values of all of the variables CASE_ID CLNT and VICTIM_ID can occur
multiple times in the file.

B. File AGENCY_A with variables
CLNT <other variables about clients of Agency A>
Values of variable CLNT cannot be duplicated in this file. The file
is sorted by CLNT. (Or, sort it this way, if it isn't.)

C. File AGENCY_B with variables
VICTIM_ID <other variables about clients of Agency B>
Values of variable VICTIM_ID cannot be duplicated in this file. The
file is sorted by VICTIM_ID. (Or, sort it this way, if it isn't.)

Then, you want the variables about agency-A cases and agency-B cases
merged into the case records.

If so, something like this should work:
1.) Load file CASES as the active file or (SPSS 14-16) active dataset
2.) The following code (untested), or something similar:

SORT CASES BY CLNT.
MATCH FILES
    /FILE=*
    /TABLE=AGENCY_A
    /BY CLNT.

SORT CASES BY VICTIM_ID.
MATCH FILES
    /FILE=*
    /TABLE=AGENCY_B
    /BY VICTIM_ID.

Good luck! And let us know anything better we can do.
Richard

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD