SPSSX Discussion

Merging only selected cases (baffling!)

Classic

List

Threaded

5 messages Options

devoidx

Merging only selected cases (baffling!)

This post was updated on .

Ok so I have this problem which theoretically should be easy to solve but it doesn't appear to be so. I have two databases,1 and 2. both databases have identical variables including a unique identified variable. Database 1 has only about a 1000 cases and database 2 has about 1 billion cases. All cases in database 1 are unique (each have a different unique ID value) but in database 2 there are cases that have the same unique ID but vary in the value of their other variables. what I want to do is to add to database 1, only cases from database 2 which have the same unique ID as database 1. I don't want to create more variables, just want to add cases that match in terms of their unique ID. forexample:

Database 1 Database 2
ID diagnosis ID diagnosis
1 x 1 a
2 y 1 b
2 c
3 d

result I want:
ID diagnosis
1 x
1 a
1 b
2 y
2 c
*notice ID 3 was not added because there was no such ID in database 1*

Any suggestions???? Again, Using Match files is going to create extra variables which i don't want. I just want something like what ADD files does but to be able to select what cases are to be added instead of adding the whole 1 billion cases...
Thanks!

Jim Marks

Re: Merging only selected cases (baffling!)

There might be some extra steps, but here is one way:

* create sample data.
NEW FILE.
DATA LIST FREE /id (F8.0) diagnosis (a1).
BEGIN DATA
1 a
1 b
2 c
3 d
END DATA.
DATASET NAME large WINDOW = FRONT.

NEW FILE.
DATA LIST FREE /id (F8.0) diagnosis (a1).
BEGIN DATA
1 x
2 y
END DATA.
DATASET NAME small WINDOW = FRONT.

* create a list of ids to be matched.
DATASET COPY matchable.
DATASET ACTIVATE matchable.

DELETE VARIABLES diagnosis.

COMPUTE useit = 1.
EXECUTE.

* match the list of needed id's into your larger dataset, and select needed cases.
DATASET ACTIVATE large.

DATASET COPY selectable.
DATASET ACTIVATE selectable.
MATCH FILES FILE = * /TABLE = matchable /BY id.
FREQUENCIES useit.
SELECT IF useit.
EXECUTE.

* create a final file with all records from the small file, and the matching records from the large file.
NEW FILE.
ADD FILES /FILE = small /FILE= selectable /DROP useit.

DATASET NAME final WINDOW = FRONT.
SORT CASES BY ID.

* clean up.
DATASET CLOSE selectable.
DATASET CLOSE matchable.

HTH

Jim Marks
Sr Market Research Manager
National Market Research
Kaiser Foundation Health Plan of the Mid-Atlantic States, Inc.
2101 E. Jefferson St.
Rockville, MD 20852
Phone: (301) 816-6822
Cell Phone: (301) 456-6164

NOTICE TO RECIPIENT: If you are not the intended recipient of this e-mail, you are prohibited from sharing, copying, or otherwise using or disclosing its contents. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and permanently delete this e-mail and any attachments without reading, forwarding or saving them. Thank you.

From: devoidx <[hidden email]>
To: [hidden email]
Date: 09/17/2013 02:45 PM
Subject: Merging only selected cases (baffling!)
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Ok so I have this problem which theoretically should be easy to solve but it doesn't appear to be so. I have two databases,1 and 2. both databases have identical variables including a unique identified variable. Database 1 has only about a 1000 cases and database 2 has about 1 billion cases. All cases in database 1 are unique (each have a different unique ID value) but in database 2 there are cases that have the same unique ID but vary in the value of their other variables. what I want to do is to add to database 1, only cases from database 2 which have the same unique ID as database 1. I don't want to create more variables, just want to add cases that match in terms of their unique ID. forexample: Database 1 Database 2 ID diagnosis ID diagnosis 1 x 1 a 2 y 1 b 2 c 3 d result I want: ID diagnosis 1 x 1 a 1 b 2 y 2 c *notice ID 3 was not added because there was no such ID in database 1* Any suggestions???? Thanks! -- View this message in context:http://spssx-discussion.1045642.n5.nabble.com/Merging-only-selected-cases-baffling-tp5722068.htmlSent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Rich Ulrich

Re: Merging only selected cases (baffling!)

In reply to this post by devoidx

I'm not sure I follow the statement of your problem, but what I
understand is that you want to use File1 to select the same IDs from
File2, and to end up with a file that has those selected IDs from both
files.

File1 IDs can match more than one line, which means you use /TABLE.
That could be -

MATCH FILES FILE= File2/ TABLE= File1/IN=inf1 /DROP=all.
SELECT IF inf1.
EXE.
* combine the selected IDs with the original set.
ADD FILES FILE= * /FILE= File1 /BY Id /DROP inf1.
EXE.

--
Rich Ulrich

> Date: Tue, 17 Sep 2013 11:41:15 -0700

> From: [hidden email]
> Subject: Merging only selected cases (baffling!)
> To: [hidden email]
>
> Ok so I have this problem which theoretically should be easy to solve but it
> doesn't appear to be so. I have two databases,1 and 2. both databases have
> identical variables including a unique identified variable. Database 1 has
> only about a 1000 cases and database 2 has about 1 billion cases. All cases
> in database 1 are unique (each have a different unique ID value) but in
> database 2 there are cases that have the same unique ID but vary in the
> value of their other variables. what I want to do is to add to database 1,
> only cases from database 2 which have the same unique ID as database 1. I
> don't want to create more variables, just want to add cases that match in
> terms of their unique ID. forexample:
>
> Database 1 Database 2
> ID diagnosis ID
> diagnosis
> 1 x 1
> a
> 2 y 1
> b
> 2
> c
> 3
> d
>
> result I want:
> ID diagnosis
> 1 x
> 1 a
> 1 b
> 2 y
> 2 c
> *notice ID 3 was not added because there was no such ID in database 1*
>
> Any suggestions????
> Thanks!
>
>
>
> --
> View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Merging-only-selected-cases-baffling-tp5722068.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

devoidx

Re: Merging only selected cases (baffling!)

I Think you might be on to something! I'm gonna try it out and report back later today or tomorrow!

Bruce Weaver

Re: Merging only selected cases (baffling!)

Administrator

In reply to this post by Rich Ulrich

Just following up on Rich's approach. In the example that was given, the diagnoses were single characters. In your actual data file, I suspect you have a string variable with text descriptions of varying lengths. When you merge files that both contain a string variable with the same name, that string variable has to be formatted to the same length in both files. I think it was Andy W who recently suggested using ALTER TYPE to get around this potential problem. E.g., if the longest diagnosis string was never going to be more than 100 characters long, you could do something like the following -- look for the ALTER TYPE commands in boldface.

* Create two datasets to illustrate.

new file.
dataset close all.

data list list / ID (f2.0) diagnosis (a5).
begin data
1 'xxxxx'
2 'yyy'
end data.
dataset name file1.
alter type diagnosis(a100).

data list list / ID (f2.0) diagnosis (a3).
begin data
1 'aaa'
1 'b'
2 'cc'
3 'ddd'
end data.
dataset name file2.
alter type diagnosis(a100).

match files
file = file2 /
table = file1 / in = flag1 /
by ID.
execute.
dataset name file3.
select if flag1. /* Keep only records with IDs in file 1.
execute.

add files
file = file1 /
file = file3 /
drop = flag1 /
by ID.
execute.
dataset name final.
dataset activate final.
dataset close all.
alter type diagnosis(amin).
list.

OUTPUT:

ID diagnosis
1 xxxxx
1 aaa
1 b
2 yyy
2 cc

Number of cases read: 5 Number of cases listed: 5

HTH.

Rich Ulrich wrote

I'm not sure I follow the statement of your problem, but what I
understand is that you want to use File1 to select the same IDs from
File2, and to end up with a file that has those selected IDs from both
files.

File1 IDs can match more than one line, which means you use /TABLE.
That could be -

MATCH FILES FILE= File2/ TABLE= File1/IN=inf1 /DROP=all.
SELECT IF inf1.
EXE.
* combine the selected IDs with the original set.
ADD FILES FILE= * /FILE= File1 /BY Id /DROP inf1.
EXE.

--
Rich Ulrich

> Date: Tue, 17 Sep 2013 11:41:15 -0700
> From: [hidden email]
> Subject: Merging only selected cases (baffling!)
> To: [hidden email]
>
> Ok so I have this problem which theoretically should be easy to solve but it
> doesn't appear to be so. I have two databases,1 and 2. both databases have
> identical variables including a unique identified variable. Database 1 has
> only about a 1000 cases and database 2 has about 1 billion cases. All cases
> in database 1 are unique (each have a different unique ID value) but in
> database 2 there are cases that have the same unique ID but vary in the
> value of their other variables. what I want to do is to add to database 1,
> only cases from database 2 which have the same unique ID as database 1. I
> don't want to create more variables, just want to add cases that match in
> terms of their unique ID. forexample:
>
> Database 1 Database 2
> ID diagnosis ID
> diagnosis
> 1 x 1
> a
> 2 y 1
> b
> 2
> c
> 3
> d
>
> result I want:
> ID diagnosis
> 1 x
> 1 a
> 1 b
> 2 y
> 2 c
> *notice ID 3 was not added because there was no such ID in database 1*
>
> Any suggestions????
> Thanks!
>
>
>
> --
> View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Merging-only-selected-cases-baffling-tp5722068.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).