SPSSX Discussion

Help Merging Two Databases!

Classic

List

Threaded

5 messages Options

Natasha

Help Merging Two Databases!

Hello,

I am looking for some help or suggestions with my current database problem. A description of the problem and the databases are below:

I have two databases with multiple variables, Database A (original) and Database B (RNA). I need to merge these two databases (and all the variables); however, the databases have multiple participant (ID) entries. Ultimately, I want to make one database with the (1) most recent entries for each participant from Database A, (2) merged with the participant entry that has the "Assessment Date" variable (Database B) that is closest in date to the "Event Date" variable (Database A).

Database A looks like this...

ID Event_Date Race Sex Age .............
1 4/2/2007
2 8/13/2008
2 1/20/2010
3 1/30/2006
4 10/27/2006
4 4/26/2007

Database B looks like this...

ID Assessment_Date Race Sex Age .............
1 11/19/2002
1 7/31/2007
2 8/25/2008
2 6/30/2010
3 11/9/2004
3 7/26/2007

So, there are multiple entries in both databases, but with Database B I would like to find a way to pick out from the multiple entries for each participant only the entry that's "Assessment Date" (Database B) is closest to the "Event Date" (Database A).

I hope this all makes sense and the example data files help, but please let me know if I can clarify anything, and again thanks for all of your time and help.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

David Marso

Re: Help Merging Two Databases!

Administrator

MATCH FILES has a LAST subcommand you can use to FLAG the most recent records per ID in file A.
Then it is a trivial MATCH into file B after selecting the flagged cases in file A.

Natasha Holmes wrote

Hello,

I am looking for some help or suggestions with my current database problem. A description of the problem and the databases are below:

I have two databases with multiple variables, Database A (original) and Database B (RNA). I need to merge these two databases (and all the variables); however, the databases have multiple participant (ID) entries. Ultimately, I want to make one database with the (1) most recent entries for each participant from Database A, (2) merged with the participant entry that has the "Assessment Date" variable (Database B) that is closest in date to the "Event Date" variable (Database A).

Database A looks like this...

ID Event_Date Race Sex Age .............
1 4/2/2007
2 8/13/2008
2 1/20/2010
3 1/30/2006
4 10/27/2006
4 4/26/2007

Database B looks like this...

ID Assessment_Date Race Sex Age .............
1 11/19/2002
1 7/31/2007
2 8/25/2008
2 6/30/2010
3 11/9/2004
3 7/26/2007

So, there are multiple entries in both databases, but with Database B I would like to find a way to pick out from the multiple entries for each participant only the entry that's "Assessment Date" (Database B) is closest to the "Event Date" (Database A).

I hope this all makes sense and the example data files help, but please let me know if I can clarify anything, and again thanks for all of your time and help.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

Maguin, Eugene

Re: Help Merging Two Databases!

I don't think it's quite a trivial match for the target record in database B because she wants to pick the B file record closest, which needs to be defined relative to whether the a) assessment date must follow, b) must precede or c) may follow or precede the event date for a given id.

I haven't tested this out but I think I'd match files using the table subcommand to put the event date on every record of the B file, use the datediff function to get the date difference (assessment-event). The above a), b), c) question comes in at this point. If the answer is a) then delete records with negative difference, then sort and aggregate again on the first record. If the answer is b) then delete records with positive difference, then sort and aggregate again on the first record. If the answer is c) then use an absolute value function on the difference, then sort and aggregate again on the first record. The leopard in the trees is if c) applies and there's two assessments, equally close but one is before and one is following.
Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David Marso
Sent: Monday, September 22, 2014 11:04 AM
To: [hidden email]
Subject: Re: Help Merging Two Databases!

MATCH FILES has a LAST subcommand you can use to FLAG the most recent records per ID in file A.
Then it is a trivial MATCH into file B after selecting the flagged cases in file A.

Natasha Holmes wrote

> Hello,
>
> I am looking for some help or suggestions with my current database
> problem. A description of the problem and the databases are below:
>
> I have two databases with multiple variables, Database A (original)
> and Database B (RNA). I need to merge these two databases (and all the
> variables); however, the databases have multiple participant (ID) entries.
> Ultimately, I want to make one database with the (1) most recent
> entries for each participant from Database A, (2) merged with the
> participant entry that has the "Assessment Date" variable (Database B)
> that is closest in date to the "Event Date" variable (Database A).
>
> Database A looks like this...
>
> ID Event_Date Race Sex Age .............
> 1 4/2/2007
> 2 8/13/2008
> 2 1/20/2010
> 3 1/30/2006
> 4 10/27/2006
> 4 4/26/2007
>
>
> Database B looks like this...
>
> ID Assessment_Date Race Sex Age .............
> 1 11/19/2002
> 1 7/31/2007
> 2 8/25/2008
> 2 6/30/2010
> 3 11/9/2004
> 3 7/26/2007
>
> So, there are multiple entries in both databases, but with Database B
> I would like to find a way to pick out from the multiple entries for
> each participant only the entry that's "Assessment Date" (Database B)
> is closest to the "Event Date" (Database A).
>
> I hope this all makes sense and the example data files help, but
> please let me know if I can clarify anything, and again thanks for all
> of your time and help.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

> (not to SPSSX-L), with no body text except the command. To leave the
> list, send the command SIGNOFF SPSSX-L For a list of commands to
> manage subscriptions, send the command INFO REFCARD

-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Help-Merging-Two-Databases-tp5727348p5727349.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Bruce Weaver

Re: Help Merging Two Databases!

Administrator

In reply to this post by David Marso

But there's still the issue of picking the Database B record with the assessment date closest to that of the last Database A record (if I followed!).

So roughly speaking, one solution would look something like this:

DATASET ACTIVATE A.
SORT CASES by ID Event_Date. /* Not be needed if data already sorted.
MATCH FILES
file = * /
by ID /
LAST = LastRec.
EXECUTE.
SELECT if LastRec.

DATASET ACTIVATE B.
SORT CASES by ID Event_Date. /* Not be needed if data already sorted.

MATCH FILES
TABLE = 'A' / IN = FromA /
RENAME= (Age = AgeFromA) /
FILE = 'B' /IN = FromB /
RENAME= (Age = AgeFromB) /
BY ID.
EXECUTE.

Then...

1. Use the DATEDIFF function to compute the difference between Event_Date and Assessment_Date, and them compute its absolute value (assuming this is what the OP wants).
2. Use AGGREGATE to write the MINIMUM value of the absolute value of that difference to each record (BREAKing on ID).
3. Use SELECT IF to keep only those cases where the absolute value of the difference computed in step 1 = the minimum computed on step 2.

This should result in one record per ID, which is what I think the OP is aiming for.

HTH.

David Marso wrote

MATCH FILES has a LAST subcommand you can use to FLAG the most recent records per ID in file A.
Then it is a trivial MATCH into file B after selecting the flagged cases in file A.

Natasha Holmes wrote

Hello,

I am looking for some help or suggestions with my current database problem. A description of the problem and the databases are below:

I have two databases with multiple variables, Database A (original) and Database B (RNA). I need to merge these two databases (and all the variables); however, the databases have multiple participant (ID) entries. Ultimately, I want to make one database with the (1) most recent entries for each participant from Database A, (2) merged with the participant entry that has the "Assessment Date" variable (Database B) that is closest in date to the "Event Date" variable (Database A).

Database A looks like this...

ID Event_Date Race Sex Age .............
1 4/2/2007
2 8/13/2008
2 1/20/2010
3 1/30/2006
4 10/27/2006
4 4/26/2007

Database B looks like this...

ID Assessment_Date Race Sex Age .............
1 11/19/2002
1 7/31/2007
2 8/25/2008
2 6/30/2010
3 11/9/2004
3 7/26/2007

So, there are multiple entries in both databases, but with Database B I would like to find a way to pick out from the multiple entries for each participant only the entry that's "Assessment Date" (Database B) is closest to the "Event Date" (Database A).

I hope this all makes sense and the example data files help, but please let me know if I can clarify anything, and again thanks for all of your time and help.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).

David Marso

Re: Help Merging Two Databases!

Administrator

Good catch.
I think perhaps an ADD FILES interleave approach will work too and avoid the RENAME on MATCH.
This because if there are a lot of variables that can be painful.
---

NEW FILE.
DATASET CLOSE ALL.
DATA LIST LIST / ID (F1) Event_Date (ADate) stuff (F3).
BEGIN DATA
1 04/02/2007 111
2 08/13/2008 211
2 01/20/2010 221
3 01/30/2006 311
4 10/27/2006 411
4 04/26/2007 421
END DATA.
DATASET NAME DatasetA.
COMPUTE Compare_Date=Event_Date.
COMPUTE File_Index=1.
DATA LIST LIST / ID (F1) Assessment_Date (ADATE) Stuff (F3).
BEGIN DATA
1 11/19/2002 112
1 07/31/2007 122
2 08/25/2008 212
2 06/30/2010 222
3 11/09/2004 312
3 07/26/2007 322
END DATA.
DATASET NAME DatasetB.

COMPUTE Compare_Date= Assessment_Date .
COMPUTE File_Index=2.
DATASET ACTIVATE DatasetA.
MATCH FILES / FILE * / LAST=@LastA@ / BY ID.
SELECT IF @LASTA@.
ADD FILES / FILE * / FILE DatasetB / BY ID Compare_Date.
SPLIT FILE BY ID.
CREATE Prev_Date=LAG(Assessment_Date,1)
/ Next_Date=LEAD(Assessment_Date,1).
SPLIT FILE OFF.
FORMATS Compare_Date Prev_Date Next_Date (ADATE).
COMPUTE Delta_Prev=ABS(CTIME.DAYS(Compare_Date-Prev_Date)).
COMPUTE Delta_Next=ABS(CTIME.DAYS(Compare_Date-Next_Date)).
COMPUTE Keep_Date=
SUM((Delta_Prev LT Delta_Next)*Prev_Date,(Delta_Next LT Delta_Prev)*Next_Date).
FORMATS Keep_Date (ADate).
AGGREGATE OUTFILE * MODE ADDVARIABLES /BREAK ID / Keep_Me=MAX(Keep_Date).
SELECT IF @LASTA@ OR Assessment_Date EQ Keep_Me.
EXECUTE.
DELETE VARIABLES
Compare_Date @LastA@ Prev_Date Next_Date Delta_Prev Delta_Next Keep_Date Keep_Me.
CASESTOVARS ID=ID/INDEX=File_Index.

Result:
ID Event_Date Assessment_Date stuff.1 stuff.2
1 04/02/2007 07/31/2007 111 122
2 01/20/2010 06/30/2010 221 222
3 01/30/2006 11/09/2004 311 312
4 04/26/2007 421

Bruce Weaver wrote

But there's still the issue of picking the Database B record with the assessment date closest to that of the last Database A record (if I followed!).

So roughly speaking, one solution would look something like this:

DATASET ACTIVATE A.
SORT CASES by ID Event_Date. /* Not be needed if data already sorted.
MATCH FILES
file = * /
by ID /
LAST = LastRec.
EXECUTE.
SELECT if LastRec.

DATASET ACTIVATE B.
SORT CASES by ID Event_Date. /* Not be needed if data already sorted.

MATCH FILES
TABLE = 'A' / IN = FromA /
RENAME= (Age = AgeFromA) /
FILE = 'B' /IN = FromB /
RENAME= (Age = AgeFromB) /
BY ID.
EXECUTE.

Then...

1. Use the DATEDIFF function to compute the difference between Event_Date and Assessment_Date, and them compute its absolute value (assuming this is what the OP wants).
2. Use AGGREGATE to write the MINIMUM value of the absolute value of that difference to each record (BREAKing on ID).
3. Use SELECT IF to keep only those cases where the absolute value of the difference computed in step 1 = the minimum computed on step 2.

This should result in one record per ID, which is what I think the OP is aiming for.

HTH.

David Marso wrote

MATCH FILES has a LAST subcommand you can use to FLAG the most recent records per ID in file A.
Then it is a trivial MATCH into file B after selecting the flagged cases in file A.

Natasha Holmes wrote

Hello,

I am looking for some help or suggestions with my current database problem. A description of the problem and the databases are below:

I have two databases with multiple variables, Database A (original) and Database B (RNA). I need to merge these two databases (and all the variables); however, the databases have multiple participant (ID) entries. Ultimately, I want to make one database with the (1) most recent entries for each participant from Database A, (2) merged with the participant entry that has the "Assessment Date" variable (Database B) that is closest in date to the "Event Date" variable (Database A).

Database A looks like this...

ID Event_Date Race Sex Age .............
1 4/2/2007
2 8/13/2008
2 1/20/2010
3 1/30/2006
4 10/27/2006
4 4/26/2007

Database B looks like this...

ID Assessment_Date Race Sex Age .............
1 11/19/2002
1 7/31/2007
2 8/25/2008
2 6/30/2010
3 11/9/2004
3 7/26/2007

So, there are multiple entries in both databases, but with Database B I would like to find a way to pick out from the multiple entries for each participant only the entry that's "Assessment Date" (Database B) is closest to the "Event Date" (Database A).

I hope this all makes sense and the example data files help, but please let me know if I can clarify anything, and again thanks for all of your time and help.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD