Help Merging Two Databases!

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Help Merging Two Databases!

Natasha
Hello,

I am looking for some help or suggestions with my current database problem. A description of the problem and the databases are below:

I have two databases with multiple variables, Database A (original) and Database B (RNA). I need to merge these two databases (and all the variables); however, the databases have multiple participant (ID) entries. Ultimately, I want to make one database with the (1) most recent entries for each participant from Database A, (2) merged with the participant entry that has the "Assessment Date" variable (Database B) that is closest in date to the "Event Date" variable (Database A).

Database A looks like this...

 ID     Event_Date     Race      Sex     Age   .............
1       4/2/2007                
2       8/13/2008            
2       1/20/2010            
3       1/30/2006          
4       10/27/2006                    
4       4/26/2007            
 

Database B looks like this...

 ID     Assessment_Date     Race      Sex     Age   .............
1       11/19/2002                
1       7/31/2007              
2       8/25/2008              
2       6/30/2010          
3       11/9/2004                    
3       7/26/2007      

So, there are multiple entries in both databases, but with Database B I would like to find a way to pick out from the multiple entries for each participant only the entry that's "Assessment Date" (Database B) is closest to the "Event Date" (Database A).

I hope this all makes sense and the example data files help, but please let me know if I can clarify anything, and again thanks for all of your time and help.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Help Merging Two Databases!

David Marso
Administrator
MATCH FILES has a LAST subcommand you can use to FLAG the most recent records per ID in file A.
Then it is a trivial MATCH into file B after selecting the flagged cases in file A.

Natasha Holmes wrote
Hello,

I am looking for some help or suggestions with my current database problem. A description of the problem and the databases are below:

I have two databases with multiple variables, Database A (original) and Database B (RNA). I need to merge these two databases (and all the variables); however, the databases have multiple participant (ID) entries. Ultimately, I want to make one database with the (1) most recent entries for each participant from Database A, (2) merged with the participant entry that has the "Assessment Date" variable (Database B) that is closest in date to the "Event Date" variable (Database A).

Database A looks like this...

 ID     Event_Date     Race      Sex     Age   .............
1       4/2/2007                
2       8/13/2008            
2       1/20/2010            
3       1/30/2006          
4       10/27/2006                    
4       4/26/2007            
 

Database B looks like this...

 ID     Assessment_Date     Race      Sex     Age   .............
1       11/19/2002                
1       7/31/2007              
2       8/25/2008              
2       6/30/2010          
3       11/9/2004                    
3       7/26/2007      

So, there are multiple entries in both databases, but with Database B I would like to find a way to pick out from the multiple entries for each participant only the entry that's "Assessment Date" (Database B) is closest to the "Event Date" (Database A).

I hope this all makes sense and the example data files help, but please let me know if I can clarify anything, and again thanks for all of your time and help.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Help Merging Two Databases!

Maguin, Eugene
I don't think it's quite a trivial match for the target record in database B because she wants to pick the B file record closest, which needs to be defined relative to whether the a) assessment date must follow, b) must precede or c) may follow or precede the event date for a given id.

I haven't tested this out but I think I'd match files using the table subcommand to put the event date on every record of the B file, use the datediff function to get the date difference (assessment-event). The above a), b), c) question comes in at this point. If the answer is a) then delete records with negative difference, then sort and aggregate again on the first record. If the answer is b) then delete records with positive difference, then sort and aggregate again on the first record. If the answer is c) then use an absolute value function on the difference, then sort and aggregate again on the first record. The leopard in the trees is if c) applies and there's two assessments, equally close but one is before and one is following.
Gene Maguin






-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David Marso
Sent: Monday, September 22, 2014 11:04 AM
To: [hidden email]
Subject: Re: Help Merging Two Databases!

MATCH FILES has a LAST subcommand you can use to FLAG the most recent records per ID in file A.
Then it is a trivial MATCH into file B after selecting the flagged cases in file A.


Natasha Holmes wrote

> Hello,
>
> I am looking for some help or suggestions with my current database
> problem. A description of the problem and the databases are below:
>
> I have two databases with multiple variables, Database A (original)
> and Database B (RNA). I need to merge these two databases (and all the
> variables); however, the databases have multiple participant (ID) entries.
> Ultimately, I want to make one database with the (1) most recent
> entries for each participant from Database A, (2) merged with the
> participant entry that has the "Assessment Date" variable (Database B)
> that is closest in date to the "Event Date" variable (Database A).
>
> Database A looks like this...
>
>  ID     Event_Date     Race      Sex     Age   .............
> 1       4/2/2007                
> 2       8/13/2008            
> 2       1/20/2010            
> 3       1/30/2006          
> 4       10/27/2006                    
> 4       4/26/2007            
>  
>
> Database B looks like this...
>
>  ID     Assessment_Date     Race      Sex     Age   .............
> 1       11/19/2002                
> 1       7/31/2007              
> 2       8/25/2008              
> 2       6/30/2010          
> 3       11/9/2004                    
> 3       7/26/2007      
>
> So, there are multiple entries in both databases, but with Database B
> I would like to find a way to pick out from the multiple entries for
> each participant only the entry that's "Assessment Date" (Database B)
> is closest to the "Event Date" (Database A).
>
> I hope this all makes sense and the example data files help, but
> please let me know if I can clarify anything, and again thanks for all
> of your time and help.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the command. To leave the
> list, send the command SIGNOFF SPSSX-L For a list of commands to
> manage subscriptions, send the command INFO REFCARD





-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Help-Merging-Two-Databases-tp5727348p5727349.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Help Merging Two Databases!

Bruce Weaver
Administrator
In reply to this post by David Marso
But there's still the issue of picking the Database B record with the assessment date closest to that of the last Database A record (if I followed!).  

So roughly speaking, one solution would look something like this:

DATASET ACTIVATE A.
SORT CASES by ID Event_Date. /* Not be needed if data already sorted.
MATCH FILES
 file = * /
 by ID /
 LAST = LastRec.
EXECUTE.
SELECT if LastRec.

DATASET ACTIVATE B.
SORT CASES by ID Event_Date. /* Not be needed if data already sorted.

MATCH FILES
  TABLE = 'A' / IN = FromA /
  RENAME= (Age = AgeFromA) /
  FILE = 'B' /IN = FromB /
  RENAME= (Age = AgeFromB) /
  BY ID.
EXECUTE.

Then...

1. Use the DATEDIFF function to compute the difference between Event_Date and Assessment_Date, and them compute its absolute value (assuming this is what the OP wants).
2. Use AGGREGATE to write the MINIMUM value of the absolute value of that difference to each record (BREAKing on ID).
3. Use SELECT IF to keep only those cases where the absolute value of the difference computed in step 1 = the minimum computed on step 2.  

This should result in one record per ID, which is what I think the OP is aiming for.

HTH.



David Marso wrote
MATCH FILES has a LAST subcommand you can use to FLAG the most recent records per ID in file A.
Then it is a trivial MATCH into file B after selecting the flagged cases in file A.

Natasha Holmes wrote
Hello,

I am looking for some help or suggestions with my current database problem. A description of the problem and the databases are below:

I have two databases with multiple variables, Database A (original) and Database B (RNA). I need to merge these two databases (and all the variables); however, the databases have multiple participant (ID) entries. Ultimately, I want to make one database with the (1) most recent entries for each participant from Database A, (2) merged with the participant entry that has the "Assessment Date" variable (Database B) that is closest in date to the "Event Date" variable (Database A).

Database A looks like this...

 ID     Event_Date     Race      Sex     Age   .............
1       4/2/2007                
2       8/13/2008            
2       1/20/2010            
3       1/30/2006          
4       10/27/2006                    
4       4/26/2007            
 

Database B looks like this...

 ID     Assessment_Date     Race      Sex     Age   .............
1       11/19/2002                
1       7/31/2007              
2       8/25/2008              
2       6/30/2010          
3       11/9/2004                    
3       7/26/2007      

So, there are multiple entries in both databases, but with Database B I would like to find a way to pick out from the multiple entries for each participant only the entry that's "Assessment Date" (Database B) is closest to the "Event Date" (Database A).

I hope this all makes sense and the example data files help, but please let me know if I can clarify anything, and again thanks for all of your time and help.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Help Merging Two Databases!

David Marso
Administrator
Good catch.
I think perhaps an ADD FILES interleave approach will work too and avoid the RENAME on MATCH.
This because if there are a lot of variables that can be painful.
---

NEW FILE.
DATASET CLOSE ALL.
DATA LIST LIST / ID (F1) Event_Date (ADate) stuff (F3).
BEGIN DATA
1       04/02/2007    111          
2       08/13/2008    211        
2       01/20/2010    221        
3       01/30/2006    311      
4       10/27/2006   411                  
4       04/26/2007    421        
END DATA.
DATASET NAME DatasetA.
COMPUTE Compare_Date=Event_Date.
COMPUTE File_Index=1.
DATA LIST LIST / ID (F1)    Assessment_Date  (ADATE) Stuff (F3).
BEGIN DATA
1      11/19/2002      112          
1       07/31/2007       122        
2       08/25/2008       212        
2       06/30/2010       222    
3       11/09/2004         312            
3       07/26/2007       322
END DATA.
DATASET NAME DatasetB.

COMPUTE Compare_Date= Assessment_Date .
COMPUTE File_Index=2.
DATASET ACTIVATE DatasetA.
MATCH FILES / FILE * / LAST=@LastA@ / BY ID.
SELECT IF @LASTA@.
ADD FILES / FILE * / FILE DatasetB / BY ID Compare_Date.
SPLIT FILE BY ID.
CREATE Prev_Date=LAG(Assessment_Date,1)
     / Next_Date=LEAD(Assessment_Date,1).
SPLIT FILE OFF.
FORMATS Compare_Date Prev_Date Next_Date (ADATE).
COMPUTE Delta_Prev=ABS(CTIME.DAYS(Compare_Date-Prev_Date)).
COMPUTE Delta_Next=ABS(CTIME.DAYS(Compare_Date-Next_Date)).
COMPUTE Keep_Date=
        SUM((Delta_Prev LT Delta_Next)*Prev_Date,(Delta_Next LT Delta_Prev)*Next_Date).
FORMATS  Keep_Date (ADate).
AGGREGATE OUTFILE * MODE ADDVARIABLES /BREAK ID / Keep_Me=MAX(Keep_Date).
SELECT IF @LASTA@ OR Assessment_Date EQ Keep_Me.
EXECUTE.
DELETE VARIABLES
  Compare_Date @LastA@ Prev_Date Next_Date Delta_Prev Delta_Next Keep_Date Keep_Me.
CASESTOVARS ID=ID/INDEX=File_Index.

Result:
ID Event_Date Assessment_Date stuff.1 stuff.2
1 04/02/2007 07/31/2007 111 122
2 01/20/2010 06/30/2010 221 222
3 01/30/2006 11/09/2004 311 312
4 04/26/2007 421
Bruce Weaver wrote
But there's still the issue of picking the Database B record with the assessment date closest to that of the last Database A record (if I followed!).  

So roughly speaking, one solution would look something like this:

DATASET ACTIVATE A.
SORT CASES by ID Event_Date. /* Not be needed if data already sorted.
MATCH FILES
 file = * /
 by ID /
 LAST = LastRec.
EXECUTE.
SELECT if LastRec.

DATASET ACTIVATE B.
SORT CASES by ID Event_Date. /* Not be needed if data already sorted.

MATCH FILES
  TABLE = 'A' / IN = FromA /
  RENAME= (Age = AgeFromA) /
  FILE = 'B' /IN = FromB /
  RENAME= (Age = AgeFromB) /
  BY ID.
EXECUTE.

Then...

1. Use the DATEDIFF function to compute the difference between Event_Date and Assessment_Date, and them compute its absolute value (assuming this is what the OP wants).
2. Use AGGREGATE to write the MINIMUM value of the absolute value of that difference to each record (BREAKing on ID).
3. Use SELECT IF to keep only those cases where the absolute value of the difference computed in step 1 = the minimum computed on step 2.  

This should result in one record per ID, which is what I think the OP is aiming for.

HTH.



David Marso wrote
MATCH FILES has a LAST subcommand you can use to FLAG the most recent records per ID in file A.
Then it is a trivial MATCH into file B after selecting the flagged cases in file A.

Natasha Holmes wrote
Hello,

I am looking for some help or suggestions with my current database problem. A description of the problem and the databases are below:

I have two databases with multiple variables, Database A (original) and Database B (RNA). I need to merge these two databases (and all the variables); however, the databases have multiple participant (ID) entries. Ultimately, I want to make one database with the (1) most recent entries for each participant from Database A, (2) merged with the participant entry that has the "Assessment Date" variable (Database B) that is closest in date to the "Event Date" variable (Database A).

Database A looks like this...

 ID     Event_Date     Race      Sex     Age   .............
1       4/2/2007                
2       8/13/2008            
2       1/20/2010            
3       1/30/2006          
4       10/27/2006                    
4       4/26/2007            
 

Database B looks like this...

 ID     Assessment_Date     Race      Sex     Age   .............
1       11/19/2002                
1       7/31/2007              
2       8/25/2008              
2       6/30/2010          
3       11/9/2004                    
3       7/26/2007      

So, there are multiple entries in both databases, but with Database B I would like to find a way to pick out from the multiple entries for each participant only the entry that's "Assessment Date" (Database B) is closest to the "Event Date" (Database A).

I hope this all makes sense and the example data files help, but please let me know if I can clarify anything, and again thanks for all of your time and help.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"