Database Management Help

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Database Management Help

Michael, Paul G.
Hi All,

I have two data sets that I would like to merge using ID as the keyed variable, and each data file has duplicates IDs. The variables of interest in the first dataset are ID, date of contact (DD-MM-YYYY), and Body Mass Index (BMI). The variables of interest in the second data set are ID, date of contact (DD-MM-YYYY), and depression score. The duplicate IDs occur because some subjects had more than 1 BMI score from different time points and/or more than 1 depression score at different time points.

The contact dates from each file do not match up in all instances (in fact very few subjects have the same contact date in both files). I only want to keep subjects who have both BMI data and depression data but I need to preserve information from all the contact dates.

When I try a simple merge by adding variables (e.g., depression score and contact date) to the BMI data set using ID as the keyed variable, I run into the problem of having to go through thousands of cases and delete those that don't have both BMI and depression data.

For example I might have one subject whose BMI was collected 10 times on different dates and none of these dates match the depression data for this subject. So after the merge I have eleven rows for this subject that I would like to preserve (all BMI data and depression data). Another subject may have BMI scores collected at 7 different dates, but this subject has no depression data. I would like to remove all the BMI data for this subject since they have no depression data.

Is there a way in which I can merge these files in a different way to get what I need or a method to delete cases in the merged data file based on duplicate IDs and BMI data but no depression data? Any help is greatly appreciated!

Best,

Paul

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Database Management Help

Poes, Matthew Joseph-2
I think your only solution is to merge by adding cases (rather than variables) and stack the data set.

If you needed a wide dataset, once all the data is there, you could transpose on ID.

Matthew J Poes
Research Data Specialist
Center for Prevention Research and Development
University of Illinois
510 Devonshire Dr.
Champaign, IL 61820
Phone: 217-265-4576
email: [hidden email]



-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Michael, Paul G.
Sent: Wednesday, March 28, 2012 12:54 PM
To: [hidden email]
Subject: Database Management Help

Hi All,

I have two data sets that I would like to merge using ID as the keyed variable, and each data file has duplicates IDs. The variables of interest in the first dataset are ID, date of contact (DD-MM-YYYY), and Body Mass Index (BMI). The variables of interest in the second data set are ID, date of contact (DD-MM-YYYY), and depression score. The duplicate IDs occur because some subjects had more than 1 BMI score from different time points and/or more than 1 depression score at different time points.

The contact dates from each file do not match up in all instances (in fact very few subjects have the same contact date in both files). I only want to keep subjects who have both BMI data and depression data but I need to preserve information from all the contact dates.

When I try a simple merge by adding variables (e.g., depression score and contact date) to the BMI data set using ID as the keyed variable, I run into the problem of having to go through thousands of cases and delete those that don't have both BMI and depression data.

For example I might have one subject whose BMI was collected 10 times on different dates and none of these dates match the depression data for this subject. So after the merge I have eleven rows for this subject that I would like to preserve (all BMI data and depression data). Another subject may have BMI scores collected at 7 different dates, but this subject has no depression data. I would like to remove all the BMI data for this subject since they have no depression data.

Is there a way in which I can merge these files in a different way to get what I need or a method to delete cases in the merged data file based on duplicate IDs and BMI data but no depression data? Any help is greatly appreciated!

Best,

Paul

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Database Management Help

Bruce Weaver
Administrator
In reply to this post by Michael, Paul G.
From what you've said, I think I'd try this:

1. Get the BMI file, keeping only the variables of interest (via /KEEP).
2. Add data from the Depression file via ADD FILES, and deleting any variables not of interest.  Also use the /IN sub-command (twice) to create flags indicating cases from the BMI and Depression files.
3. In the resulting file, sort cases by ID and date of contact.  (If you use /BY ID Date on the ADD FILES in step 2, this should be done already.)
4. Use AGGREGATE to get the MAX values of the flag variables created on Step 2 (with ID as the BREAK variable, and writing new variables to the working file).   Keep only those IDs that have data from both files.  Assuming the Step 2 flags are called BMIFlag and DepFlag:  

SELECT IF (BMIFlag and DepFlag).  
EXE.

6. Assuming it is possible to have BMI and Depression data for the same date, you need to check for that.  You could use AGGREGATE again, breaking on ID and Date this time, and using the FIRST function to keep the first value value of the BMI and Depression variables. (I assume the maximum number of records per IDxDate combination is two.)  Note that this time, you would replace the working file rather than writing new variables to the working file.  

I don't have time to concoct an example and test it right now.

HTH.


Michael, Paul G. wrote
Hi All,

I have two data sets that I would like to merge using ID as the keyed variable, and each data file has duplicates IDs. The variables of interest in the first dataset are ID, date of contact (DD-MM-YYYY), and Body Mass Index (BMI). The variables of interest in the second data set are ID, date of contact (DD-MM-YYYY), and depression score. The duplicate IDs occur because some subjects had more than 1 BMI score from different time points and/or more than 1 depression score at different time points.

The contact dates from each file do not match up in all instances (in fact very few subjects have the same contact date in both files). I only want to keep subjects who have both BMI data and depression data but I need to preserve information from all the contact dates.

When I try a simple merge by adding variables (e.g., depression score and contact date) to the BMI data set using ID as the keyed variable, I run into the problem of having to go through thousands of cases and delete those that don't have both BMI and depression data.

For example I might have one subject whose BMI was collected 10 times on different dates and none of these dates match the depression data for this subject. So after the merge I have eleven rows for this subject that I would like to preserve (all BMI data and depression data). Another subject may have BMI scores collected at 7 different dates, but this subject has no depression data. I would like to remove all the BMI data for this subject since they have no depression data.

Is there a way in which I can merge these files in a different way to get what I need or a method to delete cases in the merged data file based on duplicate IDs and BMI data but no depression data? Any help is greatly appreciated!

Best,

Paul

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Database Management Help

David Marso
Administrator
In reply to this post by Michael, Paul G.
Quick and dirty would be to do
1. CASESTOVARS on both files (use different varnames for the dates in the 2 files).
2. Simple 1:1 Match at the point.
3. Let the devil sort it out later with some basic logic after nuking the obvious crap.
--

Michael, Paul G. wrote
Hi All,

I have two data sets that I would like to merge using ID as the keyed variable, and each data file has duplicates IDs. The variables of interest in the first dataset are ID, date of contact (DD-MM-YYYY), and Body Mass Index (BMI). The variables of interest in the second data set are ID, date of contact (DD-MM-YYYY), and depression score. The duplicate IDs occur because some subjects had more than 1 BMI score from different time points and/or more than 1 depression score at different time points.

The contact dates from each file do not match up in all instances (in fact very few subjects have the same contact date in both files). I only want to keep subjects who have both BMI data and depression data but I need to preserve information from all the contact dates.

When I try a simple merge by adding variables (e.g., depression score and contact date) to the BMI data set using ID as the keyed variable, I run into the problem of having to go through thousands of cases and delete those that don't have both BMI and depression data.

For example I might have one subject whose BMI was collected 10 times on different dates and none of these dates match the depression data for this subject. So after the merge I have eleven rows for this subject that I would like to preserve (all BMI data and depression data). Another subject may have BMI scores collected at 7 different dates, but this subject has no depression data. I would like to remove all the BMI data for this subject since they have no depression data.

Is there a way in which I can merge these files in a different way to get what I need or a method to delete cases in the merged data file based on duplicate IDs and BMI data but no depression data? Any help is greatly appreciated!

Best,

Paul

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Database Management Help

David Marso
Administrator
I will add:  YMMV depending upon your actual file sizes and the maximum number of observations per ID.
If for example you are dealing with a sampling of all the depressed fat people in the world say .5 Billion you have a situation which might be better addressed more subtly (especially if some single individual shows up 100 times when everyone else has say max 5 records).  Your problem arises from having duplicate IDs in *BOTH* files.  You definitely need to denormalize one or both.
---

David Marso wrote
Quick and dirty would be to do
1. CASESTOVARS on both files (use different varnames for the dates in the 2 files).
2. Simple 1:1 Match at the point.
3. Let the devil sort it out later with some basic logic after nuking the obvious crap.
--

Michael, Paul G. wrote
Hi All,

I have two data sets that I would like to merge using ID as the keyed variable, and each data file has duplicates IDs. The variables of interest in the first dataset are ID, date of contact (DD-MM-YYYY), and Body Mass Index (BMI). The variables of interest in the second data set are ID, date of contact (DD-MM-YYYY), and depression score. The duplicate IDs occur because some subjects had more than 1 BMI score from different time points and/or more than 1 depression score at different time points.

The contact dates from each file do not match up in all instances (in fact very few subjects have the same contact date in both files). I only want to keep subjects who have both BMI data and depression data but I need to preserve information from all the contact dates.

When I try a simple merge by adding variables (e.g., depression score and contact date) to the BMI data set using ID as the keyed variable, I run into the problem of having to go through thousands of cases and delete those that don't have both BMI and depression data.

For example I might have one subject whose BMI was collected 10 times on different dates and none of these dates match the depression data for this subject. So after the merge I have eleven rows for this subject that I would like to preserve (all BMI data and depression data). Another subject may have BMI scores collected at 7 different dates, but this subject has no depression data. I would like to remove all the BMI data for this subject since they have no depression data.

Is there a way in which I can merge these files in a different way to get what I need or a method to delete cases in the merged data file based on duplicate IDs and BMI data but no depression data? Any help is greatly appreciated!

Best,

Paul

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Database Management Help

Bruce Weaver
Administrator
In reply to this post by David Marso
Paul (the OP) wrote:

"For example I might have one subject whose BMI was collected 10 times on different dates and none of these dates match the depression data for this subject. So after the merge I have eleven rows for this subject that I would like to preserve (all BMI data and depression data). Another subject may have BMI scores collected at 7 different dates, but this subject has no depression data. I would like to remove all the BMI data for this subject since they have no depression data."

I understood this to mean that Paul wants the final file to have multiple rows per ID, not one row per ID.  David's method results in the latter, I think.

This illustrates once again how helpful it is to post small examples showing what the file looks like originally, and what you want it to look like afterward!  ;-)



David Marso wrote
Quick and dirty would be to do
1. CASESTOVARS on both files (use different varnames for the dates in the 2 files).
2. Simple 1:1 Match at the point.
3. Let the devil sort it out later with some basic logic after nuking the obvious crap.
--

Michael, Paul G. wrote
Hi All,

I have two data sets that I would like to merge using ID as the keyed variable, and each data file has duplicates IDs. The variables of interest in the first dataset are ID, date of contact (DD-MM-YYYY), and Body Mass Index (BMI). The variables of interest in the second data set are ID, date of contact (DD-MM-YYYY), and depression score. The duplicate IDs occur because some subjects had more than 1 BMI score from different time points and/or more than 1 depression score at different time points.

The contact dates from each file do not match up in all instances (in fact very few subjects have the same contact date in both files). I only want to keep subjects who have both BMI data and depression data but I need to preserve information from all the contact dates.

When I try a simple merge by adding variables (e.g., depression score and contact date) to the BMI data set using ID as the keyed variable, I run into the problem of having to go through thousands of cases and delete those that don't have both BMI and depression data.

For example I might have one subject whose BMI was collected 10 times on different dates and none of these dates match the depression data for this subject. So after the merge I have eleven rows for this subject that I would like to preserve (all BMI data and depression data). Another subject may have BMI scores collected at 7 different dates, but this subject has no depression data. I would like to remove all the BMI data for this subject since they have no depression data.

Is there a way in which I can merge these files in a different way to get what I need or a method to delete cases in the merged data file based on duplicate IDs and BMI data but no depression data? Any help is greatly appreciated!

Best,

Paul

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Database Management Help

David Marso
Administrator
Impossible to tell *What* Paul wants as the endgame.  My basic claim is that it will be much easier to "remove all the BMI data for this subject since they have no depression data."  If the depression file is -denormalized/flattened/casetovar'd- and used as a TABLE into the 'BMI data'.  NUKE the unassociated cases then VARSTOCASES if you want to use a hammer or some flavor of VECTOR/LOOP/XSAVE if one is into tweezers.  Since we are ESPssPeculating at this point I will refrain from any further guessing pending further reply.  ADD w /INs ->FLAG followed by AGGREGATE -MODE ADDVAR MAX(FLAG) will also work but what should be associated with what? ALL BMI wth ALL depression?  Most recent? Before? After? BOTH?  Maybe start with ALL<-> ALL and then let the elves sort the fairy dust after the fact ;-).

"I understood this to mean that Paul wants the final file to have multiple rows per ID, not one row per ID.  David's method results in the latter, I think."


Bruce Weaver wrote
Paul (the OP) wrote:

"For example I might have one subject whose BMI was collected 10 times on different dates and none of these dates match the depression data for this subject. So after the merge I have eleven rows for this subject that I would like to preserve (all BMI data and depression data). Another subject may have BMI scores collected at 7 different dates, but this subject has no depression data. I would like to remove all the BMI data for this subject since they have no depression data."

I understood this to mean that Paul wants the final file to have multiple rows per ID, not one row per ID.  David's method results in the latter, I think.

This illustrates once again how helpful it is to post small examples showing what the file looks like originally, and what you want it to look like afterward!  ;-)



David Marso wrote
Quick and dirty would be to do
1. CASESTOVARS on both files (use different varnames for the dates in the 2 files).
2. Simple 1:1 Match at the point.
3. Let the devil sort it out later with some basic logic after nuking the obvious crap.
--

Michael, Paul G. wrote
Hi All,

I have two data sets that I would like to merge using ID as the keyed variable, and each data file has duplicates IDs. The variables of interest in the first dataset are ID, date of contact (DD-MM-YYYY), and Body Mass Index (BMI). The variables of interest in the second data set are ID, date of contact (DD-MM-YYYY), and depression score. The duplicate IDs occur because some subjects had more than 1 BMI score from different time points and/or more than 1 depression score at different time points.

The contact dates from each file do not match up in all instances (in fact very few subjects have the same contact date in both files). I only want to keep subjects who have both BMI data and depression data but I need to preserve information from all the contact dates.

When I try a simple merge by adding variables (e.g., depression score and contact date) to the BMI data set using ID as the keyed variable, I run into the problem of having to go through thousands of cases and delete those that don't have both BMI and depression data.

For example I might have one subject whose BMI was collected 10 times on different dates and none of these dates match the depression data for this subject. So after the merge I have eleven rows for this subject that I would like to preserve (all BMI data and depression data). Another subject may have BMI scores collected at 7 different dates, but this subject has no depression data. I would like to remove all the BMI data for this subject since they have no depression data.

Is there a way in which I can merge these files in a different way to get what I need or a method to delete cases in the merged data file based on duplicate IDs and BMI data but no depression data? Any help is greatly appreciated!

Best,

Paul

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Database Management Help

Bruce Weaver
Administrator
I *think* he wants one row for each unique ID x Date combination.  If he happens to have both BMI and Depression data for that row, both variables will have valid values.  On other rows (most of them), only one of the two variables will have a valid score.  But yes...it is ESPeculation on my part.  ;-)


David Marso wrote
Impossible to tell *What* Paul wants as the endgame.  My basic claim is that it will be much easier to "remove all the BMI data for this subject since they have no depression data."  If the depression file is -denormalized/flattened/casetovar'd- and used as a TABLE into the 'BMI data'.  NUKE the unassociated cases then VARSTOCASES if you want to use a hammer or some flavor of VECTOR/LOOP/XSAVE if one is into tweezers.  Since we are ESPssPeculating at this point I will refrain from any further guessing pending further reply.  ADD w /INs ->FLAG followed by AGGREGATE -MODE ADDVAR MAX(FLAG) will also work but what should be associated with what? ALL BMI wth ALL depression?  Most recent? Before? After? BOTH?  Maybe start with ALL<-> ALL and then let the elves sort the fairy dust after the fact ;-).

"I understood this to mean that Paul wants the final file to have multiple rows per ID, not one row per ID.  David's method results in the latter, I think."


Bruce Weaver wrote
Paul (the OP) wrote:

"For example I might have one subject whose BMI was collected 10 times on different dates and none of these dates match the depression data for this subject. So after the merge I have eleven rows for this subject that I would like to preserve (all BMI data and depression data). Another subject may have BMI scores collected at 7 different dates, but this subject has no depression data. I would like to remove all the BMI data for this subject since they have no depression data."

I understood this to mean that Paul wants the final file to have multiple rows per ID, not one row per ID.  David's method results in the latter, I think.

This illustrates once again how helpful it is to post small examples showing what the file looks like originally, and what you want it to look like afterward!  ;-)



David Marso wrote
Quick and dirty would be to do
1. CASESTOVARS on both files (use different varnames for the dates in the 2 files).
2. Simple 1:1 Match at the point.
3. Let the devil sort it out later with some basic logic after nuking the obvious crap.
--

Michael, Paul G. wrote
Hi All,

I have two data sets that I would like to merge using ID as the keyed variable, and each data file has duplicates IDs. The variables of interest in the first dataset are ID, date of contact (DD-MM-YYYY), and Body Mass Index (BMI). The variables of interest in the second data set are ID, date of contact (DD-MM-YYYY), and depression score. The duplicate IDs occur because some subjects had more than 1 BMI score from different time points and/or more than 1 depression score at different time points.

The contact dates from each file do not match up in all instances (in fact very few subjects have the same contact date in both files). I only want to keep subjects who have both BMI data and depression data but I need to preserve information from all the contact dates.

When I try a simple merge by adding variables (e.g., depression score and contact date) to the BMI data set using ID as the keyed variable, I run into the problem of having to go through thousands of cases and delete those that don't have both BMI and depression data.

For example I might have one subject whose BMI was collected 10 times on different dates and none of these dates match the depression data for this subject. So after the merge I have eleven rows for this subject that I would like to preserve (all BMI data and depression data). Another subject may have BMI scores collected at 7 different dates, but this subject has no depression data. I would like to remove all the BMI data for this subject since they have no depression data.

Is there a way in which I can merge these files in a different way to get what I need or a method to delete cases in the merged data file based on duplicate IDs and BMI data but no depression data? Any help is greatly appreciated!

Best,

Paul

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Database Management Help

Michael, Paul G.
Hi Bruce and David,

First, thank you so much for trying to help and answer my questions. Second, I apologize for not posting an example of the data file so I will do so now.

The merged data file looks like this...

ID     Contact_Date_BMI     BMI      Contact_Date_Depression     Depression_Score
1       22-Dec-2011             31.68     .                                         .
1       01-Jan-2012              27.65     .                                         .
2       13-Dec-2011             23.88     .                                         .
2       12-Mar-2012             27.35
2       .                               .            01-May-2012                      24
3       08-Nov-2010             29.85     .                                         .
3       07-Dec-2010             28.14     .                                         .
3       .                               .             11-Aug-2010                      8

So I would like to find a way to delete all of subject #1's data, since there is only BMI data and no depression data. I would like to keep all of the data for subject #2 & 3.

Something I did not mention in my earlier email, and to complicate things further, I would like to create a new variable that represents the difference in days and/or months between each BMI contact date and the depression score contact date. Then for each subject I would like to inlcude the depression contact date and score on each row. Ideally the new data file would look something like this...

ID     Contact_Date_BMI     BMI     Contact_Date_Depression     Depression_Score    Time_Diff_Months
2       13-Dec-2011             23.88    01-May-2012                      24                           5
2       12-Mar-2012             27.35    01-May-2012                      24                           1
2       .                               .           01-May-2012                      24
3       08-Nov-2010             29.85    11-Aug-2010                       8                            4
3       07-Dec-2010             28.14    11-Aug-2010                       8                            5
3       .                               .           11-Aug-2010                       8

In my example I have one depression contact date that is before all BMI contact dates (subject #3) and one depression contact date that is after all BMI contact dates (subject #2). It is okay with me if the former returns a negative time difference (in months or days). I hope this all makes sense and the example data files help, but please let me know if I can clarify anything, and again thanks for all of your time and help.

Best,

Paul

______________________________________
From: SPSSX(r) Discussion [[hidden email]] On Behalf Of Bruce Weaver [[hidden email]]
Sent: Wednesday, March 28, 2012 5:54 PM
To: [hidden email]
Subject: Re: Database Management Help

I *think* he wants one row for each unique ID x Date combination.  If he
happens to have both BMI and Depression data for that row, both variables
will have valid values.  On other rows (most of them), only one of the two
variables will have a valid score.  But yes...it is ESPeculation on my part.
;-)



David Marso wrote

>
> Impossible to tell *What* Paul wants as the endgame.  My basic claim is
> that it will be much easier to "remove all the BMI data for this subject
> since they have no depression data."  If the depression file is
> -denormalized/flattened/casetovar'd- and used as a TABLE into the 'BMI
> data'.  NUKE the unassociated cases then VARSTOCASES if you want to use a
> hammer or some flavor of VECTOR/LOOP/XSAVE if one is into tweezers.  Since
> we are ESPssPeculating at this point I will refrain from any further
> guessing pending further reply.  ADD w /INs ->FLAG followed by AGGREGATE
> -MODE ADDVAR MAX(FLAG) will also work but what should be associated with
> what? ALL BMI wth ALL depression?  Most recent? Before? After? BOTH?
> Maybe start with ALL<-> ALL and then let the elves sort the fairy dust
> after the fact ;-).
>
> "I understood this to mean that Paul wants the final file to have multiple
> rows per ID, not one row per ID.  David's method results in the latter, I
> think."
>
>
>
> Bruce Weaver wrote
>>
>> Paul (the OP) wrote:
>>
>> "For example I might have one subject whose BMI was collected 10 times on
>> different dates and none of these dates match the depression data for
>> this subject. So after the merge I have eleven rows for this subject that
>> I would like to preserve (all BMI data and depression data). Another
>> subject may have BMI scores collected at 7 different dates, but this
>> subject has no depression data. I would like to remove all the BMI data
>> for this subject since they have no depression data."
>>
>> I understood this to mean that Paul wants the final file to have multiple
>> rows per ID, not one row per ID.  David's method results in the latter, I
>> think.
>>
>> This illustrates once again how helpful it is to post small examples
>> showing what the file looks like originally, and what you want it to look
>> like afterward!  ;-)
>>
>>
>>
>>
>> David Marso wrote
>>>
>>> Quick and dirty would be to do
>>> 1. CASESTOVARS on both files (use different varnames for the dates in
>>> the 2 files).
>>> 2. Simple 1:1 Match at the point.
>>> 3. Let the devil sort it out later with some basic logic after nuking
>>> the obvious crap.
>>> --
>>>
>>>
>>> Michael, Paul G. wrote
>>>>
>>>> Hi All,
>>>>
>>>> I have two data sets that I would like to merge using ID as the keyed
>>>> variable, and each data file has duplicates IDs. The variables of
>>>> interest in the first dataset are ID, date of contact (DD-MM-YYYY), and
>>>> Body Mass Index (BMI). The variables of interest in the second data set
>>>> are ID, date of contact (DD-MM-YYYY), and depression score. The
>>>> duplicate IDs occur because some subjects had more than 1 BMI score
>>>> from different time points and/or more than 1 depression score at
>>>> different time points.
>>>>
>>>> The contact dates from each file do not match up in all instances (in
>>>> fact very few subjects have the same contact date in both files). I
>>>> only want to keep subjects who have both BMI data and depression data
>>>> but I need to preserve information from all the contact dates.
>>>>
>>>> When I try a simple merge by adding variables (e.g., depression score
>>>> and contact date) to the BMI data set using ID as the keyed variable, I
>>>> run into the problem of having to go through thousands of cases and
>>>> delete those that don't have both BMI and depression data.
>>>>
>>>> For example I might have one subject whose BMI was collected 10 times
>>>> on different dates and none of these dates match the depression data
>>>> for this subject. So after the merge I have eleven rows for this
>>>> subject that I would like to preserve (all BMI data and depression
>>>> data). Another subject may have BMI scores collected at 7 different
>>>> dates, but this subject has no depression data. I would like to remove
>>>> all the BMI data for this subject since they have no depression data.
>>>>
>>>> Is there a way in which I can merge these files in a different way to
>>>> get what I need or a method to delete cases in the merged data file
>>>> based on duplicate IDs and BMI data but no depression data? Any help is
>>>> greatly appreciated!
>>>>
>>>> Best,
>>>>
>>>> Paul
>>>>
>>>> =====================
>>>> To manage your subscription to SPSSX-L, send a message to
>>>> LISTSERV@.UGA (not to SPSSX-L), with no body text except the
>>>> command. To leave the list, send the command
>>>> SIGNOFF SPSSX-L
>>>> For a list of commands to manage subscriptions, send the command
>>>> INFO REFCARD
>>>>
>>>
>>
>


-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Database-Management-Help-tp5601508p5602258.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Database Management Help

David Marso
Administrator
Simple: Abbreviated varnames
**UNTESTED**.
SORT each file by ID DateVar and SAVE.
* Copy Contact_Date_Depression as DATEVar in Depression file for ADD
* Ditto  Contact_Date_BMI >DateVar... BMI File
ADD FILES / FILE Depression/IN=DEP_FLAG/FILE BMI /IN=BMI_FLAG/ BY ID Date.
DO IF ID=LAG(ID) AND MISSING(Depression).
COMPUTE Contact_Date_Depression=LAG(Contact_Date_Depression).
COMPUTE   Depression_Score=LAG(  Depression_Score).
END IF.
AGGREGATE OUTFILE * / MODE ADDVARIABLES /BREAK ID /MAX_DEP_FLAG=MAX(DEP_FLAG).
SELECT IF MAX_DEP_FLAG.
COMPUTE Date_Delta=DATEDIFF(Contact_Date_Depression,Contact_Date_BMI,???) RTFM for syntax...
HTH, David
Michael, Paul G. wrote
Hi Bruce and David,

First, thank you so much for trying to help and answer my questions. Second, I apologize for not posting an example of the data file so I will do so now.

The merged data file looks like this...

ID     Contact_Date_BMI     BMI      Contact_Date_Depression     Depression_Score
1       22-Dec-2011             31.68     .                                         .
1       01-Jan-2012              27.65     .                                         .
2       13-Dec-2011             23.88     .                                         .
2       12-Mar-2012             27.35
2       .                               .            01-May-2012                      24
3       08-Nov-2010             29.85     .                                         .
3       07-Dec-2010             28.14     .                                         .
3       .                               .             11-Aug-2010                      8

So I would like to find a way to delete all of subject #1's data, since there is only BMI data and no depression data. I would like to keep all of the data for subject #2 & 3.

Something I did not mention in my earlier email, and to complicate things further, I would like to create a new variable that represents the difference in days and/or months between each BMI contact date and the depression score contact date. Then for each subject I would like to inlcude the depression contact date and score on each row. Ideally the new data file would look something like this...

ID     Contact_Date_BMI     BMI     Contact_Date_Depression     Depression_Score    Time_Diff_Months
2       13-Dec-2011             23.88    01-May-2012                      24                           5
2       12-Mar-2012             27.35    01-May-2012                      24                           1
2       .                               .           01-May-2012                      24
3       08-Nov-2010             29.85    11-Aug-2010                       8                            4
3       07-Dec-2010             28.14    11-Aug-2010                       8                            5
3       .                               .           11-Aug-2010                       8

In my example I have one depression contact date that is before all BMI contact dates (subject #3) and one depression contact date that is after all BMI contact dates (subject #2). It is okay with me if the former returns a negative time difference (in months or days). I hope this all makes sense and the example data files help, but please let me know if I can clarify anything, and again thanks for all of your time and help.

Best,

Paul

______________________________________
From: SPSSX(r) Discussion [[hidden email]] On Behalf Of Bruce Weaver [[hidden email]]
Sent: Wednesday, March 28, 2012 5:54 PM
To: [hidden email]
Subject: Re: Database Management Help

I *think* he wants one row for each unique ID x Date combination.  If he
happens to have both BMI and Depression data for that row, both variables
will have valid values.  On other rows (most of them), only one of the two
variables will have a valid score.  But yes...it is ESPeculation on my part.
;-)



David Marso wrote
>
> Impossible to tell *What* Paul wants as the endgame.  My basic claim is
> that it will be much easier to "remove all the BMI data for this subject
> since they have no depression data."  If the depression file is
> -denormalized/flattened/casetovar'd- and used as a TABLE into the 'BMI
> data'.  NUKE the unassociated cases then VARSTOCASES if you want to use a
> hammer or some flavor of VECTOR/LOOP/XSAVE if one is into tweezers.  Since
> we are ESPssPeculating at this point I will refrain from any further
> guessing pending further reply.  ADD w /INs ->FLAG followed by AGGREGATE
> -MODE ADDVAR MAX(FLAG) will also work but what should be associated with
> what? ALL BMI wth ALL depression?  Most recent? Before? After? BOTH?
> Maybe start with ALL<-> ALL and then let the elves sort the fairy dust
> after the fact ;-).
>
> "I understood this to mean that Paul wants the final file to have multiple
> rows per ID, not one row per ID.  David's method results in the latter, I
> think."
>
>
>
> Bruce Weaver wrote
>>
>> Paul (the OP) wrote:
>>
>> "For example I might have one subject whose BMI was collected 10 times on
>> different dates and none of these dates match the depression data for
>> this subject. So after the merge I have eleven rows for this subject that
>> I would like to preserve (all BMI data and depression data). Another
>> subject may have BMI scores collected at 7 different dates, but this
>> subject has no depression data. I would like to remove all the BMI data
>> for this subject since they have no depression data."
>>
>> I understood this to mean that Paul wants the final file to have multiple
>> rows per ID, not one row per ID.  David's method results in the latter, I
>> think.
>>
>> This illustrates once again how helpful it is to post small examples
>> showing what the file looks like originally, and what you want it to look
>> like afterward!  ;-)
>>
>>
>>
>>
>> David Marso wrote
>>>
>>> Quick and dirty would be to do
>>> 1. CASESTOVARS on both files (use different varnames for the dates in
>>> the 2 files).
>>> 2. Simple 1:1 Match at the point.
>>> 3. Let the devil sort it out later with some basic logic after nuking
>>> the obvious crap.
>>> --
>>>
>>>
>>> Michael, Paul G. wrote
>>>>
>>>> Hi All,
>>>>
>>>> I have two data sets that I would like to merge using ID as the keyed
>>>> variable, and each data file has duplicates IDs. The variables of
>>>> interest in the first dataset are ID, date of contact (DD-MM-YYYY), and
>>>> Body Mass Index (BMI). The variables of interest in the second data set
>>>> are ID, date of contact (DD-MM-YYYY), and depression score. The
>>>> duplicate IDs occur because some subjects had more than 1 BMI score
>>>> from different time points and/or more than 1 depression score at
>>>> different time points.
>>>>
>>>> The contact dates from each file do not match up in all instances (in
>>>> fact very few subjects have the same contact date in both files). I
>>>> only want to keep subjects who have both BMI data and depression data
>>>> but I need to preserve information from all the contact dates.
>>>>
>>>> When I try a simple merge by adding variables (e.g., depression score
>>>> and contact date) to the BMI data set using ID as the keyed variable, I
>>>> run into the problem of having to go through thousands of cases and
>>>> delete those that don't have both BMI and depression data.
>>>>
>>>> For example I might have one subject whose BMI was collected 10 times
>>>> on different dates and none of these dates match the depression data
>>>> for this subject. So after the merge I have eleven rows for this
>>>> subject that I would like to preserve (all BMI data and depression
>>>> data). Another subject may have BMI scores collected at 7 different
>>>> dates, but this subject has no depression data. I would like to remove
>>>> all the BMI data for this subject since they have no depression data.
>>>>
>>>> Is there a way in which I can merge these files in a different way to
>>>> get what I need or a method to delete cases in the merged data file
>>>> based on duplicate IDs and BMI data but no depression data? Any help is
>>>> greatly appreciated!
>>>>
>>>> Best,
>>>>
>>>> Paul
>>>>
>>>> =====================
>>>> To manage your subscription to SPSSX-L, send a message to
>>>> LISTSERV@.UGA (not to SPSSX-L), with no body text except the
>>>> command. To leave the list, send the command
>>>> SIGNOFF SPSSX-L
>>>> For a list of commands to manage subscriptions, send the command
>>>> INFO REFCARD
>>>>
>>>
>>
>


-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Database-Management-Help-tp5601508p5602258.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Database Management Help

Bruce Weaver
Administrator
In reply to this post by Michael, Paul G.
Does this example do what you want?  It assumes that each ID will have at most one Depression score.

new file.
dataset close all.

* Create two data sets to illustrate.

data list list / ID (f5.0) Contact_Date_BMI(date11) BMI(f5.2) .
begin data
1       22-Dec-2011             31.68
1       01-Jan-2012              27.65
2       13-Dec-2011             23.88
2       12-Mar-2012             27.35
3       08-Nov-2010             29.85
3       07-Dec-2010             28.14
end data.
dataset name BMI.

data list list / ID (f5.0) Contact_Date_Depression (date11) Depression_Score (f5.0).
begin data
2  01-May-2012   24
3  11-Aug-2010      8
end data.
dataset name Dep.

**** The important stuff starts here **** .
match files
 file = 'BMI' / in = BMIflag /
 table = 'Dep' / in = Dflag /
 by ID.
execute.
dataset name merged.
dataset activate merged.
dataset close all.

* Keep only records where BMIflag and Dflag both equal 1.

select if BMIflag and Dflag.
execute.
delete variables BMIflag Dflag. /* no longer needed.

* Compute Time Diff variables.

compute timediff_days = datediff(Contact_Date_Depression, Contact_Date_BMI, "days").
compute timediff_months = datediff(Contact_Date_Depression, Contact_Date_BMI, "months").
list.

OUTPUT:

ID Contact_Date_BMI    BMI Contact_Date_Depression Depression_Score timediff_days timediff_months
 
    2    13-DEC-2011    23.88       01-MAY-2012                24           140.00           4.00
    2    12-MAR-2012    27.35       01-MAY-2012                24            50.00           1.00
    3    08-NOV-2010    29.85       11-AUG-2010                 8           -89.00          -2.00
    3    07-DEC-2010    28.14       11-AUG-2010                 8          -118.00          -3.00
 
Number of cases read:  4    Number of cases listed:  4

This output differs from what you showed in that the rows with no BMI score are deleted.  I don't see the point in including them, since the Depression score found on those rows is recorded on all the other rows too.

HTH.


Michael, Paul G. wrote
Hi Bruce and David,

First, thank you so much for trying to help and answer my questions. Second, I apologize for not posting an example of the data file so I will do so now.

The merged data file looks like this...

ID     Contact_Date_BMI     BMI      Contact_Date_Depression     Depression_Score
1       22-Dec-2011             31.68     .                                         .
1       01-Jan-2012              27.65     .                                         .
2       13-Dec-2011             23.88     .                                         .
2       12-Mar-2012             27.35
2       .                               .            01-May-2012                      24
3       08-Nov-2010             29.85     .                                         .
3       07-Dec-2010             28.14     .                                         .
3       .                               .             11-Aug-2010                      8

So I would like to find a way to delete all of subject #1's data, since there is only BMI data and no depression data. I would like to keep all of the data for subject #2 & 3.

Something I did not mention in my earlier email, and to complicate things further, I would like to create a new variable that represents the difference in days and/or months between each BMI contact date and the depression score contact date. Then for each subject I would like to inlcude the depression contact date and score on each row. Ideally the new data file would look something like this...

ID     Contact_Date_BMI     BMI     Contact_Date_Depression     Depression_Score    Time_Diff_Months
2       13-Dec-2011             23.88    01-May-2012                      24                           5
2       12-Mar-2012             27.35    01-May-2012                      24                           1
2       .                               .           01-May-2012                      24
3       08-Nov-2010             29.85    11-Aug-2010                       8                            4
3       07-Dec-2010             28.14    11-Aug-2010                       8                            5
3       .                               .           11-Aug-2010                       8

In my example I have one depression contact date that is before all BMI contact dates (subject #3) and one depression contact date that is after all BMI contact dates (subject #2). It is okay with me if the former returns a negative time difference (in months or days). I hope this all makes sense and the example data files help, but please let me know if I can clarify anything, and again thanks for all of your time and help.

Best,

Paul

______________________________________
From: SPSSX(r) Discussion [[hidden email]] On Behalf Of Bruce Weaver [[hidden email]]
Sent: Wednesday, March 28, 2012 5:54 PM
To: [hidden email]
Subject: Re: Database Management Help

I *think* he wants one row for each unique ID x Date combination.  If he
happens to have both BMI and Depression data for that row, both variables
will have valid values.  On other rows (most of them), only one of the two
variables will have a valid score.  But yes...it is ESPeculation on my part.
;-)



David Marso wrote
>
> Impossible to tell *What* Paul wants as the endgame.  My basic claim is
> that it will be much easier to "remove all the BMI data for this subject
> since they have no depression data."  If the depression file is
> -denormalized/flattened/casetovar'd- and used as a TABLE into the 'BMI
> data'.  NUKE the unassociated cases then VARSTOCASES if you want to use a
> hammer or some flavor of VECTOR/LOOP/XSAVE if one is into tweezers.  Since
> we are ESPssPeculating at this point I will refrain from any further
> guessing pending further reply.  ADD w /INs ->FLAG followed by AGGREGATE
> -MODE ADDVAR MAX(FLAG) will also work but what should be associated with
> what? ALL BMI wth ALL depression?  Most recent? Before? After? BOTH?
> Maybe start with ALL<-> ALL and then let the elves sort the fairy dust
> after the fact ;-).
>
> "I understood this to mean that Paul wants the final file to have multiple
> rows per ID, not one row per ID.  David's method results in the latter, I
> think."
>
>
>
> Bruce Weaver wrote
>>
>> Paul (the OP) wrote:
>>
>> "For example I might have one subject whose BMI was collected 10 times on
>> different dates and none of these dates match the depression data for
>> this subject. So after the merge I have eleven rows for this subject that
>> I would like to preserve (all BMI data and depression data). Another
>> subject may have BMI scores collected at 7 different dates, but this
>> subject has no depression data. I would like to remove all the BMI data
>> for this subject since they have no depression data."
>>
>> I understood this to mean that Paul wants the final file to have multiple
>> rows per ID, not one row per ID.  David's method results in the latter, I
>> think.
>>
>> This illustrates once again how helpful it is to post small examples
>> showing what the file looks like originally, and what you want it to look
>> like afterward!  ;-)
>>
>>
>>
>>
>> David Marso wrote
>>>
>>> Quick and dirty would be to do
>>> 1. CASESTOVARS on both files (use different varnames for the dates in
>>> the 2 files).
>>> 2. Simple 1:1 Match at the point.
>>> 3. Let the devil sort it out later with some basic logic after nuking
>>> the obvious crap.
>>> --
>>>
>>>
>>> Michael, Paul G. wrote
>>>>
>>>> Hi All,
>>>>
>>>> I have two data sets that I would like to merge using ID as the keyed
>>>> variable, and each data file has duplicates IDs. The variables of
>>>> interest in the first dataset are ID, date of contact (DD-MM-YYYY), and
>>>> Body Mass Index (BMI). The variables of interest in the second data set
>>>> are ID, date of contact (DD-MM-YYYY), and depression score. The
>>>> duplicate IDs occur because some subjects had more than 1 BMI score
>>>> from different time points and/or more than 1 depression score at
>>>> different time points.
>>>>
>>>> The contact dates from each file do not match up in all instances (in
>>>> fact very few subjects have the same contact date in both files). I
>>>> only want to keep subjects who have both BMI data and depression data
>>>> but I need to preserve information from all the contact dates.
>>>>
>>>> When I try a simple merge by adding variables (e.g., depression score
>>>> and contact date) to the BMI data set using ID as the keyed variable, I
>>>> run into the problem of having to go through thousands of cases and
>>>> delete those that don't have both BMI and depression data.
>>>>
>>>> For example I might have one subject whose BMI was collected 10 times
>>>> on different dates and none of these dates match the depression data
>>>> for this subject. So after the merge I have eleven rows for this
>>>> subject that I would like to preserve (all BMI data and depression
>>>> data). Another subject may have BMI scores collected at 7 different
>>>> dates, but this subject has no depression data. I would like to remove
>>>> all the BMI data for this subject since they have no depression data.
>>>>
>>>> Is there a way in which I can merge these files in a different way to
>>>> get what I need or a method to delete cases in the merged data file
>>>> based on duplicate IDs and BMI data but no depression data? Any help is
>>>> greatly appreciated!
>>>>
>>>> Best,
>>>>
>>>> Paul
>>>>
>>>> =====================
>>>> To manage your subscription to SPSSX-L, send a message to
>>>> LISTSERV@.UGA (not to SPSSX-L), with no body text except the
>>>> command. To leave the list, send the command
>>>> SIGNOFF SPSSX-L
>>>> For a list of commands to manage subscriptions, send the command
>>>> INFO REFCARD
>>>>
>>>
>>
>


-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Database-Management-Help-tp5601508p5602258.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Database Management Help

Michael, Paul G.
Hi Bruce,

This example worked great. Thank you so much for all of your time and help, I really appreciate it.

Best,

Paul


________________________________________
From: SPSSX(r) Discussion [[hidden email]] On Behalf Of Bruce Weaver [[hidden email]]
Sent: Thursday, March 29, 2012 9:51 AM
To: [hidden email]
Subject: Re: Database Management Help

Does this example do what you want?  It assumes that each ID will have *at
most* one Depression score.

new file.
dataset close all.

* Create two data sets to illustrate.

data list list / ID (f5.0) Contact_Date_BMI(date11) BMI(f5.2) .
begin data
1       22-Dec-2011             31.68
1       01-Jan-2012              27.65
2       13-Dec-2011             23.88
2       12-Mar-2012             27.35
3       08-Nov-2010             29.85
3       07-Dec-2010             28.14
end data.
dataset name BMI.

data list list / ID (f5.0) Contact_Date_Depression (date11) Depression_Score
(f5.0).
begin data
2  01-May-2012   24
3  11-Aug-2010      8
end data.
dataset name Dep.

**** The important stuff starts here **** .
match files
 file = 'BMI' / in = BMIflag /
 table = 'Dep' / in = Dflag /
 by ID.
execute.
dataset name merged.
dataset activate merged.
dataset close all.

* Keep only records where BMIflag and Dflag both equal 1.

select if BMIflag and Dflag.
execute.
delete variables BMIflag Dflag. /* no longer needed.

* Compute Time Diff variables.

compute timediff_days = datediff(Contact_Date_Depression, Contact_Date_BMI,
"days").
compute timediff_months = datediff(Contact_Date_Depression,
Contact_Date_BMI, "months").
list.

OUTPUT:

ID Contact_Date_BMI    BMI Contact_Date_Depression Depression_Score
timediff_days timediff_months

    2    13-DEC-2011    23.88       01-MAY-2012                24
140.00           4.00
    2    12-MAR-2012    27.35       01-MAY-2012                24
50.00           1.00
    3    08-NOV-2010    29.85       11-AUG-2010                 8
-89.00          -2.00
    3    07-DEC-2010    28.14       11-AUG-2010                 8
-118.00          -3.00

Number of cases read:  4    Number of cases listed:  4

This output differs from what you showed in that the rows with no BMI score
are deleted.  I don't see the point in including them, since the Depression
score found on those rows is recorded on all the other rows too.

HTH.



Michael, Paul G. wrote

>
> Hi Bruce and David,
>
> First, thank you so much for trying to help and answer my questions.
> Second, I apologize for not posting an example of the data file so I will
> do so now.
>
> The merged data file looks like this...
>
> ID     Contact_Date_BMI     BMI      Contact_Date_Depression
> Depression_Score
> 1       22-Dec-2011             31.68     .
> .
> 1       01-Jan-2012              27.65     .
> .
> 2       13-Dec-2011             23.88     .
> .
> 2       12-Mar-2012             27.35
> 2       .                               .            01-May-2012
> 24
> 3       08-Nov-2010             29.85     .
> .
> 3       07-Dec-2010             28.14     .
> .
> 3       .                               .             11-Aug-2010
> 8
>
> So I would like to find a way to delete all of subject #1's data, since
> there is only BMI data and no depression data. I would like to keep all of
> the data for subject #2 & 3.
>
> Something I did not mention in my earlier email, and to complicate things
> further, I would like to create a new variable that represents the
> difference in days and/or months between each BMI contact date and the
> depression score contact date. Then for each subject I would like to
> inlcude the depression contact date and score on each row. Ideally the new
> data file would look something like this...
>
> ID     Contact_Date_BMI     BMI     Contact_Date_Depression
> Depression_Score    Time_Diff_Months
> 2       13-Dec-2011             23.88    01-May-2012
> 24                           5
> 2       12-Mar-2012             27.35    01-May-2012
> 24                           1
> 2       .                               .           01-May-2012
> 24
> 3       08-Nov-2010             29.85    11-Aug-2010
> 8                            4
> 3       07-Dec-2010             28.14    11-Aug-2010
> 8                            5
> 3       .                               .           11-Aug-2010
> 8
>
> In my example I have one depression contact date that is before all BMI
> contact dates (subject #3) and one depression contact date that is after
> all BMI contact dates (subject #2). It is okay with me if the former
> returns a negative time difference (in months or days). I hope this all
> makes sense and the example data files help, but please let me know if I
> can clarify anything, and again thanks for all of your time and help.
>
> Best,
>
> Paul
>
> ______________________________________
> From: SPSSX(r) Discussion [SPSSX-L@.UGA] On Behalf Of Bruce Weaver
> [bruce.weaver@]
> Sent: Wednesday, March 28, 2012 5:54 PM
> To: SPSSX-L@.UGA
> Subject: Re: Database Management Help
>
> I *think* he wants one row for each unique ID x Date combination.  If he
> happens to have both BMI and Depression data for that row, both variables
> will have valid values.  On other rows (most of them), only one of the two
> variables will have a valid score.  But yes...it is ESPeculation on my
> part.
> ;-)
>
>
>
> David Marso wrote
>>
>> Impossible to tell *What* Paul wants as the endgame.  My basic claim is
>> that it will be much easier to "remove all the BMI data for this subject
>> since they have no depression data."  If the depression file is
>> -denormalized/flattened/casetovar'd- and used as a TABLE into the 'BMI
>> data'.  NUKE the unassociated cases then VARSTOCASES if you want to use a
>> hammer or some flavor of VECTOR/LOOP/XSAVE if one is into tweezers.
>> Since
>> we are ESPssPeculating at this point I will refrain from any further
>> guessing pending further reply.  ADD w /INs ->FLAG followed by AGGREGATE
>> -MODE ADDVAR MAX(FLAG) will also work but what should be associated with
>> what? ALL BMI wth ALL depression?  Most recent? Before? After? BOTH?
>> Maybe start with ALL<-> ALL and then let the elves sort the fairy dust
>> after the fact ;-).
>>
>> "I understood this to mean that Paul wants the final file to have
>> multiple
>> rows per ID, not one row per ID.  David's method results in the latter, I
>> think."
>>
>>
>>
>> Bruce Weaver wrote
>>>
>>> Paul (the OP) wrote:
>>>
>>> "For example I might have one subject whose BMI was collected 10 times
>>> on
>>> different dates and none of these dates match the depression data for
>>> this subject. So after the merge I have eleven rows for this subject
>>> that
>>> I would like to preserve (all BMI data and depression data). Another
>>> subject may have BMI scores collected at 7 different dates, but this
>>> subject has no depression data. I would like to remove all the BMI data
>>> for this subject since they have no depression data."
>>>
>>> I understood this to mean that Paul wants the final file to have
>>> multiple
>>> rows per ID, not one row per ID.  David's method results in the latter,
>>> I
>>> think.
>>>
>>> This illustrates once again how helpful it is to post small examples
>>> showing what the file looks like originally, and what you want it to
>>> look
>>> like afterward!  ;-)
>>>
>>>
>>>
>>>
>>> David Marso wrote
>>>>
>>>> Quick and dirty would be to do
>>>> 1. CASESTOVARS on both files (use different varnames for the dates in
>>>> the 2 files).
>>>> 2. Simple 1:1 Match at the point.
>>>> 3. Let the devil sort it out later with some basic logic after nuking
>>>> the obvious crap.
>>>> --
>>>>
>>>>
>>>> Michael, Paul G. wrote
>>>>>
>>>>> Hi All,
>>>>>
>>>>> I have two data sets that I would like to merge using ID as the keyed
>>>>> variable, and each data file has duplicates IDs. The variables of
>>>>> interest in the first dataset are ID, date of contact (DD-MM-YYYY),
>>>>> and
>>>>> Body Mass Index (BMI). The variables of interest in the second data
>>>>> set
>>>>> are ID, date of contact (DD-MM-YYYY), and depression score. The
>>>>> duplicate IDs occur because some subjects had more than 1 BMI score
>>>>> from different time points and/or more than 1 depression score at
>>>>> different time points.
>>>>>
>>>>> The contact dates from each file do not match up in all instances (in
>>>>> fact very few subjects have the same contact date in both files). I
>>>>> only want to keep subjects who have both BMI data and depression data
>>>>> but I need to preserve information from all the contact dates.
>>>>>
>>>>> When I try a simple merge by adding variables (e.g., depression score
>>>>> and contact date) to the BMI data set using ID as the keyed variable,
>>>>> I
>>>>> run into the problem of having to go through thousands of cases and
>>>>> delete those that don't have both BMI and depression data.
>>>>>
>>>>> For example I might have one subject whose BMI was collected 10 times
>>>>> on different dates and none of these dates match the depression data
>>>>> for this subject. So after the merge I have eleven rows for this
>>>>> subject that I would like to preserve (all BMI data and depression
>>>>> data). Another subject may have BMI scores collected at 7 different
>>>>> dates, but this subject has no depression data. I would like to remove
>>>>> all the BMI data for this subject since they have no depression data.
>>>>>
>>>>> Is there a way in which I can merge these files in a different way to
>>>>> get what I need or a method to delete cases in the merged data file
>>>>> based on duplicate IDs and BMI data but no depression data? Any help
>>>>> is
>>>>> greatly appreciated!
>>>>>
>>>>> Best,
>>>>>
>>>>> Paul
>>>>>
>>>>> =====================
>>>>> To manage your subscription to SPSSX-L, send a message to
>>>>> LISTSERV@.UGA (not to SPSSX-L), with no body text except the
>>>>> command. To leave the list, send the command
>>>>> SIGNOFF SPSSX-L
>>>>> For a list of commands to manage subscriptions, send the command
>>>>> INFO REFCARD
>>>>>
>>>>
>>>
>>
>
>
> -----
> --
> Bruce Weaver
> bweaver@
> http://sites.google.com/a/lakeheadu.ca/bweaver/
>
> "When all else fails, RTFM."
>
> NOTE: My Hotmail account is not monitored regularly.
> To send me an e-mail, please use the address shown above.
>
> --
> View this message in context:
> http://spssx-discussion.1045642.n5.nabble.com/Database-Management-Help-tp5601508p5602258.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> LISTSERV@.UGA (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> LISTSERV@.UGA (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>


-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Database-Management-Help-tp5601508p5604394.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Database Management Help

King Douglas
Well done, Bruce.

King Douglas
American Airlines Customer Research

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Michael, Paul G.
Sent: Friday, March 30, 2012 10:42 AM
To: [hidden email]
Subject: Re: Database Management Help

Hi Bruce,

This example worked great. Thank you so much for all of your time and help,
I really appreciate it.

Best,

Paul


________________________________________
From: SPSSX(r) Discussion [[hidden email]] On Behalf Of Bruce
Weaver [[hidden email]]
Sent: Thursday, March 29, 2012 9:51 AM
To: [hidden email]
Subject: Re: Database Management Help

Does this example do what you want?  It assumes that each ID will have *at
most* one Depression score.

new file.
dataset close all.

* Create two data sets to illustrate.

data list list / ID (f5.0) Contact_Date_BMI(date11) BMI(f5.2) .
begin data
1       22-Dec-2011             31.68
1       01-Jan-2012              27.65
2       13-Dec-2011             23.88
2       12-Mar-2012             27.35
3       08-Nov-2010             29.85
3       07-Dec-2010             28.14
end data.
dataset name BMI.

data list list / ID (f5.0) Contact_Date_Depression (date11) Depression_Score
(f5.0).
begin data
2  01-May-2012   24
3  11-Aug-2010      8
end data.
dataset name Dep.

**** The important stuff starts here **** .
match files
 file = 'BMI' / in = BMIflag /
 table = 'Dep' / in = Dflag /
 by ID.
execute.
dataset name merged.
dataset activate merged.
dataset close all.

* Keep only records where BMIflag and Dflag both equal 1.

select if BMIflag and Dflag.
execute.
delete variables BMIflag Dflag. /* no longer needed.

* Compute Time Diff variables.

compute timediff_days = datediff(Contact_Date_Depression, Contact_Date_BMI,
"days").
compute timediff_months = datediff(Contact_Date_Depression,
Contact_Date_BMI, "months").
list.

OUTPUT:

ID Contact_Date_BMI    BMI Contact_Date_Depression Depression_Score
timediff_days timediff_months

    2    13-DEC-2011    23.88       01-MAY-2012                24
140.00           4.00
    2    12-MAR-2012    27.35       01-MAY-2012                24
50.00           1.00
    3    08-NOV-2010    29.85       11-AUG-2010                 8
-89.00          -2.00
    3    07-DEC-2010    28.14       11-AUG-2010                 8
-118.00          -3.00

Number of cases read:  4    Number of cases listed:  4

This output differs from what you showed in that the rows with no BMI score
are deleted.  I don't see the point in including them, since the Depression
score found on those rows is recorded on all the other rows too.

HTH.



Michael, Paul G. wrote

>
> Hi Bruce and David,
>
> First, thank you so much for trying to help and answer my questions.
> Second, I apologize for not posting an example of the data file so I will
> do so now.
>
> The merged data file looks like this...
>
> ID     Contact_Date_BMI     BMI      Contact_Date_Depression
> Depression_Score
> 1       22-Dec-2011             31.68     .
> .
> 1       01-Jan-2012              27.65     .
> .
> 2       13-Dec-2011             23.88     .
> .
> 2       12-Mar-2012             27.35
> 2       .                               .            01-May-2012
> 24
> 3       08-Nov-2010             29.85     .
> .
> 3       07-Dec-2010             28.14     .
> .
> 3       .                               .             11-Aug-2010
> 8
>
> So I would like to find a way to delete all of subject #1's data, since
> there is only BMI data and no depression data. I would like to keep all of
> the data for subject #2 & 3.
>
> Something I did not mention in my earlier email, and to complicate things
> further, I would like to create a new variable that represents the
> difference in days and/or months between each BMI contact date and the
> depression score contact date. Then for each subject I would like to
> inlcude the depression contact date and score on each row. Ideally the new
> data file would look something like this...
>
> ID     Contact_Date_BMI     BMI     Contact_Date_Depression
> Depression_Score    Time_Diff_Months
> 2       13-Dec-2011             23.88    01-May-2012
> 24                           5
> 2       12-Mar-2012             27.35    01-May-2012
> 24                           1
> 2       .                               .           01-May-2012
> 24
> 3       08-Nov-2010             29.85    11-Aug-2010
> 8                            4
> 3       07-Dec-2010             28.14    11-Aug-2010
> 8                            5
> 3       .                               .           11-Aug-2010
> 8
>
> In my example I have one depression contact date that is before all BMI
> contact dates (subject #3) and one depression contact date that is after
> all BMI contact dates (subject #2). It is okay with me if the former
> returns a negative time difference (in months or days). I hope this all
> makes sense and the example data files help, but please let me know if I
> can clarify anything, and again thanks for all of your time and help.
>
> Best,
>
> Paul
>
> ______________________________________
> From: SPSSX(r) Discussion [SPSSX-L@.UGA] On Behalf Of Bruce Weaver
> [bruce.weaver@]
> Sent: Wednesday, March 28, 2012 5:54 PM
> To: SPSSX-L@.UGA
> Subject: Re: Database Management Help
>
> I *think* he wants one row for each unique ID x Date combination.  If he
> happens to have both BMI and Depression data for that row, both variables
> will have valid values.  On other rows (most of them), only one of the two
> variables will have a valid score.  But yes...it is ESPeculation on my
> part.
> ;-)
>
>
>
> David Marso wrote
>>
>> Impossible to tell *What* Paul wants as the endgame.  My basic claim is
>> that it will be much easier to "remove all the BMI data for this subject
>> since they have no depression data."  If the depression file is
>> -denormalized/flattened/casetovar'd- and used as a TABLE into the 'BMI
>> data'.  NUKE the unassociated cases then VARSTOCASES if you want to use a
>> hammer or some flavor of VECTOR/LOOP/XSAVE if one is into tweezers.
>> Since
>> we are ESPssPeculating at this point I will refrain from any further
>> guessing pending further reply.  ADD w /INs ->FLAG followed by AGGREGATE
>> -MODE ADDVAR MAX(FLAG) will also work but what should be associated with
>> what? ALL BMI wth ALL depression?  Most recent? Before? After? BOTH?
>> Maybe start with ALL<-> ALL and then let the elves sort the fairy dust
>> after the fact ;-).
>>
>> "I understood this to mean that Paul wants the final file to have
>> multiple
>> rows per ID, not one row per ID.  David's method results in the latter, I
>> think."
>>
>>
>>
>> Bruce Weaver wrote
>>>
>>> Paul (the OP) wrote:
>>>
>>> "For example I might have one subject whose BMI was collected 10 times
>>> on
>>> different dates and none of these dates match the depression data for
>>> this subject. So after the merge I have eleven rows for this subject
>>> that
>>> I would like to preserve (all BMI data and depression data). Another
>>> subject may have BMI scores collected at 7 different dates, but this
>>> subject has no depression data. I would like to remove all the BMI data
>>> for this subject since they have no depression data."
>>>
>>> I understood this to mean that Paul wants the final file to have
>>> multiple
>>> rows per ID, not one row per ID.  David's method results in the latter,
>>> I
>>> think.
>>>
>>> This illustrates once again how helpful it is to post small examples
>>> showing what the file looks like originally, and what you want it to
>>> look
>>> like afterward!  ;-)
>>>
>>>
>>>
>>>
>>> David Marso wrote
>>>>
>>>> Quick and dirty would be to do
>>>> 1. CASESTOVARS on both files (use different varnames for the dates in
>>>> the 2 files).
>>>> 2. Simple 1:1 Match at the point.
>>>> 3. Let the devil sort it out later with some basic logic after nuking
>>>> the obvious crap.
>>>> --
>>>>
>>>>
>>>> Michael, Paul G. wrote
>>>>>
>>>>> Hi All,
>>>>>
>>>>> I have two data sets that I would like to merge using ID as the keyed
>>>>> variable, and each data file has duplicates IDs. The variables of
>>>>> interest in the first dataset are ID, date of contact (DD-MM-YYYY),
>>>>> and
>>>>> Body Mass Index (BMI). The variables of interest in the second data
>>>>> set
>>>>> are ID, date of contact (DD-MM-YYYY), and depression score. The
>>>>> duplicate IDs occur because some subjects had more than 1 BMI score
>>>>> from different time points and/or more than 1 depression score at
>>>>> different time points.
>>>>>
>>>>> The contact dates from each file do not match up in all instances (in
>>>>> fact very few subjects have the same contact date in both files). I
>>>>> only want to keep subjects who have both BMI data and depression data
>>>>> but I need to preserve information from all the contact dates.
>>>>>
>>>>> When I try a simple merge by adding variables (e.g., depression score
>>>>> and contact date) to the BMI data set using ID as the keyed variable,
>>>>> I
>>>>> run into the problem of having to go through thousands of cases and
>>>>> delete those that don't have both BMI and depression data.
>>>>>
>>>>> For example I might have one subject whose BMI was collected 10 times
>>>>> on different dates and none of these dates match the depression data
>>>>> for this subject. So after the merge I have eleven rows for this
>>>>> subject that I would like to preserve (all BMI data and depression
>>>>> data). Another subject may have BMI scores collected at 7 different
>>>>> dates, but this subject has no depression data. I would like to remove
>>>>> all the BMI data for this subject since they have no depression data.
>>>>>
>>>>> Is there a way in which I can merge these files in a different way to
>>>>> get what I need or a method to delete cases in the merged data file
>>>>> based on duplicate IDs and BMI data but no depression data? Any help
>>>>> is
>>>>> greatly appreciated!
>>>>>
>>>>> Best,
>>>>>
>>>>> Paul
>>>>>
>>>>> =====================
>>>>> To manage your subscription to SPSSX-L, send a message to
>>>>> LISTSERV@.UGA (not to SPSSX-L), with no body text except the
>>>>> command. To leave the list, send the command
>>>>> SIGNOFF SPSSX-L
>>>>> For a list of commands to manage subscriptions, send the command
>>>>> INFO REFCARD
>>>>>
>>>>
>>>
>>
>
>
> -----
> --
> Bruce Weaver
> bweaver@
> http://sites.google.com/a/lakeheadu.ca/bweaver/
>
> "When all else fails, RTFM."
>
> NOTE: My Hotmail account is not monitored regularly.
> To send me an e-mail, please use the address shown above.
>
> --
> View this message in context:
>
http://spssx-discussion.1045642.n5.nabble.com/Database-Management-Help-tp560
1508p5602258.html

> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> LISTSERV@.UGA (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> LISTSERV@.UGA (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>


-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Database-Management-Help-tp560
1508p5604394.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD