|
I've been battling with the following problem over the course of the past
week and I've had a lot of great help from the group, but I'm still having more issues. It might be easier if I just explain in full what I am trying to do instead of asking a lot of individual questions. So here it goes. I have a file with three variables - class, start date, year (either 1 or 2). I am trying to match observations from year 1 with obs from year 2 on class and start date. I also need to be sure that noone is sampled more than once. An additional caveat here is that my start date doesn't have to be identical but rather in the range of plus/minus three days of the other start date. Can anyone think of the best way to approach this problem? I was going to match observations on dates and class using the "lag" function and then aggregate based on these matches. But now I've ran into the problem where I'm not sure that I can incorporate the range feature into the lag function. Ie. st-date stcnt 12-1 1 10-2 2 7-3 3 so in this example if the next start date is 7-2, a new stcnt value of 4 should be returned, if not then no . Does anyone have advice on how to best approach this and whether my thinking is correct at all? thank you!!!! ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Wouldn't CASETOVARS do what you want?
*** this keeps variables together by year(v1_1 v1_2 v2_1 v2_2. CASESTOVARS /ID = class stdate /COUNT = stcnt /INDEX = year /separator="_" /GROUPBY = VARIABLE . Melissa -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Alina Sheyman Sent: Wednesday, October 08, 2008 11:34 AM To: [hidden email] Subject: [SPSSX-L] Help on the matches within a file (more q's) I've been battling with the following problem over the course of the past week and I've had a lot of great help from the group, but I'm still having more issues. It might be easier if I just explain in full what I am trying to do instead of asking a lot of individual questions. So here it goes. I have a file with three variables - class, start date, year (either 1 or 2). I am trying to match observations from year 1 with obs from year 2 on class and start date. I also need to be sure that noone is sampled more than once. An additional caveat here is that my start date doesn't have to be identical but rather in the range of plus/minus three days of the other start date. Can anyone think of the best way to approach this problem? I was going to match observations on dates and class using the "lag" function and then aggregate based on these matches. But now I've ran into the problem where I'm not sure that I can incorporate the range feature into the lag function. Ie. st-date stcnt 12-1 1 10-2 2 7-3 3 so in this example if the next start date is 7-2, a new stcnt value of 4 should be returned, if not then no . Does anyone have advice on how to best approach this and whether my thinking is correct at all? thank you!!!! ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD PRIVILEGED AND CONFIDENTIAL INFORMATION This transmittal and any attachments may contain PRIVILEGED AND CONFIDENTIAL information and is intended only for the use of the addressee. If you are not the designated recipient, or an employee or agent authorized to deliver such transmittals to the designated recipient, you are hereby notified that any dissemination, copying or publication of this transmittal is strictly prohibited. If you have received this transmittal in error, please notify us immediately by replying to the sender and delete this copy from your system. You may also call us at (309) 827-6026 for assistance. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
I've tried it with CASETOVARS and this seems way seems to be a better
approach On Wed, Oct 8, 2008 at 12:52 PM, Melissa Ives <[hidden email]> wrote: > Wouldn't CASETOVARS do what you want? > > *** this keeps variables together by year(v1_1 v1_2 v2_1 v2_2. > CASESTOVARS > /ID = class stdate > /COUNT = stcnt > /INDEX = year > /separator="_" > /GROUPBY = VARIABLE . > > Melissa > > -----Original Message----- > From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of > Alina Sheyman > Sent: Wednesday, October 08, 2008 11:34 AM > To: [hidden email] > Subject: [SPSSX-L] Help on the matches within a file (more q's) > > I've been battling with the following problem over the course of the past > week and I've had a lot of great help from the group, but I'm still having > more issues. It might be easier if I just explain in full what I am trying > to do instead of asking a lot of individual questions. > > So here it goes. I have a file with three variables - class, start date, > year (either 1 or 2). I am trying to match observations from year 1 with obs > from year 2 on class and start date. I also need to be sure that noone is > sampled more than once. An additional caveat here is that my start date > doesn't have to be identical but rather in the range of plus/minus three > days of the other start date. Can anyone think of the best way to approach > this problem? > > I was going to match observations on dates and class using the "lag" > function and then aggregate based on these matches. But now I've ran into > the problem where I'm not sure that I can incorporate the range feature > into the lag function. Ie. > st-date stcnt > 12-1 1 > 10-2 2 > 7-3 3 > so in this example if the next start date is 7-2, a new stcnt value of 4 > should be returned, if not then no . > > Does anyone have advice on how to best approach this and whether my > thinking is correct at all? > > > thank you!!!! > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command SIGNOFF SPSSX-L For a list of > commands to manage subscriptions, send the command INFO REFCARD > > PRIVILEGED AND CONFIDENTIAL INFORMATION > This transmittal and any attachments may contain PRIVILEGED AND > CONFIDENTIAL information and is intended only for the use of the > addressee. If you are not the designated recipient, or an employee > or agent authorized to deliver such transmittals to the designated > recipient, you are hereby notified that any dissemination, > copying or publication of this transmittal is strictly prohibited. If > you have received this transmittal in error, please notify us > immediately by replying to the sender and delete this copy from your > system. You may also call us at (309) 827-6026 for assistance. > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Alina Sheyman-3
Alina,
Some things are still not clear from your explanation. You say >> ... I have a file with three variables - class, start date, year (either 1 or 2). I am trying to match observations from year 1 with obs from year 2 on class and start date. I also need to be sure that no one is sampled more than once. An additional caveat here is that my start date doesn't have to be identical but rather in the range of plus/minus three days of the other start date. Can anyone think of the best way to approach this problem? You say '... match observations from year 1 with obs from year 2 on class and start date.' That implies to me the presence of two files, a year 1 file and a year 2 file. I also remember that you have been working on this problem for a while and I think I remember your initial presentation describing two files. Regardless of whether you have one file or two, your problem won't be so easy to solve because of the need to account for the date range. So, 1) Do you have one file or two? 2) Does either file have duplicate records based on the three variables: class, start date, year? 3) What does this phrase mean '... no one is sampled more than once.'? Which file does it apply to? 4) Not relevant but curious. How big is file 1 and file 2? Gene Maguin I was going to match observations on dates and class using the "lag" function and then aggregate based on these matches. But now I've ran into the problem where I'm not sure that I can incorporate the range feature into the lag function. Ie. st-date stcnt 12-1 1 10-2 2 7-3 3 so in this example if the next start date is 7-2, a new stcnt value of 4 should be returned, if not then no . Does anyone have advice on how to best approach this and whether my thinking is correct at all? thank you!!!! ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Alina Sheyman-3
At 12:34 PM 10/8/2008, Alina Sheyman wrote:
>It might be easier if I just explain in full what I am trying to do >instead of asking a lot of individual questions. That's a good idea. If you ask how to implement an approach that you think will solve your problem, but do not explain the problem, we can't tell if there's a simpler approach altogether; or worse, that the approach you were thinking of was wrong. You'll notice the trouble we've had, trying to solve your problem without having it explained. I count 19 responses to your queries, in four threads(1), and it's not solved yet. (Ouch!) That's much more work than the problem should take, if clearly laid out. Anyway, you write, >I have a file with three variables - class, start date, year (either 1 or 2). Here's a point that you didn't clarify: 'start date' appears to be >month and year only<, not a full date. >I am trying to match observations from year 1 with obs from year 2 >on class and start date. My start date doesn't have to be >identical but rather in the range of plus/minus three days of the >other start date. >Ie. > st-date stcnt > 12-1 1 > 10-2 2 > 7-3 3 > >so in this example if the next start date is 7-2, a new stcnt value >of 4 should be returned, if not then no . This still isn't clear. You write, "I have three variables - class, start date, year". However, your example doesn't include 'class' or 'year', so we can't see how they enter into the calculation. And you write, "if the next start date is 7-2, a new stcnt value of 4 should be returned, if not then no"; in that case, what >should< the value of 'stcnt' be? Can you give a more extended example, including all variables, and illustrating all the circumstances that interest you, giving the calculated results you want. -Best of luck, Richard ........................................... (1) 'Working with multiple files without merge' - 2 responses '"do repeat" code' - 12 responses 'lag function' - 3 responses 'Help on the matches within a file (more q's)' - 2 responses ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
On Thu, Oct 9, 2008 at 12:26 AM, Richard Ristow <[hidden email]>wrote:
> At 12:34 PM 10/8/2008, Alina Sheyman wrote: > > It might be easier if I just explain in full what I am trying to do >> instead of asking a lot of individual questions. >> > > That's a good idea. If you ask how to implement an approach that you think > will solve your problem, but do not explain the problem, we can't tell if > there's a simpler approach altogether; or worse, that the approach you were > thinking of was wrong. > > You'll notice the trouble we've had, trying to solve your problem without > having it explained. I count 19 responses to your queries, in four > threads(1), and it's not solved yet. (Ouch!) That's much more work than the > problem should take, if clearly laid out. > > Anyway, you write, > > I have a file with three variables - class, start date, year (either 1 or >> 2). >> > > Here's a point that you didn't clarify: 'start date' appears to be >month > and year only<, not a full date. > > > I am trying to match observations from year 1 with obs from year 2 on >> class and start date. My start date doesn't have to be identical but >> rather in the range of plus/minus three days of the other start date. >> Ie. >> st-date stcnt >> 12-1 1 >> 10-2 2 >> 7-3 3 >> >> so in this example if the next start date is 7-2, a new stcnt value of 4 >> should be returned, if not then no . >> > > This still isn't clear. You write, "I have three variables - class, start > date, year". However, your example doesn't include 'class' or 'year', so we > can't see how they enter into the calculation. And you write, "if the next > start date is 7-2, a new stcnt value of 4 should be returned, if not then > no"; in that case, what >should< the value of 'stcnt' be? *Class City Start_date Year Science Boston 2008/08/19 2 **Science **Boston ** 2007/08/17 1* *Science New York 2007/08/17 1 Science Boston 2007/09/18 1 English Chicago 2007/09/18 1 English Chicago 2007/04/02 1 English Chicago 2008/04/01 2 * *Actually the real problem is slightly even more complicated. There's in fact not three, but four variables (I was trying to make this a bit simpler) - class, city, start_date, year. For two classes to be a match, they have to be held in the same city and their start_date in year 2 (in 2008) has to be in the range of plus/minus 0-3 days of the original (2007) start date. I want to find all the matches and then create two files -one for classes with the matches and one for classes without. Here, first two instances of Science class should be a match, since the 2008 start_date is within the range of three days of the 2007 start date and they are both held in Boston. Same goes for the last two instances of English. The first instance of English and the last instance of Science have no match. * > > > Can you give a more extended example, including all variables, and > illustrating all the circumstances that interest you, giving the calculated > results you want. > > -Best of luck, > Richard > ........................................... > (1) > 'Working with multiple files without merge' - 2 responses > '"do repeat" code' - 12 responses > 'lag function' - 3 responses > 'Help on the matches within a file (more q's)' - 2 responses > > > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Alina,
I don't mean to be rude because I'm sure you worked long and hard on this and with great frustration but you did yourself a real disservice by trying to go easy on us all and not making a better presentation. There are different ways to do this. And since you are skilled with syntax, I indicate in a narrative how I'd work the problem. Get your file sorted correctly--class, city and start_date. Year doesn't matter since you have an actual date. Set up a comparison of adjacent records on class and city using lag. If class and city don't match, date doesn't matter. You'll need a variable to keep track of the match status. Actually, I'll write this. Also, note that I assume that there are never, ever going to be matches on adjacent pairs of records having the same year. If there are, well, that's a different problem (but solvable using similar methods). Compute match=0. Do if (class and city match previous record). + compute #date=lag(start_date). + if (abs(datediff(#date,start_date,"days")) le 3) match=1. End if. So now you have all records marked as either match=0 or match=1. Thing is, some of those match=0 records ought to be match=1 records. So re-sort the file on class and city in ascending order and match in descending order. then Do if (class and city match previous record). + compute match=1. End if. That's it. One file is match=1; the other is match=0. Gene Maguin ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Thank you, Gene -- At 12:05 PM 10/9/2008, Gene Maguin wrote:
>... you did yourself a real disservice by trying to go easy on us >all and not making a better presentation. >[...] >Get your file sorted correctly--class, city and start_date. Year >doesn't matter since you have an actual date. [some instructions omitted] One modest correction. You have. >Compute match=0. >Do if (class and city match previous record). >+ compute #date=lag(start_date). >+ if (abs(datediff(#date,start_date,"days")) le 3) match=1. >End if. What's desired, I think, is match within three calendar days, of dates in subsequent years. So the earlier date has to be advanced a year before the comparison, perhaps with logic like (untested), Compute match=0. Do if (class and city match previous record). + compute #date=lag(start_date). + COMPUTE #AdvDate = DATE.MDY(XDATE.MONTH(#date), XDATE.MDAY (#date), XDATE.YEAR (#date)+1). + if (abs(datediff(#AdvDate,start_date,"days")) le 3) match=1. End if. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Richard,
Thank you! That's an extremely imporant correction. The operation would flop without that change. Gene Maguin -----Original Message----- From: Richard Ristow [mailto:[hidden email]] Sent: Thursday, October 09, 2008 12:57 PM To: Alina Sheyman; [hidden email] Cc: Gene Maguin Subject: Re: Help on the matches within a file (more q's) Thank you, Gene -- At 12:05 PM 10/9/2008, Gene Maguin wrote: >... you did yourself a real disservice by trying to go easy on us >all and not making a better presentation. >[...] >Get your file sorted correctly--class, city and start_date. Year >doesn't matter since you have an actual date. [some instructions omitted] One modest correction. You have. >Compute match=0. >Do if (class and city match previous record). >+ compute #date=lag(start_date). >+ if (abs(datediff(#date,start_date,"days")) le 3) match=1. >End if. What's desired, I think, is match within three calendar days, of dates in subsequent years. So the earlier date has to be advanced a year before the comparison, perhaps with logic like (untested), Compute match=0. Do if (class and city match previous record). + compute #date=lag(start_date). + COMPUTE #AdvDate = DATE.MDY(XDATE.MONTH(#date), XDATE.MDAY (#date), XDATE.YEAR (#date)+1). + if (abs(datediff(#AdvDate,start_date,"days")) le 3) match=1. End if. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Alina Sheyman-3
Gene,thank you for your help.
I've tried out your suggested code and it's returning all matches as 0. I've adjusted the code a little by introducing an additional variable for a year - 1 or 2, and then changing the year in the date variable for all of the observations to 2007. This way I get around Richard's comments on the the match having to be in subsequent years. I've ran the original code you suggested (without this adjustment) as well and to the same effect - all matches were returned as 0. Any ideas why this is happening? This is the code that I ran SORT CASES BY program(A) center(A) start_date(A) year(A). Compute match=0 . Do if program=LAG(program) and center=LAG(center) and year=LAG(year)+1. compute #date=lag(start_date). Else if (abs(datediff(#date,start_date,"days"))le 3). compute match =1. End if. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Alina,
The code I posted yesterday had an error in it--the error Richard pointed out. Richard's revised posting corrects that error. I'll respond to your reply point by point. >>I've tried out your suggested code and it's returning all matches as 0. Yes, it should have because I didn't take the fact that the dates were one year apart into account. >>I've adjusted the code a little by introducing an additional variable for a year - 1 or 2, and then changing the year in the date variable for all of the observations to 2007. This way I get around Richard's comments on the the match having to be in subsequent years. I'd like to back up a bit before answering this point. I'm concerned about whether you could have multiple records with the same values for class, city and year=1 or 2, but with different start dates. I don't know who you work for (and I'm not asking you to tell me) so it's hard to guess what the activity was that generated the data. If you crosstab your data by class, city and year (class by city by year), what is the most number of records in any one cell of the table? As I look at your posting yesterday, you may have already told us but I'd like a data-based answer. This is what you posted yesterday (I've eliminated your markup to make my point clearer. I've sorted it by class, city, year. Science Boston 2007/08/17 1 (a) Science Boston 2007/09/18 1 Science Boston 2008/08/19 2 (a) Science New York 2007/08/17 1 English Chicago 2007/04/02 1 (b) English Chicago 2007/09/18 1 English Chicago 2008/04/01 2 (b) If you sort by class, city, and year, the result listed above is a possible outcome, and if so, my corrected posting won't work either because the cases that should be adjacent to each other [i.e., (a)'s and (b)'s] are not. The crosstab on this snippet of data will show that cell for Science-Boston-1 has two cases. That is a fatal error for the corrected code. Let's go a step further. Were there multiple science classes in Boston that started in August, 07 or 08, maybe ones that started in the same week in August, 07 or in 08? Or, worst of all, on the same day in August, 07 or in 08? You've posed what could be a very hard problem, and I don't feel that we know your dataset well enough yet. Gene Maguin ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Alina Sheyman-3
At 10:49 AM 10/10/2008, Alina Sheyman wrote:
>This is the code that I ran > >SORT CASES BY program(A) center(A) start_date(A) year(A). > >Compute match=0 . >Do if program=LAG(program) and center=LAG(center) and year=LAG(year)+1. > compute #date=lag(start_date). >Else if (abs(datediff(#date,start_date,"days"))le 3). > compute match =1. >End if. Without stepping into Gene's work on this (goodness, this must be about the 25th response to this query), I don't think you want that "Else if". It means the second COMPUTE is never executed if the first one is. You probably mean something more like, Compute match=0 . Do if program=LAG(program) and center=LAG(center) and year=LAG(year)+1. . compute #date=lag(start_date). . IF(abs(datediff(#date,start_date,"days"))le 3) match =1. End if. Note replacement of "Else if/compute" with a simple IF. Overall, this is complicated logic, and I dare say there are yet other problems. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
| Free forum by Nabble | Edit this page |
