Help on the matches within a file (more q's)

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Help on the matches within a file (more q's)

Alina Sheyman-3
I've been battling with the following problem over the course of the past
week and I've had a lot of great help from the group, but I'm still having
more issues. It might be easier if I just explain in full what I am trying
to do instead of asking a lot of individual questions.

So here it goes. I have a file with three variables - class, start date,
year (either 1 or 2). I am trying to match observations from year 1 with obs
from year 2 on  class and start date.  I also need to be sure that noone is
sampled more than once. An additional caveat here is that my start date
doesn't have to be identical but rather in the range of plus/minus three
days of the other start date.  Can anyone think of the best way to approach
this problem?

I was going to match observations on dates and class using the "lag"
function and then aggregate based on these matches.  But now I've ran into
the problem where I'm  not sure that I can incorporate the range feature
into the lag function. Ie.
     st-date       stcnt
     12-1          1
     10-2          2
     7-3            3
so in this example if the next start date is 7-2, a new stcnt value of 4
should be returned, if not then no .

Does anyone have advice on how to best approach this and whether my thinking
is correct at all?


thank you!!!!

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Help on the matches within a file (more q's)

Melissa Ives
Wouldn't CASETOVARS do what you want?

*** this keeps variables together by year(v1_1 v1_2 v2_1 v2_2.
CASESTOVARS
 /ID = class stdate
 /COUNT = stcnt
 /INDEX = year
  /separator="_"
 /GROUPBY = VARIABLE .

Melissa

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Alina Sheyman
Sent: Wednesday, October 08, 2008 11:34 AM
To: [hidden email]
Subject: [SPSSX-L] Help on the matches within a file (more q's)

I've been battling with the following problem over the course of the past week and I've had a lot of great help from the group, but I'm still having more issues. It might be easier if I just explain in full what I am trying to do instead of asking a lot of individual questions.

So here it goes. I have a file with three variables - class, start date, year (either 1 or 2). I am trying to match observations from year 1 with obs from year 2 on  class and start date.  I also need to be sure that noone is sampled more than once. An additional caveat here is that my start date doesn't have to be identical but rather in the range of plus/minus three days of the other start date.  Can anyone think of the best way to approach this problem?

I was going to match observations on dates and class using the "lag"
function and then aggregate based on these matches.  But now I've ran into the problem where I'm  not sure that I can incorporate the range feature into the lag function. Ie.
     st-date       stcnt
     12-1          1
     10-2          2
     7-3            3
so in this example if the next start date is 7-2, a new stcnt value of 4 should be returned, if not then no .

Does anyone have advice on how to best approach this and whether my thinking is correct at all?


thank you!!!!

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

PRIVILEGED AND CONFIDENTIAL INFORMATION
This transmittal and any attachments may contain PRIVILEGED AND
CONFIDENTIAL information and is intended only for the use of the
addressee. If you are not the designated recipient, or an employee
or agent authorized to deliver such transmittals to the designated
recipient, you are hereby notified that any dissemination,
copying or publication of this transmittal is strictly prohibited. If
you have received this transmittal in error, please notify us
immediately by replying to the sender and delete this copy from your
system. You may also call us at (309) 827-6026 for assistance.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Help on the matches within a file (more q's)

Alina Sheyman-3
 I've tried it with CASETOVARS and this seems way seems to be a better
approach

On Wed, Oct 8, 2008 at 12:52 PM, Melissa Ives <[hidden email]> wrote:

> Wouldn't CASETOVARS do what you want?
>
> *** this keeps variables together by year(v1_1 v1_2 v2_1 v2_2.
> CASESTOVARS
>  /ID = class stdate
>  /COUNT = stcnt
>  /INDEX = year
>  /separator="_"
>  /GROUPBY = VARIABLE .
>
> Melissa
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
> Alina Sheyman
> Sent: Wednesday, October 08, 2008 11:34 AM
> To: [hidden email]
> Subject: [SPSSX-L] Help on the matches within a file (more q's)
>
> I've been battling with the following problem over the course of the past
> week and I've had a lot of great help from the group, but I'm still having
> more issues. It might be easier if I just explain in full what I am trying
> to do instead of asking a lot of individual questions.
>
> So here it goes. I have a file with three variables - class, start date,
> year (either 1 or 2). I am trying to match observations from year 1 with obs
> from year 2 on  class and start date.  I also need to be sure that noone is
> sampled more than once. An additional caveat here is that my start date
> doesn't have to be identical but rather in the range of plus/minus three
> days of the other start date.  Can anyone think of the best way to approach
> this problem?
>
> I was going to match observations on dates and class using the "lag"
> function and then aggregate based on these matches.  But now I've ran into
> the problem where I'm  not sure that I can incorporate the range feature
> into the lag function. Ie.
>     st-date       stcnt
>     12-1          1
>     10-2          2
>     7-3            3
> so in this example if the next start date is 7-2, a new stcnt value of 4
> should be returned, if not then no .
>
> Does anyone have advice on how to best approach this and whether my
> thinking is correct at all?
>
>
> thank you!!!!
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command SIGNOFF SPSSX-L For a list of
> commands to manage subscriptions, send the command INFO REFCARD
>
> PRIVILEGED AND CONFIDENTIAL INFORMATION
> This transmittal and any attachments may contain PRIVILEGED AND
> CONFIDENTIAL information and is intended only for the use of the
> addressee. If you are not the designated recipient, or an employee
> or agent authorized to deliver such transmittals to the designated
> recipient, you are hereby notified that any dissemination,
> copying or publication of this transmittal is strictly prohibited. If
> you have received this transmittal in error, please notify us
> immediately by replying to the sender and delete this copy from your
> system. You may also call us at (309) 827-6026 for assistance.
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Help on the matches within a file (more q's)

Maguin, Eugene
In reply to this post by Alina Sheyman-3
Alina,

Some things are still not clear from your explanation. You say

>> ... I have a file with three variables - class, start date,
year (either 1 or 2). I am trying to match observations from year 1 with obs
from year 2 on class and start date.  I also need to be sure that no one is
sampled more than once. An additional caveat here is that my start date
doesn't have to be identical but rather in the range of plus/minus three
days of the other start date. Can anyone think of the best way to approach
this problem?

You say '... match observations from year 1 with obs from year 2 on class
and start date.' That implies to me the presence of two files, a year 1 file
and a year 2 file. I also remember that you have been working on this
problem for a while and I think I remember your initial presentation
describing two files. Regardless of whether you have one file or two, your
problem won't be so easy to solve because of the need to account for the
date range. So,
1) Do you have one file or two?

2) Does either file have duplicate records based on the three variables:
class, start date, year?

3) What does this phrase mean '... no one is sampled more than once.'? Which
file does it apply to?

4) Not relevant but curious. How big is file 1 and file 2?

Gene Maguin









I was going to match observations on dates and class using the "lag"
function and then aggregate based on these matches.  But now I've ran into
the problem where I'm  not sure that I can incorporate the range feature
into the lag function. Ie.
     st-date       stcnt
     12-1          1
     10-2          2
     7-3            3
so in this example if the next start date is 7-2, a new stcnt value of 4
should be returned, if not then no .

Does anyone have advice on how to best approach this and whether my thinking
is correct at all?


thank you!!!!

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Help on the matches within a file (more q's)

Richard Ristow
In reply to this post by Alina Sheyman-3
At 12:34 PM 10/8/2008, Alina Sheyman wrote:

>It might be easier if I just explain in full what I am trying to do
>instead of asking a lot of individual questions.

That's a good idea. If you ask how to implement an approach that you
think will solve your problem, but do not explain the problem, we
can't tell if there's a simpler approach altogether; or worse, that
the approach you were thinking of was wrong.

You'll notice the trouble we've had, trying to solve your problem
without having it explained. I count 19 responses to your queries, in
four threads(1), and it's not solved yet. (Ouch!) That's much more
work than the problem should take, if clearly laid out.

Anyway, you write,

>I have a file with three variables - class, start date, year (either 1 or 2).

Here's a point that you didn't clarify: 'start date' appears to
be >month and year only<, not a full date.

>I am trying to match observations from year 1 with obs from year 2
>on  class and start date.  My start date doesn't have to be
>identical but rather in the range of plus/minus three days of the
>other start date.
>Ie.
>      st-date       stcnt
>      12-1          1
>      10-2          2
>      7-3            3
>
>so in this example if the next start date is 7-2, a new stcnt value
>of 4 should be returned, if not then no .

This still isn't clear. You write, "I have three variables - class,
start date, year". However, your example doesn't include 'class' or
'year', so we can't see how they enter into the calculation. And you
write, "if the next start date is 7-2, a new stcnt value of 4 should
be returned, if not then no"; in that case, what >should< the value
of 'stcnt' be?

Can you give a more extended example, including all variables, and
illustrating all the circumstances that interest you, giving the
calculated results you want.

-Best of luck,
  Richard
...........................................
(1)
'Working with multiple files without merge'    -  2 responses
'"do repeat" code'                             - 12 responses
'lag function'                                 -  3 responses
'Help on the matches within a file (more q's)' -  2 responses

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Help on the matches within a file (more q's)

Alina Sheyman-3
On Thu, Oct 9, 2008 at 12:26 AM, Richard Ristow <[hidden email]>wrote:

> At 12:34 PM 10/8/2008, Alina Sheyman wrote:
>
>  It might be easier if I just explain in full what I am trying to do
>> instead of asking a lot of individual questions.
>>
>
> That's a good idea. If you ask how to implement an approach that you think
> will solve your problem, but do not explain the problem, we can't tell if
> there's a simpler approach altogether; or worse, that the approach you were
> thinking of was wrong.
>
> You'll notice the trouble we've had, trying to solve your problem without
> having it explained. I count 19 responses to your queries, in four
> threads(1), and it's not solved yet. (Ouch!) That's much more work than the
> problem should take, if clearly laid out.
>
> Anyway, you write,
>
>  I have a file with three variables - class, start date, year (either 1 or
>> 2).
>>
>
> Here's a point that you didn't clarify: 'start date' appears to be >month
> and year only<, not a full date.
>
*Start_date is in fact a real  date, such as  06/12/2007*



>
>  I am trying to match observations from year 1 with obs from year 2 on
>>  class and start date.  My start date doesn't have to be identical but
>> rather in the range of plus/minus three days of the other start date.
>> Ie.
>>     st-date       stcnt
>>     12-1          1
>>     10-2          2
>>     7-3            3
>>
>> so in this example if the next start date is 7-2, a new stcnt value of 4
>> should be returned, if not then no .
>>
>
> This still isn't clear. You write, "I have three variables - class, start
> date, year". However, your example doesn't include 'class' or 'year', so we
> can't see how they enter into the calculation. And you write, "if the next
> start date is 7-2, a new stcnt value of 4 should be returned, if not then
> no"; in that case, what >should< the value of 'stcnt' be?

*Class         City                  Start_date    Year
Science    Boston               2008/08/19    2
**Science    **Boston **              2007/08/17    1*
*Science    New York           2007/08/17    1
Science    Boston               2007/09/18    1
English     Chicago             2007/09/18    1
English     Chicago             2007/04/02    1
English     Chicago             2008/04/01    2
 *
*Actually the real problem is slightly even more complicated. There's in
fact not three, but four variables (I was trying to make this a bit simpler)
- class, city, start_date, year.

 For two classes to be a match, they have to be held in the same city and
their start_date in year 2 (in 2008) has to be in the range of plus/minus
0-3 days of the original (2007) start date.

 I want to find all the matches and then create two files -one for classes
with the matches and one  for classes without.

Here,  first two instances of Science class should be a match, since the
2008 start_date is within the range of three days of  the 2007 start date
and they are both held in Boston. Same goes for the last two instances of
English. The first instance of English and the last instance of Science have
no match.
*


>
>
> Can you give a more extended example, including all variables, and
> illustrating all the circumstances that interest you, giving the calculated
> results you want.
>
> -Best of luck,
>  Richard
> ...........................................
> (1)
> 'Working with multiple files without merge'    -  2 responses
> '"do repeat" code'                             - 12 responses
> 'lag function'                                 -  3 responses
> 'Help on the matches within a file (more q's)' -  2 responses
>
>
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Help on the matches within a file (more q's)

Maguin, Eugene
Alina,

I don't mean to be rude because I'm sure you worked long and hard on this
and with great frustration but you did yourself a real disservice by trying
to go easy on us all and not making a better presentation. There are
different ways to do this. And since you are skilled with syntax, I indicate
in a narrative how I'd work the problem.

Get your file sorted correctly--class, city and start_date. Year doesn't
matter since you have an actual date.

Set up a comparison of adjacent records on class and city using lag. If
class and city don't match, date doesn't matter. You'll need a variable to
keep track of the match status. Actually, I'll write this.  Also, note that
I assume that there are never, ever going to be matches on adjacent pairs of
records having the same year. If there are, well, that's a different problem
(but solvable using similar methods).

Compute match=0.
Do if (class and city match previous record).
+  compute #date=lag(start_date).
+  if (abs(datediff(#date,start_date,"days")) le 3) match=1.
End if.

So now you have all records marked as either match=0 or match=1. Thing is,
some of those match=0 records ought to be match=1 records. So re-sort the
file on class and city in ascending order and match in descending order.
then

Do if (class and city match previous record).
+  compute match=1.
End if.

That's it. One file is match=1; the other is match=0.

Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Help on the matches within a file (more q's)

Richard Ristow
Thank you, Gene -- At 12:05 PM 10/9/2008, Gene Maguin wrote:

>... you did yourself a real disservice by trying to go easy on us
>all and not making a better presentation.
>[...]
>Get your file sorted correctly--class, city and start_date. Year
>doesn't matter since you have an actual date. [some instructions omitted]

One modest correction. You have.

>Compute match=0.
>Do if (class and city match previous record).
>+  compute #date=lag(start_date).
>+  if (abs(datediff(#date,start_date,"days")) le 3) match=1.
>End if.

What's desired, I think, is match within three calendar days, of
dates in subsequent years. So the earlier date has to be advanced a
year before the comparison, perhaps with logic like (untested),

Compute match=0.
Do if (class and city match previous record).
+  compute #date=lag(start_date).
+  COMPUTE #AdvDate =
            DATE.MDY(XDATE.MONTH(#date),
                     XDATE.MDAY (#date),
                     XDATE.YEAR (#date)+1).
+  if (abs(datediff(#AdvDate,start_date,"days")) le 3) match=1.
End if.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Help on the matches within a file (more q's)

Maguin, Eugene
Richard,

Thank you! That's an extremely imporant correction. The operation would flop
without that change.

Gene Maguin


-----Original Message-----
From: Richard Ristow [mailto:[hidden email]]
Sent: Thursday, October 09, 2008 12:57 PM
To: Alina Sheyman; [hidden email]
Cc: Gene Maguin
Subject: Re: Help on the matches within a file (more q's)


Thank you, Gene -- At 12:05 PM 10/9/2008, Gene Maguin wrote:

>... you did yourself a real disservice by trying to go easy on us
>all and not making a better presentation.
>[...]
>Get your file sorted correctly--class, city and start_date. Year
>doesn't matter since you have an actual date. [some instructions omitted]

One modest correction. You have.

>Compute match=0.
>Do if (class and city match previous record).
>+  compute #date=lag(start_date).
>+  if (abs(datediff(#date,start_date,"days")) le 3) match=1.
>End if.

What's desired, I think, is match within three calendar days, of
dates in subsequent years. So the earlier date has to be advanced a
year before the comparison, perhaps with logic like (untested),

Compute match=0.
Do if (class and city match previous record).
+  compute #date=lag(start_date).
+  COMPUTE #AdvDate =
            DATE.MDY(XDATE.MONTH(#date),
                     XDATE.MDAY (#date),
                     XDATE.YEAR (#date)+1).
+  if (abs(datediff(#AdvDate,start_date,"days")) le 3) match=1.
End if.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Help on the matches within a file (more q's)

Alina Sheyman-3
In reply to this post by Alina Sheyman-3
Gene,thank you for your help.
I've tried out your suggested code and it's returning all matches as 0.
I've adjusted the code a little by introducing an additional variable for a
year - 1 or 2, and then changing the year in the date variable for all of
the observations to 2007. This way I get around Richard's comments on the
the match having to be in subsequent years.
I've ran the original code you suggested (without this adjustment) as well
and to the same effect - all matches were returned as 0. Any ideas why this
is happening?

This is the code that I ran

SORT CASES BY program(A) center(A) start_date(A) year(A).

Compute match=0 .
Do if  program=LAG(program) and center=LAG(center) and year=LAG(year)+1.
   compute #date=lag(start_date).
Else if (abs(datediff(#date,start_date,"days"))le 3).
  compute match =1.
End if.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Help on the matches within a file (more q's)

Maguin, Eugene
Alina,

The code I posted yesterday had an error in it--the error Richard pointed
out. Richard's revised posting corrects that error. I'll respond to your
reply point by point.

>>I've tried out your suggested code and it's returning all matches as 0.

Yes, it should have because I didn't take the fact that the dates were one
year apart into account.

>>I've adjusted the code a little by introducing an additional variable for
a
year - 1 or 2, and then changing the year in the date variable for all of
the observations to 2007. This way I get around Richard's comments on the
the match having to be in subsequent years.

I'd like to back up a bit before answering this point. I'm concerned about
whether you could have multiple records with the same values for class, city
and year=1 or 2, but with different start dates. I don't know who you work
for (and I'm not asking you to tell me) so it's hard to guess what the
activity was that generated the data.

If you crosstab your data by class, city and year (class by city by year),
what is the most number of records in any one cell of the table?

As I look at your posting yesterday, you may have already told us but I'd
like a data-based answer. This is what you posted yesterday (I've eliminated
your markup to make my point clearer. I've sorted it by class, city, year.

Science Boston 2007/08/17 1 (a)
Science Boston 2007/09/18 1
Science Boston 2008/08/19 2 (a)
Science New York 2007/08/17 1
English Chicago 2007/04/02 1 (b)
English Chicago 2007/09/18 1
English Chicago 2008/04/01 2 (b)

If you sort by class, city, and year, the result listed above is a possible
outcome, and if so, my corrected posting won't work either because the cases
that should be adjacent to each other [i.e., (a)'s and (b)'s] are not. The
crosstab on this snippet of data will show that cell for Science-Boston-1
has two cases. That is a fatal error for the corrected code.

Let's go a step further. Were there multiple science classes in Boston that
started in August, 07 or 08, maybe ones that started in the same week in
August, 07 or in 08? Or, worst of all, on the same day in August, 07 or in
08?

You've posed what could be a very hard problem, and I don't feel that we
know your dataset well enough yet.

Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Help on the matches within a file (more q's)

Richard Ristow
In reply to this post by Alina Sheyman-3
At 10:49 AM 10/10/2008, Alina Sheyman wrote:

>This is the code that I ran
>
>SORT CASES BY program(A) center(A) start_date(A) year(A).
>
>Compute match=0 .
>Do if  program=LAG(program) and center=LAG(center) and year=LAG(year)+1.
>    compute #date=lag(start_date).
>Else if (abs(datediff(#date,start_date,"days"))le 3).
>   compute match =1.
>End if.

Without stepping into Gene's work on this (goodness, this must be
about the 25th response to this query), I don't think you want that
"Else if". It means the second COMPUTE is never executed if the first
one is. You probably mean something more like,

Compute match=0 .
Do if     program=LAG(program)
       and center=LAG(center)
       and year=LAG(year)+1.
.  compute #date=lag(start_date).
.  IF(abs(datediff(#date,start_date,"days"))le 3) match =1.
End if.

Note replacement of "Else if/compute" with a simple IF.

Overall, this is complicated logic, and I dare say there are yet
other problems.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD