SPSSX Discussion

Random date generator

Classic

List

Threaded

7 messages Options

Hashmi, Syed S

Random date generator

Dear co-listers,

A dataset that I'm analyzing has a set of dates for events (start and stop
dates) as well as how long those events occured for. The data for each
date is in three variables (month, day, year). The years are pretty
complete if they are filled in but the month and day might are sometimes
listed as the exact month or date and other times they're listed as
beginning, middle or end of the year (for the month variable) or the month
(for the day variable).

Thus, I have 7 vars (startm, startd, starty, stopm, stopd, stopy, duration)
from which I can deduce the start and stop dates (startdt, stopdt).
Unfortunately, I have the complete start and stop date for about half the
cases. The rest are missing either parts of one of the dates (eg. day) or
for both. If I have one of the dates and a duration, I can calculate the
other date.

The reason for this post is that there is a small subset of the population
where I have the complete stop date but am missing the start day (I have
the year and month) and am also missing the duration. I had to come up
with some way to impute a start date for these cases for analysis (which
will be done with and without these specific cases). I know that the event
could not be more than a month long. Therefore, what I was planning on
doing was based on the information I have, calculate the earliest possible
start date (e_startdt) up to a month before the stop date and then randomly
pick a date between e_startdt and the stop date.

Therefore, my query here was this: how can I code for this. I have an idea
of how to do it in SAS but since I'm working in SPSS that doesn't help
much. I'm assuming that it will be something simple like:

startdt = e_startdt + RANDOM_DAYS.

where, RANDOM_DAYS is a random number chosen from DATEDIFF(stopdt,
startdt, "days").

So how would I go about doing this? I tried using the help files and all
but couldn't come up with something that worked. Is this the best way to do
this? Any other way that I can do this? Does it matter what kind of seeding
I use for the random number generator?

Thanks.

- Shahrukh

Melissa Ives

Re: Random date generator

During the assessment process, interviewers are given these instructions
for estimating a date when a client cannot remember specific days or
months. Perhaps you could create a similar algorithm?

Date Guidelines (d/e): Use the following rules if the participant is
unsure of the exact date:
DAY: Use the 5th for the beginning of the month, 15th for the middle of
the month, and 25th for the end of the month.
MONTH: Use March for early in the year, July for middle of the year, and
October for later in the year, but try to make it so the number of weeks
is about right.

Melissa

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Syed Hashmi
Sent: Thursday, July 19, 2007 9:48 PM
To: [hidden email]
Subject: [SPSSX-L] Random date generator

Dear co-listers,

A dataset that I'm analyzing has a set of dates for events (start and
stop
dates) as well as how long those events occured for. The data for each
date is in three variables (month, day, year). The years are pretty
complete if they are filled in but the month and day might are sometimes
listed as the exact month or date and other times they're listed as
beginning, middle or end of the year (for the month variable) or the
month (for the day variable).

Thus, I have 7 vars (startm, startd, starty, stopm, stopd, stopy,
duration) from which I can deduce the start and stop dates (startdt,
stopdt).
Unfortunately, I have the complete start and stop date for about half
the cases. The rest are missing either parts of one of the dates (eg.
day) or for both. If I have one of the dates and a duration, I can
calculate the other date.

The reason for this post is that there is a small subset of the
population where I have the complete stop date but am missing the start
day (I have the year and month) and am also missing the duration. I had
to come up with some way to impute a start date for these cases for
analysis (which will be done with and without these specific cases). I
know that the event could not be more than a month long. Therefore, what
I was planning on doing was based on the information I have, calculate
the earliest possible start date (e_startdt) up to a month before the
stop date and then randomly pick a date between e_startdt and the stop
date.

Therefore, my query here was this: how can I code for this. I have an
idea of how to do it in SAS but since I'm working in SPSS that doesn't
help much. I'm assuming that it will be something simple like:

startdt = e_startdt + RANDOM_DAYS.

where, RANDOM_DAYS is a random number chosen from DATEDIFF(stopdt,
startdt, "days").

So how would I go about doing this? I tried using the help files and all
but couldn't come up with something that worked. Is this the best way to
do this? Any other way that I can do this? Does it matter what kind of
seeding I use for the random number generator?

Thanks.

- Shahrukh

PRIVILEGED AND CONFIDENTIAL INFORMATION
This transmittal and any attachments may contain PRIVILEGED AND
CONFIDENTIAL information and is intended only for the use of the
addressee. If you are not the designated recipient, or an employee
or agent authorized to deliver such transmittals to the designated
recipient, you are hereby notified that any dissemination,
copying or publication of this transmittal is strictly prohibited. If
you have received this transmittal in error, please notify us
immediately by replying to the sender and delete this copy from your
system. You may also call us at (309) 827-6026 for assistance.

Hashmi, Syed S

Re: Random date generator

Thanks Melissa,

I'm already doing something similar to what you said, using 1st, 10th
and 20th as the dates for start, middle and end of the month. The Month
is a bit trickier since my exposures are events during pregnancy so I
have to be careful about just assigning a random month lest it falls
outside the pregnancy duration.

They question I had asked concerned dates which had month and year but
no day information - not even beginning, middle or end. Therefore, I
couldn't even depend on the 1st-10th-20th coding.

My final solution for the problem where I had a start date and had a
stop month was to pick a random date between the start date (or the
first date of the stop month) and the last day of the stop month. Gene
Maguin had emailed me earlier and suggested I use the UNIFORM function
to randomly select a date. I don't think that message was posted on the
list-serv, so the body is copied below. I've used the function since
and it works nicely.

Thanks again for your help though. I agree that there should be some
sort of algorithm in place at the interviewer level to minimize the
frequency of incomplete data.

- S. Hashmi

*copy of email from Gene*

> -----Original Message-----
> From: Gene Maguin [mailto:[hidden email]]
> Sent: Friday, July 20, 2007 8:05 AM
> To: Hashmi, Syed S
> Subject: RE: Random date generator
>
> Syed,
>
> I'd like to be helpful to you but I don't have time to make up a full
> solution. I think this would be a valid example of your question.
>
> Start date (mm/dd/yyyy): 5/x/2004
> Stop date (mm/dd/yyyy): 6/17/2004
> Possible duration range (6/17/2004)-(5/31/2004)=17 days to
> (6/17/2004)-(5/18/2004)=30 days (I assume a 30 day month)
>
> So x has to be between 18 and 31 inclusive.
>
> So I think the trick to the random draw is this command.
>
> Compute x=uniform(14).
> Compute x=trunc(x).
>
> Check this but I'm pretty sure that the range of x will be 0 to 13.
> Your actual date is then > Compute x=x+18.
>
> There's lots of big 'little bits' to tidy up but this will get you

want
> you > want when the tidying up has been done.
>
> Best wishes, Gene Maguin
>

> -----Original Message-----
> From: Melissa Ives [mailto:[hidden email]]
> Sent: Wednesday, July 25, 2007 9:13 AM
> To: Hashmi, Syed S; [hidden email]
> Subject: RE: [SPSSX-L] Random date generator
>
> During the assessment process, interviewers are given these
instructions
> for estimating a date when a client cannot remember specific days or
> months. Perhaps you could create a similar algorithm?
>
> Date Guidelines (d/e): Use the following rules if the participant is
> unsure of the exact date:
> DAY: Use the 5th for the beginning of the month, 15th for the middle
of
> the month, and 25th for the end of the month.
> MONTH: Use March for early in the year, July for middle of the year,
and
> October for later in the year, but try to make it so the number of
weeks
> is about right.
>
> Melissa
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf
Of

> Syed Hashmi
> Sent: Thursday, July 19, 2007 9:48 PM
> To: [hidden email]
> Subject: [SPSSX-L] Random date generator
>
> Dear co-listers,
>
> A dataset that I'm analyzing has a set of dates for events (start and
> stop
> dates) as well as how long those events occured for. The data for

each
> date is in three variables (month, day, year). The years are pretty
> complete if they are filled in but the month and day might are
sometimes

> listed as the exact month or date and other times they're listed as
> beginning, middle or end of the year (for the month variable) or the
> month (for the day variable).
>
> Thus, I have 7 vars (startm, startd, starty, stopm, stopd, stopy,
> duration) from which I can deduce the start and stop dates (startdt,
> stopdt).
> Unfortunately, I have the complete start and stop date for about half
> the cases. The rest are missing either parts of one of the dates (eg.
> day) or for both. If I have one of the dates and a duration, I can
> calculate the other date.
>
> The reason for this post is that there is a small subset of the
> population where I have the complete stop date but am missing the

start
> day (I have the year and month) and am also missing the duration. I
had
> to come up with some way to impute a start date for these cases for
> analysis (which will be done with and without these specific cases).
I
> know that the event could not be more than a month long. Therefore,
what

> I was planning on doing was based on the information I have, calculate
> the earliest possible start date (e_startdt) up to a month before the
> stop date and then randomly pick a date between e_startdt and the stop
> date.
>
> Therefore, my query here was this: how can I code for this. I have an
> idea of how to do it in SAS but since I'm working in SPSS that doesn't
> help much. I'm assuming that it will be something simple like:
>
> startdt = e_startdt + RANDOM_DAYS.
>
> where, RANDOM_DAYS is a random number chosen from DATEDIFF(stopdt,
> startdt, "days").
>
> So how would I go about doing this? I tried using the help files and

all
> but couldn't come up with something that worked. Is this the best way
to

> do this? Any other way that I can do this? Does it matter what kind of
> seeding I use for the random number generator?
>
> Thanks.
>
> - Shahrukh
>
>
> PRIVILEGED AND CONFIDENTIAL INFORMATION
> This transmittal and any attachments may contain PRIVILEGED AND
> CONFIDENTIAL information and is intended only for the use of the
> addressee. If you are not the designated recipient, or an employee
> or agent authorized to deliver such transmittals to the designated
> recipient, you are hereby notified that any dissemination,
> copying or publication of this transmittal is strictly prohibited. If
> you have received this transmittal in error, please notify us
> immediately by replying to the sender and delete this copy from your
> system. You may also call us at (309) 827-6026 for assistance.

Richard Ristow

Re: Random date generator

In reply to this post by Melissa Ives

Somehow I missed or deleted the original posting in this thread.
Anyway, on Thursday, July 19, 2007 9:48 PM Hashmi, Syed S asked,

>A dataset that I'm analyzing has a set of dates for events (start and
>stop dates) as well as how long those events occured for. The data
>for each date is in three variables (month, day, year). The years are
>pretty complete if they are filled in but the month and day might are
>sometimes listed as the exact month or date and other times they're
>listed as beginning, middle or end of the year (for the month
>variable) or the month (for the day variable).
>
>I have [two dates as three variables each, plus a duration] duration).
>I have the complete start and stop date for about half the cases. The
>rest are missing either parts of one of the dates (eg. day) or for
>both. If I have one of the dates and a duration, I can calculate the
>other date.

So far, so good, though be careful about how precise your 'durations'
are.

>There is a small subset of the population where I have the complete
>stop date but am missing the start day (I have the year and month) and
>am also missing the duration. I had to come up with some way to
>impute a start date for these cases for analysis. (which will be done
>with and without these specific cases). I know that the event could
>not be more than a month long. I was planning calculate the earliest
>possible start date (e_startdt) up to a month before the stop date and
>then randomly pick a date between e_startdt and the stop date.

OUCH! I would not do this. Period.

*MAYBE* the start dates and durations you get this way will be vaguely
representative of the population of events, though I doubt it. Are your
durations roughly uniformly distributed from 0 to 30 days? For goodness
sake, you ought to check that before proceeding.

But even if they're representative of the population, they have nothing
to do with the individual cases for which they're 'imputed'. No
analysis using those 'dates' will be the least trustworthy.

A far better approach is to use true missing-value interpolation on the
*durations*, not the dates. (See SPSS 'MVA'.) I'm not clear how many
durations you'd have to impute. If it's near 50%, that won't be at all
reliable, either.

-Good luck,
Richard

Hashmi, Syed S

Re: Random date generator

> -----Original Message-----
> From: Richard Ristow [mailto:[hidden email]]
> Sent: Wednesday, July 25, 2007 2:05 PM
>
> >There is a small subset of the population where I have the complete
> >stop date but am missing the start day (I have the year and month)
and
> >am also missing the duration. I had to come up with some way to
> >impute a start date for these cases for analysis. (which will be done
> >with and without these specific cases). I know that the event could
> >not be more than a month long. I was planning calculate the earliest
> >possible start date (e_startdt) up to a month before the stop date
and
> >then randomly pick a date between e_startdt and the stop date.
>
> OUCH! I would not do this. Period.
>
> *MAYBE* the start dates and durations you get this way will be vaguely
> representative of the population of events, though I doubt it. Are
your
> durations roughly uniformly distributed from 0 to 30 days? For
goodness
> sake, you ought to check that before proceeding.
>
> But even if they're representative of the population, they have
nothing
> to do with the individual cases for which they're 'imputed'. No
> analysis using those 'dates' will be the least trustworthy.
>
> A far better approach is to use true missing-value interpolation on
the
> *durations*, not the dates. (See SPSS 'MVA'.) I'm not clear how many
> durations you'd have to impute. If it's near 50%, that won't be at all
> reliable, either.
>
> -Good luck,
> Richard

Richard,

Thanks for your input. I realize that I was stepping into extremely
treacherous territory when I decide to impute dates and select random
ones. As for the durations being roughly uniformly distributed, that's
what it looks like from the data I do have. Initially, I'd assumed that
durations would have a mean of about 7 days but somehow the data I do
have doesn't seem to show that. It's more or less uniformly
distributed. There were some durations that were >30 days but I doubt
if they're true. Therefore, I decided to go ahead with the uniform
distribution (although, the whole imputation and random selection still
bothers me).

The reason that I'm trying to get an idea about the dates, especially
the event start dates, is due to the nature of the study question. I'm
looking at the occurrence of certain events during pregnancy. However,
these events of interest have to occur within the first trimester, or if
I narrow it down further, the first two months of pregnancy. Therefore,
I have to know if an event occurred within a certain period of time
after the last menstrual date as reported by the woman. At the end of
the day, the variables for all the events get filtered down to a single
dichotomous variable - Y/N did the event occur during the period of
interest?

I will do the analysis with and without the cases where the dates have
been imputed from incomplete data. I hadn't previously thought of using
true-missing value interpolation on the durations but I'll look into it.
I've never done that before so will have to read up a bit on it. I
might have an issue with number of missings though, since more cases
have at least some part of the date then a duration value.

Thanks again for your advice. It's always nice to get a fresh look at an
issue.

- Shahurkh

Maguin, Eugene

Re: Random date generator

Syed,

It sounds like you are going to use the imputed dates to decide if something
happended or not. The new variable, 'something happened or not' might be a
dependent variable or it might be an independent variable. There's a
literature on estimating relationships in the presence of missing data. To
correctly estimate relationships (or, at least, come very close), you should
use either multiple imputation or a maximum likelihood estimation method
that incorporates the EM algorithm. So far as I know, SPSS has neither. The
key person here is Donald Rubin. But, there are other, more recent articles.

Gene Maguin

Hashmi, Syed S

Re: Random date generator

Thanks Gene,

After the comments that you and Richard made I'm thinking real hard of
rethinking the whole thing. Maximum likelihood estimation was something
that I had thought of initially but didn't follow up on. I guess it's
time that I do. Thanks again for your help.

- Shahrukh

> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf
Of

> Gene Maguin
> Sent: Wednesday, July 25, 2007 3:24 PM
> To: [hidden email]
> Subject: Re: Random date generator
>
> Syed,
>
> It sounds like you are going to use the imputed dates to decide if
> something
> happended or not. The new variable, 'something happened or not' might

be a
> dependent variable or it might be an independent variable. There's a
> literature on estimating relationships in the presence of missing
data. To
> correctly estimate relationships (or, at least, come very close), you
> should
> use either multiple imputation or a maximum likelihood estimation
method
> that incorporates the EM algorithm. So far as I know, SPSS has
neither.
> The
> key person here is Donald Rubin. But, there are other, more recent
> articles.
>
> Gene Maguin