Simulate cumulative probability distibution

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Simulate cumulative probability distibution

Jignesh Sutar
Hi,

This question is more about understanding probabilities and then how to create a desired dataset in SPSS. I'm trying to simulate a particular dataset, which over a set of variables observes a desired cumulative probability distribution curve.

Say for example, I have twenty variables (P1 to P20). I want the individual probabilities of P1 to P20 to incrementally increase, whilst the cumulative probabilities follow a (cumulative) normal distribution i.e. as shown here of standard deviation =5.

I then also want the final probability (PFinal) to have a specified probability, of say, 0.8.

Finally, I want to then graph the results to show it has worked.

*******************************.
input program.
loop #j = 1 to 1000.
    compute ID=#j.
    vector P(20).
    loop #i = 1 to 20.
    compute P(#i) = rv.uniform(0,1)<0.01 /* WHAT TO PUT HERE ?! */ .
    end loop.
    end case.
end loop.
end file.
end input program.
dataset name sim.
execute.
*******************************.

compute PFinal=any(1,P1 to P20).

varstocases /make P from P1 to P20 /Index=Index.
variable level P (scale).

/* INCORRECT - NEED TO GRAPH CUMULATIVE PROBABILITY (not csum) */. 
* Chart Builder.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Index SUM(P)[name="SUM_P"] MISSING=LISTWISE 
    REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Index=col(source(s), name("Index"), unit.category())
  DATA: SUM_P=col(source(s), name("SUM_P"))
  GUIDE: axis(dim(1), label("Index"))
  GUIDE: axis(dim(2), label("Cumulative Sum P"))
  SCALE: linear(dim(2), include(0))
  ELEMENT: line(position(summary.sum.cumulative(Index*SUM_P)), missing.wings())
END GPL.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Simulate cumulative probability distibution

Rich Ulrich
No one has answered in more than 12 hours since this was posted, and I assume
that the reason is, simply, no one understands it enough to want to disentangle it.
I will make a first effort.

People use the pdf rather than the cdf when setting up simulations; the logical and
computation connection is direct for the former, not the latter.   So, let's talk about
your problem in terms of the pdf.

You want something that is Normal.  But you have 20 variables, and you want their
"probabilities" to increase.  How do you apply "probabilities" to 20 separate variables? ...
each of which are Normal?  I admit, my internet ESP is not working very well on this, and
the only thing I come up with is this:  Maybe you want to draw a sample of 20 from a
Normal distribution; sort the 20 into ascending order; and assign them respectively to
P1 to P20.   This meets the requirement that there is a single distribution "over the 20 variables".
And the mean p-levels of the z-scores will increase, from P1 to P20.

If you are trying to tabulate something having to do with confidence limits for various
p-levels, I imagine that there are direct answers obtained from theory which make use of
ranks in the answers, using the beta distribution.  (For example: for a sample of 1000,
the 5% cutoff [rank-50] for other, similar samples will usually [with 95% confidence] be
found between the scores sampled at rank-37 and rank-65.)

If you want a normal with SD =5.0, you merely multiply the z  by 5.0.

I'm somewhat doubtful that the above is what you want, but I am even more at a loss
to figure out what you have in mind as PFinal.  The coding example certainly does not
help.  Or, it does not help *me*.   Do you want something to be cut off at 0.8 in the CDF?
Or to reach an asymptote at 0.8?

As usual - if you say more explicitly what you are after, there is a better chance for
a useful answer.

--
Rich Ulrich


Date: Tue, 3 Feb 2015 08:03:36 +0000
From: [hidden email]
Subject: Simulate cumulative probability distibution
To: [hidden email]

Hi,

This question is more about understanding probabilities and then how to create a desired dataset in SPSS. I'm trying to simulate a particular dataset, which over a set of variables observes a desired cumulative probability distribution curve.

Say for example, I have twenty variables (P1 to P20). I want the individual probabilities of P1 to P20 to incrementally increase, whilst the cumulative probabilities follow a (cumulative) normal distribution i.e. as shown here of standard deviation =5.

I then also want the final probability (PFinal) to have a specified probability, of say, 0.8.

Finally, I want to then graph the results to show it has worked.

*******************************.
input program.
loop #j = 1 to 1000.
    compute ID=#j.
    vector P(20).
    loop #i = 1 to 20.
    compute P(#i) = rv.uniform(0,1)<0.01 /* WHAT TO PUT HERE ?! */ .
    end loop.
    end case.
end loop.
end file.
end input program.
dataset name sim.
execute.
*******************************.

compute PFinal=any(1,P1 to P20).

varstocases /make P from P1 to P20 /Index=Index.
variable level P (scale).

/* INCORRECT - NEED TO GRAPH CUMULATIVE PROBABILITY (not csum) */. 
* Chart Builder.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Index SUM(P)[name="SUM_P"] MISSING=LISTWISE 
    REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Index=col(source(s), name("Index"), unit.category())
  DATA: SUM_P=col(source(s), name("SUM_P"))
  GUIDE: axis(dim(1), label("Index"))
  GUIDE: axis(dim(2), label("Cumulative Sum P"))
  SCALE: linear(dim(2), include(0))
  ELEMENT: line(position(summary.sum.cumulative(Index*SUM_P)), missing.wings())
END GPL.


===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Simulate cumulative probability distibution

Jignesh Sutar
Thanks for taking this up Rich.

The purpose of this exercise is exactly that, to understand probabilities, pdf, cdf a little better so thank you for raising questions and where further clarification is required.

The idea is something like this. Let's interpret P1 to P20 as being time variables, T1 to T20.

So for each subject, over time, the success of an event is measured at intervals T1 to T20 (0=failure, 1=success). During the earlier stages the probability is quite low but by T20 the expected probability should asymptote to around 0.8. Given a success in the past, there could be a relapse/failure in the future so therefore I was thinking of this in terms of cumulative probability of success over time rather than at each individual time point? And then how to go about setting the probabilities at each individual time point variable to simulate this effect.



 



On 3 February 2015 at 22:23, Rich Ulrich <[hidden email]> wrote:
No one has answered in more than 12 hours since this was posted, and I assume
that the reason is, simply, no one understands it enough to want to disentangle it.
I will make a first effort.

People use the pdf rather than the cdf when setting up simulations; the logical and
computation connection is direct for the former, not the latter.   So, let's talk about
your problem in terms of the pdf.

You want something that is Normal.  But you have 20 variables, and you want their
"probabilities" to increase.  How do you apply "probabilities" to 20 separate variables? ...
each of which are Normal?  I admit, my internet ESP is not working very well on this, and
the only thing I come up with is this:  Maybe you want to draw a sample of 20 from a
Normal distribution; sort the 20 into ascending order; and assign them respectively to
P1 to P20.   This meets the requirement that there is a single distribution "over the 20 variables".
And the mean p-levels of the z-scores will increase, from P1 to P20.

If you are trying to tabulate something having to do with confidence limits for various
p-levels, I imagine that there are direct answers obtained from theory which make use of
ranks in the answers, using the beta distribution.  (For example: for a sample of 1000,
the 5% cutoff [rank-50] for other, similar samples will usually [with 95% confidence] be
found between the scores sampled at rank-37 and rank-65.)

If you want a normal with SD =5.0, you merely multiply the z  by 5.0.

I'm somewhat doubtful that the above is what you want, but I am even more at a loss
to figure out what you have in mind as PFinal.  The coding example certainly does not
help.  Or, it does not help *me*.   Do you want something to be cut off at 0.8 in the CDF?
Or to reach an asymptote at 0.8?

As usual - if you say more explicitly what you are after, there is a better chance for
a useful answer.

--
Rich Ulrich


Date: Tue, 3 Feb 2015 08:03:36 +0000
From: [hidden email]
Subject: Simulate cumulative probability distibution
To: [hidden email]


Hi,

This question is more about understanding probabilities and then how to create a desired dataset in SPSS. I'm trying to simulate a particular dataset, which over a set of variables observes a desired cumulative probability distribution curve.

Say for example, I have twenty variables (P1 to P20). I want the individual probabilities of P1 to P20 to incrementally increase, whilst the cumulative probabilities follow a (cumulative) normal distribution i.e. as shown here of standard deviation =5.

I then also want the final probability (PFinal) to have a specified probability, of say, 0.8.

Finally, I want to then graph the results to show it has worked.

*******************************.
input program.
loop #j = 1 to 1000.
    compute ID=#j.
    vector P(20).
    loop #i = 1 to 20.
    compute P(#i) = rv.uniform(0,1)<0.01 /* WHAT TO PUT HERE ?! */ .
    end loop.
    end case.
end loop.
end file.
end input program.
dataset name sim.
execute.
*******************************.

compute PFinal=any(1,P1 to P20).

varstocases /make P from P1 to P20 /Index=Index.
variable level P (scale).

/* INCORRECT - NEED TO GRAPH CUMULATIVE PROBABILITY (not csum) */. 
* Chart Builder.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Index SUM(P)[name="SUM_P"] MISSING=LISTWISE 
    REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Index=col(source(s), name("Index"), unit.category())
  DATA: SUM_P=col(source(s), name("SUM_P"))
  GUIDE: axis(dim(1), label("Index"))
  GUIDE: axis(dim(2), label("Cumulative Sum P"))
  SCALE: linear(dim(2), include(0))
  ELEMENT: line(position(summary.sum.cumulative(Index*SUM_P)), missing.wings())
END GPL.



===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Simulate cumulative probability distibution

Andy W
Agree with Rich that it is difficult to figure out exactly what you want - but I don't feel too guilty spending some time on this given that Jignesh spends time answering other peoples questions as well.

So here is my take, you just want to sample a set of Bernoulli variables, but have those probabilities follow a normal CDF S curve. (Has some resemblance to simulating a type of probit model perhaps?) Note that SPSS has inverse functions for the normal, so given a probability we can figure out the corresponding X value for the cumulative normal S curve.

*****************************************************************.
MATRIX.
SAVE T({1:20}) /OUTFILE = * /VARIABLES = 'ID'.
END MATRIX.
DATASET NAME Points.

COMPUTE #Min = 0.20.
COMPUTE #Max = 0.80.
COMPUTE #Step = (#Max - #Min)/(20-1).
COMPUTE Prob = (ID - 1)*#Step + #Min.
COMPUTE CDF_P = IDF.NORMAL(Prob,0,5).
FORMATS CDF_P (F2.0) Prob (F2.1).
EXECUTE.

MATRIX.
SAVE T({1:99}/100) /OUTFILE = * /VARIABLES = 'Prob_L'.
END MATRIX.
DATASET NAME CDF.
DATASET ACTIVATE CDF.
COMPUTE CDF_L = IDF.NORMAL(Prob_L,0,5).
FORMATS CDF_L (F2.0) Prob_L (F2.1).
EXECUTE.

GGRAPH
  /GRAPHDATASET NAME="P" DATASET="Points" VARIABLES=CDF_P Prob
  /GRAPHDATASET NAME="C" DATASET="CDF" VARIABLES=CDF_L Prob_L
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: P=userSource(id("P"))
  DATA: CDF_P=col(source(P), name("CDF_P"))
  DATA: Prob=col(source(P), name("Prob"))
  SOURCE: C=userSource(id("C"))
  DATA: CDF_L=col(source(C), name("CDF_L"))
  DATA: Prob_L=col(source(C), name("Prob_L"))
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Prob"))
  SCALE: linear(dim(1), min(-10), max(10))
  ELEMENT: line(position(CDF_L*Prob_L))
  ELEMENT: point(position(CDF_P*Prob), color.interior(color.red), size(size."12"))
END GPL.

*This only works if you extend the number line to cdf locations where there are basically zero or 1 prob.
*DATA: x = iter(-20,20,0.2)
*ELEMENT: line(position(summary.proportion.cumulative(density.normal(x, mean(0), standardDeviation(5)))))
*****************************************************************.



Now it is just a matter of drawing random Bernoulli variates given these probabilities.

*****************************************************************.
*Now simulate data according to these same prob bins.
*******************************.
SET SEED 10.
INPUT PROGRAM.
COMPUTE #N = 20.
COMPUTE #Min = 0.20.
COMPUTE #Max = 0.80.
COMPUTE #Step = (#Max - #Min)/(#N-1).
LOOP #j = 1 TO 1000.
  LOOP #i = 1 TO #N.
    COMPUTE ID = #j.
    COMPUTE Bin = #i.
    COMPUTE #Prob = (#i - 1)*#Step + #Min.
    COMPUTE X = IDF.NORMAL(#Prob,0,5).
    COMPUTE S = RV.BERNOULLI(#Prob).
    END CASE.
  END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME sim.
DATASET ACTIVATE sim.

AGGREGATE OUTFILE=* MODE=REPLACE
  /BREAK Bin
  /S = MEAN(S)
  /X = FIRST(X).
DATASET NAME simAgg.
FORMATS S (F2.1) X (F2.0).
EXECUTE.

*Simulation varies around line.
GGRAPH
  /GRAPHDATASET NAME="C" DATASET="CDF" VARIABLES=CDF_L Prob_L
  /GRAPHDATASET NAME="Sim" DATASET="simAgg" VARIABLES=S X
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: Sim=userSource(id("Sim"))
  DATA: X=col(source(Sim), name("X"))
  DATA: S=col(source(Sim), name("S"))
  SOURCE: C=userSource(id("C"))
  DATA: CDF_L=col(source(C), name("CDF_L"))
  DATA: Prob_L=col(source(C), name("Prob_L"))
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Prob"))
  SCALE: linear(dim(1), min(-5), max(5))
  ELEMENT: line(position(CDF_L*Prob_L))
  ELEMENT: point(position(X*S), color.interior(color.grey),
           transparency.interior(transparency."0.6"), size(size."10"))
END GPL.
*******************************.
*****************************************************************.

And we can see that they meander about the expected value in this simulation run.



Hopefully that is enough to confuse everyone some more!
Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/
Reply | Threaded
Open this post in threaded view
|

Re: Simulate cumulative probability distibution

Maguin, Eugene
In reply to this post by Jignesh Sutar

It seems like what’s being described is a kind of learning model. You could think of it as a logistic distribution model or a probit distribution model. I don’t know that that matters. Without working this out, one generating equation might be

 

Y(I,j) = B0 + B1*J + B2*j**2 + B3*j**3, where j is the j’th trial for person i.

Another might be Y(I,j) = B0 + B1*(sum(y(j) for j=1 to j-1). Probability of success on trial j is a linear function of number of prior successes.

 

Also, and maybe I’m thinking about this wrongly, but it seems that P1 to P20 are trial result variables and each are dichotomous. Maybe P1 is 5% pass and pass percentage increases per some function to 80% at P20.

 

Gene Maguin

 

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Jignesh Sutar
Sent: Wednesday, February 04, 2015 3:51 AM
To: [hidden email]
Subject: Re: Simulate cumulative probability distibution

 

Thanks for taking this up Rich.

 

The purpose of this exercise is exactly that, to understand probabilities, pdf, cdf a little better so thank you for raising questions and where further clarification is required.

 

The idea is something like this. Let's interpret P1 to P20 as being time variables, T1 to T20.

 

So for each subject, over time, the success of an event is measured at intervals T1 to T20 (0=failure, 1=success). During the earlier stages the probability is quite low but by T20 the expected probability should asymptote to around 0.8. Given a success in the past, there could be a relapse/failure in the future so therefore I was thinking of this in terms of cumulative probability of success over time rather than at each individual time point? And then how to go about setting the probabilities at each individual time point variable to simulate this effect.

 

 

 

 

 

 

 

On 3 February 2015 at 22:23, Rich Ulrich <[hidden email]> wrote:

No one has answered in more than 12 hours since this was posted, and I assume
that the reason is, simply, no one understands it enough to want to disentangle it.
I will make a first effort.

People use the pdf rather than the cdf when setting up simulations; the logical and
computation connection is direct for the former, not the latter.   So, let's talk about
your problem in terms of the pdf.

You want something that is Normal.  But you have 20 variables, and you want their
"probabilities" to increase.  How do you apply "probabilities" to 20 separate variables? ...
each of which are Normal?  I admit, my internet ESP is not working very well on this, and
the only thing I come up with is this:  Maybe you want to draw a sample of 20 from a
Normal distribution; sort the 20 into ascending order; and assign them respectively to
P1 to P20.   This meets the requirement that there is a single distribution "over the 20 variables".
And the mean p-levels of the z-scores will increase, from P1 to P20.

If you are trying to tabulate something having to do with confidence limits for various
p-levels, I imagine that there are direct answers obtained from theory which make use of
ranks in the answers, using the beta distribution.  (For example: for a sample of 1000,
the 5% cutoff [rank-50] for other, similar samples will usually [with 95% confidence] be
found between the scores sampled at rank-37 and rank-65.)

If you want a normal with SD =5.0, you merely multiply the z  by 5.0.

I'm somewhat doubtful that the above is what you want, but I am even more at a loss
to figure out what you have in mind as PFinal.  The coding example certainly does not
help.  Or, it does not help *me*.   Do you want something to be cut off at 0.8 in the CDF?
Or to reach an asymptote at 0.8?

As usual - if you say more explicitly what you are after, there is a better chance for
a useful answer.

--
Rich Ulrich


Date: Tue, 3 Feb 2015 08:03:36 +0000
From:
[hidden email]
Subject: Simulate cumulative probability distibution
To:
[hidden email]

 

Hi,

 

This question is more about understanding probabilities and then how to create a desired dataset in SPSS. I'm trying to simulate a particular dataset, which over a set of variables observes a desired cumulative probability distribution curve.

 

Say for example, I have twenty variables (P1 to P20). I want the individual probabilities of P1 to P20 to incrementally increase, whilst the cumulative probabilities follow a (cumulative) normal distribution i.e. as <a href="https://www.google.co.uk/search?q=cdf&#43;normal&#43;distribution&amp;espv=2&amp;biw=2133&amp;bih=1061&amp;tbm=isch&amp;imgil=kjt3sfqffX1jaM%253A%253BCep0AMPG_IFDrM%253Bhttp%25253A%25252F%25252Fen.wikipedia.org%25252Fwiki%25252FNormal_distribution&amp;source=iu&amp;pf=m&amp;fir=kjt3sfqffX1jaM%253A%252CCep0AMPG_IFDrM%252C_&amp;usg=__Q8cqXI1qkUNkJzNo1DicPoUZSFE%3D&amp;dpr=0.9&amp;ved=0CE0Qyjc&amp;ei=P37QVI_dA8mR7AbfkYDIBw#imgdii=_&amp;imgrc=kjt3sfqffX1jaM%253A%3bCep0AMPG_IFDrM%3bhttp%253A%252F%252Fupload.wikimedia.org%252Fwikipedia%252Fcommons%252Fthumb%252Fc%252Fca%252FNormal_Distribution_CDF.svg%252F2000px-Normal_Distribution_CDF.svg.png%3bhttp%253A%252F%252Fen.wikipedia.org" target="_blank"> shown here of standard deviation =5.

 

I then also want the final probability (PFinal) to have a specified probability, of say, 0.8.

 

Finally, I want to then graph the results to show it has worked.

 

*******************************.

input program.

loop #j = 1 to 1000.

    compute ID=#j.

    vector P(20).

    loop #i = 1 to 20.

    compute P(#i) = rv.uniform(0,1)<0.01 /* WHAT TO PUT HERE ?! */ .

    end loop.

    end case.

end loop.

end file.

end input program.

dataset name sim.

execute.

*******************************.

 

compute PFinal=any(1,P1 to P20).

 

varstocases /make P from P1 to P20 /Index=Index.

variable level P (scale).

 

/* INCORRECT - NEED TO GRAPH CUMULATIVE PROBABILITY (not csum) */. 

* Chart Builder.

GGRAPH

  /GRAPHDATASET NAME="graphdataset" VARIABLES=Index SUM(P)[name="SUM_P"] MISSING=LISTWISE 

    REPORTMISSING=NO

  /GRAPHSPEC SOURCE=INLINE.

BEGIN GPL

  SOURCE: s=userSource(id("graphdataset"))

  DATA: Index=col(source(s), name("Index"), unit.category())

  DATA: SUM_P=col(source(s), name("SUM_P"))

  GUIDE: axis(dim(1), label("Index"))

  GUIDE: axis(dim(2), label("Cumulative Sum P"))

  SCALE: linear(dim(2), include(0))

  ELEMENT: line(position(summary.sum.cumulative(Index*SUM_P)), missing.wings())

END GPL.

 

 

 

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Simulate cumulative probability distibution

Jignesh Sutar
In reply to this post by Andy W
Thanks Andy. That helps a lot (I probably still need to re-visit elementary probability theory to absorb it all again).

If you run the syntax I paste below on the simulated dataset you'll see that the first DESCRIPTIVES command shows where you have simulated the S curve and the second DESCRIPTIVES command shows where I was envisaging the S curve distribution to lie.

The simulated dataset you generated shows that after 12 trials there is a 100% chance of success from any previous 12 attempts. I wanted these "net" / "cumulative" probabilities to asymptote to 0.8 after 20 attempts. 

DATASET ACTIVATE sim.
casestovars /id=id /separator="@" /drop=bin x.
/* Means follow a normal CDF S curve */.
descrip s@1 to s@20.
begin program.
import spss
for i in xrange(1,20+1):
    if i==1:
        spss.Submit("compute S_%(i)s=s@1." % locals())
    else:
        spss.Submit("compute S_%(i)s=any(1, s@1 to s@%(i)s)." % locals())
end program.
/* But it is here where I'd like means follow a normal CDF S curve */.
descrip s_1 to s_20.

On 4 February 2015 at 14:23, Andy W <[hidden email]> wrote:
Agree with Rich that it is difficult to figure out exactly what you want -
but I don't feel too guilty spending some time on this given that Jignesh
spends time answering other peoples questions as well.

So here is my take, you just want to sample a set of Bernoulli variables,
but have those probabilities follow a normal CDF S curve. (Has some
resemblance to simulating a type of probit model perhaps?) Note that SPSS
has inverse functions for the normal, so given a probability we can figure
out the corresponding X value for the cumulative normal S curve.

*****************************************************************.
MATRIX.
SAVE T({1:20}) /OUTFILE = * /VARIABLES = 'ID'.
END MATRIX.
DATASET NAME Points.

COMPUTE #Min = 0.20.
COMPUTE #Max = 0.80.
COMPUTE #Step = (#Max - #Min)/(20-1).
COMPUTE Prob = (ID - 1)*#Step + #Min.
COMPUTE CDF_P = IDF.NORMAL(Prob,0,5).
FORMATS CDF_P (F2.0) Prob (F2.1).
EXECUTE.

MATRIX.
SAVE T({1:99}/100) /OUTFILE = * /VARIABLES = 'Prob_L'.
END MATRIX.
DATASET NAME CDF.
DATASET ACTIVATE CDF.
COMPUTE CDF_L = IDF.NORMAL(Prob_L,0,5).
FORMATS CDF_L (F2.0) Prob_L (F2.1).
EXECUTE.

GGRAPH
  /GRAPHDATASET NAME="P" DATASET="Points" VARIABLES=CDF_P Prob
  /GRAPHDATASET NAME="C" DATASET="CDF" VARIABLES=CDF_L Prob_L
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: P=userSource(id("P"))
  DATA: CDF_P=col(source(P), name("CDF_P"))
  DATA: Prob=col(source(P), name("Prob"))
  SOURCE: C=userSource(id("C"))
  DATA: CDF_L=col(source(C), name("CDF_L"))
  DATA: Prob_L=col(source(C), name("Prob_L"))
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Prob"))
  SCALE: linear(dim(1), min(-10), max(10))
  ELEMENT: line(position(CDF_L*Prob_L))
  ELEMENT: point(position(CDF_P*Prob), color.interior(color.red),
size(size."12"))
END GPL.

*This only works if you extend the number line to cdf locations where there
are basically zero or 1 prob.
*DATA: x = iter(-20,20,0.2)
*ELEMENT: line(position(summary.proportion.cumulative(density.normal(x,
mean(0), standardDeviation(5)))))
*****************************************************************.

<http://spssx-discussion.1045642.n5.nabble.com/file/n5728577/CDF_Sim1.png>

Now it is just a matter of drawing random Bernoulli variates given these
probabilities.

*****************************************************************.
*Now simulate data according to these same prob bins.
*******************************.
SET SEED 10.
INPUT PROGRAM.
COMPUTE #N = 20.
COMPUTE #Min = 0.20.
COMPUTE #Max = 0.80.
COMPUTE #Step = (#Max - #Min)/(#N-1).
LOOP #j = 1 TO 1000.
  LOOP #i = 1 TO #N.
    COMPUTE ID = #j.
    COMPUTE Bin = #i.
    COMPUTE #Prob = (#i - 1)*#Step + #Min.
    COMPUTE X = IDF.NORMAL(#Prob,0,5).
    COMPUTE S = RV.BERNOULLI(#Prob).
    END CASE.
  END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME sim.
DATASET ACTIVATE sim.

AGGREGATE OUTFILE=* MODE=REPLACE
  /BREAK Bin
  /S = MEAN(S)
  /X = FIRST(X).
DATASET NAME simAgg.
FORMATS S (F2.1) X (F2.0).
EXECUTE.

*Simulation varies around line.
GGRAPH
  /GRAPHDATASET NAME="C" DATASET="CDF" VARIABLES=CDF_L Prob_L
  /GRAPHDATASET NAME="Sim" DATASET="simAgg" VARIABLES=S X
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: Sim=userSource(id("Sim"))
  DATA: X=col(source(Sim), name("X"))
  DATA: S=col(source(Sim), name("S"))
  SOURCE: C=userSource(id("C"))
  DATA: CDF_L=col(source(C), name("CDF_L"))
  DATA: Prob_L=col(source(C), name("Prob_L"))
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Prob"))
  SCALE: linear(dim(1), min(-5), max(5))
  ELEMENT: line(position(CDF_L*Prob_L))
  ELEMENT: point(position(X*S), color.interior(color.grey),
           transparency.interior(transparency."0.6"), size(size."10"))
END GPL.
*******************************.
*****************************************************************.

And we can see that they meander about the expected value in this simulation
run.

<http://spssx-discussion.1045642.n5.nabble.com/file/n5728577/CDF_Sim2.png>

Hopefully that is enough to confuse everyone some more!



-----
Andy W
[hidden email]
http://andrewpwheeler.wordpress.com/
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Simulate-cumulative-probability-distibution-tp5728568p5728577.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Simulate cumulative probability distibution

Maguin, Eugene
In reply to this post by Andy W
Maybe I'm thinking about this wrongly, but it seems that P1 to P20 are trial result variables and each are dichotomous. Maybe P1 is 5% pass and pass percentage increases per some function to 80% at P20. You could say it's like is a kind of learning model. You could think of it as a logistic model or a probit model. I don't know that that matters. One generating equation might be

P(i,j) = B0 + B1*J + B2*j**2 + B3*j**3, where j is the j'th trial for person i. Probability of success on trial j is some polynominal curve, here, cubic.
Another might be P(i,j) = B0 + B1*(sum(P(i,j) for j=1 to j-1). Probability of success on trial j is a linear function of number of prior successes.

Gene Maguin
Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Andy W
Sent: Wednesday, February 04, 2015 9:23 AM
To: [hidden email]
Subject: Re: Simulate cumulative probability distibution

Agree with Rich that it is difficult to figure out exactly what you want - but I don't feel too guilty spending some time on this given that Jignesh spends time answering other peoples questions as well.

So here is my take, you just want to sample a set of Bernoulli variables, but have those probabilities follow a normal CDF S curve. (Has some resemblance to simulating a type of probit model perhaps?) Note that SPSS has inverse functions for the normal, so given a probability we can figure out the corresponding X value for the cumulative normal S curve.

*****************************************************************.
MATRIX.
SAVE T({1:20}) /OUTFILE = * /VARIABLES = 'ID'.
END MATRIX.
DATASET NAME Points.

COMPUTE #Min = 0.20.
COMPUTE #Max = 0.80.
COMPUTE #Step = (#Max - #Min)/(20-1).
COMPUTE Prob = (ID - 1)*#Step + #Min.
COMPUTE CDF_P = IDF.NORMAL(Prob,0,5).
FORMATS CDF_P (F2.0) Prob (F2.1).
EXECUTE.

MATRIX.
SAVE T({1:99}/100) /OUTFILE = * /VARIABLES = 'Prob_L'.
END MATRIX.
DATASET NAME CDF.
DATASET ACTIVATE CDF.
COMPUTE CDF_L = IDF.NORMAL(Prob_L,0,5).
FORMATS CDF_L (F2.0) Prob_L (F2.1).
EXECUTE.

GGRAPH
  /GRAPHDATASET NAME="P" DATASET="Points" VARIABLES=CDF_P Prob
  /GRAPHDATASET NAME="C" DATASET="CDF" VARIABLES=CDF_L Prob_L
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: P=userSource(id("P"))
  DATA: CDF_P=col(source(P), name("CDF_P"))
  DATA: Prob=col(source(P), name("Prob"))
  SOURCE: C=userSource(id("C"))
  DATA: CDF_L=col(source(C), name("CDF_L"))
  DATA: Prob_L=col(source(C), name("Prob_L"))
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Prob"))
  SCALE: linear(dim(1), min(-10), max(10))
  ELEMENT: line(position(CDF_L*Prob_L))
  ELEMENT: point(position(CDF_P*Prob), color.interior(color.red),
size(size."12"))
END GPL.

*This only works if you extend the number line to cdf locations where there are basically zero or 1 prob.
*DATA: x = iter(-20,20,0.2)
*ELEMENT: line(position(summary.proportion.cumulative(density.normal(x,
mean(0), standardDeviation(5)))))
*****************************************************************.

<http://spssx-discussion.1045642.n5.nabble.com/file/n5728577/CDF_Sim1.png>

Now it is just a matter of drawing random Bernoulli variates given these probabilities.

*****************************************************************.
*Now simulate data according to these same prob bins.
*******************************.
SET SEED 10.
INPUT PROGRAM.
COMPUTE #N = 20.
COMPUTE #Min = 0.20.
COMPUTE #Max = 0.80.
COMPUTE #Step = (#Max - #Min)/(#N-1).
LOOP #j = 1 TO 1000.
  LOOP #i = 1 TO #N.
    COMPUTE ID = #j.
    COMPUTE Bin = #i.
    COMPUTE #Prob = (#i - 1)*#Step + #Min.
    COMPUTE X = IDF.NORMAL(#Prob,0,5).
    COMPUTE S = RV.BERNOULLI(#Prob).
    END CASE.
  END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME sim.
DATASET ACTIVATE sim.

AGGREGATE OUTFILE=* MODE=REPLACE
  /BREAK Bin
  /S = MEAN(S)
  /X = FIRST(X).
DATASET NAME simAgg.
FORMATS S (F2.1) X (F2.0).
EXECUTE.

*Simulation varies around line.
GGRAPH
  /GRAPHDATASET NAME="C" DATASET="CDF" VARIABLES=CDF_L Prob_L
  /GRAPHDATASET NAME="Sim" DATASET="simAgg" VARIABLES=S X
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: Sim=userSource(id("Sim"))
  DATA: X=col(source(Sim), name("X"))
  DATA: S=col(source(Sim), name("S"))
  SOURCE: C=userSource(id("C"))
  DATA: CDF_L=col(source(C), name("CDF_L"))
  DATA: Prob_L=col(source(C), name("Prob_L"))
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Prob"))
  SCALE: linear(dim(1), min(-5), max(5))
  ELEMENT: line(position(CDF_L*Prob_L))
  ELEMENT: point(position(X*S), color.interior(color.grey),
           transparency.interior(transparency."0.6"), size(size."10")) END GPL.
*******************************.
*****************************************************************.

And we can see that they meander about the expected value in this simulation run.

<http://spssx-discussion.1045642.n5.nabble.com/file/n5728577/CDF_Sim2.png>

Hopefully that is enough to confuse everyone some more!



-----
Andy W
[hidden email]
http://andrewpwheeler.wordpress.com/
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Simulate-cumulative-probability-distibution-tp5728568p5728577.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Simulate cumulative probability distibution

Andy W
Yes Gene, that is not that much different than what I showed, if you write generically:

P(Y) = f(X)

My function was simply the normal CDF (and I sampled X to follow the S curve), and yours is cubic (or in the latter some type of auto-regressive function) predicting the probability of Y occurring.

Jignesh's latest example though is more like a survival function, once a 1 occurs it is always a 1 in the future time periods. In either yours or my original formulation you can't force independent draws to behave like that.

So what I thought Jignesh wanted (in the latest example) was the death rate (1 - the life table) to follow the normal S curve. This ends up being a bit of a weird formulation though, as I'll try to show.

So in a discrete time survival model, you typically model the hazard rate and the hazard rate is constant (this is not my forte, so don't take my terminology as authoritative - hopefully my exposition is clear enough though!). So in this formulation, the probability of an event occurring (e.g. "death") after n periods is then:

1 - (1 - HazardRate)^n

So if you have say a 20% hazard rate, then the probability of "death" for the first three periods is:

1 - (1 - 0.2)^1 = 0.2
1 - (1 - 0.2)^2 = 0.36
1 - (1 - 0.2)^3 = 0.488

So here Jignesh knows what the end death probability is, and wants it to conform to the normal CDF S curve? But that forces the hazard rate for each period to be quite non-linear. So if we write the formula as:

1 - Product(1 - Hi)

Where Hi is the hazard rate for the individual time period and "Product" is my ascii fudge to say the product over the prior hazard rates. So if we knew we wanted the end death table to follow a linear pattern:

1 - (1 - H1)                   = 0.2
1 - (1 - H1)*(1 - H2)          = 0.3
1 - (1 - H1)*(1 - H2)*(1 - H3) = 0.4

We can then solve for the individual H's. For another fudge I will write Ci as the prior cumulative probability (i.e. death % for the prior time period), and so we can solve for H's as:

1 - (1 - Ci)*(1 - Hi) = p -> 1 - (1 - p)/(1 - Ci) = Hi

So based on this since we have set values for p and Ci we can figure out Hi for our above example:

1 - (1 - 0.2)/(1 - 0  ) = 0.2
1 - (1 - 0.3)/(1 - 0.2) = 0.125
1 - (1 - 0.4)/(1 - 0.3) = 0.1428571

So for a linear death rate we need a weird function for the hazard rate. Here is an SPSS example of doing this so the death rates follow the normal CDF S curve.

*********************************.
SET SEED 10.
INPUT PROGRAM.
LOOP #j = 1 TO 100000.
COMPUTE ID = #j.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Ind.

COMPUTE #Min = -15.
COMPUTE #Max = 15.
COMPUTE #N = 20.
COMPUTE #Step = (#Max - #Min)/(#N-1).
COMPUTE #C = 0.
VECTOR T(20,F1.0).
VECTOR C(20).

LOOP #i = 1 TO #N.
  COMPUTE #X = (#i - 1)*#Step + #Min.  
  COMPUTE #P = CDF.NORMAL(#X,0,5).
  COMPUTE #H = 1 - (1 - #P)/(1 - #C).
  COMPUTE #S = RV.BERNOULLI(#H).
  COMPUTE #C = #P.
  COMPUTE C(#i) = #C.
  COMPUTE T(#i) = #S.
END LOOP IF #S = 1.

*These show the approximate hazard rate for every individual time period.
FREQ T1 TO T20 /FORMAT = NOTABLE /STATISTICS = MEAN.
RECODE T1 TO T20 (SYSMIS = 1)(ELSE = COPY).
*After the recode is is the death table (or 1 minus the life table).
FREQ T1 TO T20 /FORMAT = NOTABLE /STATISTICS = MEAN.
*The C variables are the theoretical probabilities for normal CDF.
FREQ C1 TO C20 /FORMAT = NOTABLE /STATISTICS = MEAN.

DATASET DECLARE Agg.
AGGREGATE OUTFILE='Agg'
  /BREAK
  /T1 TO T20 = MEAN(T1 TO T20)
  /C1 TO C20 = MEAN(C1 TO C20).
DATASET ACTIVATE Agg.

VARSTOCASES
  /MAKE T FROM T1 TO T20
  /MAKE C FROM C1 TO C20.

COMPUTE #Step = 30/19.
COMPUTE X = ($casenum - 1)*#Step + -15.
FORMATS X (F1.0) T C (F2.1).
EXECUTE.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X T C
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X=col(source(s), name("X"))
  DATA: T=col(source(s), name("T"))
  DATA: C=col(source(s), name("C"))
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Cum. Prob."))
  ELEMENT: line(position(smooth.spline(X*C)))
  ELEMENT: point(position(X*T), size(size."8"), color.interior(color.red))
END GPL.
*********************************.

The red dots are the simulation values, and the black line are the theoretical Normal CDF. With 100,000 observations the simulation matches it near perfectly.



But hopefully all of that shows a more natural parameterization for this is to model the hazard rate, as opposed to constraining the end death rate to follow some prespecified pattern. (I suppose it is a nice illustration to show that typical survival models always result in death in the limit, although for simulation purposes you can constrain the end probability of death by only having a finite amount of time.)
Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/
Reply | Threaded
Open this post in threaded view
|

Re: Simulate cumulative probability distibution

Rich Ulrich
In reply to this post by Jignesh Sutar
Andy's post on Feb 6 finally tackles the question in the way that I thought
was more appropriate -- referring to the Hazard or risk for survivorship, if that
were appropriate.

Constant Hazard in survivorship produces a negative exponential, which is
linear on the log scale, and it is very common.  There are also models for
increasing risk, decreasing risk, and U-shaped risk (human deaths in the
first week/ final years;  manufacturing defects/ result of wear).  None of
these produce anything much like a cumulative normal, even when you
reverse them to describe a sort of growth.

However, the "cumulative normal" is very similar to the "cumulative logistic";
and the latter is typical of "exponential growth within limits".  That seems to
me to be more like what is wanted here, even though the underlying process
has yet to be described.

If you are going to define a limit, it could be 80% of the sample -- or, it might be
a function of the comment, "there could be a relapse/failure in the future".   If the
probability of "un-success" (previous success is reversed) is always 20%, that would
yield 80% as the asymptote -- with different IDs being the Successes at different times.
Is that desirable?  There has been no comment about how stable a Success should be.

Thus: For T1 or after Failure, Success is sampled as the logistic p for the time frame
that nears the max at some desired period.
For each T_i  after Success, a change to Failure is sampled as a 20% chance.

By the way, the un-success rate only has to end up at 20% for the cumulative
success to end up at 80%, so it could start higher or lower, if there were any reason
for it.

--
Rich Ulrich



Date: Wed, 4 Feb 2015 08:50:44 +0000
From: [hidden email]
Subject: Re: Simulate cumulative probability distibution
To: [hidden email]

Thanks for taking this up Rich.

The purpose of this exercise is exactly that, to understand probabilities, pdf, cdf a little better so thank you for raising questions and where further clarification is required.

The idea is something like this. Let's interpret P1 to P20 as being time variables, T1 to T20.

So for each subject, over time, the success of an event is measured at intervals T1 to T20 (0=failure, 1=success). During the earlier stages the probability is quite low but by T20 the expected probability should asymptote to around 0.8. Given a success in the past, there could be a relapse/failure in the future so therefore I was thinking of this in terms of cumulative probability of success over time rather than at each individual time point? And then how to go about setting the probabilities at each individual time point variable to simulate this effect.

 [snip, previous]
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Simulate cumulative probability distibution

Rich Ulrich
Okay, I screwed up here, especially on achieving the 80% asymptote.

The curve will approach an asymptote when the fraction that turn from
F (failure) to S (success) matches the fraction that turns from S to F.   That
will never be as great as 20%  for those steps from 1 to 20 -- more like 5%
or less for the asymptote.

From graphs on the Wikip page on the logistic, it seems to me that s=2 or s=3
might work better, for the logistic.  Question 1.  How far from zero should the
first period start.  Q2.  How near the end should an "asymptote" seem to start?

And when I start to model this myself, I notice that I have to write out the
cumulative distribution of the continuous function in order to find the
fraction that should convert in a discrete interval.  For doing it, it is just
as easy to start with the Normal as with the logistic.  However, I do still
see a rationale for the logistic (as growth) where I have not seen a
rationale for Normal. 

If there are no changes of S to F (or F to S) at the asymptote, then the modeling
is easier than otherwise, since it does not have to match the S-toF  to the
F-to-S. 

--
Rich Ulrich


From: [hidden email]
To: [hidden email]; [hidden email]
Subject: RE: Simulate cumulative probability distibution
Date: Sat, 7 Feb 2015 18:02:07 -0500

Andy's post on Feb 6 finally tackles the question in the way that I thought
was more appropriate -- referring to the Hazard or risk for survivorship, if that
were appropriate.

Constant Hazard in survivorship produces a negative exponential, which is
linear on the log scale, and it is very common.  There are also models for
increasing risk, decreasing risk, and U-shaped risk (human deaths in the
first week/ final years;  manufacturing defects/ result of wear).  None of
these produce anything much like a cumulative normal, even when you
reverse them to describe a sort of growth.

However, the "cumulative normal" is very similar to the "cumulative logistic";
and the latter is typical of "exponential growth within limits".  That seems to
me to be more like what is wanted here, even though the underlying process
has yet to be described.

If you are going to define a limit, it could be 80% of the sample -- or, it might be
a function of the comment, "there could be a relapse/failure in the future".   If the
probability of "un-success" (previous success is reversed) is always 20%, that would
yield 80% as the asymptote -- with different IDs being the Successes at different times.
Is that desirable?  There has been no comment about how stable a Success should be.

Thus: For T1 or after Failure, Success is sampled as the logistic p for the time frame
that nears the max at some desired period.
For each T_i  after Success, a change to Failure is sampled as a 20% chance.

By the way, the un-success rate only has to end up at 20% for the cumulative
success to end up at 80%, so it could start higher or lower, if there were any reason
for it.

--
Rich Ulrich



Date: Wed, 4 Feb 2015 08:50:44 +0000
From: [hidden email]
Subject: Re: Simulate cumulative probability distibution
To: [hidden email]

Thanks for taking this up Rich.

The purpose of this exercise is exactly that, to understand probabilities, pdf, cdf a little better so thank you for raising questions and where further clarification is required.

The idea is something like this. Let's interpret P1 to P20 as being time variables, T1 to T20.

So for each subject, over time, the success of an event is measured at intervals T1 to T20 (0=failure, 1=success). During the earlier stages the probability is quite low but by T20 the expected probability should asymptote to around 0.8. Given a success in the past, there could be a relapse/failure in the future so therefore I was thinking of this in terms of cumulative probability of success over time rather than at each individual time point? And then how to go about setting the probabilities at each individual time point variable to simulate this effect.

 [snip, previous]
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD