|
Hi,
I want to generate random numbers that are unique but they should not be ordered in any way (i.e. ascending or descending order). I have about 300,000 records. For e.g. Case# Random Number (Unique) 1 8 2 10 3 45 4 35 5 40 6 32 7 50 8 28 9 48 10 22 . . . . . . . . 48 2 49 5 50 18 I know that using the $CASENUM does generate unique numbers but it is in ascending order. Could somebody offer a solution? Thanks. R. Abraham |
|
Look at the RANDOM() function. It should do what you want.
Here is some syntax we use to create a new random number between 0 and 999,999. new file. input program. loop #B=0 to 999999. compute xpid=#B. compute random=rnd(rv.uniform(0,999999)). end case. end loop. end file. end input program. exe. Melissa -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of <R. Abraham> Sent: Tuesday, August 28, 2007 9:36 AM To: [hidden email] Subject: [SPSSX-L] Assigning Random Numbers Hi, I want to generate random numbers that are unique but they should not be ordered in any way (i.e. ascending or descending order). I have about 300,000 records. For e.g. Case# Random Number (Unique) 1 8 2 10 3 45 4 35 5 40 6 32 7 50 8 28 9 48 10 22 . . . . . . . . 48 2 49 5 50 18 I know that using the $CASENUM does generate unique numbers but it is in ascending order. Could somebody offer a solution? Thanks. R. Abraham PRIVILEGED AND CONFIDENTIAL INFORMATION This transmittal and any attachments may contain PRIVILEGED AND CONFIDENTIAL information and is intended only for the use of the addressee. If you are not the designated recipient, or an employee or agent authorized to deliver such transmittals to the designated recipient, you are hereby notified that any dissemination, copying or publication of this transmittal is strictly prohibited. If you have received this transmittal in error, please notify us immediately by replying to the sender and delete this copy from your system. You may also call us at (309) 827-6026 for assistance. |
|
In reply to this post by <R. Abraham>
Good point--we actually use the random number to order the file to create a new unique variable.
sort cases by random. compute xpidnew=1. exe. do if lag(xpidnew)>0. compute xpidnew=lag(xpidnew)+1. end if. exe. Melissa -----Original Message----- From: la volta statistics [mailto:[hidden email]] Sent: Tuesday, August 28, 2007 10:19 AM To: Melissa Ives; [hidden email] Subject: AW: Assigning Random Numbers To make sure that you have unique numbers, you should subsequently aggregate the numbers created with Melissa's code . new file. input program. loop #B=0 to 999999. compute xpid=#B. compute random=rnd(rv.uniform(0,999999)). end case. end loop. end file. end input program. exe. AGGREGAT OUTFILE = * /Break random /U_Nr = N. Compute case = $Casenuum. In my case the Melissa's code produced somewhat over 600,000 unique cases. have fun Christian ******************************* la volta statistics Christian Schmidhauser, Dr.phil.II Weinbergstrasse 108 Ch-8006 Zürich Tel: +41 (043) 233 98 01 Fax: +41 (043) 233 98 02 email: mailto:[hidden email] internet: http://www.lavolta.ch/ -----Ursprüngliche Nachricht----- Von: SPSSX(r) Discussion [mailto:[hidden email]]Im Auftrag von Melissa Ives Gesendet: Dienstag, 28. August 2007 16:45 An: [hidden email] Betreff: Re: Assigning Random Numbers Look at the RANDOM() function. It should do what you want. Here is some syntax we use to create a new random number between 0 and 999,999. new file. input program. loop #B=0 to 999999. compute xpid=#B. compute random=rnd(rv.uniform(0,999999)). end case. end loop. end file. end input program. exe. Melissa -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of <R. Abraham> Sent: Tuesday, August 28, 2007 9:36 AM To: [hidden email] Subject: [SPSSX-L] Assigning Random Numbers Hi, I want to generate random numbers that are unique but they should not be ordered in any way (i.e. ascending or descending order). I have about 300,000 records. For e.g. Case# Random Number (Unique) 1 8 2 10 3 45 4 35 5 40 6 32 7 50 8 28 9 48 10 22 . . . . . . . . 48 2 49 5 50 18 I know that using the $CASENUM does generate unique numbers but it is in ascending order. Could somebody offer a solution? Thanks. R. Abraham PRIVILEGED AND CONFIDENTIAL INFORMATION This transmittal and any attachments may contain PRIVILEGED AND CONFIDENTIAL information and is intended only for the use of the addressee. If you are not the designated recipient, or an employee or agent authorized to deliver such transmittals to the designated recipient, you are hereby notified that any dissemination, copying or publication of this transmittal is strictly prohibited. If you have received this transmittal in error, please notify us immediately by replying to the sender and delete this copy from your system. You may also call us at (309) 827-6026 for assistance. |
|
In reply to this post by Melissa Ives
To make sure that you have unique numbers, you should subsequently aggregate
the numbers created with Melissa's code . new file. input program. loop #B=0 to 999999. compute xpid=#B. compute random=rnd(rv.uniform(0,999999)). end case. end loop. end file. end input program. exe. AGGREGAT OUTFILE = * /Break random /U_Nr = N. Compute case = $Casenuum. In my case the Melissa's code produced somewhat over 600,000 unique cases. have fun Christian ******************************* la volta statistics Christian Schmidhauser, Dr.phil.II Weinbergstrasse 108 Ch-8006 Zurich Tel: +41 (043) 233 98 01 Fax: +41 (043) 233 98 02 email: mailto:[hidden email] internet: http://www.lavolta.ch/ -----Ursprungliche Nachricht----- Von: SPSSX(r) Discussion [mailto:[hidden email]]Im Auftrag von Melissa Ives Gesendet: Dienstag, 28. August 2007 16:45 An: [hidden email] Betreff: Re: Assigning Random Numbers Look at the RANDOM() function. It should do what you want. Here is some syntax we use to create a new random number between 0 and 999,999. new file. input program. loop #B=0 to 999999. compute xpid=#B. compute random=rnd(rv.uniform(0,999999)). end case. end loop. end file. end input program. exe. Melissa -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of <R. Abraham> Sent: Tuesday, August 28, 2007 9:36 AM To: [hidden email] Subject: [SPSSX-L] Assigning Random Numbers Hi, I want to generate random numbers that are unique but they should not be ordered in any way (i.e. ascending or descending order). I have about 300,000 records. For e.g. Case# Random Number (Unique) 1 8 2 10 3 45 4 35 5 40 6 32 7 50 8 28 9 48 10 22 . . . . . . . . 48 2 49 5 50 18 I know that using the $CASENUM does generate unique numbers but it is in ascending order. Could somebody offer a solution? Thanks. R. Abraham PRIVILEGED AND CONFIDENTIAL INFORMATION This transmittal and any attachments may contain PRIVILEGED AND CONFIDENTIAL information and is intended only for the use of the addressee. If you are not the designated recipient, or an employee or agent authorized to deliver such transmittals to the designated recipient, you are hereby notified that any dissemination, copying or publication of this transmittal is strictly prohibited. If you have received this transmittal in error, please notify us immediately by replying to the sender and delete this copy from your system. You may also call us at (309) 827-6026 for assistance. |
|
In reply to this post by Melissa Ives
At 10:35 AM 8/28/2007, <R. Abraham> wrote:
>I want to generate random numbers that are unique but they should not >be ordered in any way (i.e. ascending or descending order). I have >about 300,000 records. > >For e.g. > >Case# Random Number (Unique) > >1 8 >2 10 See detailed remarks below; but, as a summary, the random-sort method is probably the simplest. This is SPSS 15 draft output (WRR:not saved separately), for 10 cases. (And it worked on the first try, too - "Three things you should be wary of..."): |-----------------------------|---------------------------| |Output Created |28-AUG-2007 13:22:15 | |-----------------------------|---------------------------| Case# Greek 1 Alpha 2 Beta 3 Gamma 4 Delta 5 Epsilon 6 Zeta 7 Eta 8 Theta 9 Iota 10 Kappa Number of cases read: 10 Number of cases listed: 10 NUMERIC RandNmbr (F6). NUMERIC RandOrdr (F9.6). COMPUTE RandOrdr = RV.UNIFORM(0,1). SORT CASES BY RandOrdr. COMPUTE RandNmbr = $CASENUM. LIST. List |-----------------------------|---------------------------| |Output Created |28-AUG-2007 13:22:16 | |-----------------------------|---------------------------| Case# Greek RandNmbr RandOrdr 2 Beta 1 .055204 4 Delta 2 .056053 3 Gamma 3 .089339 1 Alpha 4 .289874 5 Epsilon 5 .602548 8 Theta 6 .639086 10 Kappa 7 .669676 7 Eta 8 .677862 6 Zeta 9 .870447 9 Iota 10 .959446 Number of cases read: 10 Number of cases listed: 10 SORT CASES BY Case#. LIST. List |-----------------------------|---------------------------| |Output Created |28-AUG-2007 13:22:16 | |-----------------------------|---------------------------| Case# Greek RandNmbr RandOrdr 1 Alpha 4 .289874 2 Beta 1 .055204 3 Gamma 3 .089339 4 Delta 2 .056053 5 Epsilon 5 .602548 6 Zeta 9 .870447 7 Eta 8 .677862 8 Theta 6 .639086 9 Iota 10 .959446 10 Kappa 7 .669676 Number of cases read: 10 Number of cases listed: 10 .................................... Regarding suggestions in this thread - At 10:45 AM 8/28/2007, Melissa Ives wrote: >I. Here is some syntax we use to create a new random number between 0 >and 999,999. >... >. compute random=rnd(rv.uniform(0,999999)). It doesn't much matter for a large number like 999,999, but this logic is slightly off: probabilities of getting 0 and 999,999 are half the probabilities for the other numbers. I recommend . compute random=TRUNC(rv.uniform(0,1E6)). II. At 11:14 AM 8/28/2007, Melissa Ives wrote: >We actually use the random number to order the file to create a new >unique variable. > >sort cases by random. >compute xpidnew=1. >exe. >do if lag(xpidnew)>0. >compute xpidnew=lag(xpidnew)+1. >end if. >exe. Exactly, but also note, . If you compute "random" as an integer, like - compute random=TRUNC(rv.uniform(0,1E6)). then there will be ties. That will give a slight preference for randomly sorted cases retaining their original orders in the file. For random sorting, use the raw output of the random-number generator to retain full precision: - compute random=rv.uniform(0,1). . Neither '.exe.' is necessary, and they can be inefficient if the file is large. =================== APPENDIX: Test data =================== DATA LIST FIXED /Case# 01-03 Greek 04-11 (A). BEGIN DATA 1 Alpha 2 Beta 3 Gamma 4 Delta 5 Epsilon 6 Zeta 7 Eta 8 Theta 9 Iota 10 Kappa END DATA. LIST. |
|
Thanks to all you replied to my question. I was able to successfully
produce the desired results. - Renji Richard Ristow <[hidden email]> 08/28/2007 01:58 PM To "R. Abraham" <[hidden email]>, [hidden email] cc Melissa Ives <[hidden email]>, la volta statistics <[hidden email]> Subject Re: Assigning Random Numbers At 10:35 AM 8/28/2007, <R. Abraham> wrote: >I want to generate random numbers that are unique but they should not >be ordered in any way (i.e. ascending or descending order). I have >about 300,000 records. > >For e.g. > >Case# Random Number (Unique) > >1 8 >2 10 See detailed remarks below; but, as a summary, the random-sort method is probably the simplest. This is SPSS 15 draft output (WRR:not saved separately), for 10 cases. (And it worked on the first try, too - "Three things you should be wary of..."): |-----------------------------|---------------------------| |Output Created |28-AUG-2007 13:22:15 | |-----------------------------|---------------------------| Case# Greek 1 Alpha 2 Beta 3 Gamma 4 Delta 5 Epsilon 6 Zeta 7 Eta 8 Theta 9 Iota 10 Kappa Number of cases read: 10 Number of cases listed: 10 NUMERIC RandNmbr (F6). NUMERIC RandOrdr (F9.6). COMPUTE RandOrdr = RV.UNIFORM(0,1). SORT CASES BY RandOrdr. COMPUTE RandNmbr = $CASENUM. LIST. List |-----------------------------|---------------------------| |Output Created |28-AUG-2007 13:22:16 | |-----------------------------|---------------------------| Case# Greek RandNmbr RandOrdr 2 Beta 1 .055204 4 Delta 2 .056053 3 Gamma 3 .089339 1 Alpha 4 .289874 5 Epsilon 5 .602548 8 Theta 6 .639086 10 Kappa 7 .669676 7 Eta 8 .677862 6 Zeta 9 .870447 9 Iota 10 .959446 Number of cases read: 10 Number of cases listed: 10 SORT CASES BY Case#. LIST. List |-----------------------------|---------------------------| |Output Created |28-AUG-2007 13:22:16 | |-----------------------------|---------------------------| Case# Greek RandNmbr RandOrdr 1 Alpha 4 .289874 2 Beta 1 .055204 3 Gamma 3 .089339 4 Delta 2 .056053 5 Epsilon 5 .602548 6 Zeta 9 .870447 7 Eta 8 .677862 8 Theta 6 .639086 9 Iota 10 .959446 10 Kappa 7 .669676 Number of cases read: 10 Number of cases listed: 10 .................................... Regarding suggestions in this thread - At 10:45 AM 8/28/2007, Melissa Ives wrote: >I. Here is some syntax we use to create a new random number between 0 >and 999,999. >... >. compute random=rnd(rv.uniform(0,999999)). It doesn't much matter for a large number like 999,999, but this logic is slightly off: probabilities of getting 0 and 999,999 are half the probabilities for the other numbers. I recommend . compute random=TRUNC(rv.uniform(0,1E6)). II. At 11:14 AM 8/28/2007, Melissa Ives wrote: >We actually use the random number to order the file to create a new >unique variable. > >sort cases by random. >compute xpidnew=1. >exe. >do if lag(xpidnew)>0. >compute xpidnew=lag(xpidnew)+1. >end if. >exe. Exactly, but also note, . If you compute "random" as an integer, like - compute random=TRUNC(rv.uniform(0,1E6)). then there will be ties. That will give a slight preference for randomly sorted cases retaining their original orders in the file. For random sorting, use the raw output of the random-number generator to retain full precision: - compute random=rv.uniform(0,1). . Neither '.exe.' is necessary, and they can be inefficient if the file is large. =================== APPENDIX: Test data =================== DATA LIST FIXED /Case# 01-03 Greek 04-11 (A). BEGIN DATA 1 Alpha 2 Beta 3 Gamma 4 Delta 5 Epsilon 6 Zeta 7 Eta 8 Theta 9 Iota 10 Kappa END DATA. LIST. |
|
In reply to this post by la volta statistics
If you use just rv.uniform(0,1) with no rounding, you'll be assured of uniqueness unless your number of cases exceeds 2 billion. It is rounding that raises the uniqueness issue. If you need integers, multiply by 2**31 and assign a format of F10.
Jonathan Fry SPSS Inc. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of la volta statistics Sent: Tuesday, August 28, 2007 10:19 AM To: [hidden email] Subject: AW: Assigning Random Numbers To make sure that you have unique numbers, you should subsequently aggregate the numbers created with Melissa's code . new file. input program. loop #B=0 to 999999. compute xpid=#B. compute random=rnd(rv.uniform(0,999999)). end case. end loop. end file. end input program. exe. AGGREGAT OUTFILE = * /Break random /U_Nr = N. Compute case = $Casenuum. In my case the Melissa's code produced somewhat over 600,000 unique cases. have fun Christian ******************************* la volta statistics Christian Schmidhauser, Dr.phil.II Weinbergstrasse 108 Ch-8006 Zurich Tel: +41 (043) 233 98 01 Fax: +41 (043) 233 98 02 email: mailto:[hidden email] internet: http://www.lavolta.ch/ -----Ursprungliche Nachricht----- Von: SPSSX(r) Discussion [mailto:[hidden email]]Im Auftrag von Melissa Ives Gesendet: Dienstag, 28. August 2007 16:45 An: [hidden email] Betreff: Re: Assigning Random Numbers Look at the RANDOM() function. It should do what you want. Here is some syntax we use to create a new random number between 0 and 999,999. new file. input program. loop #B=0 to 999999. compute xpid=#B. compute random=rnd(rv.uniform(0,999999)). end case. end loop. end file. end input program. exe. Melissa -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of <R. Abraham> Sent: Tuesday, August 28, 2007 9:36 AM To: [hidden email] Subject: [SPSSX-L] Assigning Random Numbers Hi, I want to generate random numbers that are unique but they should not be ordered in any way (i.e. ascending or descending order). I have about 300,000 records. For e.g. Case# Random Number (Unique) 1 8 2 10 3 45 4 35 5 40 6 32 7 50 8 28 9 48 10 22 . . . . . . . . 48 2 49 5 50 18 I know that using the $CASENUM does generate unique numbers but it is in ascending order. Could somebody offer a solution? Thanks. R. Abraham PRIVILEGED AND CONFIDENTIAL INFORMATION This transmittal and any attachments may contain PRIVILEGED AND CONFIDENTIAL information and is intended only for the use of the addressee. If you are not the designated recipient, or an employee or agent authorized to deliver such transmittals to the designated recipient, you are hereby notified that any dissemination, copying or publication of this transmittal is strictly prohibited. If you have received this transmittal in error, please notify us immediately by replying to the sender and delete this copy from your system. You may also call us at (309) 827-6026 for assistance. |
|
At 03:17 PM 8/29/2007, Fry, Jonathan B. wrote:
>If you use just rv.uniform(0,1) with no rounding, you'll be assured of >uniqueness unless your number of cases exceeds 2 billion. It is >rounding that raises the uniqueness issue. If you need integers, >multiply by 2**31 and assign a format of F10. Should you TRUNC as well, or are all but the first 32 bits always 0? (And does this differ between SPSS's two random-number generators?) |
|
-----Original Message-----
From: Richard Ristow [mailto:[hidden email]] Sent: Wednesday, August 29, 2007 3:53 PM To: Fry, Jonathan B.; [hidden email] Subject: [BULK] Re: Assigning Random Numbers Importance: Low At 03:17 PM 8/29/2007, Fry, Jonathan B. wrote: >If you use just rv.uniform(0,1) with no rounding, you'll be assured of >uniqueness unless your number of cases exceeds 2 billion. It is >rounding that raises the uniqueness issue. If you need integers, >multiply by 2**31 and assign a format of F10. Should you TRUNC as well, or are all but the first 32 bits always 0? (And does this differ between SPSS's two random-number generators?) ---------------------- It differs between the RNG's. For the older multiplicative congruential generator (MCG), all the bits after the first 31 are zero. For the Mersenne Twister, all the bits are random. In fact, uniqueness is not guaranteed using only the first 31 bits of the Mersenne algorithm, or even all 53. The period is much longer than 2^53, but individual numbers may reappear. So my suggestion is good only for the MCG, not for the twister. One might reasonably suppose that the possibility of duplication could be ignored with the twister, but I think not. Even if you use all 53 bits, I think there's a substantial possibility of duplication in 300,000 cases. It's a birthday problem with 2^53 possible birthdays and 300,000 tries. I have not worked it out. Jonathan Fry SPSS Inc. |
|
The birthday problem -
At 06:55 PM 8/29/2007, Fry, Jonathan B. wrote: >One might reasonably suppose that the possibility of duplication could >be ignored [when drawing 300,000 numbers] with the Mersenne twister, >but I think not. Even if you use all 53 bits, I think there's a >substantial possibility of duplication in 300,000 cases. It's a >birthday problem with 2^53 possible birthdays and 300,000 tries. I >have not worked it out. The birthday problem is simpler in closed form than it looks at first. Let the population size be N (N=365, for the birthday problem). Draw 1, and there's 0 chance of a duplicate. Draw a 2nd, and there's one chance for a duplicate, or probability 1/N. If that draw isn't a duplicate, draw a 3rd and *at that draw* there are 2 chances for a duplicate, probability 2/N, etc. Generally, the probability that the *first* duplicate occurs on the kth draw is (k-1)/N. So the probability that *any* duplicate occurs *through* the kth draw is 1/N times Sum(i=1,k)(i-1) = Sum(i=0,k-1)(i) = k*(k-1)/2 -- exactly square-law, in k For k=3E5 and N=2^53, that is 4.5E10/9.0E15 = 5E-6 -- pretty small. You'd need to draw about 9.5 million numbers, for a 1% chance of a duplicate (ignoring any non-random properties of the series). |
|
Hi Richard,
I am afraid your calculation in the birthday problem is a little simpler than it should be. You don't take into account conditional probability when deriving your formula. So, in my opinion, the result would be *much* more complex. It is easier to calculate the probability that all the k birthdays are on *different* days. P= 1*(1-1/365)*(1-2/365)*...*(1-(k-1)/365) = 365!/(365^k*(365-k)!) if I am not mistaken. Or, more generally, P= n!/(n^k*(n-k)!) So the probability we wanted to find is 1-P. It is the probability that there will be at least 1 duplication. If k=3E5 and n=9E15, this results in a very similar(*almost* the same) vale to Richard's: 5E-6, but not exactly the same. But if ratio k/n was larger, the result would be much different. Just consider k=30, n=365: your formula would give a value larger than 1 for a probability! Just couldn't miss replying to your note 'simpler in closed form than it looks at first'. I hope you don't mind my correcting the formula. Zoltan Bodai Hungary -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow Sent: Thursday, August 30, 2007 4:45 AM To: [hidden email] Subject: Re: Assigning Random Numbers The birthday problem - At 06:55 PM 8/29/2007, Fry, Jonathan B. wrote: >One might reasonably suppose that the possibility of duplication could >be ignored [when drawing 300,000 numbers] with the Mersenne twister, >but I think not. Even if you use all 53 bits, I think there's a >substantial possibility of duplication in 300,000 cases. It's a >birthday problem with 2^53 possible birthdays and 300,000 tries. I >have not worked it out. The birthday problem is simpler in closed form than it looks at first. Let the population size be N (N=365, for the birthday problem). Draw 1, and there's 0 chance of a duplicate. Draw a 2nd, and there's one chance for a duplicate, or probability 1/N. If that draw isn't a duplicate, draw a 3rd and *at that draw* there are 2 chances for a duplicate, probability 2/N, etc. Generally, the probability that the *first* duplicate occurs on the kth draw is (k-1)/N. So the probability that *any* duplicate occurs *through* the kth draw is 1/N times Sum(i=1,k)(i-1) = Sum(i=0,k-1)(i) = k*(k-1)/2 -- exactly square-law, in k For k=3E5 and N=2^53, that is 4.5E10/9.0E15 = 5E-6 -- pretty small. You'd need to draw about 9.5 million numbers, for a 1% chance of a duplicate (ignoring any non-random properties of the series). |
|
A better calculation of the birthday probability is 1 - P(not matching).
With two people the odds of not matching are 364/365. With three people the odds of not matching are 363/365 ... The probability of not matching is the product of the series: 364/365 * 363/365 * ... With 30 people the odds of not matching are about 30%, so the odds of a match are 1-30% = 70%. You can amaze your friends at parties (or make money betting) with the prediction of a match in a relatively small group --jim -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Bódai Zoltán Sent: Thursday, August 30, 2007 4:39 AM To: [hidden email] Subject: Re: Assigning Random Numbers (Open) Hi Richard, I am afraid your calculation in the birthday problem is a little simpler than it should be. You don't take into account conditional probability when deriving your formula. So, in my opinion, the result would be *much* more complex. It is easier to calculate the probability that all the k birthdays are on *different* days. P= 1*(1-1/365)*(1-2/365)*...*(1-(k-1)/365) = 365!/(365^k*(365-k)!) if I am not mistaken. Or, more generally, P= n!/(n^k*(n-k)!) So the probability we wanted to find is 1-P. It is the probability that there will be at least 1 duplication. If k=3E5 and n=9E15, this results in a very similar(*almost* the same) vale to Richard's: 5E-6, but not exactly the same. But if ratio k/n was larger, the result would be much different. Just consider k=30, n=365: your formula would give a value larger than 1 for a probability! Just couldn't miss replying to your note 'simpler in closed form than it looks at first'. I hope you don't mind my correcting the formula. Zoltan Bodai Hungary -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow Sent: Thursday, August 30, 2007 4:45 AM To: [hidden email] Subject: Re: Assigning Random Numbers The birthday problem - At 06:55 PM 8/29/2007, Fry, Jonathan B. wrote: >One might reasonably suppose that the possibility of duplication could >be ignored [when drawing 300,000 numbers] with the Mersenne twister, >but I think not. Even if you use all 53 bits, I think there's a >substantial possibility of duplication in 300,000 cases. It's a >birthday problem with 2^53 possible birthdays and 300,000 tries. I >have not worked it out. The birthday problem is simpler in closed form than it looks at first. Let the population size be N (N=365, for the birthday problem). Draw 1, and there's 0 chance of a duplicate. Draw a 2nd, and there's one chance for a duplicate, or probability 1/N. If that draw isn't a duplicate, draw a 3rd and *at that draw* there are 2 chances for a duplicate, probability 2/N, etc. Generally, the probability that the *first* duplicate occurs on the kth draw is (k-1)/N. So the probability that *any* duplicate occurs *through* the kth draw is 1/N times Sum(i=1,k)(i-1) = Sum(i=0,k-1)(i) = k*(k-1)/2 -- exactly square-law, in k For k=3E5 and N=2^53, that is 4.5E10/9.0E15 = 5E-6 -- pretty small. You'd need to draw about 9.5 million numbers, for a 1% chance of a duplicate (ignoring any non-random properties of the series). |
|
In reply to this post by Bódai Zoltán
At 05:39 AM 8/30/2007, Zoltan Bodai wrote:
>I am afraid your calculation in the birthday problem is a little >simpler than it should be. You don't take into account conditional >probability when deriving your formula. I'd written, >>The probability that the *first* duplicate occurs on the kth draw is >>(k-1)/N I reasoned: If no duplicate has occurred on draws 1 through k-1, then k-1 different members have been chosen once; the probability of choosing one of them again is (k-1)/N, and the probability of not choosing a duplicate is (N-k+1)/N What I missed is: there's another way not to choose the first duplicate on drawing k; namely, that the first duplicate was chosen earlier. >If k=3E5 and n=9E15, this results in a very similar(*almost* the same) >value to Richard's: 5E-6, but not exactly the same. Yes: in the range where there's little chance a duplicate will have been chosen, the effect I overlooked is small, and the difference will be small. >But if ratio k/n was larger, the result would be much different. Just >consider k=30, n=365: your formula would give a value larger than 1 >for a probability! Yes. And the formula I gave will always overestimate the current, and cumulative, probability of a duplication. Unfortunately, that overestimate will always be enough to matter (I'm not doing calculations) in the range of interest, where there's serious possibility of a duplicate being encountered. If P is the probability of no duplicate through drawing k, then (Zoltan Bodai) >P= n!/(n^k*(n-k)!) > >So the probability we wanted to find is 1-P. It is the probability >that there will be at least 1 duplication. >Just couldn't miss replying to your note 'simpler in closed form than >it looks at first'. I hope you don't mind my correcting the formula. No, I appreciate it. And am reminded how easy it is to make slips in combinatorial probability. -Many thanks, Richard |
| Free forum by Nabble | Edit this page |
