SPSSX Discussion

Assigning Random Numbers

Classic

List

Threaded

13 messages Options

<R. Abraham>

Assigning Random Numbers

Hi,
I want to generate random numbers that are unique but they should not be
ordered in any way (i.e. ascending or descending order). I have about
300,000 records.

For e.g.

Case# Random Number (Unique)

1 8
2 10
3 45
4 35
5 40
6 32
7 50
8 28
9 48
10 22
. .
. .
. .
. .
48 2
49 5
50 18

I know that using the $CASENUM does generate unique numbers but it is in
ascending order.

Could somebody offer a solution? Thanks.

R. Abraham

Melissa Ives

Re: Assigning Random Numbers

Look at the RANDOM() function. It should do what you want.
Here is some syntax we use to create a new random number between 0 and
999,999.

new file.
input program.
loop #B=0 to 999999.
compute xpid=#B.
compute random=rnd(rv.uniform(0,999999)).
end case.
end loop.
end file.
end input program.
exe.

Melissa

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
<R. Abraham>
Sent: Tuesday, August 28, 2007 9:36 AM
To: [hidden email]
Subject: [SPSSX-L] Assigning Random Numbers

Hi,
I want to generate random numbers that are unique but they should not be
ordered in any way (i.e. ascending or descending order). I have about
300,000 records.

For e.g.

Case# Random Number (Unique)

1 8
2 10
3 45
4 35
5 40
6 32
7 50
8 28
9 48
10 22
. .
. .
. .
. .
48 2
49 5
50 18

I know that using the $CASENUM does generate unique numbers but it is in
ascending order.

Could somebody offer a solution? Thanks.

R. Abraham

PRIVILEGED AND CONFIDENTIAL INFORMATION
This transmittal and any attachments may contain PRIVILEGED AND
CONFIDENTIAL information and is intended only for the use of the
addressee. If you are not the designated recipient, or an employee
or agent authorized to deliver such transmittals to the designated
recipient, you are hereby notified that any dissemination,
copying or publication of this transmittal is strictly prohibited. If
you have received this transmittal in error, please notify us
immediately by replying to the sender and delete this copy from your
system. You may also call us at (309) 827-6026 for assistance.

Melissa Ives

Re: Assigning Random Numbers

In reply to this post by <R. Abraham>

Good point--we actually use the random number to order the file to create a new unique variable.

sort cases by random.
compute xpidnew=1.
exe.
do if lag(xpidnew)>0.
compute xpidnew=lag(xpidnew)+1.
end if.
exe.

Melissa

-----Original Message-----
From: la volta statistics [mailto:[hidden email]]
Sent: Tuesday, August 28, 2007 10:19 AM
To: Melissa Ives; [hidden email]
Subject: AW: Assigning Random Numbers

To make sure that you have unique numbers, you should subsequently aggregate the numbers created with Melissa's code .

new file.
input program.
loop #B=0 to 999999.
compute xpid=#B.
compute random=rnd(rv.uniform(0,999999)).
end case.
end loop.
end file.
end input program.
exe.

AGGREGAT OUTFILE = *
/Break random
/U_Nr = N.

Compute case = $Casenuum.

In my case the Melissa's code produced somewhat over 600,000 unique cases.

have fun
Christian

*******************************
la volta statistics
Christian Schmidhauser, Dr.phil.II
Weinbergstrasse 108
Ch-8006 Zürich
Tel: +41 (043) 233 98 01
Fax: +41 (043) 233 98 02
email: mailto:[hidden email]
internet: http://www.lavolta.ch/

-----Ursprüngliche Nachricht-----
Von: SPSSX(r) Discussion [mailto:[hidden email]]Im Auftrag von Melissa Ives
Gesendet: Dienstag, 28. August 2007 16:45
An: [hidden email]
Betreff: Re: Assigning Random Numbers

Look at the RANDOM() function. It should do what you want.
Here is some syntax we use to create a new random number between 0 and 999,999.

new file.
input program.
loop #B=0 to 999999.
compute xpid=#B.
compute random=rnd(rv.uniform(0,999999)).
end case.
end loop.
end file.
end input program.
exe.

Melissa

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of <R. Abraham>
Sent: Tuesday, August 28, 2007 9:36 AM
To: [hidden email]
Subject: [SPSSX-L] Assigning Random Numbers

Hi,
I want to generate random numbers that are unique but they should not be ordered in any way (i.e. ascending or descending order). I have about 300,000 records.

For e.g.

Case# Random Number (Unique)

1 8
2 10
3 45
4 35
5 40
6 32
7 50
8 28
9 48
10 22
. .
. .
. .
. .
48 2
49 5
50 18

I know that using the $CASENUM does generate unique numbers but it is in ascending order.

Could somebody offer a solution? Thanks.

R. Abraham

PRIVILEGED AND CONFIDENTIAL INFORMATION
This transmittal and any attachments may contain PRIVILEGED AND CONFIDENTIAL information and is intended only for the use of the addressee. If you are not the designated recipient, or an employee or agent authorized to deliver such transmittals to the designated recipient, you are hereby notified that any dissemination, copying or publication of this transmittal is strictly prohibited. If you have received this transmittal in error, please notify us immediately by replying to the sender and delete this copy from your system. You may also call us at (309) 827-6026 for assistance.

la volta statistics

AW: Assigning Random Numbers

In reply to this post by Melissa Ives

To make sure that you have unique numbers, you should subsequently aggregate
the numbers created with Melissa's code .

new file.
input program.
loop #B=0 to 999999.
compute xpid=#B.
compute random=rnd(rv.uniform(0,999999)).
end case.
end loop.
end file.
end input program.
exe.

AGGREGAT OUTFILE = *
/Break random
/U_Nr = N.

Compute case = $Casenuum.

In my case the Melissa's code produced somewhat over 600,000 unique cases.

have fun
Christian

*******************************
la volta statistics
Christian Schmidhauser, Dr.phil.II
Weinbergstrasse 108
Ch-8006 Zurich
Tel: +41 (043) 233 98 01
Fax: +41 (043) 233 98 02
email: mailto:[hidden email]
internet: http://www.lavolta.ch/

-----Ursprungliche Nachricht-----
Von: SPSSX(r) Discussion [mailto:[hidden email]]Im Auftrag von
Melissa Ives
Gesendet: Dienstag, 28. August 2007 16:45
An: [hidden email]
Betreff: Re: Assigning Random Numbers

Look at the RANDOM() function. It should do what you want.
Here is some syntax we use to create a new random number between 0 and
999,999.

new file.
input program.
loop #B=0 to 999999.
compute xpid=#B.
compute random=rnd(rv.uniform(0,999999)).
end case.
end loop.
end file.
end input program.
exe.

Melissa

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
<R. Abraham>
Sent: Tuesday, August 28, 2007 9:36 AM
To: [hidden email]
Subject: [SPSSX-L] Assigning Random Numbers

Hi,
I want to generate random numbers that are unique but they should not be
ordered in any way (i.e. ascending or descending order). I have about
300,000 records.

For e.g.

Case# Random Number (Unique)

1 8
2 10
3 45
4 35
5 40
6 32
7 50
8 28
9 48
10 22
. .
. .
. .
. .
48 2
49 5
50 18

I know that using the $CASENUM does generate unique numbers but it is in
ascending order.

Could somebody offer a solution? Thanks.

R. Abraham

PRIVILEGED AND CONFIDENTIAL INFORMATION
This transmittal and any attachments may contain PRIVILEGED AND
CONFIDENTIAL information and is intended only for the use of the
addressee. If you are not the designated recipient, or an employee
or agent authorized to deliver such transmittals to the designated
recipient, you are hereby notified that any dissemination,
copying or publication of this transmittal is strictly prohibited. If
you have received this transmittal in error, please notify us
immediately by replying to the sender and delete this copy from your
system. You may also call us at (309) 827-6026 for assistance.

Richard Ristow

Re: Assigning Random Numbers

In reply to this post by Melissa Ives

At 10:35 AM 8/28/2007, <R. Abraham> wrote:

>I want to generate random numbers that are unique but they should not
>be ordered in any way (i.e. ascending or descending order). I have
>about 300,000 records.
>
>For e.g.
>
>Case# Random Number (Unique)
>
>1 8
>2 10

See detailed remarks below; but, as a summary, the random-sort method
is probably the simplest. This is SPSS 15 draft output (WRR:not saved
separately), for 10 cases. (And it worked on the first try, too -
"Three things you should be wary of..."):

|-----------------------------|---------------------------|
|Output Created |28-AUG-2007 13:22:15 |
|-----------------------------|---------------------------|
Case# Greek

1 Alpha
2 Beta
3 Gamma
4 Delta
5 Epsilon
6 Zeta
7 Eta
8 Theta
9 Iota
10 Kappa

Number of cases read: 10 Number of cases listed: 10

NUMERIC RandNmbr (F6).
NUMERIC RandOrdr (F9.6).

COMPUTE RandOrdr = RV.UNIFORM(0,1).

SORT CASES BY RandOrdr.
COMPUTE RandNmbr = $CASENUM.
LIST.

List
|-----------------------------|---------------------------|
|Output Created |28-AUG-2007 13:22:16 |
|-----------------------------|---------------------------|
Case# Greek RandNmbr RandOrdr

2 Beta 1 .055204
4 Delta 2 .056053
3 Gamma 3 .089339
1 Alpha 4 .289874
5 Epsilon 5 .602548
8 Theta 6 .639086
10 Kappa 7 .669676
7 Eta 8 .677862
6 Zeta 9 .870447
9 Iota 10 .959446

Number of cases read: 10 Number of cases listed: 10

SORT CASES BY Case#.
LIST.

List
|-----------------------------|---------------------------|
|Output Created |28-AUG-2007 13:22:16 |
|-----------------------------|---------------------------|
Case# Greek RandNmbr RandOrdr

1 Alpha 4 .289874
2 Beta 1 .055204
3 Gamma 3 .089339
4 Delta 2 .056053
5 Epsilon 5 .602548
6 Zeta 9 .870447
7 Eta 8 .677862
8 Theta 6 .639086
9 Iota 10 .959446
10 Kappa 7 .669676

Number of cases read: 10 Number of cases listed: 10
....................................
Regarding suggestions in this thread -

At 10:45 AM 8/28/2007, Melissa Ives wrote:

>I. Here is some syntax we use to create a new random number between 0
>and 999,999.
>...
>. compute random=rnd(rv.uniform(0,999999)).

It doesn't much matter for a large number like 999,999, but this logic
is slightly off: probabilities of getting 0 and 999,999 are half the
probabilities for the other numbers. I recommend

. compute random=TRUNC(rv.uniform(0,1E6)).

II. At 11:14 AM 8/28/2007, Melissa Ives wrote:

>We actually use the random number to order the file to create a new
>unique variable.
>
>sort cases by random.
>compute xpidnew=1.
>exe.
>do if lag(xpidnew)>0.
>compute xpidnew=lag(xpidnew)+1.
>end if.
>exe.

Exactly, but also note,

. If you compute "random" as an integer, like

- compute random=TRUNC(rv.uniform(0,1E6)).

then there will be ties. That will give a slight preference for
randomly sorted cases retaining their original orders in the file. For
random sorting, use the raw output of the random-number generator to
retain full precision:

- compute random=rv.uniform(0,1).

. Neither '.exe.' is necessary, and they can be inefficient if the file
is large.

===================
APPENDIX: Test data
===================
DATA LIST FIXED
/Case# 01-03
Greek 04-11 (A).
BEGIN DATA
1 Alpha
2 Beta
3 Gamma
4 Delta
5 Epsilon
6 Zeta
7 Eta
8 Theta
9 Iota
10 Kappa
END DATA.

LIST.

<R. Abraham>

Re: Assigning Random Numbers

Thanks to all you replied to my question. I was able to successfully
produce the desired results.

- Renji

Richard Ristow <[hidden email]>
08/28/2007 01:58 PM

To
"R. Abraham" <[hidden email]>, [hidden email]
cc
Melissa Ives <[hidden email]>, la volta statistics
<[hidden email]>
Subject
Re: Assigning Random Numbers

At 10:35 AM 8/28/2007, <R. Abraham> wrote:

Fry, Jonathan B.

Re: AW: Assigning Random Numbers

In reply to this post by la volta statistics

If you use just rv.uniform(0,1) with no rounding, you'll be assured of uniqueness unless your number of cases exceeds 2 billion. It is rounding that raises the uniqueness issue. If you need integers, multiply by 2**31 and assign a format of F10.

Jonathan Fry
SPSS Inc.
-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of la volta statistics
Sent: Tuesday, August 28, 2007 10:19 AM
To: [hidden email]
Subject: AW: Assigning Random Numbers

To make sure that you have unique numbers, you should subsequently aggregate
the numbers created with Melissa's code .

new file.
input program.
loop #B=0 to 999999.
compute xpid=#B.
compute random=rnd(rv.uniform(0,999999)).
end case.
end loop.
end file.
end input program.
exe.

AGGREGAT OUTFILE = *
/Break random
/U_Nr = N.

Compute case = $Casenuum.

In my case the Melissa's code produced somewhat over 600,000 unique cases.

have fun
Christian

*******************************
la volta statistics
Christian Schmidhauser, Dr.phil.II
Weinbergstrasse 108
Ch-8006 Zurich
Tel: +41 (043) 233 98 01
Fax: +41 (043) 233 98 02
email: mailto:[hidden email]
internet: http://www.lavolta.ch/

-----Ursprungliche Nachricht-----
Von: SPSSX(r) Discussion [mailto:[hidden email]]Im Auftrag von
Melissa Ives
Gesendet: Dienstag, 28. August 2007 16:45
An: [hidden email]
Betreff: Re: Assigning Random Numbers

Look at the RANDOM() function. It should do what you want.
Here is some syntax we use to create a new random number between 0 and
999,999.

new file.
input program.
loop #B=0 to 999999.
compute xpid=#B.
compute random=rnd(rv.uniform(0,999999)).
end case.
end loop.
end file.
end input program.
exe.

Melissa

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
<R. Abraham>
Sent: Tuesday, August 28, 2007 9:36 AM
To: [hidden email]
Subject: [SPSSX-L] Assigning Random Numbers

Hi,
I want to generate random numbers that are unique but they should not be
ordered in any way (i.e. ascending or descending order). I have about
300,000 records.

For e.g.

Case# Random Number (Unique)

1 8
2 10
3 45
4 35
5 40
6 32
7 50
8 28
9 48
10 22
. .
. .
. .
. .
48 2
49 5
50 18

I know that using the $CASENUM does generate unique numbers but it is in
ascending order.

Could somebody offer a solution? Thanks.

R. Abraham

PRIVILEGED AND CONFIDENTIAL INFORMATION
This transmittal and any attachments may contain PRIVILEGED AND
CONFIDENTIAL information and is intended only for the use of the
addressee. If you are not the designated recipient, or an employee
or agent authorized to deliver such transmittals to the designated
recipient, you are hereby notified that any dissemination,
copying or publication of this transmittal is strictly prohibited. If
you have received this transmittal in error, please notify us
immediately by replying to the sender and delete this copy from your
system. You may also call us at (309) 827-6026 for assistance.

Richard Ristow

Re: Assigning Random Numbers

At 03:17 PM 8/29/2007, Fry, Jonathan B. wrote:

>If you use just rv.uniform(0,1) with no rounding, you'll be assured of
>uniqueness unless your number of cases exceeds 2 billion. It is
>rounding that raises the uniqueness issue. If you need integers,
>multiply by 2**31 and assign a format of F10.

Should you TRUNC as well, or are all but the first 32 bits always 0?
(And does this differ between SPSS's two random-number generators?)

Fry, Jonathan B.

Re: [BULK] Re: Assigning Random Numbers

-----Original Message-----
From: Richard Ristow [mailto:[hidden email]]
Sent: Wednesday, August 29, 2007 3:53 PM
To: Fry, Jonathan B.; [hidden email]
Subject: [BULK] Re: Assigning Random Numbers
Importance: Low

At 03:17 PM 8/29/2007, Fry, Jonathan B. wrote:

>If you use just rv.uniform(0,1) with no rounding, you'll be assured of
>uniqueness unless your number of cases exceeds 2 billion. It is
>rounding that raises the uniqueness issue. If you need integers,
>multiply by 2**31 and assign a format of F10.

Should you TRUNC as well, or are all but the first 32 bits always 0?
(And does this differ between SPSS's two random-number generators?)
----------------------
It differs between the RNG's. For the older multiplicative congruential generator (MCG), all the bits after the first 31 are zero. For the Mersenne Twister, all the bits are random. In fact, uniqueness is not guaranteed using only the first 31 bits of the Mersenne algorithm, or even all 53. The period is much longer than 2^53, but individual numbers may reappear.

So my suggestion is good only for the MCG, not for the twister.

One might reasonably suppose that the possibility of duplication could be ignored with the twister, but I think not. Even if you use all 53 bits, I think there's a substantial possibility of duplication in 300,000 cases. It's a birthday problem with 2^53 possible birthdays and 300,000 tries. I have not worked it out.

Jonathan Fry
SPSS Inc.

Richard Ristow

Re: Assigning Random Numbers

The birthday problem -

At 06:55 PM 8/29/2007, Fry, Jonathan B. wrote:

>One might reasonably suppose that the possibility of duplication could
>be ignored [when drawing 300,000 numbers] with the Mersenne twister,
>but I think not. Even if you use all 53 bits, I think there's a
>substantial possibility of duplication in 300,000 cases. It's a
>birthday problem with 2^53 possible birthdays and 300,000 tries. I
>have not worked it out.

The birthday problem is simpler in closed form than it looks at first.
Let the population size be N (N=365, for the birthday problem). Draw 1,
and there's 0 chance of a duplicate. Draw a 2nd, and there's one chance
for a duplicate, or probability 1/N. If that draw isn't a duplicate,
draw a 3rd and *at that draw* there are 2 chances for a duplicate,
probability 2/N, etc.

Generally, the probability that the *first* duplicate occurs on the kth
draw is (k-1)/N. So the probability that *any* duplicate occurs
*through* the kth draw is 1/N times

Sum(i=1,k)(i-1)
= Sum(i=0,k-1)(i)
= k*(k-1)/2 -- exactly square-law, in k

For k=3E5 and N=2^53, that is

4.5E10/9.0E15 = 5E-6 -- pretty small.

You'd need to draw about 9.5 million numbers, for a 1% chance of a
duplicate (ignoring any non-random properties of the series).

Bódai Zoltán

Re: Assigning Random Numbers (Open)

Hi Richard,

I am afraid your calculation in the birthday problem is a little simpler than it should be. You don't take into account conditional probability when deriving your formula.

So, in my opinion, the result would be *much* more complex.

It is easier to calculate the probability that all the k birthdays are on *different* days.

P= 1*(1-1/365)*(1-2/365)*...*(1-(k-1)/365) = 365!/(365^k*(365-k)!) if I am not mistaken.

Or, more generally, P= n!/(n^k*(n-k)!)

So the probability we wanted to find is 1-P. It is the probability that there will be at least 1 duplication.

If k=3E5 and n=9E15, this results in a very similar(*almost* the same) vale to Richard's: 5E-6, but not exactly the same.

But if ratio k/n was larger, the result would be much different. Just consider k=30, n=365: your formula would give a value larger than 1 for a probability!

Just couldn't miss replying to your note 'simpler in closed form than it looks at first'. I hope you don't mind my correcting the formula.

Zoltan Bodai
Hungary

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow
Sent: Thursday, August 30, 2007 4:45 AM
To: [hidden email]
Subject: Re: Assigning Random Numbers

The birthday problem -

At 06:55 PM 8/29/2007, Fry, Jonathan B. wrote:

>One might reasonably suppose that the possibility of duplication could
>be ignored [when drawing 300,000 numbers] with the Mersenne twister,
>but I think not. Even if you use all 53 bits, I think there's a
>substantial possibility of duplication in 300,000 cases. It's a
>birthday problem with 2^53 possible birthdays and 300,000 tries. I
>have not worked it out.

The birthday problem is simpler in closed form than it looks at first.
Let the population size be N (N=365, for the birthday problem). Draw 1, and there's 0 chance of a duplicate. Draw a 2nd, and there's one chance for a duplicate, or probability 1/N. If that draw isn't a duplicate, draw a 3rd and *at that draw* there are 2 chances for a duplicate, probability 2/N, etc.

Generally, the probability that the *first* duplicate occurs on the kth draw is (k-1)/N. So the probability that *any* duplicate occurs
*through* the kth draw is 1/N times

Sum(i=1,k)(i-1)
= Sum(i=0,k-1)(i)
= k*(k-1)/2 -- exactly square-law, in k

For k=3E5 and N=2^53, that is

4.5E10/9.0E15 = 5E-6 -- pretty small.

You'd need to draw about 9.5 million numbers, for a 1% chance of a duplicate (ignoring any non-random properties of the series).

Marks, Jim

Re: Assigning Random Numbers (Open)

A better calculation of the birthday probability is 1 - P(not matching).

With two people the odds of not matching are 364/365.
With three people the odds of not matching are 363/365
...

The probability of not matching is the product of the series:
364/365 * 363/365 * ...

With 30 people the odds of not matching are about 30%, so the odds of a match are 1-30% = 70%. You can amaze your friends at parties (or make money betting) with the prediction of a match in a relatively small group

--jim

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Bódai Zoltán
Sent: Thursday, August 30, 2007 4:39 AM
To: [hidden email]
Subject: Re: Assigning Random Numbers (Open)

Hi Richard,

I am afraid your calculation in the birthday problem is a little simpler than it should be. You don't take into account conditional probability when deriving your formula.

So, in my opinion, the result would be *much* more complex.

It is easier to calculate the probability that all the k birthdays are on *different* days.

P= 1*(1-1/365)*(1-2/365)*...*(1-(k-1)/365) = 365!/(365^k*(365-k)!) if I am not mistaken.

Or, more generally, P= n!/(n^k*(n-k)!)

So the probability we wanted to find is 1-P. It is the probability that there will be at least 1 duplication.

If k=3E5 and n=9E15, this results in a very similar(*almost* the same) vale to Richard's: 5E-6, but not exactly the same.

But if ratio k/n was larger, the result would be much different. Just consider k=30, n=365: your formula would give a value larger than 1 for a probability!

Just couldn't miss replying to your note 'simpler in closed form than it looks at first'. I hope you don't mind my correcting the formula.

Zoltan Bodai
Hungary

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow
Sent: Thursday, August 30, 2007 4:45 AM
To: [hidden email]
Subject: Re: Assigning Random Numbers

The birthday problem -

At 06:55 PM 8/29/2007, Fry, Jonathan B. wrote:

>One might reasonably suppose that the possibility of duplication could
>be ignored [when drawing 300,000 numbers] with the Mersenne twister,
>but I think not. Even if you use all 53 bits, I think there's a
>substantial possibility of duplication in 300,000 cases. It's a
>birthday problem with 2^53 possible birthdays and 300,000 tries. I
>have not worked it out.

The birthday problem is simpler in closed form than it looks at first.
Let the population size be N (N=365, for the birthday problem). Draw 1, and there's 0 chance of a duplicate. Draw a 2nd, and there's one chance for a duplicate, or probability 1/N. If that draw isn't a duplicate, draw a 3rd and *at that draw* there are 2 chances for a duplicate, probability 2/N, etc.

Generally, the probability that the *first* duplicate occurs on the kth draw is (k-1)/N. So the probability that *any* duplicate occurs
*through* the kth draw is 1/N times

Sum(i=1,k)(i-1)
= Sum(i=0,k-1)(i)
= k*(k-1)/2 -- exactly square-law, in k

For k=3E5 and N=2^53, that is

4.5E10/9.0E15 = 5E-6 -- pretty small.

You'd need to draw about 9.5 million numbers, for a 1% chance of a duplicate (ignoring any non-random properties of the series).

Richard Ristow

Re: Assigning Random Numbers (Open)

In reply to this post by Bódai Zoltán

At 05:39 AM 8/30/2007, Zoltan Bodai wrote:

>I am afraid your calculation in the birthday problem is a little
>simpler than it should be. You don't take into account conditional
>probability when deriving your formula.

I'd written,

>>The probability that the *first* duplicate occurs on the kth draw is
>>(k-1)/N

I reasoned: If no duplicate has occurred on draws 1 through k-1, then
k-1 different members have been chosen once; the probability of
choosing one of them again is (k-1)/N, and the probability of not
choosing a duplicate is (N-k+1)/N

What I missed is: there's another way not to choose the first duplicate
on drawing k; namely, that the first duplicate was chosen earlier.

>If k=3E5 and n=9E15, this results in a very similar(*almost* the same)
>value to Richard's: 5E-6, but not exactly the same.

Yes: in the range where there's little chance a duplicate will have
been chosen, the effect I overlooked is small, and the difference will
be small.

>But if ratio k/n was larger, the result would be much different. Just
>consider k=30, n=365: your formula would give a value larger than 1
>for a probability!

Yes. And the formula I gave will always overestimate the current, and
cumulative, probability of a duplication. Unfortunately, that
overestimate will always be enough to matter (I'm not doing
calculations) in the range of interest, where there's serious
possibility of a duplicate being encountered.

If P is the probability of no duplicate through drawing k, then (Zoltan
Bodai)

>P= n!/(n^k*(n-k)!)
>
>So the probability we wanted to find is 1-P. It is the probability
>that there will be at least 1 duplication.

>Just couldn't miss replying to your note 'simpler in closed form than
>it looks at first'. I hope you don't mind my correcting the formula.

No, I appreciate it. And am reminded how easy it is to make slips in
combinatorial probability.

-Many thanks,
Richard