Re: 'k/n' sampling

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Re: 'k/n' sampling

Richard Ristow
I'd offered, in posting "Re: Nested sampling", Wed, 31 Jan 2007
(13:06:33 -0500),

>Jon Peck's noted the COMPLEX SAMPLES module. In the meantime, you can
>use the 'k/n' method to do your sampling, though that doesn't help
>with the analysis. (If it's of real interest, I'll discuss the 'k/n'
>algorithm another time.)

At 04:23 PM 1/31/2007, [hidden email] asked, off-list:

>Richard,
>I would be interested in that 'k/n' algorithm.
>Warmest Regards Paul

(I've been so long at this because I had trouble working out the proof
that all members are selected with the same probability.)

What's colloquially known as the 'k/n' algorithm is a technique for
drawing a sample of a specified exact size K, from a given population
of size N. Its advantages are,
. It works with a single pass through the population data; neither
sorting nor random access are required
. It gives the desired result, namely equi-probable inclusion of all
members of the population. (Proof follows the demonstration listing.)
. It can readily be adapted to take samples from each of as many
subgroups as desired, of an overall population. (I do that, in the
posting "Re: Nested sampling" I mentioned above.)
. With a little more effort, it can be adapted to select a set of
disjoint samples from the population

CAVEAT:
The algorithm REQUIRES, at the beginning, the size of the population
being sampled from. If the number is too low, sampling is not
equi-probable; members near the end of the population are excluded. If
it is too high, the sample may be too small.

Logic:
The algorithm uses two running parameters, conventionally called 'k'
and 'n'. Parameter 'k' is the number of members of the sample not yet
drawn; 'n' is the number of members of the population not yet
considered for drawing. It's called the 'k/n' algorithm because, at
each step, the probability of selecting the population member under
consideration is k/n. Here's SPSS code to carry it out, to select 7
members of a 20-member population.

This is SPSS 15 draft output. As always, test data is at the very end.

*  Sample 7 records from a dataset of 20 all told             .

NUMERIC
    N    /* No. of records not yet considered for sampling    */
    K    /* No. of records still to be drawn  for the sample  */
   (F4)
   /SAMPLED /* Marks current record as in the sample          */
   (F2).

LEAVE N K.
DO IF $CASENUM EQ 1.
.  COMPUTE N = 20 /* No. of records in file. MUST BE CORRECT */.
.  COMPUTE K =  7 /* No. of records desired for the sample   */.
END IF.

DO IF (K/N) GE RV.UNIFORM(0,1).
.  COMPUTE SAMPLED = 1.
.  COMPUTE K       = K-1.
ELSE.
.  COMPUTE SAMPLED = 0.
END IF.
COMPUTE    N       = N-1.

LIST.

List
|-----------------------------|---------------------------|
|Output Created               |02-MAR-2007 15:13:34       |
|-----------------------------|---------------------------|
Rcrd_Num    N    K SAMPLED

    1.00    19    7     0
    2.00    18    6     1
    3.00    17    6     0
    4.00    16    5     1
    5.00    15    4     1
    6.00    14    4     0
    7.00    13    4     0
    8.00    12    4     0
    9.00    11    3     1
   10.00    10    3     0
   11.00     9    3     0
   12.00     8    3     0
   13.00     7    3     0
   14.00     6    3     0
   15.00     5    3     0
   16.00     4    2     1
   17.00     3    1     1
   18.00     2    1     0
   19.00     1    1     0
   20.00     0    0     1

Number of cases read:  20    Number of cases listed:  20


SELECT IF SAMPLED EQ 1.
LIST.

List
|-----------------------------|---------------------------|
|Output Created               |02-MAR-2007 15:13:34       |
|-----------------------------|---------------------------|
Rcrd_Num    N    K SAMPLED

    2.00    18    6     1
    4.00    16    5     1
    5.00    15    4     1
    9.00    11    3     1
   16.00     4    2     1
   17.00     3    1     1
   20.00     0    0     1

Number of cases read:  7    Number of cases listed:  7
======================================================
PROOF that all population members are included in the sample with the
same probability:

Let N be the size of the population, and K the size of the desired
sample. (Note that these are not the same as running parameters n and
k, for the algorithm.) Then, the sample is fraction K/N of the
population.

Let

p(j) be the probability that member j of the population is selected;

k(j) be the number of sample members still to be selected, after
members 1 through j have been considered. (k(j) is a random variable.)

To show: for all j, p(j)=K/N

This is a mathematical-induction proof; it shows that, if the result is
true for j=1 to n, it is true for j=n+1.

To start the induction: by the basic formula, p(1)=K/N.

Induction hypothesis: Suppose that, for all i<=j, p(i)=K/N

Let s(i) be a variable that is 1 if element i is chosen, 0 otherwise;
so E(s(i))=K/N. ("E(v)" = expected value of v, in the
probability-theory sense)

Let S(i) be the total number of elements selected through element i.
(So, k(i)=K-S(i))

Then S(j) = sum(i=1,j):s(i)

By additivity of expected values,
E(S(j)) = sum(i=1,j):E(s(i)) = sum(i-1,j):K/N = j(K/N)

Since k(i) (the number of elements still to be selected after element i
is considered) is K-S(i),
E(k(i)) = K-j(K/N) = K(1-j/N) = K(N-j)/N = (N-j)K/N

Define  n(j)=N-j  (This 'n(j)' is the number of elements still to be
considered after element j. It's the 'n' from the algorithm, as k(j) is
the 'k' from the algorithm)

Then E(k(i)) =[from the above](N-j)K/N = n(j)K/N

Another formula for E(k(i)), summing over a different set of
contingencies:
E(k(i))=sum:(w=0,K)((P(k(i)=w)*w)  (P(<event>)=probability of <event>)

Finally, complete the proof:

By the algorithm, if k(i)=w,
p(j+1)=w/n(j)

Then overall,
p(j+1)=sum(w=0,K):P(k(j)=w)*w/n(j)

p(j+1)=(1/n(j))*sum(w=0,K):P(k(j)=w)

p(j+1)=(1/n(j))*E(k(j))=

p(j+1)=(1/n(j))*(n(j)K/N)

p(j+1)=K/N

Quod, as they say, erat demonsdrandum
===================
APPENDIX: TEST DATA
===================
*  Test data: a dataset with twenty records                  .
NEW FILE.
INPUT PROGRAM.
.  NUMERIC Rcrd_Num (F5.2).
.  LOOP    Rcrd_Num = 1 TO 20.
.     END CASE.
.  END LOOP.
END FILE.
END INPUT PROGRAM.

.  /*--  LIST   /*-*/.