SPSSX Discussion

Re: 'k/n' sampling

Classic

List

Threaded

1 message

Richard Ristow

Re: 'k/n' sampling

I'd offered, in posting "Re: Nested sampling", Wed, 31 Jan 2007
(13:06:33 -0500),

>Jon Peck's noted the COMPLEX SAMPLES module. In the meantime, you can
>use the 'k/n' method to do your sampling, though that doesn't help
>with the analysis. (If it's of real interest, I'll discuss the 'k/n'
>algorithm another time.)

At 04:23 PM 1/31/2007, [hidden email] asked, off-list:

>Richard,
>I would be interested in that 'k/n' algorithm.
>Warmest Regards Paul

(I've been so long at this because I had trouble working out the proof
that all members are selected with the same probability.)

What's colloquially known as the 'k/n' algorithm is a technique for
drawing a sample of a specified exact size K, from a given population
of size N. Its advantages are,
. It works with a single pass through the population data; neither
sorting nor random access are required
. It gives the desired result, namely equi-probable inclusion of all
members of the population. (Proof follows the demonstration listing.)
. It can readily be adapted to take samples from each of as many
subgroups as desired, of an overall population. (I do that, in the
posting "Re: Nested sampling" I mentioned above.)
. With a little more effort, it can be adapted to select a set of
disjoint samples from the population

CAVEAT:
The algorithm REQUIRES, at the beginning, the size of the population
being sampled from. If the number is too low, sampling is not
equi-probable; members near the end of the population are excluded. If
it is too high, the sample may be too small.

Logic:
The algorithm uses two running parameters, conventionally called 'k'
and 'n'. Parameter 'k' is the number of members of the sample not yet
drawn; 'n' is the number of members of the population not yet
considered for drawing. It's called the 'k/n' algorithm because, at
each step, the probability of selecting the population member under
consideration is k/n. Here's SPSS code to carry it out, to select 7
members of a 20-member population.

This is SPSS 15 draft output. As always, test data is at the very end.

* Sample 7 records from a dataset of 20 all told .

NUMERIC
N /* No. of records not yet considered for sampling */
K /* No. of records still to be drawn for the sample */
(F4)
/SAMPLED /* Marks current record as in the sample */
(F2).

LEAVE N K.
DO IF $CASENUM EQ 1.
. COMPUTE N = 20 /* No. of records in file. MUST BE CORRECT */.
. COMPUTE K = 7 /* No. of records desired for the sample */.
END IF.

DO IF (K/N) GE RV.UNIFORM(0,1).
. COMPUTE SAMPLED = 1.
. COMPUTE K = K-1.
ELSE.
. COMPUTE SAMPLED = 0.
END IF.
COMPUTE N = N-1.

LIST.

List
|-----------------------------|---------------------------|
|Output Created |02-MAR-2007 15:13:34 |
|-----------------------------|---------------------------|
Rcrd_Num N K SAMPLED

1.00 19 7 0
2.00 18 6 1
3.00 17 6 0
4.00 16 5 1
5.00 15 4 1
6.00 14 4 0
7.00 13 4 0
8.00 12 4 0
9.00 11 3 1
10.00 10 3 0
11.00 9 3 0
12.00 8 3 0
13.00 7 3 0
14.00 6 3 0
15.00 5 3 0
16.00 4 2 1
17.00 3 1 1
18.00 2 1 0
19.00 1 1 0
20.00 0 0 1

Number of cases read: 20 Number of cases listed: 20

SELECT IF SAMPLED EQ 1.
LIST.

List
|-----------------------------|---------------------------|
|Output Created |02-MAR-2007 15:13:34 |
|-----------------------------|---------------------------|
Rcrd_Num N K SAMPLED

2.00 18 6 1
4.00 16 5 1
5.00 15 4 1
9.00 11 3 1
16.00 4 2 1
17.00 3 1 1
20.00 0 0 1

Number of cases read: 7 Number of cases listed: 7
======================================================
PROOF that all population members are included in the sample with the
same probability:

Let N be the size of the population, and K the size of the desired
sample. (Note that these are not the same as running parameters n and
k, for the algorithm.) Then, the sample is fraction K/N of the
population.

Let

p(j) be the probability that member j of the population is selected;

k(j) be the number of sample members still to be selected, after
members 1 through j have been considered. (k(j) is a random variable.)

To show: for all j, p(j)=K/N

This is a mathematical-induction proof; it shows that, if the result is
true for j=1 to n, it is true for j=n+1.

To start the induction: by the basic formula, p(1)=K/N.

Induction hypothesis: Suppose that, for all i<=j, p(i)=K/N

Let s(i) be a variable that is 1 if element i is chosen, 0 otherwise;
so E(s(i))=K/N. ("E(v)" = expected value of v, in the
probability-theory sense)

Let S(i) be the total number of elements selected through element i.
(So, k(i)=K-S(i))

Then S(j) = sum(i=1,j):s(i)

By additivity of expected values,
E(S(j)) = sum(i=1,j):E(s(i)) = sum(i-1,j):K/N = j(K/N)

Since k(i) (the number of elements still to be selected after element i
is considered) is K-S(i),
E(k(i)) = K-j(K/N) = K(1-j/N) = K(N-j)/N = (N-j)K/N

Define n(j)=N-j (This 'n(j)' is the number of elements still to be
considered after element j. It's the 'n' from the algorithm, as k(j) is
the 'k' from the algorithm)

Then E(k(i)) =[from the above](N-j)K/N = n(j)K/N

Another formula for E(k(i)), summing over a different set of
contingencies:
E(k(i))=sum:(w=0,K)((P(k(i)=w)*w) (P(<event>)=probability of <event>)

Finally, complete the proof:

By the algorithm, if k(i)=w,
p(j+1)=w/n(j)

Then overall,
p(j+1)=sum(w=0,K):P(k(j)=w)*w/n(j)

p(j+1)=(1/n(j))*sum(w=0,K):P(k(j)=w)

p(j+1)=(1/n(j))*E(k(j))=

p(j+1)=(1/n(j))*(n(j)K/N)

p(j+1)=K/N

Quod, as they say, erat demonsdrandum
===================
APPENDIX: TEST DATA
===================
* Test data: a dataset with twenty records .
NEW FILE.
INPUT PROGRAM.
. NUMERIC Rcrd_Num (F5.2).
. LOOP Rcrd_Num = 1 TO 20.
. END CASE.
. END LOOP.
END FILE.
END INPUT PROGRAM.

. /*-- LIST /*-*/.