I'd offered, in posting "Re: Nested sampling", Wed, 31 Jan 2007
(13:06:33 -0500), >Jon Peck's noted the COMPLEX SAMPLES module. In the meantime, you can >use the 'k/n' method to do your sampling, though that doesn't help >with the analysis. (If it's of real interest, I'll discuss the 'k/n' >algorithm another time.) At 04:23 PM 1/31/2007, [hidden email] asked, off-list: >Richard, >I would be interested in that 'k/n' algorithm. >Warmest Regards Paul (I've been so long at this because I had trouble working out the proof that all members are selected with the same probability.) What's colloquially known as the 'k/n' algorithm is a technique for drawing a sample of a specified exact size K, from a given population of size N. Its advantages are, . It works with a single pass through the population data; neither sorting nor random access are required . It gives the desired result, namely equi-probable inclusion of all members of the population. (Proof follows the demonstration listing.) . It can readily be adapted to take samples from each of as many subgroups as desired, of an overall population. (I do that, in the posting "Re: Nested sampling" I mentioned above.) . With a little more effort, it can be adapted to select a set of disjoint samples from the population CAVEAT: The algorithm REQUIRES, at the beginning, the size of the population being sampled from. If the number is too low, sampling is not equi-probable; members near the end of the population are excluded. If it is too high, the sample may be too small. Logic: The algorithm uses two running parameters, conventionally called 'k' and 'n'. Parameter 'k' is the number of members of the sample not yet drawn; 'n' is the number of members of the population not yet considered for drawing. It's called the 'k/n' algorithm because, at each step, the probability of selecting the population member under consideration is k/n. Here's SPSS code to carry it out, to select 7 members of a 20-member population. This is SPSS 15 draft output. As always, test data is at the very end. * Sample 7 records from a dataset of 20 all told . NUMERIC N /* No. of records not yet considered for sampling */ K /* No. of records still to be drawn for the sample */ (F4) /SAMPLED /* Marks current record as in the sample */ (F2). LEAVE N K. DO IF $CASENUM EQ 1. . COMPUTE N = 20 /* No. of records in file. MUST BE CORRECT */. . COMPUTE K = 7 /* No. of records desired for the sample */. END IF. DO IF (K/N) GE RV.UNIFORM(0,1). . COMPUTE SAMPLED = 1. . COMPUTE K = K-1. ELSE. . COMPUTE SAMPLED = 0. END IF. COMPUTE N = N-1. LIST. List |-----------------------------|---------------------------| |Output Created |02-MAR-2007 15:13:34 | |-----------------------------|---------------------------| Rcrd_Num N K SAMPLED 1.00 19 7 0 2.00 18 6 1 3.00 17 6 0 4.00 16 5 1 5.00 15 4 1 6.00 14 4 0 7.00 13 4 0 8.00 12 4 0 9.00 11 3 1 10.00 10 3 0 11.00 9 3 0 12.00 8 3 0 13.00 7 3 0 14.00 6 3 0 15.00 5 3 0 16.00 4 2 1 17.00 3 1 1 18.00 2 1 0 19.00 1 1 0 20.00 0 0 1 Number of cases read: 20 Number of cases listed: 20 SELECT IF SAMPLED EQ 1. LIST. List |-----------------------------|---------------------------| |Output Created |02-MAR-2007 15:13:34 | |-----------------------------|---------------------------| Rcrd_Num N K SAMPLED 2.00 18 6 1 4.00 16 5 1 5.00 15 4 1 9.00 11 3 1 16.00 4 2 1 17.00 3 1 1 20.00 0 0 1 Number of cases read: 7 Number of cases listed: 7 ====================================================== PROOF that all population members are included in the sample with the same probability: Let N be the size of the population, and K the size of the desired sample. (Note that these are not the same as running parameters n and k, for the algorithm.) Then, the sample is fraction K/N of the population. Let p(j) be the probability that member j of the population is selected; k(j) be the number of sample members still to be selected, after members 1 through j have been considered. (k(j) is a random variable.) To show: for all j, p(j)=K/N This is a mathematical-induction proof; it shows that, if the result is true for j=1 to n, it is true for j=n+1. To start the induction: by the basic formula, p(1)=K/N. Induction hypothesis: Suppose that, for all i<=j, p(i)=K/N Let s(i) be a variable that is 1 if element i is chosen, 0 otherwise; so E(s(i))=K/N. ("E(v)" = expected value of v, in the probability-theory sense) Let S(i) be the total number of elements selected through element i. (So, k(i)=K-S(i)) Then S(j) = sum(i=1,j):s(i) By additivity of expected values, E(S(j)) = sum(i=1,j):E(s(i)) = sum(i-1,j):K/N = j(K/N) Since k(i) (the number of elements still to be selected after element i is considered) is K-S(i), E(k(i)) = K-j(K/N) = K(1-j/N) = K(N-j)/N = (N-j)K/N Define n(j)=N-j (This 'n(j)' is the number of elements still to be considered after element j. It's the 'n' from the algorithm, as k(j) is the 'k' from the algorithm) Then E(k(i)) =[from the above](N-j)K/N = n(j)K/N Another formula for E(k(i)), summing over a different set of contingencies: E(k(i))=sum:(w=0,K)((P(k(i)=w)*w) (P(<event>)=probability of <event>) Finally, complete the proof: By the algorithm, if k(i)=w, p(j+1)=w/n(j) Then overall, p(j+1)=sum(w=0,K):P(k(j)=w)*w/n(j) p(j+1)=(1/n(j))*sum(w=0,K):P(k(j)=w) p(j+1)=(1/n(j))*E(k(j))= p(j+1)=(1/n(j))*(n(j)K/N) p(j+1)=K/N Quod, as they say, erat demonsdrandum =================== APPENDIX: TEST DATA =================== * Test data: a dataset with twenty records . NEW FILE. INPUT PROGRAM. . NUMERIC Rcrd_Num (F5.2). . LOOP Rcrd_Num = 1 TO 20. . END CASE. . END LOOP. END FILE. END INPUT PROGRAM. . /*-- LIST /*-*/. |
Free forum by Nabble | Edit this page |