Hi,
I`m referring to an old topic. Because I have exactly the same question I have taken the liberty to copy this: I have multiple variables over which I have to select my total sample. In other word, you can say best quota fit from the total sample. for example: I have total sample of 1100 out which I have to select 1000 sample based on 3 different variable Age, Gender, Region. Gender: 500:500 Age: 250:300:300:150 Region: 334:333:333 There has been a solution for this problem by down weighting: *Simulation of your data*. NEW FILE. INPUT PROGRAM. LOOP CASE=1 TO 1100. COMPUTE G=TRUNC(UNIFORM(2)+1). COMPUTE A=TRUNC(UNIFORM(4)+1). COMPUTE R=TRUNC(UNIFORM(3)+1). LEAVE G A R CASE. END CASE. END LOOP. END FILE. END INPUT PROGRAM. **** GET YOUR RAW DATA AND GO for it. *I have used G A R for your variables gender age race so substitute as needed *. SORT CASES BY G A R. SAVE OUTFILE "SortedRawData.sav". AGGREGATE OUTFILE */ BREAK G A R / N_Obs=N. COMPUTE N_TOT=SUM(LAG(N_TOT),N_OBS). SORT CASES BY N_TOT(D). IF $CASENUM > 1 N_TOT=LAG(N_TOT). SORT CASES BY G A R. SAVE OUTFILE "Obs_table.sav" . * Create a table of desired proportions/counts *. NEW FILE. INPUT PROGRAM. LOOP G=1 TO 2. LOOP A= 1 TO 4. LOOP R= 1 TO 3. COMPUTE CELL=1. LEAVE G A R. END CASE. END LOOP. END LOOP. END LOOP. END FILE. END INPUT PROGRAM. RECODE G (1=.5)(2=.5) INTO P_G / A (1=.25)(2,3=.3)(4=.15) INTO P_A / R (1=.34)(2,3=.33) INTO P_R . COMPUTE P_C_DES = P_G * P_A * P_R . COMPUTE F_C_DES=P_C_DES*1000. SAVE OUTFILE "BaseTableProb.sav". MATCH FILES / FILE "Obs_table.sav" / FILE "BaseTableProb.sav" / BY G A R. COMPUTE P_OBS=N_OBS /N_TOT. COMPUTE C_WEIGHT=(P_C_DES /P_OBS) * (1000/N_TOT). MATCH FILES / FILE "SortedRawData.sav" /TABLE * / BY G A R. WEIGHT BY C_WEIGHT. FREQ G A R. CROSS / TABLE G BY A BY R /CELLS= ALL. I would like to know if it be possible not to downweight but to mark all 100 completes who can be erased so that I have a dataset with n=1000 ? Thank you so much! |
David, your solution was one to weight the data to achieve desired marginal proportions i.e. rim weighting / raking (though not interlocked).
The OP (Gaurav) and Emma are both interested in a solution which doesn't weight the data but selects from the larger sample of 1,100 a number of cases (1,000 in this example) that best meet and satisfy the quota requirements/proportions. So a potential solution would give priority to least achieving quota cells to be retained but also allowing randomness to ensure no quota cells are systematically biased in the selection process. |
In reply to this post by emma78
I think it is worth mentioning that throwing away cases is
=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
generally an irrational thing to do when you do not need to simplify calculations so that they can be done by hand. By that standard, it has been largely obsolete since the 1970s gave us all computers that do all the computation. If you insist on a simplified presentation that you can see with similar Ns, the down-weight option provides that: What I remember of that nature seems to be cases with artificial Ns are extrapolated from a census or general population rather than another group on hand. Now, I have discarded -- well, set aside, for separate comparison -- the subjects over age 65 in one group; that is fairly safe to do when the other groups have been defined and selected as younger than that. It is more problematic, because of what it may do to attributable effects, when the unbalance is a natural feature of the groups. Of course, if a variable is irrelevant to everything else, there is little loss in selection on it, except for loss of power. Even when the intent is to achieve one-to-one matching for cases, it is more precise and more powerful to use the variables for statistical control of the full data set, unless there is a very strong basis for match such as "left hand versus right hand" or "sibling matches". I am curious to hear of any justification contrary to the above. -- Rich Ulrich > Date: Wed, 22 Apr 2015 02:32:54 -0700 > From: [hidden email] > Subject: Best quota fit > To: [hidden email] > > Hi, > > I`m referring to an old topic. Because I have exactly the same question I > have taken the liberty to copy this: > > I have multiple variables over which I have to select my total sample. In > other word, you can say best quota fit from the total sample. for example: I > have total sample of 1100 out which I have to select 1000 sample based on 3 > different variable Age, Gender, Region. > > Gender: 500:500 > > Age: 250:300:300:150 > > Region: 334:333:333 > > > There has been a solution for this problem by down weighting: > [snip] |
Hi,
thanks for you reply. The problem is - although it might seem irrational- thats the procedure: A client wants n=1000 interviews, we do 1000 interviews +10% over quota so that we can delete those with bad quality (straightliner...). Sometimes those with bad quality are less than 10% so we have 1050 'good interviews' e.g. but the client wants exactly 1000 completes in the final dataset. Therefore we have to look manually in which cells (age/gender/region for example) we can find interviews we can delete, and we delete them randomly. Therefore a more automatic version would be great and much time-saving. Thank you! |
Okay, I see the problem. It is also troubling, perhaps, to tell your
=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
client that you are removing cases completely "randomly" when that suggests you could have seen a different set. Here is what I come up with. Insofar as the region is concerned, it is easy to say that this should be regarded as "quota-sampling"; there is (I gather) no natural reason for the observed Ns except for your own intent. It should be fine to remove specific numbers based on region. You might argue the same about sex, but it might be harder to argue that about age. By the way, I think I would try to convince my client that 3 x 334 = 1002 is a better number than 1000, for the sake of symmetry. Before automating, figure out the rational strategy: I think I would see if I could meet the requirement by dropping from the *earliest* interviews, on the weak presumption that they might reflect a slightly different standard or lower quality (because of less experience of the interviewers). Using region and sex, that suggests trying the strategy of selecting for use the last 167 males and last 167 females from each region. This is pretty simple if you have dates, or if you can use ID numbers. This gives you a unique selection-set, by a method that anyone can replicate. What makes this less-than-totally-automated is that there could be fewer than 167 in any of the sets. What cases should be used to fill in? - Without knowing the full purposes and content, my naive preference would be to try to fill out the Region from the wrong sex; if that is not available, go to the next-oldest cases, taking from the other regions alternately. -- Rich Ulrich > Date: Wed, 22 Apr 2015 11:58:20 -0700 > From: [hidden email] > Subject: Re: Best quota fit > To: [hidden email] > > Hi, > thanks for you reply. > The problem is - although it might seem irrational- thats the procedure: > > > A client wants n=1000 interviews, we do 1000 interviews +10% over quota so > that we can delete those with bad quality (straightliner...). Sometimes > those with bad quality are less than 10% so we have 1050 'good interviews' > e.g. but the client wants exactly 1000 completes in the final dataset. > Therefore we have to look manually in which cells (age/gender/region for > example) we can find interviews we can delete, and we delete them randomly. > > Therefore a more automatic version would be great and much time-saving. > > Thank you! |
In reply to this post by emma78
Have you you queried your client about this seemingly unreasonable approach?
If it is a matter of paying for only 1000 interviews, and since the interviews have been done, to preserve the integrity of the data you could do the analysis on the best available data set and only bill for 1000 interviews. Even if they only reimburse for 1000 cases, any expenses you have incurred would be sunk cost.
Art Kendall
Social Research Consultants |
Thanks for your answers.
Unfortunately its the way it works, the client pays for n=1000 and wants them in perfect quota distribution and not 1050 but 1000... So guess I have to do it manually furthermore? |
1002 would at least give you symmetry of effects.
Art Kendall
Social Research Consultants |
Hi, sorry I didn't get it, what do you mean by 1002?
|
See Rich Ulrich's post in this discussion.
Art Kendall
Social Research Consultants |
In reply to this post by emma78
I don't know what your job title is or how many levels of interference
=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
there are between some Client and you, so it might be awkward to push very hard for changes. However, there is someone who, presumably, will eventually present the results. If they have any sense, they should be happy to get the sort of advice that lets them look "more professional" in the eyes of people who actually understand the process. If there are any of those people around, they should like it if they are able to answer the occasional sharp questions. "How did you happen to get exactly 1000? Okay, you threw away the earliest excess cases, selectively by region and sex, to hit the arbitrary target for quota sampling, because if the raters gained skill, presumably the later would be better. re: 1002, which I suggested earlier. "Then, why did you pick 1000 for your fixed total, when that prevents you from having equal Ns by sex or by region?" Your unit or company will look better to the client if you point out to the client that you can help them look better if they will follow your lead. Your unit or company will look worse to the informed members of the audience if they feel like they ought to blame you (and not the client). As a sales pitch to new clients, it works better (I think) if you can tell the prospect - who saw such a presentation - that the client turned down your suggestions. If you never say anything, the client might answer those embarrassing questions honestly by saying, "Well, they were supposed to be the professionals, and that is what they gave me. So should I look somewhere else, next time?" -- Rich Ulrich > Date: Thu, 23 Apr 2015 11:14:10 -0700 > From: [hidden email] > Subject: Re: Best quota fit > To: [hidden email] > > Thanks for your answers. > Unfortunately its the way it works, the client pays for n=1000 and wants > them in perfect quota distribution and not 1050 but 1000... > So guess I have to do it manually furthermore? > > |
In reply to this post by emma78
Hi,
I´m sorry but thats the way it is, the client pays for n=1000 and wants the data to be delivered in this amount. But nevertheless: I got it for one criteria but with more than two it is difficult.I found this synatx by Gene String cellid(a4). Compute cellid=concat(string(age,f1.0),string(region,f2.0),string(sg,f1.0)). Recode age(1=0.40)… into agep. /* this is the variable's marginal proportions. Recode region …. Into regionp. Recode sg … into sgp. Compute celltarget=rnd(500*agep*region*sgp). /* you may want trunc instead of rnd. * do a frequencies at this point to check that celltarget sums to 1000, which it may not due to rounding errors. * Adjust as needed. When done. Compute draw=uniform(1). Sort cases by cellid draw. Do if ($casenum eq 1 or cellid ne lag(cellid)). + compute ccn=1. Else. + compute ccn=lag(ccn)+1. End if. Compute pick=0. If (ccn le celltarget) pick=1. Select if (pick eq 1). Frequencies cellid. But unfortunately it stops by n=983. Any tip for this? Thank you! |
Free forum by Nabble | Edit this page |