SPSSX Discussion

Best quota fit

Classic

List

Threaded

12 messages Options

emma78

Best quota fit

Hi,

I`m referring to an old topic. Because I have exactly the same question I have taken the liberty to copy this:

I have multiple variables over which I have to select my total sample. In other word, you can say best quota fit from the total sample. for example: I have total sample of 1100 out which I have to select 1000 sample based on 3 different variable Age, Gender, Region.

Gender: 500:500

Age: 250:300:300:150

Region: 334:333:333

There has been a solution for this problem by down weighting:

*Simulation of your data*.
NEW FILE.
INPUT PROGRAM.
LOOP CASE=1 TO 1100.
COMPUTE G=TRUNC(UNIFORM(2)+1).
COMPUTE A=TRUNC(UNIFORM(4)+1).
COMPUTE R=TRUNC(UNIFORM(3)+1).
LEAVE G A R CASE.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
**** GET YOUR RAW DATA AND GO for it.
*I have used G A R for your variables gender age race so substitute as needed *.

SORT CASES BY G A R.
SAVE OUTFILE "SortedRawData.sav".
AGGREGATE OUTFILE */ BREAK G A R / N_Obs=N.
COMPUTE N_TOT=SUM(LAG(N_TOT),N_OBS).
SORT CASES BY N_TOT(D).
IF $CASENUM > 1 N_TOT=LAG(N_TOT).
SORT CASES BY G A R.
SAVE OUTFILE "Obs_table.sav" .

* Create a table of desired proportions/counts *.
NEW FILE.
INPUT PROGRAM.
LOOP G=1 TO 2.
LOOP A= 1 TO 4.
LOOP R= 1 TO 3.
COMPUTE CELL=1.
LEAVE G A R.
END CASE.
END LOOP.
END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.

RECODE G (1=.5)(2=.5) INTO P_G
/ A (1=.25)(2,3=.3)(4=.15) INTO P_A
/ R (1=.34)(2,3=.33) INTO P_R .
COMPUTE P_C_DES = P_G * P_A * P_R .
COMPUTE F_C_DES=P_C_DES*1000.
SAVE OUTFILE "BaseTableProb.sav".
MATCH FILES / FILE "Obs_table.sav" / FILE "BaseTableProb.sav" / BY G A R.
COMPUTE P_OBS=N_OBS /N_TOT.
COMPUTE C_WEIGHT=(P_C_DES /P_OBS) * (1000/N_TOT).
MATCH FILES / FILE "SortedRawData.sav" /TABLE * / BY G A R.
WEIGHT BY C_WEIGHT.
FREQ G A R.
CROSS / TABLE G BY A BY R /CELLS= ALL.

I would like to know if it be possible not to downweight but to mark all 100 completes who can be erased so that I have a dataset with n=1000 ?

Thank you so much!

Jignesh Sutar

Re: Best quota fit

David, your solution was one to weight the data to achieve desired marginal proportions i.e. rim weighting / raking (though not interlocked).

The OP (Gaurav) and Emma are both interested in a solution which doesn't weight the data but selects from the larger sample of 1,100 a number of cases (1,000 in this example) that best meet and satisfy the quota requirements/proportions.

So a potential solution would give priority to least achieving quota cells to be retained but also allowing randomness to ensure no quota cells are systematically biased in the selection process.

Rich Ulrich

Re: Best quota fit

In reply to this post by emma78

I think it is worth mentioning that throwing away cases is
generally an irrational thing to do when you do not need to
simplify calculations so that they can be done by hand. By that
standard, it has been largely obsolete since the 1970s gave us
all computers that do all the computation.

If you insist on a simplified presentation that you can see with similar
Ns, the down-weight option provides that: What I remember of that
nature seems to be cases with artificial Ns are extrapolated from a
census or general population rather than another group on hand.

Now, I have discarded -- well, set aside, for separate comparison --
the subjects over age 65 in one group; that is fairly safe to do when
the other groups have been defined and selected as younger than that.
It is more problematic, because of what it may do to attributable effects,
when the unbalance is a natural feature of the groups. Of course, if
a variable is irrelevant to everything else, there is little loss in selection
on it, except for loss of power.

Even when the intent is to achieve one-to-one matching for cases, it is
more precise and more powerful to use the variables for statistical control
of the full data set, unless there is a very strong basis for match such as
"left hand versus right hand" or "sibling matches".

I am curious to hear of any justification contrary to the above.

--
Rich Ulrich

> Date: Wed, 22 Apr 2015 02:32:54 -0700

> From: [hidden email]
> Subject: Best quota fit
> To: [hidden email]
>
> Hi,
>
> I`m referring to an old topic. Because I have exactly the same question I
> have taken the liberty to copy this:
>
> I have multiple variables over which I have to select my total sample. In
> other word, you can say best quota fit from the total sample. for example: I
> have total sample of 1100 out which I have to select 1000 sample based on 3
> different variable Age, Gender, Region.
>
> Gender: 500:500
>
> Age: 250:300:300:150
>
> Region: 334:333:333
>
>
> There has been a solution for this problem by down weighting:
> [snip]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

emma78

Re: Best quota fit

Hi,
thanks for you reply.
The problem is - although it might seem irrational

- thats the procedure:

A client wants n=1000 interviews, we do 1000 interviews +10% over quota so that we can delete those with bad quality (straightliner...). Sometimes those with bad quality are less than 10% so we have 1050 'good interviews' e.g. but the client wants exactly 1000 completes in the final dataset.
Therefore we have to look manually in which cells (age/gender/region for example) we can find interviews we can delete, and we delete them randomly.

Therefore a more automatic version would be great and much time-saving.

Thank you!

Rich Ulrich

Re: Best quota fit

Okay, I see the problem. It is also troubling, perhaps, to tell your
client that you are removing cases completely "randomly" when that
suggests you could have seen a different set. Here is what I come up
with.

Insofar as the region is concerned, it is easy to say that this should be
regarded as "quota-sampling"; there is (I gather) no natural reason
for the observed Ns except for your own intent. It should be fine to
remove specific numbers based on region. You might argue the same
about sex, but it might be harder to argue that about age.

By the way, I think I would try to convince my client that 3 x 334 = 1002
is a better number than 1000, for the sake of symmetry.

Before automating, figure out the rational strategy: I think I would see
if I could meet the requirement by dropping from the *earliest* interviews,
on the weak presumption that they might reflect a slightly different standard
or lower quality (because of less experience of the interviewers).

Using region and sex, that suggests trying the strategy of selecting for use
the last 167 males and last 167 females from each region. This is pretty
simple if you have dates, or if you can use ID numbers. This gives you
a unique selection-set, by a method that anyone can replicate. What makes
this less-than-totally-automated is that there could be fewer than 167 in any
of the sets. What cases should be used to fill in? - Without knowing the full
purposes and content, my naive preference would be to try to fill out the Region
from the wrong sex; if that is not available, go to the next-oldest cases, taking
from the other regions alternately.

--
Rich Ulrich

> Date: Wed, 22 Apr 2015 11:58:20 -0700

> From: [hidden email]
> Subject: Re: Best quota fit
> To: [hidden email]
>
> Hi,
> thanks for you reply.
> The problem is - although it might seem irrational- thats the procedure:
>
>
> A client wants n=1000 interviews, we do 1000 interviews +10% over quota so
> that we can delete those with bad quality (straightliner...). Sometimes
> those with bad quality are less than 10% so we have 1050 'good interviews'
> e.g. but the client wants exactly 1000 completes in the final dataset.
> Therefore we have to look manually in which cells (age/gender/region for
> example) we can find interviews we can delete, and we delete them randomly.
>
> Therefore a more automatic version would be great and much time-saving.
>
> Thank you!

Art Kendall

Re: Best quota fit

In reply to this post by emma78

Have you you queried your client about this seemingly unreasonable approach?

If it is a matter of paying for only 1000 interviews, and since the interviews have been done, to preserve the integrity of the data you could do the analysis on the best available data set and only bill for 1000 interviews. Even if they only reimburse for 1000 cases, any expenses you have incurred would be sunk cost.

Art Kendall
Social Research Consultants

emma78

Re: Best quota fit

Thanks for your answers.
Unfortunately its the way it works, the client pays for n=1000 and wants them in perfect quota distribution and not 1050 but 1000...
So guess I have to do it manually furthermore?

Art Kendall

Re: Best quota fit

1002 would at least give you symmetry of effects.

Art Kendall
Social Research Consultants

emma78

Re: Best quota fit

Hi, sorry I didn't get it, what do you mean by 1002?

Art Kendall

Re: Best quota fit

See Rich Ulrich's post in this discussion.

Art Kendall
Social Research Consultants

Rich Ulrich

Re: Best quota fit

In reply to this post by emma78

I don't know what your job title is or how many levels of interference
there are between some Client and you, so it might be awkward to
push very hard for changes.

However, there is someone who, presumably, will eventually present
the results. If they have any sense, they should be happy to get
the sort of advice that lets them look "more professional" in the eyes
of people who actually understand the process. If there are any of
those people around, they should like it if they are able to answer
the occasional sharp questions.

"How did you happen to get exactly 1000? Okay, you threw away the
earliest excess cases, selectively by region and sex, to hit the
arbitrary target for quota sampling, because if the raters gained skill,
presumably the later would be better.

re: 1002, which I suggested earlier.
"Then, why did you pick 1000 for your fixed total, when that prevents
you from having equal Ns by sex or by region?"

Your unit or company will look better to the client if you point out to
the client that you can help them look better if they will follow your lead.
Your unit or company will look worse to the informed members of the
audience if they feel like they ought to blame you (and not the client).

As a sales pitch to new clients, it works better (I think) if you can tell
the prospect - who saw such a presentation - that the client turned
down your suggestions. If you never say anything, the client might
answer those embarrassing questions honestly by saying, "Well, they
were supposed to be the professionals, and that is what they gave me.
So should I look somewhere else, next time?"

--
Rich Ulrich

> Date: Thu, 23 Apr 2015 11:14:10 -0700

> From: [hidden email]
> Subject: Re: Best quota fit
> To: [hidden email]
>
> Thanks for your answers.
> Unfortunately its the way it works, the client pays for n=1000 and wants
> them in perfect quota distribution and not 1050 but 1000...
> So guess I have to do it manually furthermore?
>
>

emma78

Re: Best quota fit

In reply to this post by emma78

Hi,
I´m sorry but thats the way it is, the client pays for n=1000 and wants the data to be delivered in this amount.

But nevertheless:
I got it for one criteria but with more than two it is difficult.I found this synatx by Gene
String cellid(a4).

Compute cellid=concat(string(age,f1.0),string(region,f2.0),string(sg,f1.0)).

Recode age(1=0.40)… into agep. /* this is the variable's marginal proportions.

Recode region …. Into regionp.

Recode sg … into sgp.

Compute celltarget=rnd(500*agep*region*sgp). /* you may want trunc instead of rnd.

* do a frequencies at this point to check that celltarget sums to 1000, which it may not due to rounding errors.

* Adjust as needed. When done.

Compute draw=uniform(1).

Sort cases by cellid draw.

Do if ($casenum eq 1 or cellid ne lag(cellid)).

+ compute ccn=1.

Else.

+ compute ccn=lag(ccn)+1.

End if.

Compute pick=0.

If (ccn le celltarget) pick=1.

Select if (pick eq 1).

Frequencies cellid.

But unfortunately it stops by n=983. Any tip for this?

Thank you!