|
Hi,
I wrote to you some time ago about a best quota fit I have multiple variables over which I have to select my total sample. In other word, you can say best quota fit from the total sample. for example: I have total sample of 1100 out which I have to select 1000 sample based on 3 different variable Age, Gender, Region. Gender: 500:500 Age: 250:300:300:150 Region: 334:333:333 Irrespective of whether this approach makes sense or not, I`m greateful for all help on this. ![]() After some consideration I come across a thought and I would be glad if you can tell me your thoughts about it: I have a small datatset for testing: Age 3 categories Gender 2 categories data is sorted by a quality variable, the one at the beginning ID 1 has the best score ID Age Gender 1 3 1 2 2 2 3 3 1 4 1 2 5 1 1 6 2 2 7 1 1 8 2 2 9 3 1 10 3 2 I transpose the data, so I have v1 to v10(Age) v11 to v20(Gender) 3231121233 1212121212 I thought of creating a counter variable by lag function, counting the occurence of same categories. After that I would set up a loop #=10 . In the end I would like to have a n=5 dataset, Age: n=3 for categorie 3 and n=2 for categorie 2 Gender: n=3 for categorie 1 and n=2 for categorie 2 In #1 the first value of both vectors (v1 to v10:I and v11 to v20:II) are checked if they match the desired counts (the first pairs are ID 1)- If yes a flag variable gets the value=1 if not=0. And then #2 the second pair, and so on. As a result I would get those IDs with the desired amounts: 1 2 3 6 9 Would that be possible? Is the lag function helpful here? Thanks for your help, if something is unclear I would be more than happy to specify it:-) |
|
You've specified the marginal totals - but would it be alright to fix the sampling for the entire N-way table?
E.g. Region 1 & Age 1 & Gender 1= 1000*(1/3)*(1/4)*0.5 = 41 2/3 Region 1 & Age 2 & Gender 1 = 1000*(1/3)*(3/10)*0.5 = 50 etc. For example, if you ran "CROSSTABS Gender BY Age BY Region." you get the total for each cell in the contingency table. Then you can sort by those categories (and shuffle within), then take from each category until the constraints are specified. If a particular cell does not have the requisite number, you then may specify rules to take from other cells in the 3 way contingency table. E.g. if there are only 35 people in the "Region 1 & Age 1 & Gender 1" cell, you may then decide to take 51 from "Region 1 & Age 2 & Gender 1" and 51 from "Region 1 & Age 2 & Gender 2" etc., etc. Spreading it out like that should then keep the marginals close to the intended targets. Something like. COMPUTE Const = 1. COMPUTE Rand = RV.UNIFORM(0,1). SORT CASES BY Gender Age Region Rand. SPLIT FILE BY Gender Age Region. CREATE OrderCell = CSUM(Const). SPLIT FILE OFF. If the table cells are full, then you can simply do something like COMPUTE Keep = 1. IF Gender = 1 AND Age = 1 AND Region = 1 AND OrderCell > 43 Keep = 0. IF Gender = 1 AND Age = 2 AND Region = 1 AND OrderCell > 50 Keep = 0. . . . I wouldn't be surprised if upon more investigation there is some R program that will do stratified random sampling that can accomplish this. (Searching "quota sampling" may be throwing google off a bit compared to "stratified sampling".) |
|
Administrator
|
In this case create look up tables rather than hardcoding any constants in your code.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
|
Hi,
Unfortunately the unnested amounts has to be achieved. What do you mean bei Look up Table? Thank you! |
|
Administrator
|
Something like this perhaps? The main point is not littering your syntax with magic numbers in the computations but to have it data driven. Whether this will work totally depends on how your data are distributed and hard to say whether what you desire is even possible. I normally won't write this sort of thing for free, so I will leave it to you to adapt and refine as will likely be necessary. Should be a decent start and somewhat easier than what you were rough thinking of with the transpose and the like. -- DATA LIST FREE / gender freq_gender. BEGIN DATA 1 500 2 500 END DATA. DATASET NAME gender. DATA LIST FREE / age freq_age. BEGIN DATA 1 250 2 300 3 300 4 150 END DATA. DATASET NAME age. DATA LIST FREE / region freq_region. BEGIN DATA 1 334 2 333 3 333 END DATA. DATASET NAME region. DATASET ACTIVATE gender. LOOP age=1 TO 4. XSAVE OUTFILE 'C:\TEMP\gender_age.sav' . END LOOP. EXECUTE. GET FILE 'C:\TEMP\gender_age.sav'. SORT CASES BY age. DATASET NAME gender_age. MATCH FILES / FILE gender_age/ TABLE=age / BY age . LOOP region=1 TO 3. XSAVE OUTFILE 'C:\TEMP\gender_age_region.sav' . END LOOP. EXECUTE. GET FILE 'C:\TEMP\gender_age_region.sav'. SORT CASES BY region. DATASET NAME gender_age_region. MATCH FILES / FILE gender_age_region/ TABLE=region / BY region . AGGREGATE OUTFILE * MODE ADDVARIABLES / BREAK /Total_N=SUM(freq_gender). COMPUTE Total_N=Total_N/12 . COMPUTE CellFreq=RND(freq_gender/Total_N * freq_age/Total_N * freq_region). SORT CASES BY gender age region . DATASET NAME freq_Lookup_table . WEIGHT BY CellFreq . FREQUENCIES gender age region . LIST. /* Simulate your data (You don't provide ANY description of the distributions) */. MATRIX. SAVE (T({1:1100})) /OUTFILE * /VARIABLES ID. END MATRIX. COMPUTE gender=TRUNC(RV.UNIFORM(1,3)). COMPUTE age=TRUNC(RV.UNIFORM(1,5)). COMPUTE region=TRUNC(RV.UNIFORM(1,4)). FREQUENCIES gender age region. SORT CASES BY gender age region. MATCH FILES FILE * / TABLE freq_Lookup_table BY gender age region. IF $CASENUM EQ 1 OR gender NE LAG(gender) OR age NE LAG(age) OR region NE LAG(region) Counter=1. IF MISSING (Counter) Counter=LAG(Counter)+1. COMPUTE Candidate=Counter LE CellFreq. CROSSTABS TABLES gender BY age BY region BY Candidate .
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
|
Thanks David, that is quite a good way to start.
Now I have to search for a way that there is a check that not too many are marked as a candidate=0, in your example 19 are missing. |
|
Administrator
|
One might compare the number required in any particular cell according to the independence formula (straight multiplication of the marginal proportions) to what is actually present and assign cells with the smallest amount of fudge factor first. After that is done fill in the rest according to some random method. This would probably be most easily achieved using the MATRIX language rather than the Base syntax since matrix has access to the complete rectangle of data and that would facilitate building a more elegant iterative algorithm for assignment/test/reassignment. One would save a lot of time by first determining that the associations between the variables doesn't completely rule out being able to satisfy specified the marginal criteria. I can envision that being likely in some situations.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
|
In reply to this post by emma78
At 03:31 PM 8/23/2015, emma78 wrote:
>I have total sample of 1100 out which I have to select 1000 sample based on 3 >different variable Age, Gender, Region. May I ask, whether this is really the right thing to do? You'll be dropping only 10% of your current sample; to get the right counts in the rest (if it can be done at all) you'll be selecting very carefully what to drop, and may drop a sub-population that's unusual in illuminating ways. If you need to match frequencies to a large existing population, for valid comparisons, how about keeping all your cases, and weighting to get the right proportions? ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
| Free forum by Nabble | Edit this page |
