Hi everyone,
I have a question which I'm hoping someone will either either help me or provide some guidance. I have a variable with 4 categories and I want to limit each category's frequency size to a specific number - specifically 10,000, with the exception of one category. So for example, and lets say we have variable one: 1 = 800,000 2= 300,000 3 = 20,000 4 = 9,000 I want to recode the above variable into a new one where each category has a total frequency size of 10,000, except for the last category. So my new recoded variable should look like this: 1 = 10,000 2= 10,000 3 = 10,000 4 = 9,000 Any advice/suggestions is greatly appreciated. Thanks! Greg |
I will be away at conference Nov 4 through Nov 8. |
Administrator
|
In reply to this post by Greg
What you are describing entails reducing the number of cases (rows) by several thousand. "Recoding" will not change the number of cases in the file. Are you really asking how to set a filter so that only 10,000 rows per category (9,000 for the last category) are used? If so, do you want the FIRST 10,000 (or 9,000) rows in each category as the file is currently sorted? Or do you want to randomly select 10,000 (or 9.000) from each category?
HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
Thank you for your response. I want to randomly select 10,000 (or 9.000) from each category.
Greg |
Administrator
|
Something like this (untested) should work. Replace CatVar with the name of your categorical variable.
compute r = rv.uniform(0,1). /* a random number. sort cases by CatVar r. compute case = ($casenum EQ 1) or CatVar NE LAG(CatVar). if (case EQ 0) case = LAG(case) + 1. execute. /* this EXECUTE may not be needed . select if ((CatVar LE 3) and (case LE 10000)) OR ((CatVar EQ 4) and (case LE 9000)). frequencies CatVar. Note that SELECT IF deletes unselected cases from the working file. The other option is to filter them out temporarily without deleting them. compute f = ((CatVar LE 3) and (case LE 10000)) OR ((CatVar EQ 4) and (case LE 9000)). filter by f. frequencies CatVar. To use all of the cases again: filter off. HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
I am wondering why these unequal
sampling rates are needed. If you are really trying to design a sample,
you might consider the Complex Samples option. If you are trying
to balance a dataset for more effective data mining (Modeler has a built-in
way to do this), you might try calculating weights by raking.
Or you could just do what Bruce suggests. Jon Peck (no "h") Senior Software Engineer, IBM [hidden email] new phone: 720-342-5621 From: Bruce Weaver <[hidden email]> To: [hidden email] Date: 11/03/2011 06:44 PM Subject: Re: [SPSSX-L] recoding categories and randon sample Sent by: "SPSSX(r) Discussion" <[hidden email]> Something like this (untested) should work. Replace CatVar with the name of your categorical variable. compute r = rv.uniform(0,1). /* a random number. sort cases by CatVar r. compute case = ($casenum EQ 1) or CatVar NE LAG(CatVar). if (case EQ 0) case = LAG(case) + 1. execute. /* this EXECUTE may not be needed . select if ((CatVar LE 3) and (case LE 10000)) OR ((CatVar EQ 4) and (case LE 9000)). frequencies CatVar. Note that SELECT IF deletes unselected cases from the working file. The other option is to filter them out temporarily without deleting them. compute f = ((CatVar LE 3) and (case LE 10000)) OR ((CatVar EQ 4) and (case LE 9000)). filter by f. frequencies CatVar. To use all of the cases again: filter off. HTH. gargeros wrote: > > Thank you for your response. I want to randomly select 10,000 (or 9.000) > from each category. > > Greg > ----- -- Bruce Weaver [hidden email] http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." NOTE: My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/recoding-categories-and-randon-sample-tp4962407p4962748.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Greg
Just a thought, but can you not produce a weight for each case which depends
on the number of cases in each category and the number you want to reduce to? Totally off the top of my head this early in the morning, but something like: do repeat w = v1 to v4 /x 1 to 4 /y = 800000, 300000, 20000, 9000 /z = 10000, 10000, 20000, 9000 . if (x = z) wt = z/y . end repeat . weight by wt . You can then SET the weight on or off without affecting your original data. John F Hall [hidden email] www.surveyresearch.weebly.com -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of gargeros Sent: 03 November 2011 22:42 To: [hidden email] Subject: recoding categories and randon sample Hi everyone, I have a question which I'm hoping someone will either either help me or provide some guidance. I have a variable with 4 categories and I want to limit each category's frequency size to a specific number - specifically 10,000, with the exception of one category. So for example, and lets say we have variable one: 1 = 800,000 2= 300,000 3 = 20,000 4 = 9,000 I want to recode the above variable into a new one where each category has a total frequency size of 10,000, except for the last category. So my new recoded variable should look like this: 1 = 10,000 2= 10,000 3 = 10,000 4 = 9,000 Any advice/suggestions is greatly appreciated. Thanks! Greg -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/recoding-categories-and-randon -sample-tp4962407p4962407.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
I know about stratifying in order to save expense, when
designing a study. And I know that in designing a study, it is most powerful to have equal Ns in two groups, assuming equal variances. And I've used small subsets for preliminary analyses, back in the days when computing time was a real consideration. But this application we are asked about sounds like a simple case of "throwing away data." Does anybody have an argument or reference about when or why it is ever a good idea to throw away data? -- Rich Ulrich > Date: Fri, 4 Nov 2011 06:30:38 +0100 > From: [hidden email] > Subject: Re: recoding categories and randon sample > To: [hidden email] > > Just a thought, but can you not produce a weight for each case which depends > on the number of cases in each category and the number you want to reduce > to? Totally off the top of my head this early in the morning, but something > like: > > do repeat > w = v1 to v4 > /x 1 to 4 > /y = 800000, 300000, 20000, 9000 > /z = 10000, 10000, 20000, 9000 . > if (x = z) wt = z/y . > end repeat . > weight by wt . > > You can then SET the weight on or off without affecting your original data. > > John F Hall [snip previous] |
I greatly appreciate everyone's (quick) response!
The specific variable I was referring to identified racial/ethnic groups taken from census data. My argument for restricting the data size is to prevent variation in racial/ethnic group sizes influencing the results. Regards, Greg |
Then use WEIGHTing.
Art Kendall Social Research Consultants On 11/4/2011 10:05 AM, gargeros wrote: ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARDI greatly appreciate everyone's (quick) response! The specific variable I was referring to identified racial/ethnic groups taken from census data. My argument for restricting the data size is to prevent variation in racial/ethnic group sizes influencing the results. Regards, Greg -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/recoding-categories-and-randon-sample-tp4962407p4964177.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants |
I am off enjoying some vacation time.
I will return on November 14th and will respond to messages once I am back in the Lab.
~M Mikki Haegle Psychology Lab Coordinator 700 E 7th Street, NM-L202 St Paul, MN 55106 651.793.1354 |
In reply to this post by Art Kendall
I will be out of the office on Monday 11/7 through Thursday 11/10, returning on Friday (11/11). If you need immediate assistance please call the main office
number 503/223-8248 or 800/788-1887 and the receptionist will ensure that I get the message. Thank you. Kelly |
Administrator
|
In reply to this post by Greg
AGGREGATE OUTFILE * / MODE=ADDVARIABLES / BREAK=RaceEthnicOrWhatever / @N@=N.
COMPUTE @wt@=10000/@N@. IF RaceEthnicOrWhatever EQ 4 @wt@=9000/@N@. WEIGHT BY @wt@. --
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Free forum by Nabble | Edit this page |