recoding categories and randon sample

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

recoding categories and randon sample

Greg
Hi everyone,

I have a question which I'm hoping someone will either either help me or provide some guidance.

I have a variable with 4 categories and I want to limit each category's frequency size to a specific number - specifically 10,000, with the exception of one category.  So for example, and lets say we have variable one:

1 = 800,000
2= 300,000
3 = 20,000
4 = 9,000

I want to recode the above variable into a new one where each category has a total frequency size of 10,000, except for the last category.

So my new recoded variable should look like this:

1 = 10,000
2=  10,000
3 = 10,000
4 = 9,000

Any advice/suggestions is greatly appreciated.

Thanks!
Greg

Reply | Threaded
Open this post in threaded view
|

Automatic reply: recoding categories and randon sample

Beckstead, Jason

I will be away at conference Nov 4 through Nov 8.

Reply | Threaded
Open this post in threaded view
|

Re: recoding categories and randon sample

Bruce Weaver
Administrator
In reply to this post by Greg
What you are describing entails reducing the number of cases (rows) by several thousand.  "Recoding" will not change the number of cases in the file.  Are you really asking how to set a filter so that only 10,000 rows per category (9,000 for the last category) are used?  If so, do you want the FIRST 10,000 (or 9,000) rows in each category as the file is currently sorted?  Or do you want to randomly select 10,000 (or 9.000) from each category?

HTH.


gargeros wrote
Hi everyone,

I have a question which I'm hoping someone will either either help me or provide some guidance.

I have a variable with 4 categories and I want to limit each category's frequency size to a specific number - specifically 10,000, with the exception of one category.  So for example, and lets say we have variable one:

1 = 800,000
2= 300,000
3 = 20,000
4 = 9,000

I want to recode the above variable into a new one where each category has a total frequency size of 10,000, except for the last category.

So my new recoded variable should look like this:

1 = 10,000
2=  10,000
3 = 10,000
4 = 9,000

Any advice/suggestions is greatly appreciated.

Thanks!
Greg
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: recoding categories and randon sample

Greg
Thank you for your response. I want to randomly select 10,000 (or 9.000) from each category.

Greg
Reply | Threaded
Open this post in threaded view
|

Re: recoding categories and randon sample

Bruce Weaver
Administrator
Something like this (untested) should work.  Replace CatVar with the name of your categorical variable.

compute r = rv.uniform(0,1). /* a random number.
sort cases by CatVar r.
compute case = ($casenum EQ 1) or CatVar NE LAG(CatVar).
if (case EQ 0) case = LAG(case) + 1.
execute. /* this EXECUTE may not be needed .

select if ((CatVar LE 3) and (case LE 10000)) OR
           ((CatVar EQ 4) and (case LE 9000)).
frequencies CatVar.


Note that SELECT IF deletes unselected cases from the working file.  The other option is to filter them out temporarily without deleting them.

compute f = ((CatVar LE 3) and (case LE 10000)) OR
           ((CatVar EQ 4) and (case LE 9000)).
filter by f.
frequencies CatVar.

To use all of the cases again:

filter off.

HTH.

gargeros wrote
Thank you for your response. I want to randomly select 10,000 (or 9.000) from each category.

Greg
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: recoding categories and random sample

Jon K Peck
 I am wondering why these unequal sampling rates are needed.  If you are really trying to design a sample, you might consider the Complex Samples option.  If you are trying to balance a dataset for more effective data mining (Modeler has a built-in way to do this), you might try calculating weights by raking.

Or you could just do what Bruce suggests.

Jon Peck (no "h")
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621




From:        Bruce Weaver <[hidden email]>
To:        [hidden email]
Date:        11/03/2011 06:44 PM
Subject:        Re: [SPSSX-L] recoding categories and randon sample
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Something like this (untested) should work.  Replace CatVar with the name of
your categorical variable.

compute r = rv.uniform(0,1). /* a random number.
sort cases by CatVar r.
compute case = ($casenum EQ 1) or CatVar NE LAG(CatVar).
if (case EQ 0) case = LAG(case) + 1.
execute. /* this EXECUTE may not be needed .

select if ((CatVar LE 3) and (case LE 10000)) OR
          ((CatVar EQ 4) and (case LE 9000)).
frequencies CatVar.


Note that SELECT IF deletes unselected cases from the working file.  The
other option is to filter them out temporarily without deleting them.

compute f = ((CatVar LE 3) and (case LE 10000)) OR
          ((CatVar EQ 4) and (case LE 9000)).
filter by f.
frequencies CatVar.

To use all of the cases again:

filter off.

HTH.


gargeros wrote:
>
> Thank you for your response. I want to randomly select 10,000 (or 9.000)
> from each category.
>
> Greg
>


-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/recoding-categories-and-randon-sample-tp4962407p4962748.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: recoding categories and randon sample

John F Hall
In reply to this post by Greg
Just a thought, but can you not produce a weight for each case which depends
on the number of cases in each category and the number you want to reduce
to?  Totally off the top of my head this early in the morning, but something
like:

do repeat
    w = v1 to v4
   /x 1 to 4
   /y = 800000, 300000, 20000, 9000
   /z = 10000, 10000, 20000, 9000 .
if (x = z) wt = z/y .
end repeat .
weight by wt .

You can then SET the weight on or off without affecting your original data.

John F Hall

[hidden email]
www.surveyresearch.weebly.com







-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
gargeros
Sent: 03 November 2011 22:42
To: [hidden email]
Subject: recoding categories and randon sample

Hi everyone,

I have a question which I'm hoping someone will either either help me or
provide some guidance.

I have a variable with 4 categories and I want to limit each category's
frequency size to a specific number - specifically 10,000, with the
exception of one category.  So for example, and lets say we have variable
one:

1 = 800,000
2= 300,000
3 = 20,000
4 = 9,000

I want to recode the above variable into a new one where each category has a
total frequency size of 10,000, except for the last category.

So my new recoded variable should look like this:

1 = 10,000
2=  10,000
3 = 10,000
4 = 9,000

Any advice/suggestions is greatly appreciated.

Thanks!
Greg



--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/recoding-categories-and-randon
-sample-tp4962407p4962407.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: recoding categories and randon sample

Rich Ulrich
I know about stratifying in order to save expense, when
designing a study.   And I know that in designing a study,
it is most powerful to have equal Ns in two groups, assuming
equal variances.  And I've used small subsets for preliminary
analyses, back in the days when computing time was a real
consideration.

But this application we are asked about sounds like a simple
case of  "throwing away data."  Does anybody have an argument
or reference about when or why it is ever a good idea to throw
away data?

--
Rich Ulrich


> Date: Fri, 4 Nov 2011 06:30:38 +0100

> From: [hidden email]
> Subject: Re: recoding categories and randon sample
> To: [hidden email]
>
> Just a thought, but can you not produce a weight for each case which depends
> on the number of cases in each category and the number you want to reduce
> to? Totally off the top of my head this early in the morning, but something
> like:
>
> do repeat
> w = v1 to v4
> /x 1 to 4
> /y = 800000, 300000, 20000, 9000
> /z = 10000, 10000, 20000, 9000 .
> if (x = z) wt = z/y .
> end repeat .
> weight by wt .
>
> You can then SET the weight on or off without affecting your original data.
>
> John F Hall

[snip previous]
Reply | Threaded
Open this post in threaded view
|

Re: recoding categories and randon sample

Greg
I greatly appreciate everyone's (quick) response!  

The specific variable I was referring to identified racial/ethnic groups taken from census data. My argument for restricting the data size is to prevent variation in racial/ethnic group sizes influencing the results.

Regards,
Greg
Reply | Threaded
Open this post in threaded view
|

Re: recoding categories and randon sample

Art Kendall
Then use WEIGHTing.

Art Kendall
Social Research Consultants

On 11/4/2011 10:05 AM, gargeros wrote:
I greatly appreciate everyone's (quick) response!

The specific variable I was referring to identified racial/ethnic groups
taken from census data. My argument for restricting the data size is to
prevent variation in racial/ethnic group sizes influencing the results.

Regards,
Greg

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/recoding-categories-and-randon-sample-tp4962407p4964177.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Automatic reply: recoding categories and randon sample

Mikki Haegle

I am off enjoying some vacation time.

I will return on November 14th and will respond to messages once I am back in the Lab.

 

~M

 

 

 

 

Mikki Haegle

Psychology Lab Coordinator

700 E 7th Street, NM-L202

St Paul, MN 55106

651.793.1354

 

Reply | Threaded
Open this post in threaded view
|

Automatic reply: recoding categories and randon sample

Kelly Vander Ley
In reply to this post by Art Kendall

I will be out of the office on Monday 11/7 through Thursday 11/10, returning on Friday (11/11). If you need immediate assistance please call the main office number 503/223-8248 or 800/788-1887 and the receptionist will ensure that I get the message.  Thank you. Kelly

 

 

Reply | Threaded
Open this post in threaded view
|

Re: recoding categories and randon sample

David Marso
Administrator
In reply to this post by Greg
AGGREGATE OUTFILE * / MODE=ADDVARIABLES / BREAK=RaceEthnicOrWhatever / @N@=N.
COMPUTE @wt@=10000/@N@.
IF RaceEthnicOrWhatever EQ 4 @wt@=9000/@N@.
WEIGHT BY @wt@.
--

gargeros wrote
I greatly appreciate everyone's (quick) response!  

The specific variable I was referring to identified racial/ethnic groups taken from census data. My argument for restricting the data size is to prevent variation in racial/ethnic group sizes influencing the results.

Regards,
Greg
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"