SPSSX Discussion

Occurences classification

Classic

List

Threaded

8 messages Options

Luca Meyer

Occurences classification

I am using SPSS 15 and I am looking for a (semi)automatic procedure that
should allow me, for a given variable, to classify all less frequent
occurences into an "other" occurence.

For instance, assume a variable deriving from a question such as "What car
do you own?" with a list of some 25 possible brands plus a do not know
option:

BRAND 1
BRAND 2
BRAND 3
...
BRAND 22
BRAND 23
BRAND 24
BRAND 25
DO NOT KNOW

Now, what I am looking for is a procedure that:

(a) identifies the main - for instance 5, but such a parameter should be set
in the syntax - most frequent answers (let's say BRAND 19, BRAND 5, BRAND
12, BRAND 2 and BRAND 8)
(b) riclassifies all other occurences into a generic OTHER
(c) reports as such DO NOT KNOW and eventually other ausiliary answers (such
ad I DO NOT OWN A CAR, ecc)
(d) mantains the order in a CTABLES command (letting last the OTHER and
ausiliary categories)

In the example a CTABLES command with descending count order after the
classification should yeld:

BRAND 19
BRAND 5
BRAND 12
BRAND 2
BRAND 8
OTHER
DO NOT KNOW

As any of you already developed something similar? I am doing it manually
for each variable and is getting a burden....

Thanks,

Luca

Mr. Luca MEYER
Market research, data analysis & more
www.lucameyer.com <http://www.lucameyer.com/> - Tel: +39.339.495.00.21

Richard Ristow

Re: Occurences classification

At 05:40 AM 4/21/2007, Luca Meyer wrote:

>I am using SPSS 15 and I am looking for a (semi)automatic procedure
>that should allow me, for a given variable, to classify all less
>frequent occurences into an "other" occurence.

This may be a place to start. It's a macro, originally from Raynald
Levesque but I've added a lot of bells and whistles, that reproduces
AUTORECODE, but generates the RECODE and VALUE LABELS. It uses
AGGREGATE. As written, it assigns all one-occurrence values to '999
other'. I haven't tested it for a long time, I think since its
last-modified date of 18 Oct 2004, and I don't think I had it in
perfect form even then. But it's probably a place to start.

>Now, what I am looking for is a procedure that:
>
>(a) identifies the main - for instance 5, but such a parameter should
>be set in the syntax - most frequent answers (let's say BRAND 19,
>BRAND 5, BRAND 12, BRAND 2 and BRAND 8)

That'd take a little extra code. You would,
1.) AGGREGATE, to count occurrences of each answer (BRAND19, BRAND5,
...)
2.) Sort by frequency. Mark the first 5 (or however many) records to be
coded individually, and all others to be coded as 'other'.
3.) Re-sort, by alphabetical order or whatever final order you like,
except keeping the values to be individually coded at the head of the
list.
4.) Using code like that in the macro, write the RECODE and VALUE LABEL
statements to an external file.

See how far this gets you. I don't pretend it's a complete solution, so
feel free to post with questions.

Richard Ristow

Re: Occurences classification

At 04:02 AM 4/22/2007, Luca Meyer wrote:

>Thanks Richard,
>I will work on it. Would you mind to give me a URL link to the macro?

My gosh! I wrote all that, and never gave you the macro text. I must
have been sleepier than I thought. (I'm sending this to the list, as
well, under the principle of generally sharing with the whole list.)

The macro isn't on-line. Here's the text, in this note. If it isn't
easy to extract from the note, let me know and I'll send it as an
attachment - but the List won't take that.

Good luck!
Richard

In this version,
. Fixed bug that got line of first occurrence wrong
. Parameter Presort=NO to run AGGREGATE without sorting the data first
(and with out PRESORT subcommand). More reliable, and considerably
faster, on some large files. (In fact, Presort=NO should probably be
the default, or the only option; see posting on "SORT CASES algorithm",
with Jon Peck's advice.)
. Improved layout of RECODE specifications in output.
. Shorten lines; I hope they don't wrap so much in the E-message.

/* Macro !AutoRcd, version 18 Oct 2004 */
/* */
/* Writes RECODE and VALUE LABEL syntax to AUTORECODE one */
/* character string variable into a numeric variable. */
/* */
/* Parameters (all keyword): Default */
/* ------------------------ ------- */
/* SrceVar Variable to AUTORECODE <required> */
/* (must be character-string) */
/* NewVar Variable to AUTORECODE into <required> */
/* RunName Name of output code file AutoRcd */
/* OneTime If not "NO", all once-occurring NO */
/* values are recoded 999, "Other". */
/* Restore If "NO", or not "YES", original YES */
/* file is not restored. */
/* ("NO" saves time & disk space */
/* for large files that can be */
/* easily be reloaded.) */
/* Presort If "NO", or not "YES", file is YES */
/* not sorted before AGGREGATE. */
/* (May be faster and more reliable */
/* on large files with many more */
/* cases than separate values.) */
/* */
/* Path Directory for output file(*) c:\Tmp */
/* Scratch Directory for scratch files(*) c:\Tmp */
/* (*) No trailing "\" */
/* */
/* Output Contains */
/* ------ -------- */
/* <Path>\<RunName>.GEN RECODE !SrceVar into !Newvar, */
/* and VALUE LABELS for !Newvar */
/* */
/* Purpose: AUTORECODE, allowing post-editing */
/* ------- */
/* Details of service */
/* ------------------ */
/* - Like AUTORECODE, recodes values of variable into */
/* integers starting with 1, in alphabetic order, except */
/* as below */
/* - On each RECODE line, writes number of occurrences and */
/* first occurrence of the value, in comments */
/* - Recodes blank value to 0. */
/* - If parameter "OneTime" is YES, recodes all once- */
/* occurring values to 999, with value label "Other". */
/* */
/* Side effects */
/* ------------ */
/* 1. Creates and leaves on disk, */
/* <Scratch>\AutoRecd-data.SAV (unless RESTORE=NO) */
/* <Scratch>\AutoRecd-summary.SAV */
/* */
/* Known deficiencies */
/* ------------------ */
/* A. Blank value is not handled properly if any value */
/* sorts earlier than blank */
/* B. If variable has 999 or more values, value label */
/* for the 999th is "Other", whether or not parameter */
/* parameter OneTime=YES. */
/* C. Error, if directory specified with trailing "\" */
/* D. First occurrences and occurrence numbers in output */
/* comments are 7 digits; may overflow for files of */
/* more than 10,000,000 records. */
/* E. Overwrites an existing output file without comment */
/* F. Does not run fast, even on small files. (SORT and */
/* AGGREGATE are the time-consuming steps.) */
/* G. Output RECODE syntax is well formatted, but code */
/* to generate it is long and complicated. */
/* H. Has not undergone exhaustive testing */
/* */
/* Acknowledgement */
/* --------------- */
/* Adapted from macro by Raynald Levesque, SPSSX-L */
/* "Syntax reproducing AUTORECODE", 11 Jun 2003 19:59:43 */
/* */
/* . . . . . . . . . . . . . . . . . . . . . . . . . . . . */
/* Adapted by Richard Ristow */

/* - - - - - - - Start macro definition - - - - - - - */
DEFINE !AutoRec ( RunName=!DEFAULT(AutoRcd) !TOKENS(1)
/SrceVar=!TOKENS(1) /NewVar=!TOKENS(1)
/Path =!DEFAULT('c:\Tmp\') !TOKENS(1)
/Scratch=!DEFAULT('c:\Tmp\') !TOKENS(1)
/OneTime=!DEFAULT(NO) !TOKENS(1)
/Restore=!DEFAULT(YES) !TOKENS(1)
/Presort=!DEFAULT(YES) !TOKENS(1)).

!LET !SPSSout = !QUOTE(!CONCAT(!UNQUOTE(!Path),'\',
!RunName,'.GEN'))
!LET !OrigSAV = !QUOTE(!CONCAT(!UNQUOTE(!Scratch),'\',
'AutoRecd-data.SAV'))
!LET !SmrySAV = !QUOTE(!CONCAT(!UNQUOTE(!Scratch),'\',
'AutoRecd-summary.SAV'))
* Macro variables for files superseded in this version .
!LET !RecdSPS = !QUOTE(!CONCAT(!UNQUOTE(!Path),'\',
!RunName,'-Recode.SPS'))
!LET !LablSPS = !QUOTE(!CONCAT(!UNQUOTE(!Path),'\',
!RunName,'-Value lbl.SPS'))

* Save original file, if it's to be restored later .
!IF (!UPCASE(!Restore) !EQ YES) !THEN
- SAVE OUTFILE= !OrigSAV.
!IFEND

* Develop the data for Recode and Value Labels. .

COMPUTE FrstOccr = $CASENUM.
!IF (!UPCASE(!Presort) !EQ YES) !THEN
- SORT CASES BY !SrceVar.
- AGGREGATE OUTFILE=*
/PRESORTED/BREAK=!SrceVar
/Num_Occr = N
/FrstOccr = MIN(FrstOccr).
!ELSE
- AGGREGATE OUTFILE=*
/BREAK=!SrceVar
/Num_Occr = N
/FrstOccr = MIN(FrstOccr).
!IFEND

* Compute recode target values. Assign values in order but:.
* 1. If first value is blank, assign it 0, not 1. .
* 2. If parameter OneTime is set (not "NO"), assign once- .
* occurring values code 999, "Other". .

NUMERIC #Cur_Tgt (F3).

DO IF ($CASENUM = 1).
- DO IF (!SrceVar = ' ').
. COMPUTE #Cur_Tgt = 0.
. COMPUTE RecNb = 0.
!IF (!UPCASE(!OneTime) !NE NO) !THEN
/* Special-case values with one occurrence */
- ELSE IF (Num_Occr = 1).
. COMPUTE #Cur_Tgt = 0.
. COMPUTE RecNb = 999.
!IFEND
- ELSE.
. COMPUTE #Cur_Tgt = 1.
. COMPUTE RecNb = 1.
- END IF.
ELSE.
!IF (!UPCASE(!OneTime) !NE NO) !THEN
/* Special-case values with one occurrence */
- DO IF (Num_Occr = 1).
. COMPUTE RecNb = 999.
- ELSE.
. COMPUTE #Cur_Tgt = #Cur_Tgt + 1.
. COMPUTE RecNb = #Cur_Tgt.
- END IF.
!ELSE
/* Regular processing for values with one occurrence */
. COMPUTE #Cur_Tgt = #Cur_Tgt + 1.
. COMPUTE RecNb = #Cur_Tgt.
!IFEND
END IF.
* The following was from the Levesque version; the saved .
* file was never used. .
/*-- SAVE OUTFILE = !SmrySAV. /*-*/

* This, added 06 Oct 2004, writes a summary with the .
* records duplicated, for the two output passes needed .
LOOP PASS=1 TO 2.
. XSAVE OUTFILE = !SmrySAV.
END LOOP.
EXECUTE /* REQUIRED 06 Oct 2004 */.

* Re-load the file with duplicated records, and sort .
* to separate the two "passes" for output .
GET FILE=!SmrySAV.
SORT CASES BY PASS !SrceVar.

MATCH FILES FILE=* /BY=PASS /FIRST=first /LAST=last.
DO IF PASS=1.
* Write syntax to recode values.
. NUMERIC #N_OTHR (F3).
. IF (RecNb = 999) #N_OTHR = #N_OTHR + 1.
. DO IF first.
. WRITE OUTFILE=!SPSSout/
'RECODE ' !QUOTE(!SrceVar).
. END IF.
/* All records: write individual RECODE specs. */
. DO IF LENGTH(!SrceVar) LE 35.
* One-line RECODE specifications .
. DO IF (SUBSTR(!SrceVar,LENGTH(!SrceVar),1)= ' ').
. COMPUTE !SrceVar = CONCAT(RTRIM(!SrceVar),'"').
. IF (!SrceVar = '"') !SrceVar = ' "'.
. WRITE OUTFILE=!SPSSout/
' ("'!SrceVar' ='RecNb(F4)
') /* N='Num_Occr(F7)' 1st='FrstOccr(F7)' */'.
. ELSE.
. WRITE OUTFILE=!SPSSout/
' ("'!SrceVar'" ='RecNb(F4)
') /* N='Num_Occr(F7)' 1st='FrstOccr(F7)' */'.
. END IF.
. ELSE.
* Two-line RECODE specifications .
. DO IF (LENGTH(RTRIM(!SrceVar)) LE 33).
. COMPUTE !SrceVar = CONCAT(RTRIM(!SrceVar),'"').
. IF (!SrceVar = '"') !SrceVar = ' "'.
. WRITE OUTFILE=!SPSSout/
' ("'!SrceVar
'=' 40 RecNb(F4)
') /* N='Num_Occr(F7)' 1st='FrstOccr(F7)' */'.
. ELSE IF (SUBSTR(!SrceVar,LENGTH(!SrceVar),1)= ' ').
. COMPUTE !SrceVar = CONCAT(RTRIM(!SrceVar),'"').
. WRITE OUTFILE=!SPSSout/
' ("'!SrceVar.
. WRITE OUTFILE=!SPSSout/
'=' 40 RecNb(F4)
') /* N='Num_Occr(F7)' 1st='FrstOccr(F7)' */'.
. ELSE.
. WRITE OUTFILE=!SPSSout/
' ("'!SrceVar'"'.
. WRITE OUTFILE=!SPSSout/
'=' 40 RecNb(F4)
') /* N='Num_Occr(F7)' 1st='FrstOccr(F7)' */'.
. END IF.
. END IF.
. DO IF last.
. WRITE OUTFILE=!SPSSout/
' INTO '!QUOTE(!NewVar)'.'.
. END IF.

ELSE IF PASS = 2.
* Write syntax for value labels.
. DO IF first.
. WRITE OUTFILE=!SPSSout/
' '/
'ADD VALUE LABELS '!QUOTE(!NewVar).
. END IF.
. DO IF NOT last.
- DO IF RecNb NE 999.
. WRITE OUTFILE=!SPSSout/
' 'RecNb(F4)' "'!SrceVar'"'.
- END IF.
. ELSE.
- DO IF #N_OTHR = 0.
. WRITE OUTFILE=!SPSSout/
' 'RecNb(F4)' "'!SrceVar'".'.
- ELSE.
- DO IF RecNb NE 999.
. WRITE OUTFILE=!SPSSout/
' 'RecNb(F4)' "'!SrceVar'"'.
- END IF.
. WRITE OUTFILE=!SPSSout/
' 999 "Other".'.
- END IF.
. END IF.
ELSE.
. WRITE OUTFILE=!SPSSout/
' BUG! Pass = ' PASS.
END IF.
EXECUTE.

* Restore original file, if it's saved for that purpose .
!IF (!UPCASE(!Restore) !EQ YES) !THEN
- GET FILE= !OrigSAV .
!IFEND
!ENDDEFINE.
/* - - - - - - - End macro definition - - - - - - - */

Peck, Jon

Re: Occurences classification

In reply to this post by Luca Meyer

This is pretty easy to do with programmability. Collapsing small cell counts is straightforward, and there is already a function genCategoryList in the spssaux2 supplementary module that allows you to create a category list that is presorted but has special categories placed at the end.

I'll work something up for this later today (Sunday) or tomorrow to add to the SpecialTransforms module.

-Jon Peck
SPSS

-----Original Message-----
From: SPSSX(r) Discussion on behalf of Luca Meyer
Sent: Sat 4/21/2007 4:40 AM
To: [hidden email]
Subject: [SPSSX-L] Occurences classification

I am using SPSS 15 and I am looking for a (semi)automatic procedure that
should allow me, for a given variable, to classify all less frequent
occurences into an "other" occurence.

For instance, assume a variable deriving from a question such as "What car
do you own?" with a list of some 25 possible brands plus a do not know
option:

BRAND 1
BRAND 2
BRAND 3
...
BRAND 22
BRAND 23
BRAND 24
BRAND 25
DO NOT KNOW

Now, what I am looking for is a procedure that:

(a) identifies the main - for instance 5, but such a parameter should be set
in the syntax - most frequent answers (let's say BRAND 19, BRAND 5, BRAND
12, BRAND 2 and BRAND 8)
(b) riclassifies all other occurences into a generic OTHER
(c) reports as such DO NOT KNOW and eventually other ausiliary answers (such
ad I DO NOT OWN A CAR, ecc)
(d) mantains the order in a CTABLES command (letting last the OTHER and
ausiliary categories)

In the example a CTABLES command with descending count order after the
classification should yeld:

BRAND 19
BRAND 5
BRAND 12
BRAND 2
BRAND 8
OTHER
DO NOT KNOW

As any of you already developed something similar? I am doing it manually
for each variable and is getting a burden....

Thanks,

Luca

Mr. Luca MEYER
Market research, data analysis & more
www.lucameyer.com <http://www.lucameyer.com/> - Tel: +39.339.495.00.21

Peck, Jon

Re: Occurrences classification

In reply to this post by Luca Meyer

As promised (almost), I have added a function to collapse infrequent variable values to the specialtransforms module available on SPSS Developer Central (www.spss.com/devcentral).

The function CollapseInfrequentValues collapses values whose counts or occurrence percentage is below a specified value into a specified new value. It deletes any value labels for the collapsed categories and creates a new one for the collapse value. It can also define an SPSS macro listing the values in ascending or descending order of frequency with the collapsed value at the end. User and system missing values are ignored.

Here is an example:

begin program.
import specialtransforms

specialtransforms.CollapseInfrequentValues("educcopy", threshold=.20, collapsevalue=99, otherlabel="low count categories")
end program.

In this example, values of the variable educcopy occurring in fewer than 20% of the cases are mapped to the value 99 which is given the value label "low count categories".

The collapse is in place, so you may want to copy the variable(s) first using COMPUTE and APPLY DICTIONARY.

If you are using Ctables, you might want the macro to control the order.

specialtransforms.CollapseInfrequentValues("educcopy", threshold=.20, collapsevalue=99, otherlabel="low count categories", macronames="!educMacro", order="D")

Then you could do
CTABLES /TABLE educcopy /CATEGORIES VARIABLE=educcopy [!educMacro].

That would show the values in descending order of frequency with the "other" category at the end.

It can process more than one variable at a time and has a few other useful extras.

If the threshold is < 1, it is interpreted as a fraction; if it is >=1, it is interpreted as a count.

I have only tested this on 15.0.1, but it should also work on 14.0.1 and later. It requires the Python programmability plug-in and supplementary modules spssaux, spssdata, and namedtuple from Developer Central.

I hope you find this useful.

Regards,
Jon Peck
SPSS

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Luca Meyer
Sent: Saturday, April 21, 2007 4:40 AM
To: [hidden email]
Subject: [SPSSX-L] Occurences classification

I am using SPSS 15 and I am looking for a (semi)automatic procedure that
should allow me, for a given variable, to classify all less frequent
occurences into an "other" occurence.

For instance, assume a variable deriving from a question such as "What car
do you own?" with a list of some 25 possible brands plus a do not know
option:

BRAND 1
BRAND 2
BRAND 3
...
BRAND 22
BRAND 23
BRAND 24
BRAND 25
DO NOT KNOW

Now, what I am looking for is a procedure that:

(a) identifies the main - for instance 5, but such a parameter should be set
in the syntax - most frequent answers (let's say BRAND 19, BRAND 5, BRAND
12, BRAND 2 and BRAND 8)
(b) riclassifies all other occurences into a generic OTHER
(c) reports as such DO NOT KNOW and eventually other ausiliary answers (such
ad I DO NOT OWN A CAR, ecc)
(d) mantains the order in a CTABLES command (letting last the OTHER and
ausiliary categories)

In the example a CTABLES command with descending count order after the
classification should yeld:

BRAND 19
BRAND 5
BRAND 12
BRAND 2
BRAND 8
OTHER
DO NOT KNOW

As any of you already developed something similar? I am doing it manually
for each variable and is getting a burden....

Thanks,

Luca

Mr. Luca MEYER
Market research, data analysis & more
www.lucameyer.com <http://www.lucameyer.com/> - Tel: +39.339.495.00.21

zstatman

Re: Occurrences classification

Jon, I ran the following on a dataset of about 2500 cases, variable age
(scale)

begin program.
import specialtransforms

specialtransforms.CollapseInfrequentValues("age", threshold=.20,
collapsevalue=99, otherlabel="low count categories") end program.

I say "ran" but it is really still running. The process line says Running
Program>...

Like the energizer bunny, it is still going?

W

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Peck, Jon
Sent: Tuesday, April 24, 2007 10:03 AM
To: [hidden email]
Subject: Re: Occurrences classification

As promised (almost), I have added a function to collapse infrequent
variable values to the specialtransforms module available on SPSS Developer
Central (www.spss.com/devcentral).

The function CollapseInfrequentValues collapses values whose counts or
occurrence percentage is below a specified value into a specified new value.
It deletes any value labels for the collapsed categories and creates a new
one for the collapse value. It can also define an SPSS macro listing the
values in ascending or descending order of frequency with the collapsed
value at the end. User and system missing values are ignored.

Here is an example:

begin program.
import specialtransforms

specialtransforms.CollapseInfrequentValues("educcopy", threshold=.20,
collapsevalue=99, otherlabel="low count categories") end program.

In this example, values of the variable educcopy occurring in fewer than 20%
of the cases are mapped to the value 99 which is given the value label "low
count categories".

The collapse is in place, so you may want to copy the variable(s) first
using COMPUTE and APPLY DICTIONARY.

If you are using Ctables, you might want the macro to control the order.

specialtransforms.CollapseInfrequentValues("educcopy", threshold=.20,
collapsevalue=99, otherlabel="low count categories",
macronames="!educMacro", order="D")

Then you could do
CTABLES /TABLE educcopy /CATEGORIES VARIABLE=educcopy [!educMacro].

That would show the values in descending order of frequency with the "other"
category at the end.

It can process more than one variable at a time and has a few other useful
extras.

If the threshold is < 1, it is interpreted as a fraction; if it is >=1, it
is interpreted as a count.

I have only tested this on 15.0.1, but it should also work on 14.0.1 and
later. It requires the Python programmability plug-in and supplementary
modules spssaux, spssdata, and namedtuple from Developer Central.

I hope you find this useful.

Regards,
Jon Peck
SPSS

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Luca Meyer
Sent: Saturday, April 21, 2007 4:40 AM
To: [hidden email]
Subject: [SPSSX-L] Occurences classification

I am using SPSS 15 and I am looking for a (semi)automatic procedure that
should allow me, for a given variable, to classify all less frequent
occurences into an "other" occurence.

For instance, assume a variable deriving from a question such as "What car
do you own?" with a list of some 25 possible brands plus a do not know
option:

BRAND 1
BRAND 2
BRAND 3
...
BRAND 22
BRAND 23
BRAND 24
BRAND 25
DO NOT KNOW

Now, what I am looking for is a procedure that:

(a) identifies the main - for instance 5, but such a parameter should be set
in the syntax - most frequent answers (let's say BRAND 19, BRAND 5, BRAND
12, BRAND 2 and BRAND 8)
(b) riclassifies all other occurences into a generic OTHER
(c) reports as such DO NOT KNOW and eventually other ausiliary answers (such
ad I DO NOT OWN A CAR, ecc)
(d) mantains the order in a CTABLES command (letting last the OTHER and
ausiliary categories)

In the example a CTABLES command with descending count order after the
classification should yeld:

BRAND 19
BRAND 5
BRAND 12
BRAND 2
BRAND 8
OTHER
DO NOT KNOW

As any of you already developed something similar? I am doing it manually
for each variable and is getting a burden....

Thanks,

Luca

Mr. Luca MEYER
Market research, data analysis & more
www.lucameyer.com <http://www.lucameyer.com/> - Tel: +39.339.495.00.21

Will
Statistical Services

============
info.statman@earthlink.net
http://home.earthlink.net/~z_statman/
============

Peck, Jon

Re: Occurrences classification

It's hard to tell from the formatting, but make sure that the
end program.
is on a line by itself. The email probably mangled the example program.

It runs quite quickly, so I think the problem is that it hasn't really started, because it hasn't seen the end program yet.

-----Original Message-----
From: Will Bailey [Statman] [mailto:[hidden email]]
Sent: Tuesday, April 24, 2007 1:57 PM
To: Peck, Jon; [hidden email]
Subject: RE: Occurrences classification

Jon, I ran the following on a dataset of about 2500 cases, variable age
(scale)

begin program.
import specialtransforms

specialtransforms.CollapseInfrequentValues("age", threshold=.20,
collapsevalue=99, otherlabel="low count categories") end program.

I say "ran" but it is really still running. The process line says Running
Program>...

Like the energizer bunny, it is still going?

W

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Peck, Jon
Sent: Tuesday, April 24, 2007 10:03 AM
To: [hidden email]
Subject: Re: Occurrences classification

As promised (almost), I have added a function to collapse infrequent
variable values to the specialtransforms module available on SPSS Developer
Central (www.spss.com/devcentral).

The function CollapseInfrequentValues collapses values whose counts or
occurrence percentage is below a specified value into a specified new value.
It deletes any value labels for the collapsed categories and creates a new
one for the collapse value. It can also define an SPSS macro listing the
values in ascending or descending order of frequency with the collapsed
value at the end. User and system missing values are ignored.

Here is an example:

begin program.
import specialtransforms

specialtransforms.CollapseInfrequentValues("educcopy", threshold=.20,
collapsevalue=99, otherlabel="low count categories") end program.

In this example, values of the variable educcopy occurring in fewer than 20%
of the cases are mapped to the value 99 which is given the value label "low
count categories".

The collapse is in place, so you may want to copy the variable(s) first
using COMPUTE and APPLY DICTIONARY.

If you are using Ctables, you might want the macro to control the order.

specialtransforms.CollapseInfrequentValues("educcopy", threshold=.20,
collapsevalue=99, otherlabel="low count categories",
macronames="!educMacro", order="D")

Then you could do
CTABLES /TABLE educcopy /CATEGORIES VARIABLE=educcopy [!educMacro].

That would show the values in descending order of frequency with the "other"
category at the end.

It can process more than one variable at a time and has a few other useful
extras.

If the threshold is < 1, it is interpreted as a fraction; if it is >=1, it
is interpreted as a count.

I have only tested this on 15.0.1, but it should also work on 14.0.1 and
later. It requires the Python programmability plug-in and supplementary
modules spssaux, spssdata, and namedtuple from Developer Central.

I hope you find this useful.

Regards,
Jon Peck
SPSS

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Luca Meyer
Sent: Saturday, April 21, 2007 4:40 AM
To: [hidden email]
Subject: [SPSSX-L] Occurences classification

I am using SPSS 15 and I am looking for a (semi)automatic procedure that
should allow me, for a given variable, to classify all less frequent
occurences into an "other" occurence.

For instance, assume a variable deriving from a question such as "What car
do you own?" with a list of some 25 possible brands plus a do not know
option:

BRAND 1
BRAND 2
BRAND 3
...
BRAND 22
BRAND 23
BRAND 24
BRAND 25
DO NOT KNOW

Now, what I am looking for is a procedure that:

(a) identifies the main - for instance 5, but such a parameter should be set
in the syntax - most frequent answers (let's say BRAND 19, BRAND 5, BRAND
12, BRAND 2 and BRAND 8)
(b) riclassifies all other occurences into a generic OTHER
(c) reports as such DO NOT KNOW and eventually other ausiliary answers (such
ad I DO NOT OWN A CAR, ecc)
(d) mantains the order in a CTABLES command (letting last the OTHER and
ausiliary categories)

In the example a CTABLES command with descending count order after the
classification should yeld:

BRAND 19
BRAND 5
BRAND 12
BRAND 2
BRAND 8
OTHER
DO NOT KNOW

As any of you already developed something similar? I am doing it manually
for each variable and is getting a burden....

Thanks,

Luca

Mr. Luca MEYER
Market research, data analysis & more
www.lucameyer.com <http://www.lucameyer.com/> - Tel: +39.339.495.00.21

zstatman

Re: Occurrences classification

Simple enough - My oversight, thought it was

Tks,
W

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Peck, Jon
Sent: Tuesday, April 24, 2007 3:02 PM
To: [hidden email]
Subject: Re: Occurrences classification

It's hard to tell from the formatting, but make sure that the end program.
is on a line by itself. The email probably mangled the example program.

It runs quite quickly, so I think the problem is that it hasn't really
started, because it hasn't seen the end program yet.

-----Original Message-----
From: Will Bailey [Statman] [mailto:[hidden email]]
Sent: Tuesday, April 24, 2007 1:57 PM
To: Peck, Jon; [hidden email]
Subject: RE: Occurrences classification

Jon, I ran the following on a dataset of about 2500 cases, variable age
(scale)

begin program.
import specialtransforms

specialtransforms.CollapseInfrequentValues("age", threshold=.20,
collapsevalue=99, otherlabel="low count categories") end program.

I say "ran" but it is really still running. The process line says Running
Program>...

Like the energizer bunny, it is still going?

W

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Peck, Jon
Sent: Tuesday, April 24, 2007 10:03 AM
To: [hidden email]
Subject: Re: Occurrences classification

As promised (almost), I have added a function to collapse infrequent
variable values to the specialtransforms module available on SPSS Developer
Central (www.spss.com/devcentral).

The function CollapseInfrequentValues collapses values whose counts or
occurrence percentage is below a specified value into a specified new value.
It deletes any value labels for the collapsed categories and creates a new
one for the collapse value. It can also define an SPSS macro listing the
values in ascending or descending order of frequency with the collapsed
value at the end. User and system missing values are ignored.

Here is an example:

begin program.
import specialtransforms

specialtransforms.CollapseInfrequentValues("educcopy", threshold=.20,
collapsevalue=99, otherlabel="low count categories") end program.

In this example, values of the variable educcopy occurring in fewer than 20%
of the cases are mapped to the value 99 which is given the value label "low
count categories".

The collapse is in place, so you may want to copy the variable(s) first
using COMPUTE and APPLY DICTIONARY.

If you are using Ctables, you might want the macro to control the order.

specialtransforms.CollapseInfrequentValues("educcopy", threshold=.20,
collapsevalue=99, otherlabel="low count categories",
macronames="!educMacro", order="D")

Then you could do
CTABLES /TABLE educcopy /CATEGORIES VARIABLE=educcopy [!educMacro].

That would show the values in descending order of frequency with the "other"
category at the end.

It can process more than one variable at a time and has a few other useful
extras.

If the threshold is < 1, it is interpreted as a fraction; if it is >=1, it
is interpreted as a count.

I have only tested this on 15.0.1, but it should also work on 14.0.1 and
later. It requires the Python programmability plug-in and supplementary
modules spssaux, spssdata, and namedtuple from Developer Central.

I hope you find this useful.

Regards,
Jon Peck
SPSS

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Luca Meyer
Sent: Saturday, April 21, 2007 4:40 AM
To: [hidden email]
Subject: [SPSSX-L] Occurences classification

I am using SPSS 15 and I am looking for a (semi)automatic procedure that
should allow me, for a given variable, to classify all less frequent
occurences into an "other" occurence.

For instance, assume a variable deriving from a question such as "What car
do you own?" with a list of some 25 possible brands plus a do not know
option:

BRAND 1
BRAND 2
BRAND 3
...
BRAND 22
BRAND 23
BRAND 24
BRAND 25
DO NOT KNOW

Now, what I am looking for is a procedure that:

(a) identifies the main - for instance 5, but such a parameter should be set
in the syntax - most frequent answers (let's say BRAND 19, BRAND 5, BRAND
12, BRAND 2 and BRAND 8)
(b) riclassifies all other occurences into a generic OTHER
(c) reports as such DO NOT KNOW and eventually other ausiliary answers (such
ad I DO NOT OWN A CAR, ecc)
(d) mantains the order in a CTABLES command (letting last the OTHER and
ausiliary categories)

In the example a CTABLES command with descending count order after the
classification should yeld:

BRAND 19
BRAND 5
BRAND 12
BRAND 2
BRAND 8
OTHER
DO NOT KNOW

As any of you already developed something similar? I am doing it manually
for each variable and is getting a burden....

Thanks,

Luca

Mr. Luca MEYER
Market research, data analysis & more
www.lucameyer.com <http://www.lucameyer.com/> - Tel: +39.339.495.00.21

Will
Statistical Services

============
info.statman@earthlink.net
http://home.earthlink.net/~z_statman/
============