Autorecode Strangeness

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Autorecode Strangeness

Leslie Horst
I'm running version 14.0.2.

I have a file with approximately 103,000 cases, people who have some relationship with a particular university.  (Many are graduates, but not all.)  They have (among many other variables) up to 6 majors and up to 4 minors, designated major1 ...major6 and minor1...minor6 for each person in the database. These are all string variables (of varying lengths) that simply list the major or minor.

I originally imported this file from csv, reading it into Excel 2007, in order to add a bogus record that tells SPSS how to read the incoming variables.  For something like the designation of a major I put xxxxxxxxxxxxxxxxxxx in the first (bogus) record that is later deleted.  This procedures keeps SPSS from reading what is actually a string variable with lots of blanks as numeric and then recording as sysmis the few valid string values that exist.  (It's a kludge but it works like a charm and I describe it here for the sake of others who may have experience with SPSS doing strange things to string variables when there are very few non-blank values.)  Having done all that, I exported the file again as CSV (since Version 14 doesn't deal with Excel 2007) and used SPSS's text data reading capacity to bring it in to SPSS.  All of this went fine.

For the "major" and "minor" sets of variables I did the following:

recode major1 to major6 ( '     '  = 0) (else = 1) into majorcat1 to majorcat6.
freq var = majorcat1 to majorcat6.
compute nummajors = sum (majorcat1 to majorcat6).
freq var = nummajors.
autorecode var =  major1 to major6   /into majorr1 to majorr6 /group.
freq var = majorr1 to majorr6.

recode minor1 to minor4 ( '     '  = 0) (else = 1) into minorcat1 to minorcat4.
freq var = minorcat1 to minorcat4.
compute numminors = sum (minorcat1 to minorcat4).
freq var = numminors.
autorecode var =  minor1 to minor4   /into minorr1 to minorr4 /group.
freq var = minorr1 to minorr4.

The point of the "recode into" syntax was to be able to get simple counts of the number of majors or minors recorded, and these worked fine.  Checks against frequencies of the original variables matched up nicely.

In the autorecode I specified the 'group' keyword because I wanted a single value label list for the 6 major variables and another for the 4 minors.   I first noticed something funky in autorecode when I looked at the frequencies for 'minorr4'.  Oddly enough, it had the few (5, actually) cases where there was actually something entered for minor4, but it ALSO had all of the values for the NEXT variable (reading left-to-right in the data editor) appended.  This next variable happened to be a code for activities in which the person had participated.  At first I thought that was the only issue until I looked more closely at the output for the major variables, and saw the same thing creeping in with majorr3 on.

The original variables are being read cleanly (I spot-checked in the data editor and with frequencies) - it's only the autorecode process that is going haywire.  One problem the creates, aside from the fact that the data are invalid, is that the common list of value labels generated by the 'group' keyword is a mess because of all the wild codes, so as it is now it's useless.

Here's a little bit of what this output winds up looking like, for the Majorr5 variable.  Note the right-justified 'Public'.  For that particular case, Major6 is 'Public Spkg'.

Criminology             Public
Economics

The majorr6 variable reports all of the values from the adjacent (Minor1) variable plus two legitimate values.


Admiration to anyone who can figure out what's going on, and undying appreciation to anyone who can tell me a fix for it!

Leslie Horst
Senior Consultant
Maguire Associates, Inc.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Autorecode Strangeness

Art Kendall
Are string variables within " or '?  Are there embedded blanks, commas,
single or double quotes, etc?

Can you extract a few cases where this happens and send me the CSV file
and syntax?  I do not have version 14 available but I may be able to
spot something.

Art Kendall
Social Research Consultants

Leslie Horst wrote:

> I'm running version 14.0.2.
>
> I have a file with approximately 103,000 cases, people who have some relationship with a particular university.  (Many are graduates, but not all.)  They have (among many other variables) up to 6 majors and up to 4 minors, designated major1 ...major6 and minor1...minor6 for each person in the database. These are all string variables (of varying lengths) that simply list the major or minor.
>
> I originally imported this file from csv, reading it into Excel 2007, in order to add a bogus record that tells SPSS how to read the incoming variables.  For something like the designation of a major I put xxxxxxxxxxxxxxxxxxx in the first (bogus) record that is later deleted.  This procedures keeps SPSS from reading what is actually a string variable with lots of blanks as numeric and then recording as sysmis the few valid string values that exist.  (It's a kludge but it works like a charm and I describe it here for the sake of others who may have experience with SPSS doing strange things to string variables when there are very few non-blank values.)  Having done all that, I exported the file again as CSV (since Version 14 doesn't deal with Excel 2007) and used SPSS's text data reading capacity to bring it in to SPSS.  All of this went fine.
>
> For the "major" and "minor" sets of variables I did the following:
>
> recode major1 to major6 ( '     '  = 0) (else = 1) into majorcat1 to majorcat6.
> freq var = majorcat1 to majorcat6.
> compute nummajors = sum (majorcat1 to majorcat6).
> freq var = nummajors.
> autorecode var =  major1 to major6   /into majorr1 to majorr6 /group.
> freq var = majorr1 to majorr6.
>
> recode minor1 to minor4 ( '     '  = 0) (else = 1) into minorcat1 to minorcat4.
> freq var = minorcat1 to minorcat4.
> compute numminors = sum (minorcat1 to minorcat4).
> freq var = numminors.
> autorecode var =  minor1 to minor4   /into minorr1 to minorr4 /group.
> freq var = minorr1 to minorr4.
>
> The point of the "recode into" syntax was to be able to get simple counts of the number of majors or minors recorded, and these worked fine.  Checks against frequencies of the original variables matched up nicely.
>
> In the autorecode I specified the 'group' keyword because I wanted a single value label list for the 6 major variables and another for the 4 minors.   I first noticed something funky in autorecode when I looked at the frequencies for 'minorr4'.  Oddly enough, it had the few (5, actually) cases where there was actually something entered for minor4, but it ALSO had all of the values for the NEXT variable (reading left-to-right in the data editor) appended.  This next variable happened to be a code for activities in which the person had participated.  At first I thought that was the only issue until I looked more closely at the output for the major variables, and saw the same thing creeping in with majorr3 on.
>
> The original variables are being read cleanly (I spot-checked in the data editor and with frequencies) - it's only the autorecode process that is going haywire.  One problem the creates, aside from the fact that the data are invalid, is that the common list of value labels generated by the 'group' keyword is a mess because of all the wild codes, so as it is now it's useless.
>
> Here's a little bit of what this output winds up looking like, for the Majorr5 variable.  Note the right-justified 'Public'.  For that particular case, Major6 is 'Public Spkg'.
>
> Criminology             Public
> Economics
>
> The majorr6 variable reports all of the values from the adjacent (Minor1) variable plus two legitimate values.
>
>
> Admiration to anyone who can figure out what's going on, and undying appreciation to anyone who can tell me a fix for it!
>
> Leslie Horst
> Senior Consultant
> Maguire Associates, Inc.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
>
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Autorecode strangeness

Leslie Horst
In reply to this post by Leslie Horst
Well, I think I have solved this problem in an unexpected way - as I was writing a completely different email.  I had run this with version 15 with the same bad result, and earlier, with version 16 when things went fine.

I had also noticed, in earlier versions of the file, that a few cases had double quotes at the beginning of the data, like : "Radio/TV Film.  Most of the records did not.  On a flyer, the next time I imported the csv file I told the text wizard that double quote (") was the text qualifier, and the double quotes went away.  I just tried my syntax now (on 15.0.1) on this revised file - and it worked, as Tech Support had told me it did when they tried it.  However, the other night, when I tried it, it did NOT work with my earlier version of the data file.  So, perhaps the trouble was from some combination of things - vast majority of records with blanks, slightly funky reading of variables, etc.

I can't wait to see if it works now with V14.

By the way, when I tried exporting the file from Excel (2007) in tab-delimited format and read it in to SPSS, SPSS insisted on reading every single variable as numeric!  This is despite the fact that I had my bogus record up front in order to force SPSS to read string variables as string.  If there are lots of blanks in the first N records (I don't know the size of N), SPSS will assume that the variables are numeric and therefore trash your data.  And it doesn't really warn you about that.

Onward and, well, onward.


Leslie Horst, Ph.D.
Senior Consultant
Maguire Associates, Inc.
5 Concord Farms
555 Virginia Road, Suite 201
Concord, MA 01742-2727

Phone: 978-371-1775 x297
Fax: 978-371-1759
[hidden email]
www.maguireassoc.com
________________________________________
Date:    Sat, 9 May 2009 09:17:59 -0400
From:    Art Kendall <[hidden email]>
Subject: Re: Autorecode Strangeness

Are string variables within " or '?  Are there embedded blanks, commas,
single or double quotes, etc?

Can you extract a few cases where this happens and send me the CSV file
and syntax?  I do not have version 14 available but I may be able to
spot something.

Art Kendall
Social Research Consultants

Leslie Horst wrote:

> I'm running version 14.0.2.
>
> I have a file with approximately 103,000 cases, people who have some relationship with a particular university.  (Many are graduates, but not all.)  They have (among many other variables) up to 6 majors and up to 4 minors, designated major1 ...major6 and minor1...minor6 for each person in the database. These are all string variables (of varying lengths) that simply list the major or minor.
>
> I originally imported this file from csv, reading it into Excel 2007, in order to add a bogus record that tells SPSS how to read the incoming variables.  For something like the designation of a major I put xxxxxxxxxxxxxxxxxxx in the first (bogus) record that is later deleted.  This procedures keeps SPSS from reading what is actually a string variable with lots of blanks as numeric and then recording as sysmis the few valid string values that exist.  (It's a kludge but it works like a charm and I describe it here for the sake of others who may have experience with SPSS doing strange things to string variables when there are very few non-blank values.)  Having done all that, I exported the file again as CSV (since Version 14 doesn't deal with Excel 2007) and used SPSS's text data reading capacity to bring it in to SPSS.  All of this went fine.
>
> For the "major" and "minor" sets of variables I did the following:
>
> recode major1 to major6 ( '     '  = 0) (else = 1) into majorcat1 to majorcat6.
> freq var = majorcat1 to majorcat6.
> compute nummajors = sum (majorcat1 to majorcat6).
> freq var = nummajors.
> autorecode var =  major1 to major6   /into majorr1 to majorr6 /group.
> freq var = majorr1 to majorr6.
>
> recode minor1 to minor4 ( '     '  = 0) (else = 1) into minorcat1 to minorcat4.
> freq var = minorcat1 to minorcat4.
> compute numminors = sum (minorcat1 to minorcat4).
> freq var = numminors.
> autorecode var =  minor1 to minor4   /into minorr1 to minorr4 /group.
> freq var = minorr1 to minorr4.
>
> The point of the "recode into" syntax was to be able to get simple counts of the number of majors or minors recorded, and these worked fine.  Checks against frequencies of the original variables matched up nicely.
>
> In the autorecode I specified the 'group' keyword because I wanted a single value label list for the 6 major variables and another for the 4 minors.   I first noticed something funky in autorecode when I looked at the frequencies for 'minorr4'.  Oddly enough, it had the few (5, actually) cases where there was actually something entered for minor4, but it ALSO had all of the values for the NEXT variable (reading left-to-right in the data editor) appended.  This next variable happened to be a code for activities in which the person had participated.  At first I thought that was the only issue until I looked more closely at the output for the major variables, and saw the same thing creeping in with majorr3 on.
>
> The original variables are being read cleanly (I spot-checked in the data editor and with frequencies) - it's only the autorecode process that is going haywire.  One problem the creates, aside from the fact that the data are invalid, is that the common list of value labels generated by the 'group' keyword is a mess because of all the wild codes, so as it is now it's useless.
>
> Here's a little bit of what this output winds up looking like, for the Majorr5 variable.  Note the right-justified 'Public'.  For that particular case, Major6 is 'Public Spkg'.
>
> Criminology             Public
> Economics
>
> The majorr6 variable reports all of the values from the adjacent (Minor1) variable plus two legitimate values.
>
>
> Admiration to anyone who can figure out what's going on, and undying appreciation to anyone who can tell me a fix for it!
>
> Leslie Horst
> Senior Consultant
> Maguire Associates, Inc.
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD