SPSSX Discussion

Data Manipulation problem

Classic

List

Threaded

4 messages Options

Leslie Horst

Data Manipulation problem

Dear SPSS Experts,

I have downloaded datasets in Excel that I want to restructure so that I
can use them analytically in a flexible way. The data are actual and
projected high school graduation numbers separately for the 50 states
plus CD and also aggregated nationally and by region (a total of 56
spreadsheets). They are all of the following form (I hope that the list
processor doesn't make a hash of this):

Year PubTotalbyrace PubAIAN PubAPI PubBlack
PubHispanic PubWhite PubTot NonPubTot TotalAll
1991-1992
-
-
-
-
2021-2022

Respectively these are Public Schools Total by race, Public-Native
American, Alaska Native, Public-Black, Public-Hispanic, Public-White,
Public Total, Non-public Total, and Overall Total.

For all fields EXCEPT Non-public Total, the years from 2005-2006 are
projections; for the Non-public total, all fields from 2003-2004 on are
projects. The documentation with the files specifies that the Public
Total by race will not equal the Public Total in the years when the data
are projections.

This is a nice way to display a lot of data and communicate it, but it's
not so easy to analyze. I want to wind up with datasets with the form
of:

Area PubTotalbyrace9192 all the way to TotAll2122. That is, 9 (data
elements) * 31 (years) = 279 variables, plus a variable that will
designate what state (or region, or the US as a whole) is represented.
One row per area. Then I will be able to aggregate, say, 3 states that
are near each other easily using this new structure. The point of this
exercise is to be able to customized projections for varying groups of
states.

I have a fair amount of experience with syntax derived from many years
of using SPSS in its various incarnations, and no experience with
macros, scripts, or Python. The syntax below (mercifully only for the
first two years) works - it creates exactly what I want. But I'll just
bet there's a more elegant way. I'm especially interested in more
elegant solutions in syntax. Maybe there's a way to loop through the 31
lines that have the year specified, but I couldn't see a way to wind up
with unique year-specific variable names other than what I've done.
Special note to Richard Ristow: the "execute" is there so that I could
see the results in the data editor as I was going along. :-)

Suggestions anyone? FYI I have version 14.0.2, and I downloaded Python
some time ago (but have done nothing with it other than glance at the
tutorial and wonder if I'm past my prime for learning new programming
languages...)

GET DATA /TYPE=XLS
/FILE='X:\Project\Higher Ed landscape 2008\WICHE data and
profiles\data\individual states\Massachusetts.xls'
/SHEET=name 'Mass.'
/CELLRANGE=full
/READNAMES=on
/ASSUMEDSTRWIDTH=32767.
DATASET NAME DataSet1 WINDOW=FRONT.

**** create appropriate area variable and value.
string area (a4).
compute area = 'MA '.

**** create new variables for each year (for all 31 years - two years
only shown here).
do if (year = '1991-92').
Do Repeat
x = PubTotalbyrace PubAIAN PubAPI PubBlack PubHispanic PubWhite
PubTot NonPubTot TotalAll
/ y = PubTotalbyrace9192 PubAIAN9192 PubAPI9192 PubBlack9192
PubHispanic9192 PubWhite9192 PubTot9192 NonPubTot9192 TotalAll9192.
compute y = x.
End repeat.

else if (year = '1992-93').
Do Repeat
x = PubTotalbyrace PubAIAN PubAPI PubBlack PubHispanic PubWhite
PubTot NonPubTot TotalAll
/ y = PubTotalbyrace9293 PubAIAN9293 PubAPI9293 PubBlack9293
PubHispanic9293 PubWhite9293 PubTot9293 NonPubTot9293 TotalAll9293.
compute y = x.
End repeat.
end if.
list var = PubTotalbyrace9293 to TotalAll9293.

**** recode missing data to 0 for purposes of the aggregate.

recode PubTotalbyrace9192 PubAIAN9192 PubAPI9192 PubBlack9192
PubHispanic9192 PubWhite9192 PubTot9192 NonPubTot9192 TotalAll9192
PubTotalbyrace9293 PubAIAN9293 PubAPI9293 PubBlack9293
PubHispanic9293 PubWhite9293 PubTot9293 NonPubTot9293 TotalAll9293
(sysmis = 0).

**** aggregate so result will be one row of data with the area name
first and then the newly created variables (all 279 of them) in a single
row.
aggregate outfile = 'x:\project\Higher Ed landscape 2008\WICHE data and
profiles\data\individual states\MA test1.sav'
/ break = area
/ nPubTotalbyrace9192 = sum (PubTotalbyrace9192)
/ nPubAIAN9192 = sum (PubAIAN9192)
/ nPubAPI9192 = sum (PubAPI9192)
/nPubBlack9192 = sum (PubBlack9192)
/nPubHispanic9192 = sum (PubHispanic9192)
/nPubWhite9192 = sum (PubWhite9192)
/nPubTot9192 = sum (PubTot9192)
/nNonPubTot9192 = sum (NonPubTot9192)
/nTotalAll9192.= sum (TotalAll9192)
/ nPubTotalbyrace9293 = sum (PubTotalbyrace9293)
/ nPubAIAN9293 = sum (PubAIAN9293)
/ nPubAPI9293 = sum (PubAPI9293)
/nPubBlack9293 = sum (PubBlack9293)
/nPubHispanic9293 = sum (PubHispanic9293)
/nPubWhite9293 = sum (PubWhite9293)
/nPubTot9293 = sum (PubTot9293)
/nNonPubTot9293 = sum (NonPubTot9293)
/nTotalAll9293.= sum (TotalAll9293).
execute.

Leslie Horst, Ph.D.
Senior Consultant
Maguire Associates, Inc.
5 Concord Farms
555 Virginia Rd., #201
Concord, MA 01742
978-371-1775
www.maguireassoc.com

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Maguin, Eugene

Re: Data Manipulation problem

Leslie,

Richard and many others are much better at this than I am but this will give
you a start. Here's a way to think of your problem--as I understand it. You
have 56 'long' datasets and you wish to convert each to a 'wide' dataset. A
long dataset has multiple rows per case, which in your cases is one
spreadsheet page corresponding to a one state, region, or the country as a
whole. A wide dataset has one row per case. As you said, you go from 9
variables for 31 years to 279 variables. The applicable section to read in
the syntax reference is the command Casestovars. I'm pretty sure that this
command was in v14.

As you read about the Casestovars command, pay particular attention to the
Rename, Index, and Separator subcommands because I notice that you want
specific names for the new variables such that there is a stem naming the
data element and a suffix that names the year. For example, in the 91-92
file 'PubAIAN' goes to 'PubAIAN9192'. Casestovars constructs new variable
names as the concantenation of a 'stem' or 'new-stem' and an index value.
The 'stem' is the original variable name. A 'new-stem' is created, I think,
by the Rename subcommand. The index value can be supplied by another
variable (see Index subcommand). In your case Year is the candidate, except
that you want 9192 instead of 1991-92. So you could work over year to create
a new variable that would be your index variable. Also, look at the
Separator subcommand. You'll need to use that also.

Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Richard Ristow

Re: Data Manipulation problem

At 01:25 PM 4/15/2008, Gene Maguin wrote, responding to Leslie Horst:

>Richard and many others are much better at this than I [...]

(Grin) Thanks for the compliment. I hope I can live up to it.

To start, Gene wrote,
>As I understand it, you have 56 'long' datasets [...]

but, at the beginning, you don't; you have 56 Excel spreadsheets. So,
your first pain in the neck is to get all 56 as SPSS datasets.
(Actually, get them as saved files, which since SPSS 14 is a
meaningful distinction.)

At 11:39 AM 4/15/2008, Leslie Horst wrote that you'd used (I'm reformatting):

>GET DATA /TYPE=XLS
> /FILE='X:\Project\Higher Ed landscape 2008' +
> '\WICHE data and profiles\data' +
> '\individual states\' +
> 'Massachusetts.xls'
> /SHEET=name 'Mass.'
> /CELLRANGE=full
> /READNAMES=on
> /ASSUMEDSTRWIDTH=32767.
>
>**** create appropriate area variable and value.
>string area (a4).
>compute area = 'MA '.

and you need to do that 56 times. That's best done with some macro
facility (in the general sense); in SPSS, either macros or Python.
(It doesn't help that 'Massachusetts' is named in three different
ways in the syntax, and there doesn't seem to be a way to simplify
this. So you have to loop in parallel through three sets of
arguments: state names, sheet names ("Mass."), and state codes ("MA").

I want to give you a hand with this, but I can't start it right now.
Maybe other macro jocks can help.

>You wish to convert each ['long' dataset] to a 'wide' dataset. A
>long dataset has multiple rows per case, which in your cases is one
>spreadsheet page corresponding to a one state, region, or the
>country as a whole. A wide dataset has one row per case. As you
>said, you go from 9 variables for 31 years to 279 variables. The
>applicable section to read in the syntax reference is the command
>Casestovars. I'm pretty sure that this command was in v14.

It was. (Added in 12, maybe?) And, yes, follow Gene's advice. Use ADD
FILES to bring all your regions into a single dataset first, so you
can run CASESTOVARS once instead of 56 times. (And save that combined
file as it is, in 'long' form, before running CASESTOVARS!)

It'll take at least two ADD FILES to do it all, because "A maximum of
50 files can be combined on one ADD FILES command" (SPSS 14 Command
Syntax Reference). Or, probably better, you can use DATASET commands
and ADD FILES in your macro, you can build the combined long file
within the macro loop. It'll be significantly less efficient, I
think, except that efficiency will matter very little with the size
dataset you have.

Finally, you write,

>['Long' form] is a nice way to display a lot of data and communicate
>it, but it's not so easy to analyze.

Look hard at your analysis; this may not be the case. Overall, SPSS
works more gracefully with many cases and few variables -- of course,
you have whatever your problem is, not SPSS 'overall'.

But with your data, I'd seriously consider going the other way: Use
VARSTOCASES to get a dataset with variables 'area', 'year', and
ethnicity, and a single 'enrollment' variable.

-With best wishes to you,
Richard

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Richard Ristow

Re: Data Manipulation problem

In reply to this post by Leslie Horst

Postscript: At 11:39 AM 4/15/2008, Leslie Horst wrote of using:

>GET DATA /TYPE=XLS
> /FILE='X:\Project\Higher Ed landscape 2008\WICHE data and
>profiles\data\individual states\Massachusetts.xls'
> /SHEET=name 'Mass.'
> /CELLRANGE=full
> /READNAMES=on
> /ASSUMEDSTRWIDTH=32767.
>DATASET NAME DataSet1 WINDOW=FRONT.
>
>**** create appropriate area variable and value.
>string area (a4).
>compute area = 'MA '.

A macro loop (or Python loop) is the 'neat' solution, but it might be
easier to drag and drop the spreadsheets onto SPSS one at a time, and
save the results manually. That's seriously inelegant, but it could
be 56 fairly easy manual operations.

You might want to have a code snippet in a syntax window, to append
the current active dataset to a complete catenated one; that's easier
to do in syntax than from the menus. And maybe the snippet could
calculate variable 'area'; you'd have to change it manually every
time for that, but overall it could still be easier.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD