|
Dear SPSS Experts,
I have downloaded datasets in Excel that I want to restructure so that I can use them analytically in a flexible way. The data are actual and projected high school graduation numbers separately for the 50 states plus CD and also aggregated nationally and by region (a total of 56 spreadsheets). They are all of the following form (I hope that the list processor doesn't make a hash of this): Year PubTotalbyrace PubAIAN PubAPI PubBlack PubHispanic PubWhite PubTot NonPubTot TotalAll 1991-1992 - - - - 2021-2022 Respectively these are Public Schools Total by race, Public-Native American, Alaska Native, Public-Black, Public-Hispanic, Public-White, Public Total, Non-public Total, and Overall Total. For all fields EXCEPT Non-public Total, the years from 2005-2006 are projections; for the Non-public total, all fields from 2003-2004 on are projects. The documentation with the files specifies that the Public Total by race will not equal the Public Total in the years when the data are projections. This is a nice way to display a lot of data and communicate it, but it's not so easy to analyze. I want to wind up with datasets with the form of: Area PubTotalbyrace9192 all the way to TotAll2122. That is, 9 (data elements) * 31 (years) = 279 variables, plus a variable that will designate what state (or region, or the US as a whole) is represented. One row per area. Then I will be able to aggregate, say, 3 states that are near each other easily using this new structure. The point of this exercise is to be able to customized projections for varying groups of states. I have a fair amount of experience with syntax derived from many years of using SPSS in its various incarnations, and no experience with macros, scripts, or Python. The syntax below (mercifully only for the first two years) works - it creates exactly what I want. But I'll just bet there's a more elegant way. I'm especially interested in more elegant solutions in syntax. Maybe there's a way to loop through the 31 lines that have the year specified, but I couldn't see a way to wind up with unique year-specific variable names other than what I've done. Special note to Richard Ristow: the "execute" is there so that I could see the results in the data editor as I was going along. :-) Suggestions anyone? FYI I have version 14.0.2, and I downloaded Python some time ago (but have done nothing with it other than glance at the tutorial and wonder if I'm past my prime for learning new programming languages...) GET DATA /TYPE=XLS /FILE='X:\Project\Higher Ed landscape 2008\WICHE data and profiles\data\individual states\Massachusetts.xls' /SHEET=name 'Mass.' /CELLRANGE=full /READNAMES=on /ASSUMEDSTRWIDTH=32767. DATASET NAME DataSet1 WINDOW=FRONT. **** create appropriate area variable and value. string area (a4). compute area = 'MA '. **** create new variables for each year (for all 31 years - two years only shown here). do if (year = '1991-92'). Do Repeat x = PubTotalbyrace PubAIAN PubAPI PubBlack PubHispanic PubWhite PubTot NonPubTot TotalAll / y = PubTotalbyrace9192 PubAIAN9192 PubAPI9192 PubBlack9192 PubHispanic9192 PubWhite9192 PubTot9192 NonPubTot9192 TotalAll9192. compute y = x. End repeat. else if (year = '1992-93'). Do Repeat x = PubTotalbyrace PubAIAN PubAPI PubBlack PubHispanic PubWhite PubTot NonPubTot TotalAll / y = PubTotalbyrace9293 PubAIAN9293 PubAPI9293 PubBlack9293 PubHispanic9293 PubWhite9293 PubTot9293 NonPubTot9293 TotalAll9293. compute y = x. End repeat. end if. list var = PubTotalbyrace9293 to TotalAll9293. **** recode missing data to 0 for purposes of the aggregate. recode PubTotalbyrace9192 PubAIAN9192 PubAPI9192 PubBlack9192 PubHispanic9192 PubWhite9192 PubTot9192 NonPubTot9192 TotalAll9192 PubTotalbyrace9293 PubAIAN9293 PubAPI9293 PubBlack9293 PubHispanic9293 PubWhite9293 PubTot9293 NonPubTot9293 TotalAll9293 (sysmis = 0). **** aggregate so result will be one row of data with the area name first and then the newly created variables (all 279 of them) in a single row. aggregate outfile = 'x:\project\Higher Ed landscape 2008\WICHE data and profiles\data\individual states\MA test1.sav' / break = area / nPubTotalbyrace9192 = sum (PubTotalbyrace9192) / nPubAIAN9192 = sum (PubAIAN9192) / nPubAPI9192 = sum (PubAPI9192) /nPubBlack9192 = sum (PubBlack9192) /nPubHispanic9192 = sum (PubHispanic9192) /nPubWhite9192 = sum (PubWhite9192) /nPubTot9192 = sum (PubTot9192) /nNonPubTot9192 = sum (NonPubTot9192) /nTotalAll9192.= sum (TotalAll9192) / nPubTotalbyrace9293 = sum (PubTotalbyrace9293) / nPubAIAN9293 = sum (PubAIAN9293) / nPubAPI9293 = sum (PubAPI9293) /nPubBlack9293 = sum (PubBlack9293) /nPubHispanic9293 = sum (PubHispanic9293) /nPubWhite9293 = sum (PubWhite9293) /nPubTot9293 = sum (PubTot9293) /nNonPubTot9293 = sum (NonPubTot9293) /nTotalAll9293.= sum (TotalAll9293). execute. Leslie Horst, Ph.D. Senior Consultant Maguire Associates, Inc. 5 Concord Farms 555 Virginia Rd., #201 Concord, MA 01742 978-371-1775 www.maguireassoc.com ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Leslie,
Richard and many others are much better at this than I am but this will give you a start. Here's a way to think of your problem--as I understand it. You have 56 'long' datasets and you wish to convert each to a 'wide' dataset. A long dataset has multiple rows per case, which in your cases is one spreadsheet page corresponding to a one state, region, or the country as a whole. A wide dataset has one row per case. As you said, you go from 9 variables for 31 years to 279 variables. The applicable section to read in the syntax reference is the command Casestovars. I'm pretty sure that this command was in v14. As you read about the Casestovars command, pay particular attention to the Rename, Index, and Separator subcommands because I notice that you want specific names for the new variables such that there is a stem naming the data element and a suffix that names the year. For example, in the 91-92 file 'PubAIAN' goes to 'PubAIAN9192'. Casestovars constructs new variable names as the concantenation of a 'stem' or 'new-stem' and an index value. The 'stem' is the original variable name. A 'new-stem' is created, I think, by the Rename subcommand. The index value can be supplied by another variable (see Index subcommand). In your case Year is the candidate, except that you want 9192 instead of 1991-92. So you could work over year to create a new variable that would be your index variable. Also, look at the Separator subcommand. You'll need to use that also. Gene Maguin ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
At 01:25 PM 4/15/2008, Gene Maguin wrote, responding to Leslie Horst:
>Richard and many others are much better at this than I [...] (Grin) Thanks for the compliment. I hope I can live up to it. To start, Gene wrote, >As I understand it, you have 56 'long' datasets [...] but, at the beginning, you don't; you have 56 Excel spreadsheets. So, your first pain in the neck is to get all 56 as SPSS datasets. (Actually, get them as saved files, which since SPSS 14 is a meaningful distinction.) At 11:39 AM 4/15/2008, Leslie Horst wrote that you'd used (I'm reformatting): >GET DATA /TYPE=XLS > /FILE='X:\Project\Higher Ed landscape 2008' + > '\WICHE data and profiles\data' + > '\individual states\' + > 'Massachusetts.xls' > /SHEET=name 'Mass.' > /CELLRANGE=full > /READNAMES=on > /ASSUMEDSTRWIDTH=32767. > >**** create appropriate area variable and value. >string area (a4). >compute area = 'MA '. and you need to do that 56 times. That's best done with some macro facility (in the general sense); in SPSS, either macros or Python. (It doesn't help that 'Massachusetts' is named in three different ways in the syntax, and there doesn't seem to be a way to simplify this. So you have to loop in parallel through three sets of arguments: state names, sheet names ("Mass."), and state codes ("MA"). I want to give you a hand with this, but I can't start it right now. Maybe other macro jocks can help. >You wish to convert each ['long' dataset] to a 'wide' dataset. A >long dataset has multiple rows per case, which in your cases is one >spreadsheet page corresponding to a one state, region, or the >country as a whole. A wide dataset has one row per case. As you >said, you go from 9 variables for 31 years to 279 variables. The >applicable section to read in the syntax reference is the command >Casestovars. I'm pretty sure that this command was in v14. It was. (Added in 12, maybe?) And, yes, follow Gene's advice. Use ADD FILES to bring all your regions into a single dataset first, so you can run CASESTOVARS once instead of 56 times. (And save that combined file as it is, in 'long' form, before running CASESTOVARS!) It'll take at least two ADD FILES to do it all, because "A maximum of 50 files can be combined on one ADD FILES command" (SPSS 14 Command Syntax Reference). Or, probably better, you can use DATASET commands and ADD FILES in your macro, you can build the combined long file within the macro loop. It'll be significantly less efficient, I think, except that efficiency will matter very little with the size dataset you have. Finally, you write, >['Long' form] is a nice way to display a lot of data and communicate >it, but it's not so easy to analyze. Look hard at your analysis; this may not be the case. Overall, SPSS works more gracefully with many cases and few variables -- of course, you have whatever your problem is, not SPSS 'overall'. But with your data, I'd seriously consider going the other way: Use VARSTOCASES to get a dataset with variables 'area', 'year', and ethnicity, and a single 'enrollment' variable. -With best wishes to you, Richard ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Leslie Horst
Postscript: At 11:39 AM 4/15/2008, Leslie Horst wrote of using:
>GET DATA /TYPE=XLS > /FILE='X:\Project\Higher Ed landscape 2008\WICHE data and >profiles\data\individual states\Massachusetts.xls' > /SHEET=name 'Mass.' > /CELLRANGE=full > /READNAMES=on > /ASSUMEDSTRWIDTH=32767. >DATASET NAME DataSet1 WINDOW=FRONT. > >**** create appropriate area variable and value. >string area (a4). >compute area = 'MA '. A macro loop (or Python loop) is the 'neat' solution, but it might be easier to drag and drop the spreadsheets onto SPSS one at a time, and save the results manually. That's seriously inelegant, but it could be 56 fairly easy manual operations. You might want to have a code snippet in a syntax window, to append the current active dataset to a complete catenated one; that's easier to do in syntax than from the menus. And maybe the snippet could calculate variable 'area'; you'd have to change it manually every time for that, but overall it could still be easier. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
| Free forum by Nabble | Edit this page |
