Dear fellow SPSS users, I wanted to share some data management syntax I used to delete variables. I’m sure that for experienced users, it’s old news, and that the real programmers among you might even be able to improve on it. I am using version 21 for this exercise. The problem I faced was a large data set with many empty variables, i.e. variables that had no values, questions in the survey that were answered by no one. I wanted a data file with only the variables I would work with. Of course, you could use DELETE VARIABLES (or even delete manually); but that means that you need to go through every variable to determine which are “empty” and which are not. That could be very tedious and time consuming, not to mention prone to error. So here is the syntax I used (the succinct version first; the explained version second): FLIP VARIABLES=ALL. COMPUTE FLTRvar=NVALID(var1 TO varN). FILTER OFF. USE ALL. SELECT IF (FLTRvar > 0). EXECUTE. DELETE VARIABLES FLTRvar. FLIP VARIABLES=ALL. DELETE VARIABLES CASE_LBL. FLIP VARIABLES=ALL. * Cases become variables and variables become cases; so if you have 100 cases, the new data set will have 100 variables from VAR001 to VAR100. * Also FLIP creates a new string variable CASE_LBL whose values are the names of the variables in the original data set. * CAUTION: FLIP SETS ALL STRING VARIABLES in the original data set TO SYSMIS . COMPUTE FLTRvar=NVALID(var1 TO varN). * The FLTRvar using NVALID will tell me which of the variables (now cases) in the original data set are empty, which means they are set to “0”. * NVALID tells us how many variables have valid values, i.e. are not missing; so if there 5 variables and all have values (none missing) NVALID returns a “5”. * Replace “var1” and “varN” with the names of your variables. FILTER OFF. USE ALL. SELECT IF (FLTRvar > 0). EXECUTE. * The SELECT IF deletes all the empty variables (now cases) from the original data set that have “0” on FLTRvar. DELETE VARIABLES FLTRvar. * I get rid of the filter variable since I do not need it anymore. FLIP VARIABLES=ALL. * This sets back the original variables (cases in the first FLIP) to their status of variable. DELETE VARIABLES CASE_LBL. * I delete the CASE_LBL created by the second FLIP. * Then I save the resulting data file. * ALWAYS SAVE THE ORIGINAL DATA so you can always come back to it. Best regards – Dominic ********************************************* Dominic Lusinchi Far West Research Consulting Applied Statistics – Social Research – Sociology San Francisco, California 1-415-664-3032 CV: http://www.farwestresearch.com/staff/dl/dlcv.html ********************************************* |
Well done! This is very much a common procedure that I personally use frequently also. Just some tips... You should look into working with DATASET NAMEs as when you flip a dataset a new (unnamed) dataset is created and preceding commands are intended to run on that dataset so your code should really should name that dataset and activate it. In this way you can run the entire syntax without any issues (else you have to step through it and in between select/activate the appropriate dataset, which I suspect is what you are doing to get this to work). Secondly, the code "COMPUTE FLTRvar=NVALID(var1 TO varN)." requires input from the user, again which isn't ideal, and often there are easy workaround to avoid this. You could also wrap and save this into a macro also (or even create a custom extension command), so it can be called upon by a single line than have to repeat all those lines of code. I shall let you figure out how to achieve that. Here's a revised version of your code, that you may want to play with and try understand it's benefits in practice: Ohhh and by the way, note, there is a spssaux2.FindEmptyVariables(delete=True) procedure which is dependent on python achieving this very same task. DATA LIST FREE / A B C D. BEGIN DATA 1 2 3 4 5 6 7 8 9 10 11 12 END DATA. COMPUTE C=$SYSMIS. DATASET NAME DSSIM. FLIP VARIABLES=ALL. DATASET NAME dsFlip1. DATASET ACTIVATE dsFlip1. NUMERIC MyEndVar(F1.0). COMPUTE FLTRvar=NVALID(var001 TO MyEndVar). FILTER OFF. USE ALL. SELECT IF (FLTRvar > 0). EXECUTE. DELETE VARIABLES FLTRvar. FLIP VARIABLES=ALL. DATASET NAME dsFlip2. DATASET ACTIVATE dsFlip2. SELECT IF CASE_LBL<>"MyEndVar". EXECUTE. DELETE VARIABLES CASE_LBL. On 18 November 2015 at 18:40, Dominic Lusinchi <[hidden email]> wrote:
|
Preceding commands? I meant subsequent commands after any FLIP command. On 18 November 2015 at 19:14, Jignesh Sutar <[hidden email]> wrote:
|
Administrator
|
In reply to this post by Dominic Lusinchi
The basic problem I see here is that very first line.
What if you have 1 Million Cases? FLIP will create havoc in attempting to build 1 Million variables. In other words this scales horribly. Avoid that problem as follows: FWIW, This solution does not play nice with STRING variables so you will need to float a different boat for that. MATRIX. SAVE TRUNC(UNIFORM(10000,300)*3) /OUTFILE * / VARIABLES x001 TO x300. END MATRIX. RECODE x001 TO x010 x020 TO x025 x100 TO x200 (ELSE=999). MISSING VALUES ALL (999). DATASET NAME raw. /* Sample data are all set */. DATASET DECLARE DESC2. OMS /SELECT TABLES /IF COMMANDS=['Descriptives'] SUBTYPES=['Descriptive Statistics'] /DESTINATION FORMAT=SAV NUMBERED=TableNumber_ OUTFILE='DESC2' . DESCRIPTIVES ALL. OMSEND. DATASET ACTIVATE DESC2. MATCH FILES /FILE * /KEEP Var1 N. SELECT IF N GT 0. RENAME VARIABLES (Var1 = CASE_LBL). FLIP. COMPUTE @LastValid@=1. ADD FILES /FILE * / FILE raw. COMPUTE @LastVar@=1. SELECT IF CASE_LBL NE 'N'. EXECUTE. DELETE VARIABLES CASE_LBL @LastValid@ TO @LastVar@ .
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
This post was updated on .
Before OMS was introduced I'd designed my way to drop empty (all cases
missing) numeric variables as follows (Even now I still consider this way to be acceptable. This is a third path (the first is via FLIP, the second is via OMS). Goes as follows. Do replace missing values (RMV command) with all the variables: replace with series mean (If your variables either full or fully empty you may do replacing right in the original variables, otherwise you'll have to produce variables with new names). If a variable was empty the first its case will remain empty. Otherwise it won't be empty. So, inspect, programmically, just the first case. Namely, WRITE out the 'MATCH FILES' or 'DELETE VARIABLES' command (as a syntax file) along with the list of variable names where case 1 is empty. Finally, INSERT or INCLUDE what you've written out. The tricky task to do the above WRITEing can be performed in more than one way. One way is using MATRIX. Another way is using it within a macro, macro-looping through variable names. Such as: do if $casenum=1. !do !var !in (!varlist) - do if sysmis(!var). - write outfile= 'to_insert.sps' /' ' !quote(!var). or within DO REPEAT, something like do if $casenum=1. - do repeat var= var1 var2 var3 etc /qname= 'var1' 'var2' 'var3' etc. (quoted name list) - do if sysmis(var). - write outfile= 'to_insert.sps' /' ' qname. 18.11.2015 22:21, David Marso пишет: > The basic problem I see here is that very first line. > What if you have 1 Million Cases? FLIP will create havoc in attempting to > build 1 Million variables. > In other words this scales horribly. > Avoid that problem as follows: FWIW, This solution does not play nice with > STRING variables so you will need to float a different boat for that. > > MATRIX. > SAVE TRUNC(UNIFORM(10000,300)*3) /OUTFILE * / VARIABLES x001 TO x300. > END MATRIX. > > RECODE x001 TO x010 x020 TO x025 x100 TO x200 (ELSE=999). > MISSING VALUES ALL (999). > DATASET NAME raw. > /* Sample data are all set */. > > DATASET DECLARE DESC2. > OMS > /SELECT TABLES > /IF COMMANDS=['Descriptives'] SUBTYPES=['Descriptive Statistics'] > /DESTINATION FORMAT=SAV NUMBERED=TableNumber_ > OUTFILE='DESC2' . > DESCRIPTIVES ALL. > OMSEND. > DATASET ACTIVATE DESC2. > MATCH FILES /FILE * /KEEP Var1 N. > SELECT IF N GT 0. > RENAME VARIABLES (Var1 = CASE_LBL). > FLIP. > COMPUTE @LastValid@=1. > ADD FILES /FILE * / FILE raw. > COMPUTE @LastVar@=1. > SELECT IF CASE_LBL NE 'N'. > EXECUTE. > DELETE VARIABLES CASE_LBL @LastValid@ TO @LastVar@ . > ===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
In reply to this post by David Marso
Nice one, David.
It's not immediately clear to me how one might tweak that to work with string variables. One can assign missing values to string variables (MISSING VALUES command), but the NVALID and NMISS functions only work with numeric variables. Perhaps one could AUTORECODE string variables to numeric, and include a BLANK sub-command to convert blanks to user-defined missing values? Maybe SORT VARIABLES BY TYPE would be useful too? Just "thinking out loud" here in case I say something that inspires you. ;-) Here's a variation on an AUTORECODE example from the FM that illustrates use of the BLANK sub-command. DATA LIST /stringVar (A1). BEGIN DATA a b c d END DATA. AUTORECODE VARIABLES=stringVar /INTO NumericVar /BLANK=MISSING. COMPUTE MissingValue = MISSING(NumericVar). FORMATS MissingValue(F1). LIST. OUTPUT: stringVar NumericVar MissingValue a 1 0 b 2 0 5 1 c 3 0 d 4 0 Number of cases read: 5 Number of cases listed: 5 v The values a, b, c, and d are autorecoded into the numeric values 1 through 4. v The blank value is autorecoded to 5, and 5 is defined as user-missing.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
In reply to this post by Jignesh Sutar
You see folks what I mean by “real programmers”! Thank you, Jignesh, David, Kirill and Bruce!! Had to lookup “FWIW”. I’m not up on… what kind of speak is that BTW? I do have a question: are you saying that SPSS can’t handle 1M variables? If I understand the SPSS convention in naming variables, with 1M variables it would go: VAR0000001….VAR1000000. – Dominic From: [hidden email] [[hidden email]] On Behalf Of Jignesh Sutar Preceding commands? I meant subsequent commands after any FLIP command. On 18 November 2015 at 19:14, Jignesh Sutar <[hidden email]> wrote: Well done! This is very much a common procedure that I personally use frequently also.
You should look into working with DATASET NAMEs as when you flip a dataset a new (unnamed) dataset is created and preceding commands are intended to run on that dataset so your code should really should name that dataset and activate it. In this way you can run the entire syntax without any issues (else you have to step through it and in between select/activate the appropriate dataset, which I suspect is what you are doing to get this to work). You could also wrap and save this into a macro also (or even create a custom extension command), so it can be called upon by a single line than have to repeat all those lines of code. I shall let you figure out how to achieve that. Here's a revised version of your code, that you may want to play with and try understand it's benefits in practice: Ohhh and by the way, note, there is a spssaux2.FindEmptyVariables(delete=True) procedure which is dependent on python achieving this very same task. DATA LIST FREE / A B C D. BEGIN DATA 1 2 3 4 5 6 7 8 9 10 11 12 END DATA. COMPUTE C=$SYSMIS. DATASET NAME DSSIM. FLIP VARIABLES=ALL. DATASET NAME dsFlip1. DATASET ACTIVATE dsFlip1. NUMERIC MyEndVar(F1.0). COMPUTE FLTRvar=NVALID(var001 TO MyEndVar). FILTER OFF. USE ALL. SELECT IF (FLTRvar > 0). EXECUTE. DELETE VARIABLES FLTRvar. FLIP VARIABLES=ALL. DATASET NAME dsFlip2. DATASET ACTIVATE dsFlip2. SELECT IF CASE_LBL<>"MyEndVar". EXECUTE. DELETE VARIABLES CASE_LBL. On 18 November 2015 at 18:40, Dominic Lusinchi <[hidden email]> wrote: Dear fellow SPSS users, I wanted to share some data management syntax I used to delete variables. I’m sure that for experienced users, it’s old news, and that the real programmers among you might even be able to improve on it. I am using version 21 for this exercise. The problem I faced was a large data set with many empty variables, i.e. variables that had no values, questions in the survey that were answered by no one. I wanted a data file with only the variables I would work with. Of course, you could use DELETE VARIABLES (or even delete manually); but that means that you need to go through every variable to determine which are “empty” and which are not. That could be very tedious and time consuming, not to mention prone to error. So here is the syntax I used (the succinct version first; the explained version second): FLIP VARIABLES=ALL. COMPUTE FLTRvar=NVALID(var1 TO varN). FILTER OFF. USE ALL. SELECT IF (FLTRvar > 0). EXECUTE. DELETE VARIABLES FLTRvar. FLIP VARIABLES=ALL. DELETE VARIABLES CASE_LBL. FLIP VARIABLES=ALL. * Cases become variables and variables become cases; so if you have 100 cases, the new data set will have 100 variables from VAR001 to VAR100. * Also FLIP creates a new string variable CASE_LBL whose values are the names of the variables in the original data set. * CAUTION: FLIP SETS ALL STRING VARIABLES in the original data set TO SYSMIS . COMPUTE FLTRvar=NVALID(var1 TO varN). * The FLTRvar using NVALID will tell me which of the variables (now cases) in the original data set are empty, which means they are set to “0”. * NVALID tells us how many variables have valid values, i.e. are not missing; so if there 5 variables and all have values (none missing) NVALID returns a “5”. * Replace “var1” and “varN” with the names of your variables. FILTER OFF. USE ALL. SELECT IF (FLTRvar > 0). EXECUTE. * The SELECT IF deletes all the empty variables (now cases) from the original data set that have “0” on FLTRvar. DELETE VARIABLES FLTRvar. * I get rid of the filter variable since I do not need it anymore. FLIP VARIABLES=ALL. * This sets back the original variables (cases in the first FLIP) to their status of variable. DELETE VARIABLES CASE_LBL. * I delete the CASE_LBL created by the second FLIP. * Then I save the resulting data file. * ALWAYS SAVE THE ORIGINAL DATA so you can always come back to it. Best regards – Dominic ********************************************* Dominic Lusinchi Far West Research Consulting Applied Statistics – Social Research – Sociology San Francisco, California 1-415-664-3032 CV: http://www.farwestresearch.com/staff/dl/dlcv.html ********************************************* ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
FWIW = For What It's Worth: a FLA (Four Letter Acronym).
Just an abbreviation for a commonly used expression. Sort of like WTF, RTFM, LOL, ROFLMAO ... I'm not saying that SPSS can't handle 1M variables (It likely can, I have never tried it... don't tempt me). The question is whether you want it to or need it to? How much time do you have on your hands? Or more to the point, can your computer handle 1M variables? Do you want it to or need it to?
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
There are no limits on the number of variables that can be handled (other than available memory), but I can promise that with 1M variables, it will be slow. And FLIPping a dataset of that size is surely a bad idea.
Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 ----- Original message ----- ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Dominic Lusinchi
Thank you, David and Jon. Well said. Fortunately, when I’ve had to deal with a data file in the millions, I did not have to delete hundreds of variables. Cheers – Dominic P.S. Thanks for the acronymish lesson… From: Jon K Peck [mailto:[hidden email]] There are no limits on the number of variables that can be handled (other than available memory), but I can promise that with 1M variables, it will be slow. And FLIPping a dataset of that size is surely a bad idea.
|
Free forum by Nabble | Edit this page |