Filtering out variables without data

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Filtering out variables without data

Melissa David

I have a very large dataset with information from all over the county. I often want to tell SPSS that I want ALL available data from a specific area (lets say Alabama).  I have thousands of different variables to hold information from all over and I don’t have a way to know which variables are unique to Alabama; several variables may unique to different areas.  When I filter out all cases except Alabama, all of the variables still appear as options to run frequencies on, and since I do not know which variables contain Alabama data, I have to run frequencies on all of the variables.  When I filter to select all Alabama cases, I want to add a filter to tell SPSS to ignore (or also filter out) all variables that do not have data in them because there is no data to run in those variables.

If I run the filter to export into a new file, the new dataset would ideally NOT contain empty variables. Because I have to run all of the variables frequencies to find out which variables have data, it would result in a unmanageably long frequency table output.  Except, since there are so many variables and cases, SPSS crashes and doesn’t actually run all of the 15k+ variables.  

I have run the syntax described in this technote http://www-01.ibm.com/support/docview.wss?uid=swg21481480 that addresses the issue, but it isn't efficent, and it takes several hours to run on my dataset with 15k+ variables.  If anyone has a more efficient or streamlined method, please let me know! 

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Filtering out variables without data

Jon Peck
Here's a simple bit of Python code to delete empty variables.  For numerics, that means all values are sysmis and for strings, all values are blank.

begin program.
import spssaux2

spssaux2.delEmptyVars()
end program.

spssaux2 also has a function, FindEmptyVars, that offers some other options.


On Tue, Dec 5, 2017 at 3:38 PM, Melissa <[hidden email]> wrote:

I have a very large dataset with information from all over the county. I often want to tell SPSS that I want ALL available data from a specific area (lets say Alabama).  I have thousands of different variables to hold information from all over and I don’t have a way to know which variables are unique to Alabama; several variables may unique to different areas.  When I filter out all cases except Alabama, all of the variables still appear as options to run frequencies on, and since I do not know which variables contain Alabama data, I have to run frequencies on all of the variables.  When I filter to select all Alabama cases, I want to add a filter to tell SPSS to ignore (or also filter out) all variables that do not have data in them because there is no data to run in those variables.

If I run the filter to export into a new file, the new dataset would ideally NOT contain empty variables. Because I have to run all of the variables frequencies to find out which variables have data, it would result in a unmanageably long frequency table output.  Except, since there are so many variables and cases, SPSS crashes and doesn’t actually run all of the 15k+ variables.  

I have run the syntax described in this technote http://www-01.ibm.com/support/docview.wss?uid=swg21481480 that addresses the issue, but it isn't efficent, and it takes several hours to run on my dataset with 15k+ variables.  If anyone has a more efficient or streamlined method, please let me know! 

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Filtering out variables without data

Rich Ulrich
In reply to this post by Melissa David

Jon shows an elegant solution to the problem as presented.


I think I would resist, as strongly as I could, any demand

to maintain one flat file with 15K variables, even if most of the

values were not Missing. 90% of the effort of ordinary "data analysis"

is devoted to cleaning and prepping -- and that is when it is easy to

spot the proper Missing, or occasional bad values.  You are in worse

shape when massive portions are Missing-by-definition.


My own "multiplicity" examples had multiple dates; 50 vars times 5 dates

can be 250 vars, or 50 vars:  50 is far faster to edit-check and manipulate,

and is in appropriate form for aggregation, selection, etc.


And in yours, I bet it is only a few "States" that have most of the problems.


Argue for separate datasets, which will be Joined at the time of analyses.

Maintain standard sets of lines to do the joining.


--

Rich Ulrich


From: SPSSX(r) Discussion <[hidden email]> on behalf of Melissa <[hidden email]>
Sent: Tuesday, December 5, 2017 5:38:40 PM
To: [hidden email]
Subject: Filtering out variables without data
 

I have a very large dataset with information from all over the county. I often want to tell SPSS that I want ALL available data from a specific area (lets say Alabama).  I have thousands of different variables to hold information from all over and I don’t have a way to know which variables are unique to Alabama; several variables may unique to different areas.  When I filter out all cases except Alabama, all of the variables still appear as options to run frequencies on, and since I do not know which variables contain Alabama data, I have to run frequencies on all of the variables.  When I filter to select all Alabama cases, I want to add a filter to tell SPSS to ignore (or also filter out) all variables that do not have data in them because there is no data to run in those variables.

If I run the filter to export into a new file, the new dataset would ideally NOT contain empty variables. Because I have to run all of the variables frequencies to find out which variables have data, it would result in a unmanageably long frequency table output.  Except, since there are so many variables and cases, SPSS crashes and doesn’t actually run all of the 15k+ variables.  

I have run the syntax described in this technote http://www-01.ibm.com/support/docview.wss?uid=swg21481480 that addresses the issue, but it isn't efficent, and it takes several hours to run on my dataset with 15k+ variables.  If anyone has a more efficient or streamlined method, please let me know! 

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Filtering out variables without data

Jon Peck
15K variables is pretty unwieldy.  Splitting up the data into separate files can conveniently be done with the  STATS SPLIT DATASET (Data > Split into Files) extension command, and the STATS PROCESS FILES extension command can iterate a file of syntax over the separate files producing one or multiple Viewer files.  The advantage of this over SPLIT FILES is that you run a whole set of procedures and have the output grouped by file whereas SPLIT FILES iterates within a single procedure.

On Wed, Dec 6, 2017 at 10:39 AM, Rich Ulrich <[hidden email]> wrote:

Jon shows an elegant solution to the problem as presented.


I think I would resist, as strongly as I could, any demand

to maintain one flat file with 15K variables, even if most of the

values were not Missing. 90% of the effort of ordinary "data analysis"

is devoted to cleaning and prepping -- and that is when it is easy to

spot the proper Missing, or occasional bad values.  You are in worse

shape when massive portions are Missing-by-definition.


My own "multiplicity" examples had multiple dates; 50 vars times 5 dates

can be 250 vars, or 50 vars:  50 is far faster to edit-check and manipulate,

and is in appropriate form for aggregation, selection, etc.


And in yours, I bet it is only a few "States" that have most of the problems.


Argue for separate datasets, which will be Joined at the time of analyses.

Maintain standard sets of lines to do the joining.


--

Rich Ulrich


From: SPSSX(r) Discussion <[hidden email]> on behalf of Melissa <[hidden email]>
Sent: Tuesday, December 5, 2017 5:38:40 PM
To: [hidden email]
Subject: Filtering out variables without data
 

I have a very large dataset with information from all over the county. I often want to tell SPSS that I want ALL available data from a specific area (lets say Alabama).  I have thousands of different variables to hold information from all over and I don’t have a way to know which variables are unique to Alabama; several variables may unique to different areas.  When I filter out all cases except Alabama, all of the variables still appear as options to run frequencies on, and since I do not know which variables contain Alabama data, I have to run frequencies on all of the variables.  When I filter to select all Alabama cases, I want to add a filter to tell SPSS to ignore (or also filter out) all variables that do not have data in them because there is no data to run in those variables.

If I run the filter to export into a new file, the new dataset would ideally NOT contain empty variables. Because I have to run all of the variables frequencies to find out which variables have data, it would result in a unmanageably long frequency table output.  Except, since there are so many variables and cases, SPSS crashes and doesn’t actually run all of the 15k+ variables.  

I have run the syntax described in this technote http://www-01.ibm.com/support/docview.wss?uid=swg21481480 that addresses the issue, but it isn't efficent, and it takes several hours to run on my dataset with 15k+ variables.  If anyone has a more efficient or streamlined method, please let me know! 

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Filtering out variables without data

Melissa David
In reply to this post by Jon Peck

This worked, and it only took an hour on my large dataset.  Thank you so much, Jon!

 

From: Jon Peck [mailto:[hidden email]]
Sent: Tuesday, December 05, 2017 8:22 PM
To: Melissa David <[hidden email]>
Cc: SPSS List <[hidden email]>
Subject: Re: [SPSSX-L] Filtering out variables without data

 

Here's a simple bit of Python code to delete empty variables.  For numerics, that means all values are sysmis and for strings, all values are blank.

 

begin program.

import spssaux2

 

spssaux2.delEmptyVars()

end program.

 

spssaux2 also has a function, FindEmptyVars, that offers some other options.

 

 

On Tue, Dec 5, 2017 at 3:38 PM, Melissa <[hidden email]> wrote:

I have a very large dataset with information from all over the county. I often want to tell SPSS that I want ALL available data from a specific area (lets say Alabama).  I have thousands of different variables to hold information from all over and I don’t have a way to know which variables are unique to Alabama; several variables may unique to different areas.  When I filter out all cases except Alabama, all of the variables still appear as options to run frequencies on, and since I do not know which variables contain Alabama data, I have to run frequencies on all of the variables.  When I filter to select all Alabama cases, I want to add a filter to tell SPSS to ignore (or also filter out) all variables that do not have data in them because there is no data to run in those variables.

If I run the filter to export into a new file, the new dataset would ideally NOT contain empty variables. Because I have to run all of the variables frequencies to find out which variables have data, it would result in a unmanageably long frequency table output.  Except, since there are so many variables and cases, SPSS crashes and doesn’t actually run all of the 15k+ variables.  

I have run the syntax described in this technote http://www-01.ibm.com/support/docview.wss?uid=swg21481480 that addresses the issue, but it isn't efficent, and it takes several hours to run on my dataset with 15k+ variables.  If anyone has a more efficient or streamlined method, please let me know! 

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD



 

--

Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Filtering out variables without data

David Marso
Administrator
In reply to this post by Melissa David
I looked at that IBM link and my only comment is OMG YUCK ICKY WTF....
I would likely go with Jon's python solution.
Perhaps you would like to discuss how you have 15,000 variables.
There is likely a much more reasonable solution than maintaining them in one
file.
Why do you think SPSS has ADD FILES and MATCH FILES commands.
I am not going to try to guess the origins of this mess.



-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"