Delete columns with more than 20% missing cases

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Delete columns with more than 20% missing cases

Sammm313
Hi,

So I have a dataset will a couple thousand observations and a couple
thousand variables (columns) that I am tryin to clean. I now want to remove
all columns that have more than 20% missing cases. Is there a way I can do
this for all my variables without having to go through them one by one and
delete them that way?

I would do this with one line of code in python with this line. So I'm
looking for a similar thing to do in SPSS.

# drop columns with more than 20% missing
df = df.dropna(thresh=0.8*len(df), axis=1)



--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Delete columns with more than 20% missing cases

Jon Peck
This takes more code in Statistics.  If you have the Data Validation option, the following code will do it.  If not, something similar can be done with AGGREGATE, so post back if you need that.  It could also be done with a litle pure Python code, but that would be slower to execute.

This code captures the output from VALIDATE DATA as a dataset and then generates the DELETE VARIABLES command appropriately.  It's a little convoluted because the structure of the table from VALIDATE DATA is rather peculiar.

dataset name main.
preserve.
set tvars=names.
dataset declare checks.
oms select tables /if subtypes='VariableChecks' /destination outfile=checks format=sav.
VALIDATEDATA VARIABLES=ALL
  /VARCHECKS STATUS=ON PCTMISSING=20 PCTEQUAL=100 PCTUNEQUAL=100
  /CASECHECKS REPORTEMPTY=NO SCOPE=ALLVARS
  /CASEREPORT DISPLAY=NO.
omsend.
restore.

begin program.
import spss, spssdata

spss.Submit("dataset activate checks")
targetvars = spssdata.Spssdata("Var3", names=False).fetchall()
targetvars = [item[0].split("\r") for item in targetvars]
spss.Submit("dataset activate main")
if len(targetvars) > 1:
    targetvars = "\n".join(targetvars[0] + targetvars[1])
else:
    targetvars = "\n".join(targetvars[0])
spss.Submit("delete variables " + targetvars)
end program.



On Fri, Sep 21, 2018 at 10:40 AM Sammm313 <[hidden email]> wrote:
Hi,

So I have a dataset will a couple thousand observations and a couple
thousand variables (columns) that I am tryin to clean. I now want to remove
all columns that have more than 20% missing cases. Is there a way I can do
this for all my variables without having to go through them one by one and
delete them that way?

I would do this with one line of code in python with this line. So I'm
looking for a similar thing to do in SPSS.

# drop columns with more than 20% missing
df = df.dropna(thresh=0.8*len(df), axis=1)



--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Delete columns with more than 20% missing cases

Timothy Hennigar-2

That’s a complicated question depending on where the data comes from – cause if it is survey data – you likely have questions that resp.

do not answer legitimately due to skipping – presumably you would mean sysmis in variables THAT SHOULD HAVE AN ANSWER as

distinguished from general missing data and the number of questions that should have an answer will vary resp. to resp. in most surveys

 

If you simply had 100 sequentially defined variables each with known “SYSMIS” (say the system missing value) you could use count

 

COUNT x= Var1 to Var100 (SYSMIS).

 

FRE x  to view distribution

 

SEL IF (x LE 20).

 

That type of thing

 

You can do this with string variable also

 

 

 

 

 

 

 

 

 

 

 

 

*********************************

Notice: This e-mail and any attachments may contain confidential and privileged information.  If you are not the intended recipient, please notify the sender immediately by return e-mail, do not use the information, delete this e-mail and destroy any copies.  Any dissemination or use of this information by a person other than the intended recipient is unauthorized and may be illegal.  Email transmissions cannot be guaranteed to be secure or error free. The sender therefore does not accept any liability for errors or omissions in the contents of this message that arise as a result of email transmissions.

 

From: SPSSX(r) Discussion <[hidden email]> On Behalf Of Jon Peck
Sent: Friday, September 21, 2018 2:03 PM
To: [hidden email]
Subject: Re: Delete columns with more than 20% missing cases

 

This takes more code in Statistics.  If you have the Data Validation option, the following code will do it.  If not, something similar can be done with AGGREGATE, so post back if you need that.  It could also be done with a litle pure Python code, but that would be slower to execute.

 

This code captures the output from VALIDATE DATA as a dataset and then generates the DELETE VARIABLES command appropriately.  It's a little convoluted because the structure of the table from VALIDATE DATA is rather peculiar.

 

dataset name main.

preserve.

set tvars=names.

dataset declare checks.

oms select tables /if subtypes='VariableChecks' /destination outfile=checks format=sav.

VALIDATEDATA VARIABLES=ALL

  /VARCHECKS STATUS=ON PCTMISSING=20 PCTEQUAL=100 PCTUNEQUAL=100

  /CASECHECKS REPORTEMPTY=NO SCOPE=ALLVARS

  /CASEREPORT DISPLAY=NO.

omsend.

restore.

 

begin program.

import spss, spssdata

 

spss.Submit("dataset activate checks")

targetvars = spssdata.Spssdata("Var3", names=False).fetchall()

targetvars = [item[0].split("\r") for item in targetvars]

spss.Submit("dataset activate main")

if len(targetvars) > 1:

    targetvars = "\n".join(targetvars[0] + targetvars[1])

else:

    targetvars = "\n".join(targetvars[0])

spss.Submit("delete variables " + targetvars)

end program.

 

 

 

On Fri, Sep 21, 2018 at 10:40 AM Sammm313 <[hidden email]> wrote:

Hi,

So I have a dataset will a couple thousand observations and a couple
thousand variables (columns) that I am tryin to clean. I now want to remove
all columns that have more than 20% missing cases. Is there a way I can do
this for all my variables without having to go through them one by one and
delete them that way?

I would do this with one line of code in python with this line. So I'm
looking for a similar thing to do in SPSS.

# drop columns with more than 20% missing
df = df.dropna(thresh=0.8*len(df), axis=1)



--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


 

--

Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Delete columns with more than 20% missing cases

PRogman
In reply to this post by Jon Peck
If there are many empty variables within 'the couple of thousand' variables,
will the submitted DELETE VARIABLES command work if it exceeds 256
characters?
/PR



Jon Peck wrote

> This takes more code in Statistics.  If you have the Data Validation
> option, the following code will do it.  If not, something similar can be
> done with AGGREGATE, so post back if you need that.  It could also be done
> with a litle pure Python code, but that would be slower to execute.
>
> This code captures the output from VALIDATE DATA as a dataset and then
> generates the DELETE VARIABLES command appropriately.  It's a little
> convoluted because the structure of the table from VALIDATE DATA is rather
> peculiar.
>
> dataset name main.
> preserve.
> set tvars=names.
> dataset declare checks.
> oms select tables /if subtypes='VariableChecks' /destination
> outfile=checks
> format=sav.
> VALIDATEDATA VARIABLES=ALL
>   /VARCHECKS STATUS=ON PCTMISSING=20 PCTEQUAL=100 PCTUNEQUAL=100
>   /CASECHECKS REPORTEMPTY=NO SCOPE=ALLVARS
>   /CASEREPORT DISPLAY=NO.
> omsend.
> restore.
>
> begin program.
> import spss, spssdata
>
> spss.Submit("dataset activate checks")
> targetvars = spssdata.Spssdata("Var3", names=False).fetchall()
> targetvars = [item[0].split("\r") for item in targetvars]
> spss.Submit("dataset activate main")
> if len(targetvars) > 1:
>     targetvars = "\n".join(targetvars[0] + targetvars[1])
> else:
>     targetvars = "\n".join(targetvars[0])
> spss.Submit("delete variables " + targetvars)
> end program.
>
>
>
> On Fri, Sep 21, 2018 at 10:40 AM Sammm313 &lt;

> sam86m@

> &gt; wrote:
>
>> Hi,
>>
>> So I have a dataset will a couple thousand observations and a couple
>> thousand variables (columns) that I am tryin to clean. I now want to
>> remove
>> all columns that have more than 20% missing cases. Is there a way I can
>> do
>> this for all my variables without having to go through them one by one
>> and
>> delete them that way?
>>
>> I would do this with one line of code in python with this line. So I'm
>> looking for a similar thing to do in SPSS.
>>
>> # drop columns with more than 20% missing
>> df = df.dropna(thresh=0.8*len(df), axis=1)
>>
>>
>>
>> --
>> Sent from: http://spssx-discussion.1045642.n5.nabble.com/
>>
>> =====================
>> To manage your subscription to SPSSX-L, send a message to
>>

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the
>> command. To leave the list, send the command
>> SIGNOFF SPSSX-L
>> For a list of commands to manage subscriptions, send the command
>> INFO REFCARD
>>
>
>
> --
> Jon K Peck

> jkpeck@

>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD





--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Delete columns with more than 20% missing cases

David Marso-2
In reply to this post by Sammm313
The"\n" inserts a newline between the items

if len(targetvars) > 1:
>     targetvars = "\n".join(targetvars[0] + targetvars[1])
> else:
>     targetvars = "\n".join(targetvars[0])

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Delete columns with more than 20% missing cases

Jon Peck
In reply to this post by PRogman
While a single line of syntax cannot exceed the 256-character limit, commands can be much longer.  And, although the code I posted wraps each name onto a separate line, the spss.Submit function actually wraps lines to conform to the line length limit.

On Sun, Sep 23, 2018 at 2:39 AM PRogman <[hidden email]> wrote:
If there are many empty variables within 'the couple of thousand' variables,
will the submitted DELETE VARIABLES command work if it exceeds 256
characters?
/PR



Jon Peck wrote
> This takes more code in Statistics.  If you have the Data Validation
> option, the following code will do it.  If not, something similar can be
> done with AGGREGATE, so post back if you need that.  It could also be done
> with a litle pure Python code, but that would be slower to execute.
>
> This code captures the output from VALIDATE DATA as a dataset and then
> generates the DELETE VARIABLES command appropriately.  It's a little
> convoluted because the structure of the table from VALIDATE DATA is rather
> peculiar.
>
> dataset name main.
> preserve.
> set tvars=names.
> dataset declare checks.
> oms select tables /if subtypes='VariableChecks' /destination
> outfile=checks
> format=sav.
> VALIDATEDATA VARIABLES=ALL
>   /VARCHECKS STATUS=ON PCTMISSING=20 PCTEQUAL=100 PCTUNEQUAL=100
>   /CASECHECKS REPORTEMPTY=NO SCOPE=ALLVARS
>   /CASEREPORT DISPLAY=NO.
> omsend.
> restore.
>
> begin program.
> import spss, spssdata
>
> spss.Submit("dataset activate checks")
> targetvars = spssdata.Spssdata("Var3", names=False).fetchall()
> targetvars = [item[0].split("\r") for item in targetvars]
> spss.Submit("dataset activate main")
> if len(targetvars) > 1:
>     targetvars = "\n".join(targetvars[0] + targetvars[1])
> else:
>     targetvars = "\n".join(targetvars[0])
> spss.Submit("delete variables " + targetvars)
> end program.
>
>
>
> On Fri, Sep 21, 2018 at 10:40 AM Sammm313 &lt;

> sam86m@

> &gt; wrote:
>
>> Hi,
>>
>> So I have a dataset will a couple thousand observations and a couple
>> thousand variables (columns) that I am tryin to clean. I now want to
>> remove
>> all columns that have more than 20% missing cases. Is there a way I can
>> do
>> this for all my variables without having to go through them one by one
>> and
>> delete them that way?
>>
>> I would do this with one line of code in python with this line. So I'm
>> looking for a similar thing to do in SPSS.
>>
>> # drop columns with more than 20% missing
>> df = df.dropna(thresh=0.8*len(df), axis=1)
>>
>>
>>
>> --
>> Sent from: http://spssx-discussion.1045642.n5.nabble.com/
>>
>> =====================
>> To manage your subscription to SPSSX-L, send a message to
>>

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the
>> command. To leave the list, send the command
>> SIGNOFF SPSSX-L
>> For a list of commands to manage subscriptions, send the command
>> INFO REFCARD
>>
>
>
> --
> Jon K Peck

> jkpeck@

>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD





--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Delete columns with more than 20% missing cases

MLIves
In reply to this post by Timothy Hennigar-2

This gets the number missing in a record (row) rather than in a variable (column).

M

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Timothy Hennigar
Sent: Friday, September 21, 2018 2:17 PM
To: [hidden email]
Subject: Re: [SPSSX-L] Delete columns with more than 20% missing cases

 

That’s a complicated question depending on where the data comes from – cause if it is survey data – you likely have questions that resp.

do not answer legitimately due to skipping – presumably you would mean sysmis in variables THAT SHOULD HAVE AN ANSWER as

distinguished from general missing data and the number of questions that should have an answer will vary resp. to resp. in most surveys

 

If you simply had 100 sequentially defined variables each with known “SYSMIS” (say the system missing value) you could use count

 

COUNT x= Var1 to Var100 (SYSMIS).

 

FRE x  to view distribution

 

SEL IF (x LE 20).

 

That type of thing

 

You can do this with string variable also

 

 

 

 

 

 

 

 

 

 

 

 

*********************************

Notice: This e-mail and any attachments may contain confidential and privileged information.  If you are not the intended recipient, please notify the sender immediately by return e-mail, do not use the information, delete this e-mail and destroy any copies.  Any dissemination or use of this information by a person other than the intended recipient is unauthorized and may be illegal.  Email transmissions cannot be guaranteed to be secure or error free. The sender therefore does not accept any liability for errors or omissions in the contents of this message that arise as a result of email transmissions.

 

From: SPSSX(r) Discussion <[hidden email]> On Behalf Of Jon Peck
Sent: Friday, September 21, 2018 2:03 PM
To: [hidden email]
Subject: Re: Delete columns with more than 20% missing cases

 

This takes more code in Statistics.  If you have the Data Validation option, the following code will do it.  If not, something similar can be done with AGGREGATE, so post back if you need that.  It could also be done with a litle pure Python code, but that would be slower to execute.

 

This code captures the output from VALIDATE DATA as a dataset and then generates the DELETE VARIABLES command appropriately.  It's a little convoluted because the structure of the table from VALIDATE DATA is rather peculiar.

 

dataset name main.

preserve.

set tvars=names.

dataset declare checks.

oms select tables /if subtypes='VariableChecks' /destination outfile=checks format=sav.

VALIDATEDATA VARIABLES=ALL

  /VARCHECKS STATUS=ON PCTMISSING=20 PCTEQUAL=100 PCTUNEQUAL=100

  /CASECHECKS REPORTEMPTY=NO SCOPE=ALLVARS

  /CASEREPORT DISPLAY=NO.

omsend.

restore.

 

begin program.

import spss, spssdata

 

spss.Submit("dataset activate checks")

targetvars = spssdata.Spssdata("Var3", names=False).fetchall()

targetvars = [item[0].split("\r") for item in targetvars]

spss.Submit("dataset activate main")

if len(targetvars) > 1:

    targetvars = "\n".join(targetvars[0] + targetvars[1])

else:

    targetvars = "\n".join(targetvars[0])

spss.Submit("delete variables " + targetvars)

end program.

 

 

 

On Fri, Sep 21, 2018 at 10:40 AM Sammm313 <[hidden email]> wrote:

Hi,

So I have a dataset will a couple thousand observations and a couple
thousand variables (columns) that I am tryin to clean. I now want to remove
all columns that have more than 20% missing cases. Is there a way I can do
this for all my variables without having to go through them one by one and
delete them that way?

I would do this with one line of code in python with this line. So I'm
looking for a similar thing to do in SPSS.

# drop columns with more than 20% missing
df = df.dropna(thresh=0.8*len(df), axis=1)



--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


 

--

Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD




This correspondence contains proprietary information some or all of which may be legally privileged; it is for the intended recipient only. If you are not the intended recipient you must not use, disclose, distribute, copy, print, or rely on this correspondence and completely dispose of the correspondence immediately. Please notify the sender if you have received this email in error. NOTE: Messages to or from the State of Connecticut domain may be subject to the Freedom of Information statutes and regulations.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Delete columns with more than 20% missing cases

Art Kendall
In reply to this post by Sammm313
Please give more detail on the nature of your analysis.

There are many circumstances in which this would be an unwise thing to do.

it is generally a good idea to clean/prepare you data until there are no
variables with values of sysmis.
Even if you have to have a missing value labelled "No idea why missing"

something like
Missing values myvar (lo thru -1).
value labels
-1 "not applicable"
-2 "refused to answer"
-3 "answer unreadable"
-4 "skipped -- interviewer error"
-999 "No idea why missing".

In the long run to is good practice to reserve SYSMIS for the situation
where the value is missing because the computer is unable to follow your
instructions on how to read input or  how to do a computation.

What is a case?
Is there a reason to believe that missingness might be related to what
actual value would be?
Are these summative scale items?
Is it reasonable to assume that values are missing at random?

etc.





-----
Art Kendall
Social Research Consultants
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Art Kendall
Social Research Consultants