Repeat Analysis of the data.

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Repeat Analysis of the data.

Richard AK
Hi Team,
 
I have a data set which contains more than half million records.  These records were been collected from 74 companies. Each company does have an unique identity number with the variable name as "Org_ID". And I have the variable list of 500+ with the names as Q1, Q2, Q3, ......Q500.
 
My task is that, I need to filter out the data in the following way:.
 
For example, with the Org_id = 457895 and responses in the Each question are less than 5, i need to filter out the data in my final calculations. This procedure is do be done for all the variables and for all the companies. The  reason is that we need to apply weight on the data to normalize the scores.
 
 
I tried using the following aggregating syntax for counting n's and then applying filter on each and every company and variable separately.
 
AGGREGATE
 /OUTFILE=*
 MODE=ADDVARIABLES
 /BREAK= ORG_ID
 /Count_q1 = NU(q1).
 
And creating the following filters to separte the data.
 
if (CID = 457895 and count_q1>=1) filt_457895 = 1.
if (CID = 457897 and count_q1>=1) filt_457897 = 1.
....
....
 
So, i need to repeat the syntax for 74*500 = 37000.
 
This is huge task by looking at the data size and the repeats that i need to do after the same.  Please suggest a macro, loop some other way find a best solution.
 
 
Thanks in advance,
Richard.
 
 
 

exe.
 
 
 
 
 
 
 
Reply | Threaded
Open this post in threaded view
|

Descriptive Stats.

Humphrey Paulie

Dear all,  

I want to find out the percentage of representation of 7 content areas from 4 books in two exams.

 

As I said there are 4 books and 7 content areas (grammar, vocabulary, language functions etc….).

I have analysed the two exams and counted the number of grammar, vocabulary, etc questions in each exam and have also determined which of the 4 books they belong to.

Now I want to calculate the percentages and get some SPSS charts to represent the findings graphically.

I just don’t have a clue how to enter such data in SPSS.

Any comment will be appreciated.

 Regards

Humphrey

 

 


Reply | Threaded
Open this post in threaded view
|

Re: Repeat Analysis of the data.

Richard Ristow
In reply to this post by Richard AK
At 03:47 AM 11/19/2009, Richard AK wrote:

I have a data set which contains records from 74 companies. Each company does have an unique identity number with the variable name as "Org_ID". And I have the variable list of 500+ with the names as Q1, Q2, Q3, ......Q500.
 
My task is that, I need to filter out the data in the following way:.
 
For example, with the Org_id = 457895 and responses in the Each question are less than 5

This isn't clear, but I'll go on, assuming there's a variable named 'Each' and you want to select if that company has 5 non-missing responses to the 'Each' question.

I tried using the following aggregating syntax for counting n's and then applying filter on each and every company and variable separately.
 
AGGREGATE
 /OUTFILE=*
 MODE=ADDVARIABLES
 /BREAK= ORG_ID
 /Count_q1 = NU(q1).
 
And creating the following filters to separate the data.
 
if (CID = 457895 and count_q1>=1) filt_457895 = 1.
if (CID = 457897 and count_q1>=1) filt_457897 = 1.

 
So, i need to repeat the syntax for 74*500 = 37000.

Huh? For pity's sake, why the 'if' statements and a separate filter variable for each company? Why not

if (count_q1>=1) filt = 1.

If you really need separate filter variables, I think it could be done with AGGREGATE followed by CASESTOVARS, but I'm not going to write that code unless there's a reason you really need them all.

Going on, you write,

So, I need to repeat the syntax for 74*500 = 37000.

You mean, for 500 variables? First, what do you really want? Repeating the syntax you have for every variable will select all companies with any data in any variable. You sound like you want companies with at least five cases with data for all variables -- am I close? THAT doesn't need a macro or Python or any such thing, either, but again I'm not writing code until I'm clearer what you want.

-Best regards,
 Richard Ristow
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Descriptive Stats.

Steve Simon, P.Mean Consulting
In reply to this post by Humphrey Paulie
Humphrey Paulie wrote:

> I want to find out the percentage of representation of 7 content areas
> from 4 books in two exams.
>
> As I said there are 4 books and 7 content areas (grammar, vocabulary,
> language functions etc….).
>
> I have analysed the two exams and counted the number of grammar,
> vocabulary, etc questions in each exam and have also determined which of
> the 4 books they belong to.
>
> Now I want to calculate the percentages and get some SPSS charts to
> represent the findings graphically.
>
> I just don’t have a clue how to enter such data in SPSS.

I wrote about this back in 1999 at my old website:

  * www.childrens-mercy.org/stats/data/table.asp

My example is a two by two table, but conceptually, a four by seven
table is no different, other than the size. Also, please note that the
dialog boxes are from an eariler version. Again, that should not be too
confusing.

Once you get the data in, look at the options under the CROSSTABS dialog
box (in particular, the CELLS button). That should get you started.
--
Steve Simon, Standard Disclaimer
The Monthly Mean is celebrating its first anniversary.
Find out more about the newsletter that dares
to call itself "average" at www.pmean.com/news

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Repeat Analysis of the data.

Richard AK
In reply to this post by Richard Ristow
Hi Richard,
 
Sorry if i confused with my previous query. I will try to rephrase/explain with the following example data set.
 
org_id               q1         q2

----------------------------

457895            3          4

457895            3          4

457895            3          4

457895            5          4

457895            6          4

457896            4          3

457896            4          .

457896            5          3

457896            5          3

457896            5          .


In the above data set, I need to include the data of orgnizations whose count is >=5 in my overall analysis. For the first, question both the organizations are meeting the criteria.
 
But for Q2, i can include only the data for first org but for second org_id= 457896 i cannot include the data in my final overall analysis as it's count is <5.
 
In the same way, i have to validate a set of 74 organizations in org_id column and some 500 questions with half million records.  
 
Please suggest a syntax.
 
Thanks in advance,
Richards.
 

 
On Thu, Nov 19, 2009 at 7:50 PM, Richard Ristow <[hidden email]> wrote:
At 03:47 AM 11/19/2009, Richard AK wrote:

I have a data set which contains records from 74 companies. Each company does have an unique identity number with the variable name as "Org_ID". And I have the variable list of 500+ with the names as Q1, Q2, Q3, ......Q500.
 
My task is that, I need to filter out the data in the following way:.
 
For example, with the Org_id = 457895 and responses in the Each question are less than 5

This isn't clear, but I'll go on, assuming there's a variable named 'Each' and you want to select if that company has 5 non-missing responses to the 'Each' question.

I tried using the following aggregating syntax for counting n's and then applying filter on each and every company and variable separately.
 
AGGREGATE
 /OUTFILE=*
 MODE=ADDVARIABLES
 /BREAK= ORG_ID
 /Count_q1 = NU(q1).
 
And creating the following filters to separate the data.
 
if (CID = 457895 and count_q1>=1) filt_457895 = 1.
if (CID = 457897 and count_q1>=1) filt_457897 = 1.

 
So, i need to repeat the syntax for 74*500 = 37000.

Huh? For pity's sake, why the 'if' statements and a separate filter variable for each company? Why not

if (count_q1>=1) filt = 1.

If you really need separate filter variables, I think it could be done with AGGREGATE followed by CASESTOVARS, but I'm not going to write that code unless there's a reason you really need them all.

Going on, you write,


So, I need to repeat the syntax for 74*500 = 37000.

You mean, for 500 variables? First, what do you really want? Repeating the syntax you have for every variable will select all companies with any data in any variable. You sound like you want companies with at least five cases with data for all variables -- am I close? THAT doesn't need a macro or Python or any such thing, either, but again I'm not writing code until I'm clearer what you want.

-Best regards,
 Richard Ristow


Reply | Threaded
Open this post in threaded view
|

Re: Repeat Analysis of the data.

Maguin, Eugene
Richard,

>>Sorry if i confused with my previous query. I will try to rephrase/explain
with the following example data set.

org_id           q1         q2
----------------------------------
457895            3          4
457895            3          4
457895            3          4
457895            5          4
457895            6          4
457896            4          3
457896            4          .
457896            5          3
457896            5          3
457896            5          .

>>In the above data set, I need to include the data of orgnizations whose
count is >=5 in my overall analysis. For the first, question both the
organizations are meeting the criteria.

>>But for Q2, i can include only the data for first org but for second
org_id= 457896 i cannot include the data in my final overall analysis as
it's count is <5.

>>In the same way, i have to validate a set of 74 organizations in org_id
column and some 500 questions with half million records.

I think a better, clearer way of saying part of what you want is to say "for
items Q1 to Q500, I want to determine the total number of valid responses to
each item within each organization."  Do you agree with this restatement?

So, as you say, org_id=457896 has fewer than 5 valid responses for Q2. So,
knowing that then what do you want to do? Do you want to exclude
org_id=457896 from subsequent analyses involving Q2? Something else? If so,
what else?

>>In the same way, i have to validate a set of 74 organizations in org_id
column and some 500 questions with half million records.

The word 'validate' is not meaningful to me. Please define computationally
how you will validate the data. I ask because you want syntax to do this and
you haven't defined what 'validate' means.

Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Repeat Analysis of the data.

Richard Ristow
In reply to this post by Richard AK
At 12:49 PM 11/21/2009, Richard AK wrote:

Sorry if i confused with my previous query. I will try to rephrase/explain with the following example data set.
 
org_id   q1   q2
----------------------------
457895            3          4
457895            3          4
457895            3          4
457895            5          4
457895            6          4
457896            4          3
457896            4          .
457896            5          3
457896            5          3
457896            5          .

I need to [analyze] the data of organizations whose count is >=5 in my overall analysis. For the first question both the organizations are meeting the criteria. But for Q2, i can include only the data for first org but for second org_id= 457896 i cannot include the data in my final overall analysis as its count is <5.
(Emphasis added)

Ah, OK. You're analyzing the variables individually, and want to keep, for each organization, those variables for which the organization has at least 5 observed values. I would handle this (with no surprise to those who know me) by 'unrolling' the data to one record per company per variable, and then count within companies. Then you can count valid values of each variable within companies without any looping in your syntax. After selection, you may want to analyze the data in this 'long' form; or, you can restore it to the original 'wide' form, with variables not eligible for analysis always missing. (Incidentally, this will lose variable labels on the q's):

*  Add a serial number for instances within organizations: .

NUMERIC Instance (F3).
DO IF   $CASENUM EQ 1.
.  COMPUTE Instance = 1.
ELSE IF org_id NE LAG(org_id).
.  COMPUTE Instance = 1.
ELSE.
.  COMPUTE Instance = LAG(Instance) + 1.
END IF.


*  If the following "LIST" is removed, EXECUTE is needed   .
LIST.

|-----------------------------|---------------------------|
|Output Created               |23-NOV-2009 11:58:47       |
|-----------------------------|---------------------------|
org_id  q1  q2 Instance

457895   3   4      1
457895   3   4      2
457895   3   4      3
457895   5   4      4
457895   6   4      5
457896   4   3      1
457896   4   .      2
457896   5   3      3
457896   5   3      4
457896   5   .      5

Number of cases read:  10    Number of cases listed:  10

 
*  'Unroll':                                               .
VARSTOCASES
  /MAKE    Value "Vbl's value"  FROM q1 q2
  /INDEX = Name  "Vbl's name"   (Value)
  /KEEP  = org_id Instance
  /NULL  = KEEP.
 
Variables to Cases
|-----------------------------|---------------------------|
|Output Created               |23-NOV-2009 11:58:48       |
|-----------------------------|---------------------------|
|(Report of generated variables suppressed)               |
|-----------------------------|---------------------------|
 
 LIST.

List
|-----------------------------|---------------------------|
|Output Created               |23-NOV-2009 11:58:50       |
|-----------------------------|---------------------------|
org_id Instance Name Value

457895      1   q1      3
457895      1   q2      4
457895      2   q1      3
457895      2   q2      4
457895      3   q1      3
457895      3   q2      4
457895      4   q1      5
457895      4   q2      4
457895      5   q1      6
457895      5   q2      4
457896      1   q1      4
457896      1   q2      3
457896      2   q1      4
457896      2   q2      .
457896      3   q1      5
457896      3   q2      3
457896      4   q1      5
457896      4   q2      3
457896      5   q1      5
457896      5   q2      .

Number of cases read:  20    Number of cases listed:  20

 
*  Count occurrences of each variable, by organization,    .
*  and keep only variables with >=5 occurrences:           .

AGGREGATE OUTFILE=* MODE=ADDVARIABLES
   /BREAK = org_id Name
   /Occurrences 'Valid values, this organization'=NU(Value).
FORMATS Occurrences (F3).

SELECT IF Occurrences GE 5.

LIST.
List
|-----------------------------|---------------------------|
|Output Created               |23-NOV-2009 11:58:52       |
|-----------------------------|---------------------------|
org_id Instance Name Value Occurrences

457895      1   q1      3        5
457895      1   q2      4        5
457895      2   q1      3        5
457895      2   q2      4        5
457895      3   q1      3        5
457895      3   q2      4        5
457895      4   q1      5        5
457895      4   q2      4        5
457895      5   q1      6        5
457895      5   q2      4        5
457896      1   q1      4        5
457896      2   q1      4        5
457896      3   q1      5        5
457896      4   q1      5        5
457896      5   q1      5        5

Number of cases read:  15    Number of cases listed:  15

 
*  Finally, if desired, roll the data back to 'wide' form: .
CASESTOVARS
 /ID = org_id Instance
 /INDEX   = Name
 /GROUPBY = VARIABLE
 /DROP    = Occurrences.

Cases to Variables
|-----------------------------|---------------------------|
|Output Created               |23-NOV-2009 11:58:52       |
|-----------------------------|---------------------------|
| (Report of generated variables, and processing          |
|  statistics, suppressed)                                |
|-----------------------------|---------------------------|
 
LIST.
List
|-----------------------------|---------------------------|
|Output Created               |23-NOV-2009 11:58:54       |
|-----------------------------|---------------------------|
org_id Instance  q1  q2

457895      1     3   4
457895      2     3   4
457895      3     3   4
457895      4     5   4
457895      5     6   4
457896      1     4   .
457896      2     4   .
457896      3     5   .
457896      4     5   .
457896      5     5   .

Number of cases read:  10    Number of cases listed:  10
=============================
APPENDIX: Test data, and code
=============================
PRESERVE.
SET MXWARNS 0.
DATA LIST LIST/
   org_id           q1         q2
  (F6,              F3,        F3).
*  -------------------------------.
BEGIN DATA
   457895            3          4
   457895            3          4
   457895            3          4
   457895            5          4
   457895            6          4
   457896            4          3
   457896            4          .
   457896            5          3
   457896            5          3
   457896            5          .
END DATA.
RESTORE.

*  Add a serial number for instances within organizations: .

NUMERIC Instance (F3).
DO IF   $CASENUM EQ 1.
.  COMPUTE Instance = 1.
ELSE IF org_id NE LAG(org_id).
.  COMPUTE Instance = 1.
ELSE.
.  COMPUTE Instance = LAG(Instance) + 1.
END IF.


*  If the following "LIST" is removed, EXECUTE is needed   .
LIST.

*  'Unroll':                                               .
VARSTOCASES 
  /MAKE    Value "Vbl's value"  FROM q1 q2
  /INDEX = Name  "Vbl's name"   (Value)
  /KEEP  = org_id Instance
  /NULL  = KEEP.

LIST.


*  Count occurrences of each variable, by organization,    .
*  and keep only variables with >=5 occurrences:           .

AGGREGATE OUTFILE=* MODE=ADDVARIABLES
   /BREAK = org_id Name
   /Occurrences 'Valid values, this organization'=NU(Value).
FORMATS Occurrences (F3).

SELECT IF Occurrences GE 5.

LIST.


*  Finally, if desired, roll the data back to 'wide' form: .
CASESTOVARS
 /ID = org_id Instance
 /INDEX   = Name
 /GROUPBY = VARIABLE
 /DROP    = Occurrences.

LIST.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Repeat Analysis of the data.

Bruce Weaver
Administrator
Richard Ristow wrote
--- snip ---


* Add a serial number for instances within organizations: .

NUMERIC Instance (F3).
DO IF   $CASENUM EQ 1.
.  COMPUTE Instance = 1.
ELSE IF org_id NE LAG(org_id).
.  COMPUTE Instance = 1.
ELSE.
.  COMPUTE Instance = LAG(Instance) + 1.
END IF.
EXE.
That's how I've typically generated serial numbers too.  But a while ago, a colleague pointed out another method, using RANK.  For this case, it would look something like:

compute case = $casenum.
RANK VARIABLES=case (A) BY org_id
  /RANK
  /PRINT=YES
  /TIES=MEAN.
rename var(rcase = instance).
format instance(f2.0).


I don't know if it's any easier than the other way, but it does eliminate the DO-IF ELSE-IF structure, which some people may find a bit tricky.  (Note that RANK needs a real variable where I used CASE.  I tried with $CASENUM there, and with a scratch variable, but neither worked.)

Cheers,
Bruce
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).