|
Hi Team,
I have a data set which contains more than half million records. These records were been collected from 74 companies. Each company does have an unique identity number with the variable name as "Org_ID". And I have the variable list of 500+ with the names as Q1, Q2, Q3, ......Q500.
My task is that, I need to filter out the data in the following way:.
For example, with the Org_id = 457895 and responses in the Each question are less than 5, i need to filter out the data in my final calculations. This procedure is do be done for all the variables and for all the companies. The reason is that we need to apply weight on the data to normalize the scores.
I tried using the following aggregating syntax for counting n's and then applying filter on each and every company and variable separately.
AGGREGATE
/OUTFILE=* MODE=ADDVARIABLES /BREAK= ORG_ID /Count_q1 = NU(q1).
And creating the following filters to separte the data.
if (CID = 457895 and count_q1>=1) filt_457895 = 1.
if (CID = 457897 and count_q1>=1) filt_457897 = 1. ....
....
So, i need to repeat the syntax for 74*500 = 37000.
This is huge task by looking at the data size and the repeats that i need to do after the same. Please suggest a macro, loop some other way find a best solution.
Thanks in advance,
Richard.
exe. |
|
|
In reply to this post by Richard AK
At 03:47 AM 11/19/2009, Richard AK wrote:
I have a data set which contains records from 74 companies. Each company does have an unique identity number with the variable name as "Org_ID". And I have the variable list of 500+ with the names as Q1, Q2, Q3, ......Q500. This isn't clear, but I'll go on, assuming there's a variable named 'Each' and you want to select if that company has 5 non-missing responses to the 'Each' question. I tried using the following aggregating syntax for counting n's and then applying filter on each and every company and variable separately. Huh? For pity's sake, why the 'if' statements and a separate filter variable for each company? Why not if (count_q1>=1) filt = 1. If you really need separate filter variables, I think it could be done with AGGREGATE followed by CASESTOVARS, but I'm not going to write that code unless there's a reason you really need them all. Going on, you write, So, I need to repeat the syntax for 74*500 = 37000. You mean, for 500 variables? First, what do you really want? Repeating the syntax you have for every variable will select all companies with any data in any variable. You sound like you want companies with at least five cases with data for all variables -- am I close? THAT doesn't need a macro or Python or any such thing, either, but again I'm not writing code until I'm clearer what you want. -Best regards, Richard Ristow ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Humphrey Paulie
Humphrey Paulie wrote:
> I want to find out the percentage of representation of 7 content areas > from 4 books in two exams. > > As I said there are 4 books and 7 content areas (grammar, vocabulary, > language functions etc….). > > I have analysed the two exams and counted the number of grammar, > vocabulary, etc questions in each exam and have also determined which of > the 4 books they belong to. > > Now I want to calculate the percentages and get some SPSS charts to > represent the findings graphically. > > I just don’t have a clue how to enter such data in SPSS. I wrote about this back in 1999 at my old website: * www.childrens-mercy.org/stats/data/table.asp My example is a two by two table, but conceptually, a four by seven table is no different, other than the size. Also, please note that the dialog boxes are from an eariler version. Again, that should not be too confusing. Once you get the data in, look at the options under the CROSSTABS dialog box (in particular, the CELLS button). That should get you started. -- Steve Simon, Standard Disclaimer The Monthly Mean is celebrating its first anniversary. Find out more about the newsletter that dares to call itself "average" at www.pmean.com/news ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Richard Ristow
Hi Richard,
Sorry if i confused with my previous query. I will try to rephrase/explain with the following example data set.
org_id q1 q2
---------------------------- 457895 3 4 457895 3 4 457895 3 4 457895 5 4 457895 6 4 457896 4 3 457896 4 . 457896 5 3 457896 5 3 457896 5 . In the above data set, I need to include the data of orgnizations whose count is >=5 in my overall analysis. For the first, question both the organizations are meeting the criteria. But for Q2, i can include only the data for first org but for second org_id= 457896 i cannot include the data in my final overall analysis as it's count is <5.
In the same way, i have to validate a set of 74 organizations in org_id column and some 500 questions with half million records.
Please suggest a syntax.
Thanks in advance,
Richards.
On Thu, Nov 19, 2009 at 7:50 PM, Richard Ristow <[hidden email]> wrote:
|
|
Richard,
>>Sorry if i confused with my previous query. I will try to rephrase/explain with the following example data set. org_id q1 q2 ---------------------------------- 457895 3 4 457895 3 4 457895 3 4 457895 5 4 457895 6 4 457896 4 3 457896 4 . 457896 5 3 457896 5 3 457896 5 . >>In the above data set, I need to include the data of orgnizations whose count is >=5 in my overall analysis. For the first, question both the organizations are meeting the criteria. >>But for Q2, i can include only the data for first org but for second org_id= 457896 i cannot include the data in my final overall analysis as it's count is <5. >>In the same way, i have to validate a set of 74 organizations in org_id column and some 500 questions with half million records. I think a better, clearer way of saying part of what you want is to say "for items Q1 to Q500, I want to determine the total number of valid responses to each item within each organization." Do you agree with this restatement? So, as you say, org_id=457896 has fewer than 5 valid responses for Q2. So, knowing that then what do you want to do? Do you want to exclude org_id=457896 from subsequent analyses involving Q2? Something else? If so, what else? >>In the same way, i have to validate a set of 74 organizations in org_id column and some 500 questions with half million records. The word 'validate' is not meaningful to me. Please define computationally how you will validate the data. I ask because you want syntax to do this and you haven't defined what 'validate' means. Gene Maguin ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Richard AK
At 12:49 PM 11/21/2009, Richard AK wrote:
Sorry if i confused with my previous query. I will try to rephrase/explain with the following example data set. org_id q1 q2 ---------------------------- 457895 3 4 457895 3 4 457895 3 4 457895 5 4 457895 6 4 457896 4 3 457896 4 . 457896 5 3 457896 5 3 457896 5 . I need to [analyze] the data of organizations whose count is >=5 in my overall analysis. For the first question both the organizations are meeting the criteria. But for Q2, i can include only the data for first org but for second org_id= 457896 i cannot include the data in my final overall analysis as its count is <5.(Emphasis added) Ah, OK. You're analyzing the variables individually, and want to keep, for each organization, those variables for which the organization has at least 5 observed values. I would handle this (with no surprise to those who know me) by 'unrolling' the data to one record per company per variable, and then count within companies. Then you can count valid values of each variable within companies without any looping in your syntax. After selection, you may want to analyze the data in this 'long' form; or, you can restore it to the original 'wide' form, with variables not eligible for analysis always missing. (Incidentally, this will lose variable labels on the q's): * Add a serial number for instances within organizations: . NUMERIC Instance (F3). DO IF $CASENUM EQ 1. . COMPUTE Instance = 1. ELSE IF org_id NE LAG(org_id). . COMPUTE Instance = 1. ELSE. . COMPUTE Instance = LAG(Instance) + 1. END IF. * If the following "LIST" is removed, EXECUTE is needed . LIST. |-----------------------------|---------------------------| |Output Created |23-NOV-2009 11:58:47 | |-----------------------------|---------------------------| org_id q1 q2 Instance 457895 3 4 1 457895 3 4 2 457895 3 4 3 457895 5 4 4 457895 6 4 5 457896 4 3 1 457896 4 . 2 457896 5 3 3 457896 5 3 4 457896 5 . 5 Number of cases read: 10 Number of cases listed: 10 * 'Unroll': . VARSTOCASES /MAKE Value "Vbl's value" FROM q1 q2 /INDEX = Name "Vbl's name" (Value) /KEEP = org_id Instance /NULL = KEEP. Variables to Cases |-----------------------------|---------------------------| |Output Created |23-NOV-2009 11:58:48 | |-----------------------------|---------------------------| |(Report of generated variables suppressed) | |-----------------------------|---------------------------| LIST. List |-----------------------------|---------------------------| |Output Created |23-NOV-2009 11:58:50 | |-----------------------------|---------------------------| org_id Instance Name Value 457895 1 q1 3 457895 1 q2 4 457895 2 q1 3 457895 2 q2 4 457895 3 q1 3 457895 3 q2 4 457895 4 q1 5 457895 4 q2 4 457895 5 q1 6 457895 5 q2 4 457896 1 q1 4 457896 1 q2 3 457896 2 q1 4 457896 2 q2 . 457896 3 q1 5 457896 3 q2 3 457896 4 q1 5 457896 4 q2 3 457896 5 q1 5 457896 5 q2 . Number of cases read: 20 Number of cases listed: 20 * Count occurrences of each variable, by organization, . * and keep only variables with >=5 occurrences: . AGGREGATE OUTFILE=* MODE=ADDVARIABLES /BREAK = org_id Name /Occurrences 'Valid values, this organization'=NU(Value). FORMATS Occurrences (F3). SELECT IF Occurrences GE 5. LIST. List |-----------------------------|---------------------------| |Output Created |23-NOV-2009 11:58:52 | |-----------------------------|---------------------------| org_id Instance Name Value Occurrences 457895 1 q1 3 5 457895 1 q2 4 5 457895 2 q1 3 5 457895 2 q2 4 5 457895 3 q1 3 5 457895 3 q2 4 5 457895 4 q1 5 5 457895 4 q2 4 5 457895 5 q1 6 5 457895 5 q2 4 5 457896 1 q1 4 5 457896 2 q1 4 5 457896 3 q1 5 5 457896 4 q1 5 5 457896 5 q1 5 5 Number of cases read: 15 Number of cases listed: 15 * Finally, if desired, roll the data back to 'wide' form: . CASESTOVARS /ID = org_id Instance /INDEX = Name /GROUPBY = VARIABLE /DROP = Occurrences. Cases to Variables |-----------------------------|---------------------------| |Output Created |23-NOV-2009 11:58:52 | |-----------------------------|---------------------------| | (Report of generated variables, and processing | | statistics, suppressed) | |-----------------------------|---------------------------| LIST. List |-----------------------------|---------------------------| |Output Created |23-NOV-2009 11:58:54 | |-----------------------------|---------------------------| org_id Instance q1 q2 457895 1 3 4 457895 2 3 4 457895 3 3 4 457895 4 5 4 457895 5 6 4 457896 1 4 . 457896 2 4 . 457896 3 5 . 457896 4 5 . 457896 5 5 . Number of cases read: 10 Number of cases listed: 10 ============================= APPENDIX: Test data, and code ============================= PRESERVE. SET MXWARNS 0. DATA LIST LIST/ org_id q1 q2 (F6, F3, F3). * -------------------------------. BEGIN DATA 457895 3 4 457895 3 4 457895 3 4 457895 5 4 457895 6 4 457896 4 3 457896 4 . 457896 5 3 457896 5 3 457896 5 . END DATA. RESTORE. * Add a serial number for instances within organizations: . NUMERIC Instance (F3). DO IF $CASENUM EQ 1. . COMPUTE Instance = 1. ELSE IF org_id NE LAG(org_id). . COMPUTE Instance = 1. ELSE. . COMPUTE Instance = LAG(Instance) + 1. END IF. * If the following "LIST" is removed, EXECUTE is needed . LIST. * 'Unroll': . VARSTOCASES /MAKE Value "Vbl's value" FROM q1 q2 /INDEX = Name "Vbl's name" (Value) /KEEP = org_id Instance /NULL = KEEP. LIST. * Count occurrences of each variable, by organization, . * and keep only variables with >=5 occurrences: . AGGREGATE OUTFILE=* MODE=ADDVARIABLES /BREAK = org_id Name /Occurrences 'Valid values, this organization'=NU(Value). FORMATS Occurrences (F3). SELECT IF Occurrences GE 5. LIST. * Finally, if desired, roll the data back to 'wide' form: . CASESTOVARS /ID = org_id Instance /INDEX = Name /GROUPBY = VARIABLE /DROP = Occurrences. LIST. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Administrator
|
That's how I've typically generated serial numbers too. But a while ago, a colleague pointed out another method, using RANK. For this case, it would look something like: compute case = $casenum. RANK VARIABLES=case (A) BY org_id /RANK /PRINT=YES /TIES=MEAN. rename var(rcase = instance). format instance(f2.0). I don't know if it's any easier than the other way, but it does eliminate the DO-IF ELSE-IF structure, which some people may find a bit tricky. (Note that RANK needs a real variable where I used CASE. I tried with $CASENUM there, and with a scratch variable, but neither worked.) Cheers, Bruce
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
| Free forum by Nabble | Edit this page |
