This post has NOT been accepted by the mailing list yet.
I'm writing a script that automatically performs KM analysis on 50000 variables then retrieves the sigma value from each analysis. When i run the code, it prints empty brackets []. here is the xml file <?xml version="1.0" encoding="UTF-16"?> -<outputTree xsi:schemaLocation="http://xml.spss.com/spss/oms http://xml.spss.com/spss/oms/spss-output-1.7.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://xml.spss.com/spss/oms">-<command lang="en" text="Kaplan-Meier" displayTableVariables="label" displayTableValues="label" displayOutlineVariables="label" displayOutlineValues="label" command="Kaplan-Meier">-<pivotTable text="Overall Comparisons" subType="Overall Comparisons"><caption text="Test of equality of survival distributions for the different levels of SLC3A2_group."/>-<dimension text="Test Type" axis="row">-<category text="Log Rank (Mantel-Cox) ">-<dimension text="Statistics" axis="column">+<category text="Chi-Square">-<category text="df"><cell text="1" number="1"/></category>-<category text="Sig."><cell text=".205" number="0.20491462905454" decimals="3"/></category></dimension></category></dimension></pivotTable></command></outputTree> and here is part of the script. for genes in genelist: spss.Submit(r""" OMS /SELECT TABLES /IF COMMANDS=['Kaplan-Meier'] SUBTYPES=['Overall Comparisons'] /DESTINATION FORMAT=OXML XMLWORKPLACE='KM_table' /TAG='km_out'. KM Overall_Survival_FT BY %s /STATUS=ten_year_css(1) /PRINT NONE /TEST LOGRANK /COMPARE OVERALL POOLED. OMSEND TAG='km_out'. """%(genes)) handle="KM_table" context="/outputTree" xpath="//command[@lang='en']\ /pivotTable[@text='Overall Comparisons']\ /dimension[@text='Test Type']\ /category[@text='Log Rank (Mantel-Cox)']\ /dimension[@text='Statistics']\ /category[@text='Sig.']\ /cell/@text" result=spss.EvaluateXPath(handle,context,xpath) print result changing the last line to print result [0] doesn't work either. any help would be kindly appreciated. Ideally i want to retrieve each sigma value and save it to an excel sheet on a new row. |
The XPath expression isn't quite right.
Try this.
//command[@lang='en']/pivotTable[@text='Overall Comparisons']/dimension[@text='Test Type']/category[@text='Log Rank (Mantel-Cox) ']/dimension[@text='Statistics']/category[@text='Sig.']/cell/@text Note that you could simplify the XPath expression a bit to this: //category[@text='Log Rank (Mantel-Cox) ']//category[@text='Sig.']/cell/@text or even this //category[@text='Log Rank (Mantel-Cox) ']//cell/@text HTH, Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] new phone: 720-342-5621 From: Blankdots <[hidden email]> To: [hidden email], Date: 01/23/2013 06:00 PM Subject: [SPSSX-L] Help retrieving and saving SPSS output via xpath Sent by: "SPSSX(r) Discussion" <[hidden email]> This post has NOT been accepted by the mailing list yet. I'm writing a script that automatically performs KM analysis on 50000 variables then retrieves the sigma value from each analysis. When i run the code, it prints empty brackets []. here is the xml file <?xml version="1.0" encoding="UTF-16"?> -<outputTree xsi:schemaLocation="http://xml.spss.com/spss/oms http://xml.spss.com/spss/oms/spss-output-1.7.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://xml.spss.com/spss/oms">-<command lang="en" text="Kaplan-Meier" displayTableVariables="label" displayTableValues="label" displayOutlineVariables="label" displayOutlineValues="label" command="Kaplan-Meier">-<pivotTable text="Overall Comparisons" subType="Overall Comparisons"><caption text="Test of equality of survival distributions for the different levels of SLC3A2_group."/>-<dimension text="Test Type" axis="row">-<category text="Log Rank (Mantel-Cox) ">-<dimension text="Statistics" axis="column">+<category text="Chi-Square">-<category text="df"><cell text="1" number="1"/></category>-<category text="Sig."><cell text=".205" number="0.20491462905454" decimals="3"/></category></dimension></category></dimension></pivotTable></command></outputTree> and here is part of the script. for genes in genelist: spss.Submit(r""" OMS /SELECT TABLES /IF COMMANDS=['Kaplan-Meier'] SUBTYPES=['Overall Comparisons'] /DESTINATION FORMAT=OXML XMLWORKPLACE='KM_table' /TAG='km_out'. KM Overall_Survival_FT BY %s /STATUS=ten_year_css(1) /PRINT NONE /TEST LOGRANK /COMPARE OVERALL POOLED. OMSEND TAG='km_out'. """%(genes)) handle="KM_table" context="/outputTree" xpath="//command[@lang='en']\ /pivotTable[@text='Overall Comparisons']\ /dimension[@text='Test Type']\ /category[@text='Log Rank (Mantel-Cox)']\ /dimension[@text='Statistics']\ /category[@text='Sig.']\ /cell/@text" result=spss.EvaluateXPath(handle,context,xpath) print result changing the last line to print result [0] doesn't work either. any help would be kindly appreciated. Ideally i want to retrieve each sigma value and save it to an excel sheet on a new row. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Help-retrieving-and-saving-SPSS-output-via-xpath-tp5717625.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Is there a reason why such a large gap between (Mantel-Cox) and ' despite the xml file not showing this gap?
//category[@text='Log Rank (Mantel-Cox) ']//cell/@text it replies '1.607', '1', '.205' is there a way to only return '.205' which is under the xml category text "Sig." |
In reply to this post by Blankdots
It worked. Just the last simplification doesn't do the same thing. Cheers. Still wondering why there is a huge gap though between mantel-cox) and the next '
|
In reply to this post by Blankdots
Something like this:
//category[@text='Log Rank (Mantel-Cox) ']//category[@text='Sig.']/cell/@text Of course, if you retrieved a triple and just wanted the last element, you could take the [-1] element of the list on the Python side. The xml file does show all those trailing blanks. <category text="Log Rank (Mantel-Cox) "> If you want the full precision value, BTW, you should retrieve the number attribute rather than the text attribute. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] new phone: 720-342-5621 From: Blankdots <[hidden email]> To: [hidden email], Date: 01/23/2013 08:09 PM Subject: Re: [SPSSX-L] Help retrieving and saving SPSS output via xpath Sent by: "SPSSX(r) Discussion" <[hidden email]> Is there a reason why such a large gap between (Mantel-Cox) and ' despite the xml file not showing this gap? //category[@text='Log Rank (Mantel-Cox) ']//cell/@text it replies '1.607', '1', '.205' is there a way to only return '.205' which is under the xml category text "Sig." -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Help-retrieving-and-saving-SPSS-output-via-xpath-tp5717625p5717627.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Blankdots
I wonder what you are
going to do with those 5000 line excel sheets? I rarely come
across things that can not be done as well in
SPSS.
perhaps instead of saving each time directly to Excel, you could save to an SPSS dataset then do 1 transfer to excel. It would be interesting to hear from list members who do scripts why there isn't a simpler way to run KM on 5000 variables and save a single value from the listing. 5000 variables !!??!! Art Kendall Social Research ConsultantsOn 1/23/2013 7:27 PM, Blankdots wrote: This post has NOT been accepted by the mailing list yet. I'm writing a script that automatically performs KM analysis on 50000 variables then retrieves the sigma value from each analysis. When i run the code, it prints empty brackets []. here is the xml file <?xml version="1.0" encoding="UTF-16"?> -<outputTree xsi:schemaLocation="http://xml.spss.com/spss/oms http://xml.spss.com/spss/oms/spss-output-1.7.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://xml.spss.com/spss/oms">-<command lang="en" text="Kaplan-Meier" displayTableVariables="label" displayTableValues="label" displayOutlineVariables="label" displayOutlineValues="label" command="Kaplan-Meier">-<pivotTable text="Overall Comparisons" subType="Overall Comparisons"><caption text="Test of equality of survival distributions for the different levels of SLC3A2_group."/>-<dimension text="Test Type" axis="row">-<category text="Log Rank (Mantel-Cox) ">-<dimension text="Statistics" axis="column">+<category text="Chi-Square">-<category text="df"><cell text="1" number="1"/></category>-<category text="Sig."><cell text=".205" number="0.20491462905454" decimals="3"/></category></dimension></category></dimension></pivotTable></command></outputTree> and here is part of the script. for genes in genelist: spss.Submit(r""" OMS /SELECT TABLES /IF COMMANDS=['Kaplan-Meier'] SUBTYPES=['Overall Comparisons'] /DESTINATION FORMAT=OXML XMLWORKPLACE='KM_table' /TAG='km_out'. KM Overall_Survival_FT BY %s /STATUS=ten_year_css(1) /PRINT NONE /TEST LOGRANK /COMPARE OVERALL POOLED. OMSEND TAG='km_out'. """%(genes)) handle="KM_table" context="/outputTree" xpath=[hidden email] result=spss.EvaluateXPath(handle,context,xpath) print result changing the last line to print result [0] doesn't work either. any help would be kindly appreciated. Ideally i want to retrieve each sigma value and save it to an excel sheet on a new row. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Help-retrieving-and-saving-SPSS-output-via-xpath-tp5717625.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants |
I had the data on an excel sheet originally because SPSS16 was quite buggy. Now I have spss 20 its much easier and I will have a 50,000 variable spss file which I will run from.
SPSS is much more powerful than excel, but for basic transformations and manipulation excel can be faster. For example, if you need to recode a single column, excel can do this quite fast. Right now my script writes the output to a list. I then plan on outputting the list at the end of x number of iterations as a csv file which can be opened by excel. I haven't tested the script on a lot of variables yet but seems to work fine on 10 variables. If someone has a better way of running KM a lot of times, or cox regression many times I would be highly interested. I will also likely post the script up here to see if there is a better way to code it to enhance speed. And yes 50,000 variables. Its for biomedical research. :) |
In reply to this post by Blankdots
At 07:27 PM 1/23/2013, Blankdots wrote:
>I'm writing a script that automatically performs KM analysis on >50000 variables then retrieves the sigma value from each analysis. You're using SPSS in a way that it doesn't handle very well. SPSS loops easily through cases (in fact, that's what it mostly does), but awkwardly through sets of variables; and 50,000 is an awkwardly large number of variables for SPSS in any case. (This may be why you had trouble with SPSS 16.) If your data were unrolled so that each gene (that appears to be what you're analyzing?) is a group of cases rather than a variable, you'd probably have a bigger file but easier to use. Something like (untested) VARSTOCASES /ID = Input_Row_Number /MAKE Gene_Value FROM Gene1 TO Gene50000 /INDEX = Gene_Number /KEEP = ALL. SORT CASES BY Gene_Number Input_Row_Number. SPLIT FILE BY Gene_Number. Then it should be fairly straightforward. You'll need a beginning OMS command, something like (I'm sure you'll have to change this) DATASET DECLARE Results. OMS /SELECT TABLES /IF COMMANDS=['Kaplan-Meier'] SUBTYPES=['Overall Comparisons'] /DESTINATION FORMAT=SAV DESTINATION=Results /TAG='km_out'. KM Overall_Survival_FT BY Gene_Value /STATUS=ten_year_css(1) /PRINT NONE /TEST LOGRANK /COMPARE OVERALL POOLED. OMSEND TAG='km_out'. That, or something like it, should get each of your 50,000 results as one row of the SPSS dataset 'Results'. Then you can proceed in SPSS or export to Excel, as best suits your needs -- though like most of us on the list, I'll suspect that staying in SPSS would serve you best. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
In reply to this post by Blankdots
"but for basic transformations and manipulation excel can be faster. For example, if you need to recode a single column, excel can do this quite fast. "
HERESY! Somebody please ignite the bonfire! ;-) I don't know where you get your impressions of speed or any benchmarks? I personally find EXCEL to be quite awkward to do ANY form of data manipulation. I can't even fathom working with 50K columns in EXCEL (is that even possible?). As Richard stated below: 50K is rather awkward in SPSS as well (forget about variable lists in dialogs etc). It looks like you are extracting from the XMLWORKSPACE and potentially bypassing pivot table creation within the output doc (can't help you there as my version is prehistoric). I would certainly experiment with Richards suggestion of going WIDE to LONG with VARSTOCASES, SORT, SPLIT. It would certainly make any scripting simpler or perhaps unnecessary altogether! KISSASS (Keep it Stupidly Simple and Scalable Silly)!
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Blankdots
How many cases do you
have?
It is likely that you would want to transpose your data via varstocases. Also, you might do well by explaining the goals of the whole effort and what you input data is like. You might want to have other information in the data set you keep from each run of KM. Are these 50,000 measurements of the same variable along some dimension? time, location,etc? Art Kendall Social Research ConsultantsOn 1/24/2013 9:11 AM, Blankdots wrote: I had the data on an excel sheet originally because SPSS16 was quite buggy. Now I have spss 20 its much easier and I will have a 50,000 variable spss file which I will run from. SPSS is much more powerful than excel, but for basic transformations and manipulation excel can be faster. For example, if you need to recode a single column, excel can do this quite fast. Right now my script writes the output to a list. I then plan on outputting the list at the end of x number of iterations as a csv file which can be opened by excel. I haven't tested the script on a lot of variables yet but seems to work fine on 10 variables. If someone has a better way of running KM a lot of times, or cox regression many times I would be highly interested. I will also likely post the script up here to see if there is a better way to code it to enhance speed. And yes 50,000 variables. Its for biomedical research. :) -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Help-retrieving-and-saving-SPSS-output-via-xpath-tp5717625p5717650.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants |
Thanks for the prompt responses again. Much appreciated.
Haha, excel has its benefits sometimes. But overall spss is definitely more powerful. I've modified the script further. To overcome any memory problems. Its writing to a csv file and flushing with each iteration. There are 200 cases and roughly 50000 variables. Each variable is a gene with 'expression data' which is a decimal number. So each patient has a value. e.g. patient 1 - 1.11, patient 2. 1.21. And so forth. 1) Firstly patients are divided into two groups, those who have died - 1 and those who are still alive - 0. This is needed for KM. Also needed, is overall survival time as this is another variable needed in KM analysis. 2) We want to divide the 200 patients into two groups, with the cutoff between groups based on the mean value of a gene expression. These groups are 1 for low expression and 2 for high expression. 3) We then run KM with logrank. This will spit out a p value which tells us how significant the difference is between the high expression and low expression groups. The better the p value, the more interested in it we are. 4) Of course survival plots are important, but for initial analysis we are simply interested in pvalue. So my plan is to run KM log rank and record p value for all 50000 variables then sort ascending. I will then do more detailed analysis on the ones which i find to be interesting ( probably < 1000). I don't see how varstocases would help in this scenario, but if it does could someone please explain it? I only have one question. Right now when i run the script in python, it spits out each analysis in the viewer. I want to supress this output as it is being written to a csv file already. I've looked into the OMS /DESTINATION VIEWER=NO function. But it doesn't seem to suppress. I think it also interferes with the OMSEND function. I need the program to write to an xml workplace but also to prevent results from being printed. Cheers. |
I forgot to mention. The 200 cases are patients.
|
In reply to this post by Blankdots
1) The way that VarsToCases helps is that it trivializes the problem of
processing all the variables sequentially, and (for some procedures) results in a display of results that is already elegantly compressed. This was described when it was suggested. 2) KM hardly seems like the perfectly ideal procedure since it requires an arbitrary split of numeric scores into dichotomies. You might even get improved *power* if you simply perform a t-test between live/dead on each of the scores. A second set of p-values (for confirmation?) could be obtained by the correlations - among those who died - of the time-to-death and the scores. I have no experience with data of this particular kind, but I'm pretty sure I would try both things on a few variables. Multiple testing problem? Having a second and parallel test that is independent should be useful here unless your Effects are really large. That is, if your reasonable Effect size will show up that the 1% level, your prospect is at least 500 rejections-of-the-null hypothesis solely by chance or for tiny effects. - If you are concerned with outliers or skew, you would probably want to use the same transformation for all scores ... but don't forget the possibility of "Windsordizing" the data by "drawing in" the top few percent of scores to a common maximum value. -- Rich Ulrich > Date: Fri, 25 Jan 2013 22:33:12 -0800 > From: [hidden email] > Subject: Re: Help retrieving and saving SPSS output via xpath > To: [hidden email] > > Thanks for the prompt responses again. Much appreciated. > > Haha, excel has its benefits sometimes. But overall spss is definitely more > powerful. > > I've modified the script further. To overcome any memory problems. Its > writing to a csv file and flushing with each iteration. > > There are 200 cases and roughly 50000 variables. > > Each variable is a gene with 'expression data' which is a decimal number. > So each patient has a value. e.g. patient 1 - 1.11, patient 2. 1.21. And so > forth. > > 1) Firstly patients are divided into two groups, those who have died - 1 and > those who are still alive - 0. This is needed for KM. Also needed, is > overall survival time as this is another variable needed in KM analysis. > 2) We want to divide the 200 patients into two groups, with the cutoff > between groups based on the mean value of a gene expression. These groups > are 1 for low expression and 2 for high expression. > 3) We then run KM with logrank. This will spit out a p value which tells us > how significant the difference is between the high expression and low > expression groups. The better the p value, the more interested in it we are. > 4) Of course survival plots are important, but for initial analysis we are > simply interested in pvalue. > > So my plan is to run KM log rank and record p value for all 50000 variables > then sort ascending. I will then do more detailed analysis on the ones which > i find to be interesting ( probably < 1000). > > I don't see how varstocases would help in this scenario, but if it does > could someone please explain it? > > I only have one question. Right now when i run the script in python, it > spits out each analysis in the viewer. I want to supress this output as it > is being written to a csv file already. > > I've looked into the OMS /DESTINATION VIEWER=NO function. But it doesn't > seem to suppress. I think it also interferes with the OMSEND function. I > need the program to write to an xml workplace but also to prevent results > from being printed. > > Cheers. > > > > -- > View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Help-retrieving-and-saving-SPSS-output-via-xpath-tp5717625p5717706.html > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD |
You would certainly lose information by dichotomizing.
The problems have often been discussed on
this list.
Dichotomizing is a red flag issue to many methodologists and statisticians so someone on the list may have already gathered up the references. In social and behavioral sciences dichotomizing at the median is known as "committing an invidious median split". You description did not say you had anything but dead/not and expression values on 50,000 genes. If that is correct and If you do not have missing data and if you don't want to worry about equality of variances you might consider bypassing the t-test procedure and writing syntax. sort the cases by dead/not I assume you already know how many of each you have. say 90 and 110 flip the file. so var1 to var90 are dead. var91 to var200 are live. something like this untested syntax should work. This could be done more compactly but This should be readable. compute mean_dead = mean(var1 to var90). compute mean_live = mean(var91 to var90). compute diff_means= (mean_dead) - mean(live). compute sd_dead = sd(var1 to var90). compute sd_live = sd(var91 to var200). compute pooled_sd = sqrt((sd_dead**2 + sd_live**2)/2). compute se = pooled_sd *sqrt(2/200). compute t = diff_means /se. compute df = 200-2. compute p = cdf.t(t,df). You can then easily get descriptive stats on mean_dead to p. Art Kendall Social Research ConsultantsOn 1/26/2013 2:32 AM, Rich Ulrich wrote:
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants |
In reply to this post by Rich Ulrich
Some very good points.
To clarify further, KM analysis relies on 3 variables which will be discussed in the following paragraph. In the context of survival in medical literature, KM with log rank (cox-mantel) statistic is the standard. The problem with using a simple T test is that it doesn't take into account when a patient died, e.g. 5 years vs 3 months, which makes a huge difference in analysis. KM calculations require both a continuous variable i.e. 'overall survival time' and a discrete variable i.e. (dead or alive), its not as simple as comparing dead or alive. In short, the log rank, is a function of overall survival time (continuous), overall status (discrete) and gene expression (discrete). I did however perform a t test prior to trying to batch log-rank statistics via KM. I divided the patients into those alive at 5 years and those dead at 5 years. Those who were alive but whose last known status was less than 5 years were excluded from the test. This t-test does provide some valid data, but the problem is that the gold standard for survival is a KM curve with a cox-mantel statistic. Because the t test does not take into account the duration of survival as previously explained, its value is greatly diminished. But good points by everyone, especially on the signal to noise creating up to 500 false positives in such a large data set. Its got me thinking a lot. Much appreciated. |
In reply to this post by Blankdots
OMS VIEWER=NO definitely does suppress,
but it only suppresses the selected items, and this has no effect on OMSEND.
Since you are selecting only the table, other items continue to appear.
To handle this you can have another OMS request running concurrently that selects everything and suppresses it: OMS /DESTINATION VIEWER=NO. If you run in external mode, which would be a good idea in this case for performance reasons, there is no Viewer, but output flows to the console. You shut this off by calling spss.SetOutput("off") This api only applies to external mode. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] new phone: 720-342-5621 From: Blankdots <[hidden email]> To: [hidden email], Date: 01/25/2013 11:34 PM Subject: Re: [SPSSX-L] Help retrieving and saving SPSS output via xpath Sent by: "SPSSX(r) Discussion" <[hidden email]> Thanks for the prompt responses again. Much appreciated. Haha, excel has its benefits sometimes. But overall spss is definitely more powerful. I've modified the script further. To overcome any memory problems. Its writing to a csv file and flushing with each iteration. There are 200 cases and roughly 50000 variables. Each variable is a gene with 'expression data' which is a decimal number. So each patient has a value. e.g. patient 1 - 1.11, patient 2. 1.21. And so forth. 1) Firstly patients are divided into two groups, those who have died - 1 and those who are still alive - 0. This is needed for KM. Also needed, is overall survival time as this is another variable needed in KM analysis. 2) We want to divide the 200 patients into two groups, with the cutoff between groups based on the mean value of a gene expression. These groups are 1 for low expression and 2 for high expression. 3) We then run KM with logrank. This will spit out a p value which tells us how significant the difference is between the high expression and low expression groups. The better the p value, the more interested in it we are. 4) Of course survival plots are important, but for initial analysis we are simply interested in pvalue. So my plan is to run KM log rank and record p value for all 50000 variables then sort ascending. I will then do more detailed analysis on the ones which i find to be interesting ( probably < 1000). I don't see how varstocases would help in this scenario, but if it does could someone please explain it? I only have one question. Right now when i run the script in python, it spits out each analysis in the viewer. I want to supress this output as it is being written to a csv file already. I've looked into the OMS /DESTINATION VIEWER=NO function. But it doesn't seem to suppress. I think it also interferes with the OMSEND function. I need the program to write to an xml workplace but also to prevent results from being printed. Cheers. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Help-retrieving-and-saving-SPSS-output-via-xpath-tp5717625p5717706.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Blankdots
At 01:33 AM 1/26/2013, Blankdots wrote:
>[I have] are 200 cases and roughly 50000 variables. > >Each variable is a gene with 'expression data' which is a decimal number. >So each patient has a value. e.g. patient 1 - 1.11, patient 2. 1.21. >And so forth. As you state below, this isn't a complete list of variables. You have those 50,000 gene expression values for each patient; but you also have a variable for survival or not, and one for survival time. I hope you also have an identifying variable for each patient, so you can trace your SPSS data back to the source. >1) Firstly patients are divided into two groups, those who have died >- 1 and those who are still alive - 0. This is needed for KM. Also needed, is >overall survival time as this is another variable needed in KM analysis. >2) We want to divide the 200 patients into two groups, with the cutoff >between groups based on the mean value of a gene expression. These groups >are 1 for low expression and 2 for high expression. >3) We then run KM with logrank. This will spit out a p value which >tells us how significant the difference is between the high expression and low >expression groups. The better the p value, the more interested in it we are. >4) Of course survival plots are important, but for initial analysis we are >simply interested in pvalue. > >So my plan is to run KM log rank and record p value for all 50000 >variables then sort ascending. I don't see how varstocases would >help in this scenario, but if it does could someone please explain it? OK, here's where understanding how SPSS 'thinks' is important. As I wrote before, SPSS loops very easily through cases, and through groups of cases; but loops through sets of variables only awkwardly, often requiring something like Python. Now, your data is like this: Patient_ID LiveOrDead Time Gene01 Gene02 Gene03 Alpha 0 13 12.3456 78.9012 34.5678 Beta 1 10 98.7654 32.1098 76.5432 Gamma 1 15 24.6890 13.5791 36.9036 and you have to run your analysis separately for each of GeneExpr01, GeneExpr02, GeneExpr03 -- except, you have 50,000 instead of three of them. Here's what you get from VARSTOCASES and SORT -- this is an actual run, in SPSS 14: VARSTOCASES /MAKE Express FROM Gene01 Gene02 Gene03 /INDEX = Gene(Express) /KEEP = Patient_ID LiveOrDead Time /NULL = DROP. SORT CASES BY Gene Patient_ID. LIST. List |-----------------------------|---------------------------| |Output Created |28-JAN-2013 16:22:30 | |-----------------------------|---------------------------| Patient_ID LiveOrDead Time Gene Express Alpha 0 13 Gene01 12.3456 Beta 1 10 Gene01 98.7654 Gamma 1 15 Gene01 24.6890 Alpha 0 13 Gene02 78.9012 Beta 1 10 Gene02 32.1098 Gamma 1 15 Gene02 13.5791 Alpha 0 13 Gene03 34.5678 Beta 1 10 Gene03 76.5432 Gamma 1 15 Gene03 36.9036 Number of cases read: 9 Number of cases listed: 9 Notice you now have three groups of cases, corresponding to the three genes; in your data, you'll have 50,000 groups. In each group there is a case for each patient; three per group in this demo, 200 per group in yours. (So the whole file will have just one million records -- big, but well within SPSS's capacity, especially when each record is this short.) Now you don't have to write a separate KM statement to analyze each gene. If you issue command SPLIT FILES BY Gene. data in each group will be analyzed separately, and *one* KM statement will analyze all 50,000 genes. You'll use OMS to capture the results as an SPSS dataset; see my previous posting in this thread. Further, today at 01:33 AM 1/26/2013, you wrote: >2) We want to divide the 200 patients into two groups, with the cutoff >between groups based on the mean value of a gene expression. These groups >are 1 for low expression and 2 for high expression. You've had a lot of advice on *whether* to do this. As for *how* to do this, I'm looking at your thread "Basic Recode syntax not creating new Variable", where it looks like you're having Python generate a RECODE statement for each of your 50,000 variables, inserting the mean value. (Where do you get the means? Compute them in Excel? in Python?) But when you have the data transformed, the operation takes only a few lines of basic SPSS for *all* the genes. This, again, is from an actual run: * Dichotomizing by high/low gene expression . AGGREGATE OUTFILE=* MODE=ADDVARIABLES /BREAK=Gene /MeanVal=MEAN(Express). FORMATS MeanVal (F8.4). NUMERIC HiLo (F2). VALUE LABEL HiLo (1) Low expression (2) Hi expression. DO IF Express GE MeanVal. . COMPUTE HiLo = 2. ELSE. . Compute HiLo = 1. END IF. LIST. List |-----------------------------|---------------------------| |Output Created |28-JAN-2013 17:15:30 | |-----------------------------|---------------------------| Patient_ID LiveOrDead Time Gene Express MeanVal HiLo Alpha 0 13 Gene01 12.3456 45.2667 1 Beta 1 10 Gene01 98.7654 45.2667 2 Gamma 1 15 Gene01 24.6890 45.2667 1 Alpha 0 13 Gene02 78.9012 41.5300 2 Beta 1 10 Gene02 32.1098 41.5300 1 Gamma 1 15 Gene02 13.5791 41.5300 1 Alpha 0 13 Gene03 34.5678 49.3382 1 Beta 1 10 Gene03 76.5432 49.3382 2 Gamma 1 15 Gene03 36.9036 49.3382 1 Number of cases read: 9 Number of cases listed: 9 ->It helps to learn thoroughly what SPSS, itself, can do; it can save a lot of clumsy wrestling with other tools <- ============================= APPENDIX: Test data, and code ============================= * C:\Documents and Settings\Richard\My Documents . * \Technical\spssx-l\Z-2013\ . * 2013-01-23 Blankdots- . * Help retrieving and saving SPSS output via xpath.SPS . * In response to posting . * Date: Wed, 23 Jan 2013 16:27:52 -0800 . * From: Blankdots <[hidden email]> . * Subject: Help retrieving and saving SPSS output via xpath . * To: [hidden email] . * This code illustrates wide-to-long data restructuring, . * and its advantages. . DATA LIST LIST/ Patient_ID LiveOrDead Time Gene01 Gene02 Gene03 (A8, F2, F3, F8.4, F8.4, F8.4). BEGIN DATA Alpha 0 13 12.3456 78.9012 34.5678 Beta 1 10 98.7654 32.1098 76.5432 Gamma 1 15 24.6890 13.5791 36.9036 END DATA. LIST. VARSTOCASES /MAKE Express FROM Gene01 Gene02 Gene03 /INDEX = Gene(Express) /KEEP = Patient_ID LiveOrDead Time /NULL = DROP. SORT CASES BY Gene Patient_ID. LIST. * Dichotomizing by high/low gene expression . AGGREGATE OUTFILE=* MODE=ADDVARIABLES /BREAK=Gene /MeanVal=MEAN(Express). FORMATS MeanVal (F8.4). NUMERIC HiLo (F2). VALUE LABEL HiLo (1) Low expression (2) Hi expression. DO IF Express GE MeanVal. . COMPUTE HiLo = 2. ELSE. . Compute HiLo = 1. END IF. LIST. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Blankdots
It is always good to stick with the standard in an area. But then you
mention *two* things, KM and the log-rank test (Cox-Mantel). The Wikip article on log-rank mentions that it is asymptotically the same test as Cox regression (proportionate hazard) -- which I suggested before as making fuller use of the continuous data. On the other hand, even if you will use the log-rank test eventually, part of your early problem is winnowing out the false-positives. I suggested a two-test approach for that: Yes, the t-test does not use Time. However, it *should* provide a crude screen that ought to be valid, assuming that you pay attention to the effect size (not just the p) when you determine your cutoff limit for which-genes-to-drop-first. After that test, the Time correlation among those who died is a test that is orthogonal to the first test. I presume that it also ought to provide a valid cut-off, also judging more by effect size than by p-value. - Otherwise ... If you do proceed only with KM/log-rank, then you will want to use alternate cut-points to set up "lowest-low-high-highest" for Expression in order to show that your apparent high/low split is consistent across the range, instead of reflecting one of the hundreds of positive tests that arise by chance. -- Rich Ulrich > Date: Sat, 26 Jan 2013 05:58:59 -0800 > From: [hidden email] > Subject: Re: Help retrieving and saving SPSS output via xpath > To: [hidden email] > > Some very good points. > > To clarify further, KM analysis relies on 3 variables which will be > discussed in the following paragraph. > > In the context of survival in medical literature, KM with log rank > (cox-mantel) statistic is the standard. The problem with using a simple T > test is that it doesn't take into account when a patient died, e.g. 5 years > vs 3 months, which makes a huge difference in analysis. KM calculations > require both a continuous variable i.e. 'overall survival time' and a > discrete variable i.e. (dead or alive), its not as simple as comparing dead > or alive. In short, the log rank, is a function of overall survival time > (continuous), overall status (discrete) and gene expression (discrete). > > I did however perform a t test prior to trying to batch log-rank statistics > via KM. I divided the patients into those alive at 5 years and those dead at > 5 years. Those who were alive but whose last known status was less than 5 > years were excluded from the test. This t-test does provide some valid data, > but the problem is that the gold standard for survival is a KM curve with a > cox-mantel statistic. Because the t test does not take into account the > duration of survival as previously explained, its value is greatly > diminished. > > But good points by everyone, especially on the signal to noise creating up > to 500 false positives in such a large data set. Its got me thinking a lot. > Much appreciated. > > |
In reply to this post by Blankdots
Thanks for explaining the varstocases and split. I was not familiar with this function in Spss. The explanation is very thorough and makes complete sense to me now as to why it would be better. I will try this soon.
As for the t-test, I have already done this - but based on what i've found i'm not happy whcih is why i've moved onto a univariate cox-mantel. Cox-mantel in the KM analysis i'm doing right now is univariate, I will also be doing a multivariate cox-regression using additional covariates such as age, gender, etc. Anyways, i think this thread has provided me with more than enough answers. So thanks again to everyone. |
In reply to this post by David Marso
I think you will find that SPSS 15 can also do it a lot faster. Something strange happened in version 16 and thereafter that seemed to make large manipulations much slower. Even basic data entry was easier and faster. Unfortunately, I can't currently use version 15, as I had a new PC and our IT dept wouldn't put a parallel port on it for the old dongle and IBM wouldn't supply a new one (or suitable work around) for a USB port. Though it's installed there, I also can't use version 15 on the old PC anymore as IBM won't licence me on a single user licence to have the two versions on separate PCs. I would still like to use the newer versions for the analyses where the performance is respectable.
|
Free forum by Nabble | Edit this page |