Hello all,
Are there any computational performance metrics available in SPSS that one could use to compare two different syntax designs (for solving the same problem)? Thanks. Dan Williams Forecasting, Research and Analysis Office Finance and Policy Analysis Department of Human Services State of Oregon, USA (503) 947-5354 |
Dan
Having done a little work at DHS I would suggest that server performance may be one of the bigger performance issues you have to deal with. For benchmarking syntax, I would suggest using one of the data sets you have available to you and are probably familiar with. Simply run the alternative syntax on same data set and same server drive. Note that you can use the N OF CASES command to limit the number of cases processed. (I assume that part of the context for your question is that some of your forecasting files are quite large.) N OF CASES is also useful when you are debugging the syntax. Keep in mind, however, that performance of some operations like SORT CASES will not be a simple linear function of N. For general strategies to promote efficient code working with large data sets, I recommend reading the efficiency tips section of Raynald Levesque's SPSS Programming and Data Management. You can purchase this in hard copy or download from SPSS (http://www.spss.com/spss/data_management_book.htm) but a PDF copy should be on your installation disk. Dennis Deck, PhD RMC Research Corporation 111 SW Columbia Street, Suite 1200 Portland, Oregon 97201-5843 voice: 503-223-8248 x715 voice: 800-788-1887 x715 fax: 503-223-8248 [hidden email] -----Original Message----- From: Daniel E WILLIAMS [mailto:[hidden email]] Sent: Tuesday, August 22, 2006 9:45 AM Subject: Performance Measures Hello all, Are there any computational performance metrics available in SPSS that one could use to compare two different syntax designs (for solving the same problem)? Thanks. Dan Williams Forecasting, Research and Analysis Office Finance and Policy Analysis Department of Human Services State of Oregon, USA (503) 947-5354 |
Dear All, I have a large dataset of products in the following format, from which I will be using the last column, that is the unit values (or prices of products). The problem is, some firms have reported extreme values (outliers) for some products. For instance, in the following example, the unit values for sugar hovers around 11 for most firms, while one firm (firm 3) has a significantly high value, say 100. I need to detect such firms in each product and delete them from my sample. Can anybody suggest a way to do it in SPSS? Thank you in advance Kind regards Abdul
Abdul Azeez ErumbanFaculty of Economics Tel: +31 (0)50 363 3762 Discover. Explore. Connect-Windows Live Spaces. Check out! |
Sorry if you have done that, but I'd first suggest verifying the outliers
and see if it's a data entry error or mis-report, it'd be sorry to lose cases if they can be recovered. Could you also let us know your cutting point criteria? Do you want to: i. set up an absolute cutting unit price for each commoditiy, and then delete whichever higher that that value? or ii. set up a floating cutting point (such as > 2 or 3 standard deviation or as a percentage deviation from mean)? or iii. some other ideas? -Ken |
In reply to this post by Daniel E WILLIAMS
If you have SPSS 14, you might want to look into the programmability benchmark module on the SPSS Developer Central site (www.spss.com/devcentral). It provides for benchmarking alternative sets of syntax with a user-settable number of repetitions, interleaving them in order to minimize the effects of operating system caching, file buffering, memory management, virus checkers, etc. It produces as output a file designed to be read into SPSS (natch!) and analyzed. There are quite a few different performance measures available from this module corresponding more or less to the measures that you can see in the Task Manager process view.
This module is Windows only as it is tied into the measures that can be collected from the operating system. Regards, Jon Peck SPSS -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Daniel E WILLIAMS Sent: Dienstag, 22. August 2006 11:45 To: [hidden email] Subject: [SPSSX-L] Performance Measures Hello all, Are there any computational performance metrics available in SPSS that one could use to compare two different syntax designs (for solving the same problem)? Thanks. Dan Williams Forecasting, Research and Analysis Office Finance and Policy Analysis Department of Human Services State of Oregon, USA (503) 947-5354 |
In reply to this post by Ken Chui
Suppose Unit values greater or equal to 60 are outliers ,
select if (Unit_Values < 60) Will give you cases with unit values less that 60 . You may proceed to do your analysis based on the selected records. An alternate way will be to create a filter to select the desired records for analysis. Regards -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf Of Abdul Azeez Sent: Wednesday, August 23, 2006 4:03 AM To: [hidden email] Subject: Outlier detection Dear All, I have a large dataset of products in the following format, from which I will be using the last column, that is the unit values (or prices of products). The problem is, some firms have reported extreme values (outliers) for some products. For instance, in the following example, the unit values for sugar hovers around 11 for most firms, while one firm (firm 3) has a significantly high value, say 100. I need to detect such firms in each product and delete them from my sample. Can anybody suggest a way to do it in SPSS? Thank you in advance Kind regards Abdul Firm Product Quantity Sales Unit Values A B C D C/D 1 Sugar 10 100 10 2 Sugar 15 180 12 3 Sugar 18 1800 100 4 Sugar 20 220 11 5 Meat 10 200 20 6 Meat 20 460 23 7 Meat 25 1500 60 8 Butter 100 1200 12 9 Butter 10 110 11 10 Butter 30 390 13 Abdul Azeez Erumban Faculty of Economics University of Groningen Post Bus 800 9700 AV GRONINGEN The Netherlands Tel: +31 (0)50 363 3762 Fax: +31 (0)50 363 7337 E-mail: [hidden email], [hidden email] _____ Discover. Explore. Connect-Windows Live Spaces. Check out! <http://g.msn.com/8HMAENIN/2728??PS=47575> |
In reply to this post by Abdul Azeez
There have been discussions on this list about the wisdom of deleting outliers. If you are very certain that you do want to remove them from further analysis, a general rule of thumb for scalar values is to classify values greater than absolute 3 standard deviations from the mean as outliers. The quick way to do this is to use the DESCRIPTIVES command to add z-scores to your file which have a mean of zero and standard deviation of 1. Then it is simple to filter or select cases as follows:
DESCRIPTIVES VARIABLES=a to z/SAVE /STATISTICS=MEAN STDDEV MIN MAX . COMPUTE maxout = MAXIMUM(za to zz). COMPUTE minout=MINIMUM(za to zz). COMPUTE filter = 1. IF (maxout > 3 OR minout < -3) filter = 0. FILTER BY filter. This example filters these cases from further analysis but doesn't delete them from the active data file. I use the MAXIMUM and MINIMUM functions in this way because they take a simple variable list using the TO conventioni. ________________________________ From: SPSSX(r) Discussion on behalf of Abdul Azeez Sent: Wed 8/23/2006 3:02 AM To: [hidden email] Subject: Outlier detection Dear All, I have a large dataset of products in the following format, from which I will be using the last column, that is the unit values (or prices of products). The problem is, some firms have reported extreme values (outliers) for some products. For instance, in the following example, the unit values for sugar hovers around 11 for most firms, while one firm (firm 3) has a significantly high value, say 100. I need to detect such firms in each product and delete them from my sample. Can anybody suggest a way to do it in SPSS? Thank you in advance Kind regards Abdul Firm Product Quantity Sales Unit Values A B C D C/D 1 Sugar 10 100 10 2 Sugar 15 180 12 3 Sugar 18 1800 100 4 Sugar 20 220 11 5 Meat 10 200 20 6 Meat 20 460 23 7 Meat 25 1500 60 8 Butter 100 1200 12 9 Butter 10 110 11 10 Butter 30 390 13 Abdul Azeez Erumban Faculty of Economics University of Groningen Post Bus 800 9700 AV GRONINGEN The Netherlands Tel: +31 (0)50 363 3762 Fax: +31 (0)50 363 7337 E-mail: [hidden email], [hidden email] ________________________________ Discover. Explore. Connect-Windows Live Spaces. Check out! <http://g.msn.com/8HMAENIN/2728??PS=47575> |
In reply to this post by Edward Boadi
The SELECT IF command will delete the outliers from the working file for
good. If you save it under the same name, they will be lost forever. Just in case, you better use one of the following: 1. Temporary selection: TEMPORARY. SELECT IF UNIT_VALUES < 60). This is valid only for the next operation that requires reading the data. 2. Create a dummy variable and use it as a filter. COMPUTE LESSTHAN60=( UNIT_VALUES < 60). FILTER BY LESSTHAN60. ........ (as many procedures as needed) FILTER OFF. The filter is in force until FILTER OFF is issued. Hector -----Mensaje original----- De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Edward Boadi Enviado el: Wednesday, August 23, 2006 10:04 AM Para: [hidden email] Asunto: Re: Outlier detection Suppose Unit values greater or equal to 60 are outliers , select if (Unit_Values < 60) Will give you cases with unit values less that 60 . You may proceed to do your analysis based on the selected records. An alternate way will be to create a filter to select the desired records for analysis. Regards -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf Of Abdul Azeez Sent: Wednesday, August 23, 2006 4:03 AM To: [hidden email] Subject: Outlier detection Dear All, I have a large dataset of products in the following format, from which I will be using the last column, that is the unit values (or prices of products). The problem is, some firms have reported extreme values (outliers) for some products. For instance, in the following example, the unit values for sugar hovers around 11 for most firms, while one firm (firm 3) has a significantly high value, say 100. I need to detect such firms in each product and delete them from my sample. Can anybody suggest a way to do it in SPSS? Thank you in advance Kind regards Abdul Firm Product Quantity Sales Unit Values A B C D C/D 1 Sugar 10 100 10 2 Sugar 15 180 12 3 Sugar 18 1800 100 4 Sugar 20 220 11 5 Meat 10 200 20 6 Meat 20 460 23 7 Meat 25 1500 60 8 Butter 100 1200 12 9 Butter 10 110 11 10 Butter 30 390 13 Abdul Azeez Erumban Faculty of Economics University of Groningen Post Bus 800 9700 AV GRONINGEN The Netherlands Tel: +31 (0)50 363 3762 Fax: +31 (0)50 363 7337 E-mail: [hidden email], [hidden email] _____ Discover. Explore. Connect-Windows Live Spaces. Check out! <http://g.msn.com/8HMAENIN/2728??PS=47575> |
In reply to this post by Ken Chui
Ken,
I won't discuss your advice, but rather take the occasion to advertise what I think is always good practice in a list like this one: to reproduce the preceding messages in the thread; if not the entire thread, then at least the one message (or part thereof) you are responding to. This makes the responses self-contained and helps the rest of us. I personally do not know at first glance and without a short "research" which of the "Outlier detection" messages you are actually answering here, either the original query by Abdul Azeez (I suspect it is this one), or the advice given by ViAnn, or perhaps some other one I missed. If the thread is longer this can become difficult. Hector -----Mensaje original----- De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Ken Chui Enviado el: Wednesday, August 23, 2006 9:10 AM Para: [hidden email] Asunto: Re: Outlier detection Sorry if you have done that, but I'd first suggest verifying the outliers and see if it's a data entry error or mis-report, it'd be sorry to lose cases if they can be recovered. Could you also let us know your cutting point criteria? Do you want to: i. set up an absolute cutting unit price for each commoditiy, and then delete whichever higher that that value? or ii. set up a floating cutting point (such as > 2 or 3 standard deviation or as a percentage deviation from mean)? or iii. some other ideas? -Ken |
In reply to this post by Abdul Azeez
Hello Hector,
Thanks for the advice and I'll consider it. The problem, though, was the original poster posted the HTML code and I had to use dreamweaver to make sense out of it; I tried to repost the original message in text format and failed, thus ending up with only my response. There are numerous way to explore this list, and I'll try to make sure the information exchange in future is most efficient. Again thanks for the advice. -Ken On Wed, 23 Aug 2006 13:11:57 -0300, Hector Maletta <[hidden email]> wrote: >Ken, >I won't discuss your advice, but rather take the occasion to advertise what >I think is always good practice in a list like this one: to reproduce the >preceding messages in the thread; if not the entire thread, then at least >the one message (or part thereof) you are responding to. This makes the >responses self-contained and helps the rest of us. I personally do not know >at first glance and without a short "research" which of the "Outlier >detection" messages you are actually answering here, either the original >query by Abdul Azeez (I suspect it is this one), or the advice given by >ViAnn, or perhaps some other one I missed. If the thread is longer this can >become difficult. >Hector > >-----Mensaje original----- >De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Ken >Chui >Enviado el: Wednesday, August 23, 2006 9:10 AM >Para: [hidden email] >Asunto: Re: Outlier detection > >Sorry if you have done that, but I'd first suggest verifying the outliers >and see if it's a data entry error or mis-report, it'd be sorry to lose >cases if they can be recovered. > >Could you also let us know your cutting point criteria? Do you want to: > i. set up an absolute cutting unit price for each commoditiy, and then >delete whichever higher that that value? or > ii. set up a floating cutting point (such as > 2 or 3 standard deviation >or as a percentage deviation from mean)? or > iii. some other ideas? > >-Ken |
In reply to this post by Hector Maletta
Dear Ken,
I prefer to go for the second option: set up a floating cutting point (such as > 2 or 3 standard deviation. Because, it doesn't seem to be wise to use an absolute cut off points for all products, as I have more than 2000 products. abdul ----- Original Message ----- From: "Hector Maletta" <[hidden email]> Newsgroups: bit.listserv.spssx-l To: <[hidden email]> Sent: Wednesday, August 23, 2006 6:11 PM Subject: Re: Outlier detection > Ken, > I won't discuss your advice, but rather take the occasion to advertise > what > I think is always good practice in a list like this one: to reproduce the > preceding messages in the thread; if not the entire thread, then at least > the one message (or part thereof) you are responding to. This makes the > responses self-contained and helps the rest of us. I personally do not > know > at first glance and without a short "research" which of the "Outlier > detection" messages you are actually answering here, either the original > query by Abdul Azeez (I suspect it is this one), or the advice given by > ViAnn, or perhaps some other one I missed. If the thread is longer this > can > become difficult. > Hector > > -----Mensaje original----- > De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Ken > Chui > Enviado el: Wednesday, August 23, 2006 9:10 AM > Para: [hidden email] > Asunto: Re: Outlier detection > > Sorry if you have done that, but I'd first suggest verifying the outliers > and see if it's a data entry error or mis-report, it'd be sorry to lose > cases if they can be recovered. > > Could you also let us know your cutting point criteria? Do you want to: > i. set up an absolute cutting unit price for each commoditiy, and then > delete whichever higher that that value? or > ii. set up a floating cutting point (such as > 2 or 3 standard deviation > or as a percentage deviation from mean)? or > iii. some other ideas? > > -Ken > |
Stephen Brand
www.statisticsdoc.com I agree wholeheartedly with those who have suggested looking for reasons why some datapoints appear to be outliers, as well as using some flexibility in choosing cutoffs for outliers. Setting a cutoff related to standard deviations to detect outliers is useful in many situations, but there are exceptions. For example, standard deviations are more appropriate for normally distributed data. With skewed data (e.g. income, days of sickness absence, etc.) some valid values may have very large standard deviations. You might want to consider setting cutoffs for outliers based on plausibility (e.g., when dealing with high school students who claim that they consume 100 beers on a daily basis). A related topic concerns "wild values" - numbers that are not that far away from the overall distribution, but which do not make sense and can be considered out of range. For example, if you know that the school year in a given school district consists of 200 days per year maximum, and that there is no summer school, then subjects who report missing more than 200 days of school would need further examination. HTH, Stephen Brand For personalized and professional consultation in statistics and research design, visit www.statisticsdoc.com -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf Of abdul Sent: Wednesday, August 23, 2006 4:27 PM To: [hidden email] Subject: Re: Outlier detection Dear Ken, I prefer to go for the second option: set up a floating cutting point (such as > 2 or 3 standard deviation. Because, it doesn't seem to be wise to use an absolute cut off points for all products, as I have more than 2000 products. abdul ----- Original Message ----- From: "Hector Maletta" <[hidden email]> Newsgroups: bit.listserv.spssx-l To: <[hidden email]> Sent: Wednesday, August 23, 2006 6:11 PM Subject: Re: Outlier detection > Ken, > I won't discuss your advice, but rather take the occasion to advertise > what > I think is always good practice in a list like this one: to reproduce the > preceding messages in the thread; if not the entire thread, then at least > the one message (or part thereof) you are responding to. This makes the > responses self-contained and helps the rest of us. I personally do not > know > at first glance and without a short "research" which of the "Outlier > detection" messages you are actually answering here, either the original > query by Abdul Azeez (I suspect it is this one), or the advice given by > ViAnn, or perhaps some other one I missed. If the thread is longer this > can > become difficult. > Hector > > -----Mensaje original----- > De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Ken > Chui > Enviado el: Wednesday, August 23, 2006 9:10 AM > Para: [hidden email] > Asunto: Re: Outlier detection > > Sorry if you have done that, but I'd first suggest verifying the outliers > and see if it's a data entry error or mis-report, it'd be sorry to lose > cases if they can be recovered. > > Could you also let us know your cutting point criteria? Do you want to: > i. set up an absolute cutting unit price for each commoditiy, and then > delete whichever higher that that value? or > ii. set up a floating cutting point (such as > 2 or 3 standard deviation > or as a percentage deviation from mean)? or > iii. some other ideas? > > -Ken > |
In reply to this post by Daniel E WILLIAMS
At 12:45 PM 8/22/2006, Daniel E WILLIAMS wrote:
>Are there any computational performance metrics available in SPSS that >one could use to compare two different syntax designs (for solving the >same problem)? This doesn't address your question directly, but it's a good side angle. With extremely rare exceptions, time for SPSS runs is dominated by reading and writing disk files. A pretty good proxy for running time is total amount read and written; and that can usually be ascertained from the file sizes, organization, and program logic, without direct measurements. By way of tuning, that means as few procedure statements as possible. Since you don't want to forgo results, it means combining as many calculations as you can in each procedure call. It also means (here I go again) not using EXECUTE, a procedure which does nothing, except in the special cases where it's necessary. (See Raynald Levesque's book.) If you have a release with a Virtual Active File system (most of us do - was that SPSS 11, or 12?), caching can slow processing down, or speed it up considerably. It slows it if reading the original file is easy (simple things like DATA LIST, or GET FILE with maybe a few transformations), and you're using most of the data, maybe only two or three times. Caching is most likely to speed processing when . You're running multiple procedures against the same active file . Reading the original data is expensive, e.g. some SQL extractions . Procedures use a small subset of the variables, cases, or both. (SELECTing and KEEPing only the cases and variables you're using, can help a lot.) You can get the effect of CACHE with XSAVE followed by GET FILE for the file you've XSAVED. That can be more flexible; for example, you can XSAVE several versions of the file, with different selections or transformations, in the same pass through the data. Procedures that store all or much of the data - anything that calculates medians, AGGREGATE without PRESORTED, CREATE - can be very slow for very large files, but as fast as other procedures for smaller ones. If you're using these, it's an additional reason for SELECTing and KEEPing to make your working file as small as possible. (AGGREGATE without PRESORTED is generally faster than SORT followed by AGGREGATE with PRESORTED; I understand, usually a good deal faster. If you have hundreds of thousands of cases, that may well not be so, especially if your file is already in the desired order and you don't need a sort.) Of course, if you're processing data by subsets, use SPLIT FILE rather than multiple SELECT IFs or FILTERs. That may need a SORT CASES; I'm not sure of the speed penalty for that. I'm pretty sure it will take the time that's needed to read and write the whole file; how much more, under what circumstances, I don't know. MATRIX is a wild card. Usually you can't choose for efficiency; you have reason to use MATRIX, or you don't. I suspect MATRIX code is very fast, if all the matrix variables can be kept in memory at once; and these days, that can be done for pretty big matrices. If matrices have to be paged to disk, I imagine it slows down very badly right away. Onward, and good luck, Richard Ristow |
In reply to this post by Daniel E WILLIAMS
Hi, Richard Ristow's summary of how to optimize SPSS commands was excellent
and I just wanted to add one additional point. The slowest part of an analysis is going to be reading and writing data files from the disk, thus you should use 10 or 15k RPM disks (I'm not sure whether their really is much of an advantage for 15k disks when working with large files). An even better solution is to use a RAID configuration in which 2 or more disks are combined into a single volume and data are written across the disks. You can create a raid using the Disk Management tool in Windows XP. There are a number of different RAID types that can be used, but RAID 0 would be a good solution as long as you back up regularly. More expensive RAID software and hardware options are available that will offer you better speed and data protection capabilities. --Mike |
In reply to this post by statisticsdoc
To agree wholeheartedly, with a few notes:
At 09:20 PM 8/23/2006, Statisticsdoc wrote:\, >Setting a cutoff related to standard deviations to detect outliers is >useful in many situations, but there are exceptions. For example, >standard deviations are more appropriate for normally distributed >data. With skewed data (e.g. income, days of sickness absence, etc.) >some valid values may [be very large, measuris ed in] standard >deviations. Statistical folk wisdom in some circles is, you'll never see truly normally distributed data. And one of the most common is far more extreme values than a normal distribution would have. Often, most marked at large multiples of the SD, where the observed values can be rare and still many times the frequency for a normal distribution. >You might want to consider setting cutoffs for outliers based on >plausibility (e.g., when dealing with high school students who claim >that they consume 100 beers on a daily basis). [Or] "wild values" - >numbers that do not make sense and can be considered out of range. This is the easy case: 'outliers' that can confidently be identified as data errors. Easy, in that there's not subtlety about analysis. The correct handling is clear: identify them, and drop them from analysis. (Or, if possible, go back to the data source, and correct them.) But is IS the easy case. It disposes of a lot of apparent outliers, but still leaves you to deal with the real ones. One point that's only recently become clear to me, is that rare, unusually large values can give a variance in parameter estimates far larger than statistical procedures will assign. That's because sampling will commonly include very few of them; the variance in the number sampled will be high; and their effect on the parameters, of course, very large. Worst case is the difference between having any of the largest values in your sample, and having none. Bootstrapping should probably be used to estimate parameter variances in these cases. |
Free forum by Nabble | Edit this page |