SPSSX Discussion

Performance Measures

Classic

List

Threaded

15 messages Options

Daniel E WILLIAMS

Performance Measures

Hello all,

Are there any computational performance metrics available in SPSS that
one could use to compare two different syntax designs (for solving the
same problem)? Thanks.

Dan Williams
Forecasting, Research and Analysis Office
Finance and Policy Analysis
Department of Human Services
State of Oregon, USA
(503) 947-5354

Dennis Deck

Re: Performance Measures

Dan

Having done a little work at DHS I would suggest that server performance
may be one of the bigger performance issues you have to deal with.

For benchmarking syntax, I would suggest using one of the data sets you
have available to you and are probably familiar with. Simply run the
alternative syntax on same data set and same server drive. Note that
you can use the N OF CASES command to limit the number of cases
processed. (I assume that part of the context for your question is that
some of your forecasting files are quite large.) N OF CASES is also
useful when you are debugging the syntax. Keep in mind, however, that
performance of some operations like SORT CASES will not be a simple
linear function of N.

For general strategies to promote efficient code working with large data
sets, I recommend reading the efficiency tips section of Raynald
Levesque's SPSS Programming and Data Management. You can purchase this
in hard copy or download from SPSS
(http://www.spss.com/spss/data_management_book.htm) but a PDF copy
should be on your installation disk.

Dennis Deck, PhD
RMC Research Corporation
111 SW Columbia Street, Suite 1200
Portland, Oregon 97201-5843
voice: 503-223-8248 x715
voice: 800-788-1887 x715
fax: 503-223-8248
[hidden email]

-----Original Message-----
From: Daniel E WILLIAMS [mailto:[hidden email]]
Sent: Tuesday, August 22, 2006 9:45 AM
Subject: Performance Measures

Hello all,

Are there any computational performance metrics available in SPSS that
one could use to compare two different syntax designs (for solving the
same problem)? Thanks.

Dan Williams
Forecasting, Research and Analysis Office
Finance and Policy Analysis
Department of Human Services
State of Oregon, USA
(503) 947-5354

Abdul Azeez

Outlier detection

Dear All,

I have a large dataset of products in the following format, from which I will be using the last column, that is the unit values (or prices of products). The problem is, some firms have reported extreme values (outliers) for some products. For instance, in the following example, the unit values for sugar hovers around 11 for most firms, while one firm (firm 3) has a significantly high value, say 100. I need to detect such firms in each product and delete them from my sample. Can anybody suggest a way to do it in SPSS?

Thank you in advance

Kind regards

Abdul

Firm	Product	Quantity	Sales	Unit Values
A	B	C	D	C/D
1	Sugar	10	100	10
2	Sugar	15	180	12
3	Sugar	18	1800	100
4	Sugar	20	220	11
5	Meat	10	200	20
6	Meat	20	460	23
7	Meat	25	1500	60
8	Butter	100	1200	12
9	Butter	10	110	11
10	Butter	30	390	13

Abdul Azeez Erumban

Faculty of Economics
University of Groningen
Post Bus 800
9700 AV GRONINGEN
The Netherlands

Tel: +31 (0)50 363 3762
Fax: +31 (0)50 363 7337

E-mail: [hidden email], [hidden email]

Discover. Explore. Connect-Windows Live Spaces. Check out!

Ken Chui

Re: Outlier detection

Sorry if you have done that, but I'd first suggest verifying the outliers
and see if it's a data entry error or mis-report, it'd be sorry to lose
cases if they can be recovered.

Could you also let us know your cutting point criteria? Do you want to:
i. set up an absolute cutting unit price for each commoditiy, and then
delete whichever higher that that value? or
ii. set up a floating cutting point (such as > 2 or 3 standard deviation
or as a percentage deviation from mean)? or
iii. some other ideas?

-Ken

Peck, Jon

Re: Performance Measures

In reply to this post by Daniel E WILLIAMS

If you have SPSS 14, you might want to look into the programmability benchmark module on the SPSS Developer Central site (www.spss.com/devcentral). It provides for benchmarking alternative sets of syntax with a user-settable number of repetitions, interleaving them in order to minimize the effects of operating system caching, file buffering, memory management, virus checkers, etc. It produces as output a file designed to be read into SPSS (natch!) and analyzed. There are quite a few different performance measures available from this module corresponding more or less to the measures that you can see in the Task Manager process view.

This module is Windows only as it is tied into the measures that can be collected from the operating system.

Regards,
Jon Peck
SPSS

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Daniel E WILLIAMS
Sent: Dienstag, 22. August 2006 11:45
To: [hidden email]
Subject: [SPSSX-L] Performance Measures

Hello all,

Are there any computational performance metrics available in SPSS that
one could use to compare two different syntax designs (for solving the
same problem)? Thanks.

Dan Williams
Forecasting, Research and Analysis Office
Finance and Policy Analysis
Department of Human Services
State of Oregon, USA
(503) 947-5354

Edward Boadi

Re: Outlier detection

In reply to this post by Ken Chui

Suppose Unit values greater or equal to 60 are outliers ,

select if (Unit_Values < 60)

Will give you cases with unit values less that 60 . You may proceed to do your analysis based on the selected records.
An alternate way will be to create a filter to select the desired records for analysis.

Regards

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf Of Abdul Azeez
Sent: Wednesday, August 23, 2006 4:03 AM
To: [hidden email]
Subject: Outlier detection

Dear All,

I have a large dataset of products in the following format, from which I will be using the last column, that is the unit values (or prices of products). The problem is, some firms have reported extreme values (outliers) for some products. For instance, in the following example, the unit values for sugar hovers around 11 for most firms, while one firm (firm 3) has a significantly high value, say 100. I need to detect such firms in each product and delete them from my sample. Can anybody suggest a way to do it in SPSS?

Thank you in advance

Kind regards

Abdul

Firm

Product

Quantity

Sales

Unit Values

A

B

C

D

C/D

1

Sugar

10

100

10

2

Sugar

15

180

12

3

Sugar

18

1800

100

4

Sugar

20

220

11

5

Meat

10

200

20

6

Meat

20

460

23

7

Meat

25

1500

60

8

Butter

100

1200

12

9

Butter

10

110

11

10

Butter

30

390

13

Abdul Azeez Erumban

Faculty of Economics
University of Groningen
Post Bus 800
9700 AV GRONINGEN
The Netherlands

Tel: +31 (0)50 363 3762
Fax: +31 (0)50 363 7337

E-mail: [hidden email], [hidden email]

_____

Discover. Explore. Connect-Windows Live Spaces. Check out! <http://g.msn.com/8HMAENIN/2728??PS=47575>

Beadle, ViAnn

Re: Outlier detection

In reply to this post by Abdul Azeez

There have been discussions on this list about the wisdom of deleting outliers. If you are very certain that you do want to remove them from further analysis, a general rule of thumb for scalar values is to classify values greater than absolute 3 standard deviations from the mean as outliers. The quick way to do this is to use the DESCRIPTIVES command to add z-scores to your file which have a mean of zero and standard deviation of 1. Then it is simple to filter or select cases as follows:

DESCRIPTIVES
VARIABLES=a to z/SAVE
/STATISTICS=MEAN STDDEV MIN MAX .
COMPUTE maxout = MAXIMUM(za to zz).
COMPUTE minout=MINIMUM(za to zz).
COMPUTE filter = 1.
IF (maxout > 3 OR minout < -3) filter = 0.
FILTER BY filter.

This example filters these cases from further analysis but doesn't delete them from the active data file. I use the MAXIMUM and MINIMUM functions in this way because they take a simple variable list using the TO conventioni.

________________________________

From: SPSSX(r) Discussion on behalf of Abdul Azeez
Sent: Wed 8/23/2006 3:02 AM
To: [hidden email]
Subject: Outlier detection

Dear All,

I have a large dataset of products in the following format, from which I will be using the last column, that is the unit values (or prices of products). The problem is, some firms have reported extreme values (outliers) for some products. For instance, in the following example, the unit values for sugar hovers around 11 for most firms, while one firm (firm 3) has a significantly high value, say 100. I need to detect such firms in each product and delete them from my sample. Can anybody suggest a way to do it in SPSS?

Thank you in advance

Kind regards

Abdul

Firm

Product

Quantity

Sales

Unit Values

A

B

C

D

C/D

1

Sugar

10

100

10

2

Sugar

15

180

12

3

Sugar

18

1800

100

4

Sugar

20

220

11

5

Meat

10

200

20

6

Meat

20

460

23

7

Meat

25

1500

60

8

Butter

100

1200

12

9

Butter

10

110

11

10

Butter

30

390

13

Abdul Azeez Erumban

Faculty of Economics
University of Groningen
Post Bus 800
9700 AV GRONINGEN
The Netherlands

Tel: +31 (0)50 363 3762
Fax: +31 (0)50 363 7337

E-mail: [hidden email], [hidden email]

________________________________

Discover. Explore. Connect-Windows Live Spaces. Check out! <http://g.msn.com/8HMAENIN/2728??PS=47575>

Hector Maletta

Re: Outlier detection

In reply to this post by Edward Boadi

The SELECT IF command will delete the outliers from the working file for
good. If you save it under the same name, they will be lost forever. Just in
case, you better use one of the following:

1. Temporary selection:

TEMPORARY.

SELECT IF UNIT_VALUES < 60).

This is valid only for the next operation that requires reading the data.

2. Create a dummy variable and use it as a filter.

COMPUTE LESSTHAN60=( UNIT_VALUES < 60).

FILTER BY LESSTHAN60.

........ (as many procedures as needed)

FILTER OFF.

The filter is in force until FILTER OFF is issued.

Hector

-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de
Edward Boadi
Enviado el: Wednesday, August 23, 2006 10:04 AM
Para: [hidden email]
Asunto: Re: Outlier detection

Suppose Unit values greater or equal to 60 are outliers ,

select if (Unit_Values < 60)

Will give you cases with unit values less that 60 . You may proceed to do
your analysis based on the selected records.

An alternate way will be to create a filter to select the desired records
for analysis.

Regards

-----Original Message-----

From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf Of
Abdul Azeez

Sent: Wednesday, August 23, 2006 4:03 AM

To: [hidden email]

Subject: Outlier detection

Dear All,

I have a large dataset of products in the following format, from which I
will be using the last column, that is the unit values (or prices of
products). The problem is, some firms have reported extreme values
(outliers) for some products. For instance, in the following example, the
unit values for sugar hovers around 11 for most firms, while one firm (firm
3) has a significantly high value, say 100. I need to detect such firms in
each product and delete them from my sample. Can anybody suggest a way to do
it in SPSS?

Thank you in advance

Kind regards

Abdul

Firm

Product

Quantity

Sales

Unit Values

A

B

C

D

C/D

1

Sugar

10

100

10

2

Sugar

15

180

12

3

Sugar

18

1800

100

4

Sugar

20

220

11

5

Meat

10

200

20

6

Meat

20

460

23

7

Meat

25

1500

60

8

Butter

100

1200

12

9

Butter

10

110

11

10

Butter

30

390

13

Abdul Azeez Erumban

Faculty of Economics

University of Groningen

Post Bus 800

9700 AV GRONINGEN

The Netherlands

Tel: +31 (0)50 363 3762

Fax: +31 (0)50 363 7337

E-mail: [hidden email], [hidden email]

_____

Discover. Explore. Connect-Windows Live Spaces. Check out!
<http://g.msn.com/8HMAENIN/2728??PS=47575>

Hector Maletta

Re: Outlier detection

In reply to this post by Ken Chui

Ken,
I won't discuss your advice, but rather take the occasion to advertise what
I think is always good practice in a list like this one: to reproduce the
preceding messages in the thread; if not the entire thread, then at least
the one message (or part thereof) you are responding to. This makes the
responses self-contained and helps the rest of us. I personally do not know
at first glance and without a short "research" which of the "Outlier
detection" messages you are actually answering here, either the original
query by Abdul Azeez (I suspect it is this one), or the advice given by
ViAnn, or perhaps some other one I missed. If the thread is longer this can
become difficult.
Hector

-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Ken
Chui
Enviado el: Wednesday, August 23, 2006 9:10 AM
Para: [hidden email]
Asunto: Re: Outlier detection

Sorry if you have done that, but I'd first suggest verifying the outliers
and see if it's a data entry error or mis-report, it'd be sorry to lose
cases if they can be recovered.

Could you also let us know your cutting point criteria? Do you want to:
i. set up an absolute cutting unit price for each commoditiy, and then
delete whichever higher that that value? or
ii. set up a floating cutting point (such as > 2 or 3 standard deviation
or as a percentage deviation from mean)? or
iii. some other ideas?

-Ken

Ken Chui

Re: Outlier detection

In reply to this post by Abdul Azeez

Hello Hector,

Thanks for the advice and I'll consider it.

The problem, though, was the original poster posted the HTML code and I had
to use dreamweaver to make sense out of it; I tried to repost the original
message in text format and failed, thus ending up with only my response.

There are numerous way to explore this list, and I'll try to make sure the
information exchange in future is most efficient.

Again thanks for the advice.

-Ken

On Wed, 23 Aug 2006 13:11:57 -0300, Hector Maletta
<[hidden email]> wrote:

>Ken,
>I won't discuss your advice, but rather take the occasion to advertise what
>I think is always good practice in a list like this one: to reproduce the
>preceding messages in the thread; if not the entire thread, then at least
>the one message (or part thereof) you are responding to. This makes the
>responses self-contained and helps the rest of us. I personally do not know
>at first glance and without a short "research" which of the "Outlier
>detection" messages you are actually answering here, either the original
>query by Abdul Azeez (I suspect it is this one), or the advice given by
>ViAnn, or perhaps some other one I missed. If the thread is longer this can
>become difficult.
>Hector
>
>-----Mensaje original-----
>De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Ken
>Chui
>Enviado el: Wednesday, August 23, 2006 9:10 AM
>Para: [hidden email]
>Asunto: Re: Outlier detection
>
>Sorry if you have done that, but I'd first suggest verifying the outliers
>and see if it's a data entry error or mis-report, it'd be sorry to lose
>cases if they can be recovered.
>
>Could you also let us know your cutting point criteria? Do you want to:
> i. set up an absolute cutting unit price for each commoditiy, and then
>delete whichever higher that that value? or
> ii. set up a floating cutting point (such as > 2 or 3 standard deviation
>or as a percentage deviation from mean)? or
> iii. some other ideas?
>
>-Ken

Abdul Azeez

Re: Outlier detection

In reply to this post by Hector Maletta

Dear Ken,
I prefer to go for the second option: set up a floating cutting point (such
as > 2 or 3 standard deviation. Because, it doesn't seem to be wise to use
an absolute cut off points for all products, as I have more than 2000
products.

abdul
----- Original Message -----
From: "Hector Maletta" <[hidden email]>
Newsgroups: bit.listserv.spssx-l
To: <[hidden email]>
Sent: Wednesday, August 23, 2006 6:11 PM
Subject: Re: Outlier detection

> Ken,
> I won't discuss your advice, but rather take the occasion to advertise
> what
> I think is always good practice in a list like this one: to reproduce the
> preceding messages in the thread; if not the entire thread, then at least
> the one message (or part thereof) you are responding to. This makes the
> responses self-contained and helps the rest of us. I personally do not
> know
> at first glance and without a short "research" which of the "Outlier
> detection" messages you are actually answering here, either the original
> query by Abdul Azeez (I suspect it is this one), or the advice given by
> ViAnn, or perhaps some other one I missed. If the thread is longer this
> can
> become difficult.
> Hector
>
> -----Mensaje original-----
> De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Ken
> Chui
> Enviado el: Wednesday, August 23, 2006 9:10 AM
> Para: [hidden email]
> Asunto: Re: Outlier detection
>
> Sorry if you have done that, but I'd first suggest verifying the outliers
> and see if it's a data entry error or mis-report, it'd be sorry to lose
> cases if they can be recovered.
>
> Could you also let us know your cutting point criteria? Do you want to:
> i. set up an absolute cutting unit price for each commoditiy, and then
> delete whichever higher that that value? or
> ii. set up a floating cutting point (such as > 2 or 3 standard deviation
> or as a percentage deviation from mean)? or
> iii. some other ideas?
>
> -Ken
>

statisticsdoc

Re: Outlier detection

Stephen Brand
www.statisticsdoc.com

I agree wholeheartedly with those who have suggested looking for reasons why
some datapoints appear to be outliers, as well as using some flexibility in
choosing cutoffs for outliers.

Setting a cutoff related to standard deviations to detect outliers is useful
in many situations, but there are exceptions. For example, standard
deviations are more appropriate for normally distributed data. With skewed
data (e.g. income, days of sickness absence, etc.) some valid values may
have very large standard deviations. You might want to consider setting
cutoffs for outliers based on plausibility (e.g., when dealing with high
school students who claim that they consume 100 beers on a daily basis).

A related topic concerns "wild values" - numbers that are not that far away
from the overall distribution, but which do not make sense and can be
considered out of range. For example, if you know that the school year in a
given school district consists of 200 days per year maximum, and that there
is no summer school, then subjects who report missing more than 200 days of
school would need further examination.

HTH,

Stephen Brand

For personalized and professional consultation in statistics and research
design, visit
www.statisticsdoc.com

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf Of
abdul
Sent: Wednesday, August 23, 2006 4:27 PM
To: [hidden email]
Subject: Re: Outlier detection

Dear Ken,
I prefer to go for the second option: set up a floating cutting point (such
as > 2 or 3 standard deviation. Because, it doesn't seem to be wise to use
an absolute cut off points for all products, as I have more than 2000
products.

abdul
----- Original Message -----
From: "Hector Maletta" <[hidden email]>
Newsgroups: bit.listserv.spssx-l
To: <[hidden email]>
Sent: Wednesday, August 23, 2006 6:11 PM
Subject: Re: Outlier detection

Richard Ristow

Re: Performance Measures

In reply to this post by Daniel E WILLIAMS

At 12:45 PM 8/22/2006, Daniel E WILLIAMS wrote:

>Are there any computational performance metrics available in SPSS that
>one could use to compare two different syntax designs (for solving the
>same problem)?

This doesn't address your question directly, but it's a good side
angle.

With extremely rare exceptions, time for SPSS runs is dominated by
reading and writing disk files. A pretty good proxy for running time is
total amount read and written; and that can usually be ascertained from
the file sizes, organization, and program logic, without direct
measurements.

By way of tuning, that means as few procedure statements as possible.
Since you don't want to forgo results, it means combining as many
calculations as you can in each procedure call. It also means (here I
go again) not using EXECUTE, a procedure which does nothing, except in
the special cases where it's necessary. (See Raynald Levesque's book.)

If you have a release with a Virtual Active File system (most of us do
- was that SPSS 11, or 12?), caching can slow processing down, or speed
it up considerably. It slows it if reading the original file is easy
(simple things like DATA LIST, or GET FILE with maybe a few
transformations), and you're using most of the data, maybe only two or
three times. Caching is most likely to speed processing when
. You're running multiple procedures against the same active file
. Reading the original data is expensive, e.g. some SQL extractions
. Procedures use a small subset of the variables, cases, or both.
(SELECTing and KEEPing only the cases and variables you're using, can
help a lot.)

You can get the effect of CACHE with XSAVE followed by GET FILE for the
file you've XSAVED. That can be more flexible; for example, you can
XSAVE several versions of the file, with different selections or
transformations, in the same pass through the data.

Procedures that store all or much of the data - anything that
calculates medians, AGGREGATE without PRESORTED, CREATE - can be very
slow for very large files, but as fast as other procedures for smaller
ones. If you're using these, it's an additional reason for SELECTing
and KEEPing to make your working file as small as possible. (AGGREGATE
without PRESORTED is generally faster than SORT followed by AGGREGATE
with PRESORTED; I understand, usually a good deal faster. If you have
hundreds of thousands of cases, that may well not be so, especially if
your file is already in the desired order and you don't need a sort.)

Of course, if you're processing data by subsets, use SPLIT FILE rather
than multiple SELECT IFs or FILTERs. That may need a SORT CASES; I'm
not sure of the speed penalty for that. I'm pretty sure it will take
the time that's needed to read and write the whole file; how much more,
under what circumstances, I don't know.

MATRIX is a wild card. Usually you can't choose for efficiency; you
have reason to use MATRIX, or you don't. I suspect MATRIX code is very
fast, if all the matrix variables can be kept in memory at once; and
these days, that can be done for pretty big matrices. If matrices have
to be paged to disk, I imagine it slows down very badly right away.

Onward, and good luck,
Richard Ristow

Michael Healy-2

Re: Performance Measures

In reply to this post by Daniel E WILLIAMS

Hi, Richard Ristow's summary of how to optimize SPSS commands was excellent
and I just wanted to add one additional point. The slowest part of an
analysis is going to be reading and writing data files from the disk, thus
you should use 10 or 15k RPM disks (I'm not sure whether their really is
much of an advantage for 15k disks when working with large files). An even
better solution is to use a RAID configuration in which 2 or more disks are
combined into a single volume and data are written across the disks. You can
create a raid using the Disk Management tool in Windows XP. There are a
number of different RAID types that can be used, but RAID 0 would be a good
solution as long as you back up regularly. More expensive RAID software and
hardware options are available that will offer you better speed and data
protection capabilities.
--Mike

Richard Ristow

Re: Outlier detection

In reply to this post by statisticsdoc

To agree wholeheartedly, with a few notes:

At 09:20 PM 8/23/2006, Statisticsdoc wrote:\,

>Setting a cutoff related to standard deviations to detect outliers is
>useful in many situations, but there are exceptions. For example,
>standard deviations are more appropriate for normally distributed
>data. With skewed data (e.g. income, days of sickness absence, etc.)
>some valid values may [be very large, measuris ed in] standard
>deviations.

Statistical folk wisdom in some circles is, you'll never see truly
normally distributed data. And one of the most common is far more
extreme values than a normal distribution would have. Often, most
marked at large multiples of the SD, where the observed values can be
rare and still many times the frequency for a normal distribution.

>You might want to consider setting cutoffs for outliers based on
>plausibility (e.g., when dealing with high school students who claim
>that they consume 100 beers on a daily basis). [Or] "wild values" -
>numbers that do not make sense and can be considered out of range.

This is the easy case: 'outliers' that can confidently be identified as
data errors. Easy, in that there's not subtlety about analysis. The
correct handling is clear: identify them, and drop them from analysis.
(Or, if possible, go back to the data source, and correct them.) But is
IS the easy case. It disposes of a lot of apparent outliers, but still
leaves you to deal with the real ones.

One point that's only recently become clear to me, is that rare,
unusually large values can give a variance in parameter estimates far
larger than statistical procedures will assign. That's because sampling
will commonly include very few of them; the variance in the number
sampled will be high; and their effect on the parameters, of course,
very large. Worst case is the difference between having any of the
largest values in your sample, and having none.

Bootstrapping should probably be used to estimate parameter variances
in these cases.