SPSSX Discussion

Syntax Problem: speed

Classic

List

Threaded

16 messages Options

drfg2008

Syntax Problem: speed

N= 50 persons (cases) give answers on a scale from 1-5 for 10 questions (items). The syntax tries to figure out, which person gives 'fake' answers just by answering randomly. The syntax autocorrelates the 10 items of each person (lag 1). Those persons whose answers autocorrelate with lag1, are considered giving fake answers.

Appart from the question if this is the propper method, the question is: Is there a possibility to make the syntax faster. Since the syntax is running sequencially, it takes very long. Plus, the makro needs very long to start in the first place. And it is not possible, to run the syntax over a large number of cases. Already, from 1.000 cases on, the syntax needs too much time (and memory).

Do you see any possibility to optimise the script?

*examp. file-------------.

input program.
loop person =1 to 100 by 1.
end case.
end loop.
end file.
end input program.
exe.

comp v1 =RV.BINOM(5,0.5).
comp v2 =RV.BINOM(5,0.5).
comp v3 =RV.BINOM(5,0.5).
comp v4 =RV.BINOM(5,0.5).
comp v5 =RV.BINOM(5,0.5).
comp v6 =RV.BINOM(5,0.5).
comp v7 =RV.BINOM(5,0.5).
comp v8 =RV.BINOM(5,0.5).
comp v9 =RV.BINOM(5,0.5).
comp v10 =RV.BINOM(5,0.5).

EXE .
SORT CASES BY person(A).
SAVE OUTFILE='C:\user\testfile.sav'.

*create exampl agg-file ---------------- .

input program.
loop var001 =1 to 1by 1.
end case.
end loop.
end file.
end input program.
EXECUTE .
comp var002 =0 .
comp person =0 .

SAVE OUTFILE='C:\user\agg.sav'.

* ----------- Macro starts here ------------------------------------------.

DEFINE !makro_stoch (start =!tokens(1)
/end = !tokens(1)
/testfile = !tokens(1)
/aggfile = !tokens(1)
/oms_outfile = !tokens(1)
/flipvar_1 = !token(1)
/flipvar_2 = !token(1)).

!do !var = !start !to !end.

GET FILE=!testfile.

FILTER OFF.
USE !var thru !var /permanent.
EXECUTE.

FLIP VARIABLES=!flipvar_1 to !flipvar_2.

SHIFT VALUES VARIABLE=var001 RESULT=var001_shift LAG=1.

OMS /DESTINATION VIEWER=NO /TAG='suppressall'.

oms select tables
/destination format = sav
outfile=!oms_outfile
/if commands = ['Correlations']
subtypes = ['Correlations'].

CORRELATIONS
/VARIABLES=var001 var001_shift
/PRINT=TWOTAIL NOSIG
/MISSING=PAIRWISE.

OMSEND .

GET FILE=!oms_outfile.

FILTER OFF.
USE 1 thru 2 /permanent.
EXECUTE.

DELETE VARIABLES Command_ to var001.
EXECUTE .
FLIP VARIABLES=Lagvar0011.
comp person = !var.
EXECUTE .
DELETE VARIABLES CASE_LBL.
EXECUTE .

ADD FILES /FILE=*
/FILE=!aggfile.
EXECUTE.

SAVE OUTFILE=!aggfile.

!doend.

SORT CASES BY person(A).
MATCH FILES /FILE=*
/FILE=!testfile
/BY person.
EXECUTE .

formats var001 var002 (f8.3).

rename variable var001 = pearson.
rename variable var002 = probability.
exe.
RESTORE.

!enddefine.

!makro_stoch start = 1 end = 100 flipvar_1=v1 flipvar_2 =v10 testfile = 'C:\user\testfile.sav' aggfile = 'C:\user\agg.sav' oms_outfile ='C:\user\outfile.sav'.

Dr. Frank Gaeth

Maguin, Eugene

Re: Syntax Problem: speed

IF you are doing what I think you are doing, I think there are faster ways
to do this but I want to make sure I understand what you are claiming. It
sounds like you are claiming that if the within person lag 1
autocorrrelation is greater than some value (what value do you have in
mind?) then the responses are fakes. Is that true?

I think you are incorrect in your thinking for this reason. Consider two
scenarios: 1) Suppose the 10 items form a scale with the average inter-item
correlation being .60. 2) Suppose that the 10 items have no relationship to
each other. Suppose you compute an lag1 autocorrelation of .32 for somebody.
What do you conclude?

That said and before I offer something, tell me how your data are organized.
Like this.
Id v1 v2 v3 ... v10
11 2 5 1 ... 4
12 3 2 3 ... 2
13 4 1 3 ... 5

Or like this.
Id
11 2
11 5
11 1
11 ...
11 4
12 3
12 2
12 3
12 ...
12 2

Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
drfg2008
Sent: Sunday, February 13, 2011 2:17 PM
To: [hidden email]
Subject: Syntax Problem: speed

N= 50 persons (cases) give answers on a scale from 1-5 for 10 questions
(items). The syntax tries to figure out, which person gives 'fake' answers
just by answering randomly. The syntax autocorrelates the 10 items of each
person (lag 1). Those persons whose answers autocorrelate with lag1, are
considered giving fake answers.

Appart from the question if this is the propper method, the question is: Is
there a possibility to make the syntax faster. Since the syntax is running
sequencially, it takes very long. Plus, the makro needs very long to start
in the first place. And it is not possible, to run the syntax over a large
number of cases. Already, from 1.000 cases on, the syntax needs too much
time (and memory).

Do you see any possibility to optimise the script?

*examp. file-------------.

input program.
loop person =1 to 100 by 1.
end case.
end loop.
end file.
end input program.
exe.

comp v1 =RV.BINOM(5,0.5).
comp v2 =RV.BINOM(5,0.5).
comp v3 =RV.BINOM(5,0.5).
comp v4 =RV.BINOM(5,0.5).
comp v5 =RV.BINOM(5,0.5).
comp v6 =RV.BINOM(5,0.5).
comp v7 =RV.BINOM(5,0.5).
comp v8 =RV.BINOM(5,0.5).
comp v9 =RV.BINOM(5,0.5).
comp v10 =RV.BINOM(5,0.5).

EXE .
SORT CASES BY person(A).
SAVE OUTFILE='C:\user\testfile.sav'.

*create exampl agg-file ---------------- .

input program.
loop var001 =1 to 1by 1.
end case.
end loop.
end file.
end input program.
EXECUTE .
comp var002 =0 .
comp person =0 .

SAVE OUTFILE='C:\user\agg.sav'.

* ----------- Macro starts here ------------------------------------------.

DEFINE !makro_stoch (start =!tokens(1)
/end = !tokens(1)
/testfile = !tokens(1)
/aggfile = !tokens(1)
/oms_outfile = !tokens(1)
/flipvar_1 = !token(1)
/flipvar_2 = !token(1)).

!do !var = !start !to !end.

GET FILE=!testfile.

FILTER OFF.
USE !var thru !var /permanent.
EXECUTE.

FLIP VARIABLES=!flipvar_1 to !flipvar_2.

SHIFT VALUES VARIABLE=var001 RESULT=var001_shift LAG=1.

OMS /DESTINATION VIEWER=NO /TAG='suppressall'.

oms select tables
/destination format = sav
outfile=!oms_outfile
/if commands = ['Correlations']
subtypes = ['Correlations'].

CORRELATIONS
/VARIABLES=var001 var001_shift
/PRINT=TWOTAIL NOSIG
/MISSING=PAIRWISE.

OMSEND .

GET FILE=!oms_outfile.

FILTER OFF.
USE 1 thru 2 /permanent.
EXECUTE.

DELETE VARIABLES Command_ to var001.
EXECUTE .
FLIP VARIABLES=Lagvar0011.
comp person = !var.
EXECUTE .
DELETE VARIABLES CASE_LBL.
EXECUTE .

ADD FILES /FILE=*
/FILE=!aggfile.
EXECUTE.

SAVE OUTFILE=!aggfile.

!doend.

SORT CASES BY person(A).
MATCH FILES /FILE=*
/FILE=!testfile
/BY person.
EXECUTE .

formats var001 var002 (f8.3).

rename variable var001 = pearson.
rename variable var002 = probability.
exe.
RESTORE.

!enddefine.

!makro_stoch start = 1 end = 100 flipvar_1=v1 flipvar_2 =v10 testfile =
'C:\user\testfile.sav' aggfile = 'C:\user\agg.sav' oms_outfile
='C:\user\outfile.sav'.

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Syntax-Problem-speed-tp3383649
p3383649.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

drfg2008

Re: Syntax Problem: speed

Thank you for your response!

My data is exactly structured as in the "example file", please see
*examp. file-------------. (top of the text)
...

Since the items are defenitely uncorrelated (randomly sorted), there should be no strong autocorrelation (lag1).

regards. Frank

Dr. Frank Gaeth

Maguin, Eugene

Re: Syntax Problem: speed

Frank,

I think this will do what you need but let me know. I computed the lag 1
autocorrelation directly in syntax.

Gene Maguin

input program.
vector y(10,f1.0).
loop id=1 to 100.
loop #i=1 to 10.
+ compute y(#i)=RV.BINOM(5,0.5).
end loop.
end case.
end loop.
end file.
end input program.
execute.

frequencies y1 to y10.

* 'x' is y1 to y9. 'y' is y2 to y10.
vector y=y1 to y10.
compute xbar=mean(y1 to y9).
compute ybar=mean(y2 to y10).
compute sumxy=0.
compute sumx2=0.
compute sumy2=0.
loop #i=1 to 9.
+ compute sumxy=sumxy+y(#i)*y(#i+1).
+ compute sumx2=sumx2+y(#i)*y(#i).
+ compute sumy2=sumy2+y(#i+1)*y(#i+1).
end loop.
compute corr=(sumxy/9-xbar*ybar)/sqrt((sumx2/9-xbar**2)*(sumy2/9-ybar**2)).
compute xvar=sumx2/9-xbar**2.
compute yvar=sumy2/9-ybar**2.
execute.
format xbar ybar sumxy sumx2 sumy2(f3.0) corr xvar yvar(f10.6).

list id y1 to y10 corr xbar xvar ybar yvar/cases=5.

id y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 corr xbar xvar ybar
yvar

1 4 0 3 0 2 0 3 2 4 2 -.508840 2.000000 2.444444 1.777778
1.950617
2 0 2 3 3 1 4 2 3 2 3 -.366900 2.222222 1.283951 2.555556
.691358
3 2 1 3 3 0 4 2 4 3 2 -.406250 2.444444 1.580247 2.444444
1.580247
4 3 1 2 2 2 4 2 2 3 1 -.362933 2.333333 .666667 2.111111
.765432
5 4 3 1 2 3 4 2 2 2 2 .189832 2.555556 .913580 2.333333
.666667

Number of cases read: 5 Number of cases listed: 5

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
drfg2008
Sent: Tuesday, February 15, 2011 2:05 AM
To: [hidden email]
Subject: Re: Syntax Problem: speed

Thank you for your response!

My data is exactly structured as in the "example file", please see
*examp. file-------------. (top of the text)
...

Since the items are defenitely uncorrelated (randomly sorted), there should
be no strong autocorrelation (lag1).

regards. Frank

-----
Free University Berlin

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Syntax-Problem-speed-tp3383649
p3385532.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Garry Gelade

Re: Syntax Problem: speed

Frank

Here's how you can compute lag1 autocorrelations for each individual.

*Set up Sample Data in wide format.
input program.
vector y(10,f1.0).
loop id=1 to 100.
loop #i=1 to 10.
+ compute y(#i)=RV.BINOM(5,0.5).
end loop.
end case.
end loop.
end file.
end input program.
execute.

*convert to long format and calculate.

VARSTOCASES
/MAKE score FROM y1 y2 y3 y4 y5 y6 y7 y8 y9 y10
/INDEX=Index1(10)
/KEEP=id
/NULL=KEEP.

COMPUTE laggedscore = lag(score).
IF id NE lag(id) laggedscore = $SYSMIS.
EXECUTE.

SPLIT FILE LAYERED BY id.
CORRELATIONS /VARIABLES=score laggedscore /PRINT=TWOTAIL NOSIG
/MISSING=PAIRWISE.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
drfg2008
Sent: Tuesday, February 15, 2011 2:05 AM
To: [hidden email]
Subject: Re: Syntax Problem: speed

Thank you for your response!

My data is exactly structured as in the "example file", please see
*examp. file-------------. (top of the text)
...

Since the items are defenitely uncorrelated (randomly sorted), there should
be no strong autocorrelation (lag1).

regards. Frank

-----
Free University Berlin

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Syntax-Problem-speed-tp3383649
p3385532.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

drfg2008

Re: Syntax Problem: speed

In reply to this post by Maguin, Eugene

Thank you, Gene Maguin,
thank you Garry Gelade!

@ Gene Maguin

this is really great, it works fine, very quick. It took some time to understand (your computing the autocorrelation without using the function), but finally that’s exactly what I was searching for.

You’re also right with your critique of the method itself. However, the idea is to presume a fake not if there is a correlation at all (since in surveys you always have inter-item correlation), but to identify cases, where you have either a very high or a very low correlation. In the first scenario someone would answer like: 1, 1, 1, 1, 1, etc. in the second scenario someone would answer randomly. Where to draw the line is the question. After having computed the autocorrelation over several different samples, I would first see how the r#s are distributed.

Thanks!

Dr. Frank Gaeth

tjohnson

Re: Syntax Problem: speed

I have been following this discussion with interest, not least because of the looping Gene and Garry suggested (I have yet to master loops in my use of SPSS syntax).

However, given the objective, I am intrigued why you don't just compute a standard deviation for each respondent across the items of interest. That is what I generally do to identify cases with a very high correlation amongst their item responses.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of drfg2008
Sent: 16 February 2011 13:48
To: [hidden email]
Subject: Re: Syntax Problem: speed

Thank you, Gene Maguin,
thank you Garry Gelade!

@ Gene Maguin

this is really great, it works fine, very quick. It took some time to
understand (your computing the autocorrelation without using the function),
but finally that’s exactly what I was searching for.

You’re also right with your critique of the method itself. However, the idea
is to presume a fake not if there is a correlation at all (since in surveys
you always have inter-item correlation), but to identify cases, where you
have either a very high or a very low correlation. In the first scenario
someone would answer like: 1, 1, 1, 1, 1, etc. in the second scenario
someone would answer randomly. Where to draw the line is the question.
After having computed the autocorrelation over several different samples, I
would first see how the r#s are distributed.

Thanks!

-----
Free University Berlin

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Syntax-Problem-speed-tp3383649p3387655.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

drfg2008

Re: Syntax Problem: speed

I thought about that solution, but there seem to be a few little problems:

1. There is a systematic difference in answering items between groups. For example: men answer items differently form women (more variance) -> shall I exclude women because their variance is smaller than those of men (or the opposite way round)? Same problem goes for: age, education, ...

2. Different distributions cause different variance: binomial distributed variables may generate different (less) variance than metric.

3. Where is the limit (the maximum or minmum variance, variance=0)?

4. You can not identify random answers by variance.

5. You can not identify systematic answers, like 1,2,3,4,5,6, (as a fake)

Correlation identifies (as far as I am convinced) pure random as well as systematic changes. The autocorrelation should not be either very near 0 nor very near +-1.

My only problem, until now, was to compute autocorrelation for large data. That's now fixed. Thanks to Gene Maguin.

regards

Frank

Dr. Frank Gaeth

Art Kendall

Re: Syntax Problem: speed

Finding suspicious response patterns usually calls for more than one "test". A pattern within a case is more suspicious the longer the set of items under consideration. Suspicion is increased when a pattern spans logically distinct sets of items.

If the scale is on something like attitudes or values _and_ the convention of balancing with items from both end of a bipolar construct has been followed, an SD of zero should flag a case as suspicious. Strongly agreeing ( strongly disagreeing) both with items such as "I like chocolate" and items such as "I hate chocolate foods" is very suspicious.

On something like an achievement test, a zero SD is ok with a perfect score. However, a zero sd with the worst conceivable score should flag a case as suspicious.

Extreme scores can be a reason to consider a set of responses suspicious. High and low auto-correlations is one example.

Finding distances between cases via the different methods in proximities and then considering cases very far from most cases as suspicious. I.e., some cases are in sparsely populated regions of the multivariate space. The nearest neighbors are far away.

Nowadays there are some procedures (which I have not examined) in "<data> <identify unusual cases> . These appear to work. In at least the few instances that I have tried this the cases flagged as "unusual" do appear to be unusual. (One can consider "unusual" a synonym for "suspicious" in these contexts.)
For example, in a 2D space of weight and age a 300 pound 9 year old would be suspicious.

In my experience, 3D plots with colors for different groups can give some ideas of suspicious cases. So can extreme residuals is regressions etc.

Art Kendall
Social Research Consultants

Except in achievement tests

On 2/16/2011 12:17 PM, drfg2008 wrote:

I thought about that solution, but there seem to be a few little problems:

1. There is a systematic difference in answering items between groups. For
example: men answer items differently form women (more variance) -> shall I
exclude women because their variance is smaller than those of men (or the
opposite way round)? Same problem goes for: age, education, ...

2. Different distributions cause different variance: binomial distributed
variables may generate different (less) variance than metric.

3. Where is the limit (the maximum or minmum variance, variance=0)?

4. You can not identify random answers by variance.

5. You can not identify systematic answers, like 1,2,3,4,5,6, (as a fake)

Correlation identifies (as far as I am convinced) pure random as well as
systematic changes. The autocorrelation should not be either very near 0 nor
very near +-1.

My only problem, until now, was to compute autocorrelation for large data.
That's now fixed. Thanks to Gene Maguin.

regards

Frank





-----
Free University Berlin

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Syntax-Problem-speed-tp3383649p3388044.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Art Kendall
Social Research Consultants

drfg2008

Re: Syntax Problem: speed

yes, i would agree so far that detecting suspicious response patterns usually calls for more than one "test". But yours is a more context driven approach: you have to know a lot about the data and the logical context. This is often not the case, when the programmer, who is often enough under time pressure and without context knowledge, has to decide.

As a matter of fact, the autocorrelation is only one out of many opportunities of 'fraud detection'. Logical incoherence is an other possibility, and there are a few others. But it takes much more effort.

(you could also post so called 'honey pots', were only those candidates step in, who you want to keep out)

By the way, an achievement test should never have a zero SD, because this indicates a ceiling effect or a floor effect (it is either too easy or too difficult). This makes the whole test suspicious ;-)

Still, I find the autocorrelation (esp. for opinion surveys) an interesting approach: quick, without any pre-knowledge, simple to implement.

Frank

Dr. Frank Gaeth

Art Kendall

Re: Syntax Problem: speed

But a programmer could just click <data> <identify unusual cases> without much effort. In fact it could take longer to select variables, etc. than to actually run the syntax.

To check this I just tried it with 14 variable and 91,00 cases in a system file. I click the menu, pasted the syntax, highlighted it, and clicked <run selection>.

It took about 20 to 25 seconds to click through the GUI menus. It took 11.8 seconds on a fast desktop. It took a lot longer to reply to this post.

More considered use of the menus could even run to a couple of minutes. Now that this procedure is here, I'll most likely use it most of the time.

Over the years, I have very seldom received really clean data sets. YMMV.

One of the reasons that I have a soapbox about being able to go back to the beginning is that all through an analysis one can find things that look anomalous.

Art Kendall
Social Research Consultants

On 2/16/2011 2:43 PM, drfg2008 wrote:

yes, i would agree so far that detecting suspicious response patterns usually
calls for more than one "test". But yours is a more context driven approach:
you have to know a lot about the data and the logical context. This is often
not the case, when the programmer, who is often enough under time pressure
and without context knowledge, has to decide.

As a matter of fact, the autocorrelation is only one out of many
opportunities of 'fraud detection'. Logical incoherence is an other
possibility, and there are a few others. But it takes much more effort.

(you could also post so called 'honey pots', were only those candidates step
in, who you want to keep out)

By the way, an achievement test should never have a zero SD, because this
indicates a ceiling effect or a floor effect (it is either too easy or too
difficult). This makes the whole test suspicious ;-)

Still, I find the autocorrelation (esp. for opinion surveys) an interesting
approach: quick, without any pre-knowledge, simple to implement.

Frank

-----
Free University Berlin

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Syntax-Problem-speed-tp3383649p3388303.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Art Kendall
Social Research Consultants

drfg2008

Re: Syntax Problem: speed

maybe I missed something.

In my old SPSS 17 on my computer there is no <data> <identify unusual cases>. Is it a feature in later versions? Or something to plugin. Or maybe I'm just a bit uninformed?

If it is part of a later version of SPSS, this would be a very good reason to upgrade as fast as possible.

Frank

Dr. Frank Gaeth

Jon K Peck

Re: Syntax Problem: speed

Identify Unusual Cases generates syntax for the DETECTANOMALY command. It is part of the Data Preparation option, which also includes VAILDATEDATA.

From the help...
The Anomaly Detection procedure searches for unusual cases based on deviations from the norms of their cluster groups. The procedure is designed to quickly detect unusual cases for data-auditing purposes in the exploratory data analysis step, prior to any inferential data analysis. This algorithm is designed for generic anomaly detection; that is, the definition of an anomalous case is not specific to any particular application, such as detection of unusual payment patterns in the healthcare industry or detection of money laundering in the finance industry, in which the definition of an anomaly can be well-defined.

Methods. The DETECTANOMALY procedure clusters cases into peer groups based on the similarities of a set of input variables. An anomaly index is assigned to each case to reflect the unusualness of a case with respect to its peer group. All cases are sorted by the values of the anomaly index, and the top portion of the cases is identified as the set of anomalies. For each variable, an impact measure is assigned to each case that reflects the contribution of the variable to the deviation of the case from its peer group. For each case, the variables are sorted by the values of the variable impact measure, and the top portion of variables is identified as the set of reasons why the case is anomalous.

Jon Peck
Senior Software Engineer, IBM
[hidden email]
312-651-3435

From: drfg2008 <[hidden email]>
To: [hidden email]
Date: 02/16/2011 01:42 PM
Subject: Re: [SPSSX-L] Syntax Problem: speed
Sent by: "SPSSX(r) Discussion" <[hidden email]>

maybe I missed something. In my old SPSS 17 on my computer there is no <data> <identify unusual cases>. Is it a feature in later versions? Or something to plugin. Or maybe I'm just a bit uninformed? If it is part of a later version of SPSS, this would be a very good reason to upgrade as fast as possible. Frank ----- Free University Berlin -- View this message in context:http://spssx-discussion.1045642.n5.nabble.com/Syntax-Problem-speed-tp3383649p3388363.htmlSent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

drfg2008

Re: Syntax Problem: speed

oh I see. No, we don't have the licence for it.

>Fehler Nr. 7079
>Es ist keine Lizenz für SPSS Data Preparation vorhanden.
>Dieser Befehl wird nicht ausgeführt.
>Spezielle Symptomnummer: 18

thanks!

Dr. Frank Gaeth

drfg2008

Re: Syntax Problem: speed

In reply to this post by drfg2008

I list that private message here, because it has some interesting arguments

Well, that conclusion is mainly wrong.

The auto-correlation should be near +1 *only* when the
person marks each answer close to the immediately prior
one. It will *not* detect 1,1,1,1,1,... because if there is
no variance, there is (technically) no computable correlation.

And "very near zero" is what you ought to expect for legal
answers unless the items are arranged in some meaningful order.

--
Rich Ulrich

********************
Reply:

(first: you're right that 1,1,1, ... does not cause any correlation. If you can't compute a correlation due to a lack of variance, this should be a serious hint)

But: The interesting point is that, after I went through a few older studies, in these surveys there were ALWAYS meaningful arrangements of items. So, if you expect a near zero correlation for legal answers, this assumption (also stated in my first text) is theoretical but not empirical. You would exclude almost everyone. The point seems to be, that fake answers are so to speak only 'too far away' from the rest of the group.

Dr. Frank Gaeth

Sarraf, Shimon Aaron

Automatic reply: Syntax Problem: speed

I will be out of the office until Thursday, Feb. 16th. If you need immediate assistance, please call 812-856-5824. I will respond to your e-mail as soon as possible.

Thank you,

Shimon Sarraf

Center for Postsecondary Research

Indiana University at Bloomington