SPSSX Discussion

arrays in SPSS

Classic

List

Threaded

8 messages Options

sowmya sankar

arrays in SPSS

Hi,

I have two columns and for each value in column A, I want to find if the
same value exists in column B.
Example:
Col A Col B
12 11
13 14
14 12
11 10

In the above example, I would say that 13 is not present in column B. How
can I do this using arrays and loops in SPSS?

Thanks,
Sowmya

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Marta Garcia-Granero

Re: arrays in SPSS

> I have two columns and for each value in column A, I want to find if the
> same value exists in column B.
> Example:
> Col A Col B
> 12 11
> 13 14
> 14 12
> 11 10
>
> In the above example, I would say that 13 is not present in column B. How
> can I do this using arrays and loops in SPSS?
>

Quick&dirty solution (I'm sure that more elegant ones will be provided
by other people):

* Your sample dataset *.
DATA LIST LIST/ColA ColB (2 F8).
BEGIN DATA
12 11
13 14
14 12
11 10
END DATA.

* Code assumes SPSS 14 or newer is used *.
DATASET NAME Data.
DATASET DECLARE APresentInB WINDOW=MINIMIZED.
PRESERVE.
SET MXLOOPS=1000. /* If sample size GT 1000, change MXLOOPS *.

* WARNING: This code assumes no missing data *.
MATRIX.
GET ColA/VAR=ColA.
GET ColB/VAR=ColB.
COMPUTE n=NROW(ColA).
COMPUTE Present=MAKE(n,1,0).
LOOP i=1 TO n.
. LOOP j=1 TO n.
. DO IF ColA(i) EQ ColB(j).
. COMPUTE Present(i)=1.
. END IF.
. END LOOP.
END LOOP.
SAVE Present /OUTFILE=APresentInB.
END MATRIX.
RESTORE.
MATCH FILES
/FILE=*
/FILE='APresentInB'.
DATASET CLOSE APresentInB.
VALUE LABEL COL1 0'Absent in ColB' 1'Present in ColB'.
FORMAT COL1(F8).
LIST.

HTH,
Marta García-Granero

--
For miscellaneous statistical stuff, visit:
http://gjyp.nl/marta/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Marta Garcia-Granero

Re: arrays in SPSS

sowmya sankar wrote:
> One question: what would happen if there are missing values? For
> example, in my data, column A has 4731 rows and column B has 4861
> rows. Would the same code run?
As a general rule, a complete description of your dataset and your goal
will save a lot of time. You described your goal perfectly, but omitted
some important information concerning your dataset (this is not a
rebuke, just an advise that will help you to get better an faster
responses from the list). BTW, given the sample size you mention (over
4000) be patient, this code performs two nested loops, and the time
needed increases exponentially with sample size.

MATRIX will give an error message if missing data are present on either
column. This modification will make the code work OK:

* Your sample dataset (modified to add some missing data to test the
code) *.
DATA LIST LIST/ColA ColB (2 F8).
BEGIN DATA
12 11
13 14
14 12
11 10
-99 15
10 -99
END DATA.
RECODE ALL (-99=SYSMIS) .
LIST.

* Code assumes SPSS 14 or newer is used *.
DATASET NAME Data.
DATASET DECLARE APresentInB WINDOW=MINIMIZED.
PRESERVE.
SET MXLOOPS=5000. /* If sample size GT 5000, change MXLOOPS *.

* Code modified to accept missing data *.
TEMPORARY.
LIST.
MATRIX.
GET ColA/VAR=ColA /MISSING=ACCEPT /SYSMIS=-99.
GET ColB/VAR=ColB /MISSING=ACCEPT /SYSMIS=-99.
COMPUTE n=NROW(ColA).
COMPUTE Present=MAKE(n,1,0).
LOOP i=1 TO n.
. LOOP j=1 TO n.
. DO IF ColA(i) EQ ColB(j).
. COMPUTE Present(i)=1.
. END IF.
. END LOOP.
END LOOP.
SAVE Present /OUTFILE=APresentInB.
END MATRIX.
RESTORE.
MATCH FILES
/FILE=*
/FILE='APresentInB'.
DATASET CLOSE APresentInB.
VALUE LABEL COL1 0'Absent in ColB' 1'Present in ColB'.
* Cleaning missing data in result column *.
IF MISSING(ColA) COL1=$sysmis.
FORMAT COL1(F8).
LIST.

Now it will work.

Marta

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Peck, Jon

Re: arrays in SPSS

Here's a Python way to do this. It takes two passes but is linear in running time. The code assumes that sysmis values should never match. The result is a new variable, present, with values of 0 or 1. (Logically, the result should perhaps be sysmis if ColA is sysmis. That could be easily accommodated, but it probably doesn’t matter here.)

This code will work with SPSS 15 or later (and maybe 14.0.1).

HTH,
Jon Peck

* create some data.
DATA LIST LIST/ColA ColB (2 F8).
BEGIN DATA
12 11
13 14
14 12
11 10
-99 15
10 -99
END DATA.
RECODE ALL (-99=SYSMIS) .
exec.

BEGIN PROGRAM.
import spss, spssdata
caseset = set()
curs = spssdata.Spssdata(accessType="w")
for case in curs:
if not case.ColB is None:
caseset.add(case.ColB)
curs.restart()

curs.append("present")
curs.commitdict()
for case in curs:
curs.casevalues([case.ColA in caseset])
curs.CClose()
END PROGRAM.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Marta García-Granero
Sent: Tuesday, August 19, 2008 9:42 AM
To: [hidden email]
Subject: Re: [SPSSX-L] arrays in SPSS

sowmya sankar wrote:
> One question: what would happen if there are missing values? For
> example, in my data, column A has 4731 rows and column B has 4861
> rows. Would the same code run?
As a general rule, a complete description of your dataset and your goal
will save a lot of time. You described your goal perfectly, but omitted
some important information concerning your dataset (this is not a
rebuke, just an advise that will help you to get better an faster
responses from the list). BTW, given the sample size you mention (over
4000) be patient, this code performs two nested loops, and the time
needed increases exponentially with sample size.

MATRIX will give an error message if missing data are present on either
column. This modification will make the code work OK:

* Your sample dataset (modified to add some missing data to test the
code) *.
DATA LIST LIST/ColA ColB (2 F8).
BEGIN DATA
12 11
13 14
14 12
11 10
-99 15
10 -99
END DATA.
RECODE ALL (-99=SYSMIS) .
LIST.

* Code assumes SPSS 14 or newer is used *.
DATASET NAME Data.
DATASET DECLARE APresentInB WINDOW=MINIMIZED.
PRESERVE.
SET MXLOOPS=5000. /* If sample size GT 5000, change MXLOOPS *.

* Code modified to accept missing data *.
TEMPORARY.
LIST.
MATRIX.
GET ColA/VAR=ColA /MISSING=ACCEPT /SYSMIS=-99.
GET ColB/VAR=ColB /MISSING=ACCEPT /SYSMIS=-99.
COMPUTE n=NROW(ColA).
COMPUTE Present=MAKE(n,1,0).
LOOP i=1 TO n.
. LOOP j=1 TO n.
. DO IF ColA(i) EQ ColB(j).
. COMPUTE Present(i)=1.
. END IF.
. END LOOP.
END LOOP.
SAVE Present /OUTFILE=APresentInB.
END MATRIX.
RESTORE.
MATCH FILES
/FILE=*
/FILE='APresentInB'.
DATASET CLOSE APresentInB.
VALUE LABEL COL1 0'Absent in ColB' 1'Present in ColB'.
* Cleaning missing data in result column *.
IF MISSING(ColA) COL1=$sysmis.
FORMAT COL1(F8).
LIST.

Now it will work.

Marta

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Manmit.Shrimali

FA Eg value

Team:

When running FA for data reduction, I have always used eighenvalue to determine number of components along with screen plot. However, I realize, through posts, that this approach is facing much criticism. I will really appreciate if anyone can share any documents or explanation that explains in detail why relying on typical method (retaining factors near to eigenvalue of 1) is not the best approach?

Thanks,
MS

This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.
Any unauthorised review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly
prohibited and may be unlawful.

====================To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Richard Ristow

Re: arrays in SPSS

In reply to this post by sowmya sankar

At 09:59 AM 8/19/2008, sowmya sankar wrote:

>I have two columns and for each value in column A, I want to find if
>the same value exists in column B.
>Example:
|-----------------------------|---------------------------|
|Output Created |19-AUG-2008 21:13:39 |
|-----------------------------|---------------------------|
[TestData]

CaseID Col_A Col_B

Alpha 12 11
Beta 13 14
Gamma 14 12
Delta 11 10
Number of cases read: 4 Number of cases listed: 4

(Grin) It's an interesting problem, not the least as an exercise in
the different tools that SPSS offers. Here's a solution in native
SPSS transformation language. It should be N*log(N) in running time,
dominated by the SORTs. (Jon, are you sure yours is linear? I think
N*log(N) should be theoretical best case.)

Like Marta's solution, this uses datasets, requiring SPSS 14+.
However, it could easily be changed to use disk scratch files
instead, for earlier versions.

DATASET COPY Demo.
DATASET ACTIVATE Demo WINDOW=FRONT.
* Create a sorted lookup table of the column B values, .
* renaming them "Column A". You'll see why, later. .

ADD FILES
/FILE=Demo
/RENAME=(Col_A Col_B = Dump Col_A)
/KEEP = Col_A.
DATASET NAME LookUp WINDOW=FRONT.

SORT CASES BY Col_A.
. /**/ LIST /*-*/.
List
|-----------------------------|---------------------------|
|Output Created |19-AUG-2008 21:13:41 |
|-----------------------------|---------------------------|
[LookUp]
Col_A
10
11
12
14
Number of cases read: 4 Number of cases listed: 4

DATASET ACTIVATE Demo WINDOW=FRONT.
* The kludgiest feature of this solution is variable .
* CaseSeq#, which is necessary to restore the original .
* case order after the sort and lookup: .
NUMERIC CaseSeq# (F4).
COMPUTE CaseSeq# = $CASENUM.

* Now, it's easy: .

SORT CASES BY Col_A.
MATCH FILES
/FILE=*
/FILE=LookUp /IN=IsInColB
/BY Col_A.
FORMATS IsInColB (F2).

* Drop values that occur only in column B: .
SELECT IF NOT MISSING(CaseSeq#).
* Restore original order of cases in file: .
SORT CASES BY CaseSeq#.
LIST.
List
|-----------------------------|---------------------------|
|Output Created |19-AUG-2008 21:13:42 |
|-----------------------------|---------------------------|
[Demo]
CaseID Col_A Col_B CaseSeq# IsInColB

Alpha 12 11 1 1
Beta 13 14 2 0
Gamma 14 12 3 1
Delta 11 10 4 1

Number of cases read: 4 Number of cases listed: 4
=============================
APPENDIX: Test data, and code
=============================
DATA LIST LIST/
CaseID Col_A Col_B
(A6, F3, F3).
BEGIN DATA
Alpha 12 11
Beta 13 14
Gamma 14 12
Delta 11 10
END DATA.
DATASET NAME TestData WINDOW=FRONT.
LIST.

DATASET COPY Demo.
DATASET ACTIVATE Demo WINDOW=FRONT.

* Create a sorted lookup table of the column B values, .
* renaming them "Column A". You'll see why, later. .

ADD FILES
/FILE=Demo
/RENAME=(Col_A Col_B = Dump Col_A)
/KEEP = Col_A.
DATASET NAME LookUp WINDOW=FRONT.
SORT CASES BY Col_A.
. /**/ LIST /*-*/.

DATASET ACTIVATE Demo WINDOW=FRONT.
* The kludgiest feature of this solution is variable .
* CaseSeq#, which is necessary to restore the original .
* case order after the sort and lookup: .
NUMERIC CaseSeq# (F4).
COMPUTE CaseSeq# = $CASENUM.

* Now, it's easy: .

SORT CASES BY Col_A.
MATCH FILES
/FILE=*
/FILE=LookUp /IN=IsInColB
/BY Col_A.
FORMATS IsInColB (F2).

* Drop values that occur only in column B: .
SELECT IF NOT MISSING(CaseSeq#).
* Restore original order of cases in file: .
SORT CASES BY CaseSeq#.
LIST.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Richard Ristow

Re: arrays in SPSS

In reply to this post by sowmya sankar

Whoops! Revision. At 09:59 AM 8/19/2008, sowmya sankar wrote:

>For each value in column A, I want to find if the same value exists
>in column B.

The previous solution won't work if the same value occurs more than
once in either column A or column B. Here, with revised code and test data:

DATASET COPY Demo.
DATASET ACTIVATE Demo WINDOW=FRONT.
* Create a sorted lookup table of the UNIQUE column B .
* values, named as if they're from column A: .

ADD FILES
/FILE=Demo
/RENAME=(Col_A Col_B = Dump Col_A)
/KEEP = Col_A.
DATASET NAME LookUp WINDOW=FRONT.

SORT CASES BY Col_A.
* Drop duplicates from the list: .
SELECT IF MISSING(LAG(Col_A))
OR Col_A NE LAG(Col_A).
. /**/ LIST /*-*/.

List
|-----------------------------|---------------------------|
|Output Created |19-AUG-2008 21:54:07 |
|-----------------------------|---------------------------|
[LookUp]
Col_A

10
11
12
14
15

Number of cases read: 5 Number of cases listed: 5

DATASET ACTIVATE Demo WINDOW=FRONT.
* The kludgiest feature of this solution is variable .
* CaseSeq#, which is necessary to restore the original .
* case order after the sort and lookup: .
NUMERIC CaseSeq# (F4).
COMPUTE CaseSeq# = $CASENUM.

* Now, it's easy: .

SORT CASES BY Col_A.
* Revision: Use /TABLE= instead of /FILE= for the .
* lookup table. (So, it is no longer necessary to .
* explicitly drop values that occur only in Column B. .
* Howeve, it's mainly to handle values that occur more .
* than once in column A.) .
MATCH FILES
/FILE=*
/TABLE=LookUp /IN=IsInColB
/BY Col_A.
FORMATS IsInColB (F2).

* Restore original order of cases in file: .
SORT CASES BY CaseSeq#.
LIST.

List
|-----------------------------|---------------------------|
|Output Created |19-AUG-2008 21:54:08 |
|-----------------------------|---------------------------|
[Demo]
CaseID Col_A Col_B CaseSeq# IsInColB

Alpha 12 11 1 1
Beta 13 14 2 0
Gamma 14 12 3 1
Delta 11 10 4 1
Epsilon 12 15 5 1
Zeta 9 11 6 0

Number of cases read: 6 Number of cases listed: 6
=============================
APPENDIX: Test data, and code
=============================
DATA LIST LIST/
CaseID Col_A Col_B
(A8, F3, F3).
BEGIN DATA
Alpha 12 11
Beta 13 14
Gamma 14 12
Delta 11 10
Epsilon 12 15
Zeta 9 11
END DATA.
DATASET NAME TestData WINDOW=FRONT.
LIST.

DATASET COPY Demo.
DATASET ACTIVATE Demo WINDOW=FRONT.

* Create a sorted lookup table of the UNIQUE column B .
* values, named as if they're from column A: .

ADD FILES
/FILE=Demo
/RENAME=(Col_A Col_B = Dump Col_A)
/KEEP = Col_A.
DATASET NAME LookUp WINDOW=FRONT.
SORT CASES BY Col_A.
* Drop duplicates from the list: .
SELECT IF MISSING(LAG(Col_A))
OR Col_A NE LAG(Col_A).
. /**/ LIST /*-*/.

DATASET ACTIVATE Demo WINDOW=FRONT.
* The kludgiest feature of this solution is variable .
* CaseSeq#, which is necessary to restore the original .
* case order after the sort and lookup: .
NUMERIC CaseSeq# (F4).
COMPUTE CaseSeq# = $CASENUM.

* Now, it's easy: .

SORT CASES BY Col_A.
* Revision: Use /TABLE= instead of /FILE= for the .
* lookup table. (So, it is no longer necessary to .
* explicitly drop values that occur only in Column B. .
* Howeve, it's mainly to handle values that occur more .
* than once in column A.) .
MATCH FILES
/FILE=*
/TABLE=LookUp /IN=IsInColB
/BY Col_A.
FORMATS IsInColB (F2).

* Restore original order of cases in file: .
SORT CASES BY CaseSeq#.
LIST.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Peck, Jon

Re: arrays in SPSS

In reply to this post by Richard Ristow

I think mine is linear: two data passes - O(N). Set insertion and membership text are O(1). (They use a hash table, which is constant time or within an epsilon or so.) No sorting.

Of course, that doesn't mean it's always faster, since there is overhead passing the data to Python, but for large enough N it will always win.

-Jon

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow
Sent: Tuesday, August 19, 2008 7:21 PM
To: [hidden email]
Subject: Re: [SPSSX-L] arrays in SPSS
[snip]
At 09:59 AM 8/19/2008, sowmya sankar wrote:

>I have two columns and for each value in column A, I want to find if
>the same value exists in column B.
>Example:
|-----------------------------|---------------------------|
|Output Created |19-AUG-2008 21:13:39 |
|-----------------------------|---------------------------|
[TestData]

CaseID Col_A Col_B

Alpha 12 11
Beta 13 14
Gamma 14 12
Delta 11 10
Number of cases read: 4 Number of cases listed: 4

(Grin) It's an interesting problem, not the least as an exercise in
the different tools that SPSS offers. Here's a solution in native
SPSS transformation language. It should be N*log(N) in running time,
dominated by the SORTs. (Jon, are you sure yours is linear? I think
N*log(N) should be theoretical best case.)

[>>>Peck, Jon] [snip]

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD