arrays in SPSS

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

arrays in SPSS

sowmya sankar
Hi,

I have two columns and for each value in column A, I want to find if the
same value exists in column B.
Example:
Col A      Col B
12            11
13            14
14            12
11            10

In the above example, I would say that 13 is not present in column B. How
can I do this using arrays and loops in SPSS?

Thanks,
Sowmya

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: arrays in SPSS

Marta Garcia-Granero
Hi

> I have two columns and for each value in column A, I want to find if the
> same value exists in column B.
> Example:
> Col A      Col B
> 12            11
> 13            14
> 14            12
> 11            10
>
> In the above example, I would say that 13 is not present in column B. How
> can I do this using arrays and loops in SPSS?
>

Quick&dirty solution (I'm sure that more elegant ones will be provided
by other people):

* Your sample dataset *.
DATA LIST LIST/ColA ColB (2 F8).
BEGIN DATA
12            11
13            14
14            12
11            10
END DATA.

* Code assumes SPSS 14 or newer is used *.
DATASET NAME Data.
DATASET DECLARE APresentInB WINDOW=MINIMIZED.
PRESERVE.
SET MXLOOPS=1000. /* If sample size GT 1000, change MXLOOPS *.

* WARNING: This code assumes no missing data *.
MATRIX.
GET ColA/VAR=ColA.
GET ColB/VAR=ColB.
COMPUTE n=NROW(ColA).
COMPUTE Present=MAKE(n,1,0).
LOOP i=1 TO n.
. LOOP j=1 TO n.
.  DO IF ColA(i) EQ ColB(j).
.   COMPUTE Present(i)=1.
.  END IF.
. END LOOP.
END LOOP.
SAVE Present /OUTFILE=APresentInB.
END MATRIX.
RESTORE.
MATCH FILES
 /FILE=*
 /FILE='APresentInB'.
DATASET CLOSE APresentInB.
VALUE LABEL COL1 0'Absent in ColB' 1'Present in ColB'.
FORMAT COL1(F8).
LIST.

HTH,
Marta García-Granero

--
For miscellaneous statistical stuff, visit:
http://gjyp.nl/marta/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: arrays in SPSS

Marta Garcia-Granero
sowmya sankar wrote:
> One question: what would happen if there are missing values? For
> example, in my data, column A has 4731 rows and column B has 4861
> rows. Would the same code run?
As a general rule, a complete description of your dataset and your goal
will save a lot of time. You described your goal perfectly, but omitted
some important information concerning your dataset (this is not a
rebuke, just an advise that will help you to get better an faster
responses from the list). BTW, given the sample size you mention (over
4000) be patient, this code performs two nested loops, and the time
needed increases exponentially with sample size.

MATRIX will give an error message if missing data are present on either
column. This modification will make the code work OK:

* Your sample dataset (modified to add some missing data to test the
code) *.
DATA LIST LIST/ColA ColB (2 F8).
BEGIN DATA
12            11
13            14
14            12
11            10
-99           15
10            -99
END DATA.
RECODE ALL (-99=SYSMIS)  .
LIST.

* Code assumes SPSS 14 or newer is used *.
DATASET NAME Data.
DATASET DECLARE APresentInB WINDOW=MINIMIZED.
PRESERVE.
SET MXLOOPS=5000. /* If sample size GT 5000, change MXLOOPS *.

* Code modified to accept missing data *.
TEMPORARY.
LIST.
MATRIX.
GET ColA/VAR=ColA /MISSING=ACCEPT /SYSMIS=-99.
GET ColB/VAR=ColB /MISSING=ACCEPT /SYSMIS=-99.
COMPUTE n=NROW(ColA).
COMPUTE Present=MAKE(n,1,0).
LOOP i=1 TO n.
. LOOP j=1 TO n.
.  DO IF ColA(i) EQ ColB(j).
.   COMPUTE Present(i)=1.
.  END IF.
. END LOOP.
END LOOP.
SAVE Present /OUTFILE=APresentInB.
END MATRIX.
RESTORE.
MATCH FILES
/FILE=*
/FILE='APresentInB'.
DATASET CLOSE APresentInB.
VALUE LABEL COL1 0'Absent in ColB' 1'Present in ColB'.
* Cleaning missing data in result column *.
IF MISSING(ColA) COL1=$sysmis.
FORMAT COL1(F8).
LIST.

Now it will work.

Marta

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: arrays in SPSS

Peck, Jon
Here's a Python way to do this.  It takes two passes but is linear in running time.  The code assumes that sysmis values should never match.  The result is a new variable, present, with values of 0 or 1.  (Logically, the result should perhaps be sysmis if ColA is sysmis.  That could be easily accommodated, but it probably doesn’t matter here.)

This code will work with SPSS 15 or later (and maybe 14.0.1).

HTH,
Jon Peck

* create some data.
DATA LIST LIST/ColA ColB (2 F8).
BEGIN DATA
12            11
13            14
14            12
11            10
-99           15
10            -99
END DATA.
RECODE ALL (-99=SYSMIS)  .
exec.

BEGIN PROGRAM.
import spss, spssdata
caseset = set()
curs = spssdata.Spssdata(accessType="w")
for case in curs:
    if not case.ColB is None:
        caseset.add(case.ColB)
curs.restart()

curs.append("present")
curs.commitdict()
for case in curs:
    curs.casevalues([case.ColA in caseset])
curs.CClose()
END PROGRAM.




-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Marta García-Granero
Sent: Tuesday, August 19, 2008 9:42 AM
To: [hidden email]
Subject: Re: [SPSSX-L] arrays in SPSS

sowmya sankar wrote:
> One question: what would happen if there are missing values? For
> example, in my data, column A has 4731 rows and column B has 4861
> rows. Would the same code run?
As a general rule, a complete description of your dataset and your goal
will save a lot of time. You described your goal perfectly, but omitted
some important information concerning your dataset (this is not a
rebuke, just an advise that will help you to get better an faster
responses from the list). BTW, given the sample size you mention (over
4000) be patient, this code performs two nested loops, and the time
needed increases exponentially with sample size.

MATRIX will give an error message if missing data are present on either
column. This modification will make the code work OK:

* Your sample dataset (modified to add some missing data to test the
code) *.
DATA LIST LIST/ColA ColB (2 F8).
BEGIN DATA
12            11
13            14
14            12
11            10
-99           15
10            -99
END DATA.
RECODE ALL (-99=SYSMIS)  .
LIST.

* Code assumes SPSS 14 or newer is used *.
DATASET NAME Data.
DATASET DECLARE APresentInB WINDOW=MINIMIZED.
PRESERVE.
SET MXLOOPS=5000. /* If sample size GT 5000, change MXLOOPS *.

* Code modified to accept missing data *.
TEMPORARY.
LIST.
MATRIX.
GET ColA/VAR=ColA /MISSING=ACCEPT /SYSMIS=-99.
GET ColB/VAR=ColB /MISSING=ACCEPT /SYSMIS=-99.
COMPUTE n=NROW(ColA).
COMPUTE Present=MAKE(n,1,0).
LOOP i=1 TO n.
. LOOP j=1 TO n.
.  DO IF ColA(i) EQ ColB(j).
.   COMPUTE Present(i)=1.
.  END IF.
. END LOOP.
END LOOP.
SAVE Present /OUTFILE=APresentInB.
END MATRIX.
RESTORE.
MATCH FILES
/FILE=*
/FILE='APresentInB'.
DATASET CLOSE APresentInB.
VALUE LABEL COL1 0'Absent in ColB' 1'Present in ColB'.
* Cleaning missing data in result column *.
IF MISSING(ColA) COL1=$sysmis.
FORMAT COL1(F8).
LIST.

Now it will work.

Marta

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

FA Eg value

Manmit.Shrimali
Team:

When running FA for data reduction, I have always used eighenvalue to determine number of components along with screen plot. However, I realize, through posts, that this approach is facing much criticism. I will really appreciate if anyone can share any documents or explanation that explains in detail why relying on typical method (retaining factors near to eigenvalue of 1) is not the best approach?

Thanks,
MS



This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.
Any unauthorised review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly
prohibited and may be unlawful.

====================To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: arrays in SPSS

Richard Ristow
In reply to this post by sowmya sankar
At 09:59 AM 8/19/2008, sowmya sankar wrote:

>I have two columns and for each value in column A, I want to find if
>the same value exists in column B.
>Example:
|-----------------------------|---------------------------|
|Output Created               |19-AUG-2008 21:13:39       |
|-----------------------------|---------------------------|
[TestData]

CaseID Col_A Col_B

Alpha    12    11
Beta     13    14
Gamma    14    12
Delta    11    10
Number of cases read:  4    Number of cases listed:  4

(Grin) It's an interesting problem, not the least as an exercise in
the different tools that SPSS offers. Here's a solution in native
SPSS transformation language. It should be N*log(N) in running time,
dominated by the SORTs. (Jon, are you sure yours is linear? I think
N*log(N) should be theoretical best case.)

Like Marta's solution, this uses datasets, requiring SPSS 14+.
However, it could easily be changed to use disk scratch files
instead, for earlier versions.

DATASET COPY     Demo.
DATASET ACTIVATE Demo     WINDOW=FRONT.
*  Create a sorted lookup table of the column B values, .
*  renaming them "Column A".  You'll see why, later.    .

ADD FILES
   /FILE=Demo
   /RENAME=(Col_A Col_B = Dump Col_A)
   /KEEP  = Col_A.
DATASET NAME     LookUp   WINDOW=FRONT.

SORT CASES BY Col_A.
.  /**/  LIST  /*-*/.
List
|-----------------------------|---------------------------|
|Output Created               |19-AUG-2008 21:13:41       |
|-----------------------------|---------------------------|
[LookUp]
Col_A
   10
   11
   12
   14
Number of cases read:  4    Number of cases listed:  4


DATASET ACTIVATE Demo     WINDOW=FRONT.
*  The kludgiest feature of this solution is variable   .
*  CaseSeq#, which is necessary to restore the original .
*  case order after the sort and lookup:                .
NUMERIC CaseSeq# (F4).
COMPUTE CaseSeq# = $CASENUM.

*  Now, it's easy:                                      .

SORT CASES BY Col_A.
MATCH FILES
   /FILE=*
   /FILE=LookUp /IN=IsInColB
   /BY   Col_A.
FORMATS IsInColB (F2).

*  Drop values that occur only in column B:             .
SELECT IF NOT MISSING(CaseSeq#).
*  Restore original order of cases in file:             .
SORT CASES BY CaseSeq#.
LIST.
List
|-----------------------------|---------------------------|
|Output Created               |19-AUG-2008 21:13:42       |
|-----------------------------|---------------------------|
[Demo]
CaseID Col_A Col_B CaseSeq# IsInColB

Alpha    12    11       1       1
Beta     13    14       2       0
Gamma    14    12       3       1
Delta    11    10       4       1

Number of cases read:  4    Number of cases listed:  4
=============================
APPENDIX: Test data, and code
=============================
DATA LIST LIST/
    CaseID    Col_A         Col_B
   (A6,        F3,           F3).
BEGIN DATA
    Alpha      12            11
    Beta       13            14
    Gamma      14            12
    Delta      11            10
END DATA.
DATASET NAME     TestData WINDOW=FRONT.
LIST.

DATASET COPY     Demo.
DATASET ACTIVATE Demo     WINDOW=FRONT.

*  Create a sorted lookup table of the column B values, .
*  renaming them "Column A".  You'll see why, later.    .

ADD FILES
   /FILE=Demo
   /RENAME=(Col_A Col_B = Dump Col_A)
   /KEEP  = Col_A.
DATASET NAME     LookUp   WINDOW=FRONT.
SORT CASES BY Col_A.
.  /**/  LIST  /*-*/.


DATASET ACTIVATE Demo     WINDOW=FRONT.
*  The kludgiest feature of this solution is variable   .
*  CaseSeq#, which is necessary to restore the original .
*  case order after the sort and lookup:                .
NUMERIC CaseSeq# (F4).
COMPUTE CaseSeq# = $CASENUM.

*  Now, it's easy:                                      .

SORT CASES BY Col_A.
MATCH FILES
   /FILE=*
   /FILE=LookUp /IN=IsInColB
   /BY   Col_A.
FORMATS IsInColB (F2).

*  Drop values that occur only in column B:             .
SELECT IF NOT MISSING(CaseSeq#).
*  Restore original order of cases in file:             .
SORT CASES BY CaseSeq#.
LIST.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: arrays in SPSS

Richard Ristow
In reply to this post by sowmya sankar
Whoops! Revision. At 09:59 AM 8/19/2008, sowmya sankar wrote:

>For each value in column A, I want to find if the same value exists
>in column B.

The previous solution won't work if the same value occurs more than
once in either column A or column B. Here, with revised code and test data:

DATASET COPY     Demo.
DATASET ACTIVATE Demo     WINDOW=FRONT.
*  Create a sorted lookup table of the UNIQUE column B  .
*  values, named as if they're from column A:           .

ADD FILES
   /FILE=Demo
   /RENAME=(Col_A Col_B = Dump Col_A)
   /KEEP  = Col_A.
DATASET NAME     LookUp   WINDOW=FRONT.

SORT CASES BY Col_A.
*  Drop duplicates from the list:                       .
SELECT IF  MISSING(LAG(Col_A))
        OR Col_A NE LAG(Col_A).
.  /**/  LIST  /*-*/.

List
|-----------------------------|---------------------------|
|Output Created               |19-AUG-2008 21:54:07       |
|-----------------------------|---------------------------|
[LookUp]
Col_A

   10
   11
   12
   14
   15

Number of cases read:  5    Number of cases listed:  5


DATASET ACTIVATE Demo     WINDOW=FRONT.
*  The kludgiest feature of this solution is variable   .
*  CaseSeq#, which is necessary to restore the original .
*  case order after the sort and lookup:                .
NUMERIC CaseSeq# (F4).
COMPUTE CaseSeq# = $CASENUM.

*  Now, it's easy:                                      .

SORT CASES BY Col_A.
*  Revision: Use /TABLE= instead of /FILE= for the      .
*  lookup table. (So, it is no longer necessary to      .
*  explicitly drop values that occur only in Column B.  .
*  Howeve, it's mainly to handle values that occur more .
*  than once in column A.)                              .
MATCH FILES
   /FILE=*
   /TABLE=LookUp /IN=IsInColB
   /BY   Col_A.
FORMATS IsInColB (F2).

*  Restore original order of cases in file:             .
SORT CASES BY CaseSeq#.
LIST.

List
|-----------------------------|---------------------------|
|Output Created               |19-AUG-2008 21:54:08       |
|-----------------------------|---------------------------|
[Demo]
CaseID   Col_A Col_B CaseSeq# IsInColB

Alpha      12    11       1       1
Beta       13    14       2       0
Gamma      14    12       3       1
Delta      11    10       4       1
Epsilon    12    15       5       1
Zeta        9    11       6       0

Number of cases read:  6    Number of cases listed:  6
=============================
APPENDIX: Test data, and code
=============================
DATA LIST LIST/
    CaseID    Col_A         Col_B
   (A8,        F3,           F3).
BEGIN DATA
    Alpha      12            11
    Beta       13            14
    Gamma      14            12
    Delta      11            10
    Epsilon    12            15
    Zeta        9            11
END DATA.
DATASET NAME     TestData WINDOW=FRONT.
LIST.



DATASET COPY     Demo.
DATASET ACTIVATE Demo     WINDOW=FRONT.

*  Create a sorted lookup table of the UNIQUE column B  .
*  values, named as if they're from column A:           .

ADD FILES
   /FILE=Demo
   /RENAME=(Col_A Col_B = Dump Col_A)
   /KEEP  = Col_A.
DATASET NAME     LookUp   WINDOW=FRONT.
SORT CASES BY Col_A.
*  Drop duplicates from the list:                       .
SELECT IF  MISSING(LAG(Col_A))
        OR Col_A NE LAG(Col_A).
.  /**/  LIST  /*-*/.


DATASET ACTIVATE Demo     WINDOW=FRONT.
*  The kludgiest feature of this solution is variable   .
*  CaseSeq#, which is necessary to restore the original .
*  case order after the sort and lookup:                .
NUMERIC CaseSeq# (F4).
COMPUTE CaseSeq# = $CASENUM.

*  Now, it's easy:                                      .

SORT CASES BY Col_A.
*  Revision: Use /TABLE= instead of /FILE= for the      .
*  lookup table. (So, it is no longer necessary to      .
*  explicitly drop values that occur only in Column B.  .
*  Howeve, it's mainly to handle values that occur more .
*  than once in column A.)                              .
MATCH FILES
   /FILE=*
   /TABLE=LookUp /IN=IsInColB
   /BY   Col_A.
FORMATS IsInColB (F2).

*  Restore original order of cases in file:             .
SORT CASES BY CaseSeq#.
LIST.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: arrays in SPSS

Peck, Jon
In reply to this post by Richard Ristow
I think mine is linear: two data passes - O(N).  Set insertion and membership text are O(1).  (They use a hash table, which is constant time or within an epsilon or so.)  No sorting.

Of course, that doesn't mean it's always faster, since there is overhead passing the data to Python, but for large enough N it will always win.

-Jon

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow
Sent: Tuesday, August 19, 2008 7:21 PM
To: [hidden email]
Subject: Re: [SPSSX-L] arrays in SPSS
[snip]
At 09:59 AM 8/19/2008, sowmya sankar wrote:

>I have two columns and for each value in column A, I want to find if
>the same value exists in column B.
>Example:
|-----------------------------|---------------------------|
|Output Created               |19-AUG-2008 21:13:39       |
|-----------------------------|---------------------------|
[TestData]

CaseID Col_A Col_B

Alpha    12    11
Beta     13    14
Gamma    14    12
Delta    11    10
Number of cases read:  4    Number of cases listed:  4

(Grin) It's an interesting problem, not the least as an exercise in
the different tools that SPSS offers. Here's a solution in native
SPSS transformation language. It should be N*log(N) in running time,
dominated by the SORTs. (Jon, are you sure yours is linear? I think
N*log(N) should be theoretical best case.)

[>>>Peck, Jon] [snip]

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD