Selecting Cases based on Shared ID or Python to Recode Based on Previous Cases

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Selecting Cases based on Shared ID or Python to Recode Based on Previous Cases

cco14
Hello All --

I’m working with a long form data set of cases representing items that share
a Transaction ID (trans_id) (a single order). I’m interested in keeping a
specific set of items, and the other items from those orders. I created a
variable (smk_item) for whether a case is for a specific item, but I am not
sure how to select the cases that share that ID. There are over 100k cases
and the number of items in each order varies.

*If there is a way to create a list of transaction ID’s where smk_item = 1
and use SELECT IF to only keep those ID’s, that would be ideal, but I
haven’t found it.*

Instead, I used MATCH FILES after sorting by trans_id and smk_item so that
the number of items ordered was counted and smk_items were first. I removed
cases where there was just one item ordered and it wasn't a smk_item
(MatchSequence = 0) and smk_item = 0.  

So my data set still has (1) single item-orders that were smk_item =1. (2)
multi-item orders that include at least one smk_item, and (3) multi-item
orders that include no smk_items. It’s the third group I’d like to remove.

*I created a variable (“keep”):*
DO IF(smk_item = 1).
Compute keep = 1.
Else IF(smk_item = 0 AND num_items = 1).
Compute keep = 0. *This is the first item ordered and it’s not a smk_item,
so it shouldn’t be kept.
Else IF (PrimaryLast = 1 AND smk_item = 0).
Compute keep = 8. *This is the last item of the order and NOT a smk_item.
Else IF (smk_item = 0 AND num_items > 1 AND PrimaryLast = 0).
Compute keep = 9. *Item is not a smk_item, and not the first or last item in
the order.
End IF.
Execute.

I then created a filter variable (“python”) to populate using a python
program. It keeps the 1s and 0s from “keep”).
DO IF (keep =1).
Compute python = 1.
Else IF (keep = 0).
Compute python = 0.
End IF.
Execute.

Basically my data now looks like this:
v1 v2       v3 v4 v5 v6 v7
1 00001 0 1 0 0 0
2 00001 0 2 1 8
3 00002 1 1 0 1 1
4 00002 0 2 0 9
5 00002 0 3 1 8
6 00003 0 1 0 0 0
7 00003 0 2 1 8
8 00004 1 1 0 1 1
9 00004 0 2 1 0

v1 = case; v2 = trans_id; v3 = smk_items; v4=num_items; v5=PrimaryLast; v6 =
keep; v7 = python

What I want is a python program than fills in the missing cases in the
“python” variable with either 0s or 1s based on the value of the first
filled in case available above it. In this example, I want 1s in the cases
with Trans_ID 0002 and 0004.

use all.
exe.
BEGIN PROGRAM.
import spss
with spss.DataStep():
    datasetObj = spss.Dataset(name="DataSet1")
    varK = datasetObj.varlist['keep'].index
    varPy = datasetObj.varlist['python'].index
    varNum = datasetObj.varlist['num_items']
    last = 8
    for i in range(len(datasetObj.cases)):
        if datasetObj.cases[i, varK] == last:
            max = datasetObj.cases[i, varNum]
            first = (i - max) + 1
            sec = first + 1
            datasetObj.cases[second:i, varPy] = datasetObj.cases[first,
varK]
END PROGRAM.

The idea of this program is for a case when keep = 8 (the last item), the
position of the first case of that order (num_items = 1), the position of
the second case of that order (num_items = 2), and make cases second through
the last equal to the value in keep when num_items = 1.
When I run this program, the python variable is unchanged. Any help at all
with this would be appreciated -- I’ve spent the last two days trying to
come up with a systematic way to do this. Sorry for the length of this
question and if I've missed anything.

Thanks!
 Catherine




--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Selecting Cases based on Shared ID or Python to Recode Based on Previous Cases

Jon Peck
It would be possible to build a select if with the requisite case numbers, but I'll just address the last set of code above.  Assuming that the actual variable names are as above below the data rather than v1 to v7, the Python code should be as below.  The Dataset classes are a bit difficult to work with, and there were several problems.

with spss.DataStep():
    datasetObj = spss.Dataset(name="DataSet1")
    varK = datasetObj.varlist['keep'].index
    varPy = datasetObj.varlist['python'].index
    varNum = datasetObj.varlist['num_items'].index      
    last = 8
    for i in range(len(datasetObj.cases)):
        if datasetObj.cases[i, varK][0] == last:    
            maxval = datasetObj.cases[i, varNum][0]
            first = int((i - maxval) + 1)
            sec = first + 1
            for j in range(sec, i):
                datasetObj.cases[j, varPy] = datasetObj.cases[first, varK]

This still isn't quite right - there is an indexing error, but I'll let you work that out.

Things to note:
The datasetObj.cases call returns a list even if it is just one element, so the == last line test was never true, since it was comparing a list to a number
The "second" variable was undefined.
The indexes into the cases structure have to be integers, but the values retrieved from the active dataset were floats.

I hope that moves this along.

On Thu, Apr 23, 2020 at 2:52 PM cco14 <[hidden email]> wrote:
Hello All --

I’m working with a long form data set of cases representing items that share
a Transaction ID (trans_id) (a single order). I’m interested in keeping a
specific set of items, and the other items from those orders. I created a
variable (smk_item) for whether a case is for a specific item, but I am not
sure how to select the cases that share that ID. There are over 100k cases
and the number of items in each order varies.

*If there is a way to create a list of transaction ID’s where smk_item = 1
and use SELECT IF to only keep those ID’s, that would be ideal, but I
haven’t found it.*

Instead, I used MATCH FILES after sorting by trans_id and smk_item so that
the number of items ordered was counted and smk_items were first. I removed
cases where there was just one item ordered and it wasn't a smk_item
(MatchSequence = 0) and smk_item = 0. 

So my data set still has (1) single item-orders that were smk_item =1. (2)
multi-item orders that include at least one smk_item, and (3) multi-item
orders that include no smk_items. It’s the third group I’d like to remove.

*I created a variable (“keep”):*
DO IF(smk_item = 1).
Compute keep = 1.
Else IF(smk_item = 0 AND num_items = 1).
Compute keep = 0. *This is the first item ordered and it’s not a smk_item,
so it shouldn’t be kept.
Else IF (PrimaryLast = 1 AND smk_item = 0).
Compute keep = 8. *This is the last item of the order and NOT a smk_item.
Else IF (smk_item = 0 AND num_items > 1 AND PrimaryLast = 0).
Compute keep = 9. *Item is not a smk_item, and not the first or last item in
the order.
End IF.
Execute.

I then created a filter variable (“python”) to populate using a python
program. It keeps the 1s and 0s from “keep”).
DO IF (keep =1).
Compute python = 1.
Else IF (keep = 0).
Compute python = 0.
End IF.
Execute.

Basically my data now looks like this:
v1      v2             v3       v4      v5      v6      v7
1       00001   0       1       0       0       0
2       00001   0       2       1       8       
3       00002   1       1       0       1       1
4       00002   0       2       0       9       
5       00002   0       3       1       8       
6       00003   0       1       0       0       0
7       00003   0       2       1       8       
8       00004   1       1       0       1       1
9       00004   0       2       1       0       

v1 = case; v2 = trans_id; v3 = smk_items; v4=num_items; v5=PrimaryLast; v6 =
keep; v7 = python

What I want is a python program than fills in the missing cases in the
“python” variable with either 0s or 1s based on the value of the first
filled in case available above it. In this example, I want 1s in the cases
with Trans_ID 0002 and 0004.

use all.
exe.
BEGIN PROGRAM.
import spss
with spss.DataStep():
    datasetObj = spss.Dataset(name="DataSet1")
    varK = datasetObj.varlist['keep'].index
    varPy = datasetObj.varlist['python'].index
    varNum = datasetObj.varlist['num_items']
    last = 8
    for i in range(len(datasetObj.cases)):
        if datasetObj.cases[i, varK] == last:
            max = datasetObj.cases[i, varNum]
            first = (i - max) + 1
            sec = first + 1
            datasetObj.cases[second:i, varPy] = datasetObj.cases[first,
varK]
END PROGRAM.

The idea of this program is for a case when keep = 8 (the last item), the
position of the first case of that order (num_items = 1), the position of
the second case of that order (num_items = 2), and make cases second through
the last equal to the value in keep when num_items = 1.
When I run this program, the python variable is unchanged. Any help at all
with this would be appreciated -- I’ve spent the last two days trying to
come up with a systematic way to do this. Sorry for the length of this
question and if I've missed anything.

Thanks!
 Catherine




--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Selecting Cases based on Shared ID or Python to Recode Based on Previous Cases

Rich Ulrich
In reply to this post by cco14
I don't see the need for Python.
*If there is a way to create a list of transaction ID’s where smk_item = 1
and use SELECT IF to only keep those ID’s, that would be ideal, but I
haven’t found it.*

For that question, why not do this:
SELECT IF  smk_item=1.
SAVE OUTFILE= "smk_ids".

Then you can MATCH the table, and the In-file indicator serves as Keep.

But there is the easier option of using AGGREGATE and adding
MAX(keep)  [if it is 0/1] to every record.

--
Rich Ulrich



From: SPSSX(r) Discussion <[hidden email]> on behalf of cco14 <[hidden email]>
Sent: Thursday, April 23, 2020 4:52 PM

<... snipped >
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Selecting Cases based on Shared ID or Python to Recode Based on Previous Cases

Bruce Weaver
Administrator
Like Rich, I don't see any reason to use Python to achieve what you've
described.  For the sample data you provided, I think it is a simple matter
of using the LAG function to pull down the value of variable python from the
previous row.  The following syntax shows my understanding of what you have
done, and what you want to do.  Forgive me if I've misunderstood.  HTH.


* I think the original variables stopped at smk_item.
* The others were computed using those variables.

DATA LIST LIST / case (F5.0) trans_id (N5) smk_item (F5.0) .
BEGIN DATA
1 00001 0
2 00001 0
3 00002 1
4 00002 0
5 00002 0
6 00003 0
7 00003 0
8 00004 1
9 00004 0
END DATA.

* Use MATCH FILES to generate variables RecNum and PrimaryLast.
* I'm using RecNum in place of the OP's variable num_items.
MATCH FILES FILE=* /FIRST=RecNum /LAST=PrimaryLast /BY trans_id.
IF RecNum EQ 0 RecNum = LAG(RecNum) + 1.
LIST.

* Modification of OP's code to compute variable KEEP,
* bearing in mind that smk_item has only two possible values, 0 and 1.
DO IF smk_item EQ 1.
- COMPUTE keep = 1.
ELSE IF RecNum EQ 1. /* smk_item = 0 for this and all subsequent conditions.
- COMPUTE keep = 0.
ELSE IF PrimaryLast.
- COMPUTE keep = 8.
ELSE IF RecNum GT 1 AND NOT PrimaryLast.
- COMPUTE keep = 9.
END IF.
FORMATS keep(F1).
LIST.

* Modification of OP's code to compute variable python.
IF ANY(keep,0,1) python = keep.
FORMATS python(F1).
LIST.

*
"What I want is a python program than fills in the missing cases in the
“python” variable with either 0s or 1s based on the value of the first
filled in case available above it. In this example, I want 1s in the cases
with Trans_ID 0002 and 0004."
.
* Where variable python is missing, fill in previous value.
IF MISSING(python) python = LAG(python).
LIST.

OUTPUT from the final LIST command:

 case trans_id smk_item RecNum PrimaryLast keep python
 
    1   00001        0     1        0        0     0
    2   00001        0     2        1        8     0
    3   00002        1     1        0        1     1
    4   00002        0     2        0        9     1
    5   00002        0     3        1        8     1
    6   00003        0     1        0        0     0
    7   00003        0     2        1        8     0
    8   00004        1     1        0        1     1
    9   00004        0     2        1        8     1
 
 
Number of cases read:  9    Number of cases listed:  9



Rich Ulrich wrote

> I don't see the need for Python.
> *If there is a way to create a list of transaction ID’s where smk_item = 1
> and use SELECT IF to only keep those ID’s, that would be ideal, but I
> haven’t found it.*
>
> For that question, why not do this:
> SELECT IF  smk_item=1.
> SAVE OUTFILE= "smk_ids".
>
> Then you can MATCH the table, and the In-file indicator serves as Keep.
>
> But there is the easier option of using AGGREGATE and adding
> MAX(keep)  [if it is 0/1] to every record.
>
> --
> Rich Ulrich
>
>
> ________________________________
> From: SPSSX(r) Discussion &lt;

> SPSSX-L@.UGA

> &gt; on behalf of cco14 &lt;

> cco14@

> &gt;
> Sent: Thursday, April 23, 2020 4:52 PM
>
> <... snipped >
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD





-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).