Hello All --
I’m working with a long form data set of cases representing items that share a Transaction ID (trans_id) (a single order). I’m interested in keeping a specific set of items, and the other items from those orders. I created a variable (smk_item) for whether a case is for a specific item, but I am not sure how to select the cases that share that ID. There are over 100k cases and the number of items in each order varies. *If there is a way to create a list of transaction ID’s where smk_item = 1 and use SELECT IF to only keep those ID’s, that would be ideal, but I haven’t found it.* Instead, I used MATCH FILES after sorting by trans_id and smk_item so that the number of items ordered was counted and smk_items were first. I removed cases where there was just one item ordered and it wasn't a smk_item (MatchSequence = 0) and smk_item = 0. So my data set still has (1) single item-orders that were smk_item =1. (2) multi-item orders that include at least one smk_item, and (3) multi-item orders that include no smk_items. It’s the third group I’d like to remove. *I created a variable (“keep”):* DO IF(smk_item = 1). Compute keep = 1. Else IF(smk_item = 0 AND num_items = 1). Compute keep = 0. *This is the first item ordered and it’s not a smk_item, so it shouldn’t be kept. Else IF (PrimaryLast = 1 AND smk_item = 0). Compute keep = 8. *This is the last item of the order and NOT a smk_item. Else IF (smk_item = 0 AND num_items > 1 AND PrimaryLast = 0). Compute keep = 9. *Item is not a smk_item, and not the first or last item in the order. End IF. Execute. I then created a filter variable (“python”) to populate using a python program. It keeps the 1s and 0s from “keep”). DO IF (keep =1). Compute python = 1. Else IF (keep = 0). Compute python = 0. End IF. Execute. Basically my data now looks like this: v1 v2 v3 v4 v5 v6 v7 1 00001 0 1 0 0 0 2 00001 0 2 1 8 3 00002 1 1 0 1 1 4 00002 0 2 0 9 5 00002 0 3 1 8 6 00003 0 1 0 0 0 7 00003 0 2 1 8 8 00004 1 1 0 1 1 9 00004 0 2 1 0 v1 = case; v2 = trans_id; v3 = smk_items; v4=num_items; v5=PrimaryLast; v6 = keep; v7 = python What I want is a python program than fills in the missing cases in the “python” variable with either 0s or 1s based on the value of the first filled in case available above it. In this example, I want 1s in the cases with Trans_ID 0002 and 0004. use all. exe. BEGIN PROGRAM. import spss with spss.DataStep(): datasetObj = spss.Dataset(name="DataSet1") varK = datasetObj.varlist['keep'].index varPy = datasetObj.varlist['python'].index varNum = datasetObj.varlist['num_items'] last = 8 for i in range(len(datasetObj.cases)): if datasetObj.cases[i, varK] == last: max = datasetObj.cases[i, varNum] first = (i - max) + 1 sec = first + 1 datasetObj.cases[second:i, varPy] = datasetObj.cases[first, varK] END PROGRAM. The idea of this program is for a case when keep = 8 (the last item), the position of the first case of that order (num_items = 1), the position of the second case of that order (num_items = 2), and make cases second through the last equal to the value in keep when num_items = 1. When I run this program, the python variable is unchanged. Any help at all with this would be appreciated -- I’ve spent the last two days trying to come up with a systematic way to do this. Sorry for the length of this question and if I've missed anything. Thanks! Catherine -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
It would be possible to build a select if with the requisite case numbers, but I'll just address the last set of code above. Assuming that the actual variable names are as above below the data rather than v1 to v7, the Python code should be as below. The Dataset classes are a bit difficult to work with, and there were several problems. with spss.DataStep(): datasetObj = spss.Dataset(name="DataSet1") varK = datasetObj.varlist['keep'].index varPy = datasetObj.varlist['python'].index varNum = datasetObj.varlist['num_items'].index last = 8 for i in range(len(datasetObj.cases)): if datasetObj.cases[i, varK][0] == last: maxval = datasetObj.cases[i, varNum][0] first = int((i - maxval) + 1) sec = first + 1 for j in range(sec, i): datasetObj.cases[j, varPy] = datasetObj.cases[first, varK] This still isn't quite right - there is an indexing error, but I'll let you work that out. Things to note: The datasetObj.cases call returns a list even if it is just one element, so the == last line test was never true, since it was comparing a list to a number The "second" variable was undefined. The indexes into the cases structure have to be integers, but the values retrieved from the active dataset were floats. I hope that moves this along. On Thu, Apr 23, 2020 at 2:52 PM cco14 <[hidden email]> wrote: Hello All -- |
In reply to this post by cco14
I don't see the need for Python.
*If there is a way to create a list of transaction ID’s where smk_item = 1
and use SELECT IF to only keep those ID’s, that would be ideal, but I haven’t found it.*
For that question, why not do this:
SELECT IF smk_item=1.
SAVE OUTFILE= "smk_ids".
Then you can MATCH the table, and the In-file indicator serves as Keep.
But there is the easier option of using AGGREGATE and adding
MAX(keep) [if it is 0/1] to every record.
--
Rich Ulrich
From: SPSSX(r) Discussion <[hidden email]> on behalf of cco14 <[hidden email]>
Sent: Thursday, April 23, 2020 4:52 PM <... snipped >
=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
|
Administrator
|
Like Rich, I don't see any reason to use Python to achieve what you've
described. For the sample data you provided, I think it is a simple matter of using the LAG function to pull down the value of variable python from the previous row. The following syntax shows my understanding of what you have done, and what you want to do. Forgive me if I've misunderstood. HTH. * I think the original variables stopped at smk_item. * The others were computed using those variables. DATA LIST LIST / case (F5.0) trans_id (N5) smk_item (F5.0) . BEGIN DATA 1 00001 0 2 00001 0 3 00002 1 4 00002 0 5 00002 0 6 00003 0 7 00003 0 8 00004 1 9 00004 0 END DATA. * Use MATCH FILES to generate variables RecNum and PrimaryLast. * I'm using RecNum in place of the OP's variable num_items. MATCH FILES FILE=* /FIRST=RecNum /LAST=PrimaryLast /BY trans_id. IF RecNum EQ 0 RecNum = LAG(RecNum) + 1. LIST. * Modification of OP's code to compute variable KEEP, * bearing in mind that smk_item has only two possible values, 0 and 1. DO IF smk_item EQ 1. - COMPUTE keep = 1. ELSE IF RecNum EQ 1. /* smk_item = 0 for this and all subsequent conditions. - COMPUTE keep = 0. ELSE IF PrimaryLast. - COMPUTE keep = 8. ELSE IF RecNum GT 1 AND NOT PrimaryLast. - COMPUTE keep = 9. END IF. FORMATS keep(F1). LIST. * Modification of OP's code to compute variable python. IF ANY(keep,0,1) python = keep. FORMATS python(F1). LIST. * "What I want is a python program than fills in the missing cases in the “python” variable with either 0s or 1s based on the value of the first filled in case available above it. In this example, I want 1s in the cases with Trans_ID 0002 and 0004." . * Where variable python is missing, fill in previous value. IF MISSING(python) python = LAG(python). LIST. OUTPUT from the final LIST command: case trans_id smk_item RecNum PrimaryLast keep python 1 00001 0 1 0 0 0 2 00001 0 2 1 8 0 3 00002 1 1 0 1 1 4 00002 0 2 0 9 1 5 00002 0 3 1 8 1 6 00003 0 1 0 0 0 7 00003 0 2 1 8 0 8 00004 1 1 0 1 1 9 00004 0 2 1 8 1 Number of cases read: 9 Number of cases listed: 9 Rich Ulrich wrote > I don't see the need for Python. > *If there is a way to create a list of transaction ID’s where smk_item = 1 > and use SELECT IF to only keep those ID’s, that would be ideal, but I > haven’t found it.* > > For that question, why not do this: > SELECT IF smk_item=1. > SAVE OUTFILE= "smk_ids". > > Then you can MATCH the table, and the In-file indicator serves as Keep. > > But there is the easier option of using AGGREGATE and adding > MAX(keep) [if it is 0/1] to every record. > > -- > Rich Ulrich > > > ________________________________ > From: SPSSX(r) Discussion < > SPSSX-L@.UGA > > on behalf of cco14 < > cco14@ > > > Sent: Thursday, April 23, 2020 4:52 PM > > <... snipped > > > ===================== > To manage your subscription to SPSSX-L, send a message to > LISTSERV@.UGA > (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD ----- -- Bruce Weaver [hidden email] http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." NOTE: My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
Free forum by Nabble | Edit this page |