SPSSX Discussion

Help with a Syntax File

Classic

List

Threaded

7 messages Options

Stella Vasquez

Help with a Syntax File

I have two files, first one tells me the juvenile KNum and the date they started and finished each stage.

The second file tells me every time the juvenile has a rule violation and the type of violation (A, B or C).

I have 532 juveniles and I would like to calculate how many rule violations each juvenile had in each particular stage.

For example:

Knum	PhaseNo	StartPhase	EndPhase	TotalA	TotalB	TotalC
33486	2	10/11/2018	12/6/2018	4	5	0
33486	3	12/6/2018	12/20/2018	3	0	0
33486	4	12/20/2018	1/3/2019	2	2	0

File 1

Knum	PhaseNo	StartPhase	EndPhase
33486	2	10/11/2018	12/6/2018
33486	3	12/6/2018	12/20/2018
33486	4	12/20/2018	1/3/2019
33496	4	10/2/2018	1/1/2019

File 2

Knum	ViolationDate	RuleA	RuleB	RuleC
33486	10/15/2018 18:00	0	1	0
33486	10/24/2018 12:21	1	0	0
33486	10/25/2018 17:42	0	1	0
33486	10/26/2018 10:26	1	0	0
33486	10/30/2018 9:29	0	1	0
33486	11/2/2018 14:09	1	0	0
33486	11/5/2018 10:38	1	0	0
33486	11/10/2018 17:48	0	1	0
33486	11/25/2018 8:37	0	1	0
33486	12/8/2018 12:45	1	0	0
33486	12/11/2018 18:40	1	0	0
33486	12/17/2018 20:29	1	0	0
33486	12/22/2018 8:41	1	0	0
33486	12/22/2018 8:47	0	1	0
33486	12/30/2018 7:12	1	0	0
33486	1/3/2019 8:19	0	1	0
33486	1/13/2019 4:46	1	0	0
33486	6/20/2019 18:04	0	1	0
33486	6/24/2019 19:40	1	0	0

Any help would be appreciated.

Stella Vasquez

AZ Department of Juveniles Corrections

1624 W. Adams Ave, 3^rd Floor

Phoenix, AZ 85007

(602) 542-2272 or [hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Andy W

Re: Help with a Syntax File

I think Jon Peck has some examples of using a python VLOOKUP type function to
do these range merges that is more slick than this, but here is an example
of flattening file1 to merge to file2. Biggest issue here is the vector
statements have hard coded end #'s, so if that is variable can be a pain in
this file. Can solve that with python, or adding in more variables and doing
one long index instead of 3 separate ones. (Also not sure what happens here
if you have someone in File2 but not in File1, loop may throw an error. But
maybe that is a good thing.)

****************************************************************************************.
DATA LIST FREE / Knum (F6.0) PhaseNo (F1.0) StartPhase EndPhase (2ADATE10).
BEGIN DATA.
33486 2 10/11/2018 12/6/2018
33486 3 12/6/2018 12/20/2018
33486 4 12/20/2018 1/3/2019
33496 4 10/2/2018 1/1/2019
END DATA.
DATASET NAME File1.

DATA LIST FREE / Knum (F6.0) ViolationDate (ADATE10) ViolTime (TIME5) RuleA
RuleB RuleC (3F1.0).
BEGIN DATA.
33486 10/15/2018 18:00 0 1 0
33486 10/24/2018 12:21 1 0 0
33486 10/25/2018 17:42 0 1 0
33486 10/26/2018 10:26 1 0 0
33486 10/30/2018 9:29 0 1 0
33486 11/2/2018 14:09 1 0 0
33486 11/5/2018 10:38 1 0 0
33486 11/10/2018 17:48 0 1 0
33486 11/25/2018 8:37 0 1 0
33486 12/8/2018 12:45 1 0 0
33486 12/11/2018 18:40 1 0 0
33486 12/17/2018 20:29 1 0 0
33486 12/22/2018 8:41 1 0 0
33486 12/22/2018 8:47 0 1 0
33486 12/30/2018 7:12 1 0 0
33486 1/3/2019 8:19 0 1 0
33486 1/13/2019 4:46 1 0 0
33486 6/20/2019 18:04 0 1 0
33486 6/24/2019 19:40 1 0 0
END DATA.
DATASET NAME File2.
COMPUTE ViolationDate = ViolationDate + ViolTime.
FORMATS ViolationDate (DATETIME17).
EXECUTE.

DATASET ACTIVATE FILE1.
DATASET COPY WideF1.
DATASET ACTIVATE WideF1.
SORT CASES BY Knum PhaseNo.
CASESTOVARS /ID = Knum /SEPARATOR = "_" /GROUPBY = VARIABLE /COUNT TotPhase.

DATASET ACTIVATE FILE2.
SORT CASES BY Knum ViolationDate.
MATCH FILES FILE = *
/TABLE = 'WideF1'
/IN EndF1
/BY Knum.

*I can think of hacky ways to make this not PhasNo begin/end not hard coded.
*But this should be easier to grok.
COMPUTE PhaseNo = -1.
VECTOR P = PhaseNo_1 TO PhaseNo_3.
VECTOR B = StartPhase_1 TO StartPhase_3.
VECTOR E = EndPhase_1 TO EndPhase_3.
LOOP #i = 1 TO TotPhase.
IF ViolationDate >= B(#i) AND ViolationDate <= E(#i) PhaseNo = P(#i).
END LOOP.

*Now aggregate and merge back into File1.
DATASET DECLARE AggF2.
AGGREGATE OUTFILE=AggF2
/BREAK Knum PhaseNo
/TotalA = SUM(RuleA)
/TotalB = SUM(RuleB)
/TotalC = SUM(RuleC).

DATASET ACTIVATE File1.
MATCH FILES FILE = *
/TABLE = 'AggF2'
/IN AnyViol
/BY Knum PhaseNo.

RECODE TotalA TO TotalC (SYSMIS = 0).
FORMATS TotalA TO TotalC (F5.0).
EXECUTE.
*Can clean up all those files, but may want.
*To see what they look like to understand code.
*SELECT IF AnyViol = 1.
*DATASET CLOSE WideF1.
*DATASET CLOSE AggF2.
*DATASET CLOSE File2.
****************************************************************************************.

-----
Andy W
[hidden email]
http://andrewpwheeler.wordpress.com/
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

Jon Peck

Re: Help with a Syntax File

In reply to this post by Stella Vasquez

Andrew has provided a solution using (a lot) of standard syntax. Here is a solution that uses the extension command STATS CARTPROD, which is available from the Extensions menu (but see caveat below).

STATS CARTPROD will create a large file that is subsequently pruned, so it make take a bit longer than you would expect.

There is a subtle issue with the phase date matching: the file1 dates do not have a time component, so the time is 00:00. I, therefore, ignored the time portion of the violation data in order to make the phase selection work. I read in the date and time as separate fields. Andrew's code adds in the time.

This code expects that there are no overlaps in the StartPhase and EndPhase values for a subject. If there are any, it would count a violation in the overlapping interval too many times.

The final output is a dataset named count that has the count by type for each Knum and phase. Note that subjects with zero violations will not appear. Change the get commands as needed, and note that it assumes that there is a directory named c:\temp.

Here is the code.

get file="r:\ibm\users\vasquez\file1.sav".
dataset name file1.
get file="r:\ibm\users\vasquez\file2.sav".
dataset name file2.
RENAME VARIABLES Knum = Knum2.

DATASET ACTIVATE file1.
STATS CARTPROD VAR1=Knum PhaseNo StartPhase EndPhase INPUT2=file2 VAR2=Knum2 ViolationDate RuleA
RuleB RuleC
/SAVE OUTFILE="c:\temp\merged.sav" DSNAME=merged.
dataset activate merged.
select if Knum eq Knum2.
select if StartPhase <= ViolationDate and ViolationDate <= EndPhase.
DATASET DECLARE counts.
AGGREGATE
/OUTFILE=counts
/BREAK=Knum PhaseNo
/RuleA_sum=SUM(RuleA)
/RuleB_sum=SUM(RuleB)
/RuleC_sum=SUM(RuleC)
/Total=N.

In order for this to work with your dataset, a minor fix is required to the STATS CARTPROD extension command. I can send you the updated file if you want to go this route. You would install it using the Extensions > Install Local Extension Bundle menu item. If your system is locked down, you might need Administator permission to install it.

On Fri, Mar 13, 2020 at 6:01 PM Stella Vasquez <[hidden email]> wrote:

I have two files, first one tells me the juvenile KNum and the date they started and finished each stage.

The second file tells me every time the juvenile has a rule violation and the type of violation (A, B or C).

I have 532 juveniles and I would like to calculate how many rule violations each juvenile had in each particular stage.

For example:

Knum PhaseNo StartPhase EndPhase TotalA TotalB TotalC

33486 2 10/11/2018 12/6/2018 4 5 0

33486 3 12/6/2018 12/20/2018 3 0 0

33486 4 12/20/2018 1/3/2019 2 2 0

File 1

Knum PhaseNo StartPhase EndPhase

33486 2 10/11/2018 12/6/2018

33486 3 12/6/2018 12/20/2018

33486 4 12/20/2018 1/3/2019

33496 4 10/2/2018 1/1/2019

File 2

Knum ViolationDate RuleA RuleB RuleC

33486 10/15/2018 18:00 0 1 0

33486 10/24/2018 12:21 1 0 0

33486 10/25/2018 17:42 0 1 0

33486 10/26/2018 10:26 1 0 0

33486 10/30/2018 9:29 0 1 0

33486 11/2/2018 14:09 1 0 0

33486 11/5/2018 10:38 1 0 0

33486 11/10/2018 17:48 0 1 0

33486 11/25/2018 8:37 0 1 0

33486 12/8/2018 12:45 1 0 0

33486 12/11/2018 18:40 1 0 0

33486 12/17/2018 20:29 1 0 0

33486 12/22/2018 8:41 1 0 0

33486 12/22/2018 8:47 0 1 0

33486 12/30/2018 7:12 1 0 0

33486 1/3/2019 8:19 0 1 0

33486 1/13/2019 4:46 1 0 0

33486 6/20/2019 18:04 0 1 0

33486 6/24/2019 19:40 1 0 0

Any help would be appreciated.

Stella Vasquez
AZ Department of Juveniles Corrections
1624 W. Adams Ave, 3^rd Floor
Phoenix, AZ 85007

(602) 542-2272 or [hidden email]
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Jon K Peck
[hidden email]

Andy W

Re: Help with a Syntax File

Haven't checked out the CART PROD -- looks something like an inner join?
(Could you do those with that STAR JOIN command?)

So maybe if you have access to the database of the original tables,
something in SQL like:

SELECT
File1.Knum AS Knum,
File1.PhaseNo AS PhaseNo,
SUM(File2.RuleA) AS TotalA,
SUM(File2.RuleB) AS TotalB,
SUM(File2.RuleC) AS TotalC
FROM File2 INNER JOIN File1
ON File2.Knum = File1.Knum
WHERE File2.ViolationDate BETWEEN File1.StartPhase TO File1.EndPhase
GROUPBY File1.Knum File1.PhaseNo

Although your DBA might not be happy depending on how big the files are!

My new gig is a python shop. I've used
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html
for a range lookup problem, but it has always been with pretty tiny tables,
so can't say anything about bigger data performance/memory.

-----
Andy W
[hidden email]
http://andrewpwheeler.wordpress.com/
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

Jon Peck

Re: Help with a Syntax File

Actually, CARTPROD is (almost) an outer join, which is what blows up the size. I don't think the STAR JOIN command can do this, because it doesn't handle arithmetic including functions like SUM.

On Sat, Mar 14, 2020 at 12:13 PM Andy W <[hidden email]> wrote:

Haven't checked out the CART PROD -- looks something like an inner join?
(Could you do those with that STAR JOIN command?)

So maybe if you have access to the database of the original tables,
something in SQL like:

SELECT
File1.Knum AS Knum,
File1.PhaseNo AS PhaseNo,
SUM(File2.RuleA) AS TotalA,
SUM(File2.RuleB) AS TotalB,
SUM(File2.RuleC) AS TotalC
FROM File2 INNER JOIN File1
ON File2.Knum = File1.Knum
WHERE File2.ViolationDate BETWEEN File1.StartPhase TO File1.EndPhase
GROUPBY File1.Knum File1.PhaseNo

Although your DBA might not be happy depending on how big the files are!

My new gig is a python shop. I've used
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html
for a range lookup problem, but it has always been with pretty tiny tables,
so can't say anything about bigger data performance/memory.

-----
Andy W
[hidden email]
http://andrewpwheeler.wordpress.com/
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon K Peck
[hidden email]

Andy W

Re: Help with a Syntax File

Good call with the time -- both my SPSS and SQL examples would mess up the
end date exclusion (so maybe just add a day to PhaseEnd field would fix that
in my code examples).

-----
Andy W
[hidden email]
http://andrewpwheeler.wordpress.com/
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

Jon Peck

Re: Help with a Syntax File

Just don't add in the time component.

On Sat, Mar 14, 2020 at 12:46 PM Andy W <[hidden email]> wrote:

Good call with the time -- both my SPSS and SQL examples would mess up the
end date exclusion (so maybe just add a day to PhaseEnd field would fix that
in my code examples).

-----
Andy W
[hidden email]
http://andrewpwheeler.wordpress.com/
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon K Peck
[hidden email]