I have two files, first one tells me the juvenile KNum and the date they started and finished each stage. The second file tells me every time the juvenile has a rule violation and the type of violation (A, B or C). I have 532 juveniles and I would like to calculate how many rule violations each juvenile had in each particular stage. For example:
File 1
File 2
Stella Vasquez AZ Department of Juveniles Corrections 1624 W. Adams Ave, 3rd Floor Phoenix, AZ 85007
(602) 542-2272 or [hidden email] |
I think Jon Peck has some examples of using a python VLOOKUP type function to
do these range merges that is more slick than this, but here is an example of flattening file1 to merge to file2. Biggest issue here is the vector statements have hard coded end #'s, so if that is variable can be a pain in this file. Can solve that with python, or adding in more variables and doing one long index instead of 3 separate ones. (Also not sure what happens here if you have someone in File2 but not in File1, loop may throw an error. But maybe that is a good thing.) ****************************************************************************************. DATA LIST FREE / Knum (F6.0) PhaseNo (F1.0) StartPhase EndPhase (2ADATE10). BEGIN DATA. 33486 2 10/11/2018 12/6/2018 33486 3 12/6/2018 12/20/2018 33486 4 12/20/2018 1/3/2019 33496 4 10/2/2018 1/1/2019 END DATA. DATASET NAME File1. DATA LIST FREE / Knum (F6.0) ViolationDate (ADATE10) ViolTime (TIME5) RuleA RuleB RuleC (3F1.0). BEGIN DATA. 33486 10/15/2018 18:00 0 1 0 33486 10/24/2018 12:21 1 0 0 33486 10/25/2018 17:42 0 1 0 33486 10/26/2018 10:26 1 0 0 33486 10/30/2018 9:29 0 1 0 33486 11/2/2018 14:09 1 0 0 33486 11/5/2018 10:38 1 0 0 33486 11/10/2018 17:48 0 1 0 33486 11/25/2018 8:37 0 1 0 33486 12/8/2018 12:45 1 0 0 33486 12/11/2018 18:40 1 0 0 33486 12/17/2018 20:29 1 0 0 33486 12/22/2018 8:41 1 0 0 33486 12/22/2018 8:47 0 1 0 33486 12/30/2018 7:12 1 0 0 33486 1/3/2019 8:19 0 1 0 33486 1/13/2019 4:46 1 0 0 33486 6/20/2019 18:04 0 1 0 33486 6/24/2019 19:40 1 0 0 END DATA. DATASET NAME File2. COMPUTE ViolationDate = ViolationDate + ViolTime. FORMATS ViolationDate (DATETIME17). EXECUTE. DATASET ACTIVATE FILE1. DATASET COPY WideF1. DATASET ACTIVATE WideF1. SORT CASES BY Knum PhaseNo. CASESTOVARS /ID = Knum /SEPARATOR = "_" /GROUPBY = VARIABLE /COUNT TotPhase. DATASET ACTIVATE FILE2. SORT CASES BY Knum ViolationDate. MATCH FILES FILE = * /TABLE = 'WideF1' /IN EndF1 /BY Knum. *I can think of hacky ways to make this not PhasNo begin/end not hard coded. *But this should be easier to grok. COMPUTE PhaseNo = -1. VECTOR P = PhaseNo_1 TO PhaseNo_3. VECTOR B = StartPhase_1 TO StartPhase_3. VECTOR E = EndPhase_1 TO EndPhase_3. LOOP #i = 1 TO TotPhase. IF ViolationDate >= B(#i) AND ViolationDate <= E(#i) PhaseNo = P(#i). END LOOP. *Now aggregate and merge back into File1. DATASET DECLARE AggF2. AGGREGATE OUTFILE=AggF2 /BREAK Knum PhaseNo /TotalA = SUM(RuleA) /TotalB = SUM(RuleB) /TotalC = SUM(RuleC). DATASET ACTIVATE File1. MATCH FILES FILE = * /TABLE = 'AggF2' /IN AnyViol /BY Knum PhaseNo. RECODE TotalA TO TotalC (SYSMIS = 0). FORMATS TotalA TO TotalC (F5.0). EXECUTE. *Can clean up all those files, but may want. *To see what they look like to understand code. *SELECT IF AnyViol = 1. *DATASET CLOSE WideF1. *DATASET CLOSE AggF2. *DATASET CLOSE File2. ****************************************************************************************. ----- Andy W [hidden email] http://andrewpwheeler.wordpress.com/ -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Stella Vasquez
Andrew has provided a solution using (a lot) of standard syntax. Here is a solution that uses the extension command STATS CARTPROD, which is available from the Extensions menu (but see caveat below). STATS CARTPROD will create a large file that is subsequently pruned, so it make take a bit longer than you would expect. There is a subtle issue with the phase date matching: the file1 dates do not have a time component, so the time is 00:00. I, therefore, ignored the time portion of the violation data in order to make the phase selection work. I read in the date and time as separate fields. Andrew's code adds in the time. This code expects that there are no overlaps in the StartPhase and EndPhase values for a subject. If there are any, it would count a violation in the overlapping interval too many times. The final output is a dataset named count that has the count by type for each Knum and phase. Note that subjects with zero violations will not appear. Change the get commands as needed, and note that it assumes that there is a directory named c:\temp. Here is the code. get file="r:\ibm\users\vasquez\file1.sav". dataset name file1. get file="r:\ibm\users\vasquez\file2.sav". dataset name file2. RENAME VARIABLES Knum = Knum2. DATASET ACTIVATE file1. STATS CARTPROD VAR1=Knum PhaseNo StartPhase EndPhase INPUT2=file2 VAR2=Knum2 ViolationDate RuleA RuleB RuleC /SAVE OUTFILE="c:\temp\merged.sav" DSNAME=merged. dataset activate merged. select if Knum eq Knum2. select if StartPhase <= ViolationDate and ViolationDate <= EndPhase. DATASET DECLARE counts. AGGREGATE /OUTFILE=counts /BREAK=Knum PhaseNo /RuleA_sum=SUM(RuleA) /RuleB_sum=SUM(RuleB) /RuleC_sum=SUM(RuleC) /Total=N. In order for this to work with your dataset, a minor fix is required to the STATS CARTPROD extension command. I can send you the updated file if you want to go this route. You would install it using the Extensions > Install Local Extension Bundle menu item. If your system is locked down, you might need Administator permission to install it. On Fri, Mar 13, 2020 at 6:01 PM Stella Vasquez <[hidden email]> wrote:
|
Haven't checked out the CART PROD -- looks something like an inner join?
(Could you do those with that STAR JOIN command?) So maybe if you have access to the database of the original tables, something in SQL like: SELECT File1.Knum AS Knum, File1.PhaseNo AS PhaseNo, SUM(File2.RuleA) AS TotalA, SUM(File2.RuleB) AS TotalB, SUM(File2.RuleC) AS TotalC FROM File2 INNER JOIN File1 ON File2.Knum = File1.Knum WHERE File2.ViolationDate BETWEEN File1.StartPhase TO File1.EndPhase GROUPBY File1.Knum File1.PhaseNo Although your DBA might not be happy depending on how big the files are! My new gig is a python shop. I've used https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html for a range lookup problem, but it has always been with pretty tiny tables, so can't say anything about bigger data performance/memory. ----- Andy W [hidden email] http://andrewpwheeler.wordpress.com/ -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Actually, CARTPROD is (almost) an outer join, which is what blows up the size. I don't think the STAR JOIN command can do this, because it doesn't handle arithmetic including functions like SUM. On Sat, Mar 14, 2020 at 12:13 PM Andy W <[hidden email]> wrote: Haven't checked out the CART PROD -- looks something like an inner join? |
Good call with the time -- both my SPSS and SQL examples would mess up the
end date exclusion (so maybe just add a day to PhaseEnd field would fix that in my code examples). ----- Andy W [hidden email] http://andrewpwheeler.wordpress.com/ -- Sent from: http://spssx-discussion.1045642.n5.nabble.com/ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Just don't add in the time component. On Sat, Mar 14, 2020 at 12:46 PM Andy W <[hidden email]> wrote: Good call with the time -- both my SPSS and SQL examples would mess up the |
Free forum by Nabble | Edit this page |