Hi folks, I am working on a coding problem and need help. I have sequences of 20 Xs and Os like this XXXXOOOOXXOXXXOOOOOX. For each sequence I need to tally how many times X follows X, O follows X, X follows O, and O follows O. Then I need to find how many times X
(and O) follows each of the previous doubles, XX, XO, OO, OX. Then we move to how many times X (and O) follow all 8 of the previous triples, all 16 of the previous 4ples, all 32 of the previous 5ples, etc. These new count variables need to be saved to a SAV
file and matched back to the sequences for input into another program that uses them to compute various quantities for each sequence. I have written some code that works (see below) but it is VERY CLUNKY and it takes *way to long* to run as the number of sequences becomes large.
The Xs and Os are represented by 1s and 0s, respectively. My method makes heavy use of the LAG function and then uses CROSSTABS to do the tallying but I think this method is a dead end. With 92,378 sequences it took 52 hours! The code will need to run on
1,048,574 sequences and be extended to compute tallies for previous 6ples, 7ples, up to 18ples as well. I know there must be a better way to do this using LOOPS and VECTORS, but I don't know enough about these commands to use them efficiently. Any
help is appreciated. Jason CODE FOLLOWS----- **MAKE SURE TO SET VALUE LABLES TO 'LABELS ONLY'**. CD 'C:\MY_FOLDER'. SET MITERATE=200000. OMS /DESTINATION VIEWER=NO. OMS /SELECT TABLES /IF COMMANDS=[' Descriptives'] SUBTYPES=[' Descriptive Statistics'] /DESTINATION FORMAT=SAV OUTFILE='nGRAMs_1_A.SAV' /COLUMNS SEQUENCE=[CALL RALL LALL]. OMS /SELECT TABLES /IF COMMANDS=[' Crosstabs'] SUBTYPES=[' Crosstabulation'] /DESTINATION FORMAT=SAV OUTFILE='nGRAMs_1_A.SAV' /COLUMNS SEQUENCE=[CALL RALL LALL]. *EXAMPLE DATA*. DATA LIST FREE/ SEQNO S. BEGIN DATA. 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 1 2 1 2 1 2 1 2 0 2 0 2 0 2 0 2 1 2 1 2 1 2 0 2 1 2 1 2 0 2 0 2 0 2 0 2 0 2 1 2 1 END DATA. FORMAT S (F3.0). IF LAG(SEQNO,1)=SEQNO Lag1=Lag(S,1). IF LAG(SEQNO,2)=SEQNO Lag2=Lag(S,2). IF LAG(SEQNO,3)=SEQNO Lag3=Lag(S,3). IF LAG(SEQNO,4)=SEQNO Lag4=Lag(S,4). IF LAG(SEQNO,5)=SEQNO Lag5=Lag(S,5). RECODE LAG1 (1=1)(0=2) INTO Prv1. IF LAG1=1 & LAG2=1 Prv2=1. IF LAG1=1 & LAG2=0 Prv2=2. IF LAG1=0 & LAG2=0 Prv2=3. IF LAG1=0 & LAG2=1 Prv2=4. IF LAG1=1 & LAG2=1 & LAG3=1 Prv3=1. IF LAG1=1 & LAG2=1 & LAG3=0 Prv3=2. IF LAG1=1 & LAG2=0 & LAG3=1 Prv3=3. IF LAG1=1 & LAG2=0 & LAG3=0 Prv3=4. IF LAG1=0 & LAG2=0 & LAG3=1 Prv3=5. IF LAG1=0 & LAG2=0 & LAG3=0 Prv3=6. IF LAG1=0 & LAG2=1 & LAG3=1 Prv3=7. IF LAG1=0 & LAG2=1 & LAG3=0 Prv3=8. IF LAG1=1 & LAG2=1 & LAG3=1 & LAG4=1 Prv4=1. IF LAG1=1 & LAG2=1 & LAG3=1 & LAG4=0 Prv4=2. IF LAG1=1 & LAG2=1 & LAG3=0 & LAG4=1 Prv4=3. IF LAG1=1 & LAG2=1 & LAG3=0 & LAG4=0 Prv4=4. IF LAG1=1 & LAG2=0 & LAG3=0 & LAG4=1 Prv4=5. IF LAG1=1 & LAG2=0 & LAG3=0 & LAG4=0 Prv4=6. IF LAG1=1 & LAG2=0 & LAG3=1 & LAG4=1 Prv4=7. IF LAG1=1 & LAG2=0 & LAG3=1 & LAG4=0 Prv4=8. IF LAG1=0 & LAG2=1 & LAG3=1 & LAG4=1 Prv4=9. IF LAG1=0 & LAG2=1 & LAG3=1 & LAG4=0 Prv4=10. IF LAG1=0 & LAG2=1 & LAG3=0 & LAG4=1 Prv4=11. IF LAG1=0 & LAG2=1 & LAG3=0 & LAG4=0 Prv4=12. IF LAG1=0 & LAG2=0 & LAG3=0 & LAG4=1 Prv4=13. IF LAG1=0 & LAG2=0 & LAG3=0 & LAG4=0 Prv4=14. IF LAG1=0 & LAG2=0 & LAG3=1 & LAG4=1 Prv4=15. IF LAG1=0 & LAG2=0 & LAG3=1 & LAG4=0 Prv4=16. IF LAG1=1 & LAG2=1 & LAG3=1 & LAG4=1 & LAG5=1 Prv5=1. IF LAG1=1 & LAG2=1 & LAG3=1 & LAG4=1 & LAG5=0 Prv5=2. IF LAG1=1 & LAG2=1 & LAG3=1 & LAG4=0 & LAG5=1 Prv5=3. IF LAG1=1 & LAG2=1 & LAG3=1 & LAG4=0 & LAG5=0 Prv5=4. IF LAG1=1 & LAG2=1 & LAG3=0 & LAG4=0 & LAG5=1 Prv5=5. IF LAG1=1 & LAG2=1 & LAG3=0 & LAG4=0 & LAG5=0 Prv5=6. IF LAG1=1 & LAG2=1 & LAG3=0 & LAG4=1 & LAG5=1 Prv5=7. IF LAG1=1 & LAG2=1 & LAG3=0 & LAG4=1 & LAG5=0 Prv5=8. IF LAG1=1 & LAG2=0 & LAG3=1 & LAG4=1 & LAG5=1 Prv5=9. IF LAG1=1 & LAG2=0 & LAG3=1 & LAG4=1 & LAG5=0 Prv5=10. IF LAG1=1 & LAG2=0 & LAG3=1 & LAG4=0 & LAG5=1 Prv5=11. IF LAG1=1 & LAG2=0 & LAG3=1 & LAG4=0 & LAG5=0 Prv5=12. IF LAG1=1 & LAG2=0 & LAG3=0 & LAG4=0 & LAG5=1 Prv5=13. IF LAG1=1 & LAG2=0 & LAG3=0 & LAG4=0 & LAG5=0 Prv5=14. IF LAG1=1 & LAG2=0 & LAG3=0 & LAG4=1 & LAG5=1 Prv5=15. IF LAG1=1 & LAG2=0 & LAG3=0 & LAG4=1 & LAG5=0 Prv5=16. IF LAG1=0 & LAG2=1 & LAG3=1 & LAG4=1 & LAG5=1 Prv5=17. IF LAG1=0 & LAG2=1 & LAG3=1 & LAG4=1 & LAG5=0 Prv5=18. IF LAG1=0 & LAG2=1 & LAG3=1 & LAG4=0 & LAG5=1 Prv5=19. IF LAG1=0 & LAG2=1 & LAG3=1 & LAG4=0 & LAG5=0 Prv5=20. IF LAG1=0 & LAG2=1 & LAG3=0 & LAG4=0 & LAG5=1 Prv5=21. IF LAG1=0 & LAG2=1 & LAG3=0 & LAG4=0 & LAG5=0 Prv5=22. IF LAG1=0 & LAG2=1 & LAG3=0 & LAG4=1 & LAG5=1 Prv5=23. IF LAG1=0 & LAG2=1 & LAG3=0 & LAG4=1 & LAG5=0 Prv5=24. IF LAG1=0 & LAG2=0 & LAG3=1 & LAG4=1 & LAG5=1 Prv5=25. IF LAG1=0 & LAG2=0 & LAG3=1 & LAG4=1 & LAG5=0 Prv5=26. IF LAG1=0 & LAG2=0 & LAG3=1 & LAG4=0 & LAG5=1 Prv5=27. IF LAG1=0 & LAG2=0 & LAG3=1 & LAG4=0 & LAG5=0 Prv5=28. IF LAG1=0 & LAG2=0 & LAG3=0 & LAG4=0 & LAG5=1 Prv5=29. IF LAG1=0 & LAG2=0 & LAG3=0 & LAG4=0 & LAG5=0 Prv5=30. IF LAG1=0 & LAG2=0 & LAG3=0 & LAG4=1 & LAG5=1 Prv5=31. IF LAG1=0 & LAG2=0 & LAG3=0 & LAG4=1 & LAG5=0 Prv5=32. VALUE LABEL S 0'o' 1'x' /PRV1 1'X' 2'O' /PRV2 1'XX' 2'XO' 3'OO' 4'OX' /PRV3 1'XXX' 2'XXO' 3'XOX' 4'XOO' 5'OOX' 6'OOO' 7'OXX' 8'OXO' /PRV4 1'XXXX' 2'XXXO' 3'XXOX' 4'XXOO' 5'XOOX' 6'XOOO' 7'XOXX' 8'XOXO' 9'OXXX' 10'OXXO' 11'OXOX' 12'OXOO' 13'OOOX' 14'OOOO' 15'OOXX' 16'OOXO' /PRV5 1'XXXXX' 2'XXXXO' 3'XXXOX' 4'XXXOO' 5'XXOOX' 6'XXOOO' 7'XXOXX' 8'XXOXO' 9'XOXXX' 10'XOXXO' 11'XOXOX' 12'XOXOO' 13'XOOOX' 14'XOOOO' 15'XOOXX' 16'XOOXO' 17'OXXXX' 18'OXXXO' 19'OXXOX' 20'OXXOO' 21'OXOOX' 22'OXOOO' 23'OXOXX' 24'OXOXO' 25'OOXXX' 26'OOXXO' 27'OOXOX' 28'OOXOO' 29'OOOOX' 30'OOOOO' 31'OOOXX' 32'OOOXO'. FORMAT PRV1 PRV2 PRV3 PRV4 PRV5(F3.0).
DEFINE !Ngrams (minNO = !TOKENS(1)/maxNO = !TOKENS(1)). !DO !I = !minNO !TO !maxNO. TEMP. SELECT IF SEQNO=!I. DESCRIPTIVES SEQNO. TEMP. SELECT IF SEQNO=!I. CROSSTAB VARS= PRV1 (1,2) PRV2(1,4) PRV3(1,8) PRV4(1,16) PRV5(1,32) S(0,1) /TABLES=PRV1 PRV2 PRV3 PRV4 PRV5 BY S. !DOEND. !ENDDEFINE. !Ngrams minNO = 1 maxNO = 92378 /*set to N 92378*/. OMSEND. *------------------------------------------------------------------------------. GET FILE='nGRAMs_1_B.SAV' /KEEP=Mean_SEQNO o_X_Count o_O_Count x_X_Count x_O_Count o_XX_Count o_XO_Count o_OO_Count o_OX_Count x_XX_Count x_XO_Count x_OO_Count x_OX_Count o_XXX_Count o_XXO_Count o_XOX_Count o_XOO_Count o_OOX_Count o_OOO_Count o_OXX_Count o_OXO_Count x_XXX_Count x_XXO_Count x_XOX_Count x_XOO_Count x_OOX_Count x_OOO_Count x_OXX_Count x_OXO_Count x_XXXX_Count x_XXXO_Count x_XXOX_Count x_XXOO_Count x_XOOX_Count x_XOOO_Count x_XOXX_Count x_XOXO_Count x_OXXX_Count x_OXXO_Count x_OXOX_Count x_OXOO_Count x_OOOX_Count x_OOOO_Count x_OOXX_Count x_OOXO_Count o_XXXX_Count o_XXXO_Count o_XXOX_Count o_XXOO_Count o_XOOX_Count o_XOOO_Count o_XOXX_Count o_XOXO_Count o_OXXX_Count o_OXXO_Count o_OXOX_Count o_OXOO_Count o_OOOX_Count o_OOOO_Count o_OOXX_Count o_OOXO_Count x_XXXXX_Count x_XXXXO_Count x_XXXOX_Count x_XXXOO_Count x_XXOOX_Count x_XXOOO_Count x_XXOXX_Count x_XXOXO_Count x_XOXXX_Count x_XOXXO_Count x_XOXOX_Count x_XOXOO_Count x_XOOOX_Count x_XOOOO_Count x_XOOXX_Count x_XOOXO_Count x_OXXXX_Count x_OXXXO_Count x_OXXOX_Count x_OXXOO_Count x_OXOOX_Count x_OXOOO_Count x_OXOXX_Count x_OXOXO_Count x_OOXXX_Count x_OOXXO_Count x_OOXOX_Count x_OOXOO_Count x_OOOOX_Count x_OOOOO_Count x_OOOXX_Count x_OOOXO_Count o_XXXXX_Count o_XXXXO_Count o_XXXOX_Count o_XXXOO_Count o_XXOOX_Count o_XXOOO_Count o_XXOXX_Count o_XXOXO_Count o_XOXXX_Count o_XOXXO_Count o_XOXOX_Count o_XOXOO_Count o_XOOOX_Count o_XOOOO_Count o_XOOXX_Count o_XOOXO_Count o_OXXXX_Count o_OXXXO_Count o_OXXOX_Count o_OXXOO_Count o_OXOOX_Count o_OXOOO_Count o_OXOXX_Count o_OXOXO_Count o_OOXXX_Count o_OOXXO_Count o_OOXOX_Count o_OOXOO_Count o_OOOOX_Count o_OOOOO_Count o_OOOXX_Count o_OOOXO_Count. RENAME VARIABLES (Mean_SEQNO o_X_Count o_O_Count x_X_Count x_O_Count o_XX_Count o_XO_Count o_OO_Count o_OX_Count x_XX_Count x_XO_Count x_OO_Count x_OX_Count o_XXX_Count o_XXO_Count o_XOX_Count o_XOO_Count o_OOX_Count o_OOO_Count o_OXX_Count o_OXO_Count x_XXX_Count x_XXO_Count x_XOX_Count x_XOO_Count x_OOX_Count x_OOO_Count x_OXX_Count x_OXO_Count x_XXXX_Count x_XXXO_Count x_XXOX_Count x_XXOO_Count x_XOOX_Count x_XOOO_Count x_XOXX_Count x_XOXO_Count x_OXXX_Count x_OXXO_Count x_OXOX_Count x_OXOO_Count x_OOOX_Count x_OOOO_Count x_OOXX_Count x_OOXO_Count o_XXXX_Count o_XXXO_Count o_XXOX_Count o_XXOO_Count o_XOOX_Count o_XOOO_Count o_XOXX_Count o_XOXO_Count o_OXXX_Count o_OXXO_Count o_OXOX_Count o_OXOO_Count o_OOOX_Count o_OOOO_Count o_OOXX_Count o_OOXO_Count x_XXXXX_Count x_XXXXO_Count x_XXXOX_Count x_XXXOO_Count x_XXOOX_Count x_XXOOO_Count x_XXOXX_Count x_XXOXO_Count x_XOXXX_Count x_XOXXO_Count x_XOXOX_Count x_XOXOO_Count x_XOOOX_Count x_XOOOO_Count x_XOOXX_Count x_XOOXO_Count x_OXXXX_Count x_OXXXO_Count x_OXXOX_Count x_OXXOO_Count x_OXOOX_Count x_OXOOO_Count x_OXOXX_Count x_OXOXO_Count x_OOXXX_Count x_OOXXO_Count x_OOXOX_Count x_OOXOO_Count x_OOOOX_Count x_OOOOO_Count x_OOOXX_Count x_OOOXO_Count o_XXXXX_Count o_XXXXO_Count o_XXXOX_Count o_XXXOO_Count o_XXOOX_Count o_XXOOO_Count o_XXOXX_Count o_XXOXO_Count o_XOXXX_Count o_XOXXO_Count o_XOXOX_Count o_XOXOO_Count o_XOOOX_Count o_XOOOO_Count o_XOOXX_Count o_XOOXO_Count o_OXXXX_Count o_OXXXO_Count o_OXXOX_Count o_OXXOO_Count o_OXOOX_Count o_OXOOO_Count o_OXOXX_Count o_OXOXO_Count o_OOXXX_Count o_OOXXO_Count o_OOXOX_Count o_OOXOO_Count o_OOOOX_Count o_OOOOO_Count o_OOOXX_Count o_OOOXO_Count = SEQNO o_X o_O x_X x_O o_XX o_XO o_OO o_OX x_XX x_XO x_OO x_OX o_XXX o_XXO o_XOX o_XOO o_OOX o_OOO o_OXX o_OXO x_XXX x_XXO x_XOX x_XOO x_OOX x_OOO x_OXX x_OXO x_XXXX x_XXXO x_XXOX x_XXOO x_XOOX x_XOOO x_XOXX x_XOXO x_OXXX x_OXXO x_OXOX x_OXOO x_OOOX x_OOOO x_OOXX x_OOXO o_XXXX o_XXXO o_XXOX o_XXOO o_XOOX o_XOOO o_XOXX o_XOXO o_OXXX o_OXXO o_OXOX o_OXOO o_OOOX o_OOOO o_OOXX o_OOXO x_XXXXX x_XXXXO x_XXXOX x_XXXOO x_XXOOX x_XXOOO x_XXOXX x_XXOXO x_XOXXX x_XOXXO x_XOXOX x_XOXOO x_XOOOX x_XOOOO x_XOOXX x_XOOXO x_OXXXX x_OXXXO x_OXXOX x_OXXOO x_OXOOX x_OXOOO x_OXOXX x_OXOXO x_OOXXX x_OOXXO x_OOXOX x_OOXOO x_OOOOX x_OOOOO x_OOOXX x_OOOXO o_XXXXX o_XXXXO o_XXXOX o_XXXOO o_XXOOX o_XXOOO o_XXOXX o_XXOXO o_XOXXX o_XOXXO o_XOXOX o_XOXOO o_XOOOX o_XOOOO o_XOOXX o_XOOXO o_OXXXX o_OXXXO o_OXXOX o_OXXOO o_OXOOX o_OXOOO o_OXOXX o_OXOXO o_OOXXX o_OOXXO o_OOXOX o_OOXOO o_OOOOX o_OOOOO o_OOOXX o_OOOXO). IF NMISS(SEQNO)=1 SEQNO=LAG(SEQNO,1). AGGREGATE OUTFILE=* /BREAK=SEQNO /o_X o_O x_X x_O o_XX o_XO o_OO o_OX x_XX x_XO x_OO x_OX o_XXX o_XXO o_XOX o_XOO o_OOX o_OOO o_OXX o_OXO x_XXX x_XXO x_XOX x_XOO x_OOX x_OOO x_OXX x_OXO x_XXXX x_XXXO x_XXOX x_XXOO x_XOOX x_XOOO x_XOXX x_XOXO x_OXXX x_OXXO x_OXOX x_OXOO x_OOOX x_OOOO x_OOXX x_OOXO o_XXXX o_XXXO o_XXOX o_XXOO o_XOOX o_XOOO o_XOXX o_XOXO o_OXXX o_OXXO o_OXOX o_OXOO o_OOOX o_OOOO o_OOXX o_OOXO x_XXXXX x_XXXXO x_XXXOX x_XXXOO x_XXOOX x_XXOOO x_XXOXX x_XXOXO x_XOXXX x_XOXXO x_XOXOX x_XOXOO x_XOOOX x_XOOOO x_XOOXX x_XOOXO x_OXXXX x_OXXXO x_OXXOX x_OXXOO x_OXOOX x_OXOOO x_OXOXX x_OXOXO x_OOXXX x_OOXXO x_OOXOX x_OOXOO x_OOOOX x_OOOOO x_OOOXX x_OOOXO o_XXXXX o_XXXXO o_XXXOX o_XXXOO o_XXOOX o_XXOOO o_XXOXX o_XXOXO o_XOXXX o_XOXXO o_XOXOX o_XOXOO o_XOOOX o_XOOOO o_XOOXX o_XOOXO o_OXXXX o_OXXXO o_OXXOX o_OXXOO o_OXOOX o_OXOOO o_OXOXX o_OXOXO o_OOXXX o_OOXXO o_OOXOX o_OOXOO o_OOOOX o_OOOOO o_OOOXX o_OOOXO =FIRST(o_X o_O x_X x_O o_XX o_XO o_OO o_OX x_XX x_XO x_OO x_OX o_XXX o_XXO o_XOX o_XOO o_OOX o_OOO o_OXX o_OXO x_XXX x_XXO x_XOX x_XOO x_OOX x_OOO x_OXX x_OXO x_XXXX x_XXXO x_XXOX x_XXOO x_XOOX x_XOOO x_XOXX x_XOXO x_OXXX x_OXXO x_OXOX x_OXOO x_OOOX x_OOOO x_OOXX x_OOXO o_XXXX o_XXXO o_XXOX o_XXOO o_XOOX o_XOOO o_XOXX o_XOXO o_OXXX o_OXXO o_OXOX o_OXOO o_OOOX o_OOOO o_OOXX o_OOXO x_XXXXX x_XXXXO x_XXXOX x_XXXOO x_XXOOX x_XXOOO x_XXOXX x_XXOXO x_XOXXX x_XOXXO x_XOXOX x_XOXOO x_XOOOX x_XOOOO x_XOOXX x_XOOXO x_OXXXX x_OXXXO x_OXXOX x_OXXOO x_OXOOX x_OXOOO x_OXOXX x_OXOXO x_OOXXX x_OOXXO x_OOXOX x_OOXOO x_OOOOX x_OOOOO x_OOOXX x_OOOXO o_XXXXX o_XXXXO o_XXXOX o_XXXOO o_XXOOX o_XXOOO o_XXOXX o_XXOXO o_XOXXX o_XOXXO o_XOXOX o_XOXOO o_XOOOX o_XOOOO o_XOOXX o_XOOXO o_OXXXX o_OXXXO o_OXXOX o_OXXOO o_OXOOX o_OXOOO o_OXOXX o_OXOXO o_OOXXX o_OOXXO o_OOXOX o_OOXOO o_OOOOX o_OOOOO o_OOOXX o_OOOXO). FORMAT SEQNO (F5.0). SAVE OUTFILE='nGRAMS_2_B.SAV'. *------------------------------------------------------------------------------. _____________________________________________________________ Jason W. Beckstead, Ph.D. Associate Professor/Quantitative Methodologist University of South Florida College of Nursing 12901 Bruce B. Downs Blvd., MDC22, Tampa, FL 33612, USA Statistical Editor, International Journal of Nursing Studies phone: +1.813.974.7667 fax: +1.813.974.5418 personal website: http://personal.health.usf.edu/jbeckste/ International Journal of Nursing Studies http://www.elsevier.com/ijns |
It helps to make a minimal example, but a few points:
- Probably all of the time is spent generating the seperate tables for each SEQNO using SELECT IF. You can do them in one swoop using SPLIT FILE BY SEQNO and then doing one cross tab. With so many tables also suppressing them in the output may save some time as well. - I don't get the point of the OMS to begin with, you can use AGGREGATE to generate the statistics in the cross tabs, which will likely be faster. ------------------ Here is a brief, different approach of how I would go about calculating the before ngrams. It will take alittle more thought though about how to extend it to search for all of the prior ngrams given a sequence of a certain length though, but it is a start though. The 20 million cases won't be super fast, but I would think would be faster than 50 hours. **************************************************. DATASET CLOSE ALL. OUTPUT CLOSE ALL. DATA LIST FREE/ SEQNO S. BEGIN DATA. 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 1 2 1 2 1 2 1 2 0 2 0 2 0 2 0 2 1 2 1 2 1 2 0 2 1 2 1 2 0 2 0 2 0 2 0 2 0 2 1 2 1 END DATA. DATASET NAME Test. DEFINE !NGram (Before = !ENCLOSE("[","]") /After = !TOKENS(1) /Out = !TOKENS(1) ) *Figure out how many items in Before. !LET !L = !NULL !DO !I !IN (!Before) !LET !L = !CONCAT(!L," ") !DOEND !LET !N = !LENGTH(!L) *Now a DO REPEAT, cant use LAG with LOOP. COMPUTE #T = 0. DO REPEAT B = !Before /#i = 1 TO !N. IF SEQNO = LAG(SEQNO,#i) AND B = LAG(S,#i) #T = #T + 1. END REPEAT. COMPUTE !Out = (#T = !N). VARIABLE LABEL !Out !QUOTE(!CONCAT("NGram for [",!Before,"] followed by a ",!After)). !ENDDEFINE. *X followed by X. !NGram Before = [1] After = 1 Out = X_X. *O follow by X. !NGram Before = [0] After = 1 Out = O_X. *OO followed by an O. !NGram Before = [0 0] After = 0 Out = OO_O. *OXO followed by an X. !NGram Before = [0 1 0] After = 1 Out = OXO_X. *Note for Aggregate, you can use TO - very helpful. DATASET DECLARE SeqAgg. AGGREGATE OUTFILE='SeqAgg' /BREAK SEQNO /X1 TO X4 = SUM(X_X TO OXO_X). DATASET ACTIVATE SeqAgg. *Lose labels and names though. **************************************************. |
In reply to this post by Beckstead, Jason
In the example sequence that you give, I think the count for XX is 6, XO is 3, OX is 3, and OO is 7. Triples: XXX is 3, XXO is 3, OOO is 5, OOX is 2, XOX is 1, XOO is 2, OXO is 0, OXX is 2. So the problem you have is that the number of sequences to check is 2**n, where n is the length of the sequence and n runs from 2 to 20. 2**20 is a big number. Second, it
appears that the sum of sequences found is 20-n+1. 19 for n=2, 18 for n=3. I think there are different ways of working this problem although I don’t think that a long format file is the way to go. The data apparently come in as an A20 string. Given that string, you could a) write code
that counts by looking back, b) write code that loops through the string and matches a substring of length n against the 2**n list of possible sequences and counts matches, c) write code that creates 19 files. File 2 consists of 19 records per case with each
record being an A2 substring; file 2 is 18 records per case with each record being an A3 substring, etc. Each file is then aggregated by caseid if you want that level of detail or a frequencies for file totals. At the moment, I believe I’d choose c). Creating the substrings is a trivial task. Varstocases. Aggregate by caseid or frequencies. The drawback is that you know which sequences appeared and how many times each appeared but not which sequences did not appear.
Gene Maguin From: SPSSX(r) Discussion [mailto:[hidden email]]
On Behalf Of Beckstead, Jason Hi folks, I am working on a coding problem and need help. I have sequences of 20 Xs and Os like this XXXXOOOOXXOXXXOOOOOX. For each sequence I need to tally how many times X follows X, O follows X, X follows O, and O follows O. Then I need to find how many times X
(and O) follows each of the previous doubles, XX, XO, OO, OX. Then we move to how many times X (and O) follow all 8 of the previous triples, all 16 of the previous 4ples, all 32 of the previous 5ples, etc. These new count variables need to be saved to a SAV
file and matched back to the sequences for input into another program that uses them to compute various quantities for each sequence. I have written some code that works (see below) but it is VERY CLUNKY and it takes *way to long* to run as the number of sequences becomes large.
The Xs and Os are represented by 1s and 0s, respectively. My method makes heavy use of the LAG function and then uses CROSSTABS to do the tallying but I think this method is a dead end. With 92,378 sequences it took 52 hours! The code will need to run on
1,048,574 sequences and be extended to compute tallies for previous 6ples, 7ples, up to 18ples as well. I know there must be a better way to do this using LOOPS and VECTORS, but I don't know enough about these commands to use them efficiently. Any
help is appreciated. Jason CODE FOLLOWS----- **MAKE SURE TO SET VALUE LABLES TO 'LABELS ONLY'**. CD 'C:\MY_FOLDER'. SET MITERATE=200000. OMS /DESTINATION VIEWER=NO. OMS /SELECT TABLES /IF COMMANDS=[' Descriptives'] SUBTYPES=[' Descriptive Statistics'] /DESTINATION FORMAT=SAV OUTFILE='nGRAMs_1_A.SAV' /COLUMNS SEQUENCE=[CALL RALL LALL]. OMS /SELECT TABLES /IF COMMANDS=[' Crosstabs'] SUBTYPES=[' Crosstabulation'] /DESTINATION FORMAT=SAV OUTFILE='nGRAMs_1_A.SAV' /COLUMNS SEQUENCE=[CALL RALL LALL]. *EXAMPLE DATA*. DATA LIST FREE/ SEQNO S. BEGIN DATA. 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 1 2 1 2 1 2 1 2 0 2 0 2 0 2 0 2 1 2 1 2 1 2 0 2 1 2 1 2 0 2 0 2 0 2 0 2 0 2 1 2 1 END DATA. FORMAT S (F3.0). IF LAG(SEQNO,1)=SEQNO Lag1=Lag(S,1). IF LAG(SEQNO,2)=SEQNO Lag2=Lag(S,2). IF LAG(SEQNO,3)=SEQNO Lag3=Lag(S,3). IF LAG(SEQNO,4)=SEQNO Lag4=Lag(S,4). IF LAG(SEQNO,5)=SEQNO Lag5=Lag(S,5). RECODE LAG1 (1=1)(0=2) INTO Prv1. IF LAG1=1 & LAG2=1 Prv2=1. IF LAG1=1 & LAG2=0 Prv2=2. IF LAG1=0 & LAG2=0 Prv2=3. IF LAG1=0 & LAG2=1 Prv2=4. IF LAG1=1 & LAG2=1 & LAG3=1 Prv3=1. IF LAG1=1 & LAG2=1 & LAG3=0 Prv3=2. IF LAG1=1 & LAG2=0 & LAG3=1 Prv3=3. IF LAG1=1 & LAG2=0 & LAG3=0 Prv3=4. IF LAG1=0 & LAG2=0 & LAG3=1 Prv3=5. IF LAG1=0 & LAG2=0 & LAG3=0 Prv3=6. IF LAG1=0 & LAG2=1 & LAG3=1 Prv3=7. IF LAG1=0 & LAG2=1 & LAG3=0 Prv3=8. IF LAG1=1 & LAG2=1 & LAG3=1 & LAG4=1 Prv4=1. IF LAG1=1 & LAG2=1 & LAG3=1 & LAG4=0 Prv4=2. IF LAG1=1 & LAG2=1 & LAG3=0 & LAG4=1 Prv4=3. IF LAG1=1 & LAG2=1 & LAG3=0 & LAG4=0 Prv4=4. IF LAG1=1 & LAG2=0 & LAG3=0 & LAG4=1 Prv4=5. IF LAG1=1 & LAG2=0 & LAG3=0 & LAG4=0 Prv4=6. IF LAG1=1 & LAG2=0 & LAG3=1 & LAG4=1 Prv4=7. IF LAG1=1 & LAG2=0 & LAG3=1 & LAG4=0 Prv4=8. IF LAG1=0 & LAG2=1 & LAG3=1 & LAG4=1 Prv4=9. IF LAG1=0 & LAG2=1 & LAG3=1 & LAG4=0 Prv4=10. IF LAG1=0 & LAG2=1 & LAG3=0 & LAG4=1 Prv4=11. IF LAG1=0 & LAG2=1 & LAG3=0 & LAG4=0 Prv4=12. IF LAG1=0 & LAG2=0 & LAG3=0 & LAG4=1 Prv4=13. IF LAG1=0 & LAG2=0 & LAG3=0 & LAG4=0 Prv4=14. IF LAG1=0 & LAG2=0 & LAG3=1 & LAG4=1 Prv4=15. IF LAG1=0 & LAG2=0 & LAG3=1 & LAG4=0 Prv4=16. IF LAG1=1 & LAG2=1 & LAG3=1 & LAG4=1 & LAG5=1 Prv5=1. IF LAG1=1 & LAG2=1 & LAG3=1 & LAG4=1 & LAG5=0 Prv5=2. IF LAG1=1 & LAG2=1 & LAG3=1 & LAG4=0 & LAG5=1 Prv5=3. IF LAG1=1 & LAG2=1 & LAG3=1 & LAG4=0 & LAG5=0 Prv5=4. IF LAG1=1 & LAG2=1 & LAG3=0 & LAG4=0 & LAG5=1 Prv5=5. IF LAG1=1 & LAG2=1 & LAG3=0 & LAG4=0 & LAG5=0 Prv5=6. IF LAG1=1 & LAG2=1 & LAG3=0 & LAG4=1 & LAG5=1 Prv5=7. IF LAG1=1 & LAG2=1 & LAG3=0 & LAG4=1 & LAG5=0 Prv5=8. IF LAG1=1 & LAG2=0 & LAG3=1 & LAG4=1 & LAG5=1 Prv5=9. IF LAG1=1 & LAG2=0 & LAG3=1 & LAG4=1 & LAG5=0 Prv5=10. IF LAG1=1 & LAG2=0 & LAG3=1 & LAG4=0 & LAG5=1 Prv5=11. IF LAG1=1 & LAG2=0 & LAG3=1 & LAG4=0 & LAG5=0 Prv5=12. IF LAG1=1 & LAG2=0 & LAG3=0 & LAG4=0 & LAG5=1 Prv5=13. IF LAG1=1 & LAG2=0 & LAG3=0 & LAG4=0 & LAG5=0 Prv5=14. IF LAG1=1 & LAG2=0 & LAG3=0 & LAG4=1 & LAG5=1 Prv5=15. IF LAG1=1 & LAG2=0 & LAG3=0 & LAG4=1 & LAG5=0 Prv5=16. IF LAG1=0 & LAG2=1 & LAG3=1 & LAG4=1 & LAG5=1 Prv5=17. IF LAG1=0 & LAG2=1 & LAG3=1 & LAG4=1 & LAG5=0 Prv5=18. IF LAG1=0 & LAG2=1 & LAG3=1 & LAG4=0 & LAG5=1 Prv5=19. IF LAG1=0 & LAG2=1 & LAG3=1 & LAG4=0 & LAG5=0 Prv5=20. IF LAG1=0 & LAG2=1 & LAG3=0 & LAG4=0 & LAG5=1 Prv5=21. IF LAG1=0 & LAG2=1 & LAG3=0 & LAG4=0 & LAG5=0 Prv5=22. IF LAG1=0 & LAG2=1 & LAG3=0 & LAG4=1 & LAG5=1 Prv5=23. IF LAG1=0 & LAG2=1 & LAG3=0 & LAG4=1 & LAG5=0 Prv5=24. IF LAG1=0 & LAG2=0 & LAG3=1 & LAG4=1 & LAG5=1 Prv5=25. IF LAG1=0 & LAG2=0 & LAG3=1 & LAG4=1 & LAG5=0 Prv5=26. IF LAG1=0 & LAG2=0 & LAG3=1 & LAG4=0 & LAG5=1 Prv5=27. IF LAG1=0 & LAG2=0 & LAG3=1 & LAG4=0 & LAG5=0 Prv5=28. IF LAG1=0 & LAG2=0 & LAG3=0 & LAG4=0 & LAG5=1 Prv5=29. IF LAG1=0 & LAG2=0 & LAG3=0 & LAG4=0 & LAG5=0 Prv5=30. IF LAG1=0 & LAG2=0 & LAG3=0 & LAG4=1 & LAG5=1 Prv5=31. IF LAG1=0 & LAG2=0 & LAG3=0 & LAG4=1 & LAG5=0 Prv5=32. VALUE LABEL S 0'o' 1'x' /PRV1 1'X' 2'O' /PRV2 1'XX' 2'XO' 3'OO' 4'OX' /PRV3 1'XXX' 2'XXO' 3'XOX' 4'XOO' 5'OOX' 6'OOO' 7'OXX' 8'OXO' /PRV4 1'XXXX' 2'XXXO' 3'XXOX' 4'XXOO' 5'XOOX' 6'XOOO' 7'XOXX' 8'XOXO' 9'OXXX' 10'OXXO' 11'OXOX' 12'OXOO' 13'OOOX' 14'OOOO' 15'OOXX' 16'OOXO' /PRV5 1'XXXXX' 2'XXXXO' 3'XXXOX' 4'XXXOO' 5'XXOOX' 6'XXOOO' 7'XXOXX' 8'XXOXO' 9'XOXXX' 10'XOXXO' 11'XOXOX' 12'XOXOO' 13'XOOOX' 14'XOOOO' 15'XOOXX' 16'XOOXO' 17'OXXXX' 18'OXXXO' 19'OXXOX' 20'OXXOO' 21'OXOOX' 22'OXOOO' 23'OXOXX' 24'OXOXO' 25'OOXXX' 26'OOXXO' 27'OOXOX' 28'OOXOO' 29'OOOOX' 30'OOOOO' 31'OOOXX' 32'OOOXO'. FORMAT PRV1 PRV2 PRV3 PRV4 PRV5(F3.0).
DEFINE !Ngrams (minNO = !TOKENS(1)/maxNO = !TOKENS(1)). !DO !I = !minNO !TO !maxNO. TEMP. SELECT IF SEQNO=!I. DESCRIPTIVES SEQNO. TEMP. SELECT IF SEQNO=!I. CROSSTAB VARS= PRV1 (1,2) PRV2(1,4) PRV3(1,8) PRV4(1,16) PRV5(1,32) S(0,1) /TABLES=PRV1 PRV2 PRV3 PRV4 PRV5 BY S. !DOEND. !ENDDEFINE. !Ngrams minNO = 1 maxNO = 92378 /*set to N 92378*/. OMSEND. *------------------------------------------------------------------------------. GET FILE='nGRAMs_1_B.SAV' /KEEP=Mean_SEQNO o_X_Count o_O_Count x_X_Count x_O_Count o_XX_Count o_XO_Count o_OO_Count o_OX_Count x_XX_Count x_XO_Count x_OO_Count x_OX_Count o_XXX_Count o_XXO_Count o_XOX_Count o_XOO_Count o_OOX_Count o_OOO_Count o_OXX_Count o_OXO_Count x_XXX_Count x_XXO_Count x_XOX_Count x_XOO_Count x_OOX_Count x_OOO_Count x_OXX_Count x_OXO_Count x_XXXX_Count x_XXXO_Count x_XXOX_Count x_XXOO_Count x_XOOX_Count x_XOOO_Count x_XOXX_Count x_XOXO_Count x_OXXX_Count x_OXXO_Count x_OXOX_Count x_OXOO_Count x_OOOX_Count x_OOOO_Count x_OOXX_Count x_OOXO_Count o_XXXX_Count o_XXXO_Count o_XXOX_Count o_XXOO_Count o_XOOX_Count o_XOOO_Count o_XOXX_Count o_XOXO_Count o_OXXX_Count o_OXXO_Count o_OXOX_Count o_OXOO_Count o_OOOX_Count o_OOOO_Count o_OOXX_Count o_OOXO_Count x_XXXXX_Count x_XXXXO_Count x_XXXOX_Count x_XXXOO_Count x_XXOOX_Count x_XXOOO_Count x_XXOXX_Count x_XXOXO_Count x_XOXXX_Count x_XOXXO_Count x_XOXOX_Count x_XOXOO_Count x_XOOOX_Count x_XOOOO_Count x_XOOXX_Count x_XOOXO_Count x_OXXXX_Count x_OXXXO_Count x_OXXOX_Count x_OXXOO_Count x_OXOOX_Count x_OXOOO_Count x_OXOXX_Count x_OXOXO_Count x_OOXXX_Count x_OOXXO_Count x_OOXOX_Count x_OOXOO_Count x_OOOOX_Count x_OOOOO_Count x_OOOXX_Count x_OOOXO_Count o_XXXXX_Count o_XXXXO_Count o_XXXOX_Count o_XXXOO_Count o_XXOOX_Count o_XXOOO_Count o_XXOXX_Count o_XXOXO_Count o_XOXXX_Count o_XOXXO_Count o_XOXOX_Count o_XOXOO_Count o_XOOOX_Count o_XOOOO_Count o_XOOXX_Count o_XOOXO_Count o_OXXXX_Count o_OXXXO_Count o_OXXOX_Count o_OXXOO_Count o_OXOOX_Count o_OXOOO_Count o_OXOXX_Count o_OXOXO_Count o_OOXXX_Count o_OOXXO_Count o_OOXOX_Count o_OOXOO_Count o_OOOOX_Count o_OOOOO_Count o_OOOXX_Count o_OOOXO_Count. RENAME VARIABLES (Mean_SEQNO o_X_Count o_O_Count x_X_Count x_O_Count o_XX_Count o_XO_Count o_OO_Count o_OX_Count x_XX_Count x_XO_Count x_OO_Count x_OX_Count o_XXX_Count o_XXO_Count o_XOX_Count o_XOO_Count o_OOX_Count o_OOO_Count o_OXX_Count o_OXO_Count x_XXX_Count x_XXO_Count x_XOX_Count x_XOO_Count x_OOX_Count x_OOO_Count x_OXX_Count x_OXO_Count x_XXXX_Count x_XXXO_Count x_XXOX_Count x_XXOO_Count x_XOOX_Count x_XOOO_Count x_XOXX_Count x_XOXO_Count x_OXXX_Count x_OXXO_Count x_OXOX_Count x_OXOO_Count x_OOOX_Count x_OOOO_Count x_OOXX_Count x_OOXO_Count o_XXXX_Count o_XXXO_Count o_XXOX_Count o_XXOO_Count o_XOOX_Count o_XOOO_Count o_XOXX_Count o_XOXO_Count o_OXXX_Count o_OXXO_Count o_OXOX_Count o_OXOO_Count o_OOOX_Count o_OOOO_Count o_OOXX_Count o_OOXO_Count x_XXXXX_Count x_XXXXO_Count x_XXXOX_Count x_XXXOO_Count x_XXOOX_Count x_XXOOO_Count x_XXOXX_Count x_XXOXO_Count x_XOXXX_Count x_XOXXO_Count x_XOXOX_Count x_XOXOO_Count x_XOOOX_Count x_XOOOO_Count x_XOOXX_Count x_XOOXO_Count x_OXXXX_Count x_OXXXO_Count x_OXXOX_Count x_OXXOO_Count x_OXOOX_Count x_OXOOO_Count x_OXOXX_Count x_OXOXO_Count x_OOXXX_Count x_OOXXO_Count x_OOXOX_Count x_OOXOO_Count x_OOOOX_Count x_OOOOO_Count x_OOOXX_Count x_OOOXO_Count o_XXXXX_Count o_XXXXO_Count o_XXXOX_Count o_XXXOO_Count o_XXOOX_Count o_XXOOO_Count o_XXOXX_Count o_XXOXO_Count o_XOXXX_Count o_XOXXO_Count o_XOXOX_Count o_XOXOO_Count o_XOOOX_Count o_XOOOO_Count o_XOOXX_Count o_XOOXO_Count o_OXXXX_Count o_OXXXO_Count o_OXXOX_Count o_OXXOO_Count o_OXOOX_Count o_OXOOO_Count o_OXOXX_Count o_OXOXO_Count o_OOXXX_Count o_OOXXO_Count o_OOXOX_Count o_OOXOO_Count o_OOOOX_Count o_OOOOO_Count o_OOOXX_Count o_OOOXO_Count = SEQNO o_X o_O x_X x_O o_XX o_XO o_OO o_OX x_XX x_XO x_OO x_OX o_XXX o_XXO o_XOX o_XOO o_OOX o_OOO o_OXX o_OXO x_XXX x_XXO x_XOX x_XOO x_OOX x_OOO x_OXX x_OXO x_XXXX x_XXXO x_XXOX x_XXOO x_XOOX x_XOOO x_XOXX x_XOXO x_OXXX x_OXXO x_OXOX x_OXOO x_OOOX x_OOOO x_OOXX x_OOXO o_XXXX o_XXXO o_XXOX o_XXOO o_XOOX o_XOOO o_XOXX o_XOXO o_OXXX o_OXXO o_OXOX o_OXOO o_OOOX o_OOOO o_OOXX o_OOXO x_XXXXX x_XXXXO x_XXXOX x_XXXOO x_XXOOX x_XXOOO x_XXOXX x_XXOXO x_XOXXX x_XOXXO x_XOXOX x_XOXOO x_XOOOX x_XOOOO x_XOOXX x_XOOXO x_OXXXX x_OXXXO x_OXXOX x_OXXOO x_OXOOX x_OXOOO x_OXOXX x_OXOXO x_OOXXX x_OOXXO x_OOXOX x_OOXOO x_OOOOX x_OOOOO x_OOOXX x_OOOXO o_XXXXX o_XXXXO o_XXXOX o_XXXOO o_XXOOX o_XXOOO o_XXOXX o_XXOXO o_XOXXX o_XOXXO o_XOXOX o_XOXOO o_XOOOX o_XOOOO o_XOOXX o_XOOXO o_OXXXX o_OXXXO o_OXXOX o_OXXOO o_OXOOX o_OXOOO o_OXOXX o_OXOXO o_OOXXX o_OOXXO o_OOXOX o_OOXOO o_OOOOX o_OOOOO o_OOOXX o_OOOXO). IF NMISS(SEQNO)=1 SEQNO=LAG(SEQNO,1). AGGREGATE OUTFILE=* /BREAK=SEQNO /o_X o_O x_X x_O o_XX o_XO o_OO o_OX x_XX x_XO x_OO x_OX o_XXX o_XXO o_XOX o_XOO o_OOX o_OOO o_OXX o_OXO x_XXX x_XXO x_XOX x_XOO x_OOX x_OOO x_OXX x_OXO x_XXXX x_XXXO x_XXOX x_XXOO x_XOOX x_XOOO x_XOXX x_XOXO x_OXXX x_OXXO x_OXOX x_OXOO x_OOOX x_OOOO x_OOXX x_OOXO o_XXXX o_XXXO o_XXOX o_XXOO o_XOOX o_XOOO o_XOXX o_XOXO o_OXXX o_OXXO o_OXOX o_OXOO o_OOOX o_OOOO o_OOXX o_OOXO x_XXXXX x_XXXXO x_XXXOX x_XXXOO x_XXOOX x_XXOOO x_XXOXX x_XXOXO x_XOXXX x_XOXXO x_XOXOX x_XOXOO x_XOOOX x_XOOOO x_XOOXX x_XOOXO x_OXXXX x_OXXXO x_OXXOX x_OXXOO x_OXOOX x_OXOOO x_OXOXX x_OXOXO x_OOXXX x_OOXXO x_OOXOX x_OOXOO x_OOOOX x_OOOOO x_OOOXX x_OOOXO o_XXXXX o_XXXXO o_XXXOX o_XXXOO o_XXOOX o_XXOOO o_XXOXX o_XXOXO o_XOXXX o_XOXXO o_XOXOX o_XOXOO o_XOOOX o_XOOOO o_XOOXX o_XOOXO o_OXXXX o_OXXXO o_OXXOX o_OXXOO o_OXOOX o_OXOOO o_OXOXX o_OXOXO o_OOXXX o_OOXXO o_OOXOX o_OOXOO o_OOOOX o_OOOOO o_OOOXX o_OOOXO =FIRST(o_X o_O x_X x_O o_XX o_XO o_OO o_OX x_XX x_XO x_OO x_OX o_XXX o_XXO o_XOX o_XOO o_OOX o_OOO o_OXX o_OXO x_XXX x_XXO x_XOX x_XOO x_OOX x_OOO x_OXX x_OXO x_XXXX x_XXXO x_XXOX x_XXOO x_XOOX x_XOOO x_XOXX x_XOXO x_OXXX x_OXXO x_OXOX x_OXOO x_OOOX x_OOOO x_OOXX x_OOXO o_XXXX o_XXXO o_XXOX o_XXOO o_XOOX o_XOOO o_XOXX o_XOXO o_OXXX o_OXXO o_OXOX o_OXOO o_OOOX o_OOOO o_OOXX o_OOXO x_XXXXX x_XXXXO x_XXXOX x_XXXOO x_XXOOX x_XXOOO x_XXOXX x_XXOXO x_XOXXX x_XOXXO x_XOXOX x_XOXOO x_XOOOX x_XOOOO x_XOOXX x_XOOXO x_OXXXX x_OXXXO x_OXXOX x_OXXOO x_OXOOX x_OXOOO x_OXOXX x_OXOXO x_OOXXX x_OOXXO x_OOXOX x_OOXOO x_OOOOX x_OOOOO x_OOOXX x_OOOXO o_XXXXX o_XXXXO o_XXXOX o_XXXOO o_XXOOX o_XXOOO o_XXOXX o_XXOXO o_XOXXX o_XOXXO o_XOXOX o_XOXOO o_XOOOX o_XOOOO o_XOOXX o_XOOXO o_OXXXX o_OXXXO o_OXXOX o_OXXOO o_OXOOX o_OXOOO o_OXOXX o_OXOXO o_OOXXX o_OOXXO o_OOXOX o_OOXOO o_OOOOX o_OOOOO o_OOOXX o_OOOXO). FORMAT SEQNO (F5.0). SAVE OUTFILE='nGRAMS_2_B.SAV'. *------------------------------------------------------------------------------. _____________________________________________________________ Jason W. Beckstead, Ph.D. Associate Professor/Quantitative Methodologist University of South Florida College of Nursing 12901 Bruce B. Downs Blvd., MDC22, Tampa, FL 33612, USA Statistical Editor, International Journal of Nursing Studies phone: +1.813.974.7667 fax: +1.813.974.5418 personal website: http://personal.health.usf.edu/jbeckste/ International Journal of Nursing Studies http://www.elsevier.com/ijns ===================== To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
|
In reply to this post by Andy W
Whoops, my original macro forgot to evaluate whether the outcome value was a 0 or 1 (ie it did not use the after parameter at all). Here is an updated version, and an example of using python to generate all of the potential permutations given a certain length.
The 8 and 16 permutations is going to be a bit of a doozy, for the 16 it is 2^16 = 65536 computes. Not sure if you want all those variables or if there is a more convenient output format, but hopefully this is a good start. For the 1 through 4 grams on 90,000 cases only takes around a minute on my machine. *****************************************. *Simulate 90,000 sequences of length 20. INPUT PROGRAM. LOOP #i = 1 TO 90000. LOOP #j = 1 TO 20. COMPUTE SEQNO = #i. COMPUTE S = RV.BERNOULLI(0.5). END CASE. END LOOP. END LOOP. END FILE. END INPUT PROGRAM. DATASET NAME Test. EXECUTE. DEFINE !NGram2 (!POSITIONAL = !CMDEND) *Figure out how many items in Before. !LET !L = !NULL !DO !I !IN (!1) !IF (!I = "0") !THEN !LET !L = !CONCAT(!L,"O") !ELSE !LET !L = !CONCAT(!L,"X") !IFEND !DOEND !LET !N = !LENGTH(!L) *Now a DO REPEAT, cant use LAG with LOOP. COMPUTE #T0 = 0. COMPUTE #T1 = 0. DO REPEAT B = !1 /#i = 1 TO !N. DO IF SEQNO = LAG(SEQNO,#i) AND B = LAG(S,#i). IF S = 0 #T0 = #T0 + 1. IF S = 1 #T1 = #T1 + 1. END IF. END REPEAT. *Making output variables. COMPUTE !CONCAT(!L,"_O") = (#T0 = !N). COMPUTE !CONCAT(!L,"_X") = (#T1 = !N). *Making labels. VARIABLE LABEL !CONCAT(!L,"_O") !QUOTE(!CONCAT("NGram for [",!1,"] followed by a 0")). VARIABLE LABEL !CONCAT(!L,"_X") !QUOTE(!CONCAT("NGram for [",!1,"] followed by a 1")). !ENDDEFINE. *Use python to generate all NGRAM 4 length permuations. BEGIN PROGRAM Python. import spss import itertools as it YourSet = [0,1] NG = [1,2,4] #can add 8 and 16, total computes equals 2^n for N in NG: x = it.product(YourSet,repeat=N) for i in x: b = " ".join(str(e) for e in i) c = "!NGram2 " + b + "." spss.Submit(c) END PROGRAM. EXECUTE. *This takes about a minute. *****************************************. |
Administrator
|
In reply to this post by Andy W
I'm in the swamp getting dirty with my own nest of angry alligators at the moment so going to just drop the MATRIX bomb!
DO THIS IN MATRIX! LAGS etc are DOOMED here! Laterz
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Yeah, your probably right. The million cases will fit in memory so should be doable. Some of the work of generating all the permutations to loop over is in this thread, http://spssx-discussion.1045642.n5.nabble.com/How-to-enumerate-all-the-combinations-in-SPSS-td5727284.html. Probably be slightly simpler to code it up in wide format than long format (so just looping over rows).
My code above for up to the 4 grams for a million sequences will probably finish ~ 20 minutes I'm guessing. For up to the 8 grams will likely take overnight, all the permutations for the higher orders will be progressively more of a problem though. I'm pretty sure it will plug away like a champ though. (Some of them are going to take a while no matter how you slice them, some clever evaluations to eliminate specific sequences before them may need to be in order.) |
Administrator
|
In reply to this post by Beckstead, Jason
You are absolutely correct. This method is doomed from the get go.
You need top define a general algorithm and learn MATRIX to implement. LAG will not take you very far and your eyes will start bleeding (as mine are) from staring at that monstrosity for too long. -- Tough Love!
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Beckstead, Jason
I would like to know what the end is supposed to be, because
=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
this seems like the long way around the barn. If you get all the complete tabulations of everything, you then have a ridiculous problem of parsing out the overlaps. For instance, if there is a pattern of 16 that matches perfectly for 10% of the lines... that gives you an enormous number of matches of its sub-sequences, length 2, 3, 4, ... , -- which would *tend* to dominate all the other counts of sub-sequences. Is something like that conceivable? By that logic, if it applies at all: to get unique counts, it must be better if you can remove the (interesting) longer ones before counting the shorter ones, like Fourier analysis removes the influence of slower frequencies first. If you really want to do the *job* efficiently, I think you need to consider the whole job, which does include making sense of the (potentially) millions of tabulations. Or, thinking of what else has not been said: is it interesting that a certain pattern does NOT exist, or are you only interested in the actual ones? Further: Is it reasonable to take a sampling approach, instead of a complete enumeration? Is this a "population" or, even if it is, can it be treated as a sample? Are the lines expected to be similar as random samplings? Does the starting point in a line potentially matter, if you are "sampling", is it reasonable to assume that the progressions in a line are started at arbitrary places, so that counts that start with position 1 would be similar to counts starting at any other position? -- Rich Ulrich Date: Tue, 24 Mar 2015 10:33:28 +0000 From: [hidden email] Subject: Analyzing Sequences; help with LOOPS & VECTORS To: [hidden email] Hi folks, I am working on a coding problem and need help. I have sequences of 20 Xs and Os like this XXXXOOOOXXOXXXOOOOOX. For each sequence I need to tally how many times X follows X, O follows X, X follows O, and O follows O. Then I need to find how many times X (and O) follows each of the previous doubles, XX, XO, OO, OX. Then we move to how many times X (and O) follow all 8 of the previous triples, all 16 of the previous 4ples, all 32 of the previous 5ples, etc. These new count variables need to be saved to a SAV file and matched back to the sequences for input into another program that uses them to compute various quantities for each sequence.
I have written some code that works (see below) but it is VERY CLUNKY and it takes *way to long* to run as the number of sequences becomes large. The Xs and Os are represented by 1s and 0s, respectively. My method makes heavy use of the LAG function and then uses CROSSTABS to do the tallying but I think this method is a dead end. With 92,378 sequences it took 52 hours! The code will need to run on 1,048,574 sequences and be extended to compute tallies for previous 6ples, 7ples, up to 18ples as well.
[snip, |
In reply to this post by Beckstead, Jason
At 06:33 AM 3/24/2015, Beckstead, Jason wrote:
>I have sequences of 20 Xs and Os like this XXXXOOOOXXOXXXOOOOOX. >For each sequence I need to tally how many times X follows X, O >follows X, X follows O, and O follows O. Then I need to find how >many times X (and O) follows each of the previous doubles, XX, XO, >OO, OX. Then we move to how many times X (and O) follow all 8 of the >previous triples, all 16 of the previous 4ples, all 32 of the >previous 5ples, etc. These new count variables need to be matched >back to the sequences for input into another program. I'd do it by unrolling the data to 'long' form. It looks like you've done that, but given yourself much more trouble than necessary. If you start with data like this, |-----------------------------|---------------------------| |Output Created |24-MAR-2015 21:20:02 | |-----------------------------|---------------------------| [TestData] CaseID Xs.and.Os 001 OXXXXOXOOXXXXXOXOXXO 002 XXOXXXXXXOXXOXXOXXXX 003 XXOXXXXXXOXXXOXXXXOX 004 XOXXOXOOXXXOXXXOXXOO 005 XOXOOXXOXXOXOOXXXXXO ... I'll call the trailing letter the 'suffix' and the string of one or more preceding letters the 'prefix'. You can use XSAVE to write each instance of a prefix-suffix combination to a separate file (I named it 'Unroll';it must be a disk file, not a dataset): NUMERIC Posit (F3). NUMERIC PfxLen (F3). STRING Prefix (A5). STRING Suffix (A1). VAR LABEL Posit 'Position of suffix character, in input string' PfxLen 'Length of prefix' Prefix 'Prefix text' Suffix 'Suffix letter'. LOOP Posit = 2 TO 20. . COMPUTE Suffix = SUBSTR(Xs.and.Os,Posit,1). . LOOP PfxLen = 1 TO 2 /* Limit to 2-char. prefix for demo */. . DO IF PfxLen LT Posit. . COMPUTE Prefix = SUBSTR(Xs.and.Os,Posit-PfxLen,PfxLen). . XSAVE OUTFILE=Unroll /KEEP=CaseID Posit PfxLen Prefix Suffix. . END IF. . END LOOP. END LOOP. EXECUTE /* needed, in a transformation program run to do an XSAVE */. GET FILE= Unroll. DATASET NAME Unrolled WINDOW=FRONT. -------------------------------------- The above counts only prefixes of length 1 or 2, not length 5 as you asked for; but it can be increased to any length prefix, by first making sure that variable Prefix is wide enough to hold all the prefixes, and then changing the upper limit on the statement . LOOP PfxLen = 1 TO 2 /* Limit to 2-char. prefix for demo */. to whatever you want. That gives you every prefix-suffix occurrence, however many times it occurs; you want to just count occurrences of different prefix-suffix values: DATASET DECLARE Summary. AGGREGATE OUTFILE=Summary /BREAK = CaseID PfxLen Prefix Suffix /Instances 'Occurrences of this pattern' = NU. DATASET ACTIVATE Summary WINDOW=FRONT. LIST /CASES=15. List |-----------------------------|---------------------------| |Output Created |24-MAR-2015 21:20:03 | |-----------------------------|---------------------------| [Summary] CaseID PfxLen Prefix Suffix Instances 001 1 O O 1 001 1 O X 5 001 1 X O 5 001 1 X X 8 001 2 OO X 1 001 2 OX O 2 001 2 OX X 3 001 2 XO O 1 001 2 XO X 3 001 2 XX O 3 001 2 XX X 5 002 1 O X 4 002 1 X O 4 002 1 X X 11 002 2 OX X 4 Number of cases read: 15 Number of cases listed: 15 -------------------------------------------------------- For a lot of analytic purposes, this would do. But you say you want a separate variable giving the counts for each prefix-suffix pair. You can do that with CASESTOVARS: STRING Label (A10). VAR LABEL Label 'Length of prefix, text of prefix, and suffix'. COMPUTE Label=CONCAT('L',STRING(PfxLen,F1),'.', RTRIM(Prefix) ,'.', Suffix ). CASESTOVARS /ID = CaseID /DROP = PfxLen, Prefix, Suffix /INDEX = Label /GROUPBY = VARIABLE /AUTOFIX = NO. Cases to Variables |----------------------------|---------------------------| |Output Created |24-MAR-2015 21:20:04 | |----------------------------|---------------------------| [WideForm] Generated Variables [suppressed] Processing Statistics |---------------|----| |Cases In |213 | |Cases Out |20 | |---------------|----| |Variables In |6 | |Variables Out |13 | |---------------|----| |Index Values |12 | |---------------|----| RECODE ALL (SYSMIS=0). LIST /CASES=10 /VARIABLES=CaseID TO L2.OX.X. List |-----------------------------|---------------------------| |Output Created |24-MAR-2015 21:20:04 | |-----------------------------|---------------------------| [WideForm] CaseID L1.O.O L1.O.X L1.X.O L1.X.X L2.OO.O L2.OO.X L2.OX.O L2.OX.X 001 1 5 5 8 0 1 2 3 002 0 4 4 11 0 0 0 4 003 0 4 4 11 0 0 0 3 004 2 5 6 6 0 1 1 4 005 2 5 6 6 0 2 2 3 006 3 7 6 3 0 3 4 2 007 0 5 5 9 0 0 2 2 008 3 2 2 12 1 1 0 2 009 1 6 5 7 0 1 2 4 010 2 5 4 8 0 2 1 4 Number of cases read: 10 Number of cases listed: 10 The only drawback (aside from an awful lot of variables) is that this will only include variables for combinations that actually occur in your data. Good luck to you! Richard Ristow ================================================== APPENDIX: All code, including generating test data ================================================== * C:\Documents and Settings\Richard\My Documents . * \Technical\spssx-l\Z-2015\ . * 2015-03-24 Beckstead-Analyzing Sequences; help with LOOPS & VECTORS.SPS . * In response to posting . * Date: Tue, 24 Mar 2015 10:33:28 +0000 . * From: "Beckstead, Jason" <[hidden email]> . * Subject: Analyzing Sequences; help with LOOPS & VECTORS . * To: [hidden email] . * "I have sequences of 20 Xs and Os like this XXXXOOOOXXOXXXOOOOOX. . * For each sequence I need to tally how many times X follows X, O . * follows X, X follows O, and O follows O. Then I need to find how . * many times X (and O) follows each of the previous doubles, XX, . * XO, OO, OX. Then we move to how many times X (and O) follow all . * 8 of the previous triples, all 16 of the previous 4ples, all 32 . * of the previous 5ples, etc. These new count variables need to be . * saved to a SAV file and matched back to the sequences for input . * into another program that uses them to compute various . * quantities for each sequence. " . * ................................................................. . * ................. Scratch file, for XSAVE ..................... . FILE HANDLE Unroll /NAME='C:\Documents and Settings\Richard\My Documents' + '\Temporary\SPSS\' + '2015-03-24 Beckstead-Analyzing Sequences; ' + 'help with LOOPS & VECTORS' + ' - ' + 'UNROLL.SAV'. * ................................................................. . * ................. Test data ..................... . SET RNG = MT /* 'Mersenne twister' random number generator */ . SET MTINDEX = 5624 /* Boston, MA telephone book */ . INPUT PROGRAM. . NUMERIC CaseID (N3). . STRING Xs.and.Os (A20). . STRING #Ltr (A1). . NUMERIC #Idx (F3). . LOOP CaseID = 1 TO 20. * Fill in string with 60% Xs, 40% Os ... . . LOOP #Idx = 1 TO 20. . COMPUTE #Ltr = SUBSTR('OX',1 + RV.Bernoulli(0.6),1). . COMPUTE SUBSTR(Xs.and.Os,#Idx,1) = #Ltr. . END LOOP. . END CASE. . END LOOP. END FILE. END INPUT PROGRAM. DATASET NAME TestData WINDOW=FRONT. . /**/ LIST /*-*/. * ................ Post after this point ..................... . * ................................................................ . DATASET ACTIVATE TestData WINDOW=FRONT. * Unroll, to give a separate record for each prefix-suffix pair . * in the string: . NUMERIC Posit (F3). NUMERIC PfxLen (F3). STRING Prefix (A5). STRING Suffix (A1). VAR LABEL Posit 'Position of suffix character, in input string' PfxLen 'Length of prefix' Prefix 'Prefix text' Suffix 'Suffix letter'. LOOP Posit = 2 TO 20. . COMPUTE Suffix = SUBSTR(Xs.and.Os,Posit,1). . LOOP PfxLen = 1 TO 2 /* Limit to 2-char. prefix for demo */. . DO IF PfxLen LT Posit. . COMPUTE Prefix = SUBSTR(Xs.and.Os,Posit-PfxLen,PfxLen). . XSAVE OUTFILE=Unroll /KEEP=CaseID Posit PfxLen Prefix Suffix. . END IF. . END LOOP. END LOOP. EXECUTE /* needed, in a transformation program run to do an XSAVE */. GET FILE= Unroll. DATASET NAME Unrolled WINDOW=FRONT. * Count instances of distinct prefix-suffix pairs, by input record:. DATASET DECLARE Summary. AGGREGATE OUTFILE=Summary /BREAK = CaseID PfxLen Prefix Suffix /Instances 'Occurrences of this pattern' = NU. DATASET ACTIVATE Summary WINDOW=FRONT. LIST /CASES=15. * Finally, if you really want to, you can put the data back in . * 'wide' form, with a separate variable for each prefix-suffix . * combination: . DATASET COPY WideForm. DATASET ACTIVATE WideForm WINDOW=FRONT. STRING Label (A10). VAR LABEL Label 'Length of prefix, text of prefix, and suffix'. COMPUTE Label=CONCAT('L',STRING(PfxLen,F1),'.', RTRIM(Prefix) ,'.', Suffix ). CASESTOVARS /ID = CaseID /DROP = PfxLen, Prefix, Suffix /INDEX = Label /GROUPBY = VARIABLE /AUTOFIX = NO. RECODE ALL (SYSMIS=0). LIST /CASES=10 /VARIABLES=CaseID TO L2.OX.X. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Great points Gene and Richard, way better to calculate all of the observed sub-sequences, which is
20 of length 1 19 of length 2 18 of length 3 .... 1 of length 20 Which summed up is a total of 210. So even with 1 million cases, there are basically 210 million sub-sequences, large but not unmanageable. As opposed to the way I was doing it originally, calculating presence/absence of all the potential sub-sequences: 2^1 of length 1 2^2 of length 2 2^3 of length 3 ..... 2^20 of length 20 Which results in over 2 million. Presumably there is no need to hold the empty vectors in memory, for data mining purposes I don't see how a zero variance vector would help any. I can imagine though wanting to calculate population wise stats, e.g. if the null model for 2 length tuples was: OX - 20% XO - 30% OO - 5% XX - 45% You could compare observed versus expected for the entire sample (although I don't imagine you would want to do this for every set of tuple at once for the reasons Rich mentioned). For all the different substrings this is only a few over 2 million cases though, so you just need to generate them. Here is some syntax, https://dl.dropbox.com/s/alnxe3l0oaxil6c/AllPerm_XOXO.sps?dl=0, it works in around ~4 minutes for 100,000 cases on my machine. I note the bottleneck though with an aggregate that my machine runs out memory for larger #'s of starting sequences. (On a server should not be a problem though.) At least with that you could chunk it up though and be finished before lunchtime. |
At 11:05 AM 3/25/2015, Andy W wrote:
>Here is some syntax, >https://dl.dropbox.com/s/alnxe3l0oaxil6c/AllPerm_XOXO.sps?dl=0, it >works in around ~4 minutes for 100,000 cases on my machine. I note >the bottleneck though with an aggregate that my machine runs out >memory for larger #'s of starting sequences. As you conjectured in comments in that code, sorting, and then using PRESORTED on AGGREGATE, will help a lot. In my experience, letting AGGREGATE build its tables in memory, i.e. not using PRESORTED, works well up to about 100,000 BREAK groups, is seriously slow with over 1,000,000 groups, with the degradation happening at various points in between. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
See below...
Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: Richard Ristow <[hidden email]> To: [hidden email] Date: 03/26/2015 01:01 PM Subject: Re: [SPSSX-L] Analyzing Sequences; help with LOOPS & VECTORS Sent by: "SPSSX(r) Discussion" <[hidden email]> At 11:05 AM 3/25/2015, Andy W wrote: >Here is some syntax, >https://dl.dropbox.com/s/alnxe3l0oaxil6c/AllPerm_XOXO.sps?dl=0, it >works in around ~4 minutes for 100,000 cases on my machine. I note >the bottleneck though with an aggregate that my machine runs out >memory for larger #'s of starting sequences. As you conjectured in comments in that code, sorting, and then using PRESORTED on AGGREGATE, will help a lot. In my experience, letting AGGREGATE build its tables in memory, i.e. not using PRESORTED, works well up to about 100,000 BREAK groups, is seriously slow with over 1,000,000 groups, with the degradation happening at various points in between. >>>My guess is that this observation is highly dependent on the amount of available memory and whether using the 64- or 32-bit version of Statistics. For the slow scenario, it would be interesting to watch the Task Manager to see if the paging rate goes way up with a huge number of breaks. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
I'd written,
>>In my experience, letting AGGREGATE build its tables in memory, i.e. >>not using PRESORTED, works well up to about 100,000 BREAK groups, is >>seriously slow with over 1,000,000 groups, with the degradation >>happening at various points in between. At 03:26 PM 3/26/2015, Jon K Peck wrote: >My guess is that this observation is highly dependent on the amount >of available memory and whether using the 64- or 32-bit version of Statistics. Point most definitely taken. All my experience is with 32-bit SPSS, running with about a gig of memory. By the way, Jon, thanks for pointing out, years back, how well AGGREGATE now works without presorting, in most ordinary cases. I've found that useful in coding, ever since. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Beckstead, Jason
At 06:33 AM 3/24/2015, Beckstead, Jason wrote:
>I have sequences of 20 Xs and Os like this XXXXOOOOXXOXXXOOOOOX. >For each sequence I need to tally how many times X follows X, O >follows X, X follows O, and O follows O. Then I need to find how >many times X (and O) follows each of the previous doubles, XX, XO, OO, OX. ... > >The code will need to run on 1,048,574 sequences and be extended to >compute tallies for previous 6ples, 7ples, up to 18ples as well. The number "1,048,574" is striking, being exactly 2 less than the number of distinct 20-character sequences made of Xs and Os. Will you be working with an enumeration of all possible sequences? ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |