Hey folks,
I have tried for many hours to figure out how to write a macro to divide a 380,000 case file into 38 files with 10,000 cases. My most recent attempt gives error: "A macro expansion required more storage than was available. Try running with more memory." Suggestions for fixing this code or another method that should work would be greatly appreciated!! Thanks, Jim DEFINE !Looper () !DO !i=1 !to 38. Get file='D:\sbec\tea ids\Master ID List.sav'. dataset name SSNList. /* Define Ending point.*/ !let !temp=!blanks(0). !do !cnt=1 !to !i !Let !temp=!concat(!temp,!blanks(10000)) !doEnd. !Let !EndNum=!Length(!temp). /* Define start point.*/ !Let !j=!length(!substr(!blanks(!temp),9999)). !Let !StartNum=!length(!concat(!blanks(!j),!blanks(1))). Select if $casenum<=!StartNum & $casenum >=!EndNum. SAVE TRANSLATE OUTFILE=!QUOTE(!CONCAT("d:\sbec\tea ids\newIDs\ID",!i,".txt")) /TYPE=CSV /MAP /REPLACE /CELLS=VALUES. !DOEND. !ENDDEFINE. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
And, yes, I see I did my >, < in the wrong directions.
-----Original Message----- From: Van Overschelde, Jim Sent: Sunday, April 07, 2013 8:54 AM To: [hidden email] Subject: Dividing file into 10,000 case chunks Hey folks, I have tried for many hours to figure out how to write a macro to divide a 380,000 case file into 38 files with 10,000 cases. My most recent attempt gives error: "A macro expansion required more storage than was available. Try running with more memory." Suggestions for fixing this code or another method that should work would be greatly appreciated!! Thanks, Jim DEFINE !Looper () !DO !i=1 !to 38. Get file='D:\sbec\tea ids\Master ID List.sav'. dataset name SSNList. /* Define Ending point.*/ !let !temp=!blanks(0). !do !cnt=1 !to !i !Let !temp=!concat(!temp,!blanks(10000)) !doEnd. !Let !EndNum=!Length(!temp). /* Define start point.*/ !Let !j=!length(!substr(!blanks(!temp),9999)). !Let !StartNum=!length(!concat(!blanks(!j),!blanks(1))). Select if $casenum<=!StartNum & $casenum >=!EndNum. SAVE TRANSLATE OUTFILE=!QUOTE(!CONCAT("d:\sbec\tea ids\newIDs\ID",!i,".txt")) /TYPE=CSV /MAP /REPLACE /CELLS=VALUES. !DOEND. !ENDDEFINE. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Van Overschelde, Jim
see the archive for writing out separate files. IIRC there is a Python
method.
UNTESTED not sure if you need the +1 on the mod. you may not need to randomize the order of cases. NCY -- NO Coffee Yet. this is what you would want to Macro or Python to do. or you can just write the 38 set of xsaves. compute RandomOrder = uniform(31**2). sort cases by RandomOrder. compute WhichFile = mod($casenum, 10000)+1. do if WhichFile eq 1. xsave outfile = 'j:Get file='D:\sbec\tea ids\Master ID List subset 1.sav'. else if WhichFile eq 2. xsave outfile = 'j:Get file='D:\sbec\tea ids\Master ID List subset 2.sav'. else if WhichFile eq 3. xsave outfile = 'j:Get file='D:\sbec\tea ids\Master ID List subset 3.sav'. . . . else if WhichFile eq 38. xsave outfile = 'j:Get file='D:\sbec\tea ids\Master ID List subset 38.sav'. else. print /'oops WhichFile is ' WhichFile. frequencies variables = WhichFile. However why are you doing this? There may be other approaches. Art Kendall Social Research ConsultantsOn 4/7/2013 8:55 PM, Van Overschelde, Jim [via SPSSX Discussion] wrote: Hey folks,
Art Kendall
Social Research Consultants |
In reply to this post by Van Overschelde, Jim
Here is a simple way to do this using the
SPSSINC SPLIT DATASET extension command available from the SPSS Community
website (www.ibm.com/developerworks/spssdevcentral)
with far fewer data passes. In this example the output file names
would be fraction_1, fraction_2...
COMPUTE group=trunc($casenum/10000). SPSSINC SPLIT DATASET SPLITVAR=group /OUTPUT DIRECTORY= "c:\temp\splits" /OPTIONS NAMES=VALUES NAMEPREFIX="fraction". This extension command and its companion, SPSSINC PROCESS FILES, require the Python Essentials available through the Community site. This example produces sav files while your macro was attempting to produce text files. That could be addressed with another step. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: "Van Overschelde, Jim" <[hidden email]> To: [hidden email], Date: 04/07/2013 06:54 PM Subject: Re: [SPSSX-L] Dividing file into 10,000 case chunks Sent by: "SPSSX(r) Discussion" <[hidden email]> And, yes, I see I did my >, < in the wrong directions. -----Original Message----- From: Van Overschelde, Jim Sent: Sunday, April 07, 2013 8:54 AM To: [hidden email] Subject: Dividing file into 10,000 case chunks Hey folks, I have tried for many hours to figure out how to write a macro to divide a 380,000 case file into 38 files with 10,000 cases. My most recent attempt gives error: "A macro expansion required more storage than was available. Try running with more memory." Suggestions for fixing this code or another method that should work would be greatly appreciated!! Thanks, Jim DEFINE !Looper () !DO !i=1 !to 38. Get file='D:\sbec\tea ids\Master ID List.sav'. dataset name SSNList. /* Define Ending point.*/ !let !temp=!blanks(0). !do !cnt=1 !to !i !Let !temp=!concat(!temp,!blanks(10000)) !doEnd. !Let !EndNum=!Length(!temp). /* Define start point.*/ !Let !j=!length(!substr(!blanks(!temp),9999)). !Let !StartNum=!length(!concat(!blanks(!j),!blanks(1))). Select if $casenum<=!StartNum & $casenum >=!EndNum. SAVE TRANSLATE OUTFILE=!QUOTE(!CONCAT("d:\sbec\tea ids\newIDs\ID",!i,".txt")) /TYPE=CSV /MAP /REPLACE /CELLS=VALUES. !DOEND. !ENDDEFINE. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Art Kendall
I don't think Art meant to have those GET
commands in there. But putting that aside, this is an example of
how NOT to do a task even though it would work.
It is painful to write all that code, and, worse, the chances of getting it exactly right are not great - boredom will set in long before that many XSAVE commands are written, so careful testing and code review is required. Finally, it is very specific to these particular numbers, so it doesn't make a good model for a general solution. Generalization + correctness + pain reduction = Python Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: Art Kendall <[hidden email]> To: [hidden email], Date: 04/08/2013 08:16 AM Subject: Re: [SPSSX-L] Dividing file into 10,000 case chunks Sent by: "SPSSX(r) Discussion" <[hidden email]> see the archive for writing out separate files. IIRC there is a Python method. UNTESTED not sure if you need the +1 on the mod. you may not need to randomize the order of cases. NCY -- NO Coffee Yet. this is what you would want to Macro or Python to do. or you can just write the 38 set of xsaves. compute RandomOrder = uniform(31**2). sort cases by RandomOrder. compute WhichFile = mod($casenum, 10000)+1. do if WhichFile eq 1. xsave outfile = 'j:Get file='D:\sbec\tea ids\Master ID List subset 1.sav'. else if WhichFile eq 2. xsave outfile = 'j:Get file='D:\sbec\tea ids\Master ID List subset 2.sav'. else if WhichFile eq 3. xsave outfile = 'j:Get file='D:\sbec\tea ids\Master ID List subset 3.sav'. . . . else if WhichFile eq 38. xsave outfile = 'j:Get file='D:\sbec\tea ids\Master ID List subset 38.sav'. else. print /'oops WhichFile is ' WhichFile. frequencies variables = WhichFile. However why are you doing this? There may be other approaches. Art Kendall Social Research Consultants On 4/7/2013 8:55 PM, Van Overschelde, Jim [via SPSSX Discussion] wrote: Hey folks, I have tried for many hours to figure out how to write a macro to divide a 380,000 case file into 38 files with 10,000 cases. My most recent attempt gives error: "A macro expansion required more storage than was available. Try running with more memory." Suggestions for fixing this code or another method that should work would be greatly appreciated!! Thanks, Jim DEFINE !Looper () !DO !i=1 !to 38. Get file='D:\sbec\tea ids\Master ID List.sav'. dataset name SSNList. /* Define Ending point.*/ !let !temp=!blanks(0). !do !cnt=1 !to !i !Let !temp=!concat(!temp,!blanks(10000)) !doEnd. !Let !EndNum=!Length(!temp). /* Define start point.*/ !Let !j=!length(!substr(!blanks(!temp),9999)). !Let !StartNum=!length(!concat(!blanks(!j),!blanks(1))). Select if $casenum<=!StartNum & $casenum >=!EndNum. SAVE TRANSLATE OUTFILE=!QUOTE(!CONCAT("d:\sbec\tea ids\newIDs\ID",!i,".txt")) /TYPE=CSV /MAP /REPLACE /CELLS=VALUES. !DOEND. !ENDDEFINE. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD If you reply to this email, your message will be added to the discussion below: http://spssx-discussion.1045642.n5.nabble.com/Dividing-file-into-10-000-case-chunks-tp5719315.html To start a new topic under SPSSX Discussion, email [hidden email] To unsubscribe from SPSSX Discussion, click here. NAML Art Kendall View this message in context: Re: Dividing file into 10,000 case chunks Sent from the SPSSX Discussion mailing list archive at Nabble.com. |
Administrator
|
In reply to this post by Jon K Peck
I am breathlessly awaiting the day when IBM incorporates these extension commands into the BASE system rather than requiring external dependencies such as python!
--
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Administrator
|
Good luck with that.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
Administrator
|
I figure sometimes squeaky wheels get oiled ;-)
--
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Jon K Peck
Thanks Jon. I added a short macro to loop thru and convert the files from .sav to .txt and the whole thing worked great! The size of the subset files has to vary given requirements of the state’s data system so this will save many hours of my time. From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Jon K Peck Here is a simple way to do this using the SPSSINC SPLIT DATASET extension command available from the SPSS Community website (www.ibm.com/developerworks/spssdevcentral) with far fewer data passes. In this example the output file names would be fraction_1, fraction_2...
|
*Numbering cases and Selecting every nth
parcel. *sample data created for 380,000 cases. input program. loop #i = 1 to 380000. do repeat x = v1 v2 v3. compute x = trunc(uniform(10000))+1. end repeat. end case. end loop. end file. end input program. execute. *Cross check (control end) to be sure the
cases are numbered 380,000. COMPUTE newvars = MOD($CASENUM-1,10000)+1. FORMATS newvars (F1.0). EXECUTE. *As a cross check. selects only each 10000
case. SELECT IF newvars = 1. EXECUTE. Sackey From: SPSSX(r)
Discussion [mailto:[hidden email]] On
Behalf Of Jim Van Overschelde Thanks Jon. I
added a short macro to loop thru and convert the files from .sav to .txt and
the whole thing worked great! The size of the
subset files has to vary given requirements of the state’s data system so
this will save many hours of my time. From: SPSSX(r)
Discussion [mailto:[hidden email]] On
Behalf Of Jon K Peck Here is a simple way to do this using the SPSSINC SPLIT
DATASET extension command available from the SPSS Community website (www.ibm.com/developerworks/spssdevcentral) with far
fewer data passes. In this example the output file names would be
fraction_1, fraction_2...
|
Administrator
|
I'm pretty certain that's not what Jim was looking for.
Reread the original post. --
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Administrator
|
In reply to this post by Jon K Peck
For those who have an instinctive resistance to unnecessarily adding python to the mix :
No Python, No Pain, pretty general and.......CORRECT !!!!!!! I'll drink the Koolaid when I need to build an application which requires metadata access or data cursors. Maybe I am a heretic, but this is too simple to break and KISS!!! Viva La Resistance. DEFINE BreakOut (NBreak !TOKENS(1) / BlockSize !TOKENS(1)). + COMPUTE ID=$CASENUM. + !DO !I=1 !TO !NBreak . + DO IF RANGE(ID, (!I-1)*!BlockSize +1,!I*!BlockSize ). + XSAVE OUTFILE !QUOTE(!CONCAT(!UNQUOTE('G:\TEMP3\junk'),!I,'.sav')). + END IF. + !DOEND . + EXECUTE. !ENDDEFINE. **SET MPRINT ON. BreakOut NBreak=38 BlockSize=10000.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
This ugly code will fail if there are more
than 10 or 64 groups, depending on the Statistics version.
Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: David Marso <[hidden email]> To: [hidden email], Date: 04/08/2013 04:24 PM Subject: Re: [SPSSX-L] Dividing file into 10,000 case chunks Sent by: "SPSSX(r) Discussion" <[hidden email]> For those who have an instinctive resistance to unnecessarily adding python to the mix : No Python, No Pain, pretty general and.......CORRECT !!!!!!! I'll drink the Koolaid when I need to build an application which requires metadata access or data cursors. Maybe I am a heretic, but this is too simple to break and KISS!!! Viva La Resistance. DEFINE BreakOut (NBreak !TOKENS(1) / BlockSize !TOKENS(1)). + COMPUTE ID=$CASENUM. + !DO !I=1 !TO !NBreak . + DO IF RANGE(ID, (!I-1)*!BlockSize +1,!I*!BlockSize ). + XSAVE OUTFILE !QUOTE(!CONCAT(!UNQUOTE('G:\TEMP3\junk'),!I,'.sav')). + END IF. + !DOEND . + EXECUTE. !ENDDEFINE. |
Administrator
|
I'll bet the hair on my ugly monkey macro that the internals of SPLIT DATASET are pretty ugly as well.
Maybe an elegant white space sensitive sort of ugly. OTOH: maybe IBM should spend some time on enhancing old workhorses rather than peddling python as the panacea for all our woes! Oh, they would rather port everything to the MVM (at that point I'm out for good). --------------- If there are more than 64 groups simply toss an EXE into the mix (probably take a couple more lines of 'ugly' code).
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Just a shout from a crowd in approval to David Marso.
Although Jon Peck's Python-based contribution to the development of Statistics is enormous SPSS shouldn't generally rely on Python. Python is quite slow language and is better for front end, I think (you won't generally wish to compose statistical algorithms on Python). Old facilities such as MATRIX and MACROS *should* be revised and updated, and this must be done using more low-level language such as Fortran. Also, while Python-composed functions/commands can be packed as SPSS-style syntax (Jon does it very well), Python itself, *inside*, is very different from SPSS syntax, so using Python statements between BEGIN PROGRAM - END PROGRAM in any conjunction with SPSS syntax looks *very* ugly. 09.04.2013 3:15, David Marso пишет:
I'll bet the hair on my ugly monkey macro that the internals of SPLIT DATASET are pretty ugly as well. Maybe an elegant white space sensitive sort of ugly. OTOH: maybe IBM should spend some time on enhancing old workhorses rather than peddling python as the panacea for all our woes! Oh, they would rather port everything to the MVM (at that point I'm out for good). --------------- If there are more than 64 groups simply toss an EXE into the mix (probably take a couple more lines of 'ugly' code). Jon K Peck wroteThis ugly code will fail if there are more than 10 or 64 groups, depending on the Statistics version. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM |
Subject: Re: [SPSSX-L] Dividing file into 10,000 case chunks
> >Just a shout from a crowd in approval to David Marso. >Although Jon Peck's Python-based contribution to the development of Statistics is enormous SPSS shouldn't generally rely on Python. Python is quite slow language and is better for front end, I think (you won't generally wish to compose statistical algorithms on Python). Old facilities such as MATRIX and MACROS *should* be revised and updated, and this must be done using more low-level language such as Fortran. Also, while Python-composed functions/commands can be packed as SPSS-style syntax (Jon does it very well), Python itself, *inside*, is very different from SPSS syntax, so using Python statements between BEGIN PROGRAM - END PROGRAM in any conjunction with SPSS syntax looks *very* ugly. > >--> I don't really care whether it's ugly or not. I do care whether code is readable, maintainable and whether it gets the job done in the first place. Python is a rapid development language, so it's much easier to achieve things than in e.g. Fortran. It's true that Python is not the fastest language around, but in many cases that it is not important at all (e.g., when working with meta data). If really needed, computationally intensive things can be done in numpy, scipy, or using Cython or even C/ctypes. Data giants such as Google and Youtube use Python. Linux uses Python. GIMP and LibreOffice use Python. That's because Python is more than just a string parser. > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
(A wandering remark) The very fact that so many rabbits and so tall
giants endorse it is enough to decide to reject using it. For me,
people and languages must be divergent. Something "readable,
maintainable" is rather a shortcoming than an asset, in the end.
09.04.2013 17:33, Albert-Jan Roskam
пишет:
Subject: Re: [SPSSX-L] Dividing file into 10,000 case chunksJust a shout from a crowd in approval to David Marso. Although Jon Peck's Python-based contribution to the development of Statistics is enormous SPSS shouldn't generally rely on Python. Python is quite slow language and is better for front end, I think (you won't generally wish to compose statistical algorithms on Python). Old facilities such as MATRIX and MACROS *should* be revised and updated, and this must be done using more low-level language such as Fortran. Also, while Python-composed functions/commands can be packed as SPSS-style syntax (Jon does it very well), Python itself, *inside*, is very different from SPSS syntax, so using Python statements between BEGIN PROGRAM - END PROGRAM in any conjunction with SPSS syntax looks *very* ugly. --> I don't really care whether it's ugly or not. I do care whether code is readable, maintainable and whether it gets the job done in the first place. Python is a rapid development language, so it's much easier to achieve things than in e.g. Fortran. It's true that Python is not the fastest language around, but in many cases that it is not important at all (e.g., when working with meta data). If really needed, computationally intensive things can be done in numpy, scipy, or using Cython or even C/ctypes. Data giants such as Google and Youtube use Python. Linux uses Python. GIMP and LibreOffice use Python. That's because Python is more than just a string parser.===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
For me, people and languages must be divergent. Something "readable, maintainable" is rather a shortcoming than an asset, in the end.
> > >--->> Why? *puzzled* Does this also imply that cryptic, duplicated code is a good thing? How does this relate to software quality standards, eg. https://en.wikipedia.org/wiki/ISO/IEC_9126 ? > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Kirill Orlov
Although I think this is what Kirill meant when he says "Python is quite slow language and is better for front end, I think (you won't generally wish to compose statistical algorithms on Python)" meant, I think there should be made a more clear distinction here. Some python code is used to write SPSS code (some call this metaprogramming). Ignoring opinions on readability, ugliness and maintainability, in these cases, there will be little overhead in calling the code. Also it should be a relatively fixed amount of added time to running of the code, and should not grow with the size of the data (at least the calling of python to write the code part, not the actual total time it takes to run any job). In these cases Python can act very similar to current macros, but is much more flexible and has access to various meta-data (in addition to the output). I believe Jon's SPSSINC PROCESS FILES is an example of this. (There are of course counter-examples, but hopefully this is a reasonable characterization)
The other case is actually passing data to and from python, which will cause code to run slower and will be problematic for really big data (ditto for R). In those cases, it is preferable to find a solution to data management in native SPSS code, although one shouldn't be concerned about using python for the meta-programming tools IMO. My 2cents (from someone who rarely uses python and works mostly with relatively big data files), Andy |
Administrator
|
In reply to this post by David Marso
Yep! 2 more lines of ugly monkey macro poop. When will it ever learn to add?
Added code in BOLD. -- DATASET DECLARE test. MATRIX. SAVE (UNIFORM(1000000,5))/OUTFILE test / VARIABLES V01 TO V05. END MATRIX. DEFINE BreakOut (NBreak !TOKENS(1) / BlockSize !TOKENS(1)). + COMPUTE ID=$CASENUM. + !LET !L = "". + !DO !I=1 !TO !NBreak . + DO IF RANGE(ID, (!I-1)*!BlockSize +1,!I*!BlockSize ). + XSAVE OUTFILE !QUOTE(!CONCAT(!UNQUOTE('G:\TEMP3\junk'),!I,'.sav')). + END IF. + !LET !L=!CONCAT(!L,"x")!IF ( !LENGTH(!L) !EQ 64 ) !THEN EXECUTE. !LET !L="" !IFEND !DOEND . EXECUTE. !ENDDEFINE. SET MPRINT ON PRINTBACK ON. BreakOut NBreak=100 BlockSize=10000.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Free forum by Nabble | Edit this page |