SPSSX Discussion

Dividing file into 10,000 case chunks

Classic

List

Threaded

21 messages Options

Van Overschelde, Jim

Dividing file into 10,000 case chunks

Hey folks,

I have tried for many hours to figure out how to write a macro to divide a 380,000 case file into 38 files with 10,000 cases.
My most recent attempt gives error: "A macro expansion required more storage than was available. Try running with more memory."
Suggestions for fixing this code or another method that should work would be greatly appreciated!!

Thanks,
Jim

DEFINE !Looper ()
!DO !i=1 !to 38.
Get file='D:\sbec\tea ids\Master ID List.sav'.
dataset name SSNList.
/* Define Ending point.*/
!let !temp=!blanks(0).
!do !cnt=1 !to !i
!Let !temp=!concat(!temp,!blanks(10000))
!doEnd.
!Let !EndNum=!Length(!temp).
/* Define start point.*/
!Let !j=!length(!substr(!blanks(!temp),9999)).
!Let !StartNum=!length(!concat(!blanks(!j),!blanks(1))).
Select if $casenum<=!StartNum & $casenum >=!EndNum.
SAVE TRANSLATE OUTFILE=!QUOTE(!CONCAT("d:\sbec\tea ids\newIDs\ID",!i,".txt"))
/TYPE=CSV
/MAP
/REPLACE
/CELLS=VALUES.
!DOEND.
!ENDDEFINE.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Van Overschelde, Jim

Re: Dividing file into 10,000 case chunks

And, yes, I see I did my >, < in the wrong directions.

-----Original Message-----
From: Van Overschelde, Jim
Sent: Sunday, April 07, 2013 8:54 AM
To: [hidden email]
Subject: Dividing file into 10,000 case chunks

Hey folks,

I have tried for many hours to figure out how to write a macro to divide a 380,000 case file into 38 files with 10,000 cases.
My most recent attempt gives error: "A macro expansion required more storage than was available. Try running with more memory."
Suggestions for fixing this code or another method that should work would be greatly appreciated!!

Thanks,
Jim

DEFINE !Looper ()
!DO !i=1 !to 38.
Get file='D:\sbec\tea ids\Master ID List.sav'.
dataset name SSNList.
/* Define Ending point.*/
!let !temp=!blanks(0).
!do !cnt=1 !to !i
!Let !temp=!concat(!temp,!blanks(10000))
!doEnd.
!Let !EndNum=!Length(!temp).
/* Define start point.*/
!Let !j=!length(!substr(!blanks(!temp),9999)).
!Let !StartNum=!length(!concat(!blanks(!j),!blanks(1))).
Select if $casenum<=!StartNum & $casenum >=!EndNum.
SAVE TRANSLATE OUTFILE=!QUOTE(!CONCAT("d:\sbec\tea ids\newIDs\ID",!i,".txt"))
/TYPE=CSV
/MAP
/REPLACE
/CELLS=VALUES.
!DOEND.
!ENDDEFINE.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Art Kendall

Re: Dividing file into 10,000 case chunks

In reply to this post by Van Overschelde, Jim

Art Kendall
Social Research Consultants

On 4/7/2013 8:55 PM, Van Overschelde, Jim [via SPSSX Discussion] wrote:

Hey folks,

I have tried for many hours to figure out how to write a macro to divide a 380,000 case file into 38 files with 10,000 cases.
My most recent attempt gives error: "A macro expansion required more storage than was available. Try running with more memory."
Suggestions for fixing this code or another method that should work would be greatly appreciated!!

Thanks,
Jim

DEFINE !Looper ()
!DO !i=1 !to 38.
Get file='D:\sbec\tea ids\Master ID List.sav'.
dataset name SSNList.
/* Define Ending point.*/
!let !temp=!blanks(0).
!do !cnt=1 !to !i
!Let !temp=!concat(!temp,!blanks(10000))
!doEnd.
!Let !EndNum=!Length(!temp).
/* Define start point.*/
!Let !j=!length(!substr(!blanks(!temp),9999)).
!Let !StartNum=!length(!concat(!blanks(!j),!blanks(1))).
Select if $casenum<=!StartNum & $casenum >=!EndNum.
SAVE TRANSLATE OUTFILE=!QUOTE(!CONCAT("d:\sbec\tea ids\newIDs\ID",!i,".txt"))
/TYPE=CSV
/MAP
/REPLACE
/CELLS=VALUES.
!DOEND.
!ENDDEFINE.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

If you reply to this email, your message will be added to the discussion below:
http://spssx-discussion.1045642.n5.nabble.com/Dividing-file-into-10-000-case-chunks-tp5719315.html

To start a new topic under SPSSX Discussion, email [hidden email]
To unsubscribe from SPSSX Discussion, click here.
NAML

Art Kendall
Social Research Consultants

Jon K Peck

Re: Dividing file into 10,000 case chunks

In reply to this post by Van Overschelde, Jim

Here is a simple way to do this using the SPSSINC SPLIT DATASET extension command available from the SPSS Community website (www.ibm.com/developerworks/spssdevcentral) with far fewer data passes. In this example the output file names would be fraction_1, fraction_2...

COMPUTE group=trunc($casenum/10000).
SPSSINC SPLIT DATASET SPLITVAR=group
/OUTPUT DIRECTORY= "c:\temp\splits"
/OPTIONS NAMES=VALUES NAMEPREFIX="fraction".

This extension command and its companion, SPSSINC PROCESS FILES, require the Python Essentials available through the Community site.

This example produces sav files while your macro was attempting to produce text files. That could be addressed with another step.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: "Van Overschelde, Jim" <[hidden email]>
To: [hidden email],
Date: 04/07/2013 06:54 PM
Subject: Re: [SPSSX-L] Dividing file into 10,000 case chunks
Sent by: "SPSSX(r) Discussion" <[hidden email]>

And, yes, I see I did my >, < in the wrong directions. -----Original Message----- From: Van Overschelde, Jim Sent: Sunday, April 07, 2013 8:54 AM To: [hidden email] Subject: Dividing file into 10,000 case chunks Hey folks, I have tried for many hours to figure out how to write a macro to divide a 380,000 case file into 38 files with 10,000 cases. My most recent attempt gives error: "A macro expansion required more storage than was available. Try running with more memory." Suggestions for fixing this code or another method that should work would be greatly appreciated!! Thanks, Jim DEFINE !Looper () !DO !i=1 !to 38. Get file='D:\sbec\tea ids\Master ID List.sav'. dataset name SSNList. /* Define Ending point.*/ !let !temp=!blanks(0). !do !cnt=1 !to !i !Let !temp=!concat(!temp,!blanks(10000)) !doEnd. !Let !EndNum=!Length(!temp). /* Define start point.*/ !Let !j=!length(!substr(!blanks(!temp),9999)). !Let !StartNum=!length(!concat(!blanks(!j),!blanks(1))). Select if $casenum<=!StartNum & $casenum >=!EndNum. SAVE TRANSLATE OUTFILE=!QUOTE(!CONCAT("d:\sbec\tea ids\newIDs\ID",!i,".txt")) /TYPE=CSV /MAP /REPLACE /CELLS=VALUES. !DOEND. !ENDDEFINE. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Jon K Peck

Re: Dividing file into 10,000 case chunks

In reply to this post by Art Kendall

I don't think Art meant to have those GET commands in there. But putting that aside, this is an example of how NOT to do a task even though it would work.

It is painful to write all that code, and, worse, the chances of getting it exactly right are not great - boredom will set in long before that many XSAVE commands are written, so careful testing and code review is required. Finally, it is very specific to these particular numbers, so it doesn't make a good model for a general solution.

Generalization + correctness + pain reduction = Python

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: Art Kendall <[hidden email]>
To: [hidden email],
Date: 04/08/2013 08:16 AM
Subject: Re: [SPSSX-L] Dividing file into 10,000 case chunks
Sent by: "SPSSX(r) Discussion" <[hidden email]>

see the archive for writing out separate files. IIRC there is a Python method.
UNTESTED
not sure if you need the +1 on the mod.
you may not need to randomize the order of cases.
NCY -- NO Coffee Yet.
this is what you would want to Macro or Python to do. or you can just write the 38 set of xsaves.
compute RandomOrder = uniform(31**2). sort cases by RandomOrder. compute WhichFile = mod($casenum, 10000)+1. do if WhichFile eq 1. xsave outfile = 'j:Get file='D:\sbec\tea ids\Master ID List subset 1.sav'. else if WhichFile eq 2. xsave outfile = 'j:Get file='D:\sbec\tea ids\Master ID List subset 2.sav'. else if WhichFile eq 3. xsave outfile = 'j:Get file='D:\sbec\tea ids\Master ID List subset 3.sav'.. . .else if WhichFile eq 38. xsave outfile = 'j:Get file='D:\sbec\tea ids\Master ID List subset 38.sav'.else. print /'oops WhichFile is ' WhichFile.frequencies variables = WhichFile.

However why are you doing this? There may be other approaches.
Art Kendall Social Research Consultants
On 4/7/2013 8:55 PM, Van Overschelde, Jim [via SPSSX Discussion] wrote:
Hey folks,

I have tried for many hours to figure out how to write a macro to divide a 380,000 case file into 38 files with 10,000 cases.
My most recent attempt gives error: "A macro expansion required more storage than was available. Try running with more memory."
Suggestions for fixing this code or another method that should work would be greatly appreciated!!

Thanks,
Jim

DEFINE !Looper ()
!DO !i=1 !to 38.
Get file='D:\sbec\tea ids\Master ID List.sav'.
dataset name SSNList.
/* Define Ending point.*/
!let !temp=!blanks(0).
!do !cnt=1 !to !i
!Let !temp=!concat(!temp,!blanks(10000))
!doEnd.
!Let !EndNum=!Length(!temp).
/* Define start point.*/
!Let !j=!length(!substr(!blanks(!temp),9999)).
!Let !StartNum=!length(!concat(!blanks(!j),!blanks(1))).
Select if $casenum<=!StartNum & $casenum >=!EndNum.
SAVE TRANSLATE OUTFILE=!QUOTE(!CONCAT("d:\sbec\tea ids\newIDs\ID",!i,".txt"))
/TYPE=CSV
/MAP
/REPLACE
/CELLS=VALUES.
!DOEND.
!ENDDEFINE.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

If you reply to this email, your message will be added to the discussion below:
http://spssx-discussion.1045642.n5.nabble.com/Dividing-file-into-10-000-case-chunks-tp5719315.html
To start a new topic under SPSSX Discussion, email [hidden email]
To unsubscribe from SPSSX Discussion, click here.
NAML

Art Kendall
Social Research Consultants

View this message in context: Re: Dividing file into 10,000 case chunks
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

David Marso

Re: Dividing file into 10,000 case chunks

Administrator

In reply to this post by Jon K Peck

I am breathlessly awaiting the day when IBM incorporates these extension commands into the BASE system rather than requiring external dependencies such as python!
--

Jon K Peck wrote

Here is a simple way to do this using the SPSSINC SPLIT DATASET extension
command available from the SPSS Community website (
www.ibm.com/developerworks/spssdevcentral) with far fewer data passes. In
this example the output file names would be fraction_1, fraction_2...

COMPUTE group=trunc($casenum/10000).
SPSSINC SPLIT DATASET SPLITVAR=group
/OUTPUT DIRECTORY= "c:\temp\splits"
/OPTIONS NAMES=VALUES NAMEPREFIX="fraction".

This extension command and its companion, SPSSINC PROCESS FILES, require
the Python Essentials available through the Community site.

This example produces sav files while your macro was attempting to produce
text files. That could be addressed with another step.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: "Van Overschelde, Jim" <[hidden email]>
To: [hidden email],
Date: 04/07/2013 06:54 PM
Subject: Re: [SPSSX-L] Dividing file into 10,000 case chunks
Sent by: "SPSSX(r) Discussion" <[hidden email]>

And, yes, I see I did my >, < in the wrong directions.

-----Original Message-----
From: Van Overschelde, Jim
Sent: Sunday, April 07, 2013 8:54 AM
To: [hidden email]
Subject: Dividing file into 10,000 case chunks

Hey folks,

I have tried for many hours to figure out how to write a macro to divide a
380,000 case file into 38 files with 10,000 cases.
My most recent attempt gives error: "A macro expansion required more
storage than was available. Try running with more memory."
Suggestions for fixing this code or another method that should work would
be greatly appreciated!!

Thanks,
Jim

DEFINE !Looper ()
!DO !i=1 !to 38.
Get file='D:\sbec\tea ids\Master ID List.sav'.
dataset name SSNList.
/* Define Ending point.*/
!let !temp=!blanks(0).
!do !cnt=1 !to !i
!Let !temp=!concat(!temp,!blanks(10000))
!doEnd.
!Let !EndNum=!Length(!temp).
/* Define start point.*/
!Let !j=!length(!substr(!blanks(!temp),9999)).
!Let !StartNum=!length(!concat(!blanks(!j),!blanks(1))).
Select if $casenum<=!StartNum & $casenum >=!EndNum.
SAVE TRANSLATE OUTFILE=!QUOTE(!CONCAT("d:\sbec\tea
ids\newIDs\ID",!i,".txt"))
/TYPE=CSV
/MAP
/REPLACE
/CELLS=VALUES.
!DOEND.
!ENDDEFINE.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

Bruce Weaver

Re: Dividing file into 10,000 case chunks

Administrator

Good luck with that.

David Marso wrote

I am breathlessly awaiting the day when IBM incorporates these extension commands into the BASE system rather than requiring external dependencies such as python!
--

Jon K Peck wrote

Here is a simple way to do this using the SPSSINC SPLIT DATASET extension
command available from the SPSS Community website (
www.ibm.com/developerworks/spssdevcentral) with far fewer data passes. In
this example the output file names would be fraction_1, fraction_2...

COMPUTE group=trunc($casenum/10000).
SPSSINC SPLIT DATASET SPLITVAR=group
/OUTPUT DIRECTORY= "c:\temp\splits"
/OPTIONS NAMES=VALUES NAMEPREFIX="fraction".

This extension command and its companion, SPSSINC PROCESS FILES, require
the Python Essentials available through the Community site.

This example produces sav files while your macro was attempting to produce
text files. That could be addressed with another step.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: "Van Overschelde, Jim" <[hidden email]>
To: [hidden email],
Date: 04/07/2013 06:54 PM
Subject: Re: [SPSSX-L] Dividing file into 10,000 case chunks
Sent by: "SPSSX(r) Discussion" <[hidden email]>

And, yes, I see I did my >, < in the wrong directions.

-----Original Message-----
From: Van Overschelde, Jim
Sent: Sunday, April 07, 2013 8:54 AM
To: [hidden email]
Subject: Dividing file into 10,000 case chunks

Hey folks,

I have tried for many hours to figure out how to write a macro to divide a
380,000 case file into 38 files with 10,000 cases.
My most recent attempt gives error: "A macro expansion required more
storage than was available. Try running with more memory."
Suggestions for fixing this code or another method that should work would
be greatly appreciated!!

Thanks,
Jim

DEFINE !Looper ()
!DO !i=1 !to 38.
Get file='D:\sbec\tea ids\Master ID List.sav'.
dataset name SSNList.
/* Define Ending point.*/
!let !temp=!blanks(0).
!do !cnt=1 !to !i
!Let !temp=!concat(!temp,!blanks(10000))
!doEnd.
!Let !EndNum=!Length(!temp).
/* Define start point.*/
!Let !j=!length(!substr(!blanks(!temp),9999)).
!Let !StartNum=!length(!concat(!blanks(!j),!blanks(1))).
Select if $casenum<=!StartNum & $casenum >=!EndNum.
SAVE TRANSLATE OUTFILE=!QUOTE(!CONCAT("d:\sbec\tea
ids\newIDs\ID",!i,".txt"))
/TYPE=CSV
/MAP
/REPLACE
/CELLS=VALUES.
!DOEND.
!ENDDEFINE.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).

David Marso

Re: Dividing file into 10,000 case chunks

Administrator

I figure sometimes squeaky wheels get oiled ;-)
--

Bruce Weaver wrote

Good luck with that.

David Marso wrote

I am breathlessly awaiting the day when IBM incorporates these extension commands into the BASE system rather than requiring external dependencies such as python!
--

Jon K Peck wrote

Here is a simple way to do this using the SPSSINC SPLIT DATASET extension
command available from the SPSS Community website (
www.ibm.com/developerworks/spssdevcentral) with far fewer data passes. In
this example the output file names would be fraction_1, fraction_2...

COMPUTE group=trunc($casenum/10000).
SPSSINC SPLIT DATASET SPLITVAR=group
/OUTPUT DIRECTORY= "c:\temp\splits"
/OPTIONS NAMES=VALUES NAMEPREFIX="fraction".

This extension command and its companion, SPSSINC PROCESS FILES, require
the Python Essentials available through the Community site.

This example produces sav files while your macro was attempting to produce
text files. That could be addressed with another step.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: "Van Overschelde, Jim" <[hidden email]>
To: [hidden email],
Date: 04/07/2013 06:54 PM
Subject: Re: [SPSSX-L] Dividing file into 10,000 case chunks
Sent by: "SPSSX(r) Discussion" <[hidden email]>

And, yes, I see I did my >, < in the wrong directions.

-----Original Message-----
From: Van Overschelde, Jim
Sent: Sunday, April 07, 2013 8:54 AM
To: [hidden email]
Subject: Dividing file into 10,000 case chunks

Hey folks,

I have tried for many hours to figure out how to write a macro to divide a
380,000 case file into 38 files with 10,000 cases.
My most recent attempt gives error: "A macro expansion required more
storage than was available. Try running with more memory."
Suggestions for fixing this code or another method that should work would
be greatly appreciated!!

Thanks,
Jim

DEFINE !Looper ()
!DO !i=1 !to 38.
Get file='D:\sbec\tea ids\Master ID List.sav'.
dataset name SSNList.
/* Define Ending point.*/
!let !temp=!blanks(0).
!do !cnt=1 !to !i
!Let !temp=!concat(!temp,!blanks(10000))
!doEnd.
!Let !EndNum=!Length(!temp).
/* Define start point.*/
!Let !j=!length(!substr(!blanks(!temp),9999)).
!Let !StartNum=!length(!concat(!blanks(!j),!blanks(1))).
Select if $casenum<=!StartNum & $casenum >=!EndNum.
SAVE TRANSLATE OUTFILE=!QUOTE(!CONCAT("d:\sbec\tea
ids\newIDs\ID",!i,".txt"))
/TYPE=CSV
/MAP
/REPLACE
/CELLS=VALUES.
!DOEND.
!ENDDEFINE.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jim Van Overschelde

Re: Dividing file into 10,000 case chunks

In reply to this post by Jon K Peck

Thanks Jon. I added a short macro to loop thru and convert the files from .sav to .txt and the whole thing worked great!

The size of the subset files has to vary given requirements of the state’s data system so this will save many hours of my time.

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Jon K Peck
Sent: Monday, April 08, 2013 8:36 AM
To: [hidden email]
Subject: Re: Dividing file into 10,000 case chunks

And, yes, I see I did my >, < in the wrong directions.

-----Original Message-----
From: Van Overschelde, Jim
Sent: Sunday, April 07, 2013 8:54 AM
To: [hidden email]
Subject: Dividing file into 10,000 case chunks

Hey folks,

I have tried for many hours to figure out how to write a macro to divide a 380,000 case file into 38 files with 10,000 cases.
My most recent attempt gives error: "A macro expansion required more storage than was available. Try running with more memory."
Suggestions for fixing this code or another method that should work would be greatly appreciated!!

Thanks,
Jim

DEFINE !Looper ()
!DO !i=1 !to 38.
Get file='D:\sbec\tea ids\Master ID List.sav'.
dataset name SSNList.
/* Define Ending point.*/
!let !temp=!blanks(0).
!do !cnt=1 !to !i
!Let !temp=!concat(!temp,!blanks(10000))
!doEnd.
!Let !EndNum=!Length(!temp).
/* Define start point.*/
!Let !j=!length(!substr(!blanks(!temp),9999)).
!Let !StartNum=!length(!concat(!blanks(!j),!blanks(1))).
Select if $casenum<=!StartNum & $casenum >=!EndNum.
SAVE TRANSLATE OUTFILE=!QUOTE(!CONCAT("d:\sbec\tea ids\newIDs\ID",!i,".txt"))
/TYPE=CSV
/MAP
/REPLACE
/CELLS=VALUES.
!DOEND.
!ENDDEFINE.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Sackey Kweku

Re: Dividing file into 10,000 case chunks

*Numbering cases and Selecting every nth parcel.

*sample data created for 380,000 cases.

input program.

loop #i = 1 to 380000.

do repeat x = v1 v2 v3.

compute x = trunc(uniform(10000))+1.

end repeat.

end case.

end loop.

end file.

end input program.

execute.

*Cross check (control end) to be sure the cases are numbered 380,000.

COMPUTE newvars = MOD($CASENUM-1,10000)+1.

FORMATS newvars (F1.0).

EXECUTE.

*As a cross check. selects only each 10000 case.

SELECT IF newvars = 1.

EXECUTE.

Sackey

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Jim Van Overschelde
Sent: Monday, April 08, 2013 12:48 PM
To: [hidden email]
Subject: Re: Dividing file into 10,000 case chunks

Thanks Jon. I added a short macro to loop thru and convert the files from .sav to .txt and the whole thing worked great!

The size of the subset files has to vary given requirements of the state’s data system so this will save many hours of my time.

David Marso

Re: Dividing file into 10,000 case chunks

Administrator

I'm pretty certain that's not what Jim was looking for.
Reread the original post.
--

Sackey Kweku wrote

*Numbering cases and Selecting every nth parcel.

*sample data created for 380,000 cases.
input program.
loop #i = 1 to 380000.
do repeat x = v1 v2 v3.
compute x = trunc(uniform(10000))+1.
end repeat.
end case.
end loop.
end file.
end input program.
execute.

*Cross check (control end) to be sure the cases are numbered 380,000.
COMPUTE newvars = MOD($CASENUM-1,10000)+1.
FORMATS newvars (F1.0).
EXECUTE.

*As a cross check. selects only each 10000 case.
SELECT IF newvars = 1.
EXECUTE.

Sackey

________________________________
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Jim Van Overschelde
Sent: Monday, April 08, 2013 12:48 PM
To: [hidden email]
Subject: Re: Dividing file into 10,000 case chunks

Thanks Jon. I added a short macro to loop thru and convert the files from .sav to .txt and the whole thing worked great!

The size of the subset files has to vary given requirements of the state's data system so this will save many hours of my time.

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Jon K Peck
Sent: Monday, April 08, 2013 8:36 AM
To: [hidden email]
Subject: Re: Dividing file into 10,000 case chunks

Here is a simple way to do this using the SPSSINC SPLIT DATASET extension command available from the SPSS Community website (www.ibm.com/developerworks/spssdevcentral) with far fewer data passes. In this example the output file names would be fraction_1, fraction_2...

COMPUTE group=trunc($casenum/10000).
SPSSINC SPLIT DATASET SPLITVAR=group
/OUTPUT DIRECTORY= "c:\temp\splits"
/OPTIONS NAMES=VALUES NAMEPREFIX="fraction".

This extension command and its companion, SPSSINC PROCESS FILES, require the Python Essentials available through the Community site.

This example produces sav files while your macro was attempting to produce text files. That could be addressed with another step.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]<mailto:[hidden email]>
phone: 720-342-5621

From: "Van Overschelde, Jim" <[hidden email]<mailto:[hidden email]>>
To: [hidden email]<mailto:[hidden email]>,
Date: 04/07/2013 06:54 PM
Subject: Re: [SPSSX-L] Dividing file into 10,000 case chunks
Sent by: "SPSSX(r) Discussion" <[hidden email]<mailto:[hidden email]>>
________________________________

And, yes, I see I did my >, < in the wrong directions.

-----Original Message-----
From: Van Overschelde, Jim
Sent: Sunday, April 07, 2013 8:54 AM
To: [hidden email]
Subject: Dividing file into 10,000 case chunks

Hey folks,

I have tried for many hours to figure out how to write a macro to divide a 380,000 case file into 38 files with 10,000 cases.
My most recent attempt gives error: "A macro expansion required more storage than was available. Try running with more memory."
Suggestions for fixing this code or another method that should work would be greatly appreciated!!

Thanks,
Jim

DEFINE !Looper ()
!DO !i=1 !to 38.
Get file='D:\sbec\tea ids\Master ID List.sav'.
dataset name SSNList.
/* Define Ending point.*/
!let !temp=!blanks(0).
!do !cnt=1 !to !i
!Let !temp=!concat(!temp,!blanks(10000))
!doEnd.
!Let !EndNum=!Length(!temp).
/* Define start point.*/
!Let !j=!length(!substr(!blanks(!temp),9999)).
!Let !StartNum=!length(!concat(!blanks(!j),!blanks(1))).
Select if $casenum<=!StartNum & $casenum >=!EndNum.
SAVE TRANSLATE OUTFILE=!QUOTE(!CONCAT("d:\sbec\tea ids\newIDs\ID",!i,".txt"))
/TYPE=CSV
/MAP
/REPLACE
/CELLS=VALUES.
!DOEND.
!ENDDEFINE.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email]<mailto:[hidden email]> (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

David Marso

Re: Dividing file into 10,000 case chunks

Administrator

In reply to this post by Jon K Peck

For those who have an instinctive resistance to unnecessarily adding python to the mix :
No Python, No Pain, pretty general and.......CORRECT !!!!!!!
I'll drink the Koolaid when I need to build an application which requires metadata access or data cursors.
Maybe I am a heretic, but this is too simple to break and KISS!!!
Viva La Resistance.

DEFINE BreakOut (NBreak !TOKENS(1) / BlockSize !TOKENS(1)).
+ COMPUTE ID=$CASENUM.
+ !DO !I=1 !TO !NBreak .
+ DO IF RANGE(ID, (!I-1)*!BlockSize +1,!I*!BlockSize ).
+ XSAVE OUTFILE !QUOTE(!CONCAT(!UNQUOTE('G:\TEMP3\junk'),!I,'.sav')).
+ END IF.
+ !DOEND .
+ EXECUTE.
!ENDDEFINE.

**SET MPRINT ON.
BreakOut NBreak=38 BlockSize=10000.

Jon K Peck wrote

I don't think Art meant to have those GET commands in there. But putting
that aside, this is an example of how NOT to do a task even though it
would work.

It is painful to write all that code, and, worse, the chances of getting
it exactly right are not great - boredom will set in long before that many
XSAVE commands are written, so careful testing and code review is
required. Finally, it is very specific to these particular numbers, so it
doesn't make a good model for a general solution.

Generalization + correctness + pain reduction = Python

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: Art Kendall <[hidden email]>
To: [hidden email],
Date: 04/08/2013 08:16 AM
Subject: Re: [SPSSX-L] Dividing file into 10,000 case chunks
Sent by: "SPSSX(r) Discussion" <[hidden email]>

see the archive for writing out separate files. IIRC there is a Python
method.
UNTESTED
not sure if you need the +1 on the mod.
you may not need to randomize the order of cases.
NCY -- NO Coffee Yet.
this is what you would want to Macro or Python to do. or you can just
write the 38 set of xsaves.

compute RandomOrder = uniform(31**2).
sort cases by RandomOrder.
compute WhichFile = mod($casenum, 10000)+1.
do if WhichFile eq 1.
xsave outfile = 'j:Get file='D:\sbec\tea ids\Master ID List subset 1.sav'.

else if WhichFile eq 2.
xsave outfile = 'j:Get file='D:\sbec\tea ids\Master ID List subset 2.sav'.

else if WhichFile eq 3.
xsave outfile = 'j:Get file='D:\sbec\tea ids\Master ID List subset 3.sav'.

. . .
else if WhichFile eq 38.
xsave outfile = 'j:Get file='D:\sbec\tea ids\Master ID List subset
38.sav'.
else.
print /'oops WhichFile is ' WhichFile.
frequencies variables = WhichFile.

However why are you doing this? There may be other approaches.
Art Kendall
Social Research Consultants
On 4/7/2013 8:55 PM, Van Overschelde, Jim [via SPSSX Discussion] wrote:
Hey folks,

I have tried for many hours to figure out how to write a macro to divide a
380,000 case file into 38 files with 10,000 cases.
My most recent attempt gives error: "A macro expansion required more
storage than was available. Try running with more memory."
Suggestions for fixing this code or another method that should work would
be greatly appreciated!!

Thanks,
Jim

DEFINE !Looper ()
!DO !i=1 !to 38.
Get file='D:\sbec\tea ids\Master ID List.sav'.
dataset name SSNList.
/* Define Ending point.*/
!let !temp=!blanks(0).
!do !cnt=1 !to !i
!Let !temp=!concat(!temp,!blanks(10000))
!doEnd.
!Let !EndNum=!Length(!temp).
/* Define start point.*/
!Let !j=!length(!substr(!blanks(!temp),9999)).
!Let !StartNum=!length(!concat(!blanks(!j),!blanks(1))).
Select if $casenum<=!StartNum & $casenum >=!EndNum.
SAVE TRANSLATE OUTFILE=!QUOTE(!CONCAT("d:\sbec\tea
ids\newIDs\ID",!i,".txt"))
/TYPE=CSV
/MAP
/REPLACE
/CELLS=VALUES.
!DOEND.
!ENDDEFINE.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

If you reply to this email, your message will be added to the discussion
below:
http://spssx-discussion.1045642.n5.nabble.com/Dividing-file-into-10-000-case-chunks-tp5719315.html

To start a new topic under SPSSX Discussion, email [hidden email]
To unsubscribe from SPSSX Discussion, click here.
NAML

Art Kendall
Social Research Consultants

View this message in context: Re: Dividing file into 10,000 case chunks
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

Jon K Peck

Re: Dividing file into 10,000 case chunks

This ugly code will fail if there are more than 10 or 64 groups, depending on the Statistics version.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: David Marso <[hidden email]>
To: [hidden email],
Date: 04/08/2013 04:24 PM
Subject: Re: [SPSSX-L] Dividing file into 10,000 case chunks
Sent by: "SPSSX(r) Discussion" <[hidden email]>

For those who have an instinctive resistance to unnecessarily adding python to the mix : No Python, No Pain, pretty general and.......CORRECT !!!!!!! I'll drink the Koolaid when I need to build an application which requires metadata access or data cursors. Maybe I am a heretic, but this is too simple to break and KISS!!! Viva La Resistance. DEFINE BreakOut (NBreak !TOKENS(1) / BlockSize !TOKENS(1)). + COMPUTE ID=$CASENUM. + !DO !I=1 !TO !NBreak . + DO IF RANGE(ID, (!I-1)*!BlockSize +1,!I*!BlockSize ). + XSAVE OUTFILE !QUOTE(!CONCAT(!UNQUOTE('G:\TEMP3\junk'),!I,'.sav')). + END IF. + !DOEND . + EXECUTE. !ENDDEFINE.

David Marso

Re: Dividing file into 10,000 case chunks

Administrator

I'll bet the hair on my ugly monkey macro that the internals of SPLIT DATASET are pretty ugly as well.
Maybe an elegant white space sensitive sort of ugly.
OTOH: maybe IBM should spend some time on enhancing old workhorses rather than peddling python as the panacea for all our woes!
Oh, they would rather port everything to the MVM (at that point I'm out for good).
---------------
If there are more than 64 groups simply toss an EXE into the mix (probably take a couple more lines of 'ugly' code).

Jon K Peck wrote

This ugly code will fail if there are more than 10 or 64 groups, depending
on the Statistics version.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: David Marso <[hidden email]>
To: [hidden email],
Date: 04/08/2013 04:24 PM
Subject: Re: [SPSSX-L] Dividing file into 10,000 case chunks
Sent by: "SPSSX(r) Discussion" <[hidden email]>

For those who have an instinctive resistance to unnecessarily adding
python
to the mix :
No Python, No Pain, pretty general and.......CORRECT !!!!!!!
I'll drink the Koolaid when I need to build an application which requires
metadata access or data cursors.
Maybe I am a heretic, but this is too simple to break and KISS!!!
Viva La Resistance.

DEFINE BreakOut (NBreak !TOKENS(1) / BlockSize !TOKENS(1)).
+ COMPUTE ID=$CASENUM.
+ !DO !I=1 !TO !NBreak .
+ DO IF RANGE(ID, (!I-1)*!BlockSize +1,!I*!BlockSize ).
+ XSAVE OUTFILE !QUOTE(!CONCAT(!UNQUOTE('G:\TEMP3\junk'),!I,'.sav')).
+ END IF.
+ !DOEND .
+ EXECUTE.
!ENDDEFINE.

Kirill Orlov

Re: Dividing file into 10,000 case chunks

Just a shout from a crowd in approval to David Marso.
Although Jon Peck's Python-based contribution to the development of Statistics is enormous SPSS shouldn't generally rely on Python. Python is quite slow language and is better for front end, I think (you won't generally wish to compose statistical algorithms on Python). Old facilities such as MATRIX and MACROS *should* be revised and updated, and this must be done using more low-level language such as Fortran. Also, while Python-composed functions/commands can be packed as SPSS-style syntax (Jon does it very well), Python itself, *inside*, is very different from SPSS syntax, so using Python statements between BEGIN PROGRAM - END PROGRAM in any conjunction with SPSS syntax looks *very* ugly.

09.04.2013 3:15, David Marso пишет:

I'll bet the hair on my ugly monkey macro that the internals of SPLIT DATASET
are pretty ugly as well.
Maybe an elegant white space sensitive sort of ugly.
OTOH: maybe IBM should spend some time on enhancing old workhorses rather
than peddling python as the panacea for all our woes!
Oh, they would rather port everything to the MVM (at that point I'm out for
good).
---------------
If there are more than 64 groups simply toss an EXE into the mix (probably
take a couple more lines of 'ugly' code).


Jon K Peck wrote

This ugly code will fail if there are more than 10 or 64 groups, depending
on the Statistics version.


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM

Albert-Jan Roskam

Re: Dividing file into 10,000 case chunks

Subject: Re: [SPSSX-L] Dividing file into 10,000 case chunks

>
>Just a shout from a crowd in approval to David Marso.
>Although Jon Peck's Python-based contribution to the development of Statistics is enormous SPSS shouldn't generally rely on Python. Python is quite slow language and is better for front end, I think (you won't generally wish to compose statistical algorithms on Python). Old facilities such as MATRIX and MACROS *should* be revised and updated, and this must be done using more low-level language such as Fortran. Also, while Python-composed functions/commands can be packed as SPSS-style syntax (Jon does it very well), Python itself, *inside*, is very different from SPSS syntax, so using Python statements between BEGIN PROGRAM - END PROGRAM in any conjunction with SPSS syntax looks *very* ugly.
>
>--> I don't really care whether it's ugly or not. I do care whether code is readable, maintainable and whether it gets the job done in the first place. Python is a rapid development language, so it's much easier to achieve things than in e.g. Fortran. It's true that Python is not the fastest language around, but in many cases that it is not important at all (e.g., when working with meta data). If really needed, computationally intensive things can be done in numpy, scipy, or using Cython or even C/ctypes. Data giants such as Google and Youtube use Python. Linux uses Python. GIMP and LibreOffice use Python. That's because Python is more than just a string parser.
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Kirill Orlov

Re: Dividing file into 10,000 case chunks

(A wandering remark) The very fact that so many rabbits and so tall giants endorse it is enough to decide to reject using it. For me, people and languages must be divergent. Something "readable, maintainable" is rather a shortcoming than an asset, in the end.

09.04.2013 17:33, Albert-Jan Roskam пишет:

Subject: Re: [SPSSX-L] Dividing file into 10,000 case chunks

Just a shout from a crowd in approval to David Marso.
Although Jon Peck's Python-based contribution to the development of Statistics is enormous SPSS shouldn't generally rely on Python. Python is quite slow language and is better for front end, I think (you won't generally wish to compose statistical algorithms on Python). Old facilities such as MATRIX and MACROS *should* be revised and updated, and this must be done using more low-level language such as Fortran. Also, while Python-composed functions/commands can be packed as SPSS-style syntax (Jon does it very well), Python itself, *inside*, is very different from SPSS syntax, so using Python statements between BEGIN PROGRAM - END PROGRAM in any conjunction with SPSS syntax looks *very* ugly.

--> I don't really care whether it's ugly or not. I do care whether code is readable, maintainable and whether it gets the job done in the first place. Python is a rapid development language, so it's much easier to achieve things than in e.g. Fortran. It's true that Python is not the fastest language around, but in many cases that it is not important at all (e.g., when working with meta data). If really needed, computationally intensive things can be done in numpy, scipy, or using Cython or even C/ctypes. Data giants such as Google and Youtube use Python. Linux uses Python. GIMP and LibreOffice use Python. That's because Python is more than just a string parser.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Albert-Jan Roskam

Re: Dividing file into 10,000 case chunks

For me, people and languages must be divergent. Something "readable, maintainable" is rather a shortcoming than an asset, in the end.
>
>
>--->> Why? *puzzled* Does this also imply that cryptic, duplicated code is a good thing? How does this relate to software quality standards, eg. https://en.wikipedia.org/wiki/ISO/IEC_9126 ?
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Andy W

Re: Dividing file into 10,000 case chunks

In reply to this post by Kirill Orlov

Although I think this is what Kirill meant when he says "Python is quite slow language and is better for front end, I think (you won't generally wish to compose statistical algorithms on Python)" meant, I think there should be made a more clear distinction here. Some python code is used to write SPSS code (some call this metaprogramming). Ignoring opinions on readability, ugliness and maintainability, in these cases, there will be little overhead in calling the code. Also it should be a relatively fixed amount of added time to running of the code, and should not grow with the size of the data (at least the calling of python to write the code part, not the actual total time it takes to run any job). In these cases Python can act very similar to current macros, but is much more flexible and has access to various meta-data (in addition to the output). I believe Jon's SPSSINC PROCESS FILES is an example of this. (There are of course counter-examples, but hopefully this is a reasonable characterization)

The other case is actually passing data to and from python, which will cause code to run slower and will be problematic for really big data (ditto for R). In those cases, it is preferable to find a solution to data management in native SPSS code, although one shouldn't be concerned about using python for the meta-programming tools IMO.

My 2cents (from someone who rarely uses python and works mostly with relatively big data files),
Andy

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

David Marso

Re: Dividing file into 10,000 case chunks

Administrator

In reply to this post by David Marso

Yep! 2 more lines of ugly monkey macro poop. When will it ever learn to add?
Added code in BOLD.
--
DATASET DECLARE test.
MATRIX.
SAVE (UNIFORM(1000000,5))/OUTFILE test / VARIABLES V01 TO V05.
END MATRIX.

DEFINE BreakOut (NBreak !TOKENS(1) / BlockSize !TOKENS(1)).
+ COMPUTE ID=$CASENUM.
+ !LET !L = "".
+ !DO !I=1 !TO !NBreak .
+ DO IF RANGE(ID, (!I-1)*!BlockSize +1,!I*!BlockSize ).
+ XSAVE OUTFILE !QUOTE(!CONCAT(!UNQUOTE('G:\TEMP3\junk'),!I,'.sav')).
+ END IF.
+ !LET !L=!CONCAT(!L,"x")!IF ( !LENGTH(!L) !EQ 64 ) !THEN EXECUTE. !LET !L="" !IFEND
!DOEND .
EXECUTE.
!ENDDEFINE.

SET MPRINT ON PRINTBACK ON.
BreakOut NBreak=100 BlockSize=10000.

David Marso wrote

I'll bet the hair on my ugly monkey macro that the internals of SPLIT DATASET are pretty ugly as well.
Maybe an elegant white space sensitive sort of ugly.
OTOH: maybe IBM should spend some time on enhancing old workhorses rather than peddling python as the panacea for all our woes!
Oh, they would rather port everything to the MVM (at that point I'm out for good).
---------------
If there are more than 64 groups simply toss an EXE into the mix (probably take a couple more lines of 'ugly' code).

Jon K Peck wrote

This ugly code will fail if there are more than 10 or 64 groups, depending
on the Statistics version.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: David Marso <[hidden email]>
To: [hidden email],
Date: 04/08/2013 04:24 PM
Subject: Re: [SPSSX-L] Dividing file into 10,000 case chunks
Sent by: "SPSSX(r) Discussion" <[hidden email]>

For those who have an instinctive resistance to unnecessarily adding
python
to the mix :
No Python, No Pain, pretty general and.......CORRECT !!!!!!!
I'll drink the Koolaid when I need to build an application which requires
metadata access or data cursors.
Maybe I am a heretic, but this is too simple to break and KISS!!!
Viva La Resistance.

DEFINE BreakOut (NBreak !TOKENS(1) / BlockSize !TOKENS(1)).
+ COMPUTE ID=$CASENUM.
+ !DO !I=1 !TO !NBreak .
+ DO IF RANGE(ID, (!I-1)*!BlockSize +1,!I*!BlockSize ).
+ XSAVE OUTFILE !QUOTE(!CONCAT(!UNQUOTE('G:\TEMP3\junk'),!I,'.sav')).
+ END IF.
+ !DOEND .
+ EXECUTE.
!ENDDEFINE.