SPSSX Discussion

Splitting a large file

Classic

List

Threaded

13 messages Options

Hector Maletta

Splitting a large file

I need to generate separate files for each country-year combination. Easy enough to do with some ordinary SELECT IF and SAVE commands, of course, but the trouble is the time it would take to execute. SPSS would read the entire file of 140 million cases each time it has to find and save the cases showing a certain country-year combination. Reading the file (and saving the selected cases) takes a significant amount of time, and dramatically slows my computer for other tasks while it’s at it.

Therefore I’d like to produce the 40 files in possibly one read, or in as few reads as feasible. Ideally, the process would read and save the first n cases (sharing the same xyz value), then proceed to read and save the second batch of m cases, and so on till saving the last batch. The various files thus produced should be automatically named, say xyz.sav for any particular xyz value.

Does anyone have a solution ready? Otherwise I should put myself to the thankless (and possibly time consuming) task of programming this, or (still worse) have my computer tied down for two days while it does this boring chore.

Thanks in advance

Hector

David Marso

Re: Splitting a large file

Administrator

Hector,
Please see the XSAVE command.
From FM: "Limitations: Maximum of 64 XSAVE commands are allowed within a single set of transformations."
If you are clever you could likely automate this, but otherwise just copy/paste assuming you know the values of the variable you are splitting the file on.
--
DO IF (var EQ 1).
XSAVE OUTFILE 'f1'.
ELSE IF (var EQ 2).
XSAVE OUTFILE 'f2'.
blah blah blah.......
END IF.

EXECUTE.

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

Hector Maletta

Re: Splitting a large file

Thanks a lot, David. Will try your DO IF sequence. I should have remembered
XSAVE, of course, but I've practically never used it, and completely forgot
about it.

Hector

-----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de David
Marso
Enviado el: Monday, January 07, 2013 18:28
Para: [hidden email]
Asunto: Re: Splitting a large file

Hector,
Please see the XSAVE command.
From FM: "Limitations: Maximum of 64 XSAVE commands are allowed within a
single set of transformations."
If you are clever you could likely automate this, but otherwise just
copy/paste assuming you know the values of the variable you are splitting
the file on.
--
DO IF (var EQ 1).
XSAVE OUTFILE 'f1'.
ELSE IF (var EQ 2).
XSAVE OUTFILE 'f2'.
blah blah blah.......
END IF.

EXECUTE.

-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email
me.
--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Splitting-a-large-file-tp57172
58p5717259.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command SIGNOFF SPSSX-L For a list of
commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon K Peck

Re: Splitting a large file

In reply to this post by Hector Maletta

As David said, XSAVE can do this if you are willing to write all the conditional code.

Alternatively, if you install the SPSSINC SPLIT DATASET extension command, which requires the Python Essentials, it will do all the work for you. It has a dialog box interface as well as traditional syntax. In fact, it generates all the DO IF... XSAVE conditions for you, and it handles the situation where there are more than 64 groups, although it has to do additional data passes for that.

It requires a minimum of two data passes, but the data don't have to be sorted.

HTH,

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: Hector Maletta <[hidden email]>
To: [hidden email],
Date: 01/07/2013 02:15 PM
Subject: [SPSSX-L] Splitting a large file
Sent by: "SPSSX(r) Discussion" <[hidden email]>

I have a large SPSS (.SAV format) file, containing 140 million cases with about 100 variables per case, consisting of various population census samples from different countries and dates. Thus the first n cases are from a census taken in a certain country (say the US) and year (say 1970), the next m cases are from a different country-year combination, and so on (One to five censuses per country). Each country-year combination is identified with a single ID variable, with numerical values xyz where xy is the country code and z is the ordinal identifier of successive censuses for that country; thus for a particular country the country code may be 43, and then all cases from the 1970 census may be coded 431, those from the 1980 census would be 432, and so on. The file is sorted by this ID variable, and thus ordered by country and year. A total of about 40 census samples from about 15 countries are included.

I need to generate separate files for each country-year combination. Easy enough to do with some ordinary SELECT IF and SAVE commands, of course, but the trouble is the time it would take to execute. SPSS would read the entire file of 140 million cases each time it has to find and save the cases showing a certain country-year combination. Reading the file (and saving the selected cases) takes a significant amount of time, and dramatically slows my computer for other tasks while it’s at it.

Therefore I’d like to produce the 40 files in possibly one read, or in as few reads as feasible. Ideally, the process would read and save the first n cases (sharing the same xyz value), then proceed to read and save the second batch of m cases, and so on till saving the last batch. The various files thus produced should be automatically named, say xyz.sav for any particular xyz value.

Does anyone have a solution ready? Otherwise I should put myself to the thankless (and possibly time consuming) task of programming this, or (still worse) have my computer tied down for two days while it does this boring chore.

Thanks in advance

Hector

Hector Maletta

Re: Splitting a large file

Hector

De: Jon K Peck [mailto:[hidden email]]
Enviado el: Monday, January 07, 2013 19:55
Para: Hector Maletta
CC: SPSSX-L@listserv.uga.edu
Asunto: Re: [SPSSX-L] Splitting a large file

Mike

Re: Splitting a large file

In reply to this post by Hector Maletta

You might want to take a look at entries #20 and #21 on

http://www.spsstools.net/SampleSyntax.htm#WorkingWithManyFiles

David M may be able to elaborate.

-Mike Palij

New York University

[hidden email]

----- Original Message -----

From: [hidden email]

To: [hidden email]

Sent: Monday, January 07, 2013 3:05 PM

Subject: Splitting a large file

I have a large SPSS (.SAV format) file, containing 140 million cases with about 100 variables per case, consisting of various population census samples from different countries and dates. Thus the first n cases are from a census taken in a certain country (say the US) and year (say 1970), the next m cases are from a different country-year combination, and so on (One to five censuses per country). Each country-year combination is identified with a single ID variable, with numerical values xyz where xy is the country code and z is the ordinal identifier of successive censuses for that country; thus for a particular country the country code may be 43, and then all cases from the 1970 census may be coded 431, those from the 1980 census would be 432, and so on. The file is sorted by this ID variable, and thus ordered by country and year. A total of about 40 census samples from about 15 countries are included.

I need to generate separate files for each country-year combination. Easy enough to do with some ordinary SELECT IF and SAVE commands, of course, but the trouble is the time it would take to execute. SPSS would read the entire file of 140 million cases each time it has to find and save the cases showing a certain country-year combination. Reading the file (and saving the selected cases) takes a significant amount of time, and dramatically slows my computer for other tasks while its at it.

Therefore Id like to produce the 40 files in possibly one read, or in as few reads as feasible. Ideally, the process would read and save the first n cases (sharing the same xyz value), then proceed to read and save the second batch of m cases, and so on till saving the last batch. The various files thus produced should be automatically named, say xyz.sav for any particular xyz value.

Does anyone have a solution ready? Otherwise I should put myself to the thankless (and possibly time consuming) task of programming this, or (still worse) have my computer tied down for two days while it does this boring chore.

Thanks in advance

Hector

Jon K Peck

Re: Splitting a large file

In reply to this post by Hector Maletta

If you have fewer than 64 groups it can be done in one pass with XSAVE - but that assumes that you crafted all the conditional and file names for the groups correctly the first time. SPLIT DATASET builds all that from the variable values, but it runs AGGREGATE first so that it knows all the split values.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: "Hector Maletta" <[hidden email]>
To: Jon K Peck/Chicago/IBM@IBMUS,
Cc: <[hidden email]>
Date: 01/07/2013 04:03 PM
Subject: RE: [SPSSX-L] Splitting a large file

Thank you, Jon. Quite useful. Just a question: if SPSSINC SPLIT DATASET requires two passes, it is also the case that David Marso’s solution requires two passes, or only one? (In my case, as explained before, cases are sorted)

Hector

De: Jon K Peck [mailto:peck@...]
Enviado el: Monday, January 07, 2013 19:55
Para: Hector Maletta
CC: [hidden email]
Asunto: Re: [SPSSX-L] Splitting a large file

As David said, XSAVE can do this if you are willing to write all the conditional code.

Alternatively, if you install the SPSSINC SPLIT DATASET extension command, which requires the Python Essentials, it will do all the work for you. It has a dialog box interface as well as traditional syntax. In fact, it generates all the DO IF... XSAVE conditions for you, and it handles the situation where there are more than 64 groups, although it has to do additional data passes for that.

It requires a minimum of two data passes, but the data don't have to be sorted.

HTH,

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
peck@...
new phone: 720-342-5621

From: Hector Maletta <hmaletta@...>
To: [hidden email],
Date: 01/07/2013 02:15 PM
Subject: [SPSSX-L] Splitting a large file
Sent by: "SPSSX(r) Discussion" <[hidden email]>

David Marso

Re: Splitting a large file

Administrator

With 140 Million cases I would consider than an expensive data pass.
If you have a frequency output somewhere of all your combinations you can even bypass the laborious error prone manual DO IF biz by using the following Macro I came up with in about 5 minutes ;-)
Even though some people believe MACRO to be an atrocious, inelegant, primitive, bear skin and flint shard old skool, difficult to learn, unsophisticated, 20th century dinosaur technology (did I miss anything there ?;-))) , it has the advantage of not requiring python and in this case will save you perhaps half an hour for that extra 140M data pass ;-)

DEFINE BIGXSAV (Varname !TOKENS(1)/ VList !ENCLOSE ("(",")" ) ).
DO IF (!VarName EQ !HEAD(!VLIST)).
XSAVE OUTFILE !QUOTE(!CONCAT ("C:\TEMP\File_",!HEAD(!VLIST))).
!DO !V !IN (!TAIL(!VLIST))
ELSE IF (!VarName EQ !V).
XSAVE OUTFILE !QUOTE(!CONCAT ("C:\TEMP\File_",!V)).
!DOEND .
END IF.
EXECUTE.
!ENDDEFINE .
SET PRINTBACK ON MPRINT ON.

DATA LIST FREE / X .
BEGIN DATA
1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6
END DATA.

BIGXSAV VARNAME X VLIST (1 2 3 4 5 6).

BIGXSAV VARNAME X VLIST (1 2 3 4 5 6).
90 M>
91 M> .
92 M> DO IF ( X EQ 1 ).
93 M> XSAVE OUTFILE 'C:\TEMP\File_1'.
94 M> ELSE IF ( X EQ 2 ).
95 M> XSAVE OUTFILE 'C:\TEMP\File_2'.
96 M> ELSE IF ( X EQ 3 ).
97 M> XSAVE OUTFILE 'C:\TEMP\File_3'.
98 M> ELSE IF ( X EQ 4 ).
99 M> XSAVE OUTFILE 'C:\TEMP\File_4'.
100 M> ELSE IF ( X EQ 5 ).
101 M> XSAVE OUTFILE 'C:\TEMP\File_5'.
102 M> ELSE IF ( X EQ 6 ).
103 M> XSAVE OUTFILE 'C:\TEMP\File_6'.
104 M>
105 M> END IF.
106 M> EXECUTE
107 M> .

-----

Jon K Peck wrote

If you have fewer than 64 groups it can be done in one pass with XSAVE -
but that assumes that you crafted all the conditional and file names for
the groups correctly the first time. SPLIT DATASET builds all that from
the variable values, but it runs AGGREGATE first so that it knows all the
split values.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: "Hector Maletta" <[hidden email]>
To: Jon K Peck/Chicago/IBM@IBMUS,
Cc: <[hidden email]>
Date: 01/07/2013 04:03 PM
Subject: RE: [SPSSX-L] Splitting a large file

Thank you, Jon. Quite useful. Just a question: if SPSSINC SPLIT DATASET
requires two passes, it is also the case that David Marso’s solution
requires two passes, or only one? (In my case, as explained before, cases
are sorted)

Hector

De: Jon K Peck [mailto:[hidden email]]
Enviado el: Monday, January 07, 2013 19:55
Para: Hector Maletta
CC: [hidden email]
Asunto: Re: [SPSSX-L] Splitting a large file

As David said, XSAVE can do this if you are willing to write all the
conditional code.

Alternatively, if you install the SPSSINC SPLIT DATASET extension command,
which requires the Python Essentials, it will do all the work for you. It
has a dialog box interface as well as traditional syntax. In fact, it
generates all the DO IF... XSAVE conditions for you, and it handles the
situation where there are more than 64 groups, although it has to do
additional data passes for that.

It requires a minimum of two data passes, but the data don't have to be
sorted.

HTH,

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: Hector Maletta <[hidden email]>
To: [hidden email],
Date: 01/07/2013 02:15 PM
Subject: [SPSSX-L] Splitting a large file
Sent by: "SPSSX(r) Discussion" <[hidden email]>

I have a large SPSS (.SAV format) file, containing 140 million cases with
about 100 variables per case, consisting of various population census
samples from different countries and dates. Thus the first n cases are
from a census taken in a certain country (say the US) and year (say 1970),
the next m cases are from a different country-year combination, and so on
(One to five censuses per country). Each country-year combination is
identified with a single ID variable, with numerical values xyz where xy
is the country code and z is the ordinal identifier of successive censuses
for that country; thus for a particular country the country code may be
43, and then all cases from the 1970 census may be coded 431, those from
the 1980 census would be 432, and so on. The file is sorted by this ID
variable, and thus ordered by country and year. A total of about 40 census
samples from about 15 countries are included.

I need to generate separate files for each country-year combination. Easy
enough to do with some ordinary SELECT IF and SAVE commands, of course,
but the trouble is the time it would take to execute. SPSS would read the
entire file of 140 million cases each time it has to find and save the
cases showing a certain country-year combination. Reading the file (and
saving the selected cases) takes a significant amount of time, and
dramatically slows my computer for other tasks while it’s at it.

Therefore I’d like to produce the 40 files in possibly one read, or in as
few reads as feasible. Ideally, the process would read and save the first
n cases (sharing the same xyz value), then proceed to read and save the
second batch of m cases, and so on till saving the last batch. The various
files thus produced should be automatically named, say xyz.sav for any
particular xyz value.

Does anyone have a solution ready? Otherwise I should put myself to the
thankless (and possibly time consuming) task of programming this, or
(still worse) have my computer tied down for two days while it does this
boring chore.

Thanks in advance

Hector

David Marso

Re: Splitting a large file

Administrator

In reply to this post by Mike

Only thing I wish to elaborate re those 2 examples are that they are utterly inexcusable crap!
Not only do they use my ancient horrible hack/red headed bastard child method (writing syntax for later INCLUDE) which can more readily be implemented using old skool BASIC scripting or errr ahem python but more shame: within the main macro commit the egregious act of an unnecessary EXECUTE following SAVE. For Hector that involves one data pass for the AGGREGATE to do the red headed bastard child thing and then 2 data passes for each value of the control variable. Raynald really aught to delete those two and probably anything similar from his site.
--

Mike Palij wrote

You might want to take a look at entries #20 and #21 on
http://www.spsstools.net/SampleSyntax.htm#WorkingWithManyFiles
David M may be able to elaborate.

-Mike Palij
New York University
[hidden email]

----- Original Message -----
From: Hector Maletta
To: [hidden email]
Sent: Monday, January 07, 2013 3:05 PM
Subject: Splitting a large file

I have a large SPSS (.SAV format) file, containing 140 million cases with about 100 variables per case, consisting of various population census samples from different countries and dates. Thus the first n cases are from a census taken in a certain country (say the US) and year (say 1970), the next m cases are from a different country-year combination, and so on (One to five censuses per country). Each country-year combination is identified with a single ID variable, with numerical values xyz where xy is the country code and z is the ordinal identifier of successive censuses for that country; thus for a particular country the country code may be 43, and then all cases from the 1970 census may be coded 431, those from the 1980 census would be 432, and so on. The file is sorted by this ID variable, and thus ordered by country and year. A total of about 40 census samples from about 15 countries are included.

I need to generate separate files for each country-year combination. Easy enough to do with some ordinary SELECT IF and SAVE commands, of course, but the trouble is the time it would take to execute. SPSS would read the entire file of 140 million cases each time it has to find and save the cases showing a certain country-year combination. Reading the file (and saving the selected cases) takes a significant amount of time, and dramatically slows my computer for other tasks while it's at it.

Therefore I'd like to produce the 40 files in possibly one read, or in as few reads as feasible. Ideally, the process would read and save the first n cases (sharing the same xyz value), then proceed to read and save the second batch of m cases, and so on till saving the last batch. The various files thus produced should be automatically named, say xyz.sav for any particular xyz value.

Does anyone have a solution ready? Otherwise I should put myself to the thankless (and possibly time consuming) task of programming this, or (still worse) have my computer tied down for two days while it does this boring chore.

Thanks in advance

Hector

Bruce Weaver

Re: Splitting a large file

Administrator

In reply to this post by David Marso

Very nice NPR solution, David. Good illustration of how useful !HEAD and !TAIL can be.

David Marso wrote

With 140 Million cases I would consider than an expensive data pass.
If you have a frequency output somewhere of all your combinations you can even bypass the laborious error prone manual DO IF biz by using the following Macro I came up with in about 5 minutes ;-)
Even though some people believe MACRO to be an atrocious, inelegant, primitive, bear skin and flint shard old skool, difficult to learn, unsophisticated, 20th century dinosaur technology (did I miss anything there ?;-))) , it has the advantage of not requiring python and in this case will save you perhaps half an hour for that extra 140M data pass ;-)

DEFINE BIGXSAV (Varname !TOKENS(1)/ VList !ENCLOSE ("(",")" ) ).
DO IF (!VarName EQ !HEAD(!VLIST)).
XSAVE OUTFILE !QUOTE(!CONCAT ("C:\TEMP\File_",!HEAD(!VLIST))).
!DO !V !IN (!TAIL(!VLIST))
ELSE IF (!VarName EQ !V).
XSAVE OUTFILE !QUOTE(!CONCAT ("C:\TEMP\File_",!V)).
!DOEND .
END IF.
EXECUTE.
!ENDDEFINE .
SET PRINTBACK ON MPRINT ON.

DATA LIST FREE / X .
BEGIN DATA
1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6
END DATA.

BIGXSAV VARNAME X VLIST (1 2 3 4 5 6).

BIGXSAV VARNAME X VLIST (1 2 3 4 5 6).
90 M>
91 M> .
92 M> DO IF ( X EQ 1 ).
93 M> XSAVE OUTFILE 'C:\TEMP\File_1'.
94 M> ELSE IF ( X EQ 2 ).
95 M> XSAVE OUTFILE 'C:\TEMP\File_2'.
96 M> ELSE IF ( X EQ 3 ).
97 M> XSAVE OUTFILE 'C:\TEMP\File_3'.
98 M> ELSE IF ( X EQ 4 ).
99 M> XSAVE OUTFILE 'C:\TEMP\File_4'.
100 M> ELSE IF ( X EQ 5 ).
101 M> XSAVE OUTFILE 'C:\TEMP\File_5'.
102 M> ELSE IF ( X EQ 6 ).
103 M> XSAVE OUTFILE 'C:\TEMP\File_6'.
104 M>
105 M> END IF.
106 M> EXECUTE
107 M> .

-----

Jon K Peck wrote

If you have fewer than 64 groups it can be done in one pass with XSAVE -
but that assumes that you crafted all the conditional and file names for
the groups correctly the first time. SPLIT DATASET builds all that from
the variable values, but it runs AGGREGATE first so that it knows all the
split values.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: "Hector Maletta" <[hidden email]>
To: Jon K Peck/Chicago/IBM@IBMUS,
Cc: <[hidden email]>
Date: 01/07/2013 04:03 PM
Subject: RE: [SPSSX-L] Splitting a large file

Thank you, Jon. Quite useful. Just a question: if SPSSINC SPLIT DATASET
requires two passes, it is also the case that David Marso’s solution
requires two passes, or only one? (In my case, as explained before, cases
are sorted)

Hector

De: Jon K Peck [mailto:[hidden email]]
Enviado el: Monday, January 07, 2013 19:55
Para: Hector Maletta
CC: [hidden email]
Asunto: Re: [SPSSX-L] Splitting a large file

As David said, XSAVE can do this if you are willing to write all the
conditional code.

Alternatively, if you install the SPSSINC SPLIT DATASET extension command,
which requires the Python Essentials, it will do all the work for you. It
has a dialog box interface as well as traditional syntax. In fact, it
generates all the DO IF... XSAVE conditions for you, and it handles the
situation where there are more than 64 groups, although it has to do
additional data passes for that.

It requires a minimum of two data passes, but the data don't have to be
sorted.

HTH,

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: Hector Maletta <[hidden email]>
To: [hidden email],
Date: 01/07/2013 02:15 PM
Subject: [SPSSX-L] Splitting a large file
Sent by: "SPSSX(r) Discussion" <[hidden email]>

I have a large SPSS (.SAV format) file, containing 140 million cases with
about 100 variables per case, consisting of various population census
samples from different countries and dates. Thus the first n cases are
from a census taken in a certain country (say the US) and year (say 1970),
the next m cases are from a different country-year combination, and so on
(One to five censuses per country). Each country-year combination is
identified with a single ID variable, with numerical values xyz where xy
is the country code and z is the ordinal identifier of successive censuses
for that country; thus for a particular country the country code may be
43, and then all cases from the 1970 census may be coded 431, those from
the 1980 census would be 432, and so on. The file is sorted by this ID
variable, and thus ordered by country and year. A total of about 40 census
samples from about 15 countries are included.

I need to generate separate files for each country-year combination. Easy
enough to do with some ordinary SELECT IF and SAVE commands, of course,
but the trouble is the time it would take to execute. SPSS would read the
entire file of 140 million cases each time it has to find and save the
cases showing a certain country-year combination. Reading the file (and
saving the selected cases) takes a significant amount of time, and
dramatically slows my computer for other tasks while it’s at it.

Therefore I’d like to produce the 40 files in possibly one read, or in as
few reads as feasible. Ideally, the process would read and save the first
n cases (sharing the same xyz value), then proceed to read and save the
second batch of m cases, and so on till saving the last batch. The various
files thus produced should be automatically named, say xyz.sav for any
particular xyz value.

Does anyone have a solution ready? Otherwise I should put myself to the
thankless (and possibly time consuming) task of programming this, or
(still worse) have my computer tied down for two days while it does this
boring chore.

Thanks in advance

Hector

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).

David Marso

Re: Splitting a large file

Administrator

Thanks Bruce,
However upon reflection I strangely prefer the following :
--
DEFINE BIGXSAV (Varname !TOKENS(1)/ VList !ENCLOSE ("(",")" ) ).
DO IF (1 EQ 0).
!DO !V !IN (!VLIST)
ELSE IF (!VarName EQ !V).
XSAVE OUTFILE !QUOTE(!CONCAT ("C:\TEMP\File_",!V)).
!DOEND .
END IF.
EXECUTE.
!ENDDEFINE .

Bruce Weaver wrote

Very nice NPR solution, David. Good illustration of how useful !HEAD and !TAIL can be.

David Marso wrote

With 140 Million cases I would consider than an expensive data pass.
If you have a frequency output somewhere of all your combinations you can even bypass the laborious error prone manual DO IF biz by using the following Macro I came up with in about 5 minutes ;-)
Even though some people believe MACRO to be an atrocious, inelegant, primitive, bear skin and flint shard old skool, difficult to learn, unsophisticated, 20th century dinosaur technology (did I miss anything there ?;-))) , it has the advantage of not requiring python and in this case will save you perhaps half an hour for that extra 140M data pass ;-)

DEFINE BIGXSAV (Varname !TOKENS(1)/ VList !ENCLOSE ("(",")" ) ).
DO IF (!VarName EQ !HEAD(!VLIST)).
XSAVE OUTFILE !QUOTE(!CONCAT ("C:\TEMP\File_",!HEAD(!VLIST))).
!DO !V !IN (!TAIL(!VLIST))
ELSE IF (!VarName EQ !V).
XSAVE OUTFILE !QUOTE(!CONCAT ("C:\TEMP\File_",!V)).
!DOEND .
END IF.
EXECUTE.
!ENDDEFINE .
SET PRINTBACK ON MPRINT ON.

DATA LIST FREE / X .
BEGIN DATA
1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6
END DATA.

BIGXSAV VARNAME X VLIST (1 2 3 4 5 6).

BIGXSAV VARNAME X VLIST (1 2 3 4 5 6).
90 M>
91 M> .
92 M> DO IF ( X EQ 1 ).
93 M> XSAVE OUTFILE 'C:\TEMP\File_1'.
94 M> ELSE IF ( X EQ 2 ).
95 M> XSAVE OUTFILE 'C:\TEMP\File_2'.
96 M> ELSE IF ( X EQ 3 ).
97 M> XSAVE OUTFILE 'C:\TEMP\File_3'.
98 M> ELSE IF ( X EQ 4 ).
99 M> XSAVE OUTFILE 'C:\TEMP\File_4'.
100 M> ELSE IF ( X EQ 5 ).
101 M> XSAVE OUTFILE 'C:\TEMP\File_5'.
102 M> ELSE IF ( X EQ 6 ).
103 M> XSAVE OUTFILE 'C:\TEMP\File_6'.
104 M>
105 M> END IF.
106 M> EXECUTE
107 M> .

-----

Jon K Peck wrote

If you have fewer than 64 groups it can be done in one pass with XSAVE -
but that assumes that you crafted all the conditional and file names for
the groups correctly the first time. SPLIT DATASET builds all that from
the variable values, but it runs AGGREGATE first so that it knows all the
split values.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: "Hector Maletta" <[hidden email]>
To: Jon K Peck/Chicago/IBM@IBMUS,
Cc: <[hidden email]>
Date: 01/07/2013 04:03 PM
Subject: RE: [SPSSX-L] Splitting a large file

Thank you, Jon. Quite useful. Just a question: if SPSSINC SPLIT DATASET
requires two passes, it is also the case that David Marso’s solution
requires two passes, or only one? (In my case, as explained before, cases
are sorted)

Hector

De: Jon K Peck [mailto:[hidden email]]
Enviado el: Monday, January 07, 2013 19:55
Para: Hector Maletta
CC: [hidden email]
Asunto: Re: [SPSSX-L] Splitting a large file

As David said, XSAVE can do this if you are willing to write all the
conditional code.

Alternatively, if you install the SPSSINC SPLIT DATASET extension command,
which requires the Python Essentials, it will do all the work for you. It
has a dialog box interface as well as traditional syntax. In fact, it
generates all the DO IF... XSAVE conditions for you, and it handles the
situation where there are more than 64 groups, although it has to do
additional data passes for that.

It requires a minimum of two data passes, but the data don't have to be
sorted.

HTH,

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: Hector Maletta <[hidden email]>
To: [hidden email],
Date: 01/07/2013 02:15 PM
Subject: [SPSSX-L] Splitting a large file
Sent by: "SPSSX(r) Discussion" <[hidden email]>

I have a large SPSS (.SAV format) file, containing 140 million cases with
about 100 variables per case, consisting of various population census
samples from different countries and dates. Thus the first n cases are
from a census taken in a certain country (say the US) and year (say 1970),
the next m cases are from a different country-year combination, and so on
(One to five censuses per country). Each country-year combination is
identified with a single ID variable, with numerical values xyz where xy
is the country code and z is the ordinal identifier of successive censuses
for that country; thus for a particular country the country code may be
43, and then all cases from the 1970 census may be coded 431, those from
the 1980 census would be 432, and so on. The file is sorted by this ID
variable, and thus ordered by country and year. A total of about 40 census
samples from about 15 countries are included.

I need to generate separate files for each country-year combination. Easy
enough to do with some ordinary SELECT IF and SAVE commands, of course,
but the trouble is the time it would take to execute. SPSS would read the
entire file of 140 million cases each time it has to find and save the
cases showing a certain country-year combination. Reading the file (and
saving the selected cases) takes a significant amount of time, and
dramatically slows my computer for other tasks while it’s at it.

Therefore I’d like to produce the 40 files in possibly one read, or in as
few reads as feasible. Ideally, the process would read and save the first
n cases (sharing the same xyz value), then proceed to read and save the
second batch of m cases, and so on till saving the last batch. The various
files thus produced should be automatically named, say xyz.sav for any
particular xyz value.

Does anyone have a solution ready? Otherwise I should put myself to the
thankless (and possibly time consuming) task of programming this, or
(still worse) have my computer tied down for two days while it does this
boring chore.

Thanks in advance

Hector

Bruce Weaver

Re: Splitting a large file

Administrator

If I follow, that will expand as follows (using your earlier data set):

M> DO IF ( 1 EQ 0 ).
M> ELSE IF ( X EQ 1 ).
M> XSAVE OUTFILE 'C:\TEMP\File_1'.
M> ELSE IF ( X EQ 2 ).
M> XSAVE OUTFILE 'C:\TEMP\File_2'.
M> ELSE IF ( X EQ 3 ).
M> XSAVE OUTFILE 'C:\TEMP\File_3'.
M> ELSE IF ( X EQ 4 ).
M> XSAVE OUTFILE 'C:\TEMP\File_4'.
M> ELSE IF ( X EQ 5 ).
M> XSAVE OUTFILE 'C:\TEMP\File_5'.
M> ELSE IF ( X EQ 6 ).
M> XSAVE OUTFILE 'C:\TEMP\File_6'.
M>
M> END IF.
M> EXECUTE
M> .

I like it.

David Marso wrote

Thanks Bruce,
However upon reflection I strangely prefer the following :
--
DEFINE BIGXSAV (Varname !TOKENS(1)/ VList !ENCLOSE ("(",")" ) ).
DO IF (1 EQ 0).
!DO !V !IN (!VLIST)
ELSE IF (!VarName EQ !V).
XSAVE OUTFILE !QUOTE(!CONCAT ("C:\TEMP\File_",!V)).
!DOEND .
END IF.
EXECUTE.
!ENDDEFINE .

Bruce Weaver wrote

Very nice NPR solution, David. Good illustration of how useful !HEAD and !TAIL can be.

David Marso wrote

With 140 Million cases I would consider than an expensive data pass.
If you have a frequency output somewhere of all your combinations you can even bypass the laborious error prone manual DO IF biz by using the following Macro I came up with in about 5 minutes ;-)
Even though some people believe MACRO to be an atrocious, inelegant, primitive, bear skin and flint shard old skool, difficult to learn, unsophisticated, 20th century dinosaur technology (did I miss anything there ?;-))) , it has the advantage of not requiring python and in this case will save you perhaps half an hour for that extra 140M data pass ;-)

DEFINE BIGXSAV (Varname !TOKENS(1)/ VList !ENCLOSE ("(",")" ) ).
DO IF (!VarName EQ !HEAD(!VLIST)).
XSAVE OUTFILE !QUOTE(!CONCAT ("C:\TEMP\File_",!HEAD(!VLIST))).
!DO !V !IN (!TAIL(!VLIST))
ELSE IF (!VarName EQ !V).
XSAVE OUTFILE !QUOTE(!CONCAT ("C:\TEMP\File_",!V)).
!DOEND .
END IF.
EXECUTE.
!ENDDEFINE .
SET PRINTBACK ON MPRINT ON.

DATA LIST FREE / X .
BEGIN DATA
1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6
END DATA.

BIGXSAV VARNAME X VLIST (1 2 3 4 5 6).

BIGXSAV VARNAME X VLIST (1 2 3 4 5 6).
90 M>
91 M> .
92 M> DO IF ( X EQ 1 ).
93 M> XSAVE OUTFILE 'C:\TEMP\File_1'.
94 M> ELSE IF ( X EQ 2 ).
95 M> XSAVE OUTFILE 'C:\TEMP\File_2'.
96 M> ELSE IF ( X EQ 3 ).
97 M> XSAVE OUTFILE 'C:\TEMP\File_3'.
98 M> ELSE IF ( X EQ 4 ).
99 M> XSAVE OUTFILE 'C:\TEMP\File_4'.
100 M> ELSE IF ( X EQ 5 ).
101 M> XSAVE OUTFILE 'C:\TEMP\File_5'.
102 M> ELSE IF ( X EQ 6 ).
103 M> XSAVE OUTFILE 'C:\TEMP\File_6'.
104 M>
105 M> END IF.
106 M> EXECUTE
107 M> .

-----

Jon K Peck wrote

If you have fewer than 64 groups it can be done in one pass with XSAVE -
but that assumes that you crafted all the conditional and file names for
the groups correctly the first time. SPLIT DATASET builds all that from
the variable values, but it runs AGGREGATE first so that it knows all the
split values.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: "Hector Maletta" <[hidden email]>
To: Jon K Peck/Chicago/IBM@IBMUS,
Cc: <[hidden email]>
Date: 01/07/2013 04:03 PM
Subject: RE: [SPSSX-L] Splitting a large file

Thank you, Jon. Quite useful. Just a question: if SPSSINC SPLIT DATASET
requires two passes, it is also the case that David Marso’s solution
requires two passes, or only one? (In my case, as explained before, cases
are sorted)

Hector

De: Jon K Peck [mailto:[hidden email]]
Enviado el: Monday, January 07, 2013 19:55
Para: Hector Maletta
CC: [hidden email]
Asunto: Re: [SPSSX-L] Splitting a large file

As David said, XSAVE can do this if you are willing to write all the
conditional code.

Alternatively, if you install the SPSSINC SPLIT DATASET extension command,
which requires the Python Essentials, it will do all the work for you. It
has a dialog box interface as well as traditional syntax. In fact, it
generates all the DO IF... XSAVE conditions for you, and it handles the
situation where there are more than 64 groups, although it has to do
additional data passes for that.

It requires a minimum of two data passes, but the data don't have to be
sorted.

HTH,

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: Hector Maletta <[hidden email]>
To: [hidden email],
Date: 01/07/2013 02:15 PM
Subject: [SPSSX-L] Splitting a large file
Sent by: "SPSSX(r) Discussion" <[hidden email]>

I have a large SPSS (.SAV format) file, containing 140 million cases with
about 100 variables per case, consisting of various population census
samples from different countries and dates. Thus the first n cases are
from a census taken in a certain country (say the US) and year (say 1970),
the next m cases are from a different country-year combination, and so on
(One to five censuses per country). Each country-year combination is
identified with a single ID variable, with numerical values xyz where xy
is the country code and z is the ordinal identifier of successive censuses
for that country; thus for a particular country the country code may be
43, and then all cases from the 1970 census may be coded 431, those from
the 1980 census would be 432, and so on. The file is sorted by this ID
variable, and thus ordered by country and year. A total of about 40 census
samples from about 15 countries are included.

I need to generate separate files for each country-year combination. Easy
enough to do with some ordinary SELECT IF and SAVE commands, of course,
but the trouble is the time it would take to execute. SPSS would read the
entire file of 140 million cases each time it has to find and save the
cases showing a certain country-year combination. Reading the file (and
saving the selected cases) takes a significant amount of time, and
dramatically slows my computer for other tasks while it’s at it.

Therefore I’d like to produce the 40 files in possibly one read, or in as
few reads as feasible. Ideally, the process would read and save the first
n cases (sharing the same xyz value), then proceed to read and save the
second batch of m cases, and so on till saving the last batch. The various
files thus produced should be automatically named, say xyz.sav for any
particular xyz value.

Does anyone have a solution ready? Otherwise I should put myself to the
thankless (and possibly time consuming) task of programming this, or
(still worse) have my computer tied down for two days while it does this
boring chore.

Thanks in advance

Hector

David Marso

Re: Splitting a large file

Administrator

Exactly.
Tossing a FALSE into the main clause allows everything else to be ELSE IF'd.
Makes the programming logic a bit tidier.
--

Bruce Weaver wrote

If I follow, that will expand as follows (using your earlier data set):

M> DO IF ( 1 EQ 0 ).
M> ELSE IF ( X EQ 1 ).
M> XSAVE OUTFILE 'C:\TEMP\File_1'.
M> ELSE IF ( X EQ 2 ).
M> XSAVE OUTFILE 'C:\TEMP\File_2'.
M> ELSE IF ( X EQ 3 ).
M> XSAVE OUTFILE 'C:\TEMP\File_3'.
M> ELSE IF ( X EQ 4 ).
M> XSAVE OUTFILE 'C:\TEMP\File_4'.
M> ELSE IF ( X EQ 5 ).
M> XSAVE OUTFILE 'C:\TEMP\File_5'.
M> ELSE IF ( X EQ 6 ).
M> XSAVE OUTFILE 'C:\TEMP\File_6'.
M>
M> END IF.
M> EXECUTE
M> .

I like it.

David Marso wrote

Thanks Bruce,
However upon reflection I strangely prefer the following :
--
DEFINE BIGXSAV (Varname !TOKENS(1)/ VList !ENCLOSE ("(",")" ) ).
DO IF (1 EQ 0).
!DO !V !IN (!VLIST)
ELSE IF (!VarName EQ !V).
XSAVE OUTFILE !QUOTE(!CONCAT ("C:\TEMP\File_",!V)).
!DOEND .
END IF.
EXECUTE.
!ENDDEFINE .

Bruce Weaver wrote

Very nice NPR solution, David. Good illustration of how useful !HEAD and !TAIL can be.

David Marso wrote

With 140 Million cases I would consider than an expensive data pass.
If you have a frequency output somewhere of all your combinations you can even bypass the laborious error prone manual DO IF biz by using the following Macro I came up with in about 5 minutes ;-)
Even though some people believe MACRO to be an atrocious, inelegant, primitive, bear skin and flint shard old skool, difficult to learn, unsophisticated, 20th century dinosaur technology (did I miss anything there ?;-))) , it has the advantage of not requiring python and in this case will save you perhaps half an hour for that extra 140M data pass ;-)

DEFINE BIGXSAV (Varname !TOKENS(1)/ VList !ENCLOSE ("(",")" ) ).
DO IF (!VarName EQ !HEAD(!VLIST)).
XSAVE OUTFILE !QUOTE(!CONCAT ("C:\TEMP\File_",!HEAD(!VLIST))).
!DO !V !IN (!TAIL(!VLIST))
ELSE IF (!VarName EQ !V).
XSAVE OUTFILE !QUOTE(!CONCAT ("C:\TEMP\File_",!V)).
!DOEND .
END IF.
EXECUTE.
!ENDDEFINE .
SET PRINTBACK ON MPRINT ON.

DATA LIST FREE / X .
BEGIN DATA
1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6
END DATA.

BIGXSAV VARNAME X VLIST (1 2 3 4 5 6).

BIGXSAV VARNAME X VLIST (1 2 3 4 5 6).
90 M>
91 M> .
92 M> DO IF ( X EQ 1 ).
93 M> XSAVE OUTFILE 'C:\TEMP\File_1'.
94 M> ELSE IF ( X EQ 2 ).
95 M> XSAVE OUTFILE 'C:\TEMP\File_2'.
96 M> ELSE IF ( X EQ 3 ).
97 M> XSAVE OUTFILE 'C:\TEMP\File_3'.
98 M> ELSE IF ( X EQ 4 ).
99 M> XSAVE OUTFILE 'C:\TEMP\File_4'.
100 M> ELSE IF ( X EQ 5 ).
101 M> XSAVE OUTFILE 'C:\TEMP\File_5'.
102 M> ELSE IF ( X EQ 6 ).
103 M> XSAVE OUTFILE 'C:\TEMP\File_6'.
104 M>
105 M> END IF.
106 M> EXECUTE
107 M> .

-----

Jon K Peck wrote

If you have fewer than 64 groups it can be done in one pass with XSAVE -
but that assumes that you crafted all the conditional and file names for
the groups correctly the first time. SPLIT DATASET builds all that from
the variable values, but it runs AGGREGATE first so that it knows all the
split values.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: "Hector Maletta" <[hidden email]>
To: Jon K Peck/Chicago/IBM@IBMUS,
Cc: <[hidden email]>
Date: 01/07/2013 04:03 PM
Subject: RE: [SPSSX-L] Splitting a large file

Thank you, Jon. Quite useful. Just a question: if SPSSINC SPLIT DATASET
requires two passes, it is also the case that David Marso’s solution
requires two passes, or only one? (In my case, as explained before, cases
are sorted)

Hector

De: Jon K Peck [mailto:[hidden email]]
Enviado el: Monday, January 07, 2013 19:55
Para: Hector Maletta
CC: [hidden email]
Asunto: Re: [SPSSX-L] Splitting a large file

As David said, XSAVE can do this if you are willing to write all the
conditional code.

Alternatively, if you install the SPSSINC SPLIT DATASET extension command,
which requires the Python Essentials, it will do all the work for you. It
has a dialog box interface as well as traditional syntax. In fact, it
generates all the DO IF... XSAVE conditions for you, and it handles the
situation where there are more than 64 groups, although it has to do
additional data passes for that.

It requires a minimum of two data passes, but the data don't have to be
sorted.

HTH,

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: Hector Maletta <[hidden email]>
To: [hidden email],
Date: 01/07/2013 02:15 PM
Subject: [SPSSX-L] Splitting a large file
Sent by: "SPSSX(r) Discussion" <[hidden email]>

I have a large SPSS (.SAV format) file, containing 140 million cases with
about 100 variables per case, consisting of various population census
samples from different countries and dates. Thus the first n cases are
from a census taken in a certain country (say the US) and year (say 1970),
the next m cases are from a different country-year combination, and so on
(One to five censuses per country). Each country-year combination is
identified with a single ID variable, with numerical values xyz where xy
is the country code and z is the ordinal identifier of successive censuses
for that country; thus for a particular country the country code may be
43, and then all cases from the 1970 census may be coded 431, those from
the 1980 census would be 432, and so on. The file is sorted by this ID
variable, and thus ordered by country and year. A total of about 40 census
samples from about 15 countries are included.

I need to generate separate files for each country-year combination. Easy
enough to do with some ordinary SELECT IF and SAVE commands, of course,
but the trouble is the time it would take to execute. SPSS would read the
entire file of 140 million cases each time it has to find and save the
cases showing a certain country-year combination. Reading the file (and
saving the selected cases) takes a significant amount of time, and
dramatically slows my computer for other tasks while it’s at it.

Therefore I’d like to produce the 40 files in possibly one read, or in as
few reads as feasible. Ideally, the process would read and save the first
n cases (sharing the same xyz value), then proceed to read and save the
second batch of m cases, and so on till saving the last batch. The various
files thus produced should be automatically named, say xyz.sav for any
particular xyz value.

Does anyone have a solution ready? Otherwise I should put myself to the
thankless (and possibly time consuming) task of programming this, or
(still worse) have my computer tied down for two days while it does this
boring chore.

Thanks in advance

Hector