Hi all,
I have a data file with 30,000 variables and 12,000 observations. I am using SPSS24 and Windows7. Variables are a mix of character, numeric, date, time. I have a very large .dat file and the syntax to turn this into a .sav file. The syntax runs on a smaller data set. When I try to run a syntax file to create a .sav file of all the variables, SPSS becomes very slow and or stops working. Is there a way to improve performance with this file? Currently it appears to take more than 6gb of ram for the syntax to try to run. Yes, I know it is stupid to want a data file with that many variables, but that is what some of our clients want. Thank you, Merlin Marshall Center for Human Resource Research The Ohio State University Columbus Ohio ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
You will need a lot of available memory for that. In fact, if you have string variables with length more than 8 bytes, they count as extra variables and take extra dictionary slots. If you are running out of memory, you might try creating two sav files each with half the variables and then doing MATCH FILES on the two. Another trick that might help a lot would be using external mode to run the syntax, because that eliminates all the memory and cpu overhead of the user interface and, in particular, the Data Editor, which needs a lot of updating when there are so many variables. I have on occasion seen code run an order of magnitude faster or more in external mode for certain tasks. To use external mode, you need to use a tiny amount of Python code. You could do it like this. Open a command prompt (DOS) window and cd to the Python directory, which is under the Statistics installation directory. Type this code import spss spss.Submit(r"""INSERT FILE="filespec".""") where filespec is the path to a syntax file to execute. Note the r before the quote and the use of three " surrounding the command. ctrl-z will exit the session. If it produces a lot of output that you don't want, you can write spss.SetOutput("off") before the Submit line. On the other hand, if you want to capture the output, you would wrap the syntax in OMS and OMSEND commands to get a Viewer file or just plain text (less overhead). BTW, you might consider using zsav rather than sav with large datasets as zsav format compresses much more effectively than sav in most cases. On Mon, Jul 17, 2017 at 10:20 AM, Merlin Marshall <[hidden email]> wrote: Hi all, |
In reply to this post by Merlin Marshall
Hi,
It is unclear what you mean by the phrase "smaller data set". Do you mean: (1) You use all observations/cases but a subset of the 30,000 variables, or (2) You use a subset of cases but all 30,000 variables. Jon Peck or others may know if there is a more sophisticated way of cresting the master file of 30k vars & 12K cases but if situation (1) above is true, why don't you group the cases into manageable numbers (say 5k cases), and then do an add files after all of the cases have been saved to a system data file (about 6 data files). If (2) is the case, create three system data files with 10k each (assuming that this can be done without a problem) but with some unique identifiers in each file (e.g., an ID number/name, other variables, or create a unique identifier for each case that appears in each system data file). Then do a match files on the identifier(s) to create the master file. It could be that you can do the whole analysis in one run as a production job, perhaps with increasing ram (does SPSS still allow this? I know the mainframe versions did), or making sure that no other programs are running and unneeded background programs turned off. However, this may require you to be the administrator on the Window PC. Again, more tech savvy may be able to suggest a more sophisticated (powerful) method(s) of doing this. Otherwise, a piecewise approach may get you to your goal. To John and other SPSS personnel (past & present): I assume that one can still run SPSS as a production job in a "Dos" or system window which should reduce the amount of RAM SPSS uses for running itself. I believe this was true in earlier (i.e., DOS versions) but maybe I was wrong. If I am clear in what I am saying, is there a way of doing this now (i.e., do a command line call of SPSS with specifications for input file(s), output file(s), and other "/" specifications? -Mike Palij New York University [hidden email] ----- Original Message ----- From: "Merlin Marshall" <[hidden email]> To: <[hidden email]> Sent: Monday, July 17, 2017 12:20 PM Subject: trouble reading large syntax file Hi all, I have a data file with 30,000 variables and 12,000 observations. I am using SPSS24 and Windows7. Variables are a mix of character, numeric, date, time. I have a very large .dat file and the syntax to turn this into a .sav file. The syntax runs on a smaller data set. When I try to run a syntax file to create a .sav file of all the variables, SPSS becomes very slow and or stops working. Is there a way to improve performance with this file? Currently it appears to take more than 6gb of ram for the syntax to try to run. Yes, I know it is stupid to want a data file with that many variables, but that is what some of our clients want. Thank you, Merlin Marshall Center for Human Resource Research The Ohio State University Columbus Ohio ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
It is possible to run a production job by executing in a DOS box a production job via stats -production ... (Search Help > Topics for "command line" for details.) In production mode, you would need to create an spj file with control information. If you run from the command line but not in production mode, you get the UI and would not save the UI overhead. On Mon, Jul 17, 2017 at 11:00 AM, Mike Palij <[hidden email]> wrote: Hi, |
In reply to this post by Merlin Marshall
Hi Mike,
Sorry I wasn't clear. By smaller I mean fewer variables. The syntax code does not contain errors, the problem comes when one tries to get all the variables in one job. Merlin Marshall ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Merlin Marshall
Thank you John Peck and everyone else who responded to my question.
In regards to an earlier comment, the file is too large (too many variables) to open in Excel, Stata or R. It did open in SAS. Because it won't open in almost all the stats packages we support, we are probably just going to tell people that we can not comply with their request for code to read the entire data set in one go. Of course it can be read in pieces, and that is how it should be done, but that is not what these users want. John, I don't know Python, but the support person who is working on this issue does. I don't know if he tried it or not, he is out of the office, so I can't tell you how it went. Thank you everyone again, Merlin Marshall Center for Human Resource Research The Ohio State University Columbus Ohio ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Jon Peck
Jon,
As one of the members of the list who used to
run SPSS jobs
on Vax and other mainframe systems (I started
out with UNIVAC),
I am used to writing syntax in an editor,
creating a raw text file
of data (if not included "inline"), and
submitting a job at a
system command prompt. What you say
below is somewhat
at variance with my experience because of the
use of Python
(not that there is anything wrong with
that). However, looking
at the PDF "IBM SPSS Statistics Batch Facility
User's Guide"
shows that the old mainframe functionality can
be used without
Python.
Some fiddling with Windows' environment
variables (i.e., identifying
where to put temporary files, etc.; see page
12 in the manual
for Ver 23), one can open a "DOS" window (Win
10 really doesn't
have DOS in it, right? Should this be called a
"command line window"?).
and at the command prompt enter something like
the following:
General format:
C:> statisticsb -f syntaxfile
-type outputtype -out outputfile
Specific example:
C:> statisticsb -f
"C:\syntaxjobs\bank.sps" -type text -out C:\output\bank.txt
where "statisticsb" invokes the SPSS
production program (I
assume that it is a limited front end to the
SPSS statistics
software), "-f" identifies the location and
the name of the
SPSS syntax file (what happened to "in=" or am
I confusing
that switch with BMDP or another program?),
"-type"
specifies the output format (NOTE: SPSS's use
of pivot
table are a PITA but I believe one can specify
such tables
to be "deconstructed" into component parts if
old fashioned
text format is used -- might also work with
HTML/XML format),
and "-out" identifies where and the name of
the output file
(note that one has to provide right file
extension).
This would seem to me to put minimal demands
on
system resources (no need to open SPSS; one
can
just have Windows explorer open to access
the
output file and make sure that the Data system
file
was created).
So, Jon, this *should* work, right? And there
is no
need for a *. spj file,
right?
-Mike Palij
New York University
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Mike, I think statisticsb is only available with Statistics Server, so most people do not have it. As for Win 10 and the "DOS" window, Microsoft officially calls it the "Command Prompt", and Win 10 certainly does not have old DOS code in it, but it is still commonly referred to as the DOS window, since it works a lot like real DOS from days of yore. But either with statisticsb or a production job or the Python external-mode approach, the point is the same: less stress on system resources and in some cases much less. On Wed, Jul 19, 2017 at 11:56 AM, Mike Palij <[hidden email]> wrote:
|
Jon,
Thanks for the feedback but I need to get one
point clarified.
By "Statistics Servers" do you mean that SPSS
programs are
located on a separate PC server or cloud
platform? That is,
Statistics server is not part of single user
versions of SPSS
which would mean that statisticsb is not
available? If so,
when was this change implemented? I
vaguely remember
being able to do this in older Windows
versions of SPSS
(then again, I might be confusing this with
BMDP which
used command line submission and didn't really
have a
Windows interface until after SPSS bought and
sold the
company; I also think such command line
submission of
SPSS was possible on OS/2 though the
"Windows"
interface was rather nice there).
-Mike Palij
New York University
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
SPSS Statistics Server is a different product from the SPSS Client that most people have. The Server handles multiple users and consists of a central server program that clients connect to which spawns "slave" processes for each connection. The server would generally be running on a remote system, so long jobs do not tie up the desktop. It has some features not available in the Client system. One distinction that used to exist - Client was limited in the number of processors it could use - was removed in V24. Here is an extract from the Batch Facility Users Guide. Statisticsb is part of this. The IBM® SPSS® Statistics Batch Facility is a batch processing utility that is included with the IBM SPSS Statistics Server product. This guide describes the Batch Facility and how to use it. Introduction to the Batch Facility IBM SPSS Statistics Server is client/server based. It distributes client requests for resource-intensive operations to powerful server software. Typically, the client for IBM SPSS Statistics Server is a version of IBM SPSS Statistics running on a desktop computer. The Batch Facility is an alternative way to use the power of IBM SPSS Statistics Server, and it runs on the server computer. ---------- The Production Facility on the client machine can run jobs on Statistics Server and monitor them and fetch the output. This has been the case for quite a few releases - maybe back to SPSS version 10, but I don't recall exactly. On Wed, Jul 19, 2017 at 12:23 PM, Mike Palij <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |