Computer Buying Help

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Computer Buying Help

Chris Cronin
I downloaded and scanned all previous posts related to the topic.  There
are many but I remain unclear how to prioritize.  I have to spec a new
computer to run SPSS 13 to replace an existing installation.

My general daily computing comprises reprocessing about 4 million records.
 Sorting is the most time consuming task, followed closely by case-by-case
transformations.  The daily routine (a couple hundred lines of read text
data, process, save, and produce about 120 reports) can take upwards of 12
hours on a win2000 machine with 512M RAM and Pentium 4 2GHz and 80G IDE
hard disk.  My data volume will be increasing by 30% and I need to get the
routine down to the least overall processing time possible.  Which factors
will most greatly increase processing speed?

- Windows XP
- Up to 2G RAM
- Processor Speed
- Multiple Processors
- Bus Speed
- SCSI vs IDE / Disk speed / space

Thanks for your assistance.
Chris Cronin
Reply | Threaded
Open this post in threaded view
|

Re: Computer Buying Help

BLAND, GLEN
Chris,

Windows XP would probably be a good start.  2GB of RAM would greatly
increase your computing power for heavy analysis.  An Intel Core 2 Duo
processor would be a great help.  If not, a processor with at least a
2gig clock speed would do pretty well for you.  As far as bus speed I
can't answer that.  A high volume harddrive (at least 120gb) would be
beneficial.  If you can get an even larger harddrive it can never hurt.
You can never have too much power ;)

Hope that helps.  If anyone has a better suggestion for him.


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Christopher Cronin
Sent: Tuesday, January 30, 2007 9:47 AM
To: [hidden email]
Subject: Computer Buying Help

I downloaded and scanned all previous posts related to the topic.  There
are many but I remain unclear how to prioritize.  I have to spec a new
computer to run SPSS 13 to replace an existing installation.

My general daily computing comprises reprocessing about 4 million
records.
 Sorting is the most time consuming task, followed closely by
case-by-case transformations.  The daily routine (a couple hundred lines
of read text data, process, save, and produce about 120 reports) can
take upwards of 12 hours on a win2000 machine with 512M RAM and Pentium
4 2GHz and 80G IDE hard disk.  My data volume will be increasing by 30%
and I need to get the routine down to the least overall processing time
possible.  Which factors will most greatly increase processing speed?

- Windows XP
- Up to 2G RAM
- Processor Speed
- Multiple Processors
- Bus Speed
- SCSI vs IDE / Disk speed / space

Thanks for your assistance.
Chris Cronin
Reply | Threaded
Open this post in threaded view
|

Re: Computer Buying Help

Richard Ristow
In reply to this post by Chris Cronin
At 09:47 AM 1/30/2007, Christopher Cronin wrote:

>My general daily computing comprises reprocessing about 4 million
>records. Sorting is the most time consuming task, followed closely by
>case-by-case transformations.  The daily routine can take upwards of
>12 hours on a win2000 machine with 512M RAM and Pentium 4 2GHz and 80G
>IDE hard disk. Which factors will most greatly increase processing
>speed?
>
>- Windows XP
>- Up to 2G RAM
>- Processor Speed
>- Multiple Processors
>- Bus Speed
>- SCSI vs IDE / Disk speed / space

Here's general advice, regarding these:

- Windows XP
It's time to go to Windows XP anyway, but I don't expect that, by
itself, it will speed up your job much.

- Up to 2G RAM
Definitely, do it. By current standards, especially with XP, 512M is
considerably too low. You should notice an improvement, but it's hard
to tell how much. It will make more difference with very 'wide' records
(many variables); it will make a great deal of difference if any part
of SPSS is being paged out.

- Processor Speed
- Multiple Processors
You're likely to see little or no effect from either. (I'm not aware
that SPSS can even take advantage of multiple processors.) Time for
your job is probably heavily for data transfer, especially to and from
disk, with computation taking only a small part.

- Bus Speed
As for processor speed, though it might have a somewhat larger effect,
for speeding transmission between main memory and the CPU.

- SCSI vs IDE / Disk speed / space
Here's the big one: your time is probably dominated by disk transfers.
I'd say,
. Space won't matter much once you have 'plenty', but be generous about
that. Calculate space for your input file and all files you're
creating, and make sure you have at least double that, free. (Among
other things, that avoids contention if SPSS keeps an old copy of a
file active until a new one is completely created.)
. Speed, at almost any price. As for protocol (SCSI/IDE), go for the
fastest. (Isn't that SIDE now?)
. Strongly consider two disks. Arrange your job so, when you're
creating or modifying a file, input is read from one, output file
written to the other.
....................
You've likely enough checked all this, but it's worth mentioning ways
to arrange your job to run faster. You write that your job has "a
couple hundred lines of read text data, process, save, and produce
about 120 reports". Techniques for efficiency include,

. If that's 120 different passes through the data file, see whether you
can arrange your logic to do several reports with each pass.
. Create the smallest working file you possibly can. Drop all variables
you aren't using; select out all cases you don't need.
. Use CACHE. That may have a huge effect if your initial input is
expensive (like most GET DATA/TYPE=ODBC, or complex GET
DATA/TYPE=TEXT), or if you can select a small subset of your variables
or cases. (Failing to use CACHE can eliminate the effectiveness of
selection.)
. Of course, taking full advantage of separate disks, if you have them.
Reply | Threaded
Open this post in threaded view
|

Re: Computer Buying Help

Richard Ristow
This is an on-list response to an off-list follow-up.

At 08:46 AM 1/31/2007, [hidden email] wrote, off-list:

>The routine starts by reading 4 million lines of
>text data of 55 variables each, then sorting by
>11 variables, computing a few new variables, and
>saving to a .sav file.  I need to pull it in and
>keep as a .sav file for further processing -
>rereading the text data for doing reports is impossibly slow.

You're certainly right about that last point.

>The way their programs work, data can be added
>anywhere in their storage file so I have to
>re-read the whole thing every day.

If you can identify, when you're reading them,
which records have been added since the last run,
then A., below, will probably help you; B. MAY
help you. These assume that the previous complete
version of the file, sorted, is saved under name
or handle PREVIOUS. New records will be saved in
file TODAY, and the complete new file written to
file CURRENT. PREVIOUS and TODAY may be on the
same disk; CURRENT should be on a different disk
from both of them. To make this work well with a
series of updates, you'll have to change your
file handle definitions for each new run. (If
you're from the old IBM days, think 'generation data groups'.)

A. Save time in transformations, sorting and saving. Code not tested:

DATA LIST FILE=<big text file> {FIXED|FREE|LIST}
    / <variable specifications>.
SELECT IF     <it's a new record>.
<transformation commands>
SORT CASES BY <sort variables>.
ADD FILES
   /FILE=PREVIOUS
   /FILE=*
   /BY <sort variables>.
XSAVE OUTFILE=CURRENT.
<First procedure that uses the updated file>
GET FILE=CURRENT.
<Other procedures that use the updated file>.

B. Save computing time in DATA LIST (which may or
may not matter). This isn't tested, either, and I
have less experience with this logic.

INPUT PROGRAM.
.  DATA LIST FILE=<big text file> {FIXED|FREE|LIST}
      / <variables needed for selection>.
.  DO IF <it's a new record>.
.     REREAD.
.     DATA LIST FILE=<big text file> {FIXED|FREE|LIST}
          / <all variables>.
.     END CASE.
.  END IF.
END INPUT PROGRAM.

>I do as many transformation commands as possible
>between each "exe" command.

WHOOPS! Biggie! Red alert!

'EXECUTE' commands are very expensive with big
files. "As many commands as possible"? There's no
limit, certainly not one you're likely to hit, in
the number of commands in a transformation program.

There are specific reasons you need EXECUTE: see
"Use EXECUTE sparingly" in any edition of Raynald
Levesque's book(1), or more postings of mine than
I care to count(2). If there's not one of these
specific reasons for any of your EXECUTEs, take
it out. There's a very large chance you need no EXECUTEs at all.

>The second step selects only data gathered on
>weekdays and saves out, keeping only the
>variables needed for the first batch of reports,
>gets the modified version back, and then uses
>temporary - select if (var1 = 1 through 200,
>which means 200 re-reads through the data) for
>each report. Then repeat the process for
>Saturday and Sunday, deleting the partial files as it goes.

The following should be MUCH faster, if (as
sounds likely) all 200 reports are the same
except for the cases selected. Code still not tested:

GET FILE=CURRENT.
*  The following SORT isn't necessary, if   .
*  'var1' is the first variable in list     .
*  '<sort variables>', above                .
.  SORT CASES BY var1.
SPLIT FILE BY var1.
<Run report>.

If you don't run the same report each time, you
can do 20 data passes (or 19) instead of 200 by
using XSAVE to write 10 or 11 of the selected
files in one data pass. (Only 11, because 10 is
the maximum number of XSAVEs in one input
program.) The TEMPORARY/ SELECT IF/ SAVE to save
the 11th may be more trouble than it's worth; in
that case, replace it by a <gasp!> EXECUTE.
You'll probably want macro loops or Python loops
to generate this code, which is two nested loops:
through values 1 thru 200, 11 or 10 at a time,
generating one transformation program per pass;
then through 11 or 10 consecutive values,
generating the <test>/XSAVE pairs.

Still untested:

GET FILE=CURRENT.
DO IF     var1  EQ  001.
.  XSAVE OUTFILE=Rpt001.
ELSE IF   var1  EQ  002.
.  XSAVE OUTFILE=Rpt002.
...
ELSE IF   var1  EQ  010.
.  XSAVE OUTFILE=Rpt010.
END IF.
TEMPORARY.
SELECT IF var1  EQ  011.
SAVE     OUTFILE=Rpt011.
<etc., for the rest of the reports>
GET FILE=Rpt001.
<run report for var1=1>
GET FILE=Rpt002
<run report for var1=2>
...

Better hardware is a good thing, and you should
probably have it for jobs your size. But there
are more ways to solve problems than by throwing hardware at them.

-Cheers, and good luck,
  Richard
....................
(*) Levesque, Raynald, "SPSS® Programming and
Data Management/A Guide for SPSS® and SAS® Users".
SPSS, Inc., Chicago, IL, 2005.

You can download it free as a PDF file, from
http://www.spss.com/spss/SPSS_programming_data_mgmt.pdf.

(2) For example, Ristow, Richard, "EXECUTE is
sometimes useful", SPSSX-L Thu, 4 May 2006 (12:10:14 -0400).
....................

>Thank you.  This is very helpful indeed.  Yes, I
>didn't specify but the routine starts by reading
>4 million lines of text data of 55 variables
>each, then sorting by 11 variables, computing a
>few new variables, and saving to a .sav file.  I
>do as many transformation commands as possible
>between each "exe" command.
>
>The reason is that one of our vendor's equipment
>stores gathered data as text files and I need to
>pull it in and keep as a .sav file for further
>processing - rereading the text data for doing
>reports is impossibly slow.  The way their
>programs work, data can be added anywhere in
>their storage file so I have to re-read the
>whole thing every day.  (I've mentioned to them
>that they are about 20 years behind but they no
>budge).  I have to keep the full file the first
>time.  The second step selects only data
>gathered on weekdays and saves out, keeping only
>the specific variables needed for the first
>batch of reports, gets the modified version
>back, and then uses temporary - select if (var1
>= 1 through 200, which means 200 re-reads
>through the data) for each report.  Then repeat
>the process for Saturday and Sunday, deleting
>the partial files as it goes.  Anyway, I'll use
>your recommendations to design the computer with
>multiple disks and do my reads and writes to
>separate disks.  Maximize RAM as well.  Thanks again.
>
>Chris
Reply | Threaded
Open this post in threaded view
|

Re: Computer Buying Help

Albert-Jan Roskam
Hi,

Testing (de-bugging) your syntaxes on a small subset
of your data --less cases-- can also save a lot of
time. For example:
N OF CASES 1000.
[... your syntax ...]

or:
SET SEED 12345.
SAMPLE .05.
[... your syntax ...]

Maybe I overlooked it in Richard's e-mail, but Boolean
short circuiting is a way to speed up the processing
of e.g. DO IF structures. It means that the most
common/likely condition should be specified first.

I am not sure whether using BY processing is faster,
but I believe so. If not it saves lots of time when
you process your output in e.g. Excel.
SORT CASES BY myvar. /* yes, this is time-consuming!.
TEMPORARY.
SPLIT FILE LAYERED BY myvar.
[... your syntax ...]

The following link is about programming efficiency in
SAS, but can still be of use for SPSS users.
http://www.ats.ucla.edu/stat/SAS/library/nesug00/bt3005.pdf
Btw, the e-book from Raynald Levesque is a must.


Cheers!
Albert-Jan


--- Richard Ristow <[hidden email]> wrote:

> This is an on-list response to an off-list
> follow-up.
>
> At 08:46 AM 1/31/2007, [hidden email]
> wrote, off-list:
>
> >The routine starts by reading 4 million lines of
> >text data of 55 variables each, then sorting by
> >11 variables, computing a few new variables, and
> >saving to a .sav file.  I need to pull it in and
> >keep as a .sav file for further processing -
> >rereading the text data for doing reports is
> impossibly slow.
>
> You're certainly right about that last point.
>
> >The way their programs work, data can be added
> >anywhere in their storage file so I have to
> >re-read the whole thing every day.
>
> If you can identify, when you're reading them,
> which records have been added since the last run,
> then A., below, will probably help you; B. MAY
> help you. These assume that the previous complete
> version of the file, sorted, is saved under name
> or handle PREVIOUS. New records will be saved in
> file TODAY, and the complete new file written to
> file CURRENT. PREVIOUS and TODAY may be on the
> same disk; CURRENT should be on a different disk
> from both of them. To make this work well with a
> series of updates, you'll have to change your
> file handle definitions for each new run. (If
> you're from the old IBM days, think 'generation data
> groups'.)
>
> A. Save time in transformations, sorting and saving.
> Code not tested:
>
> DATA LIST FILE=<big text file> {FIXED|FREE|LIST}
>     / <variable specifications>.
> SELECT IF     <it's a new record>.
> <transformation commands>
> SORT CASES BY <sort variables>.
> ADD FILES
>    /FILE=PREVIOUS
>    /FILE=*
>    /BY <sort variables>.
> XSAVE OUTFILE=CURRENT.
> <First procedure that uses the updated file>
> GET FILE=CURRENT.
> <Other procedures that use the updated file>.
>
> B. Save computing time in DATA LIST (which may or
> may not matter). This isn't tested, either, and I
> have less experience with this logic.
>
> INPUT PROGRAM.
> .  DATA LIST FILE=<big text file> {FIXED|FREE|LIST}
>       / <variables needed for selection>.
> .  DO IF <it's a new record>.
> .     REREAD.
> .     DATA LIST FILE=<big text file>
> {FIXED|FREE|LIST}
>           / <all variables>.
> .     END CASE.
> .  END IF.
> END INPUT PROGRAM.
>
> >I do as many transformation commands as possible
> >between each "exe" command.
>
> WHOOPS! Biggie! Red alert!
>
> 'EXECUTE' commands are very expensive with big
> files. "As many commands as possible"? There's no
> limit, certainly not one you're likely to hit, in
> the number of commands in a transformation program.
>
> There are specific reasons you need EXECUTE: see
> "Use EXECUTE sparingly" in any edition of Raynald
> Levesque's book(1), or more postings of mine than
> I care to count(2). If there's not one of these
> specific reasons for any of your EXECUTEs, take
> it out. There's a very large chance you need no
> EXECUTEs at all.
>
> >The second step selects only data gathered on
> >weekdays and saves out, keeping only the
> >variables needed for the first batch of reports,
> >gets the modified version back, and then uses
> >temporary - select if (var1 = 1 through 200,
> >which means 200 re-reads through the data) for
> >each report. Then repeat the process for
> >Saturday and Sunday, deleting the partial files as
> it goes.
>
> The following should be MUCH faster, if (as
> sounds likely) all 200 reports are the same
> except for the cases selected. Code still not
> tested:
>
> GET FILE=CURRENT.
> *  The following SORT isn't necessary, if   .
> *  'var1' is the first variable in list     .
> *  '<sort variables>', above                .
> .  SORT CASES BY var1.
> SPLIT FILE BY var1.
> <Run report>.
>
> If you don't run the same report each time, you
> can do 20 data passes (or 19) instead of 200 by
> using XSAVE to write 10 or 11 of the selected
> files in one data pass. (Only 11, because 10 is
> the maximum number of XSAVEs in one input
> program.) The TEMPORARY/ SELECT IF/ SAVE to save
> the 11th may be more trouble than it's worth; in
> that case, replace it by a <gasp!> EXECUTE.
> You'll probably want macro loops or Python loops
> to generate this code, which is two nested loops:
> through values 1 thru 200, 11 or 10 at a time,
> generating one transformation program per pass;
> then through 11 or 10 consecutive values,
> generating the <test>/XSAVE pairs.
>
> Still untested:
>
> GET FILE=CURRENT.
> DO IF     var1  EQ  001.
> .  XSAVE OUTFILE=Rpt001.
> ELSE IF   var1  EQ  002.
> .  XSAVE OUTFILE=Rpt002.
> ...
> ELSE IF   var1  EQ  010.
> .  XSAVE OUTFILE=Rpt010.
> END IF.
> TEMPORARY.
> SELECT IF var1  EQ  011.
> SAVE     OUTFILE=Rpt011.
> <etc., for the rest of the reports>
> GET FILE=Rpt001.
> <run report for var1=1>
> GET FILE=Rpt002
> <run report for var1=2>
> ...
>
> Better hardware is a good thing, and you should
> probably have it for jobs your size. But there
> are more ways to solve problems than by throwing
> hardware at them.
>
> -Cheers, and good luck,
>   Richard
> ....................
> (*) Levesque, Raynald, "SPSS� Programming and
> Data Management/A Guide for SPSS� and SAS�
Users".
> SPSS, Inc., Chicago, IL, 2005.
>
> You can download it free as a PDF file, from
>
http://www.spss.com/spss/SPSS_programming_data_mgmt.pdf.

>
> (2) For example, Ristow, Richard, "EXECUTE is
> sometimes useful", SPSSX-L Thu, 4 May 2006 (12:10:14
> -0400).
> ....................
> >Thank you.  This is very helpful indeed.  Yes, I
> >didn't specify but the routine starts by reading
> >4 million lines of text data of 55 variables
> >each, then sorting by 11 variables, computing a
> >few new variables, and saving to a .sav file.  I
> >do as many transformation commands as possible
> >between each "exe" command.
> >
> >The reason is that one of our vendor's equipment
> >stores gathered data as text files and I need to
> >pull it in and keep as a .sav file for further
> >processing - rereading the text data for doing
> >reports is impossibly slow.  The way their
> >programs work, data can be added anywhere in
> >their storage file so I have to re-read the
> >whole thing every day.  (I've mentioned to them
> >that they are about 20 years behind but they no
> >budge).  I have to keep the full file the first
> >time.  The second step selects only data
> >gathered on weekdays and saves out, keeping only
> >the specific variables needed for the first
> >batch of reports, gets the modified version
> >back, and then uses temporary - select if (var1
> >= 1 through 200, which means 200 re-reads
> >through the data) for each report.  Then repeat
> >the process for Saturday and Sunday, deleting
> >the partial files as it goes.  Anyway, I'll use
> >your recommendations to design the computer with
> >multiple disks and do my reads and writes to
> >separate disks.  Maximize RAM as well.  Thanks
> again.
> >
> >Chris
>




____________________________________________________________________________________
We won't tell. Get more on shows you hate to love
(and love to hate): Yahoo! TV's Guilty Pleasures list.
http://tv.yahoo.com/collections/265
Reply | Threaded
Open this post in threaded view
|

Streamlining (was: Computer Buying Help)

Chris Cronin
Thanks all for your assistance.  The suggestions related to changing my
logic in speeding things up is particularly helpful.  Streamlining code is
an art in every coding language that will always present challenges and
new rewards as new solutions are found.

One suggestion was that I use split file and layering to reduce the number
of data passes for the 200-report routine.  I haven't been able to get all
the pieces to work together.  Specifically, the 4-million record file in
its entirety produces reports too large for others to use.  They don't
have SPSS so I have to save the reports as text files.  Wordpad is not
meant to handle 4 million line text files.  So I break it into many
separate reports, repeated iteratively in the syntax file, and each report
uses the /outfile command to produce reasonably sized text reports others
can use, arranged by logical file names, and inside each report, specific
page titles.  I've trained people how to interpret the file names to find
their data.  For example (paraphrased for brevity)

temporary.
select if (var1 = 1)
report
 /FORMAT= <<format statements>>
 /outfile='g:\apc_reports\ride_checks\01wkwi06.txt' ?* 01 is the value of
the first select if - this means "route 1 weekday winter 2006"
      /TITLE=CENTER
                     '*****************************'
                     'Route 1'   *** 1 is the value of the first select if
                     'Winter, 2006'
                     'TRIP ORDER'
                     '*****************************'
 /vars = varlist
 /break = <<break & summary lists).

temporary.
select if (var1 = 2)
report
 /FORMAT= <<format statements>>
 /outfile='g:\apc_reports\ride_checks\02wkwi06.txt' ?* 02 is the value of
the second select if - this means "route 2 weekday winter 2006"
      /TITLE=CENTER
                     '*****************************'
                     'Route 2'   *** 2 is the value of the second select
if
                     'Winter, 2006'
                     'TRIP ORDER'
                     '*****************************'
 /vars = varlist
 /break = <<break & summary lists).

etc 3 => 200

I know I can use the /string statement for the page titles. Can you create
the filename statement in a split file solution?  I also realize that it
could be done in a script using macros, but again, you'd be 'including'
the same report syntax file 200 times, would it be any faster?

The final issue is that the values for var1 are not completely contiguous.
 A partial valid value list looks like
1,2,3,4,5,6,7,8,11,12,13,14,15,16,18,19,20,22 etc.  This would be fine for
split file by var1, but not for loop 1 - 200 as some values would produce
empty reports.

Any thoughts ....
Reply | Threaded
Open this post in threaded view
|

Re: Computer Buying Help - drives, ram, processor, etc.

Jeff-125
In reply to this post by Richard Ristow
At 09:27 PM 1/30/2007, Richard R. wrote:

>- SCSI vs IDE / Disk speed / space
>Here's the big one: your time is probably dominated by disk transfers.
>I'd say,
>. Space won't matter much once you have 'plenty', but be generous about
>that. Calculate space for your input file and all files you're
>creating, and make sure you have at least double that, free. (Among
>other things, that avoids contention if SPSS keeps an old copy of a
>file active until a new one is completely created.)
>. Speed, at almost any price. As for protocol (SCSI/IDE), go for the
>fastest. (Isn't that SIDE now?)
>. Strongly consider two disks. Arrange your job so, when you're
>creating or modifying a file, input is read from one, output file
>written to the other.
>....................


...just a few comments from someone who builds computers.

The ram is probably the most important and the hard drive speed the
next important for the type of analyses that I think your describing.
Standard Windows XP or W2000 will handle up to 4 gigs of ram.
...anything more requires a windows 64 bit system. I'm not positive,
however, whether spss will run on a 64 bit machine, nor whether it
can take advantage of 64 bit hardware and operating systems.

Some software can take advantage of multiple processors and multiple
cores, but others can't. I've messed with spss and multiple
processors back with probably version 9, but not since. Back then, I
could get no benefit from what I was doing with 2 processors. Things
may have changed. It also matters whether you are speaking about
running spss alone, or with other programs at the same time. E.g.,
multiple processors might help if you are running office while
waiting for spss to perform some long task, but not if you are simply
waiting for spss to finish. - I'm not sure here. My understanding is
that Vista will be better able to make use of multiple processors and
multiple cores.

Hard drive - there is much misinformation out there. The
manufacturers advertise "interface" speeds. E.g., 3 gig/sec SATA
drives. The speed of the interface is all but irrelevant to the
actual performance of the drive with all else equal. The old drives
were labelled IDE/ATA (same thing essentially). Now these same drives
are called PATA (for parallel ATA) - you could place up to 2 on the
same cable in parallel. The new version is called SATA (serial ATA) -
each drive connects serially - one drive = one cable. The sata
interface (either 1.5 gig/sec or 3.0 gig/sec) again, is essentially
irrelevant unless you are running a server with multiple drives on
the same channel. The drive itself will run no where close to this
speed. A very fast drive will run at 75 megs/sec (considerably slower
than the interface speed that is advertised). ...however the newer
drives will run slightly faster than the older models. So I newer
SATA will probably be faster than an older PATA, but not because of
the interface per se. In drives where the exact same hardware is
available in both PATA and SATA, they will transfer data at the same speeds.

SCSI is being replaced by SAS. (Serial scsi). SAS drives (and the
older SCSI) are primarily designed for servers and are much more
expensive. I have 4 73 gig sas drives on a server that cost about
$450 each. The backup SATA drive for the machine is 400 gigs and
costs about $120. The advantage of Scsi or sas is primarily in it low
latency - how long between a request is made until data is delivered.
SCSI or SAS is far superior here, but that is nearly irrelevant for
statistical software. More important is the transfer rate - the
transfer rate of any interface is largely determined by the
rotational speed of the drive. So a 7200 rpm SCSI drive will transfer
at almost the same rate as a 7200 SATA, but the scsi will start the
transfer just a split second before - again, I think that this is
largely irrelevant for stat software. The scsi's (and SAS) do,
however, come in higher RPM ratings, with the current rpm max in
SCSI/SAS at 15,000 RPM and the max for SATA being 10,000 (in the
Western Digital Raptor). The more common rpm now is 7200 for SATA.

Instead of more expensive drives, you might consider a raid system.
RAID is redundant array of inexpensive (or independent) disks. You
combine two or more. Consider Raid 0 with 2 disks if you have a small
budget and wish to get about twice the transfer rate of a single
drive. Consider raid 10 or a few other arrangements for other
advantages (search the web on this one, it gets more complex). My 4
Drive 15K rpm SAS raid 10 array can read data at over 270 megs/sec
(unfortunately not the machine that I run stat software on, because
it would fly at this speed) as the typical rate is under 70.

If you want a good drive setup for running spss, here is what I would
consider in order of increasing cost.

1 (lowest cost but more than the performance of the typical stock
computer) Western Digital Raptor 10,000 rpm SATA drive in typical
single-drive setup.
2 dual WD raptors in standard non-raid - use one for the OS and one
for the data - or one to read and one to write.
3 dual WD raptors in RAID 0 (double the cost - same storage space as
one drive - more chance of malfunction - if one drive goes bad, you
lose your data and OS - make sure you have a backup to another drive)
4 3 WD raptors in Raid 5 (triple the cost - twice the storage space -
one drive can go bad and you still save the data, about twice the
read and write speed as a single drive, but it may slow down the
processor in most setups unless you get a hardware raid card - check
on google for more info)
5 4 raptors in raid 10 (4x the cost - twice the storage space - one
or two drives can go bad, up to 4 times the read speed and twice the
write speed of a single drive - less processor slow-down than the raid 5)
6 15K rpm scsi or SAS drives (now you're getting really expensive) in
one of the raid setup.
7 15K rpm scsi or SAS in raid with a hardware raid controller (start
thinking about $3000 just for the drives and controller hardware).

I would definitely go with a raid setup with 2 or more slower drives
before considering SCSI or SAS for your application.
Reply | Threaded
Open this post in threaded view
|

Re: Streamlining (was: Computer Buying Help)

Richard Ristow
In reply to this post by Chris Cronin
At 10:03 AM 2/1/2007, Christopher Cronin wrote:

>The 4-million record file in its entirety produces reports too large
>for others to use.  I have to save the reports as text files.  Wordpad
>is not meant to handle 4 million line text files.

First thought: If this is so important, and so burdensome, wouldn't
there be something better than Wordpad? I don't know editors and
text-handlers well, but there must be something that could read the
huge output file and separate it into parts.

>So I break [the job] into many separate reports, repeated iteratively
>in the syntax file, and each report uses the /outfile command to
>produce reasonably sized text reports others can use, [with] logical
>file names, and specific page titles.  For example (paraphrased for
>brevity)
>
>temporary.
>select if (var1 = 1)
>report
[...]
>temporary.
>select if (var1 = 2)
>report
[...]
>etc 3 => 200
>
>I know I can use the /string statement for the page titles. Can you
>create the filename statement in a split file solution?  I also
>realize that it could be done in a script using macros, but again,
>you'd be 'including' the same report syntax file 200 times, would it
>be any faster?

All right. The big issue is: your production job reads the entirety of
your big file 200 times. That's an inefficiency that's beyond glaring.
When you started having questions about speed, your first thought
should have been to look at that and say, "There HAS to be a better
way."

I have no idea whether the time to read, transform, and save the data
is even noticeable, by comparison with the time for those 200 report
passes.

For this job, almost the only relevant hardware speed parameter is
overall transfer rate from disk.

Breaking up the large file into 10 pieces, using XSAVE logic, and
running each report against the pertinent smaller file, should give
near a ten-fold saving, very easily. No hardware improvement will come
near that. It may be shorter to sort the 10 smaller files individually,
rather than the large file all together.

It looks like the syntax file for your reports has all 200 (or almost
200) reports hard-coded in. That's clumsy for you to write and
maintain, but it makes changing to use the smaller files very easy.
(Macros, or Python code-generating code, could make your code
considerably more compact and maintainable, but no faster.)

-Cheers, and good luck,
  Richard
Reply | Threaded
Open this post in threaded view
|

Re: Streamlining (was: Computer Buying Help)

Chris Cronin
In reply to this post by Chris Cronin
At 02/01/2007 02:01:38 PM  Richard Ristow wrote:

First thought: If this is so important, and so burdensome, wouldn't
there be something better than Wordpad? I don't know editors and
text-handlers well, but there must be something that could read the
huge output file and separate it into parts.

I maintain what's analogous to a library - the files in one directory are
the books in one, say, aisle, named by their content.  Many end users,
many of whom I won't know, look at them so I can't require proprietary
software or assume advanced computer skills.  Each "book" refers to one
data set, which still contains up to 20,000 lines, and SPSS's break and
summary commands in reports makes them very easy to flip through and
understand.  Wordpad or any text editor they have will work.  I have been
well praised for their organization and ease of use.

Breaking up the large file into 10 pieces, using XSAVE logic, and
running each report against the pertinent smaller file, should give
near a ten-fold saving, very easily.

I absolutely agree that cutting the big file in pieces saves time.
Unfortunately, The XSAVE command has eluded my grasp.  I understand the
manual's definition that save causes an immediate read and write of all
the data, and xsave "stores up" multiple commands until something triggers
an execute.  When I use loop and xsave, for example

loop a = 1 to 5.
        xsave /outfile = 'c:test.sav' /compressed.
end loop.
exe.

I get a file that contains 5 iterations of my original file with the
values 1 to 5 in a new variable 'a' for each iteration.  I'm not
succeeding in understanding how it will reduce multiple saves to one data
pass.  I tried this, with and without the 'temporary' commands:

temporary.
select if (rt >= 1 and rt <= 11).
xsave /outfile = 'c:\testa.sav' /compressed.
temporary.
select if (rt >= 12 and rt <= 20).
xsave /outfile = 'c:\testb.sav' /compressed.
temporary.
select if (rt >= 21 and rt <= 30).
xsave /outfile = 'c:\testc.sav' /compressed.
temporary.
select if (rt >= 31 and rt <= 50).
xsave /outfile = 'c:\testd.sav' /compressed.
temporary.
select if (rt >= 51 and rt <= 70).
xsave /outfile = 'c:\teste.sav' /compressed.
temporary.
select if (rt >= 71 and rt <= 100).
xsave /outfile = 'c:\testf.sav' /compressed.
temporary.
select if (rt >= 101 and rt <= 121).
xsave /outfile = 'c:\testg.sav' /compressed.
temporary.
select if (rt >= 122 and rt <= 220).
xsave /outfile = 'c:\testh.sav' /compressed.
exe.

It saved 8 files with the right names in one data pass, and the first file
'testa.sav' correctly contains the records where rt = 1 through 11.  The
rest of the files are empty.  What am I missing?
Reply | Threaded
Open this post in threaded view
|

Re: Streamlining (was: Computer Buying Help)

Richard Ristow
At 03:24 PM 2/1/2007, Christopher Cronin wrote:

>At 02/01/2007 02:01:38 PM  Richard Ristow wrote:
>
>>Breaking up the large file into 10 pieces, using XSAVE logic, and
>>running each report against the pertinent smaller file, should give
>>near a ten-fold saving, very easily.
>
>I absolutely agree that cutting the big file in pieces saves time. I
>tried this, with and without the 'temporary' commands:
>
>temporary.
>select if (rt >= 1 and rt <= 11).
>xsave /outfile = 'c:\testa.sav' /compressed.
>temporary.
>select if (rt >= 12 and rt <= 20).
>xsave /outfile = 'c:\testb.sav' /compressed.
>temporary.
>select if (rt >= 21 and rt <= 30).
>xsave /outfile = 'c:\testc.sav' /compressed.
>temporary.
>select if (rt >= 31 and rt <= 50).
>xsave /outfile = 'c:\testd.sav' /compressed.
>temporary.
>select if (rt >= 51 and rt <= 70).
>xsave /outfile = 'c:\teste.sav' /compressed.
>temporary.
>select if (rt >= 71 and rt <= 100).
>xsave /outfile = 'c:\testf.sav' /compressed.
>temporary.
>select if (rt >= 101 and rt <= 121).
>xsave /outfile = 'c:\testg.sav' /compressed.
>temporary.
>select if (rt >= 122 and rt <= 220).
>xsave /outfile = 'c:\testh.sav' /compressed.
>exe.
>
>It saved 8 files with the right names in one data pass, and the first
>file 'testa.sav' correctly contains the records where rt = 1 through
>11.  The rest of the files are empty.  What am I missing?

XSAVE is a transformation command, not a procedure. All XSAVEs (until a
procedure, SAVE, or EXECUTE) are in the same transformation program,
and you can't have separate 'TEMPORARY' states in the same
transformation program.

Instead of TEMPORARY/SELECT IF logic, use DO IF logic, as outlined in
my posting "Re: Computer Buying Help", Wed, 31 Jan 2007 (12:40:54
-0500).
Reply | Threaded
Open this post in threaded view
|

Re: Streamlining (was: Computer Buying Help)

Gary Rosin
In reply to this post by Chris Cronin
>pass.  I tried this, with and without the 'temporary' commands:
>
>temporary.
>select if (rt >= 1 and rt <= 11).
>xsave /outfile = 'c:\testa.sav' /compressed.
>temporary.
>select if (rt >= 12 and rt <= 20).
>xsave /outfile = 'c:\testb.sav' /compressed.
>temporary.
>select if (rt >= 21 and rt <= 30).
>xsave /outfile = 'c:\testc.sav' /compressed.
>temporary.
>* * *
>select if (rt >= 122 and rt <= 220).
>xsave /outfile = 'c:\testh.sav' /compressed.
>exe.
>
>It saved 8 files with the right names in one data pass, and the first file
>'testa.sav' correctly contains the records where rt = 1 through 11.  The
>rest of the files are empty.  What am I missing?

I'm hesistant to jump and to try to run with the big dogs, but the
way I read the Command Syntax Reference for 15.0, temporary
transformations remain in effect until the next time the datafile is
read.  XSave also is not executed until the next time data is read.
Doesn't that mean that the *first* Selectif is still ipending when the
later Selectif temporary commands are entered.  Because the
intervals do not overlap, the later selections are the null set.

Gary





     ---

Prof. Gary S. Rosin              Internet:  [hidden email]
South Texas College of Law
1303 San Jacinto                   Voice:  (713) 646-1854
Houston, TX  77002-7000           Fax:  (713) 646-1766
Reply | Threaded
Open this post in threaded view
|

Re: Streamlining (was: Computer Buying Help)

Richard Ristow
At 05:02 PM 2/1/2007, Gary Rosin wrote:

>>I tried this, with and without the 'temporary' commands:
>>
>>temporary.
>>select if (rt >= 1 and rt <= 11).
>>xsave /outfile = 'c:\testa.sav' /compressed.
>>temporary.
>>select if (rt >= 12 and rt <= 20).
>>xsave /outfile = 'c:\testb.sav' /compressed.
>>* * *
>>select if (rt >= 122 and rt <= 220).
>>xsave /outfile = 'c:\testh.sav' /compressed.
>>exe.
>>
>>It saved 8 files with the right names in one data pass, and the first
>>file 'testa.sav' correctly contains the records where rt = 1 through
>>11.  The rest of the files are empty.  What am I missing?
>
>The way I read the Command Syntax Reference for 15.0, temporary
>transformations remain in effect until the next time the datafile is
>read.  XSave also is not executed until the next time data is read.
>Doesn't that mean that the *first* Selectif is still ipending when the
>later Selectif temporary commands are entered.  Because the intervals
>do not overlap, the later selections are the null set.

Bingo. Exactly.
Richard
Reply | Threaded
Open this post in threaded view
|

Re: Computer Buying Help

Richard Ristow
In reply to this post by Jeff-125
At 10:48 AM 2/1/2007, Jeff wrote:

>At 09:27 PM 1/30/2007, Richard R. wrote:
>>- SCSI vs IDE / Disk speed / space
>>Here's the big one: your time is probably dominated by disk
>>transfers.
>>[...]
>
>...just a few comments from someone who builds computers.

Thank you very much for those. Certainly I learned a lot I hadn't
known, about best disk transfer speed per cost; and that's critically
important, when running statistical programs with large input files.

Thanks!
Richard