I downloaded and scanned all previous posts related to the topic. There
are many but I remain unclear how to prioritize. I have to spec a new computer to run SPSS 13 to replace an existing installation. My general daily computing comprises reprocessing about 4 million records. Sorting is the most time consuming task, followed closely by case-by-case transformations. The daily routine (a couple hundred lines of read text data, process, save, and produce about 120 reports) can take upwards of 12 hours on a win2000 machine with 512M RAM and Pentium 4 2GHz and 80G IDE hard disk. My data volume will be increasing by 30% and I need to get the routine down to the least overall processing time possible. Which factors will most greatly increase processing speed? - Windows XP - Up to 2G RAM - Processor Speed - Multiple Processors - Bus Speed - SCSI vs IDE / Disk speed / space Thanks for your assistance. Chris Cronin |
Chris,
Windows XP would probably be a good start. 2GB of RAM would greatly increase your computing power for heavy analysis. An Intel Core 2 Duo processor would be a great help. If not, a processor with at least a 2gig clock speed would do pretty well for you. As far as bus speed I can't answer that. A high volume harddrive (at least 120gb) would be beneficial. If you can get an even larger harddrive it can never hurt. You can never have too much power ;) Hope that helps. If anyone has a better suggestion for him. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Christopher Cronin Sent: Tuesday, January 30, 2007 9:47 AM To: [hidden email] Subject: Computer Buying Help I downloaded and scanned all previous posts related to the topic. There are many but I remain unclear how to prioritize. I have to spec a new computer to run SPSS 13 to replace an existing installation. My general daily computing comprises reprocessing about 4 million records. Sorting is the most time consuming task, followed closely by case-by-case transformations. The daily routine (a couple hundred lines of read text data, process, save, and produce about 120 reports) can take upwards of 12 hours on a win2000 machine with 512M RAM and Pentium 4 2GHz and 80G IDE hard disk. My data volume will be increasing by 30% and I need to get the routine down to the least overall processing time possible. Which factors will most greatly increase processing speed? - Windows XP - Up to 2G RAM - Processor Speed - Multiple Processors - Bus Speed - SCSI vs IDE / Disk speed / space Thanks for your assistance. Chris Cronin |
In reply to this post by Chris Cronin
At 09:47 AM 1/30/2007, Christopher Cronin wrote:
>My general daily computing comprises reprocessing about 4 million >records. Sorting is the most time consuming task, followed closely by >case-by-case transformations. The daily routine can take upwards of >12 hours on a win2000 machine with 512M RAM and Pentium 4 2GHz and 80G >IDE hard disk. Which factors will most greatly increase processing >speed? > >- Windows XP >- Up to 2G RAM >- Processor Speed >- Multiple Processors >- Bus Speed >- SCSI vs IDE / Disk speed / space Here's general advice, regarding these: - Windows XP It's time to go to Windows XP anyway, but I don't expect that, by itself, it will speed up your job much. - Up to 2G RAM Definitely, do it. By current standards, especially with XP, 512M is considerably too low. You should notice an improvement, but it's hard to tell how much. It will make more difference with very 'wide' records (many variables); it will make a great deal of difference if any part of SPSS is being paged out. - Processor Speed - Multiple Processors You're likely to see little or no effect from either. (I'm not aware that SPSS can even take advantage of multiple processors.) Time for your job is probably heavily for data transfer, especially to and from disk, with computation taking only a small part. - Bus Speed As for processor speed, though it might have a somewhat larger effect, for speeding transmission between main memory and the CPU. - SCSI vs IDE / Disk speed / space Here's the big one: your time is probably dominated by disk transfers. I'd say, . Space won't matter much once you have 'plenty', but be generous about that. Calculate space for your input file and all files you're creating, and make sure you have at least double that, free. (Among other things, that avoids contention if SPSS keeps an old copy of a file active until a new one is completely created.) . Speed, at almost any price. As for protocol (SCSI/IDE), go for the fastest. (Isn't that SIDE now?) . Strongly consider two disks. Arrange your job so, when you're creating or modifying a file, input is read from one, output file written to the other. .................... You've likely enough checked all this, but it's worth mentioning ways to arrange your job to run faster. You write that your job has "a couple hundred lines of read text data, process, save, and produce about 120 reports". Techniques for efficiency include, . If that's 120 different passes through the data file, see whether you can arrange your logic to do several reports with each pass. . Create the smallest working file you possibly can. Drop all variables you aren't using; select out all cases you don't need. . Use CACHE. That may have a huge effect if your initial input is expensive (like most GET DATA/TYPE=ODBC, or complex GET DATA/TYPE=TEXT), or if you can select a small subset of your variables or cases. (Failing to use CACHE can eliminate the effectiveness of selection.) . Of course, taking full advantage of separate disks, if you have them. |
This is an on-list response to an off-list follow-up.
At 08:46 AM 1/31/2007, [hidden email] wrote, off-list: >The routine starts by reading 4 million lines of >text data of 55 variables each, then sorting by >11 variables, computing a few new variables, and >saving to a .sav file. I need to pull it in and >keep as a .sav file for further processing - >rereading the text data for doing reports is impossibly slow. You're certainly right about that last point. >The way their programs work, data can be added >anywhere in their storage file so I have to >re-read the whole thing every day. If you can identify, when you're reading them, which records have been added since the last run, then A., below, will probably help you; B. MAY help you. These assume that the previous complete version of the file, sorted, is saved under name or handle PREVIOUS. New records will be saved in file TODAY, and the complete new file written to file CURRENT. PREVIOUS and TODAY may be on the same disk; CURRENT should be on a different disk from both of them. To make this work well with a series of updates, you'll have to change your file handle definitions for each new run. (If you're from the old IBM days, think 'generation data groups'.) A. Save time in transformations, sorting and saving. Code not tested: DATA LIST FILE=<big text file> {FIXED|FREE|LIST} / <variable specifications>. SELECT IF <it's a new record>. <transformation commands> SORT CASES BY <sort variables>. ADD FILES /FILE=PREVIOUS /FILE=* /BY <sort variables>. XSAVE OUTFILE=CURRENT. <First procedure that uses the updated file> GET FILE=CURRENT. <Other procedures that use the updated file>. B. Save computing time in DATA LIST (which may or may not matter). This isn't tested, either, and I have less experience with this logic. INPUT PROGRAM. . DATA LIST FILE=<big text file> {FIXED|FREE|LIST} / <variables needed for selection>. . DO IF <it's a new record>. . REREAD. . DATA LIST FILE=<big text file> {FIXED|FREE|LIST} / <all variables>. . END CASE. . END IF. END INPUT PROGRAM. >I do as many transformation commands as possible >between each "exe" command. WHOOPS! Biggie! Red alert! 'EXECUTE' commands are very expensive with big files. "As many commands as possible"? There's no limit, certainly not one you're likely to hit, in the number of commands in a transformation program. There are specific reasons you need EXECUTE: see "Use EXECUTE sparingly" in any edition of Raynald Levesque's book(1), or more postings of mine than I care to count(2). If there's not one of these specific reasons for any of your EXECUTEs, take it out. There's a very large chance you need no EXECUTEs at all. >The second step selects only data gathered on >weekdays and saves out, keeping only the >variables needed for the first batch of reports, >gets the modified version back, and then uses >temporary - select if (var1 = 1 through 200, >which means 200 re-reads through the data) for >each report. Then repeat the process for >Saturday and Sunday, deleting the partial files as it goes. The following should be MUCH faster, if (as sounds likely) all 200 reports are the same except for the cases selected. Code still not tested: GET FILE=CURRENT. * The following SORT isn't necessary, if . * 'var1' is the first variable in list . * '<sort variables>', above . . SORT CASES BY var1. SPLIT FILE BY var1. <Run report>. If you don't run the same report each time, you can do 20 data passes (or 19) instead of 200 by using XSAVE to write 10 or 11 of the selected files in one data pass. (Only 11, because 10 is the maximum number of XSAVEs in one input program.) The TEMPORARY/ SELECT IF/ SAVE to save the 11th may be more trouble than it's worth; in that case, replace it by a <gasp!> EXECUTE. You'll probably want macro loops or Python loops to generate this code, which is two nested loops: through values 1 thru 200, 11 or 10 at a time, generating one transformation program per pass; then through 11 or 10 consecutive values, generating the <test>/XSAVE pairs. Still untested: GET FILE=CURRENT. DO IF var1 EQ 001. . XSAVE OUTFILE=Rpt001. ELSE IF var1 EQ 002. . XSAVE OUTFILE=Rpt002. ... ELSE IF var1 EQ 010. . XSAVE OUTFILE=Rpt010. END IF. TEMPORARY. SELECT IF var1 EQ 011. SAVE OUTFILE=Rpt011. <etc., for the rest of the reports> GET FILE=Rpt001. <run report for var1=1> GET FILE=Rpt002 <run report for var1=2> ... Better hardware is a good thing, and you should probably have it for jobs your size. But there are more ways to solve problems than by throwing hardware at them. -Cheers, and good luck, Richard .................... (*) Levesque, Raynald, "SPSS® Programming and Data Management/A Guide for SPSS® and SAS® Users". SPSS, Inc., Chicago, IL, 2005. You can download it free as a PDF file, from http://www.spss.com/spss/SPSS_programming_data_mgmt.pdf. (2) For example, Ristow, Richard, "EXECUTE is sometimes useful", SPSSX-L Thu, 4 May 2006 (12:10:14 -0400). .................... >Thank you. This is very helpful indeed. Yes, I >didn't specify but the routine starts by reading >4 million lines of text data of 55 variables >each, then sorting by 11 variables, computing a >few new variables, and saving to a .sav file. I >do as many transformation commands as possible >between each "exe" command. > >The reason is that one of our vendor's equipment >stores gathered data as text files and I need to >pull it in and keep as a .sav file for further >processing - rereading the text data for doing >reports is impossibly slow. The way their >programs work, data can be added anywhere in >their storage file so I have to re-read the >whole thing every day. (I've mentioned to them >that they are about 20 years behind but they no >budge). I have to keep the full file the first >time. The second step selects only data >gathered on weekdays and saves out, keeping only >the specific variables needed for the first >batch of reports, gets the modified version >back, and then uses temporary - select if (var1 >= 1 through 200, which means 200 re-reads >through the data) for each report. Then repeat >the process for Saturday and Sunday, deleting >the partial files as it goes. Anyway, I'll use >your recommendations to design the computer with >multiple disks and do my reads and writes to >separate disks. Maximize RAM as well. Thanks again. > >Chris |
Hi,
Testing (de-bugging) your syntaxes on a small subset of your data --less cases-- can also save a lot of time. For example: N OF CASES 1000. [... your syntax ...] or: SET SEED 12345. SAMPLE .05. [... your syntax ...] Maybe I overlooked it in Richard's e-mail, but Boolean short circuiting is a way to speed up the processing of e.g. DO IF structures. It means that the most common/likely condition should be specified first. I am not sure whether using BY processing is faster, but I believe so. If not it saves lots of time when you process your output in e.g. Excel. SORT CASES BY myvar. /* yes, this is time-consuming!. TEMPORARY. SPLIT FILE LAYERED BY myvar. [... your syntax ...] The following link is about programming efficiency in SAS, but can still be of use for SPSS users. http://www.ats.ucla.edu/stat/SAS/library/nesug00/bt3005.pdf Btw, the e-book from Raynald Levesque is a must. Cheers! Albert-Jan --- Richard Ristow <[hidden email]> wrote: > This is an on-list response to an off-list > follow-up. > > At 08:46 AM 1/31/2007, [hidden email] > wrote, off-list: > > >The routine starts by reading 4 million lines of > >text data of 55 variables each, then sorting by > >11 variables, computing a few new variables, and > >saving to a .sav file. I need to pull it in and > >keep as a .sav file for further processing - > >rereading the text data for doing reports is > impossibly slow. > > You're certainly right about that last point. > > >The way their programs work, data can be added > >anywhere in their storage file so I have to > >re-read the whole thing every day. > > If you can identify, when you're reading them, > which records have been added since the last run, > then A., below, will probably help you; B. MAY > help you. These assume that the previous complete > version of the file, sorted, is saved under name > or handle PREVIOUS. New records will be saved in > file TODAY, and the complete new file written to > file CURRENT. PREVIOUS and TODAY may be on the > same disk; CURRENT should be on a different disk > from both of them. To make this work well with a > series of updates, you'll have to change your > file handle definitions for each new run. (If > you're from the old IBM days, think 'generation data > groups'.) > > A. Save time in transformations, sorting and saving. > Code not tested: > > DATA LIST FILE=<big text file> {FIXED|FREE|LIST} > / <variable specifications>. > SELECT IF <it's a new record>. > <transformation commands> > SORT CASES BY <sort variables>. > ADD FILES > /FILE=PREVIOUS > /FILE=* > /BY <sort variables>. > XSAVE OUTFILE=CURRENT. > <First procedure that uses the updated file> > GET FILE=CURRENT. > <Other procedures that use the updated file>. > > B. Save computing time in DATA LIST (which may or > may not matter). This isn't tested, either, and I > have less experience with this logic. > > INPUT PROGRAM. > . DATA LIST FILE=<big text file> {FIXED|FREE|LIST} > / <variables needed for selection>. > . DO IF <it's a new record>. > . REREAD. > . DATA LIST FILE=<big text file> > {FIXED|FREE|LIST} > / <all variables>. > . END CASE. > . END IF. > END INPUT PROGRAM. > > >I do as many transformation commands as possible > >between each "exe" command. > > WHOOPS! Biggie! Red alert! > > 'EXECUTE' commands are very expensive with big > files. "As many commands as possible"? There's no > limit, certainly not one you're likely to hit, in > the number of commands in a transformation program. > > There are specific reasons you need EXECUTE: see > "Use EXECUTE sparingly" in any edition of Raynald > Levesque's book(1), or more postings of mine than > I care to count(2). If there's not one of these > specific reasons for any of your EXECUTEs, take > it out. There's a very large chance you need no > EXECUTEs at all. > > >The second step selects only data gathered on > >weekdays and saves out, keeping only the > >variables needed for the first batch of reports, > >gets the modified version back, and then uses > >temporary - select if (var1 = 1 through 200, > >which means 200 re-reads through the data) for > >each report. Then repeat the process for > >Saturday and Sunday, deleting the partial files as > it goes. > > The following should be MUCH faster, if (as > sounds likely) all 200 reports are the same > except for the cases selected. Code still not > tested: > > GET FILE=CURRENT. > * The following SORT isn't necessary, if . > * 'var1' is the first variable in list . > * '<sort variables>', above . > . SORT CASES BY var1. > SPLIT FILE BY var1. > <Run report>. > > If you don't run the same report each time, you > can do 20 data passes (or 19) instead of 200 by > using XSAVE to write 10 or 11 of the selected > files in one data pass. (Only 11, because 10 is > the maximum number of XSAVEs in one input > program.) The TEMPORARY/ SELECT IF/ SAVE to save > the 11th may be more trouble than it's worth; in > that case, replace it by a <gasp!> EXECUTE. > You'll probably want macro loops or Python loops > to generate this code, which is two nested loops: > through values 1 thru 200, 11 or 10 at a time, > generating one transformation program per pass; > then through 11 or 10 consecutive values, > generating the <test>/XSAVE pairs. > > Still untested: > > GET FILE=CURRENT. > DO IF var1 EQ 001. > . XSAVE OUTFILE=Rpt001. > ELSE IF var1 EQ 002. > . XSAVE OUTFILE=Rpt002. > ... > ELSE IF var1 EQ 010. > . XSAVE OUTFILE=Rpt010. > END IF. > TEMPORARY. > SELECT IF var1 EQ 011. > SAVE OUTFILE=Rpt011. > <etc., for the rest of the reports> > GET FILE=Rpt001. > <run report for var1=1> > GET FILE=Rpt002 > <run report for var1=2> > ... > > Better hardware is a good thing, and you should > probably have it for jobs your size. But there > are more ways to solve problems than by throwing > hardware at them. > > -Cheers, and good luck, > Richard > .................... > (*) Levesque, Raynald, "SPSS� Programming and > Data Management/A Guide for SPSS� and SAS� > SPSS, Inc., Chicago, IL, 2005. > > You can download it free as a PDF file, from > http://www.spss.com/spss/SPSS_programming_data_mgmt.pdf. > > (2) For example, Ristow, Richard, "EXECUTE is > sometimes useful", SPSSX-L Thu, 4 May 2006 (12:10:14 > -0400). > .................... > >Thank you. This is very helpful indeed. Yes, I > >didn't specify but the routine starts by reading > >4 million lines of text data of 55 variables > >each, then sorting by 11 variables, computing a > >few new variables, and saving to a .sav file. I > >do as many transformation commands as possible > >between each "exe" command. > > > >The reason is that one of our vendor's equipment > >stores gathered data as text files and I need to > >pull it in and keep as a .sav file for further > >processing - rereading the text data for doing > >reports is impossibly slow. The way their > >programs work, data can be added anywhere in > >their storage file so I have to re-read the > >whole thing every day. (I've mentioned to them > >that they are about 20 years behind but they no > >budge). I have to keep the full file the first > >time. The second step selects only data > >gathered on weekdays and saves out, keeping only > >the specific variables needed for the first > >batch of reports, gets the modified version > >back, and then uses temporary - select if (var1 > >= 1 through 200, which means 200 re-reads > >through the data) for each report. Then repeat > >the process for Saturday and Sunday, deleting > >the partial files as it goes. Anyway, I'll use > >your recommendations to design the computer with > >multiple disks and do my reads and writes to > >separate disks. Maximize RAM as well. Thanks > again. > > > >Chris > ____________________________________________________________________________________ We won't tell. Get more on shows you hate to love (and love to hate): Yahoo! TV's Guilty Pleasures list. http://tv.yahoo.com/collections/265 |
Thanks all for your assistance. The suggestions related to changing my
logic in speeding things up is particularly helpful. Streamlining code is an art in every coding language that will always present challenges and new rewards as new solutions are found. One suggestion was that I use split file and layering to reduce the number of data passes for the 200-report routine. I haven't been able to get all the pieces to work together. Specifically, the 4-million record file in its entirety produces reports too large for others to use. They don't have SPSS so I have to save the reports as text files. Wordpad is not meant to handle 4 million line text files. So I break it into many separate reports, repeated iteratively in the syntax file, and each report uses the /outfile command to produce reasonably sized text reports others can use, arranged by logical file names, and inside each report, specific page titles. I've trained people how to interpret the file names to find their data. For example (paraphrased for brevity) temporary. select if (var1 = 1) report /FORMAT= <<format statements>> /outfile='g:\apc_reports\ride_checks\01wkwi06.txt' ?* 01 is the value of the first select if - this means "route 1 weekday winter 2006" /TITLE=CENTER '*****************************' 'Route 1' *** 1 is the value of the first select if 'Winter, 2006' 'TRIP ORDER' '*****************************' /vars = varlist /break = <<break & summary lists). temporary. select if (var1 = 2) report /FORMAT= <<format statements>> /outfile='g:\apc_reports\ride_checks\02wkwi06.txt' ?* 02 is the value of the second select if - this means "route 2 weekday winter 2006" /TITLE=CENTER '*****************************' 'Route 2' *** 2 is the value of the second select if 'Winter, 2006' 'TRIP ORDER' '*****************************' /vars = varlist /break = <<break & summary lists). etc 3 => 200 I know I can use the /string statement for the page titles. Can you create the filename statement in a split file solution? I also realize that it could be done in a script using macros, but again, you'd be 'including' the same report syntax file 200 times, would it be any faster? The final issue is that the values for var1 are not completely contiguous. A partial valid value list looks like 1,2,3,4,5,6,7,8,11,12,13,14,15,16,18,19,20,22 etc. This would be fine for split file by var1, but not for loop 1 - 200 as some values would produce empty reports. Any thoughts .... |
In reply to this post by Richard Ristow
At 09:27 PM 1/30/2007, Richard R. wrote:
>- SCSI vs IDE / Disk speed / space >Here's the big one: your time is probably dominated by disk transfers. >I'd say, >. Space won't matter much once you have 'plenty', but be generous about >that. Calculate space for your input file and all files you're >creating, and make sure you have at least double that, free. (Among >other things, that avoids contention if SPSS keeps an old copy of a >file active until a new one is completely created.) >. Speed, at almost any price. As for protocol (SCSI/IDE), go for the >fastest. (Isn't that SIDE now?) >. Strongly consider two disks. Arrange your job so, when you're >creating or modifying a file, input is read from one, output file >written to the other. >.................... ...just a few comments from someone who builds computers. The ram is probably the most important and the hard drive speed the next important for the type of analyses that I think your describing. Standard Windows XP or W2000 will handle up to 4 gigs of ram. ...anything more requires a windows 64 bit system. I'm not positive, however, whether spss will run on a 64 bit machine, nor whether it can take advantage of 64 bit hardware and operating systems. Some software can take advantage of multiple processors and multiple cores, but others can't. I've messed with spss and multiple processors back with probably version 9, but not since. Back then, I could get no benefit from what I was doing with 2 processors. Things may have changed. It also matters whether you are speaking about running spss alone, or with other programs at the same time. E.g., multiple processors might help if you are running office while waiting for spss to perform some long task, but not if you are simply waiting for spss to finish. - I'm not sure here. My understanding is that Vista will be better able to make use of multiple processors and multiple cores. Hard drive - there is much misinformation out there. The manufacturers advertise "interface" speeds. E.g., 3 gig/sec SATA drives. The speed of the interface is all but irrelevant to the actual performance of the drive with all else equal. The old drives were labelled IDE/ATA (same thing essentially). Now these same drives are called PATA (for parallel ATA) - you could place up to 2 on the same cable in parallel. The new version is called SATA (serial ATA) - each drive connects serially - one drive = one cable. The sata interface (either 1.5 gig/sec or 3.0 gig/sec) again, is essentially irrelevant unless you are running a server with multiple drives on the same channel. The drive itself will run no where close to this speed. A very fast drive will run at 75 megs/sec (considerably slower than the interface speed that is advertised). ...however the newer drives will run slightly faster than the older models. So I newer SATA will probably be faster than an older PATA, but not because of the interface per se. In drives where the exact same hardware is available in both PATA and SATA, they will transfer data at the same speeds. SCSI is being replaced by SAS. (Serial scsi). SAS drives (and the older SCSI) are primarily designed for servers and are much more expensive. I have 4 73 gig sas drives on a server that cost about $450 each. The backup SATA drive for the machine is 400 gigs and costs about $120. The advantage of Scsi or sas is primarily in it low latency - how long between a request is made until data is delivered. SCSI or SAS is far superior here, but that is nearly irrelevant for statistical software. More important is the transfer rate - the transfer rate of any interface is largely determined by the rotational speed of the drive. So a 7200 rpm SCSI drive will transfer at almost the same rate as a 7200 SATA, but the scsi will start the transfer just a split second before - again, I think that this is largely irrelevant for stat software. The scsi's (and SAS) do, however, come in higher RPM ratings, with the current rpm max in SCSI/SAS at 15,000 RPM and the max for SATA being 10,000 (in the Western Digital Raptor). The more common rpm now is 7200 for SATA. Instead of more expensive drives, you might consider a raid system. RAID is redundant array of inexpensive (or independent) disks. You combine two or more. Consider Raid 0 with 2 disks if you have a small budget and wish to get about twice the transfer rate of a single drive. Consider raid 10 or a few other arrangements for other advantages (search the web on this one, it gets more complex). My 4 Drive 15K rpm SAS raid 10 array can read data at over 270 megs/sec (unfortunately not the machine that I run stat software on, because it would fly at this speed) as the typical rate is under 70. If you want a good drive setup for running spss, here is what I would consider in order of increasing cost. 1 (lowest cost but more than the performance of the typical stock computer) Western Digital Raptor 10,000 rpm SATA drive in typical single-drive setup. 2 dual WD raptors in standard non-raid - use one for the OS and one for the data - or one to read and one to write. 3 dual WD raptors in RAID 0 (double the cost - same storage space as one drive - more chance of malfunction - if one drive goes bad, you lose your data and OS - make sure you have a backup to another drive) 4 3 WD raptors in Raid 5 (triple the cost - twice the storage space - one drive can go bad and you still save the data, about twice the read and write speed as a single drive, but it may slow down the processor in most setups unless you get a hardware raid card - check on google for more info) 5 4 raptors in raid 10 (4x the cost - twice the storage space - one or two drives can go bad, up to 4 times the read speed and twice the write speed of a single drive - less processor slow-down than the raid 5) 6 15K rpm scsi or SAS drives (now you're getting really expensive) in one of the raid setup. 7 15K rpm scsi or SAS in raid with a hardware raid controller (start thinking about $3000 just for the drives and controller hardware). I would definitely go with a raid setup with 2 or more slower drives before considering SCSI or SAS for your application. |
In reply to this post by Chris Cronin
At 10:03 AM 2/1/2007, Christopher Cronin wrote:
>The 4-million record file in its entirety produces reports too large >for others to use. I have to save the reports as text files. Wordpad >is not meant to handle 4 million line text files. First thought: If this is so important, and so burdensome, wouldn't there be something better than Wordpad? I don't know editors and text-handlers well, but there must be something that could read the huge output file and separate it into parts. >So I break [the job] into many separate reports, repeated iteratively >in the syntax file, and each report uses the /outfile command to >produce reasonably sized text reports others can use, [with] logical >file names, and specific page titles. For example (paraphrased for >brevity) > >temporary. >select if (var1 = 1) >report [...] >temporary. >select if (var1 = 2) >report [...] >etc 3 => 200 > >I know I can use the /string statement for the page titles. Can you >create the filename statement in a split file solution? I also >realize that it could be done in a script using macros, but again, >you'd be 'including' the same report syntax file 200 times, would it >be any faster? All right. The big issue is: your production job reads the entirety of your big file 200 times. That's an inefficiency that's beyond glaring. When you started having questions about speed, your first thought should have been to look at that and say, "There HAS to be a better way." I have no idea whether the time to read, transform, and save the data is even noticeable, by comparison with the time for those 200 report passes. For this job, almost the only relevant hardware speed parameter is overall transfer rate from disk. Breaking up the large file into 10 pieces, using XSAVE logic, and running each report against the pertinent smaller file, should give near a ten-fold saving, very easily. No hardware improvement will come near that. It may be shorter to sort the 10 smaller files individually, rather than the large file all together. It looks like the syntax file for your reports has all 200 (or almost 200) reports hard-coded in. That's clumsy for you to write and maintain, but it makes changing to use the smaller files very easy. (Macros, or Python code-generating code, could make your code considerably more compact and maintainable, but no faster.) -Cheers, and good luck, Richard |
In reply to this post by Chris Cronin
At 02/01/2007 02:01:38 PM Richard Ristow wrote:
First thought: If this is so important, and so burdensome, wouldn't there be something better than Wordpad? I don't know editors and text-handlers well, but there must be something that could read the huge output file and separate it into parts. I maintain what's analogous to a library - the files in one directory are the books in one, say, aisle, named by their content. Many end users, many of whom I won't know, look at them so I can't require proprietary software or assume advanced computer skills. Each "book" refers to one data set, which still contains up to 20,000 lines, and SPSS's break and summary commands in reports makes them very easy to flip through and understand. Wordpad or any text editor they have will work. I have been well praised for their organization and ease of use. Breaking up the large file into 10 pieces, using XSAVE logic, and running each report against the pertinent smaller file, should give near a ten-fold saving, very easily. I absolutely agree that cutting the big file in pieces saves time. Unfortunately, The XSAVE command has eluded my grasp. I understand the manual's definition that save causes an immediate read and write of all the data, and xsave "stores up" multiple commands until something triggers an execute. When I use loop and xsave, for example loop a = 1 to 5. xsave /outfile = 'c:test.sav' /compressed. end loop. exe. I get a file that contains 5 iterations of my original file with the values 1 to 5 in a new variable 'a' for each iteration. I'm not succeeding in understanding how it will reduce multiple saves to one data pass. I tried this, with and without the 'temporary' commands: temporary. select if (rt >= 1 and rt <= 11). xsave /outfile = 'c:\testa.sav' /compressed. temporary. select if (rt >= 12 and rt <= 20). xsave /outfile = 'c:\testb.sav' /compressed. temporary. select if (rt >= 21 and rt <= 30). xsave /outfile = 'c:\testc.sav' /compressed. temporary. select if (rt >= 31 and rt <= 50). xsave /outfile = 'c:\testd.sav' /compressed. temporary. select if (rt >= 51 and rt <= 70). xsave /outfile = 'c:\teste.sav' /compressed. temporary. select if (rt >= 71 and rt <= 100). xsave /outfile = 'c:\testf.sav' /compressed. temporary. select if (rt >= 101 and rt <= 121). xsave /outfile = 'c:\testg.sav' /compressed. temporary. select if (rt >= 122 and rt <= 220). xsave /outfile = 'c:\testh.sav' /compressed. exe. It saved 8 files with the right names in one data pass, and the first file 'testa.sav' correctly contains the records where rt = 1 through 11. The rest of the files are empty. What am I missing? |
At 03:24 PM 2/1/2007, Christopher Cronin wrote:
>At 02/01/2007 02:01:38 PM Richard Ristow wrote: > >>Breaking up the large file into 10 pieces, using XSAVE logic, and >>running each report against the pertinent smaller file, should give >>near a ten-fold saving, very easily. > >I absolutely agree that cutting the big file in pieces saves time. I >tried this, with and without the 'temporary' commands: > >temporary. >select if (rt >= 1 and rt <= 11). >xsave /outfile = 'c:\testa.sav' /compressed. >temporary. >select if (rt >= 12 and rt <= 20). >xsave /outfile = 'c:\testb.sav' /compressed. >temporary. >select if (rt >= 21 and rt <= 30). >xsave /outfile = 'c:\testc.sav' /compressed. >temporary. >select if (rt >= 31 and rt <= 50). >xsave /outfile = 'c:\testd.sav' /compressed. >temporary. >select if (rt >= 51 and rt <= 70). >xsave /outfile = 'c:\teste.sav' /compressed. >temporary. >select if (rt >= 71 and rt <= 100). >xsave /outfile = 'c:\testf.sav' /compressed. >temporary. >select if (rt >= 101 and rt <= 121). >xsave /outfile = 'c:\testg.sav' /compressed. >temporary. >select if (rt >= 122 and rt <= 220). >xsave /outfile = 'c:\testh.sav' /compressed. >exe. > >It saved 8 files with the right names in one data pass, and the first >file 'testa.sav' correctly contains the records where rt = 1 through >11. The rest of the files are empty. What am I missing? XSAVE is a transformation command, not a procedure. All XSAVEs (until a procedure, SAVE, or EXECUTE) are in the same transformation program, and you can't have separate 'TEMPORARY' states in the same transformation program. Instead of TEMPORARY/SELECT IF logic, use DO IF logic, as outlined in my posting "Re: Computer Buying Help", Wed, 31 Jan 2007 (12:40:54 -0500). |
In reply to this post by Chris Cronin
>pass. I tried this, with and without the 'temporary' commands:
> >temporary. >select if (rt >= 1 and rt <= 11). >xsave /outfile = 'c:\testa.sav' /compressed. >temporary. >select if (rt >= 12 and rt <= 20). >xsave /outfile = 'c:\testb.sav' /compressed. >temporary. >select if (rt >= 21 and rt <= 30). >xsave /outfile = 'c:\testc.sav' /compressed. >temporary. >* * * >select if (rt >= 122 and rt <= 220). >xsave /outfile = 'c:\testh.sav' /compressed. >exe. > >It saved 8 files with the right names in one data pass, and the first file >'testa.sav' correctly contains the records where rt = 1 through 11. The >rest of the files are empty. What am I missing? I'm hesistant to jump and to try to run with the big dogs, but the way I read the Command Syntax Reference for 15.0, temporary transformations remain in effect until the next time the datafile is read. XSave also is not executed until the next time data is read. Doesn't that mean that the *first* Selectif is still ipending when the later Selectif temporary commands are entered. Because the intervals do not overlap, the later selections are the null set. Gary --- Prof. Gary S. Rosin Internet: [hidden email] South Texas College of Law 1303 San Jacinto Voice: (713) 646-1854 Houston, TX 77002-7000 Fax: (713) 646-1766 |
At 05:02 PM 2/1/2007, Gary Rosin wrote:
>>I tried this, with and without the 'temporary' commands: >> >>temporary. >>select if (rt >= 1 and rt <= 11). >>xsave /outfile = 'c:\testa.sav' /compressed. >>temporary. >>select if (rt >= 12 and rt <= 20). >>xsave /outfile = 'c:\testb.sav' /compressed. >>* * * >>select if (rt >= 122 and rt <= 220). >>xsave /outfile = 'c:\testh.sav' /compressed. >>exe. >> >>It saved 8 files with the right names in one data pass, and the first >>file 'testa.sav' correctly contains the records where rt = 1 through >>11. The rest of the files are empty. What am I missing? > >The way I read the Command Syntax Reference for 15.0, temporary >transformations remain in effect until the next time the datafile is >read. XSave also is not executed until the next time data is read. >Doesn't that mean that the *first* Selectif is still ipending when the >later Selectif temporary commands are entered. Because the intervals >do not overlap, the later selections are the null set. Bingo. Exactly. Richard |
In reply to this post by Jeff-125
At 10:48 AM 2/1/2007, Jeff wrote:
>At 09:27 PM 1/30/2007, Richard R. wrote: >>- SCSI vs IDE / Disk speed / space >>Here's the big one: your time is probably dominated by disk >>transfers. >>[...] > >...just a few comments from someone who builds computers. Thank you very much for those. Certainly I learned a lot I hadn't known, about best disk transfer speed per cost; and that's critically important, when running statistical programs with large input files. Thanks! Richard |
Free forum by Nabble | Edit this page |