|
Does anybody know about the plans to make a Java interface to SPSS?
On the web page regarding the European 2007 user conference something is mentioned about a "new Java interface to SPSS". See "What’s New in SPSS 16.0?" at: http://www.spss.com/spssdirections/prague/tech.htm I appreciate very much the Python interface to SPSS. It allows a very flexible access to SPSS on the programming level. Now I'm wondering if a similar Java interface is planned. I would prefer Java instead of Python because our other program development is Java-oriented. Joachim |
|
Yes, V16 is Java based and the beta really looks nice
W SPSS Beta Site -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Joachim Wackerow Sent: Thursday, July 05, 2007 12:57 PM To: [hidden email] Subject: SPSS and Java Interface? Does anybody know about the plans to make a Java interface to SPSS? On the web page regarding the European 2007 user conference something is mentioned about a "new Java interface to SPSS". See "What's New in SPSS 16.0?" at: http://www.spss.com/spssdirections/prague/tech.htm I appreciate very much the Python interface to SPSS. It allows a very flexible access to SPSS on the programming level. Now I'm wondering if a similar Java interface is planned. I would prefer Java instead of Python because our other program development is Java-oriented. Joachim
Will
Statistical Services ============ info.statman@earthlink.net http://home.earthlink.net/~z_statman/ ============ |
|
Just a quick add-on to Will's point. Will is correct that SPSS 16 will have a user interface implemented in Java. This will allow for new functionality such as resizable dialogs, and it is portable to Mac and Linux versions of SPSS in addition to Windows. However, this is different from the Python interface. The Python interface is a programming language interface which allows SPSS and Python to interact programmatically. In contrast, in SPSS 16 the "Java interface" is the user interface. At this time there are no plans for a Java programmability plug-in to be added to the existing Python and .NET plug-ins.
More information on the new feature set for SPSS 16 will be forthcoming soon. Regards. Kyle Weeks, Ph.D. Director of Product Management, SPSS Product Line Product Management SPSS Inc. [hidden email] www.spss.com SPSS Inc. helps organizations turn data into insight through predictive analytics. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Statman (WMB) Sent: Thursday, July 05, 2007 12:13 PM To: [hidden email] Subject: Re: SPSS and Java Interface? Yes, V16 is Java based and the beta really looks nice W SPSS Beta Site -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Joachim Wackerow Sent: Thursday, July 05, 2007 12:57 PM To: [hidden email] Subject: SPSS and Java Interface? Does anybody know about the plans to make a Java interface to SPSS? On the web page regarding the European 2007 user conference something is mentioned about a "new Java interface to SPSS". See "What's New in SPSS 16.0?" at: http://www.spss.com/spssdirections/prague/tech.htm I appreciate very much the Python interface to SPSS. It allows a very flexible access to SPSS on the programming level. Now I'm wondering if a similar Java interface is planned. I would prefer Java instead of Python because our other program development is Java-oriented. Joachim |
|
Thanks for clarifying that point.
My interest would be in Java as a programming language interface. Joachim Weeks, Kyle wrote: > Just a quick add-on to Will's point. Will is correct that SPSS 16 will have a user interface implemented in Java. This will allow for new functionality such as resizable dialogs, and it is portable to Mac and Linux versions of SPSS in addition to Windows. However, this is different from the Python interface. The Python interface is a programming language interface which allows SPSS and Python to interact programmatically. In contrast, in SPSS 16 the "Java interface" is the user interface. At this time there are no plans for a Java programmability plug-in to be added to the existing Python and .NET plug-ins. > > More information on the new feature set for SPSS 16 will be forthcoming soon. > > Regards. > > > Kyle Weeks, Ph.D. > Director of Product Management, SPSS Product Line > Product Management > SPSS Inc. > [hidden email] > www.spss.com > SPSS Inc. helps organizations turn data into insight through predictive analytics. > > > -----Original Message----- > From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Statman (WMB) > Sent: Thursday, July 05, 2007 12:13 PM > To: [hidden email] > Subject: Re: SPSS and Java Interface? > > Yes, V16 is Java based and the beta really looks nice > > W > SPSS Beta Site > > -----Original Message----- > From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of > Joachim Wackerow > Sent: Thursday, July 05, 2007 12:57 PM > To: [hidden email] > Subject: SPSS and Java Interface? > > Does anybody know about the plans to make a Java interface to SPSS? > > On the web page regarding the European 2007 user conference something is > mentioned about a "new Java interface to SPSS". > > See "What's New in SPSS 16.0?" at: > http://www.spss.com/spssdirections/prague/tech.htm > > I appreciate very much the Python interface to SPSS. It allows a very > flexible access to SPSS on the programming level. Now I'm wondering if a > similar Java interface is planned. I would prefer Java instead of Python > because our other program development is Java-oriented. > > Joachim |
|
Joachim
It depends what kinds of customisations you are looking to make but you might want to investigate SPSS Web App http://www.spss.com/webapp/ . I'm over-simplifying but the WebApp framework gives you access to the SPSS back end functionality through JSP and has some additional portal-style capabilities. john John McConnell applied insights -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Joachim Wackerow Sent: 06 July 2007 10:05 To: [hidden email] Subject: Re: SPSS and Java Interface? Thanks for clarifying that point. My interest would be in Java as a programming language interface. Joachim Weeks, Kyle wrote: > Just a quick add-on to Will's point. Will is correct that SPSS 16 will have a user interface implemented in Java. This will allow for new functionality such as resizable dialogs, and it is portable to Mac and Linux versions of SPSS in addition to Windows. However, this is different from the Python interface. The Python interface is a programming language interface which allows SPSS and Python to interact programmatically. In contrast, in SPSS 16 the "Java interface" is the user interface. At this time there are no plans for a Java programmability plug-in to be added to the existing Python and .NET plug-ins. > > More information on the new feature set for SPSS 16 will be forthcoming soon. > > Regards. > > > Kyle Weeks, Ph.D. > Director of Product Management, SPSS Product Line > Product Management > SPSS Inc. > [hidden email] > www.spss.com > SPSS Inc. helps organizations turn data into insight through > > > -----Original Message----- > From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Statman (WMB) > Sent: Thursday, July 05, 2007 12:13 PM > To: [hidden email] > Subject: Re: SPSS and Java Interface? > > Yes, V16 is Java based and the beta really looks nice > > W > SPSS Beta Site > > -----Original Message----- > From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf > Joachim Wackerow > Sent: Thursday, July 05, 2007 12:57 PM > To: [hidden email] > Subject: SPSS and Java Interface? > > Does anybody know about the plans to make a Java interface to SPSS? > > On the web page regarding the European 2007 user conference something is > mentioned about a "new Java interface to SPSS". > > See "What's New in SPSS 16.0?" at: > http://www.spss.com/spssdirections/prague/tech.htm > > I appreciate very much the Python interface to SPSS. It allows a very > flexible access to SPSS on the programming level. Now I'm wondering if a > similar Java interface is planned. I would prefer Java instead of Python > because our other program development is Java-oriented. > > Joachim |
|
John,
Thank you for the hint. Currently I'm using the OMS XML output to transform the metadata to DDI 3.0 format (Data Documentation Initiative). A custom BEGIN PROGRAM/END PROGRAM Python block does a XSLT transformation using Pyana (Python access to Xalan). This module should be usable for people interested in DDI (people with a regular SPSS installation should be able to use this export facility, so we don't want to use additional software like SPSS Web App). I'm using Python now for this purpose as Kyle Weeks pointed out, that there are currently no plans for a Java programmability plug-in. Joachim John McConnell wrote: > Joachim > > It depends what kinds of customisations you are looking to make but you > might want to investigate SPSS Web App http://www.spss.com/webapp/ . > > I'm over-simplifying but the WebApp framework gives you access to the > SPSS back end functionality through JSP and has some additional > portal-style capabilities. > > john > > John McConnell > applied insights > > -----Original Message----- > From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of > Joachim Wackerow > Sent: 06 July 2007 10:05 > To: [hidden email] > Subject: Re: SPSS and Java Interface? > > Thanks for clarifying that point. > My interest would be in Java as a programming language interface. > > Joachim > > Weeks, Kyle wrote: >> Just a quick add-on to Will's point. Will is correct that SPSS 16 > will have a user interface implemented in Java. This will allow for new > functionality such as resizable dialogs, and it is portable to Mac and > Linux versions of SPSS in addition to Windows. However, this is > different from the Python interface. The Python interface is a > programming language interface which allows SPSS and Python to interact > programmatically. In contrast, in SPSS 16 the "Java interface" is the > user interface. At this time there are no plans for a Java > programmability plug-in to be added to the existing Python and .NET > plug-ins. >> More information on the new feature set for SPSS 16 will be > forthcoming soon. >> Regards. >> >> >> Kyle Weeks, Ph.D. >> Director of Product Management, SPSS Product Line >> Product Management >> SPSS Inc. >> [hidden email] >> www.spss.com >> SPSS Inc. helps organizations turn data into insight through > predictive analytics. >> >> -----Original Message----- >> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf > Of Statman (WMB) >> Sent: Thursday, July 05, 2007 12:13 PM >> To: [hidden email] >> Subject: Re: SPSS and Java Interface? >> >> Yes, V16 is Java based and the beta really looks nice >> >> W >> SPSS Beta Site >> >> -----Original Message----- >> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf > Of >> Joachim Wackerow >> Sent: Thursday, July 05, 2007 12:57 PM >> To: [hidden email] >> Subject: SPSS and Java Interface? >> >> Does anybody know about the plans to make a Java interface to SPSS? >> >> On the web page regarding the European 2007 user conference something > is >> mentioned about a "new Java interface to SPSS". >> >> See "What's New in SPSS 16.0?" at: >> http://www.spss.com/spssdirections/prague/tech.htm >> >> I appreciate very much the Python interface to SPSS. It allows a very >> flexible access to SPSS on the programming level. Now I'm wondering if > a >> similar Java interface is planned. I would prefer Java instead of > Python >> because our other program development is Java-oriented. >> >> Joachim > > -- GESIS - German Social Science Infrastructure Services http://www.gesis.org/en/ |
|
Joachim
I understand ... sounds like you have developed a useful tool. john -----Original Message----- From: Joachim Wackerow [mailto:[hidden email]] Sent: 09 July 2007 07:54 To: John McConnell Cc: [hidden email] Subject: Re: SPSS and Java Interface? John, Thank you for the hint. Currently I'm using the OMS XML output to transform the metadata to DDI 3.0 format (Data Documentation Initiative). A custom BEGIN PROGRAM/END PROGRAM Python block does a XSLT transformation using Pyana (Python access to Xalan). This module should be usable for people interested in DDI (people with a regular SPSS installation should be able to use this export facility, so we don't want to use additional software like SPSS Web App). I'm using Python now for this purpose as Kyle Weeks pointed out, that there are currently no plans for a Java programmability plug-in. Joachim John McConnell wrote: > Joachim > > It depends what kinds of customisations you are looking to make but you > might want to investigate SPSS Web App http://www.spss.com/webapp/ . > > I'm over-simplifying but the WebApp framework gives you access to the > SPSS back end functionality through JSP and has some additional > portal-style capabilities. > > john > > John McConnell > applied insights > > -----Original Message----- > From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf > Joachim Wackerow > Sent: 06 July 2007 10:05 > To: [hidden email] > Subject: Re: SPSS and Java Interface? > > Thanks for clarifying that point. > My interest would be in Java as a programming language interface. > > Joachim > > Weeks, Kyle wrote: >> Just a quick add-on to Will's point. Will is correct that SPSS 16 > will have a user interface implemented in Java. This will allow for > functionality such as resizable dialogs, and it is portable to Mac and > Linux versions of SPSS in addition to Windows. However, this is > different from the Python interface. The Python interface is a > programming language interface which allows SPSS and Python to interact > programmatically. In contrast, in SPSS 16 the "Java interface" is the > user interface. At this time there are no plans for a Java > programmability plug-in to be added to the existing Python and .NET > plug-ins. >> More information on the new feature set for SPSS 16 will be > forthcoming soon. >> Regards. >> >> >> Kyle Weeks, Ph.D. >> Director of Product Management, SPSS Product Line >> Product Management >> SPSS Inc. >> [hidden email] >> www.spss.com >> SPSS Inc. helps organizations turn data into insight through > predictive analytics. >> >> -----Original Message----- >> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf > Of Statman (WMB) >> Sent: Thursday, July 05, 2007 12:13 PM >> To: [hidden email] >> Subject: Re: SPSS and Java Interface? >> >> Yes, V16 is Java based and the beta really looks nice >> >> W >> SPSS Beta Site >> >> -----Original Message----- >> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf > Of >> Joachim Wackerow >> Sent: Thursday, July 05, 2007 12:57 PM >> To: [hidden email] >> Subject: SPSS and Java Interface? >> >> Does anybody know about the plans to make a Java interface to SPSS? >> >> On the web page regarding the European 2007 user conference something > is >> mentioned about a "new Java interface to SPSS". >> >> See "What's New in SPSS 16.0?" at: >> http://www.spss.com/spssdirections/prague/tech.htm >> >> I appreciate very much the Python interface to SPSS. It allows a very >> flexible access to SPSS on the programming level. Now I'm wondering > a >> similar Java interface is planned. I would prefer Java instead of > Python >> because our other program development is Java-oriented. >> >> Joachim > > -- GESIS - German Social Science Infrastructure Services http://www.gesis.org/en/ |
|
In reply to this post by Joachim Wackerow-2
On Thu, 5 Jul 2007 14:02:52 -0500, Weeks, Kyle <[hidden email]> wrote:
>More information on the new feature set for SPSS 16 will be forthcoming soon. > >Regards. > > >Kyle Weeks, Ph.D. >Director of Product Management, SPSS Product Line >Product Management >SPSS Inc. >[hidden email] >www.spss.com >SPSS Inc. helps organizations turn data into insight through predictive > Hi Kyle When you post more details about v16, could you please include some info on the following: * If V16 is better able to make use of all available memory - presently it can't seem to make use of more than about 700-900 MB. It is fairly common now in high-end setups to have 2GB and I think the practical (i.e. usable) maximum for 32 bit Windows is 3 GB. * Degree of support for multi-core processors and motherboards with multiple CPUs. A particularly desirable feature would be sorting algorithms which can spread the sort over multiple cores. I don't mean to confine the question solely to sorting - although in my work it's the biggest time-consumer - the more ANY of the heavy-duty data-manipulation tasks could be spread like this the better. * Plans for support of 64 bit versions * Whether SPSS in a 32 bit version could make use of > 4 GB RAM when running on 64-bit Windows. * Whether a 64-bit version is planned in the foreseeable future I am particularly interested in these questions because some projects are now pushing the boundaries of what is feasible with current versions even with the most advanced hardware. Even with advanced hardware, the software does not exploit all the advantages the hardware has. I appreciate that these sorts of things are non-trivial to implement, but some idea of where things are headed and when we might expect to get there would be much appreciated. I'm aware of plans for the sorts of project that may not be feasible with what we have now. Regards Adrian Barnett |
|
At 01:00 AM 7/10/2007, Adrian Barnett wrote:
>On Thu, 5 Jul 2007 14:02:52 -0500, Weeks, Kyle <[hidden email]> >wrote: > >>More information on the new feature set for SPSS 16 will be >>forthcoming soon. >>Regards. >>Kyle Weeks, Ph.D. > >Could you please include some info on the following: > >* If V16 is better able to make use of all available memory - >presently [SPSS] can't seem to make use of more than about 700-900 MB. > >* Degree of support for multi-core processors and motherboards with >multiple CPUs. - Particularly desirable, sorting algorithms which can >spread the sort over multiple cores. [Although sorting is] the biggest >time-consumer - the more ANY of the heavy-duty data-manipulation tasks >could be spread like this the better. > >I am interested because some projects are now pushing the boundaries >of what is feasible with current versions even with the most advanced >hardware. Butting in, of course: I assume those projects have been analyzed for inefficiencies in design and implementation? The question comes up for me because SPSS data-manipulation tasks are most commonly limited by disk I/O speed. Sorting is something of a special case, and may get CPU bound under some conditions - I've no idea when, or how often. However - you've probably looked at this, but to speed sorting, what would be the relative importance of a second disk drive, so data can be read from one disk and written to the other; of more memory, with existing algorithms; of a dual-core algorithm? I'd expect that they'd usually be important in that order. For other manipulations, it's rare a transformation program is CPU-bound; and faster, or dual, CPUs aren't likely to help with one that isn't. There was an exchange a while ago(*), that began with an inquiry about optimal hardware for speeding up SPSS, in which it became clear that the project had correctable inefficiencies that slowed it perhaps an order of magnitude. You've probably analyzed adequately, but that instance reminded me of the importance of doing just that. Good luck! Richard (*)See the two connected threads "Computer Buying Help", Tue, 30 Jan 2007 <09:47:07 -0500>, ff; "Streamlining (was: Computer Buying Help)", Thu, 1 Feb 2007 <10:03:47 -0500>, ff. |
|
Hi Richard
-----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow Sent: Wednesday, 11 July 2007 9:12 AM To: [hidden email] Subject: Optimization (was, re: SPSS and Java Interface?) At 01:00 AM 7/10/2007, Adrian Barnett wrote: >>On Thu, 5 Jul 2007 14:02:52 -0500, Weeks, Kyle <[hidden email]> >>wrote: >> >>>More information on the new feature set for SPSS 16 will be >>>forthcoming soon. >>>Regards. >>>Kyle Weeks, Ph.D. >> >>Could you please include some info on the following: >> >>* If V16 is better able to make use of all available memory - >>presently [SPSS] can't seem to make use of more than about 700-900 MB. >> >>* Degree of support for multi-core processors and motherboards with >>multiple CPUs. - Particularly desirable, sorting algorithms which can >>spread the sort over multiple cores. [Although sorting is] the biggest >>time-consumer - the more ANY of the heavy-duty data-manipulation tasks >>could be spread like this the better. >> >>I am interested because some projects are now pushing the boundaries >>of what is feasible with current versions even with the most advanced >>hardware. > >Butting in, of course: I assume those projects have been analyzed for >inefficiencies in design and implementation? I had in mind classes of projects rather than specific ones. In the area I work (government) there is increasing interest in analysing operational data. These projects tend to involve pretty large volumes of data. Record linkage is becoming much more recognized as a way to go, linking data from multiple sources and so generating even bigger files. When these involve transactional data recording lots of different contacts over possible decades, they get pretty big indeed. In Western Australia a group has been linking health-related data from an ever- growing list of data providers for over 20 years, so these things can get pretty big if you were to try to deal with all of it. In my experience, the final file for analysis is much smaller than the initial one, but there is a stage where you are preparing the initial data where things are pretty big for a while. Your point is well-made though, that the design of the data structures needs a lot of thought and planning to try to ensure that processing is efficient. >The question comes up for me because SPSS data-manipulation tasks are >most commonly limited by disk I/O speed. > >Sorting is something of a special case, and may get CPU bound under >some conditions - I've no idea when, or how often. However - you've >probably looked at this, but to speed sorting, what would be the >relative importance of a second disk drive, so data can be read from >one disk and written to the other; of more memory, with existing >algorithms; of a dual-core algorithm? I'd expect that they'd usually be >important in that order. Indeed, keeping the swap file on a separate disk from the one the data lives on is a way of reducing some of the impact of I/O during a sort. The process of writing and reading temporary files SPSS builds when sorting a file that won't fit in memory takes up the bulk of the time in a sort. Now that SPSS reports CPU time separately from elapsed time it is much easier to quantify the effect. From what I've been able to read about sorting, the more memory available, the less writing to disk and the faster the sort will run ( holding all other factors constant). The thing I and others on the list have noticed is that current and previous versions of SPSS don't seem to make use of memory beyond about 700-900 MB. Whilst the biggest files I've worked on would take more then the 2GB available on one of my systems, theory suggests that if SPSS did use all of the available RAM, it would have improved things. If I tell SPSS to increase the Workspace beyond those levels, it whinges that it can't get that much memory, even though the Task Manager is showing there is another gigabyte or so that isn't being used. It's not hard now to find motherboards that support 8GB, and if the operating system would support it (which 64 bit versions do), large projects would benefit a lot. In processing this big data files, the number of times the data has to be sorted different ways prior to analysis seems amazing. And no matter how careful one is, and how often testing is done on small subsets, it always seems that the main data gets put through the whole series of programs in the cycle lots more times than was ever intended. Computer scientists seem to have been working on sorting algorithms that take advantage of more than one CPU for at least 20 years now, and seem to provide improvements that scale well with additional CPUs. Given dual core is quite common now amongst new computers in the general market, and quad-core is not hard to obtain, the hardware is well and truly available to take advantage. So sorting would definitely benefit from the available algorithms that work well with 2+ CPUs. >For other manipulations, it's rare a transformation program is >CPU-bound; and faster, or dual, CPUs aren't likely to help with one >that isn't. Yes, I/O seems to be the major bottleneck, but there's nothing like more RAM for curing that - if only the system will make use of it! An awful lot of work of a statistical nature isn't involved with processing and restructuring large volumes of data. The stuff I'm aware of at university research labs is rarely data-intensive and wouldn't much be affected by how much RAM is available or how many processors there were. So I guess most of the people on this list could not care less about efficiency of memory use or sophistication of sorting algorithms. The stuff I have been banging on about concern a very different type of work in a different type of organization. In the context in which I work, the projects being talked about are starting to reach a point where the capabilities of the hardware and software are a much more important consideration in the feasibility of the project than they have been. Previously, if things were taking too long, it was because you had an old computer, and buying a new one would fix it. If the software doesn't change soon, new hardware is no longer going to cure it. Anyway, thanks for your thoughtful observations Regards Adrian Barnett |
|
A propos of this discussion it is worth noting that careful benchmarking can sometimes (often) reveal that time spent is not necessarily where one expects, and experimenting with alternative syntax can be productive.
It is tricky to benchmark on modern operating systems, because they play a lot of tricks in attempting to boost performance, including caching of dll's/modules, using extra memory for i/o buffering when it is not required elsewhere, and dynamic adjustment of process priorities, just to mention a few large-grain issues. And your friendly neighborhood virus checker will contribute its bit of drag to the process, too. On Windows, if you have SPSS programmability installed, you can make use of the benchmark.py Python module to run repetitions of command sequences and to run alternative versions interleaved in order to minimize os effects. You can choose from among a wide variety of performance measures including a variety of time, memory, and i/o measures - basically any measure you could see in the Windows Task Manager. This can help both in finding the most efficient way to structure your SPSS jobs and in predicting what hardware changes are likely to be most beneficial. The output of the benchmark module is a text file of measures set up for loading into SPSS for analysis. You can download this module along with other programmability materials from SPSS Developer Central, www.spss.com/devcentral. The benchmark module requires some third-party (free) downloads that are detailed in the documentation. Note also that driving SPSS externally using programmability, which removes the Data Editor and the rest of the SPSS user interface can speed up jobs. Sometimes this makes a large difference. Using SPSSB, which is part of SPSS Server also allows dispensing with the user interface. In these modes, there is no SPSS Viewer, so no spo files can be created (plain text, html, and other formats are available), but for data preparation, the output isn't usually all that interesting anyway. Regards, Jon Peck -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Barnett, Adrian (DECS) Sent: Wednesday, July 11, 2007 7:47 PM To: [hidden email] Subject: Re: [SPSSX-L] Optimization (was, re: SPSS and Java Interface?) [snip] I had in mind classes of projects rather than specific ones. In the area I work (government) there is increasing interest in analysing operational data. These projects tend to involve pretty large volumes of data. Record linkage is becoming much more recognized as a way to go, linking data from multiple sources and so generating even bigger files. When these involve transactional data recording lots of different contacts over possible decades, they get pretty big indeed. In Western Australia a group has been linking health-related data from an ever- growing list of data providers for over 20 years, so these things can get pretty big if you were to try to deal with all of it. In my experience, the final file for analysis is much smaller than the initial one, but there is a stage where you are preparing the initial data where things are pretty big for a while. Your point is well-made though, that the design of the data structures needs a lot of thought and planning to try to ensure that processing is efficient. >The question comes up for me because SPSS data-manipulation tasks are >most commonly limited by disk I/O speed. > >Sorting is something of a special case, and may get CPU bound under >some conditions - I've no idea when, or how often. However - you've >probably looked at this, but to speed sorting, what would be the >relative importance of a second disk drive, so data can be read from >one disk and written to the other; of more memory, with existing >algorithms; of a dual-core algorithm? I'd expect that they'd usually be >important in that order. Indeed, keeping the swap file on a separate disk from the one the data lives on is a way of reducing some of the impact of I/O during a sort. The process of writing and reading temporary files SPSS builds when sorting a file that won't fit in memory takes up the bulk of the time in a sort. Now that SPSS reports CPU time separately from elapsed time it is much easier to quantify the effect. From what I've been able to read about sorting, the more memory available, the less writing to disk and the faster the sort will run ( holding all other factors constant). The thing I and others on the list have noticed is that current and previous versions of SPSS don't seem to make use of memory beyond about 700-900 MB. Whilst the biggest files I've worked on would take more then the 2GB available on one of my systems, theory suggests that if SPSS did use all of the available RAM, it would have improved things. If I tell SPSS to increase the Workspace beyond those levels, it whinges that it can't get that much memory, even though the Task Manager is showing there is another gigabyte or so that isn't being used. It's not hard now to find motherboards that support 8GB, and if the operating system would support it (which 64 bit versions do), large projects would benefit a lot. In processing this big data files, the number of times the data has to be sorted different ways prior to analysis seems amazing. And no matter how careful one is, and how often testing is done on small subsets, it always seems that the main data gets put through the whole series of programs in the cycle lots more times than was ever intended. Computer scientists seem to have been working on sorting algorithms that take advantage of more than one CPU for at least 20 years now, and seem to provide improvements that scale well with additional CPUs. Given dual core is quite common now amongst new computers in the general market, and quad-core is not hard to obtain, the hardware is well and truly available to take advantage. So sorting would definitely benefit from the available algorithms that work well with 2+ CPUs. >For other manipulations, it's rare a transformation program is >CPU-bound; and faster, or dual, CPUs aren't likely to help with one >that isn't. Yes, I/O seems to be the major bottleneck, but there's nothing like more RAM for curing that - if only the system will make use of it! An awful lot of work of a statistical nature isn't involved with processing and restructuring large volumes of data. The stuff I'm aware of at university research labs is rarely data-intensive and wouldn't much be affected by how much RAM is available or how many processors there were. So I guess most of the people on this list could not care less about efficiency of memory use or sophistication of sorting algorithms. The stuff I have been banging on about concern a very different type of work in a different type of organization. In the context in which I work, the projects being talked about are starting to reach a point where the capabilities of the hardware and software are a much more important consideration in the feasibility of the project than they have been. Previously, if things were taking too long, it was because you had an old computer, and buying a new one would fix it. If the software doesn't change soon, new hardware is no longer going to cure it. Anyway, thanks for your thoughtful observations Regards Adrian Barnett |
| Free forum by Nabble | Edit this page |
