SPSS and Java Interface?

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

SPSS and Java Interface?

Joachim Wackerow-2
Does anybody know about the plans to make a Java interface to SPSS?

On the web page regarding the European 2007 user conference something is
mentioned about a "new Java interface to SPSS".

See "What’s New in SPSS 16.0?" at:
http://www.spss.com/spssdirections/prague/tech.htm

I appreciate very much the Python interface to SPSS. It allows a very
flexible access to SPSS on the programming level. Now I'm wondering if a
similar Java interface is planned. I would prefer Java instead of Python
because our other program development is Java-oriented.

Joachim
Reply | Threaded
Open this post in threaded view
|

Re: SPSS and Java Interface?

zstatman
Yes, V16 is Java based and the beta really looks nice

W
SPSS Beta Site

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Joachim Wackerow
Sent: Thursday, July 05, 2007 12:57 PM
To: [hidden email]
Subject: SPSS and Java Interface?

Does anybody know about the plans to make a Java interface to SPSS?

On the web page regarding the European 2007 user conference something is
mentioned about a "new Java interface to SPSS".

See "What's New in SPSS 16.0?" at:
http://www.spss.com/spssdirections/prague/tech.htm

I appreciate very much the Python interface to SPSS. It allows a very
flexible access to SPSS on the programming level. Now I'm wondering if a
similar Java interface is planned. I would prefer Java instead of Python
because our other program development is Java-oriented.

Joachim
Will
Statistical Services
 
============
info.statman@earthlink.net
http://home.earthlink.net/~z_statman/
============
Reply | Threaded
Open this post in threaded view
|

Re: SPSS and Java Interface?

Weeks, Kyle
Just a quick add-on to Will's point.  Will is correct that SPSS 16 will have a user interface implemented in Java.  This will allow for new functionality such as resizable dialogs, and it is portable to Mac and Linux versions of SPSS in addition to Windows.  However, this is different from the Python interface.  The Python interface is a programming language interface which allows SPSS and Python to interact programmatically.  In contrast, in SPSS 16 the "Java interface" is the user interface.  At this time there are no plans for a Java programmability plug-in to be added to the existing Python and .NET plug-ins.

More information on the new feature set for SPSS 16 will be forthcoming soon.

Regards.


Kyle Weeks, Ph.D.
Director of Product Management, SPSS Product Line
Product Management
SPSS Inc.
[hidden email]
www.spss.com
SPSS Inc. helps organizations turn data into insight through predictive analytics.


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Statman (WMB)
Sent: Thursday, July 05, 2007 12:13 PM
To: [hidden email]
Subject: Re: SPSS and Java Interface?

Yes, V16 is Java based and the beta really looks nice

W
SPSS Beta Site

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Joachim Wackerow
Sent: Thursday, July 05, 2007 12:57 PM
To: [hidden email]
Subject: SPSS and Java Interface?

Does anybody know about the plans to make a Java interface to SPSS?

On the web page regarding the European 2007 user conference something is
mentioned about a "new Java interface to SPSS".

See "What's New in SPSS 16.0?" at:
http://www.spss.com/spssdirections/prague/tech.htm

I appreciate very much the Python interface to SPSS. It allows a very
flexible access to SPSS on the programming level. Now I'm wondering if a
similar Java interface is planned. I would prefer Java instead of Python
because our other program development is Java-oriented.

Joachim
Reply | Threaded
Open this post in threaded view
|

Re: SPSS and Java Interface?

Joachim Wackerow-2
Thanks for clarifying that point.
My interest would be in Java as a programming language interface.

Joachim

Weeks, Kyle wrote:

> Just a quick add-on to Will's point.  Will is correct that SPSS 16 will have a user interface implemented in Java.  This will allow for new functionality such as resizable dialogs, and it is portable to Mac and Linux versions of SPSS in addition to Windows.  However, this is different from the Python interface.  The Python interface is a programming language interface which allows SPSS and Python to interact programmatically.  In contrast, in SPSS 16 the "Java interface" is the user interface.  At this time there are no plans for a Java programmability plug-in to be added to the existing Python and .NET plug-ins.
>
> More information on the new feature set for SPSS 16 will be forthcoming soon.
>
> Regards.
>
>
> Kyle Weeks, Ph.D.
> Director of Product Management, SPSS Product Line
> Product Management
> SPSS Inc.
> [hidden email]
> www.spss.com
> SPSS Inc. helps organizations turn data into insight through predictive analytics.
>
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Statman (WMB)
> Sent: Thursday, July 05, 2007 12:13 PM
> To: [hidden email]
> Subject: Re: SPSS and Java Interface?
>
> Yes, V16 is Java based and the beta really looks nice
>
> W
> SPSS Beta Site
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
> Joachim Wackerow
> Sent: Thursday, July 05, 2007 12:57 PM
> To: [hidden email]
> Subject: SPSS and Java Interface?
>
> Does anybody know about the plans to make a Java interface to SPSS?
>
> On the web page regarding the European 2007 user conference something is
> mentioned about a "new Java interface to SPSS".
>
> See "What's New in SPSS 16.0?" at:
> http://www.spss.com/spssdirections/prague/tech.htm
>
> I appreciate very much the Python interface to SPSS. It allows a very
> flexible access to SPSS on the programming level. Now I'm wondering if a
> similar Java interface is planned. I would prefer Java instead of Python
> because our other program development is Java-oriented.
>
> Joachim
Reply | Threaded
Open this post in threaded view
|

Re: SPSS and Java Interface?

John McConnell
Joachim

It depends what kinds of customisations you are looking to make but you
might want to investigate SPSS Web App http://www.spss.com/webapp/ .

I'm over-simplifying but the WebApp framework gives you access to the
SPSS back end functionality through JSP and has some additional
portal-style capabilities.

john

John McConnell
applied insights

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Joachim Wackerow
Sent: 06 July 2007 10:05
To: [hidden email]
Subject: Re: SPSS and Java Interface?

Thanks for clarifying that point.
My interest would be in Java as a programming language interface.

Joachim

Weeks, Kyle wrote:
> Just a quick add-on to Will's point.  Will is correct that SPSS 16
will have a user interface implemented in Java.  This will allow for new
functionality such as resizable dialogs, and it is portable to Mac and
Linux versions of SPSS in addition to Windows.  However, this is
different from the Python interface.  The Python interface is a
programming language interface which allows SPSS and Python to interact
programmatically.  In contrast, in SPSS 16 the "Java interface" is the
user interface.  At this time there are no plans for a Java
programmability plug-in to be added to the existing Python and .NET
plug-ins.
>
> More information on the new feature set for SPSS 16 will be
forthcoming soon.

>
> Regards.
>
>
> Kyle Weeks, Ph.D.
> Director of Product Management, SPSS Product Line
> Product Management
> SPSS Inc.
> [hidden email]
> www.spss.com
> SPSS Inc. helps organizations turn data into insight through
predictive analytics.
>
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf
Of Statman (WMB)

> Sent: Thursday, July 05, 2007 12:13 PM
> To: [hidden email]
> Subject: Re: SPSS and Java Interface?
>
> Yes, V16 is Java based and the beta really looks nice
>
> W
> SPSS Beta Site
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf
Of
> Joachim Wackerow
> Sent: Thursday, July 05, 2007 12:57 PM
> To: [hidden email]
> Subject: SPSS and Java Interface?
>
> Does anybody know about the plans to make a Java interface to SPSS?
>
> On the web page regarding the European 2007 user conference something
is
> mentioned about a "new Java interface to SPSS".
>
> See "What's New in SPSS 16.0?" at:
> http://www.spss.com/spssdirections/prague/tech.htm
>
> I appreciate very much the Python interface to SPSS. It allows a very
> flexible access to SPSS on the programming level. Now I'm wondering if
a
> similar Java interface is planned. I would prefer Java instead of
Python
> because our other program development is Java-oriented.
>
> Joachim
Reply | Threaded
Open this post in threaded view
|

Re: SPSS and Java Interface?

Joachim Wackerow-2
John,

Thank you for the hint.
Currently I'm using the OMS XML output to transform the metadata to DDI
3.0 format (Data Documentation Initiative). A custom BEGIN PROGRAM/END
PROGRAM Python block does a XSLT transformation using Pyana (Python
access to Xalan). This module should be usable for people interested in
DDI (people with a regular SPSS installation should be able to use this
export facility, so we don't want to use additional software like SPSS
Web App). I'm using Python now for this purpose as Kyle Weeks pointed
out, that there are currently no plans for a Java programmability plug-in.

Joachim

John McConnell wrote:

> Joachim
>
> It depends what kinds of customisations you are looking to make but you
> might want to investigate SPSS Web App http://www.spss.com/webapp/ .
>
> I'm over-simplifying but the WebApp framework gives you access to the
> SPSS back end functionality through JSP and has some additional
> portal-style capabilities.
>
> john
>
> John McConnell
> applied insights
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
> Joachim Wackerow
> Sent: 06 July 2007 10:05
> To: [hidden email]
> Subject: Re: SPSS and Java Interface?
>
> Thanks for clarifying that point.
> My interest would be in Java as a programming language interface.
>
> Joachim
>
> Weeks, Kyle wrote:
>> Just a quick add-on to Will's point.  Will is correct that SPSS 16
> will have a user interface implemented in Java.  This will allow for new
> functionality such as resizable dialogs, and it is portable to Mac and
> Linux versions of SPSS in addition to Windows.  However, this is
> different from the Python interface.  The Python interface is a
> programming language interface which allows SPSS and Python to interact
> programmatically.  In contrast, in SPSS 16 the "Java interface" is the
> user interface.  At this time there are no plans for a Java
> programmability plug-in to be added to the existing Python and .NET
> plug-ins.
>> More information on the new feature set for SPSS 16 will be
> forthcoming soon.
>> Regards.
>>
>>
>> Kyle Weeks, Ph.D.
>> Director of Product Management, SPSS Product Line
>> Product Management
>> SPSS Inc.
>> [hidden email]
>> www.spss.com
>> SPSS Inc. helps organizations turn data into insight through
> predictive analytics.
>>
>> -----Original Message-----
>> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf
> Of Statman (WMB)
>> Sent: Thursday, July 05, 2007 12:13 PM
>> To: [hidden email]
>> Subject: Re: SPSS and Java Interface?
>>
>> Yes, V16 is Java based and the beta really looks nice
>>
>> W
>> SPSS Beta Site
>>
>> -----Original Message-----
>> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf
> Of
>> Joachim Wackerow
>> Sent: Thursday, July 05, 2007 12:57 PM
>> To: [hidden email]
>> Subject: SPSS and Java Interface?
>>
>> Does anybody know about the plans to make a Java interface to SPSS?
>>
>> On the web page regarding the European 2007 user conference something
> is
>> mentioned about a "new Java interface to SPSS".
>>
>> See "What's New in SPSS 16.0?" at:
>> http://www.spss.com/spssdirections/prague/tech.htm
>>
>> I appreciate very much the Python interface to SPSS. It allows a very
>> flexible access to SPSS on the programming level. Now I'm wondering if
> a
>> similar Java interface is planned. I would prefer Java instead of
> Python
>> because our other program development is Java-oriented.
>>
>> Joachim
>
>


--
GESIS - German Social Science Infrastructure Services
http://www.gesis.org/en/
Reply | Threaded
Open this post in threaded view
|

Re: SPSS and Java Interface?

John McConnell
Joachim

I understand ... sounds like you have developed a useful tool.


john


-----Original Message-----
From: Joachim Wackerow [mailto:[hidden email]]
Sent: 09 July 2007 07:54
To: John McConnell
Cc: [hidden email]
Subject: Re: SPSS and Java Interface?

John,

Thank you for the hint.
Currently I'm using the OMS XML output to transform the metadata to DDI
3.0 format (Data Documentation Initiative). A custom BEGIN PROGRAM/END
PROGRAM Python block does a XSLT transformation using Pyana (Python
access to Xalan). This module should be usable for people interested in
DDI (people with a regular SPSS installation should be able to use this
export facility, so we don't want to use additional software like SPSS
Web App). I'm using Python now for this purpose as Kyle Weeks pointed
out, that there are currently no plans for a Java programmability
plug-in.

Joachim

John McConnell wrote:
> Joachim
>
> It depends what kinds of customisations you are looking to make but
you

> might want to investigate SPSS Web App http://www.spss.com/webapp/ .
>
> I'm over-simplifying but the WebApp framework gives you access to the
> SPSS back end functionality through JSP and has some additional
> portal-style capabilities.
>
> john
>
> John McConnell
> applied insights
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf
Of

> Joachim Wackerow
> Sent: 06 July 2007 10:05
> To: [hidden email]
> Subject: Re: SPSS and Java Interface?
>
> Thanks for clarifying that point.
> My interest would be in Java as a programming language interface.
>
> Joachim
>
> Weeks, Kyle wrote:
>> Just a quick add-on to Will's point.  Will is correct that SPSS 16
> will have a user interface implemented in Java.  This will allow for
new
> functionality such as resizable dialogs, and it is portable to Mac and
> Linux versions of SPSS in addition to Windows.  However, this is
> different from the Python interface.  The Python interface is a
> programming language interface which allows SPSS and Python to
interact

> programmatically.  In contrast, in SPSS 16 the "Java interface" is the
> user interface.  At this time there are no plans for a Java
> programmability plug-in to be added to the existing Python and .NET
> plug-ins.
>> More information on the new feature set for SPSS 16 will be
> forthcoming soon.
>> Regards.
>>
>>
>> Kyle Weeks, Ph.D.
>> Director of Product Management, SPSS Product Line
>> Product Management
>> SPSS Inc.
>> [hidden email]
>> www.spss.com
>> SPSS Inc. helps organizations turn data into insight through
> predictive analytics.
>>
>> -----Original Message-----
>> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf
> Of Statman (WMB)
>> Sent: Thursday, July 05, 2007 12:13 PM
>> To: [hidden email]
>> Subject: Re: SPSS and Java Interface?
>>
>> Yes, V16 is Java based and the beta really looks nice
>>
>> W
>> SPSS Beta Site
>>
>> -----Original Message-----
>> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf
> Of
>> Joachim Wackerow
>> Sent: Thursday, July 05, 2007 12:57 PM
>> To: [hidden email]
>> Subject: SPSS and Java Interface?
>>
>> Does anybody know about the plans to make a Java interface to SPSS?
>>
>> On the web page regarding the European 2007 user conference something
> is
>> mentioned about a "new Java interface to SPSS".
>>
>> See "What's New in SPSS 16.0?" at:
>> http://www.spss.com/spssdirections/prague/tech.htm
>>
>> I appreciate very much the Python interface to SPSS. It allows a very
>> flexible access to SPSS on the programming level. Now I'm wondering
if
> a
>> similar Java interface is planned. I would prefer Java instead of
> Python
>> because our other program development is Java-oriented.
>>
>> Joachim
>
>


--
GESIS - German Social Science Infrastructure Services
http://www.gesis.org/en/
Reply | Threaded
Open this post in threaded view
|

Re: SPSS and Java Interface?

Barnett, Adrian (DECS)
In reply to this post by Joachim Wackerow-2
On Thu, 5 Jul 2007 14:02:52 -0500, Weeks, Kyle <[hidden email]> wrote:


>More information on the new feature set for SPSS 16 will be forthcoming soon.
>
>Regards.
>
>
>Kyle Weeks, Ph.D.
>Director of Product Management, SPSS Product Line
>Product Management
>SPSS Inc.
>[hidden email]
>www.spss.com
>SPSS Inc. helps organizations turn data into insight through predictive
analytics.
>

Hi Kyle
When you post more details about v16, could you please include some info on
the following:

* If V16 is better able to make use of all available memory - presently it
can't seem to make use of more than about 700-900 MB. It is fairly common
now in high-end setups to have 2GB and I think the practical (i.e. usable)
maximum for 32 bit Windows is 3 GB.

* Degree of support for multi-core processors and motherboards with multiple
CPUs.

A particularly desirable feature would be sorting algorithms which can
spread the sort over multiple cores. I don't mean to confine the question
solely to sorting - although in my work it's the biggest time-consumer - the
more ANY of the heavy-duty data-manipulation tasks could be spread like this
the better.

* Plans for support of 64 bit versions

* Whether SPSS in a 32 bit version could make use of > 4 GB RAM when running
on 64-bit Windows.

* Whether a 64-bit version is planned in the foreseeable future

I am particularly interested in these questions because some projects are
now pushing the boundaries of what is feasible with current versions even
with the most advanced hardware. Even with advanced hardware, the software
does not exploit all the advantages the hardware has.

I appreciate that these sorts of things are non-trivial to implement, but
some idea of where things are headed and when we might expect to get there
would be much appreciated. I'm aware of plans for the sorts of project that
may not be feasible with what we have now.

Regards

Adrian Barnett
Reply | Threaded
Open this post in threaded view
|

Optimization (was, re: SPSS and Java Interface?)

Richard Ristow
At 01:00 AM 7/10/2007, Adrian Barnett wrote:

>On Thu, 5 Jul 2007 14:02:52 -0500, Weeks, Kyle <[hidden email]>
>wrote:
>
>>More information on the new feature set for SPSS 16 will be
>>forthcoming soon.
>>Regards.
>>Kyle Weeks, Ph.D.
>
>Could you please include some info on the following:
>
>* If V16 is better able to make use of all available memory -
>presently [SPSS] can't seem to make use of more than about 700-900 MB.
>
>* Degree of support for multi-core processors and motherboards with
>multiple CPUs. - Particularly desirable, sorting algorithms which can
>spread the sort over multiple cores. [Although sorting is] the biggest
>time-consumer - the more ANY of the heavy-duty data-manipulation tasks
>could be spread like this the better.
>
>I am interested because some projects are now pushing the boundaries
>of what is feasible with current versions even with the most advanced
>hardware.

Butting in, of course: I assume those projects have been analyzed for
inefficiencies in design and implementation?

The question comes up for me because SPSS data-manipulation tasks are
most commonly limited by disk I/O speed.

Sorting is something of a special case, and may get CPU bound under
some conditions - I've no idea when, or how often. However - you've
probably looked at this, but to speed sorting, what would be the
relative importance of a second disk drive, so data can be read from
one disk and written to the other; of more memory, with existing
algorithms; of a dual-core algorithm? I'd expect that they'd usually be
important in that order.

For other manipulations, it's rare a transformation program is
CPU-bound; and faster, or dual, CPUs aren't likely to help with one
that isn't.

There was an exchange a while ago(*), that began with an inquiry about
optimal hardware for speeding up SPSS, in which it became clear that
the project had correctable inefficiencies that slowed it perhaps an
order of magnitude. You've probably analyzed adequately, but that
instance reminded me of the importance of doing just that.

Good luck!
Richard


(*)See the two connected threads

"Computer Buying Help", Tue, 30 Jan 2007 <09:47:07 -0500>, ff;

"Streamlining (was: Computer Buying Help)", Thu, 1 Feb 2007 <10:03:47
-0500>, ff.
Reply | Threaded
Open this post in threaded view
|

Re: Optimization (was, re: SPSS and Java Interface?)

Barnett, Adrian (DECS)
Hi Richard

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Richard Ristow
Sent: Wednesday, 11 July 2007 9:12 AM To: [hidden email]

Subject: Optimization (was, re: SPSS and Java Interface?)

At 01:00 AM 7/10/2007, Adrian Barnett wrote:

>>On Thu, 5 Jul 2007 14:02:52 -0500, Weeks, Kyle <[hidden email]>
>>wrote:
>>
>>>More information on the new feature set for SPSS 16 will be
>>>forthcoming soon.
>>>Regards.
>>>Kyle Weeks, Ph.D.
>>
>>Could you please include some info on the following:
>>
>>* If V16 is better able to make use of all available memory -
>>presently [SPSS] can't seem to make use of more than about 700-900 MB.
>>
>>* Degree of support for multi-core processors and motherboards with
>>multiple CPUs. - Particularly desirable, sorting algorithms which can
>>spread the sort over multiple cores. [Although sorting is] the biggest
>>time-consumer - the more ANY of the heavy-duty data-manipulation tasks
>>could be spread like this the better.
>>
>>I am interested because some projects are now pushing the boundaries
>>of what is feasible with current versions even with the most advanced
>>hardware.
>
>Butting in, of course: I assume those projects have been analyzed for
>inefficiencies in design and implementation?

I had in mind classes of projects rather than specific ones. In the area
I work (government) there is increasing interest in analysing
operational data.  These projects tend to involve pretty large volumes
of data.  Record linkage is becoming much more recognized as a way to
go, linking data from multiple sources and so generating even bigger
files. When these involve transactional data recording lots of different
contacts over possible decades, they get pretty big indeed. In Western
Australia a group has been linking health-related data from an ever-
growing list of data providers for over 20 years, so these things can
get pretty big if you were to try to deal with all of it. In my
experience, the final file for analysis is much smaller than the initial
one, but there is a stage where you are preparing the initial data where
things are pretty big for a while.

Your point is well-made though, that the design of the data structures
needs a lot of thought and planning to try to ensure that processing
is efficient.


>The question comes up for me because SPSS data-manipulation tasks are
>most commonly limited by disk I/O speed.
>
>Sorting is something of a special case, and may get CPU bound under
>some conditions - I've no idea when, or how often. However - you've
>probably looked at this, but to speed sorting, what would be the
>relative importance of a second disk drive, so data can be read from
>one disk and written to the other; of more memory, with existing
>algorithms; of a dual-core algorithm? I'd expect that they'd usually be
>important in that order.

Indeed, keeping the swap file on a separate disk from the one the data
lives on is a way of reducing some of the impact of I/O during a sort.
The process of writing and reading temporary files SPSS builds when
sorting a file that won't fit in memory takes up the bulk of the time in
a sort. Now that SPSS reports CPU time separately from elapsed time it
is much easier to quantify the effect.

From what I've been able to read about sorting, the more memory
available, the less writing to disk and the faster the sort will run (
holding all other factors constant). The thing I and others on the
list have noticed is that current and previous versions of SPSS don't
seem to make use of memory beyond about 700-900 MB. Whilst the biggest
files I've worked on would take more then the 2GB available on one of
my systems, theory suggests that if SPSS did use all of the available
RAM, it would have improved things. If I tell SPSS to increase the
Workspace beyond those levels, it whinges that it can't get that much
memory, even though the Task Manager is showing there is another
gigabyte or so that isn't being used. It's not hard now to find
motherboards that support 8GB, and if the operating system would
support it (which 64 bit versions do), large projects would benefit a
lot.

In processing this big data files, the number of times the data has to
be sorted different ways prior to analysis seems amazing. And no
matter how careful one is, and how often testing is done on small
subsets, it always seems that the main data gets put through the
whole series of programs in the cycle lots more times than was ever
intended.

Computer scientists seem to have been working on sorting algorithms that
take advantage of more than one CPU for at least 20 years now, and
seem to provide improvements that scale well with additional CPUs. Given
dual core is quite common now amongst new computers in the general
market, and quad-core is not hard to obtain, the hardware is well and
truly available to take advantage. So sorting would definitely benefit
from the available algorithms that work well with 2+ CPUs.

>For other manipulations, it's rare a transformation program is
>CPU-bound; and faster, or dual, CPUs aren't likely to help with one
>that isn't.

Yes, I/O seems to be the major bottleneck, but there's nothing like more
RAM for curing that - if only the system will make use of it!

An awful lot of work of a statistical nature isn't involved with
processing and restructuring large volumes of data. The stuff I'm aware
of at university research labs is rarely data-intensive and wouldn't
much be affected by how much RAM is available or how many processors
there were. So I guess most of the people on this list could not care
less about efficiency of memory use or sophistication of sorting
algorithms.

The stuff I have been banging on about concern a very different type of
work in a different type of organization. In the context in which I
work, the projects being talked about are starting to reach a point
where the capabilities of the hardware and software are a much more
important consideration in the feasibility of the project than they have
been. Previously, if things were taking too long, it was because you had
an old computer, and buying a new one would fix it. If the software
doesn't change soon, new hardware is no longer going to cure it.

Anyway, thanks for your thoughtful observations

Regards

Adrian Barnett
Reply | Threaded
Open this post in threaded view
|

Re: Optimization (was, re: SPSS and Java Interface?)

Peck, Jon
A propos of this discussion it is worth noting that careful benchmarking can sometimes (often) reveal that time spent is not necessarily where one expects, and experimenting with alternative syntax can be productive.

It is tricky to benchmark on modern operating systems, because they play a lot of tricks in attempting to boost performance, including caching of dll's/modules, using extra memory for i/o buffering when it is not required elsewhere, and dynamic adjustment of process priorities, just to mention a few large-grain issues.  And your friendly neighborhood virus checker will contribute its bit of drag to the process, too.

On Windows, if you have SPSS programmability installed, you can make use of the benchmark.py Python module to run repetitions of command sequences and to run alternative versions interleaved in order to minimize os effects.  You can choose from among a wide variety of performance measures including a variety of time, memory, and i/o measures - basically any measure you could see in the Windows Task Manager.  This can help both in finding the most efficient way to structure your SPSS jobs and in predicting what hardware changes are likely to be most beneficial.

The output of the benchmark module is a text file of measures set up for loading into SPSS for analysis.

You can download this module along with other programmability materials from SPSS Developer Central, www.spss.com/devcentral.  The benchmark module requires some third-party (free) downloads that are detailed in the documentation.

Note also that driving SPSS externally using programmability, which removes the Data Editor and the rest of the SPSS user interface can speed up jobs.  Sometimes this makes a large difference.  Using SPSSB, which is part of SPSS Server also allows dispensing with the user interface.  In these modes, there is no SPSS Viewer, so no spo files can be created (plain text, html, and other formats are available), but for data preparation, the output isn't usually all that interesting anyway.

Regards,
Jon Peck

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Barnett, Adrian (DECS)
Sent: Wednesday, July 11, 2007 7:47 PM
To: [hidden email]
Subject: Re: [SPSSX-L] Optimization (was, re: SPSS and Java Interface?)
[snip]

I had in mind classes of projects rather than specific ones. In the area
I work (government) there is increasing interest in analysing
operational data.  These projects tend to involve pretty large volumes
of data.  Record linkage is becoming much more recognized as a way to
go, linking data from multiple sources and so generating even bigger
files. When these involve transactional data recording lots of different
contacts over possible decades, they get pretty big indeed. In Western
Australia a group has been linking health-related data from an ever-
growing list of data providers for over 20 years, so these things can
get pretty big if you were to try to deal with all of it. In my
experience, the final file for analysis is much smaller than the initial
one, but there is a stage where you are preparing the initial data where
things are pretty big for a while.

Your point is well-made though, that the design of the data structures
needs a lot of thought and planning to try to ensure that processing
is efficient.


>The question comes up for me because SPSS data-manipulation tasks are
>most commonly limited by disk I/O speed.
>
>Sorting is something of a special case, and may get CPU bound under
>some conditions - I've no idea when, or how often. However - you've
>probably looked at this, but to speed sorting, what would be the
>relative importance of a second disk drive, so data can be read from
>one disk and written to the other; of more memory, with existing
>algorithms; of a dual-core algorithm? I'd expect that they'd usually be
>important in that order.

Indeed, keeping the swap file on a separate disk from the one the data
lives on is a way of reducing some of the impact of I/O during a sort.
The process of writing and reading temporary files SPSS builds when
sorting a file that won't fit in memory takes up the bulk of the time in
a sort. Now that SPSS reports CPU time separately from elapsed time it
is much easier to quantify the effect.

From what I've been able to read about sorting, the more memory
available, the less writing to disk and the faster the sort will run (
holding all other factors constant). The thing I and others on the
list have noticed is that current and previous versions of SPSS don't
seem to make use of memory beyond about 700-900 MB. Whilst the biggest
files I've worked on would take more then the 2GB available on one of
my systems, theory suggests that if SPSS did use all of the available
RAM, it would have improved things. If I tell SPSS to increase the
Workspace beyond those levels, it whinges that it can't get that much
memory, even though the Task Manager is showing there is another
gigabyte or so that isn't being used. It's not hard now to find
motherboards that support 8GB, and if the operating system would
support it (which 64 bit versions do), large projects would benefit a
lot.

In processing this big data files, the number of times the data has to
be sorted different ways prior to analysis seems amazing. And no
matter how careful one is, and how often testing is done on small
subsets, it always seems that the main data gets put through the
whole series of programs in the cycle lots more times than was ever
intended.

Computer scientists seem to have been working on sorting algorithms that
take advantage of more than one CPU for at least 20 years now, and
seem to provide improvements that scale well with additional CPUs. Given
dual core is quite common now amongst new computers in the general
market, and quad-core is not hard to obtain, the hardware is well and
truly available to take advantage. So sorting would definitely benefit
from the available algorithms that work well with 2+ CPUs.

>For other manipulations, it's rare a transformation program is
>CPU-bound; and faster, or dual, CPUs aren't likely to help with one
>that isn't.

Yes, I/O seems to be the major bottleneck, but there's nothing like more
RAM for curing that - if only the system will make use of it!

An awful lot of work of a statistical nature isn't involved with
processing and restructuring large volumes of data. The stuff I'm aware
of at university research labs is rarely data-intensive and wouldn't
much be affected by how much RAM is available or how many processors
there were. So I guess most of the people on this list could not care
less about efficiency of memory use or sophistication of sorting
algorithms.

The stuff I have been banging on about concern a very different type of
work in a different type of organization. In the context in which I
work, the projects being talked about are starting to reach a point
where the capabilities of the hardware and software are a much more
important consideration in the feasibility of the project than they have
been. Previously, if things were taking too long, it was because you had
an old computer, and buying a new one would fix it. If the software
doesn't change soon, new hardware is no longer going to cure it.

Anyway, thanks for your thoughtful observations

Regards

Adrian Barnett