SPSSX Discussion

multiprocessing

Classic

List

Threaded

21 messages Options

drfg2008

multiprocessing

Is it possible to run several database queries at the same time from one single SPSS syntax file ?

Dr. Frank Gaeth

David Marso

Re: multiprocessing

Administrator

Frank,
If you thought about it for one minute you would realize your question is meaningless!
Think about it! How would this possibly work? Why?
---

drfg2008 wrote

Is it possible to run several database queries at the same time from one single SPSS syntax file ?

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

Jon K Peck

Re: multiprocessing

It isn't a meaningless question. One might hope that multiple GET DATA queries building different datasets could be executed in parallel, but Statistics does not work this way. Each command, i.e., GET DATA command is executed sequentially. Parallelism is currently limited to operations within certain procedures (list is under SET in CSR).

However, you could run multiple Statistics processes at the same time, each pulling from a database and creating a sav file and then process those in a subsequent job.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: David Marso <[hidden email]>
To: [hidden email],
Date: 09/13/2013 08:43 AM
Subject: Re: [SPSSX-L] multiprocessing
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Frank, If you thought about it for one minute you would realize your question is meaningless! Think about it! How would this possibly work? Why? --- drfg2008 wrote > Is it possible to run several database queries at the same time from one > single SPSS syntax file ? ----- Please reply to the list and not to my personal email. Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" -- View this message in context:http://spssx-discussion.1045642.n5.nabble.com/multiprocessing-tp5721935p5721965.htmlSent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

drfg2008

Re: multiprocessing

In reply to this post by David Marso

David, don't think about meaning, think about solutions.

Our data is distributed over 7 databases. I can't change it. To run queries in parallel would mean 1/7 of time. Currently some queries take days.

I thought it might be possible in connection with Python. Python has multiprocessing. It should be possible somehow.

To start several instances might cause a problem of coordinating it at the end.

Dr. Frank Gaeth

Jon K Peck

Re: multiprocessing

In order to make this work across several instances, you would need a small framework that starts each Statistics process but does not wait. The START command in Win7, for example, allows you to do this. But you then need to wait for each started process to complete. After that you can run your analysis job. I hope that you are using Server for all this. Then it would be Statisticsb you would use here.

Whether this is a good idea or not depends on whether it makes sense to write a lot of sav or csv files to then be pulled into Statistics and whether the database access is actually the bottleneck. That is, would jobs like this run faster if the data had already been pulled into these external files.

What some customers do, especially when the DB admin does not want apps hitting the database without controls, is to schedule a csv dump for preset queries to run overnight so that Statistics can start by gobbling up these csv files in the morning or whenever the data are available. C and DS can help to automate and mange such jobs. If you really have that many databases and queries that take days, you really need more than just Statistics to manage this.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: drfg2008 <[hidden email]>
To: [hidden email],
Date: 09/13/2013 01:54 PM
Subject: Re: [SPSSX-L] multiprocessing
Sent by: "SPSSX(r) Discussion" <[hidden email]>

David, don't think about meaning, think about solutions. Our data is distributed over 7 databases. I can't change it. To run queries in parallel would mean 1/7 of time. Currently some queries take days. I thought it might be possible in connection with Python. Python has multiprocessing. It should be possible somehow. To start several instances might cause a problem of coordinating it at the end. ----- Dr. Frank Gaeth FU-Berlin -- View this message in context:http://spssx-discussion.1045642.n5.nabble.com/multiprocessing-tp5721935p5721989.htmlSent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Albert-Jan Roskam

Re: multiprocessing

In reply to this post by drfg2008

----- Original Message -----

> From: drfg2008 <[hidden email]>
> To: [hidden email]
> Cc:
> Sent: Friday, September 13, 2013 9:53 PM
> Subject: Re: [SPSSX-L] multiprocessing
>
> David, don't think about meaning, think about solutions.
>
> Our data is distributed over 7 databases. I can't change it. To run queries
> in parallel would mean 1/7 of time. Currently some queries take days.
>
> I thought it might be possible in connection with Python. Python has
> multiprocessing. It should be possible somehow.
>
> To start several instances might cause a problem of coordinating it at the
> end.

firing up 6 production facility jobs in subprocesses is easy, but indeed you'd have to periodically check to see if all the jobs are finished. This may be useful: http://stackoverflow.com/questions/6102535/python-how-to-determine-subprocess-children-have-all-finished-running
But it's indeed cooler to use the multiprocessing module, in conjunction with pyodbc or similar. Then, assuming you somehow merge all these datasets you could use spss at the end.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Richard Ristow

Re: multiprocessing

In reply to this post by drfg2008

At 03:53 PM 9/13/2013, drfg2008 wrote:

>Our data is distributed over 7 databases. I can't change it. To run
>queries in parallel would mean 1/7 of time. Currently some queries take days.

Good. Now we've got to the real question, to which running in
parallel *may* be an answer. Whether parallel running can work or
not, it's important to look hard at what other sources of
inefficiency there may be.

You haven't said anything about the jobs, so I can only guess wildly. But,

. How big are the 7 databases?

. Is your data read from some source outside SPSS? It's possible that
the limiting factor for your jobs is transferring the data from a
disk or over a network connection. If so, and if all seven jobs use
the same data path, running them in parallel might only result in
contention for that path. You'd be unlikely to increase throughput
much, and might well decrease it -- lengthen total running time.

. If your data is read from outside SPSS, and is going to be used
more than once in a job, store it within SPSS -- use the CACHE command.

. SPSS jobs are usually I/O bound, and optimizing them consists
principally of minimizing how much data is read, and how many times.
For example (I'm sure you've thought of this already), if your job
includes transformation code, remove all EXECUTE statements in the
code, except the very few that are needed for certain special purposes.

I may be writing things that you've addressed long ago. But, if I
were looking at SPSS jobs with very long running times, I'd start by
exploring those things I've noted.

-Best of luck,
Richard

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

drfg2008

Re: multiprocessing

In reply to this post by Albert-Jan Roskam

good idea: pyodbc

We'll check that. Might work!

Thanks everyone

Dr. Frank Gaeth

David Marso

Re: multiprocessing

Administrator

In reply to this post by drfg2008

"Currently some queries take days.."
Maybe your DB design is sub optimal.
Maybe take a look at what is going on in the queries.
Is all the data that is being extracted fresh or are you reading data which has already been processed last week and the past 6 days as well unnecessarily?
This several instance idea works only if you don't need to work with the files concurrently.
Otherwise you need to SAVE each data set to disk before mangling them further.
Enough rope?

drfg2008 wrote

David, don't think about meaning, think about solutions.

Our data is distributed over 7 databases. I can't change it. To run queries in parallel would mean 1/7 of time. Currently some queries take days.

I thought it might be possible in connection with Python. Python has multiprocessing. It should be possible somehow.

To start several instances might cause a problem of coordinating it at the end.

Albert-Jan Roskam

Re: multiprocessing

or maybe update the db stats (full scan). That could help a LOT.
https://www.simple-talk.com/sql/performance/sql-server-statistics-questions-we-were-too-shy-to-ask/
And what I don't like about VIEWS is that one could easily create very complex queries that look deceptively simple. It may be better to use a temporary table.

And since you're asking about parallel processing: do you know Hadoop? That's a big step, but wow, it's powerful!

Regards,
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a
fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

----- Original Message -----

> From: David Marso <[hidden email]>
> To: [hidden email]
> Cc:
> Sent: Saturday, September 14, 2013 5:26 PM
> Subject: Re: [SPSSX-L] multiprocessing
>
>& quot;Currently some queries take days.."
> Maybe your DB design is sub optimal.
> Maybe take a look at what is going on in the queries.
> Is all the data that is being extracted fresh or are you reading data which
> has already been processed last week and the past 6 days as well
> unnecessarily?
> This several instance idea works only if you don't need to work with the
> files concurrently.
> Otherwise you need to SAVE each data set to disk before mangling them
> further.
> Enough rope?
>
>
> drfg2008 wrote
>> David, don't think about meaning, think about solutions.
>>
>> Our data is distributed over 7 databases. I can't change it. To run
>> queries in parallel would mean 1/7 of time. Currently some queries take
>> days.
>>
>> I thought it might be possible in connection with Python. Python has
>> multiprocessing. It should be possible somehow.
>>
>> To start several instances might cause a problem of coordinating it at the
>> end.
>
>
>
>
>
> -----
> Please reply to the list and not to my personal email.
> Those desiring my consulting or training services please feel free to email me.
> ---
> "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos
> ne forte conculcent eas pedibus suis."
> Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in
> abyssum?"
> --
> View this message in context:
> http://spssx-discussion.1045642.n5.nabble.com/multiprocessing-tp5721935p5722010.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

drfg2008

Re: multiprocessing

yes, IT suggested Hadoop already. But how about IBM-SPSS and Hadoop integration then? Couldn't find anything until now.

Dr. Frank Gaeth

Jon K Peck

Re: multiprocessing

SPSS Modeler 15.0.2 (the current version) with IBM SPSS Analytic Server supports Hadoop.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: drfg2008 <[hidden email]>
To: [hidden email],
Date: 09/15/2013 06:48 AM
Subject: Re: [SPSSX-L] multiprocessing
Sent by: "SPSSX(r) Discussion" <[hidden email]>

yes, IT suggested Hadoop already. But how about IBM-SPSS and Hadoop integration then? Couldn't find anything until now. ----- Dr. Frank Gaeth FU-Berlin -- View this message in context:http://spssx-discussion.1045642.n5.nabble.com/multiprocessing-tp5721935p5722029.htmlSent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

drfg2008

Re: multiprocessing

well we have IBM-SPSS statistics 20 licenses right now

Dr. Frank Gaeth

Albert-Jan Roskam

Re: multiprocessing

I am planning to check this out in a single node hadoop cluster: https://pypi.python.org/pypi/mrjob/0.4-dev
Not sure how easy it is to install hadoop under windows. It is not entirely trivial in linux, but some good guides are available.

Regards,
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a
fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

From: drfg2008 <[hidden email]>
To: [hidden email]
Sent: Sunday, September 15, 2013 3:34 PM
Subject: Re: [SPSSX-L] multiprocessing

well we have IBM-SPSS statistics 20 licenses right now

-----
Dr. Frank Gaeth
FU-Berlin

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/multiprocessing-tp5721935p5722031.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Eugenia Cachia

Re: multiprocessing

In reply to this post by drfg2008

Good afternoon,

I would definitely agree with David Marso:
the DBs and queries should be optimized, it will greatly reduce the processing time.
But about it we cannot do anything, what we have is 7 DBs, 7 queries and we have to process the results of the queries.

Since Python is integrated in SPSS we can use its powerfull functinality, f.e. apply multiprocessing:

our jobs list would be a list of 7 tuples (dbName, sqlQuery),
we should use 7 worker processes to access these DBs in parallel,
the output will be a list of all resulting records and, if we wish so, we can also make available separate queries results as a dictionary:

dict1={'dbName1':'resultingRecordsList1', ....,'dbName7':'resultingRecordsList7'}

To be able to do all this we have to add 'import sqlite3, multiprocessing' to imports, then:

def dbManager():
q=multiprocessing.Queue(1024*1024)
maxworkers=7
jobs=[(dbName1, sqlQuery1), ....(dbName7, sqlQuery7)]
allRecordsList=[]
allRecordsDict={}
while jobs or workers:
for w in workers:
if not w.is_alive():
workers.remove(w)
if len(workers) < maxworkers and len(jobs) != 0:
j = jobs.pop()
wd = multiprocessing.Process(target=fiddleWorker, args=(j,q) )
wd.start()
workers.append(wd)
time.sleep(0.1)
while not q.empty():
dbName,records=q.get()
if records:
allRecordsList.extend(records)
allRecordsDict[dbName]=records
return ((allRecordsList, allRecordsDict))

def dbWorker(args,q):
dbName,sql=args
conn = sqlite3.connect(dbName)
c = conn.cursor()
cursor.execute(sql)
records=cursor.fetchall()
q.put((dbName,records))

Jon K Peck

Re: multiprocessing

The bottleneck here, most likely, is in the query processing and record transmission to the client system. While Python multiprocessing could be used to access the database in parallel, I doubt that this would be any faster than running multiple Statistics processes in external mode, and you still have the task of merging the resulting datasets. Before adopting any of these strategies, it would be important to understand where the system is spending the time. There could be i/o bottlenecks, database execution overhead, network delays or other problems upstream of actually getting the data into Statistics.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: Eugenia Cachia <[hidden email]>
To: [hidden email],
Date: 09/16/2013 05:36 AM
Subject: Re: [SPSSX-L] multiprocessing
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Good afternoon, I would definitely agree with David Marso: the DBs and queries should be optimized, it will greatly reduce the processing time. But about it we cannot do anything, what we have is 7 DBs, 7 queries and we have to process the results of the queries. Since Python is integrated in SPSS we can use its powerfull functinality, f.e. apply multiprocessing: our jobs list would be a list of 7 tuples (dbName, sqlQuery), we should use 7 worker processes to access these DBs in parallel, the output will be a list of all resulting records and, if we wish so, we can also make available separate queries results as a dictionary: dict1={'dbName1':'resultingRecordsList1', ....,'dbName7':'resultingRecordsList7'} To be able to do all this we have to add 'import sqlite3, multiprocessing' to imports, then: def dbManager(): q=multiprocessing.Queue(1024*1024) maxworkers=7 jobs=[(dbName1, sqlQuery1), ....(dbName7, sqlQuery7)] allRecordsList=[] allRecordsDict={} while jobs or workers: for w in workers: if not w.is_alive(): workers.remove(w) if len(workers) < maxworkers and len(jobs) != 0: j = jobs.pop() wd = multiprocessing.Process(target=fiddleWorker, args=(j,q) ) wd.start() workers.append(wd) time.sleep(0.1) while not q.empty(): dbName,records=q.get() if records: allRecordsList.extend(records) allRecordsDict[dbName]=records return ((allRecordsList, allRecordsDict)) def dbWorker(args,q): dbName,sql=args conn = sqlite3.connect(dbName) c = conn.cursor() cursor.execute(sql) records=cursor.fetchall() q.put((dbName,records)) -- View this message in context:http://spssx-discussion.1045642.n5.nabble.com/multiprocessing-tp5721935p5722040.htmlSent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Eugenia Cachia

Re: multiprocessing

In reply to this post by drfg2008

To Jon K Peck:

"it would be important to understand where the system is spending the time" - cannot disagree on this, but, unfortunately, DB creation and administration is very often out of our hands. Mostly we are stuck with what we have and have to adapt.
I agree that in majority of cases long DB query time would be due to "i/o bottlenecks, database execution overhead, network delays", etc.

But since we have to solve our problem anyway, without touching DBs and queries we can try python multiprocessing here which would also solve "the task of merging the resulting datasets". The python manager function def dbManager() will do it on the fly, picking queried records from the queue and appending them to the all records list:

while not q.empty():
dbName,records=q.get()
if records:
allRecordsList.extend(records)

Then we have multiple choices:
1. to dump data into CSV file;
2. to pickle our data;
3. to create SPSS dataset from this data;

Regards,

drfg2008

Re: multiprocessing

In reply to this post by Jon K Peck

@ Jon:

R has such a 'multiprocessing' or 'parallel' feature called mclapply function.
-> with do.call(rbind, ... ) the resulting SQL queries would be aggregated into one final table.

Would be nice to have it as an R plugin for IBM-SPSS. Would make things much, much easier.

Dr. Frank Gaeth

Jon K Peck

Re: multiprocessing

I guess you are referring to the R parallel package. There is nothing stopping you from using that package via the R plugin. Could you elaborate on what you want to do with mclapply et al?

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: drfg2008 <[hidden email]>
To: [hidden email],
Date: 12/11/2013 06:00 AM
Subject: Re: [SPSSX-L] multiprocessing
Sent by: "SPSSX(r) Discussion" <[hidden email]>

@ Jon: R has such a 'multiprocessing' or 'parallel' feature called mclapply function. -> with do.call(rbind, ... ) the resulting SQL queries would be aggregated into one final table. Would be nice to have it as an R plugin for IBM-SPSS. Would make things much, much easier. ----- Dr. Frank Gaeth FU-Berlin -- View this message in context:http://spssx-discussion.1045642.n5.nabble.com/multiprocessing-tp5721935p5723580.htmlSent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

drfg2008

Re: multiprocessing

no idea right now how to integrate it - just learned of an R parallel package and that via do.call(rbind, ...) the different queries can be merged into one table.

Would be nice to have one example how to integrate this in IBM-SPSS and run it from IBM-SPSS syntax.

Dr. Frank Gaeth