Is it possible to run several database queries at the same time from one single SPSS syntax file ?
Dr. Frank Gaeth
|
Administrator
|
Frank,
If you thought about it for one minute you would realize your question is meaningless! Think about it! How would this possibly work? Why? ---
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
It isn't a meaningless question. One
might hope that multiple GET DATA queries building different datasets could
be executed in parallel, but Statistics does not work this way. Each
command, i.e., GET DATA command is executed sequentially. Parallelism
is currently limited to operations within certain procedures (list is under
SET in CSR).
However, you could run multiple Statistics processes at the same time, each pulling from a database and creating a sav file and then process those in a subsequent job. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: David Marso <[hidden email]> To: [hidden email], Date: 09/13/2013 08:43 AM Subject: Re: [SPSSX-L] multiprocessing Sent by: "SPSSX(r) Discussion" <[hidden email]> Frank, If you thought about it for one minute you would realize your question is meaningless! Think about it! How would this possibly work? Why? --- drfg2008 wrote > Is it possible to run several database queries at the same time from one > single SPSS syntax file ? ----- Please reply to the list and not to my personal email. Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/multiprocessing-tp5721935p5721965.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by David Marso
David, don't think about meaning, think about solutions.
Our data is distributed over 7 databases. I can't change it. To run queries in parallel would mean 1/7 of time. Currently some queries take days. I thought it might be possible in connection with Python. Python has multiprocessing. It should be possible somehow. To start several instances might cause a problem of coordinating it at the end.
Dr. Frank Gaeth
|
In order to make this work across several
instances, you would need a small framework that starts each Statistics
process but does not wait. The START command in Win7, for example,
allows you to do this. But you then need to wait for each started
process to complete. After that you can run your analysis job. I
hope that you are using Server for all this. Then it would be Statisticsb
you would use here.
Whether this is a good idea or not depends on whether it makes sense to write a lot of sav or csv files to then be pulled into Statistics and whether the database access is actually the bottleneck. That is, would jobs like this run faster if the data had already been pulled into these external files. What some customers do, especially when the DB admin does not want apps hitting the database without controls, is to schedule a csv dump for preset queries to run overnight so that Statistics can start by gobbling up these csv files in the morning or whenever the data are available. C and DS can help to automate and mange such jobs. If you really have that many databases and queries that take days, you really need more than just Statistics to manage this. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: drfg2008 <[hidden email]> To: [hidden email], Date: 09/13/2013 01:54 PM Subject: Re: [SPSSX-L] multiprocessing Sent by: "SPSSX(r) Discussion" <[hidden email]> David, don't think about meaning, think about solutions. Our data is distributed over 7 databases. I can't change it. To run queries in parallel would mean 1/7 of time. Currently some queries take days. I thought it might be possible in connection with Python. Python has multiprocessing. It should be possible somehow. To start several instances might cause a problem of coordinating it at the end. ----- Dr. Frank Gaeth FU-Berlin -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/multiprocessing-tp5721935p5721989.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by drfg2008
----- Original Message -----
> From: drfg2008 <[hidden email]> > To: [hidden email] > Cc: > Sent: Friday, September 13, 2013 9:53 PM > Subject: Re: [SPSSX-L] multiprocessing > > David, don't think about meaning, think about solutions. > > Our data is distributed over 7 databases. I can't change it. To run queries > in parallel would mean 1/7 of time. Currently some queries take days. > > I thought it might be possible in connection with Python. Python has > multiprocessing. It should be possible somehow. > > To start several instances might cause a problem of coordinating it at the > end. firing up 6 production facility jobs in subprocesses is easy, but indeed you'd have to periodically check to see if all the jobs are finished. This may be useful: http://stackoverflow.com/questions/6102535/python-how-to-determine-subprocess-children-have-all-finished-running But it's indeed cooler to use the multiprocessing module, in conjunction with pyodbc or similar. Then, assuming you somehow merge all these datasets you could use spss at the end. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by drfg2008
At 03:53 PM 9/13/2013, drfg2008 wrote:
>Our data is distributed over 7 databases. I can't change it. To run >queries in parallel would mean 1/7 of time. Currently some queries take days. Good. Now we've got to the real question, to which running in parallel *may* be an answer. Whether parallel running can work or not, it's important to look hard at what other sources of inefficiency there may be. You haven't said anything about the jobs, so I can only guess wildly. But, . How big are the 7 databases? . Is your data read from some source outside SPSS? It's possible that the limiting factor for your jobs is transferring the data from a disk or over a network connection. If so, and if all seven jobs use the same data path, running them in parallel might only result in contention for that path. You'd be unlikely to increase throughput much, and might well decrease it -- lengthen total running time. . If your data is read from outside SPSS, and is going to be used more than once in a job, store it within SPSS -- use the CACHE command. . SPSS jobs are usually I/O bound, and optimizing them consists principally of minimizing how much data is read, and how many times. For example (I'm sure you've thought of this already), if your job includes transformation code, remove all EXECUTE statements in the code, except the very few that are needed for certain special purposes. I may be writing things that you've addressed long ago. But, if I were looking at SPSS jobs with very long running times, I'd start by exploring those things I've noted. -Best of luck, Richard ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Albert-Jan Roskam
good idea: pyodbc
We'll check that. Might work! Thanks everyone
Dr. Frank Gaeth
|
Administrator
|
In reply to this post by drfg2008
"Currently some queries take days.."
Maybe your DB design is sub optimal. Maybe take a look at what is going on in the queries. Is all the data that is being extracted fresh or are you reading data which has already been processed last week and the past 6 days as well unnecessarily? This several instance idea works only if you don't need to work with the files concurrently. Otherwise you need to SAVE each data set to disk before mangling them further. Enough rope?
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
or maybe update the db stats (full scan). That could help a LOT.
https://www.simple-talk.com/sql/performance/sql-server-statistics-questions-we-were-too-shy-to-ask/ And what I don't like about VIEWS is that one could easily create very complex queries that look deceptively simple. It may be better to use a temporary table. And since you're asking about parallel processing: do you know Hadoop? That's a big step, but wow, it's powerful! Regards, Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ----- Original Message ----- > From: David Marso <[hidden email]> > To: [hidden email] > Cc: > Sent: Saturday, September 14, 2013 5:26 PM > Subject: Re: [SPSSX-L] multiprocessing > >& quot;Currently some queries take days.." > Maybe your DB design is sub optimal. > Maybe take a look at what is going on in the queries. > Is all the data that is being extracted fresh or are you reading data which > has already been processed last week and the past 6 days as well > unnecessarily? > This several instance idea works only if you don't need to work with the > files concurrently. > Otherwise you need to SAVE each data set to disk before mangling them > further. > Enough rope? > > > drfg2008 wrote >> David, don't think about meaning, think about solutions. >> >> Our data is distributed over 7 databases. I can't change it. To run >> queries in parallel would mean 1/7 of time. Currently some queries take >> days. >> >> I thought it might be possible in connection with Python. Python has >> multiprocessing. It should be possible somehow. >> >> To start several instances might cause a problem of coordinating it at the >> end. > > > > > > ----- > Please reply to the list and not to my personal email. > Those desiring my consulting or training services please feel free to email me. > --- > "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos > ne forte conculcent eas pedibus suis." > Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in > abyssum?" > -- > View this message in context: > http://spssx-discussion.1045642.n5.nabble.com/multiprocessing-tp5721935p5722010.html > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
yes, IT suggested Hadoop already. But how about IBM-SPSS and Hadoop integration then? Couldn't find anything until now.
Dr. Frank Gaeth
|
SPSS Modeler 15.0.2 (the current version)
with IBM SPSS Analytic Server supports Hadoop.
Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: drfg2008 <[hidden email]> To: [hidden email], Date: 09/15/2013 06:48 AM Subject: Re: [SPSSX-L] multiprocessing Sent by: "SPSSX(r) Discussion" <[hidden email]> yes, IT suggested Hadoop already. But how about IBM-SPSS and Hadoop integration then? Couldn't find anything until now. ----- Dr. Frank Gaeth FU-Berlin -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/multiprocessing-tp5721935p5722029.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
well we have IBM-SPSS statistics 20 licenses right now
Dr. Frank Gaeth
|
I am planning to check this out in a single node hadoop cluster: https://pypi.python.org/pypi/mrjob/0.4-dev
Not sure how easy it is to install hadoop under windows. It is not entirely trivial in linux, but some good guides are available. Regards, Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
In reply to this post by drfg2008
Good afternoon,
I would definitely agree with David Marso: the DBs and queries should be optimized, it will greatly reduce the processing time. But about it we cannot do anything, what we have is 7 DBs, 7 queries and we have to process the results of the queries. Since Python is integrated in SPSS we can use its powerfull functinality, f.e. apply multiprocessing: our jobs list would be a list of 7 tuples (dbName, sqlQuery), we should use 7 worker processes to access these DBs in parallel, the output will be a list of all resulting records and, if we wish so, we can also make available separate queries results as a dictionary: dict1={'dbName1':'resultingRecordsList1', ....,'dbName7':'resultingRecordsList7'} To be able to do all this we have to add 'import sqlite3, multiprocessing' to imports, then: def dbManager(): q=multiprocessing.Queue(1024*1024) maxworkers=7 jobs=[(dbName1, sqlQuery1), ....(dbName7, sqlQuery7)] allRecordsList=[] allRecordsDict={} while jobs or workers: for w in workers: if not w.is_alive(): workers.remove(w) if len(workers) < maxworkers and len(jobs) != 0: j = jobs.pop() wd = multiprocessing.Process(target=fiddleWorker, args=(j,q) ) wd.start() workers.append(wd) time.sleep(0.1) while not q.empty(): dbName,records=q.get() if records: allRecordsList.extend(records) allRecordsDict[dbName]=records return ((allRecordsList, allRecordsDict)) def dbWorker(args,q): dbName,sql=args conn = sqlite3.connect(dbName) c = conn.cursor() cursor.execute(sql) records=cursor.fetchall() q.put((dbName,records)) |
The bottleneck here, most likely, is in
the query processing and record transmission to the client system. While
Python multiprocessing could be used to access the database in parallel,
I doubt that this would be any faster than running multiple Statistics
processes in external mode, and you still have the task of merging the
resulting datasets. Before adopting any of these strategies, it would
be important to understand where the system is spending the time. There
could be i/o bottlenecks, database execution overhead, network delays or
other problems upstream of actually getting the data into Statistics.
Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: Eugenia Cachia <[hidden email]> To: [hidden email], Date: 09/16/2013 05:36 AM Subject: Re: [SPSSX-L] multiprocessing Sent by: "SPSSX(r) Discussion" <[hidden email]> Good afternoon, I would definitely agree with David Marso: the DBs and queries should be optimized, it will greatly reduce the processing time. But about it we cannot do anything, what we have is 7 DBs, 7 queries and we have to process the results of the queries. Since Python is integrated in SPSS we can use its powerfull functinality, f.e. apply multiprocessing: our jobs list would be a list of 7 tuples (dbName, sqlQuery), we should use 7 worker processes to access these DBs in parallel, the output will be a list of all resulting records and, if we wish so, we can also make available separate queries results as a dictionary: dict1={'dbName1':'resultingRecordsList1', ....,'dbName7':'resultingRecordsList7'} To be able to do all this we have to add 'import sqlite3, multiprocessing' to imports, then: def dbManager(): q=multiprocessing.Queue(1024*1024) maxworkers=7 jobs=[(dbName1, sqlQuery1), ....(dbName7, sqlQuery7)] allRecordsList=[] allRecordsDict={} while jobs or workers: for w in workers: if not w.is_alive(): workers.remove(w) if len(workers) < maxworkers and len(jobs) != 0: j = jobs.pop() wd = multiprocessing.Process(target=fiddleWorker, args=(j,q) ) wd.start() workers.append(wd) time.sleep(0.1) while not q.empty(): dbName,records=q.get() if records: allRecordsList.extend(records) allRecordsDict[dbName]=records return ((allRecordsList, allRecordsDict)) def dbWorker(args,q): dbName,sql=args conn = sqlite3.connect(dbName) c = conn.cursor() cursor.execute(sql) records=cursor.fetchall() q.put((dbName,records)) -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/multiprocessing-tp5721935p5722040.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by drfg2008
To Jon K Peck:
"it would be important to understand where the system is spending the time" - cannot disagree on this, but, unfortunately, DB creation and administration is very often out of our hands. Mostly we are stuck with what we have and have to adapt. I agree that in majority of cases long DB query time would be due to "i/o bottlenecks, database execution overhead, network delays", etc. But since we have to solve our problem anyway, without touching DBs and queries we can try python multiprocessing here which would also solve "the task of merging the resulting datasets". The python manager function def dbManager() will do it on the fly, picking queried records from the queue and appending them to the all records list: while not q.empty(): dbName,records=q.get() if records: allRecordsList.extend(records) Then we have multiple choices: 1. to dump data into CSV file; 2. to pickle our data; 3. to create SPSS dataset from this data; Regards, |
In reply to this post by Jon K Peck
@ Jon:
R has such a 'multiprocessing' or 'parallel' feature called mclapply function. -> with do.call(rbind, ... ) the resulting SQL queries would be aggregated into one final table. Would be nice to have it as an R plugin for IBM-SPSS. Would make things much, much easier.
Dr. Frank Gaeth
|
I guess you are referring to the R parallel
package. There is nothing stopping you from using that package via
the R plugin. Could you elaborate on what you want to do with mclapply
et al?
Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: drfg2008 <[hidden email]> To: [hidden email], Date: 12/11/2013 06:00 AM Subject: Re: [SPSSX-L] multiprocessing Sent by: "SPSSX(r) Discussion" <[hidden email]> @ Jon: R has such a 'multiprocessing' or 'parallel' feature called mclapply function. -> with do.call(rbind, ... ) the resulting SQL queries would be aggregated into one final table. Would be nice to have it as an R plugin for IBM-SPSS. Would make things much, much easier. ----- Dr. Frank Gaeth FU-Berlin -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/multiprocessing-tp5721935p5723580.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
no idea right now how to integrate it - just learned of an R parallel package and that via do.call(rbind, ...) the different queries can be merged into one table.
Would be nice to have one example how to integrate this in IBM-SPSS and run it from IBM-SPSS syntax.
Dr. Frank Gaeth
|
Free forum by Nabble | Edit this page |