SPSSX Discussion

Limiting Queries Involving Two or More Datasets From Different Sources

Classic

List

Threaded

3 messages Options

Krinsky, Alan-2

Limiting Queries Involving Two or More Datasets From Different Sources

Is there a manageable way in SPSS to limit a query on one dataset with by a large subset of the dataset?

That is, let's say I have data for 100,000 people, but I am only interested in a particular subset of 1,000 people from this group. The way I would ordinarily work this out would be to run whatever query I needed on the entire group of 100,000 and then link those results with the group of 1,000 and select only for those cases (matching, e.g., on an identification number common to both sets). That's fine for a single query, but for something more complicated, involving multiple steps or even datasets, this can get bulky. So, I was wondering if there is not a way to make part of the query of the larger group a limitation to the people of interest, and to do so without, say, having to list out 1,000 names or identification numbers.

Another way of asking my question is like this: For the above example, let's assume the data for the 100,000 is on a server, whereas the group of 1,000 is identified in a dataset previously created from data on the server, but not itself sitting on the server. Can a single New Query to create a dataset draw from both the server and the SPSS dataset, or from two servers?

A third possibility: Can I export to the server a copy of my dataset of 1,000? Then my New Query would be able to join the small and large datasets as part of the query. And if I can do this, can I code the export without having to export manually?

A last question: Is my problem that I am attempting to carry out a query appropriate to a relational database, but SPSS is not itself a relational database (in the way, e.g., Access is)?

Thanks for any assistance with this.

Alan

“In theory, there is no difference between theory and practice. But in practice, there is."
---source uncertain; attributed to Yogi Berra among others

Alan D. Krinsky PhD, MPH
Medical Management Interventions Manager
UMass Memorial Health Care
Hahnemann Campus
281 Lincoln St.
Worcester, MA 01605
Phone: 508-334-5854
Fax: 508-793-6086
E-mail: [hidden email]

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, transmission, re-transmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.

Richard Ristow

Re: Limiting Queries Involving Two or More Datasets

At 04:41 PM 11/17/2009, Krinsky, Alan wrote:

That is, let's say I have data for 100,000 people, but I am only interested in a particular subset of 1,000 people from this group. The way I would ordinarily work this out would be to run whatever query I needed on the entire group of 100,000 and then link those results with the group of 1,000 and select only for those cases (matching, e.g., on an identification number common to both sets).

So you're all set with how to MATCH FILES two files, and select only records in one that have matches in the other.

I was wondering if there is not a way to make part of the query of the larger group a limitation to the people of interest, and to do so without, say, having to list out 1,000 names or identification numbers.

If the data for the 100,000 is on a server, whereas the group of 1,000 is identified in a dataset previously created from data on the server, but not itself sitting on the server.

That sounds like (untested)

MATCH FILES /FILE=FromServer /TABLE=Subset /IN=InSubset /BY IDnum. SELECT IF InSubset.After this, it may be well to issue a

CACHE.so that SPSS doesn't go back to the server and run the whole thing again, the next time you run a procedure on the same subset. If you have to use GET DATA to get data from the server, then try

GET DATA TYPE=ODBC.... DATASET NAME FromServer. MATCH FILES /FILE=FromServer /TABLE=Subset /IN=InSubset /BY IDnum. SELECT IF InSubset. DATASET NAME SubsetData.I think that would pass through the data only once.

But it look like you know all this, so there's some reason it doesn't work for you. Like, the server data doesn't come in sorted by IDnum?

A third possibility: Can I export to the server a copy of my dataset of 1,000? Then my New Query would be able to join the small and large datasets as part of the query. And if I can do this, can I code the export without having to export manually?

SAVE TRANSLATE to your server, followed by the SQL query?

A last question: Is my problem that I am attempting to carry out a query appropriate to a relational database, but SPSS is not itself a relational database (in the way, e.g., Access is)?

SPSS is a sort-merge database manager -- that is, files (relational 'tables') can be joined, but only if they're in order on the key to be used.

Sort-merge systems can do most of what relational systems can, if you don't mind sorting the data as needed. (An exception is many-to-many matching, which is natural in SQL as an inner join, but a screaming pain in the neck in SPSS.)

Anyway: You should be able to do what you want by GET DATA from the server, followed by MATCH FILES (with the subset list) in SPSS; or by SAVE TRANSLATE to send the subset list to the server, then GET DATA specifying SQL that selects only cases from the subset.

So I'm pretty sure I'm missing something. Why don't any of these work for you?

-Best of luck,
Richard
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Jon K Peck

Re: Limiting Queries Involving Two or More Datasets From Different Sources

In reply to this post by Krinsky, Alan-2

To prefilter the cases, I would suggest using SQL in the GET DATA command so that the filtering would happen in the database rather than bringing the data down to SPSS and then discarding most of the cases.

In your ODBC code, you would write sql code as below. This assumes that table1 has all the records and table2 is also in the database and is just a list of id's.

SELECT Table1.*
from table1, table2
WHERE Table1.ID In ([table2].[id]);

So your whole command would be something like
GET DATA
/TYPE=ODBC
/CONNECT='DSN=blah blah blah'
/SQL='SELECT Table1.* from table1, table2 WHERE Table1.ID In ([table2].[id]);'.

If you initially determined the cases of interest using SPSS, you could use SAVE TRANSLATE to write table2 back to the database for future use, assuming that you have the appropriate permissions.

HTH,

Jon Peck
SPSS, an IBM Company
[hidden email]
312-651-3435

From:	"Krinsky, Alan" <[hidden email]>
To:	[hidden email]
Date:	11/17/2009 06:35 PM
Subject:	[SPSSX-L] Limiting Queries Involving Two or More Datasets From Different Sources
Sent by:	"SPSSX(r) Discussion" <[hidden email]>

Is there a manageable way in SPSS to limit a query on one dataset with by a large subset of the dataset?

A last question: Is my problem that I am attempting to carry out a query appropriate to a relational database, but SPSS is not itself a relational database (in the way, e.g., Access is)?

Thanks for any assistance with this.

Alan

“In theory, there is no difference between theory and practice. But in practice, there is."
---source uncertain; attributed to Yogi Berra among others