Re: Matching ... cases

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Matching ... cases

Bob Schacht-3
My job is such that I perform the most complicated, subtle task of the
entire year only once per year. I am trying to construct a manual so that
when the next year comes around, I don't have to re-invent the wheel. My
manual still has gaps in it, however.

Please allow me to review, and simplify.
In previous years, I spent too much time worrying about duplicate cases,
and trying to salvage loose ends. This year, I think I have resolved the
duplicate case issues sufficiently, and will deal with odds and ends by
discarding them, after duly noting their existence. However, one
complication remains: The fundamental unit of observation is not the
person, but the case, since a person may be served by this system more than
once during a year. Therefore, there may be more than one case per person.

So, to review the simplified situation:
    * I have an aggregated file representing the merger of 4 quarterly case
files. Duplicate cases have been removed. Let's call the key ID field the
CASEKEY. In addition, the file includes a personal identifier, the SSN. The
SSN + Date of Application (DATEAPPL), if correctly specified, should also
provide a unique case key. Let's call this file QuExtAll.
    * I also have an annual case summary file,  called RSA911, that is
supposed to summarize information on the same cases. However, RSA911 is not
derived from QuExtAll. Rather, it is derived independently from the same
computerized database.  RSA911 does not have the CASEKEY. Instead, it has
the SSN and DATEAPPL.
I need to combine these two files into one, because although many of the
variables are the same, there are important variables in the first file
that are not in the second, and important variables in the second file that
are not in the first.
Wherever there is any conflict in the value of a variable common to both
files, I want the value from RSA911 to be used, unless it is "missing" or
undefined.

I am assuming that the key field to match the two files will be SSN +
DATEAPPL.
I plan to discard any cases that do not have a match in both files.

The question of the hour is, should I be using the syntax for ADD FILES, or
for UPDATE?
And if either one finds a discrepancy in any of a variable's attributes or
characteristics (e.g., field width, type) what will it do? Last year, your
choice was UPDATE:

At 12:01 PM 5/1/2006, Richard Ristow wrote:

>. . . I think you want UPDATE. On the UPDATE command, as I'm outlining it
>(not tested), when a variable has data in both input files, the value from
>the LATER SPECIFIED one will be taken. "Missing or blank values in the
>transaction [i.e., second] files are not used to update values in the
>master [first] file."
>
>So, if
>-> You've done the matching of people. You have a variable PRSN_KEY in
>files ANNUAL and QTRLY. You're confident that if PRSN_ID is the same in
>records in the two files, they represent the same person; and if PRSN_ID
>is different, they represent different people.

But this is not quite correct, since I will be matching cases, not people.
There is a PRSN_ID = SSN, and that will be unique for the person (but not
for the case) because I've already thrown out the cases where the SSN is
all 0's or all 9's.


>-> You've done the selection of quarterly records, so QTRLY contains at
>most one record for any PRSN_KEY.

Not quite. It can have more than one record for a PRSN_KEY if the DATEAPPL
is different.

>  (That was already true of file ANNUAL, right?)

Same thing.


>-> You want all variables that appear in either file. For variables that
>appear in both files, you want the non-missing value from either. If both
>have non-missing values, you want the value found in ANNUAL.

Yes! Well stated.


>Then, this should work (but very much not tested):
>
>UPDATE
>   /FILE=QTRLY
>   /FILE=ANNUAL
>   /BY PRSN_KEY.

...except that this should be CASE_KEY, which needs two variables, i.e.,
SSN and DATEAPPL. How do I designate that syntax?


>>But what will it do when it finds a case in one database that is not in
>>the other database?
>
>Take the values from the database the case is in. Variables that are only
>in the other database will have missing values.
>
>What you're doing has a lot of room for glitches; it doesn't sound like
>the files are perfectly clean, at all. But I think this should get you
>started; I hope, most of the way.
>
>-Good luck,
>  Richard

Thanks!
Bob Schacht




Robert M. Schacht, Ph.D. <[hidden email]>
Pacific Basin Rehabilitation Research & Training Center
1268 Young Street, Suite #204
Research Center, University of Hawaii
Honolulu, HI 96814
Reply | Threaded
Open this post in threaded view
|

Re: Matching ... cases

Maguin, Eugene
Bob,

I'd like to offer a suggestion. I haven't been following the postings very
carefully and I know there have been several back and forths between you and
Richard that may have contained important elements that I either didn't read
or don't recall. But, based on your most recent email only, this is how I
work the problem. It seems that you have two data sets with a common pair of
keys. In theory, every case in one dateset should be the other and
vice-versa. I would first satisfy myself that this is, in fact, true. It
might not be. This is one problem. You have the additional problem that you
want to select which variables to save contingently (i.e., look at the
values of the same variable in both datasets and decide based on that
evaluation which to keep). This is the other problem.

This may be overkill but I think it is the only way to be really sure about
both problems. First, go into both datasets and rename to a dataset specific
name the variables that you are going to want to keep. Then do a Match Files
by the two keys. Although I would add a dataset id variable to each dataset
before matching, the In subcommand will, I think, do the same thing. If you
crosstab the dataset id vars you'll find out whether one dataset has more
cases than the other. That might be revealing in and of itself! When you are
satisfied that you have the cases right, you can write some code for your
contingent variables (I think Do Repeat would work fine here), i.e., to
decide whether to use the value from the first dataset or replace it with
the value from the other dataset. Finally, do a big rename operation and
save the finished dataset.

Gene Maguin
Reply | Threaded
Open this post in threaded view
|

Re: Matching ... cases

Richard Ristow
In reply to this post by Bob Schacht-3
Thinking along, maybe complementary to Gene and maybe going different
directions, but here goes:

At 05:09 PM 6/21/2007, Bob Schacht wrote:

>[In my data,] one complication remains: The fundamental unit of
>observation is not the person, but the case, since a person may be
>served by this system more than once during a year. Therefore, there
>may be more than one case per person.

So far, so good. Not a difficulty in itself. Data can have any unit of
observation, as long as it's clear what it is, and you have keys that
identify units. (Data can also, as easily, have several units of
observation, in a hierarchy of decreasing aggregation; and that works
fine, too.)

>So, to review the simplified situation:
>    * I have an aggregated file representing the merger of 4 quarterly
> case files. Duplicate cases have been removed.

I take it that this merger is a catenation or interleave. That is, you
have the 4 quarterly files, all with the same set of variables, and
different records. The merged file has the same set of variables, and
all records from all the quarterly files, except (as you say) where
duplicates have been removed. (In SPSS, such a merger is usually done
with ADD FILES.)

And you're all set, here; you already have the merged file with the
records from all 4 quarters.

>[In this merged file, there is a] key ID field [that uniquely
>identifies each case, as defined above]; let's call [it] the CASEKEY.
>In addition, the file includes a personal identifier, the SSN. The SSN
>+ Date of Application (DATEAPPL), if correctly specified, should also
>provide a unique case key.
>Let's call this [merged file of the 4 quarters' data] QuExtAll.

And, since you've eliminated duplicates, this file has at most one
record for any one case.

>    * I also have an annual case summary file,  called RSA911, that is
> supposed to summarize information on the same cases. However, RSA911
> is not derived from QuExtAll. Rather, it is derived independently
> from the same computerized database.  RSA911 does not have the
> CASEKEY. Instead, it has the SSN and DATEAPPL.
>
>I need to combine these two files into one, because although many of
>the variables are the same, there are important variables in the first
>file that are not in the second, and important variables in the second
>file that are not in the first. Wherever there is any conflict in the
>value of a variable common to both files, I want the value from RSA911
>to be used, unless it is "missing" or undefined.

Clear enough specification here, too,so far.

>I am assuming that the key field to match the two files will be SSN +
>DATEAPPL.

(Grin) A reasonable assumption, since that pair of variables is the
only shared key you have.

>The question of the hour is, should I be using the syntax for ADD
>FILES, or for UPDATE?

This one is definitely UPDATE. Like this, but not tested:

UPDATE
   /FILE=QuExtAll  /IN=IN_Qtrs
   /FILE=RSA911    /IN=IN_RSA
   /BY SSN DATEAPPL.


>I plan to discard any cases that do not have a match in both files.

That would be, any case for which 'IN_Qtrs' and 'IN_RSA' are not both
1.

>And if either one finds a discrepancy in any of a variable's
>attributes or characteristics (e.g., field width, type) what will it
>do?

It will abandon the UPDATE, and give an error message.

Then, in the UPDATE statement, you have to rename that variable in one
of the files, using the /RENAME subcommand. After you've done that, you
can,
a. Keep both versions of the variable, with their different types but
now also with different names.
b. Drop the version of the variable in one of the files. If you do
that, rename the version in the file from which you're going to drop
the variable, and then drop the renamed version.
c. Use COMPUTE to copy the value of the renamed variable to the
original name; then, later, drop the renamed version. IF you do that,
. Since you want the value from RSA to take precedence, rename the copy
in QuExtAll. If the variable is named 'Vbl', rename the QuExtAll copy
something like 'QuX_Vbl'.
. Then, probably this logic (but not tested)
IF   MISSING(Vbl) Vbl = QuX_Vbl.

You'll probably want to drop 'QuX_Vbl', but *later*, not on the UPDATE
statement.

-Good night and good luck,
  Richard
Reply | Threaded
Open this post in threaded view
|

Re: Matching ... cases

Bob Schacht-3
At 07:15 PM 6/21/2007, Richard Ristow wrote:

>Thinking along, maybe complementary to Gene and maybe going different
>directions, but here goes:
>
>At 05:09 PM 6/21/2007, Bob Schacht wrote:
>
>[snip]
>
>>So, to review the simplified situation:
>>    * I have an aggregated file representing the merger of 4 quarterly
>>case files. Duplicate cases have been removed.
>
>I take it that this merger is a catenation or interleave. That is, you
>have the 4 quarterly files, all with the same set of variables, and
>different records. The merged file has the same set of variables, and
>all records from all the quarterly files, except (as you say) where
>duplicates have been removed. (In SPSS, such a merger is usually done
>with ADD FILES.)
>
>And you're all set, here; you already have the merged file with the
>records from all 4 quarters.

Right!


>>[In this merged file, there is a] key ID field [that uniquely
>>identifies each case, as defined above]; let's call [it] the CASEKEY.
>>In addition, the file includes a personal identifier, the SSN. The SSN
>>+ Date of Application (DATEAPPL), if correctly specified, should also
>>provide a unique case key.
>>Let's call this [merged file of the 4 quarters' data] QuExtAll.
>
>And, since you've eliminated duplicates, this file has at most one record
>for any one case.

Yes.


>>    * I also have an annual case summary file,  called RSA911, that is
>>supposed to summarize information on the same cases. However, RSA911
>>is not derived from QuExtAll. Rather, it is derived independently
>>from the same computerized database.  RSA911 does not have the
>>CASEKEY. Instead, it has the SSN and DATEAPPL.
>>
>>I need to combine these two files into one, because although many of
>>the variables are the same, there are important variables in the first
>>file that are not in the second, and important variables in the second
>>file that are not in the first. Wherever there is any conflict in the
>>value of a variable common to both files, I want the value from RSA911
>>to be used, unless it is "missing" or undefined.
>
>Clear enough specification here, too,so far.
>
>>I am assuming that the key field to match the two files will be SSN +
>>DATEAPPL.
>
>(Grin) A reasonable assumption, since that pair of variables is the
>only shared key you have.
>
>>The question of the hour is, should I be using the syntax for ADD FILES,
>>or for UPDATE?
>
>This one is definitely UPDATE. Like this, but not tested:
>
>UPDATE
>   /FILE=QuExtAll  /IN=IN_Qtrs
>   /FILE=RSA911    /IN=IN_RSA
>   /BY SSN DATEAPPL.

I didn't think to add the /IN=   tags.

>>I plan to discard any cases that do not have a match in both files.
>
>That would be, any case for which 'IN_Qtrs' and 'IN_RSA' are not both 1.

This part of my "plan" might be over-hasty. I may not discard these cases
after all; depends on the results.
In any case, having the 'IN_xxx' tags may come in handy.


>>And if either one finds a discrepancy in any of a variable's
>>attributes or characteristics (e.g., field width, type) what will it
>>do?
>
>It will abandon the UPDATE, and give an error message.

I've already encountered this, as it points out to me half a dozen common
variables that are defined differently.
Very nice error message.


>Then, in the UPDATE statement, you have to rename that variable in one
>of the files, using the /RENAME subcommand. After you've done that, you
>can,
>a. Keep both versions of the variable, with their different types but
>now also with different names.
>b. Drop the version of the variable in one of the files. If you do
>that, rename the version in the file from which you're going to drop
>the variable, and then drop the renamed version.
>c. Use COMPUTE to copy the value of the renamed variable to the
>original name; then, later, drop the renamed version. IF you do that,
>. Since you want the value from RSA to take precedence, rename the copy
>in QuExtAll. If the variable is named 'Vbl', rename the QuExtAll copy
>something like 'QuX_Vbl'.
>. Then, probably this logic (but not tested)
>IF   MISSING(Vbl) Vbl = QuX_Vbl.
>
>You'll probably want to drop 'QuX_Vbl', but *later*, not on the UPDATE
>statement.

Rather than do those things, I think I'll fix the files before the merge so
that the variables will merge properly in the first place.

Thanks for your recommendations!
Bob

Robert M. Schacht, Ph.D. <[hidden email]>
Pacific Basin Rehabilitation Research & Training Center
1268 Young Street, Suite #204
Research Center, University of Hawaii
Honolulu, HI 96814
Reply | Threaded
Open this post in threaded view
|

Exploratory data analysis using macros of the MVA, FREQUENCIES and EXAMINE commands

Gary Oliver
Colleagues

Over the last year I have gradually moved from using the menus in SPSS
v14 to creating macros. The last stage of this has been to create a file
to do exploratory data analysis. Below is a summary of my new syntax.
I'd appreciate any comments or suggestions for improvements.

Essentially I considered that checking missing values was the stage 1.
Then in stage 2 seeing the distribution of the survey responses as
frequencies using bar graph was next. Then in stage 3 using Tukey style
graphics to reconsider the distributions. The in stage 4 to obtain the
spot values for central tendency and dispersion.

I realise I have not isolated extreme values nor have I separated
kurtosis and skew for the variables to make it easy to focus on them (as
this seems not to be allowed by the command syntax). There also seems no
requirement to do a list of the variables as I just cust and paste the
list from the variables view where I have the placeholder <list of
variables>.

TIA/gary

* TITLE "STAGE 1 MISSING VALUES" .
MVA
<list of variables>
EXECUTE .

TITLE "STAGE 2 GRAPHICAL DISTRIBUTION OF FREQUENCIES OF KEY UNIVARIATE
VARIABLES" .
FREQUENCIES
 VARIABLES=
<list of variables>
/HISTOGRAM
 /ORDER= ANALYSIS .
EXECUTE .

TITLE "STAGE 3 TUKEY GRAPHICAL UNIVARIATE VARIABLES" .
EXAMINE
  VARIABLES=
<list of variables>
  /PLOT STEMLEAF BOXPLOT NPPLOT .
EXECUTE .

TITLE "STAGE 4 DESCRIPTIVE CENTRAL TENDENCY & DISPERSION UNIVARIATE
VARIABLES" .
 EXAMINE
  VARIABLES=
<list of variables>
  /STATISTICS DESCRIPTIVE .
EXECUTE .
Reply | Threaded
Open this post in threaded view
|

Re: Matching ... cases

Bob Schacht-3
In reply to this post by Richard Ristow
At 07:15 PM 6/21/2007, Richard Ristow wrote:

[snip]

>[Bob had written:]
>>    * I also have an annual case summary file,  called RSA911, that is
>>supposed to summarize information on the same cases. However, RSA911
>>is not derived from QuExtAll. Rather, it is derived independently
>>from the same computerized database.  RSA911 does not have the
>>CASEKEY. Instead, it has the SSN and DATEAPPL.
>>
>>I need to combine these two files into one, because although many of
>>the variables are the same, there are important variables in the first
>>file that are not in the second, and important variables in the second
>>file that are not in the first. Wherever there is any conflict in the
>>value of a variable common to both files, I want the value from RSA911
>>to be used, unless it is "missing" or undefined.
>
>Clear enough specification here, too,so far.
>
>>I am assuming that the key field to match the two files will be SSN +
>>DATEAPPL.
>
>(Grin) A reasonable assumption, since that pair of variables is the
>only shared key you have.
>
>>The question of the hour is, should I be using the syntax for ADD
>>FILES, or for UPDATE?
>
>This one is definitely UPDATE. Like this, but not tested:
>
>UPDATE
>   /FILE=QuExtAll  /IN=IN_Qtrs
>   /FILE=RSA911    /IN=IN_RSA
>   /BY SSN DATEAPPL.

This *seemed to* work like a charm.

>>I plan to discard any cases that do not have a match in both files.
>
>That would be, any case for which 'IN_Qtrs' and 'IN_RSA' are not both 1. .
>. .

I changed my mind about this, because I want to retain information about
active cases, and that information is available ONLY from the Qtrs files.
In other words, I kept cases where IN_Qtrs = 1 but IN_RSA = 0.

However, the Update did not work as expected. That is, a field present in
both files but blank in the Qtrs file was NOT filled in by values from the
field in the RSA file. Why not?

Bob

Robert M. Schacht, Ph.D. <[hidden email]>
Pacific Basin Rehabilitation Research & Training Center
1268 Young Street, Suite #204
Research Center, University of Hawaii
Honolulu, HI 96814
Reply | Threaded
Open this post in threaded view
|

Re: Matching ... cases

Richard Ristow
Ah, ever onward. At 05:34 PM 6/28/2007, Bob Schacht wrote:

>At 07:15 PM 6/21/2007, Richard Ristow wrote:
>
>>
>>This one is definitely UPDATE. Like this, but not tested:
>>
>>UPDATE
>>   /FILE=QuExtAll  /IN=IN_Qtrs
>>   /FILE=RSA911    /IN=IN_RSA
>>   /BY SSN DATEAPPL.
>
>This *seemed to* work like a charm.
>
>However, the Update did not work as expected. That is, a field present
>in both files but blank in the Qtrs file was NOT filled in by values
>from the field in the RSA file. Why not?

Well, I'm not sure.

May I remind you that, if some code doesn't work, it's hard to diagnose
from the bare statement that it doesn't? It's much better if you can
give some test data, with a statement of what doesn't work - in your
case, which variable this happens to.

And in your case, among other things: Are there any string variables
that *do* appear to be processed as you desire?

Off the top of my head, I don't see it. From the Command Syntax
Reference article on "UPDATE", it should work as you desire. This is
from the SPSS 15 version, but I think it's been the same for many
releases:

>If the master and transaction files contain common variables for
>matched cases, the values for those variables are taken from the
>transaction file, provided that the values are not missing or blanks.
>Missing or blank values in the transaction files are not used to
>update values in the master file. (p.1840)

Running with the /MAP subcommand may help, by at least telling you
whether SPSS thinks that the two variables *are* the same.

That's all I can think of, with what you've given us to work with. Try
/MAP, and ask again if that doesn't solve it.

-Good luck,
  Richard
Reply | Threaded
Open this post in threaded view
|

Re: Matching ... cases

Bob Schacht-3
At 03:49 PM 6/28/2007, Richard Ristow wrote:

>Ah, ever onward. At 05:34 PM 6/28/2007, Bob Schacht wrote:
>
>>At 07:15 PM 6/21/2007, Richard Ristow wrote:
>>
>>>
>>>This one is definitely UPDATE. Like this, but not tested:
>>>
>>>UPDATE
>>>   /FILE=QuExtAll  /IN=IN_Qtrs
>>>   /FILE=RSA911    /IN=IN_RSA
>>>   /BY SSN DATEAPPL.
>>
>>This *seemed to* work like a charm.
>>
>>However, the Update did not work as expected. That is, a field present in
>>both files but blank in the Qtrs file was NOT filled in by values from
>>the field in the RSA file. Why not?
>
>Well, I'm not sure.
>
>May I remind you that, if some code doesn't work, it's hard to diagnose
>from the bare statement that it doesn't? It's much better if you can give
>some test data, with a statement of what doesn't work - in your case,
>which variable this happens to.

Sorry, I sent in too much haste.  Here's an example. A variable in common
to both files is Employment Status at Closure, or CLOEMPST. This is a one
digit numeric field coded 1-7. Code '5' is for homemakers. A variable found
only in the RSA file is Occupation at Closure. It is coded '599999' for
Homemakers. However, when I did a crosstab on these two variables using the
combined datafile, I got nuthin': The table was empty:

>Warnings
>
>The crosstabulation of Occupation@Clo * Employment Status @Closure is empty.

But when I backed up and used the original RSA file, I got a table
populated with 667 cases-- which was the number of cases closed in
employment. Does that help?


>And in your case, among other things: Are there any string variables that
>*do* appear to be processed as you desire?

In this case, Occupation at Closure was a string variable, but CLOEMPST
was  numeric. No problems with Occupation at Closure, but it was only in
the RSA file. Why should it make any difference if its a string variable?


>Off the top of my head, I don't see it. From the Command Syntax Reference
>article on "UPDATE", it should work as you desire. This is from the SPSS
>15 version, but I think it's been the same for many releases:

Mine is version 12.


>>If the master and transaction files contain common variables for matched
>>cases, the values for those variables are taken from the transaction
>>file, provided that the values are not missing or blanks. Missing or
>>blank values in the transaction files are not used to update values in
>>the master file. (p.1840)
>
>Running with the /MAP subcommand may help, by at least telling you whether
>SPSS thinks that the two variables *are* the same.

Yes, it thinks they're the same.
I re-ran the Occupation @ Closure by Employment Status @ closure, and it
was empty again.
Any suggestions? In any case, thanks with your efforts to help!

Bob


>That's all I can think of, with what you've given us to work with. Try
>/MAP, and ask again if that doesn't solve it.
>
>-Good luck,
>  Richard
Reply | Threaded
Open this post in threaded view
|

Re: Matching ... cases

Richard Ristow
At 10:55 PM 6/28/2007, Bob Schacht wrote:

>>UPDATE
>>   /FILE=QuExtAll  /IN=IN_Qtrs
>>   /FILE=RSA911    /IN=IN_RSA
>>   /BY SSN DATEAPPL.
>>
>>>The Update did not work as expected. A field present in both files
>>>but blank in the Qtrs file was NOT filled in by values from the
>>>field in the RSA file. Why not?
>>
>>If code doesn't work, it's much better if you give some test data,
>>with a statement of what doesn't work - in your case, which variable
>>this happens to.
>
>A variable in common to both files is Employment Status at Closure, or
>CLOEMPST. This is a one-digit numeric field coded 1-7. Code '5' is for
>homemakers.
>
>A variable found only in the RSA file is Occupation at Closure. It is
>coded '599999' for Homemakers. However, when I did a crosstab on these
>two variables using the combined datafile, I got nuthin': The table
>was empty:
>
>>Warnings
>>
>>The crosstabulation of Occupation@Clo * Employment Status @Closure is
>>empty.
>
>But when I backed up and used the original RSA file, I got a table
>populated with 667 cases-- which was the number of cases closed in
>employment. Does that help?

It's closer. Still a couple of remarks, though:

- You wrote "A field present in both files but blank in the Qtrs file
was NOT filled in by values from the field in the RSA file."
   Now it looks like the field wasn't blank (which applies to string
variables) but system-missing (which applies to numeric variables).
That is just the kind of thing it's important to state accurately. If
you're not clear about the difference, give some test data, and a
listing of the variables' attributes. Use DISPLAY
DICTIONARY/VARIABLES=...

- You wrote, "when I did a crosstab, the table was empty." Best
practice is to 'think globally, debug locally.' Crosstab is global: it
gives you overall statistics from the file, but no indication of what
records were involved.
   It would not be hard to list the QuExtAll records, the RSA911
records, and the merged records, for a few cases that have values for
both variables in the RSA911 file. If there are too many variables, use
the /VARIABLES subcommand on LIST to restrict to the relevant
variables. You might even solve your own problem, doing this.
   Besides, your CROSSTABS doesn't tell you anything about the
individual variables, only that they never both have valid values in
the same record. What happens if you do FREQUENCIES on both variables,
before and after the merge? That'll test whether or not the variables
individually seem to have come through.

Do you have any user-missing values for either variable, in either
file? FREQUENCIES would catch that.

In other words, you have some very potent diagnostic tools available
easily. I don't see your problem offhand, so I'd just have to run the
same diagnostics myself (except you've sent no data, as noted
previously). So, give it a go. LIST plus FREQUENCIES will tell you a
lot you don't know now; there's a fair chance they'll solve your
problem.

And it's getting late, here in EDT. Good luck, and let's see what
tomorrow  brings.

-Best wishes,
  Richard
Reply | Threaded
Open this post in threaded view
|

Re: Matching ... cases

Bob Schacht-3
At 06:03 PM 6/28/2007, Richard Ristow wrote:

>At 10:55 PM 6/28/2007, Bob Schacht wrote:
>
>>>UPDATE
>>>   /FILE=QuExtAll  /IN=IN_Qtrs
>>>   /FILE=RSA911    /IN=IN_RSA
>>>   /BY SSN DATEAPPL.
>>>
>>>>The Update did not work as expected. A field present in both files but
>>>>blank in the Qtrs file was NOT filled in by values from the field in
>>>>the RSA file. Why not?

[snip]

>>A variable in common to both files is Employment Status at Closure, or
>>CLOEMPST. This is a one-digit numeric field coded 1-7. Code '5' is for
>>homemakers. . . .
>>
>>But when I backed up and used the original RSA file, I got a table
>>populated with 667 cases-- which was the number of cases closed in
>>employment. Does that help?
>
>It's closer. Still a couple of remarks, though:
>
>- You wrote "A field present in both files but blank in the Qtrs file was
>NOT filled in by values from the field in the RSA file."
>   Now it looks like the field wasn't blank (which applies to string
> variables) but system-missing (which applies to numeric variables).  .  .  .

You are right. I should not have described the field as blank.

>- You wrote, "when I did a crosstab, the table was empty." Best practice
>is to 'think globally, debug locally.' Crosstab is global: it gives you
>overall statistics from the file, but no indication of what records were
>involved.
>   It would not be hard to list the QuExtAll records, the RSA911 records,
> and the merged records, for a few cases that have values for both
> variables in the RSA911 file. If there are too many variables, use the
> /VARIABLES subcommand on LIST to restrict to the relevant variables. You
> might even solve your own problem, doing this.

I looked at the original QuExtAll file, the original RSA911 file, and the
combined file, for these 17 cases closed "in employment" as "homemakers."
Since presenting tables in this forum is awkward, a few limited examples
will have to suffice. Three variables, SSN, Comp-Emp, and CLOEMPST were
available in both files. Variables OccuCode and CLOSTYPE were present only
in the RSA911 file, and STAT was present only in QuExtAll. The 17 cases in
question were selected in Quextall using Comp-emp + STAT; the same cases
were isolated in the RSA911 file using Comp-Emp + CLOSTYPE. The SSN were
cross-checked to align the case information.

Selected                Selected
Fields in               Fields in
QuExtAll                RSA911
-------------           ------------
SSN                     SSN
                         OccuCode
Comp-Emp                Comp-Emp
STAT
                         CLOSTYPE
CLOEMPST                CLOEMPST

The value of CLOEMPST in these 17 cases was always "5" in the RSA911 file,
and always ". " in the QuExtAll file, indicating system missing, right?

In the combined file, the value of CLOEMPST was always ". ".  Obviously,
this is not what I wanted, which was that the result, if the field is
present in both files, is the non-missing value. I left the syntax commands
from the original message up top to show the order in which they were
invoked by my UPDATE command.



>   Besides, your CROSSTABS doesn't tell you anything about the individual
> variables, only that they never both have valid values in the same
> record. What happens if you do FREQUENCIES on both variables, before and
> after the merge? That'll test whether or not the variables individually
> seem to have come through.

Frequencies on CLOEMPST, original QuExtAll file (closed + active cases):
Employment Status @Closure

  FrequencyPercent
Missing036.5
  16488.3
  310.1
  45.1
  517.2
  711.1
  System704790.6
  Total7774100.0

  Frequencies on CLOEMPST, original RSA911 File (closed cases only):
Employment status@clo

  FrequencyPercentValid PercentCumulative Percent
Valid162628.393.993.9
  39.41.395.2
  44.2.695.8
  517.82.598.4
  711.51.6100.0
  Total66730.1100.0
MissingSystem154869.9
Total2215100.0

This draws my attention to the fact that for the QuExtAll file, I declared
"0" to be a missing value, but did not do the same for the RSA911 file.

The frequencies for the combined file were as follows:
Employment Status @Closure

  FrequencyPercent
Missing036.5
  16488.3
  310.1
  45.1
  517.2
  711.1
  System704790.6
  Total7774100.0


It finally has dawned on me that I was using the missing value
specification incorrectly, and had inadvertently specified all possible
values as missing. The combination of files process copied this
specification along with everything else, so the values all got copied, but
were all considered missing! My bad.

Thanks for pushing me towards the correct solution.

Bob Schacht
Reply | Threaded
Open this post in threaded view
|

Re: Matching ... cases

Richard Ristow
At 08:59 PM 6/29/2007, Bob Schacht wrote:
>At 06:03 PM 6/28/2007, Richard Ristow wrote [double-quoted text,
>below]:

>>>>UPDATE
>>>>   /FILE=QuExtAll  /IN=IN_Qtrs
>>>>   /FILE=RSA911    /IN=IN_RSA
>>>>   /BY SSN DATEAPPL.

>>- You wrote "A field present in both files but blank in the Qtrs file
>>was
>>NOT filled in by values from the field in the RSA file."
>>   Now it looks like the field wasn't blank (which applies to string
>>variables) but system-missing (which applies to numeric
>>variables).  .  .  .
>
>You are right. I should not have described the field as blank.

Thank you. That kind of accuracy is a useful discipline to remember,
for the future.

>>   It would not be hard to list the QuExtAll records, the RSA911
>> records, and the merged records, for a few cases that have values
>> for both variables in the RSA911 file. If there are too many
>> variables, use the /VARIABLES subcommand on LIST to restrict to the
>> relevant variables. You might even solve your own problem, doing this.
>
>I looked at the original QuExtAll file, the original RSA911 file, and
>the combined file, for these 17 cases closed "in employment" as
>"homemakers." Since presenting tables in this forum is awkward

Goodness, is it really that bad? I don't see that it's more than 5
variables from any of the files. It would be an awfully useful tool to
work with.

Is it awkward because there are a lot of *other* variables? There are
several ways of getting only the relevant ones, including /VARIABLES=
on LIST (see previous posting), and /KEEP on GET FILE. Either way, use
LIST to print a table to a draft output file, copy and paste.

>Three variables, SSN, Comp-Emp, and CLOEMPST were available in both
>files. Variables OccuCode and CLOSTYPE were present only in the RSA911
>file, and STAT was present only in QuExtAll. The 17 cases in question
>were selected in Quextall using Comp-emp + STAT; the same cases were
>isolated in the RSA911 file using Comp-Emp + CLOSTYPE. The SSN were
>cross-checked to align the case information.

OK. The following is the /MAP listing from UPDATE, right?

>Selected                Selected
>Fields in               Fields in
>QuExtAll                RSA911
>-------------           ------------
>SSN                     SSN
>                         OccuCode
>Comp-Emp                Comp-Emp
>STAT
>                         CLOSTYPE
>CLOEMPST                CLOEMPST
>
>The value of CLOEMPST in these 17 cases was always "5" in the RSA911
>file, and always "." in the QuExtAll file, indicating system missing,
>right?

Yes, that's how system-missing appears in LIST output.

>In the combined file, the value of CLOEMPST was always ".".  What I
>wanted, which was, if the field is present in both files, the
>non-missing value. I left the syntax commands from the original
>message ...

Namely this, right?

>UPDATE
>   /FILE=QuExtAll  /IN=IN_Qtrs
>   /FILE=RSA911    /IN=IN_RSA
>   /BY SSN DATEAPPL.

One thing raises question whether this is, after all, be the syntax you
ran:
The UPDATE command includes DATEAPPL as a key variable; but DATEAPPL
does not appear in the map from UPDATE, above.

What happens if you do FREQUENCIES on both variables, before and
>>after the merge? That'll test whether or not the variables
>>individually
>>seem to have come through.
>
>Frequencies on CLOEMPST, original QuExtAll file (closed + active
>cases):
>Employment Status @Closure
[...]
>This draws my attention to the fact that for the QuExtAll file, I
>declared "0" to be a missing value, but did not do the same for the
>RSA911 file.
>
>It finally has dawned on me that I was using the missing value
>specification incorrectly, and had inadvertently specified all
>possible values as missing. The combination of files process copied
>this specification along with everything else, so the values all got
>copied, but were all considered missing! My bad.

These things happen. To say once more, as we close this:

It helps a great deal, it would have helped a great deal, to have all
the relevant information, when working on a problem.

In this case, giving us complete dictionary information on the relevant
variables, and a listing of the data, would probably have got you to
the solution several posting cycles, and several days, sooner, which
would likely have been to your advantage. In fact, you might well have
solved the problem yourself from that information, as in fact you did.

>Thanks for pushing me towards the correct solution.

You're most welcome. And, with a deep sigh of relief, on to the next.

-Wishing you the best of good analysis,
  Richard Ristow