Merging files

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Merging files

Bob Schacht-3
It is time for my annual task of merging files:
    * First, merging quarterly files of closed and open cases that have the
same format, and then deleting duplicates;
    * Then merging the set of quarterly files with a file of annual case
reports of the same closed cases, but with additional variables.
Richard Ristow has provided helpful advise on this procedure, and it looks
like a case for UPDATE syntax.

However, I have a question about the Menu "Data/Merge files" function. If I
chose the "add variables" option, the matching algorithm fails to match
dozens of variables that have exactly the same name, type, and width in
both files. Some it lists in the "excluded variables" window, and others it
puts into the included variables window.

For example, "AgeatApp" is numeric 8.2 in both files, but are not matched.
"App_GA" is a string 1 in both files, but not matched.

There are dozens of variables like this. How does it know which variables
to "include," and which to "exclude"?

Second part:
Last time I tried to use UPDATE, this is what happened:

UPDATE FILE='C:\. . . AllQuExt2005withSSN_NoDupes.sav'
/FILE='C:\. . . RSA911-2005Original_rev.sav'
/BY SSN  DATEAPPL.
/*The above update generated 942 multiple closures with no match for one of
the RSA911 cases, so these RSA911 cases were removed from the file and
transferred to a new file
/*
UPDATE FILE='C:\. . . AllQuExt2005withSSN_NoDupes.sav'
/FILE='C:\. . . .RSA911QuExt2005ver2x.sav'
/BY SSN DATEAPPL.

/*This doesn't work. SPSS gives the following error:
/*File #2Y
/*     KEY: 575253056       1.31E+10
/* This SSN exists in one file but not the other.
/*>Error # 5141
/*>File out of order doing ADD FILES or UPDATE.  All files must be in
non-descending
/*>order on the BY variables.  Use SORT CASES to sort the file in order.
/*But both files were sorted by SSN & DATEAPPL!

What's wrong with this picture?

Bob


Bob Schacht


Robert M. Schacht, Ph.D. <[hidden email]>
Pacific Basin Rehabilitation Research & Training Center
1268 Young Street, Suite #204
Research Center, University of Hawaii
Honolulu, HI 96814
Reply | Threaded
Open this post in threaded view
|

Re: Merging files

Richard Ristow
First part only:

At 10:18 PM 6/19/2007, Bob Schacht wrote:

>It is time for my annual task of merging files:
>    * First, merging quarterly files of closed and open cases that
> have the same format, and then deleting duplicates;
>
>However, I have a question about the Menu "Data/Merge files" function.
>If I chose the "add variables" option, the matching algorithm fails to
>match dozens of variables that have exactly the same name, type, and
>width in both files. Some it lists in the "excluded variables" window,
>and others it puts into the included variables window.

I think you want "add cases", don't you? You're trying, as you say, to
"merge files of closed and open cases that have the same format, and
then delete duplicates". That sounds like keeping cases from both files
(it expands as ADD FILES), rather than combining variables from
multiple cases into one case.

>For example, "AgeatApp" is numeric 8.2 in both files, but are not
>matched. "App_GA" is a string 1 in both files, but not matched.
>
>There are dozens of variables like this. How does it know which
>variables
>to "include," and which to "exclude"?

If you "add variables" (which you probably don't want to, per above)
and have variables with the same names in the two files, it will, by
default, take the one from the first file named, and delete the one in
the second file. It that what you're seeing?

I'm pretty sure that using "add cases" will give you what you want.
Reply | Threaded
Open this post in threaded view
|

Re: Merging files

Richard Ristow
In reply to this post by Bob Schacht-3
Second part: At 10:18 PM 6/19/2007, Bob Schacht wrote:

>    * Then merging the set of quarterly files with a file of annual
> case
>reports of the same closed cases, but with additional variables.
>Richard Ristow has provided helpful advise on this procedure, and it
>looks like a case for UPDATE syntax.
>
>Second part:
>Last time I tried to use UPDATE, this is what happened:
>
>UPDATE FILE='C:\. . . AllQuExt2005withSSN_NoDupes.sav'
>   /FILE='C:\. . . RSA911-2005Original_rev.sav'
>   /BY SSN  DATEAPPL.
>
>The above update generated 942 multiple closures

"Multiple closures"? I don't recognize the term.

>with no match for one of the RSA911 cases, so these RSA911 cases were
>removed from the file and transferred to a new file

Wait a minute. You removed records "with no match for one of the RSA911
cases" from the RSA911 file? I don't understand this; if they're in the
RSA911 file, how could they not have a match in the RSA911 file?

>UPDATE FILE='C:\. . . AllQuExt2005withSSN_NoDupes.sav'
>/FILE='C:\. . . .RSA911QuExt2005ver2x.sav'
>/BY SSN DATEAPPL.
>
>/*This doesn't work. SPSS gives the following error:
>/*File #2Y
>/*     KEY: 575253056       1.31E+10
>/* This SSN exists in one file but not the other.
>/*>Error # 5141
>/*>File out of order doing ADD FILES or UPDATE.  All files must be in
>non-descending order on the BY variables.  Use SORT CASES to sort the
>file in order.
>/*But both files were sorted by SSN & DATEAPPL!

I'm afraid this trail is too long and cold to follow; there are too
many inferences to make, about just what you did to get here.

Do you have some test data, and syntax which will carry out this
process and produce the error? Only a fraction of your real data, of
course.

With that, it's probably soluble. Or maybe someone else on the List has
more energy and enthusiasm for this kind of inferring than I do,
tonight.

(By the way, if this is an annual job, do you have code that worked
last year? What happens if you try the same code this year?)

-Cheers, and good luck,
  Richard
Reply | Threaded
Open this post in threaded view
|

Re: Merging files

Bob Schacht-3
At 06:20 PM 6/19/2007, Richard Ristow wrote:

>Second part: At 10:18 PM 6/19/2007, Bob Schacht wrote:
>
>>    * Then merging the set of quarterly files with a file of annual case
>>reports of the same closed cases, but with additional variables. Richard
>>Ristow has provided helpful advise on this procedure, and it looks like a
>>case for UPDATE syntax.
>>
>>Second part:
>>Last time I tried to use UPDATE, this is what happened:
>>
>>UPDATE FILE='C:\. . . AllQuExt2005withSSN_NoDupes.sav'
>>   /FILE='C:\. . . RSA911-2005Original_rev.sav'
>>   /BY SSN  DATEAPPL.
>>
>>The above update generated 942 multiple closures
>
>"Multiple closures"? I don't recognize the term.

"Multiple closures" means, or is supposed to mean, that the same person was
in the system more than once. I am supposed to base my summaries on
*cases*, rather than *persons*, so multiple closures are alright, in
principle, as long as I have a match from both files.The total service pool
is more than 6,000, but that makes 942 a rather high fraction of the
service population, so I am skeptical about the interpretation.


>>with no match for one of the RSA911 cases, so these RSA911 cases were
>>removed from the file and transferred to a new file
>
>Wait a minute. You removed records "with no match for one of the RSA911
>cases" from the RSA911 file? I don't understand this; if they're in the
>RSA911 file, how could they not have a match in the RSA911 file?

I think what I meant here is that the RSA911 file had 942 multiple
closures, but of each pair, there was a match for only one of them in the
other file. So that meant that I had 942 cases that looked like multiple
closures in RSA911, but there appeared to be only one closure for each of
these cases in the companion file.


>>UPDATE FILE='C:\. . . AllQuExt2005withSSN_NoDupes.sav'
>>/FILE='C:\. . . .RSA911QuExt2005ver2x.sav'
>>/BY SSN DATEAPPL.
>>
>>/*This doesn't work. SPSS gives the following error:
>>/*File #2Y
>>/*     KEY: 575253056       1.31E+10
>>/* This SSN exists in one file but not the other.
>>/*>Error # 5141
>>/*>File out of order doing ADD FILES or UPDATE.  All files must be in
>>non-descending order on the BY variables.  Use SORT CASES to sort the
>>file in order.
>>/*But both files were sorted by SSN & DATEAPPL!
>
>I'm afraid this trail is too long and cold to follow; there are too many
>inferences to make, about just what you did to get here.
>
>Do you have some test data, and syntax which will carry out this process
>and produce the error? Only a fraction of your real data, of course.

I'll try it out again with this year's data and see if I get the same error.


>With that, it's probably soluble. Or maybe someone else on the List has
>more energy and enthusiasm for this kind of inferring than I do, tonight.
>
>(By the way, if this is an annual job, do you have code that worked last
>year? What happens if you try the same code this year?)

Thanks for your attempt. But for this application, which do you think would
work best? ADD FILE or UPDATE?

When variables match in both files, I want the value in the RSA911 file to
be used-- unless the RSA911 value is "missing" (i.e., field is blank), in
which case I would prefer the non-blank value, if any, from the companion
file. That situation should not occur, but with this merging process, I
don't consider any possibility out of the question.

Thanks,
Bob in HI
Reply | Threaded
Open this post in threaded view
|

Re: Merging files

Richard Ristow
At 02:39 AM 6/20/2007, Bob Schacht wrote:

>At 06:20 PM 6/19/2007, Richard Ristow wrote:
>>Second part: At 10:18 PM 6/19/2007, Bob Schacht wrote:
>>
>>>Last time I tried to use UPDATE, this is what happened:
>>>
>>>UPDATE FILE='C:\. . . AllQuExt2005withSSN_NoDupes.sav'
>>>   /FILE='C:\. . . RSA911-2005Original_rev.sav'
>>>   /BY SSN  DATEAPPL.
>>>
>>>The above update generated 942 multiple closures
>>
>>"Multiple closures"? I don't recognize the term.
>
>"Multiple closures" means, or is supposed to mean, that the same
>person was in the system more than once.

So, the same SSN turns up with more than one DATEAPPL?

>I am supposed to base my summaries on *cases*, rather than *persons*,
>so multiple closures are alright, in principle, as long as I have a
>match from both files.

So, every case (combination of SSN and DATEAPPL) is supposed to occur
in both files, or in neither.

>The total service pool is more than 6,000, but that makes 942 a rather
>high fraction of the service population, so I am skeptical about the
>interpretation.

That is, taken on the face of it, 942 of your 6,000 are people who are
more than one case each. I can see that might be high; at least, you
would know.

>>You removed records "with no match for one of the RSA911 cases" from
>>the RSA911 file? I don't understand this; if they're in the RSA911
>>file, how could they not have a match in the RSA911 file?
>
>I think what I meant here is that the RSA911 file had 942 multiple
>closures,

i.e., SSNs with more than one DATEAPPL, yes?

>but of each pair, there was a match for only one of them in the other
>file. So that meant that I had 942 cases that looked like multiple
>closures in RSA911, but there appeared to be only one closure for each
>of these cases in the companion file.

So, of those 942 SSNs that occur twice, *none* of them matched more
than one of their DATEAPPL values in the "companion file"? Did they all
match one of their DATEAPPL values; that is, where you have two dates
for the same person, it is *always* true that one is found in the
companion file, and the other isn't? (You see I am being niggly. But
the distinction between 'always' and 'usually' is crucial.)

>>>UPDATE FILE='C:\. . . AllQuExt2005withSSN_NoDupes.sav'
>>>/FILE='C:\. . . .RSA911QuExt2005ver2x.sav'
>>>/BY SSN DATEAPPL.
>>>
>>>/*This doesn't work. SPSS gives the following error:
>>>/*File #2Y
>>>/*     KEY: 575253056       1.31E+10
>>>/* This SSN exists in one file but not the other.
>>>/*>Error # 5141
>>>/*>File out of order doing ADD FILES or UPDATE.  All files must be
>>>in
>>>non-descending order on the BY variables.  Use SORT CASES to sort
>>>the
>>>file in order.
>>>/*But both files were sorted by SSN & DATEAPPL!
>>
>>I'm afraid this trail is too long and cold to follow; there are too
>>many
>>inferences to make, about just what you did to get here.
>>
>>Do you have some test data, and syntax which will carry out this
>>process
>>and produce the error? Only a fraction of your real data, of course.
>
>I'll try it out again with this year's data and see if I get the same
>error.
>
>
>>With that, it's probably soluble. Or maybe someone else on the List
>>has
>>more energy and enthusiasm for this kind of inferring than I do,
>>tonight.
>>
>>(By the way, if this is an annual job, do you have code that worked
>>last
>>year? What happens if you try the same code this year?)
>
>Thanks for your attempt. But for this application, which do you think
>would work best? ADD FILE or UPDATE?
>
>When variables match in both files, I want the value in the RSA911
>file to be used-- unless the RSA911 value is "missing" (i.e., field is
>blank), in which case I would prefer the value from the companion file.

This is what UPDATE is for. But remember, this is step 2. Step 1 sounds
different. *And* you may need to write syntax. I don't think the menus
'like' this situation, which is why you had trouble with the menus in
step 1.

Don't try the menus, for this problem. Skip what I said about step 1.,
drop the menus altogether, and write syntax for it.

Keep asking. I'm still trying to understand what you're doing, and I'm
working this out a step at a time.
Reply | Threaded
Open this post in threaded view
|

Re: Merging files

Bob Schacht-3
At 04:56 AM 6/20/2007, Richard Ristow wrote:

>At 02:39 AM 6/20/2007, Bob Schacht wrote:
>
>>At 06:20 PM 6/19/2007, Richard Ristow wrote:
>>>Second part: At 10:18 PM 6/19/2007, Bob Schacht wrote:
>>>
>>>>Last time I tried to use UPDATE, this is what happened:
>>>>
>>>>UPDATE FILE='C:\. . . AllQuExt2005withSSN_NoDupes.sav'
>>>>   /FILE='C:\. . . RSA911-2005Original_rev.sav'
>>>>   /BY SSN  DATEAPPL.
>>>>
>>>>The above update generated 942 multiple closures
>>>
>>>"Multiple closures"? I don't recognize the term.
>>
>>"Multiple closures" means, or is supposed to mean, that the same person
>>was in the system more than once.
>
>So, the same SSN turns up with more than one DATEAPPL?

Yes! If there are multiple closures, there has to be two different
application dates.


>>I am supposed to base my summaries on *cases*, rather than *persons*, so
>>multiple closures are alright, in principle, as long as I have a match
>>from both files.
>
>So, every case (combination of SSN and DATEAPPL) is supposed to occur in
>both files, or in neither.

In "theory".


>>The total service pool is more than 6,000, but that makes 942 a rather
>>high fraction of the service population, so I am skeptical about the
>>interpretation.
>
>That is, taken on the face of it, 942 of your 6,000 are people who are
>more than one case each. I can see that might be high; at least, you would
>know.
>
>>>You removed records "with no match for one of the RSA911 cases" from the
>>>RSA911 file? I don't understand this; if they're in the RSA911 file, how
>>>could they not have a match in the RSA911 file?
>>
>>I think what I meant here is that the RSA911 file had 942 multiple closures,
>
>i.e., SSNs with more than one DATEAPPL, yes?

Yes


>>but of each pair, there was a match for only one of them in the other
>>file. So that meant that I had 942 cases that looked like multiple
>>closures in RSA911, but there appeared to be only one closure for each of
>>these cases in the companion file.
>
>So, of those 942 SSNs that occur twice, *none* of them matched more than
>one of their DATEAPPL values in the "companion file"?
>Did they all match one of their DATEAPPL values; that is, where you have
>two dates for the same person, it is *always* true that one is found in
>the companion file, and the other isn't?

Yes, I believe so.

>(You see I am being niggly. But the distinction between 'always' and
>'usually' is crucial.)
>
>>>>UPDATE FILE='C:\. . . AllQuExt2005withSSN_NoDupes.sav'
>>>>/FILE='C:\. . . .RSA911QuExt2005ver2x.sav'
>>>>/BY SSN DATEAPPL.
>>>>
>>>>/*This doesn't work. SPSS gives the following error:
>>>>/*File #2Y
>>>>/*     KEY: 575253056       1.31E+10
>>>>/* This SSN exists in one file but not the other.
>>>>/*>Error # 5141
>>>>/*>File out of order doing ADD FILES or UPDATE.  All files must be in
>>>>non-descending order on the BY variables.  Use SORT CASES to sort the
>>>>file in order.
>>>>/*But both files were sorted by SSN & DATEAPPL!
>>>
>>>I'm afraid this trail is too long and cold to follow; there are too many
>>>inferences to make, about just what you did to get here.
>>>
>>>Do you have some test data, and syntax which will carry out this process
>>>and produce the error? Only a fraction of your real data, of course.
>>
>>I'll try it out again with this year's data and see if I get the same error.
>>
>>
>>>With that, it's probably soluble. Or maybe someone else on the List has
>>>more energy and enthusiasm for this kind of inferring than I do, tonight.
>>>
>>>(By the way, if this is an annual job, do you have code that worked last
>>>year? What happens if you try the same code this year?)
>>
>>Thanks for your attempt. But for this application, which do you think
>>would work best? ADD FILE or UPDATE?
>>
>>When variables match in both files, I want the value in the RSA911 file
>>to be used-- unless the RSA911 value is "missing" (i.e., field is blank),
>>in which case I would prefer the value from the companion file.
>
>This is what UPDATE is for. But remember, this is step 2. Step 1 sounds
>different. *And* you may need to write syntax. I don't think the menus
>'like' this situation, which is why you had trouble with the menus in step 1.
>
>Don't try the menus, for this problem. Skip what I said about step 1.,
>drop the menus altogether, and write syntax for it.

That's what I intend to do. Thanks!

>Keep asking. I'm still trying to understand what you're doing, and I'm
>working this out a step at a time.

I'm still trying to understand, too. Thanks for your help.

Bob Schacht