SPSSX Discussion

matching data files without unique ids

Classic

List

Threaded

11 messages Options

Khaleel Hussaini

matching data files without unique ids

Hello Listers,
I have the following variables but no unique id and I want to match
the two files and create a new file with one additional variable. Here is
the variable information in both datasets A and B but also some other
information in dataset B.
1) Mother's Last Name (A & B)
2) Mother's First Name (A & B)
3) Mother's DOB (A & B)
4) Child's Last name (A & B)
5) Child's First name (A & B)
6) Child's DOB (A & B)
7) CERTNUM (B)

I want the new datasets matched by mother's last name, first name, mother's
dob, child's last name, first name, and child's dob. Any ideas?

KH.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Art Kendall

Re: matching data files without unique ids

We need a better description of your problem.

make a copy of your files as they stand for backup.
For both files determine how frequently the 6 variables constitute
unique identification of cases.
click <data>
click <find duplicate cases>
enter the 6 variable as unique identifier set.
paste the syntax
run the syntax.

Is it possible you could accomplish what you want by aggregating each
file so that there is a summary record for each unique combination of
the six variables?

Art Kendall
Social Research Consultants

Khaleel Hussaini wrote:

> Hello Listers,
> I have the following variables but no unique id and I want to match
> the two files and create a new file with one additional variable. Here is
> the variable information in both datasets A and B but also some other
> information in dataset B.
> 1) Mother's Last Name (A & B)
> 2) Mother's First Name (A & B)
> 3) Mother's DOB (A & B)
> 4) Child's Last name (A & B)
> 5) Child's First name (A & B)
> 6) Child's DOB (A & B)
> 7) CERTNUM (B)
>
> I want the new datasets matched by mother's last name, first name, mother's
> dob, child's last name, first name, and child's dob. Any ideas?
>
> KH.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
>
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Art Kendall
Social Research Consultants

Art Kendall

Re: matching data files without unique ids

I still do not understand what the problem is.
Match files can use several variables that have to be the same to create
a match.
E.G.,
match files
file=seta /in = ina
file=setb /in = inb
/by MLN MFN MDOB CLN CFN CDOB
select if ina eq 1 and inb eq 1.
. . .
Why a _probabilistic_ match?
Are dates only approximate, e.g., one has July 4, 1980 for Mother's
birthdate and one has July 5, 1980?
Are there hours, minutes etc in the date on one file and not the other?
Are there variations in the spelling of names for a particular mother or
child? or sometimes a nickname?

can you create a small example with two files that shows what the
problem is?

Art Kendall
Social Research Consultants
Khaleel Hussaini wrote:

> Thank you for your response. There are no duplicate cases. The problem
> essentially is how does one match files without unique ids? If there
> are unique ids in more than database that correspond to an individual
> X and each file stores information about the individual X then
> matching the data file based on unique ids is easy. However, the
> problem arises when the information available is either string or a
> date variable (or both) and one has to match X to X using the criteria
> that will result in a probabilistic match. In my case it would be that
> files for individual X satisfies, not only the last name, first name,
> dob, but also child's last name, child's first name and date of birth.
> The purpose of doing this is that I have mother and child's records in
> two databases and one database has more information than the other. I
> want to build a new dataset that contains information on variables
> from both datasets essentially matching and merging two datasets into
> one, but only those cases that are common to both datasets. Best,
> KH
>
>
> On 3/17/08, *Art Kendall* <[hidden email]
> <mailto:[hidden email]>> wrote:
>
> We need a better description of your problem.
>
> make a copy of your files as they stand for backup.
> For both files determine how frequently the 6 variables constitute
> unique identification of cases.
> click <data>
> click <find duplicate cases>
> enter the 6 variable as unique identifier set.
> paste the syntax
> run the syntax.
>
> Is it possible you could accomplish what you want by aggregating each
> file so that there is a summary record for each unique combination of
> the six variables?
>
>
> Art Kendall
> Social Research Consultants
>
>
>
>
>
>
> Khaleel Hussaini wrote:
> > Hello Listers,
> > I have the following variables but no unique id and I
> want to match
> > the two files and create a new file with one additional
> variable. Here is
> > the variable information in both datasets A and B but also some
> other
> > information in dataset B.
> > 1) Mother's Last Name (A & B)
> > 2) Mother's First Name (A & B)
> > 3) Mother's DOB (A & B)
> > 4) Child's Last name (A & B)
> > 5) Child's First name (A & B)
> > 6) Child's DOB (A & B)
> > 7) CERTNUM (B)
> >
> > I want the new datasets matched by mother's last name, first
> name, mother's
> > dob, child's last name, first name, and child's dob. Any ideas?
> >
> > KH.
> >
> > =====================
> > To manage your subscription to SPSSX-L, send a message to
> > [hidden email] <mailto:[hidden email]>
> (not to SPSSX-L), with no body text except the
> > command. To leave the list, send the command
> > SIGNOFF SPSSX-L
> > For a list of commands to manage subscriptions, send the command
> > INFO REFCARD
> >
> >
> >
>
>

Art Kendall
Social Research Consultants

Bob Schacht-3

Re: matching data files without unique ids

At 08:57 AM 3/18/2008, Art Kendall wrote:

>I still do not understand what the problem is.
>Match files can use several variables that have to be the same to create
>a match.
>E.G.,
>match files
> file=seta /in = ina
> file=setb /in = inb
>/by MLN MFN MDOB CLN CFN CDOB
>select if ina eq 1 and inb eq 1.
>. . .
>Why a _probabilistic_ match?

"Fuzzy logic" models are, I suppose, based on the idea that some of the
data may be wrong, and if two records match on, say, 5 out of 6 criteria
(or, "a sufficiently large number of variables"), then the records must be
associated with the same case, and the value for the sixth variable must be
in error. I would suggest exercising considerable caution in this regard.
For example, 5/6 might produce too many matches. Or it might produce a
false match. All such "probabilistic" matches should be tagged for future
reference so that they can easily be removed if challenged.

My own experience in this regard was with trying to find a combination of
variables to serve as the key field (in the aggregate.) Strings of
variables that I thought surely would be unique turned out not to be unique
(i.e., identified more than one case.) But of course that's a different
scenario.

Take care!

Bob Schacht

Robert M. Schacht, Ph.D. <[hidden email]>
Pacific Basin Rehabilitation Research & Training Center
1268 Young Street, Suite #204
Research Center, University of Hawaii
Honolulu, HI 96814

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Art Kendall

Re: matching data files without unique ids

I originally suggested using aggregate or "find duplicate cases" to see
if the 6 variables produced duplicates. The OP said they did not.

Art Kendall
Social Research Consultants

Bob Schacht wrote:

> At 08:57 AM 3/18/2008, Art Kendall wrote:
>> I still do not understand what the problem is.
>> Match files can use several variables that have to be the same to create
>> a match.
>> E.G.,
>> match files
>> file=seta /in = ina
>> file=setb /in = inb
>> /by MLN MFN MDOB CLN CFN CDOB
>> select if ina eq 1 and inb eq 1.
>> . . .
>> Why a _probabilistic_ match?
>
> "Fuzzy logic" models are, I suppose, based on the idea that some of
> the data may be wrong, and if two records match on, say, 5 out of 6
> criteria (or, "a sufficiently large number of variables"), then the
> records must be associated with the same case, and the value for the
> sixth variable must be in error. I would suggest exercising
> considerable caution in this regard. For example, 5/6 might produce
> too many matches. Or it might produce a false match. All such
> "probabilistic" matches should be tagged for future reference so that
> they can easily be removed if challenged.
>
> My own experience in this regard was with trying to find a combination
> of variables to serve as the key field (in the aggregate.) Strings of
> variables that I thought surely would be unique turned out not to be
> unique (i.e., identified more than one case.) But of course that's a
> different scenario.
>
> Take care!
>
> Bob Schacht
>
>
> Robert M. Schacht, Ph.D. <[hidden email]>
> Pacific Basin Rehabilitation Research & Training Center
> 1268 Young Street, Suite #204
> Research Center, University of Hawaii
> Honolulu, HI 96814
>
>

Art Kendall
Social Research Consultants

Dennis Deck

Re: matching data files without unique ids

In reply to this post by Khaleel Hussaini

There is a literature on matching with incomplete information using
deterministic and/or probabilistic approaches outside of the SPSS
Listserve. Art says there are no duplicates so this may be overkill
here - it may be sufficient to run MATCH FILES using all the ID
variables you have in the two files.

However, if there is missing or inconsistent data among these variables
then you may need to consider one of the more sophisticated methods. CDC
provides such a matching program for free for such a purpose.

Dennis Deck, PhD
RMC Research Corporation
111 SW Columbia Street, Suite 1200
Portland, Oregon 97201-5843
voice: 503-223-8248 x715
voice: 800-788-1887 x715
fax: 503-223-8248
[hidden email]

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Albert-Jan Roskam

Re: matching data files without unique ids

Hi,

Other free probabilistic linkage programs include:
--Febrl (Python based, now has a GUI)
--The Link King (SAS based).

I am currently using Febrl. It's well-documented, and
because it's open source you're free to (and even
encouraged) to change or improve the program. You can
find it on sourceforge.net.

Another option would be to use "n-1 deterministic
matching": allow one of the matching variables to be a
non-match.

Cheers!!
Albert-Jan

--- Dennis Deck <[hidden email]> wrote:

> There is a literature on matching with incomplete
> information using
> deterministic and/or probabilistic approaches
> outside of the SPSS
> Listserve. Art says there are no duplicates so
> this may be overkill
> here - it may be sufficient to run MATCH FILES using
> all the ID
> variables you have in the two files.
>
> However, if there is missing or inconsistent data
> among these variables
> then you may need to consider one of the more
> sophisticated methods. CDC
> provides such a matching program for free for such a
> purpose.
>
> Dennis Deck, PhD
> RMC Research Corporation
> 111 SW Columbia Street, Suite 1200
> Portland, Oregon 97201-5843
> voice: 503-223-8248 x715
> voice: 800-788-1887 x715
> fax: 503-223-8248
> [hidden email]
>
> =====================
> To manage your subscription to SPSSX-L, send a
> message to
> [hidden email] (not to SPSSX-L), with no
> body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send
> the command
> INFO REFCARD
>

____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Art Kendall

Re: matching data files without unique ids

In reply to this post by Art Kendall

Home many cases do you have in each file?
Might you also have gender for the child?

I have a client coming in a few minutes and cannot fully develop and
test this idea.

A start might be to write two new files with just the 6 variables, a
flag variable to indicate which file the record came from and the
sequence number in its original file file.
Use this to test and debug a file of syntax that you can apply to the
whole data set.

Then something like this;

temporary.
select if mln ne cln.
list.
generate a set of specific data patch statements
DO if fileflag eq 1 and
ANY(caseseq, 101, 222, 333,1234).
COMPUTE OLDMLN = MLN.
COMPUTE MLN= CLN.
ELSE IF
fileflag eq 1 and
ANY(caseseq, 121, 212, 335,2341).
COMPUTE OLDCLN = CLN.
COMPUTE CLN= MLN.
END IF.
do the same with fileflag2.

AUTORECODE VARIABLES= MLN CLN
/INTO new N_MLN N_CLN
/BLANK= MISSING
/SAVE TEMPLATE='filespec1'
** /GROUP** /PRINT.

/using judgment decide which last names are possibly the same by looking back at the other variables and relevant cases in the input./

recode n_MLN (17, 23=17) (24,28=28) (else=copy) into N2_MLN.
recode n_CLN (98,99=99)(101,222=101)(else=copy) into N2_CLN.

AUTORECODE VARIABLES= MFN CFN
/INTO new N_MFN N_CFN
/BLANK= MISSING
/SAVE TEMPLATE='filespec2'
/PRINT.
//using judgment decide which mother (child) first names are possibly the same by looking back at the other variables and relevant cases in the input.//

recode n_mfn ... into n2_mfn.
recode n_cfn ... into n2_cfn.

then open the original files and patch them cannibalizing the patches above autorecode them using the templates in filespec1 and filespec 2.

then match the files using n2_MFN n2_MLN etc.
examine the results and go back to the beginning and tweak the process.

Art Kendall
Social Research Consultants.

autorecode MLN and CLN

Khaleel Hussaini wrote:

> Thanks for all your responses. The dataset example is given below. Now
> if we were do an exact match this would not be matched in SPSS i
> presume, how would we go about matching these datasets as the only
> condition that is equivalent is MDOB and CDOB. There are cases in the
> databases where the date of birth for either mother and/or child is in
> correct. I know that the data in dataset 2 is most reliable. How do I
> then match the files? I am aware of the CDC software, however, due to
> data integrity and security issues I am unable to download on my
> workcomputer.
> KH.
>
> Dataset1
>
>
>
>
>
> MFN MLN MDOB CLN CFN CDOB
> Arroyo Letici 8/13/1975 Arroyo George 11/29/2007
>
>
>
>
>
>
> Dataset2
>
>
>
>
> MFN MLN MDOB CLN CFN CDOB
> Arroya Leticia 8/13/1975 Arroyo Jorge 11/29/2007
>
>
> MLN = Mother's Last Name
> MFN = Mother's First Name
> MDOB = Mother' Date of Birth
> CLN = Child's last name
> CFN = Child's first name
> CDOB = Child's birthdate
>
> I th
>
>
>
>
>
>
> On 3/18/08, *Art Kendall* <[hidden email]
> <mailto:[hidden email]>> wrote:
>
> I still do not understand what the problem is.
> Match files can use several variables that have to be the same to
> create a match.
> E.G.,
> match files
> file=seta /in = ina
> file=setb /in = inb
> /by MLN MFN MDOB CLN CFN CDOB
> select if ina eq 1 and inb eq 1.
> . . .
> Why a _probabilistic_ match?
> Are dates only approximate, e.g., one has July 4, 1980 for
> Mother's birthdate and one has July 5, 1980?
> Are there hours, minutes etc in the date on one file and not the
> other?
> Are there variations in the spelling of names for a particular
> mother or child? or sometimes a nickname?
>
> can you create a small example with two files that shows what the
> problem is?
>
>
> Art Kendall
> Social Research Consultants
> Khaleel Hussaini wrote:
>> Thank you for your response. There are no duplicate cases. The
>> problem essentially is how does one match files without unique
>> ids? If there are unique ids in more than database that
>> correspond to an individual X and each file stores information
>> about the individual X then matching the data file based on
>> unique ids is easy. However, the problem arises when the
>> information available is either string or a date variable (or
>> both) and one has to match X to X using the criteria that will
>> result in a probabilistic match. In my case it would be that
>> files for individual X satisfies, not only the last name, first
>> name, dob, but also child's last name, child's first name and
>> date of birth. The purpose of doing this is that I have mother
>> and child's records in two databases and one database has more
>> information than the other. I want to build a new dataset that
>> contains information on variables from both datasets essentially
>> matching and merging two datasets into one, but only those cases
>> that are common to both datasets. Best,
>> KH
>>
>>
>> On 3/17/08, *Art Kendall* <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>
>> We need a better description of your problem.
>>
>> make a copy of your files as they stand for backup.
>> For both files determine how frequently the 6 variables
>> constitute
>> unique identification of cases.
>> click <data>
>> click <find duplicate cases>
>> enter the 6 variable as unique identifier set.
>> paste the syntax
>> run the syntax.
>>
>> Is it possible you could accomplish what you want by
>> aggregating each
>> file so that there is a summary record for each unique
>> combination of
>> the six variables?
>>
>>
>> Art Kendall
>> Social Research Consultants
>>
>>
>>
>>
>>
>>
>> Khaleel Hussaini wrote:
>> > Hello Listers,
>> > I have the following variables but no unique id and I
>> want to match
>> > the two files and create a new file with one additional
>> variable. Here is
>> > the variable information in both datasets A and B but also
>> some other
>> > information in dataset B.
>> > 1) Mother's Last Name (A & B)
>> > 2) Mother's First Name (A & B)
>> > 3) Mother's DOB (A & B)
>> > 4) Child's Last name (A & B)
>> > 5) Child's First name (A & B)
>> > 6) Child's DOB (A & B)
>> > 7) CERTNUM (B)
>> >
>> > I want the new datasets matched by mother's last name, first
>> name, mother's
>> > dob, child's last name, first name, and child's dob. Any ideas?
>> >
>> > KH.
>> >
>> > =====================
>> > To manage your subscription to SPSSX-L, send a message to
>> > [hidden email] <mailto:[hidden email]>
>> (not to SPSSX-L), with no body text except the
>> > command. To leave the list, send the command
>> > SIGNOFF SPSSX-L
>> > For a list of commands to manage subscriptions, send the command
>> > INFO REFCARD
>> >
>> >
>> >
>>
>>
>

Art Kendall
Social Research Consultants

Art Kendall

Re: matching data files without unique ids

In reply to this post by Art Kendall

This is a repost because the listserv said it had already been posted.
However, it did not come back to my mail
The note said to add some sentences explaining why some might receive 2
copies.
Another sentence should make this sufficiently different from previous
posts.

*what my post said:*

Home many cases do you have in each file?
Might you also have gender for the child?

I have a client coming in a few minutes and cannot fully develop and
test this idea.

A start might be to write two new files with just the 6 variables, a
flag variable to indicate which file the record came from and the
sequence number in its original file file.
Use this to test and debug a file of syntax that you can apply to the
whole data set.

Then something like this;

temporary.
select if mln ne cln.
list.
generate a set of specific data patch statements
DO if fileflag eq 1 and
ANY(caseseq, 101, 222, 333,1234).
COMPUTE OLDMLN = MLN.
COMPUTE MLN= CLN.
ELSE IF
fileflag eq 1 and
ANY(caseseq, 121, 212, 335,2341).
COMPUTE OLDCLN = CLN.
COMPUTE CLN= MLN.
END IF.
do the same with fileflag2.

AUTORECODE VARIABLES= MLN CLN
/INTO new N_MLN N_CLN
/BLANK= MISSING
/SAVE TEMPLATE='filespec1'
** /GROUP** /PRINT.

/using judgment decide which last names are possibly the same by looking back at the other variables and relevant cases in the input./

recode n_MLN (17, 23=17) (24,28=28) (else=copy) into N2_MLN.
recode n_CLN (98,99=99)(101,222=101)(else=copy) into N2_CLN.

AUTORECODE VARIABLES= MFN CFN
/INTO new N_MFN N_CFN
/BLANK= MISSING
/SAVE TEMPLATE='filespec2'
/PRINT.
//using judgment decide which mother (child) first names are possibly the same by looking back at the other variables and relevant cases in the input.//

recode n_mfn ... into n2_mfn.
recode n_cfn ... into n2_cfn.

then open the original files and patch them cannibalizing the patches above autorecode them using the templates in filespec1 and filespec 2.

then match the files using n2_MFN n2_MLN etc.
examine the results and go back to the beginning and tweak the process.

Art Kendall
Social Research Consultants.

autorecode MLN and CLN

Khaleel Hussaini wrote:

Art Kendall
Social Research Consultants

Johnny Amora

Extracting year from a lengthy string variable

In reply to this post by Art Kendall

A list of publications consists of more than 3000
articles presented in html format. The first 2 entries
are as follows:

1. Abbott, A. A. (2003). A confirmatory factor
analysis of the Professional Opinion Scale: A
values assessment instrument. Research on Social
Work Practice, 13(5), 641-666.
2. Abbott, G. N., White, F. A., & Charles, M. A.
(2005). Linking values and organizational
commitment: A correlational and experimental
investigation in two organizations. Journal of
Occupational and Organizational Psychology, 78,
531-551.

I want to create spss database for such data. My
variables would be author, year of publication, title,
and title of journal. Do you know how to do it using
spss syntax?

Thank you.

Johnny

____________________________________________________________________________________
Never miss a thing. Make Yahoo your home page.
http://www.yahoo.com/r/hs

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Florio Arguillas

Re: Extracting year from a lengthy string variable

Johnny,

Try this. For explanation about the functions and the behavior of
scratch variables (those that start with #) see the SPSS Help documentation

The assumptions in this code are:
1. The name of the variable containing the list of publications is A.
2. And that each value starts with the author name right away (i.e.,
no record numbers preceding the Author name -- meaning the 1
preceding Abbott should not be included in variable A).

STRING AUTHOR (A200).
COMPUTE AUTHOR = SUBSTR(A,1,CHAR.INDEX(A,'(')-1).

COMPUTE YEAR = NUMBER(SUBSTR(A,CHAR.INDEX(A,'(')+1,4),F4.0).
EXE.

STRING TITLE (A300).
STRING JOURNAL1 (A300).
STRING JOURNAL2 (A300).
STRING #AAA (A300).
STRING #AA (A300).
COMPUTE #AA= SUBSTR(A,CHAR.INDEX(A,'(')+7).
COMPUTE TITLE = SUBSTR(#AA,1,CHAR.INDEX(#AA,'.')-1).
COMPUTE JOURNAL1 = SUBSTR(#AA,CHAR.INDEX(#AA,'.')+1).
COMPUTE #AAA= SUBSTR(#AA,CHAR.INDEX(#AA,'.')+1).
COMPUTE JOURNAL2 = SUBSTR(#AAA,1,CHAR.INDEX(#AAA,',')-1).
EXE.

Hope this helps.

Florio

At 12:34 AM 3/22/2008, John Amora wrote:

>A list of publications consists of more than 3000
>articles presented in html format. The first 2 entries
>are as follows:
>
>1. Abbott, A. A. (2003). A confirmatory factor
> analysis of the Professional Opinion Scale: A
> values assessment instrument. Research on Social
> Work Practice, 13(5), 641-666.
>2. Abbott, G. N., White, F. A., & Charles, M. A.
> (2005). Linking values and organizational
> commitment: A correlational and experimental
> investigation in two organizations. Journal of
> Occupational and Organizational Psychology, 78,
> 531-551.
>
>I want to create spss database for such data. My
>variables would be author, year of publication, title,
>and title of journal. Do you know how to do it using
>spss syntax?
>
>Thank you.
>
>Johnny
>
>
>
>____________________________________________________________________________________
>Never miss a thing. Make Yahoo your home page.
>http://www.yahoo.com/r/hs
>
>=====================
>To manage your subscription to SPSSX-L, send a message to
>[hidden email] (not to SPSSX-L), with no body text except the
>command. To leave the list, send the command
>SIGNOFF SPSSX-L
>For a list of commands to manage subscriptions, send the command
>INFO REFCARD