sorting out a nested data structure

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

sorting out a nested data structure

Terry Westover

Hello SPSS users –

 

We are working on a statewide study of subsidized childcare.  We have a large data set with a nested structure - children nested within families.  Each row (record) is an individual child. Each record contains a family ID and a variable identifying which agency is providing care for that individual child –  each childcare agency assigns each family a unique (to that agency) identifier but there is no statewide system of issuing a unique family identifier.  So, families may have children receiving services from more than one agency and thus have multiple family IDs and we need to be able to identify all programs/agencies that are serving each family.  The family is our unit of analysis.

 

We can, using  concatenation and duplicate functions, identify duplicate children across agencies (e.g. with different family ids) – this unique child id is a string variable - but we still have a few problems to solve that I hope you can help with:

1.       Because the dataset is so large, manually combing thru the duplicates to assign our own unique family or child identifiers is not practical. 

2.       Is there an “assign” function that will automate assigning unique ids to children using the string variable we have constructed?

3.       Once we figure out how to assign unique child identifiers we are still faced with the problem of finding some automated way of grouping all the children within families so each family has a unique identifier, regardless of how many individual agencies/programs are providing services to the children w/i that family.  Since the family is our unit of analysis this is critical.   Any suggestions?

 

I realize that most of the queries that come across this listserv are much more sophisticated and apologize for asking what is likely a simple question but it’s one I can’t seem to get my head around.

 

Thanks for any assistance you can provide,

 

 

Theresa (Terry) Westover, Director

Center for Education and Evaluation Services

CRESS, School of Education

1 Shields Ave.

University of California, Davis  95616-8729

530-754-9523 office

530-752-6135 fax

 

Reply | Threaded
Open this post in threaded view
|

Re: sorting out a nested data structure

Art Kendall
I am not sure I understand your request.
see AUTORECODE in <help> to see if that is what you are looking for.

AUTORECODE takes a string variable and assign sequential whole numbers to each value of the string that it finds.
all identical strings will have the same number assigned.

You may want to create a new string variable where your force all characters to be upper case.
string newid (a20)
compute newid = upcase(oldid).
autorecode variables = newid /into idnum.

Art Kendall
Social Research Consultants

On 9/1/2010 2:17 PM, Terry Westover wrote:

Hello SPSS users –

 

We are working on a statewide study of subsidized childcare.  We have a large data set with a nested structure - children nested within families.  Each row (record) is an individual child. Each record contains a family ID and a variable identifying which agency is providing care for that individual child –  each childcare agency assigns each family a unique (to that agency) identifier but there is no statewide system of issuing a unique family identifier.  So, families may have children receiving services from more than one agency and thus have multiple family IDs and we need to be able to identify all programs/agencies that are serving each family.  The family is our unit of analysis.

 

We can, using  concatenation and duplicate functions, identify duplicate children across agencies (e.g. with different family ids) – this unique child id is a string variable - but we still have a few problems to solve that I hope you can help with:

1.       Because the dataset is so large, manually combing thru the duplicates to assign our own unique family or child identifiers is not practical. 

2.       Is there an “assign” function that will automate assigning unique ids to children using the string variable we have constructed?

3.       Once we figure out how to assign unique child identifiers we are still faced with the problem of finding some automated way of grouping all the children within families so each family has a unique identifier, regardless of how many individual agencies/programs are providing services to the children w/i that family.  Since the family is our unit of analysis this is critical.   Any suggestions?

 

I realize that most of the queries that come across this listserv are much more sophisticated and apologize for asking what is likely a simple question but it’s one I can’t seem to get my head around.

 

Thanks for any assistance you can provide,

 

 

Theresa (Terry) Westover, Director

Center for Education and Evaluation Services

CRESS, School of Education

1 Shields Ave.

University of California, Davis  95616-8729

530-754-9523 office

530-752-6135 fax

 

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: sorting out a nested data structure

Maguin, Eugene
In reply to this post by Terry Westover
Terry,

I'd like to make sure that I understand your data structure (and it might
help to post some sample data with the relevant sections shown). I
understand your data this way.


AgencyID FamID KidID  TrueFamID TrueKidID      IdentifyingDataString
A123     1002  B234    100876      2
WW89     A003  QWE2    101023      1
A123     3421  A346    120945      3
UA90     1002  RTYQ    100876      2
A123     3421  A389    120945      2

One family can have one kid at one agency.
Same family can have the same kid at two (or more) different agencies.
Same family can have two (or more) different kids at same agencies.

Does this cover the possibilities.

So you make up what I'm calling the IdentifyingDataString that is composed
of, for instance,
DOB, gender, kid birth certificate first name, etc. Exactly what doesn't
matter per se but just that you have something that is unique to each kid.

Is all this correct?

Gene Maguin




________________________________

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Terry Westover
Sent: Wednesday, September 01, 2010 2:17 PM
To: [hidden email]
Subject: sorting out a nested data structure

Hello SPSS users -

We are working on a statewide study of subsidized childcare.  We have a
large data set with a nested structure - children nested within families.
Each row (record) is an individual child. Each record contains a family ID
and a variable identifying which agency is providing care for that
individual child -  each childcare agency assigns each family a unique (to
that agency) identifier but there is no statewide system of issuing a unique
family identifier.  So, families may have children receiving services from
more than one agency and thus have multiple family IDs and we need to be
able to identify all programs/agencies that are serving each family.  The
family is our unit of analysis.

We can, using  concatenation and duplicate functions, identify duplicate
children across agencies (e.g. with different family ids) - this unique
child id is a string variable - but we still have a few problems to solve
that I hope you can help with:
1.       Because the dataset is so large, manually combing thru the
duplicates to assign our own unique family or child identifiers is not
practical.
2.       Is there an "assign" function that will automate assigning unique
ids to children using the string variable we have constructed?
3.       Once we figure out how to assign unique child identifiers we are
still faced with the problem of finding some automated way of grouping all
the children within families so each family has a unique identifier,
regardless of how many individual agencies/programs are providing services
to the children w/i that family.  Since the family is our unit of analysis
this is critical.   Any suggestions?

I realize that most of the queries that come across this listserv are much
more sophisticated and apologize for asking what is likely a simple question
but it's one I can't seem to get my head around.

Thanks for any assistance you can provide,

Theresa (Terry) Westover, Director
Center for Education and Evaluation Services
CRESS, School of Education
1 Shields Ave.
University of California, Davis  95616-8729
530-754-9523 office
530-752-6135 fax

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: sorting out a nested data structure

Maguin, Eugene
In reply to this post by Terry Westover
-----Original Message-----
From: Terry Westover [mailto:[hidden email]]
Sent: Wednesday, September 01, 2010 4:34 PM
To: 'Gene Maguin'
Subject: RE: sorting out a nested data structure

Yes, that's basically the situation.  The data look like this (each case is
an individual child):

Family ID (different ones for different agencies)    Kid ID (diff for diff
agencies)       Agency ID   Funding Program ID

We need to get to something like your structure below where we assign a
"true" unique identifier for both the child and family.  Our unit of
analysis is family.

We have developed a variable like your "identifying datastring" by combining
name, birthdate, and zipcode.  So we can identify multiple entries for the
same child.  Where we're a little "stuck" is in figuring out how to use that
variable to modify the data structure so that we capture all the services
information per child and then group all children into the proper "true" or
unique families and assign a family unique identifier.

We need to be able to capture what services each family uses for all of
their children and the total out of pocket costs the family pays for
childcare.

Thanks much for your help
Terry

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Gene Maguin
Sent: Wednesday, September 01, 2010 1:10 PM
To: [hidden email]
Subject: Re: sorting out a nested data structure

Terry,

I'd like to make sure that I understand your data structure (and it might
help to post some sample data with the relevant sections shown). I
understand your data this way.


AgencyID FamID KidID  TrueFamID TrueKidID      IdentifyingDataString
A123     1002  B234    100876      2
WW89     A003  QWE2    101023      1
A123     3421  A346    120945      3
UA90     1002  RTYQ    100876      2
A123     3421  A389    120945      2

One family can have one kid at one agency.
Same family can have the same kid at two (or more) different agencies.
Same family can have two (or more) different kids at same agencies.

Does this cover the possibilities.

So you make up what I'm calling the IdentifyingDataString that is composed
of, for instance,
DOB, gender, kid birth certificate first name, etc. Exactly what doesn't
matter per se but just that you have something that is unique to each kid.

Is all this correct?

Gene Maguin




________________________________

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Terry Westover
Sent: Wednesday, September 01, 2010 2:17 PM
To: [hidden email]
Subject: sorting out a nested data structure

Hello SPSS users -

We are working on a statewide study of subsidized childcare.  We have a
large data set with a nested structure - children nested within families.
Each row (record) is an individual child. Each record contains a family ID
and a variable identifying which agency is providing care for that
individual child -  each childcare agency assigns each family a unique (to
that agency) identifier but there is no statewide system of issuing a unique
family identifier.  So, families may have children receiving services from
more than one agency and thus have multiple family IDs and we need to be
able to identify all programs/agencies that are serving each family.  The
family is our unit of analysis.

We can, using  concatenation and duplicate functions, identify duplicate
children across agencies (e.g. with different family ids) - this unique
child id is a string variable - but we still have a few problems to solve
that I hope you can help with:
1.       Because the dataset is so large, manually combing thru the
duplicates to assign our own unique family or child identifiers is not
practical.
2.       Is there an "assign" function that will automate assigning unique
ids to children using the string variable we have constructed?
3.       Once we figure out how to assign unique child identifiers we are
still faced with the problem of finding some automated way of grouping all
the children within families so each family has a unique identifier,
regardless of how many individual agencies/programs are providing services
to the children w/i that family.  Since the family is our unit of analysis
this is critical.   Any suggestions?

I realize that most of the queries that come across this listserv are much
more sophisticated and apologize for asking what is likely a simple question
but it's one I can't seem to get my head around.

Thanks for any assistance you can provide,

Theresa (Terry) Westover, Director
Center for Education and Evaluation Services
CRESS, School of Education
1 Shields Ave.
University of California, Davis  95616-8729
530-754-9523 office
530-752-6135 fax

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: sorting out a nested data structure

Mike
If each case represents a kid's data and each kid has a unique
identifier, then you can group all "cases" for a single kid. So,
if "unique_kid_ID" is the name of your variable; a listing of your
data might look like this:

unique_kid_ID (family variables) (agency vars) (funding vars) etc.
100000001    etc
100000001
100000001
100000002
100000002
100000003
100000004
100000004

Above the first four lines correspond to the same kid but for
different agencies, funding sources, etc.  With data in this format
you could come up with an aggregated datafile that summarizes
the variables for the "cases" of each kid or possible use
CASESTOVARIABLES to make a single case for each kid.

But you still need a unique identifier for family (i.e., unique_family_ID)
that will allow you to group different kids together.  Once you
have a unique family id you might use aggregate to get summary
or use CASESTOVARIABLES to create a single case for each
family that contains all of the variables for all kids in the family that
you have data for (though this makes me think that your dataset
would be "sparse", that is, some families may have many kids
but most many have only one or two which would leave variables
with missing values), Consider

Unique_Family_ID Kid01 (kid #1 vars) Kid02 (kid #2 vars) ... Kid0k (kth Kid's vars)

presumably some of the info you used in creating unique_kid_ID will
allow you to create unique_family_ID and then use aggregate or
create a case structure that has variables representing all kids in
the family (then using compute statements to get the summary info
you want).

HTH

-Mike Palij
New York University
[hidden email]


----- Original Message -----
From: "Gene Maguin" <[hidden email]>
To: <[hidden email]>
Sent: Wednesday, September 01, 2010 4:48 PM
Subject: Re: sorting out a nested data structure


> -----Original Message-----
> From: Terry Westover [mailto:[hidden email]]
> Sent: Wednesday, September 01, 2010 4:34 PM
> To: 'Gene Maguin'
> Subject: RE: sorting out a nested data structure
>
> Yes, that's basically the situation.  The data look like this (each case is
> an individual child):
>
> Family ID (different ones for different agencies)    Kid ID (diff for diff
> agencies)       Agency ID   Funding Program ID
>
> We need to get to something like your structure below where we assign a
> "true" unique identifier for both the child and family.  Our unit of
> analysis is family.
>
> We have developed a variable like your "identifying datastring" by combining
> name, birthdate, and zipcode.  So we can identify multiple entries for the
> same child.  Where we're a little "stuck" is in figuring out how to use that
> variable to modify the data structure so that we capture all the services
> information per child and then group all children into the proper "true" or
> unique families and assign a family unique identifier.
>
> We need to be able to capture what services each family uses for all of
> their children and the total out of pocket costs the family pays for
> childcare.
>
> Thanks much for your help
> Terry
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
> Gene Maguin
> Sent: Wednesday, September 01, 2010 1:10 PM
> To: [hidden email]
> Subject: Re: sorting out a nested data structure
>
> Terry,
>
> I'd like to make sure that I understand your data structure (and it might
> help to post some sample data with the relevant sections shown). I
> understand your data this way.
>
>
> AgencyID FamID KidID  TrueFamID TrueKidID      IdentifyingDataString
> A123     1002  B234    100876      2
> WW89     A003  QWE2    101023      1
> A123     3421  A346    120945      3
> UA90     1002  RTYQ    100876      2
> A123     3421  A389    120945      2
>
> One family can have one kid at one agency.
> Same family can have the same kid at two (or more) different agencies.
> Same family can have two (or more) different kids at same agencies.
>
> Does this cover the possibilities.
>
> So you make up what I'm calling the IdentifyingDataString that is composed
> of, for instance,
> DOB, gender, kid birth certificate first name, etc. Exactly what doesn't
> matter per se but just that you have something that is unique to each kid.
>
> Is all this correct?
>
> Gene Maguin
>
>
>
>
> ________________________________
>
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
> Terry Westover
> Sent: Wednesday, September 01, 2010 2:17 PM
> To: [hidden email]
> Subject: sorting out a nested data structure
>
> Hello SPSS users -
>
> We are working on a statewide study of subsidized childcare.  We have a
> large data set with a nested structure - children nested within families.
> Each row (record) is an individual child. Each record contains a family ID
> and a variable identifying which agency is providing care for that
> individual child -  each childcare agency assigns each family a unique (to
> that agency) identifier but there is no statewide system of issuing a unique
> family identifier.  So, families may have children receiving services from
> more than one agency and thus have multiple family IDs and we need to be
> able to identify all programs/agencies that are serving each family.  The
> family is our unit of analysis.
>
> We can, using  concatenation and duplicate functions, identify duplicate
> children across agencies (e.g. with different family ids) - this unique
> child id is a string variable - but we still have a few problems to solve
> that I hope you can help with:
> 1.       Because the dataset is so large, manually combing thru the
> duplicates to assign our own unique family or child identifiers is not
> practical.
> 2.       Is there an "assign" function that will automate assigning unique
> ids to children using the string variable we have constructed?
> 3.       Once we figure out how to assign unique child identifiers we are
> still faced with the problem of finding some automated way of grouping all
> the children within families so each family has a unique identifier,
> regardless of how many individual agencies/programs are providing services
> to the children w/i that family.  Since the family is our unit of analysis
> this is critical.   Any suggestions?
>
> I realize that most of the queries that come across this listserv are much
> more sophisticated and apologize for asking what is likely a simple question
> but it's one I can't seem to get my head around.
>
> Thanks for any assistance you can provide,
>
> Theresa (Terry) Westover, Director
> Center for Education and Evaluation Services
> CRESS, School of Education
> 1 Shields Ave.
> University of California, Davis  95616-8729
> 530-754-9523 office
> 530-752-6135 fax
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: sorting out a nested data structure

Barnett, Adrian (DECD)
In reply to this post by Terry Westover

Hi Terry

Your question is not stupid and although your purpose is straightforward, the issue is not simple

 

Your project is similar to a large on I worked on a number of years ago in which we attempted to identify the same people who were clients of a wide variety of service agencies within the health/human services area.

Yours is a classic example of a record-linkage task, and would benefit from some purpose-designed record-linkage software. Record linkage software pulls together your individuals and allow you to assign each of them a unique identifier. Once you had done that you could begin the task of determining family identifiers (not sure what you would do with kids whose parents are separated but share custody and have since started new families – aargh!)

 

The advantage record-linkage software gives you is that it combines and weights all the identifying data you have before presenting you with the best-matching candidates. Well-designed software copes with mis-spelled names and dates where elements have been transposed or the year is out by one etc and gives them partial weights etc. methods which you can apply for yourself in things like SPSS or a database package tend to nudge you towards doing exact matches in a hierarchy – for example, if the family name is a mismatch because of a typo or a phonetic mis-spelling, even though you might have reasonable or even exact matches on given name, date of birth, gender etc, the match is most likely tossed aside because it failed at the first matching variable.

 

I use a program called LinkageWiz and it works pretty well. It is a commercial product ( details at www.linkagewiz.com) so if your budget does not allow for purchasing additional software you might consider the open-source FEBRL linkage software. (LinkageWiz also has a free version limited to datasets of 10,000 cases). FEBRL is developed by the Data Mining Group in the Computer Science department at the Australian National University. http://datamining.anu.edu.au/software/febrl/febrldoc/

 

Whatever method you use, whether record linkage or a hand-built approach in SPSS, you will need to do a lot of checking to get rid of false positives from twins in particular, but also close matches from other family members. I’d also recommend fairly rigorous data cleaning before you begin, correcting or getting rid of absurd dates, plugging missing data in gender, removing garbage characters in names, ensuring consistency of case in names, removing nicknames embedded in given names – for example: JONES, TALLULAH ROSE (aka “SOPHIE”). It will give you much better matches and reduce the amount of checking of matches in what might be called the ‘grey area’ between the obviously good and the obviously bad matches. There are a squillion extremely tedious details to think about, all of which affect how clean a match you will get. Given that your data is coming out of a number of different operational data systems, you can rely on it being dirty, and some systems will be much dirtier than others, so time spent going over each to spot just what sort of unspeakable practices have been followed in entering data will pay dividends, because if you clean the data first, your life will be much easier later.

 

Once you’ve git your data linked, you can clean it up further by imputing data which is present one agency’s collection to plug holes in another’s.

 

Best of luck

 

 

Adrian Barnett

Project Officer

Educational Measurement and Analysis

Data and Educational Measurement

DECS

ph 82261080

 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Terry Westover
Sent: Thursday, 2 September 2010 3:47 AM
To: [hidden email]
Subject: sorting out a nested data structure

 

Hello SPSS users –

 

We are working on a statewide study of subsidized childcare.  We have a large data set with a nested structure - children nested within families.  Each row (record) is an individual child. Each record contains a family ID and a variable identifying which agency is providing care for that individual child –  each childcare agency assigns each family a unique (to that agency) identifier but there is no statewide system of issuing a unique family identifier.  So, families may have children receiving services from more than one agency and thus have multiple family IDs and we need to be able to identify all programs/agencies that are serving each family.  The family is our unit of analysis.

 

We can, using  concatenation and duplicate functions, identify duplicate children across agencies (e.g. with different family ids) – this unique child id is a string variable - but we still have a few problems to solve that I hope you can help with:

1.       Because the dataset is so large, manually combing thru the duplicates to assign our own unique family or child identifiers is not practical. 

2.       Is there an “assign” function that will automate assigning unique ids to children using the string variable we have constructed?

3.       Once we figure out how to assign unique child identifiers we are still faced with the problem of finding some automated way of grouping all the children within families so each family has a unique identifier, regardless of how many individual agencies/programs are providing services to the children w/i that family.  Since the family is our unit of analysis this is critical.   Any suggestions?

 

I realize that most of the queries that come across this listserv are much more sophisticated and apologize for asking what is likely a simple question but it’s one I can’t seem to get my head around.

 

Thanks for any assistance you can provide,

 

 

Theresa (Terry) Westover, Director

Center for Education and Evaluation Services

CRESS, School of Education

1 Shields Ave.

University of California, Davis  95616-8729

530-754-9523 office

530-752-6135 fax

 

Reply | Threaded
Open this post in threaded view
|

Re: sorting out a nested data structure

Jon K Peck

To add to Adrian's excellent note, there are two functions available as extensions in SPSS Statistics for constructing a phonetic representation of names.  They are soundex and nysiis.  You can apply them with the SPSSINC TRANS extension command downloadable from SPSS Developer Central, www.spss.com/devcentral.  You would also need to download the extendedTranforms module and install the Python Essentials.

If surname holds the last name, here is a usage example creating a variable named nysiis.
SPSSINC TRANS RESULT=nysiis TYPE=20
/FORMULA "extendedTransforms.nysiis(surname)".

HTH,
Jon Peck
SPSS, an IBM Company
[hidden email]
312-651-3435



From: "Barnett, Adrian (DECS)" <[hidden email]>
To: [hidden email]
Date: 09/01/2010 06:51 PM
Subject: Re: [SPSSX-L] sorting out a nested data structure
Sent by: "SPSSX(r) Discussion" <[hidden email]>





Hi Terry
Your question is not stupid and although your purpose is straightforward, the issue is not simple
 
Your project is similar to a large on I worked on a number of years ago in which we attempted to identify the same people who were clients of a wide variety of service agencies within the health/human services area.
Yours is a classic example of a record-linkage task, and would benefit from some purpose-designed record-linkage software. Record linkage software pulls together your individuals and allow you to assign each of them a unique identifier. Once you had done that you could begin the task of determining family identifiers (not sure what you would do with kids whose parents are separated but share custody and have since started new families – aargh!)
 
The advantage record-linkage software gives you is that it combines and weights all the identifying data you have before presenting you with the best-matching candidates. Well-designed software copes with mis-spelled names and dates where elements have been transposed or the year is out by one etc and gives them partial weights etc. methods which you can apply for yourself in things like SPSS or a database package tend to nudge you towards doing exact matches in a hierarchy – for example, if the family name is a mismatch because of a typo or a phonetic mis-spelling, even though you might have reasonable or even exact matches on given name, date of birth, gender etc, the match is most likely tossed aside because it failed at the first matching variable.
 
I use a program called LinkageWiz and it works pretty well. It is a commercial product ( details at www.linkagewiz.com) so if your budget does not allow for purchasing additional software you might consider the open-source FEBRL linkage software. (LinkageWiz also has a free version limited to datasets of 10,000 cases). FEBRL is developed by the Data Mining Group in the Computer Science department at the Australian National University. http://datamining.anu.edu.au/software/febrl/febrldoc/
 
Whatever method you use, whether record linkage or a hand-built approach in SPSS, you will need to do a lot of checking to get rid of false positives from twins in particular, but also close matches from other family members. I’d also recommend fairly rigorous data cleaning before you begin, correcting or getting rid of absurd dates, plugging missing data in gender, removing garbage characters in names, ensuring consistency of case in names, removing nicknames embedded in given names – for example: JONES, TALLULAH ROSE (aka “SOPHIE”). It will give you much better matches and reduce the amount of checking of matches in what might be called the ‘grey area’ between the obviously good and the obviously bad matches. There are a squillion extremely tedious details to think about, all of which affect how clean a match you will get. Given that your data is coming out of a number of different operational data systems, you can rely on it being dirty, and some systems will be much dirtier than others, so time spent going over each to spot just what sort of unspeakable practices have been followed in entering data will pay dividends, because if you clean the data first, your life will be much easier later.
 
Once you’ve git your data linked, you can clean it up further by imputing data which is present one agency’s collection to plug holes in another’s.
 
Best of luck
 
 
Adrian Barnett
Project Officer
Educational Measurement and Analysis
Data and Educational Measurement
DECS
ph 82261080
 



From: SPSSX(r) Discussion [[hidden email]] On Behalf Of Terry Westover
Sent:
Thursday, 2 September 2010 3:47 AM
To:
[hidden email]
Subject:
sorting out a nested data structure

 
Hello SPSS users –
 
We are working on a statewide study of subsidized childcare.  We have a large data set with a nested structure - children nested within families.  Each row (record) is an individual child. Each record contains a family ID and a variable identifying which agency is providing care for that individual child –  each childcare agency assigns each family a unique (to that agency) identifier but there is no statewide system of issuing a unique family identifier.  So, families may have children receiving services from more than one agency and thus have multiple family IDs and we need to be able to identify all programs/agencies that are serving each family.  The family is our unit of analysis.
 
We can, using  concatenation and duplicate functions, identify duplicate children across agencies (e.g. with different family ids) – this unique child id is a string variable - but we still have a few problems to solve that I hope you can help with:
1.       Because the dataset is so large, manually combing thru the duplicates to assign our own unique family or child identifiers is not practical.  
2.       Is there an “assign” function that will automate assigning unique ids to children using the string variable we have constructed?
3.       Once we figure out how to assign unique child identifiers we are still faced with the problem of finding some automated way of grouping all the children within families so each family has a unique identifier, regardless of how many individual agencies/programs are providing services to the children w/i that family.  Since the family is our unit of analysis this is critical.   Any suggestions?
 
I realize that most of the queries that come across this listserv are much more sophisticated and apologize for asking what is likely a simple question but it’s one I can’t seem to get my head around.
 
Thanks for any assistance you can provide,
 
 
Theresa (Terry) Westover, Director
Center for Education and Evaluation Services
CRESS, School of Education
1 Shields Ave.
University of California, Davis  95616-8729
530-754-9523 office
530-752-6135 fax
 

Reply | Threaded
Open this post in threaded view
|

Re: sorting out a nested data structure

Richard Ristow
In reply to this post by Terry Westover
At 02:17 PM 9/1/2010, Terry Westover wrote:

(It's a few hours since I've been able to check mail, and I can neither check nor send at the moment; so, apologies if I'm way behind in this thread.)

I realize that most of the queries that come across this listserv are much more sophisticated and apologize for asking what is likely a simple question but it’s one I can’t seem to get my head around.

In the first place, we're here to answer questions, whatever they are. In the second, as you've seen by responses, this is actually a data-handling problem of some subtlety.

We are working on a statewide study of subsidized childcare.  We have a large data set with a nested structure - children nested within families.  Each row (record) is an individual child. Each record contains a family ID and a variable identifying which agency is providing care for that individual child –  each childcare agency assigns each family a unique (to that agency) identifier but there is no statewide system of issuing a unique family identifier.  So, families may have children receiving services from more than one agency and thus have multiple family IDs and we need to be able to identify all programs/agencies that are serving each family.  The family is our unit of analysis.
 
We can, using  concatenation and duplicate functions, identify duplicate children across agencies (e.g. with different family ids) – this unique child id is a string variable:
2.       Is there an “assign” function that will automate assigning unique ids to children using the string variable we have constructed?

I'm missing what you need, here. Do you already have "a unique child id, [which] is a string variable", or do you need to construct one?

If the latter, given that you can identify duplicate children (I take it, that means you can recognize when two records represent the same child, even though the records are from different agencies), then if you sort the data so all records for each child are together in the file (I presume that's possible), and you have a way (with LAG or something) to determine when a record represents the same child as its predecessor, than something like (untested)

NUMERIC Our_Child_ID(F6).
LEAVE   Our_Child_ID.
DO IF   $CASENUM EQ 1.
.  COMPUTE Our_Child_ID = 1.
ELSE IF NOT <same child as previous record>.
.  COMPUTE Our_Child_ID = Our_Child_ID + 1.
END IF.
 
That assigns a numeric ID, which is easier to calculate. You can convert it to a string using the STRING function, but I'd see no need to, unless I'm misunderstanding your needs.

1.       Because the dataset is so large, manually combing thru the duplicates to assign our own unique family or child identifiers is not practical.

For child identifiers, see above.

To construct family identifiers, I'd start with the view that you have a single family identifying key, consisting of an agency ID and that agency's family identifier. Your problem then is, a family may be in the file under several keys, and you want to recognize which (different) keys refer to the same family.

3.       Once we figure out how to assign unique child identifiers we are still faced with the problem of finding some automated way of grouping all the children within families so each family has a unique identifier, regardless of how many individual agencies/programs are providing services to the children w/i that family.  Since the family is our unit of analysis this is critical.   Any suggestions?

If I read you correctly, the only way you can tell if two family IDs represent the same family is if some child has both of those IDs. If that's so, I don't see a way to recognize families where some of the children are served by one group of agencies, others are served by another group, but there's no child served by any agency that's in both groups.

However, you can use children that are served by multiple agencies to identify families that are served by multiple agencies by a transitive-closure process (hi, Jon!). It identifies a family by the earliest-sorting child ID within that family, by repeating the following two steps until there is no further change:

A. WITHIN EACH AGENCY, assign all children in any family the lowest ID of any child in that family.

B. Give all records for EACH CHILD the lowest ID associated with any record for that child, in any agency.

These can each be carried out by a simple AGGREGATE. They must be repeated, because carrying out either of them can invalidate the other; however, after a certain number of steps there will be no further change.

Example: Suppose you have families Alpha, Beta, Gamma, and Delta, each with several children and receiving services from several agencies, i.e.
|-----------------------------|---------------------------|
|Output Created               |09-SEP-2010 00:51:35       |
|-----------------------------|---------------------------|
[Families]
Family Child Child_ID Agency Agency_Family_ID

Alpha     1  A123     A      PQ345
Alpha     1  A123     B      67890
Alpha     2  B456     B      67890
Beta      1  B789     A      RS234
Beta      2  C987     A      RS234
Beta      2  C987     B      12345
Gamma     3  C123     A      AC987
Gamma     4  C987     A      AC987
Gamma     2  B890     B      23456
Gamma     3  C123     B      23456
Gamma     1  A789     C      01ABC
Gamma     2  B890     C      01ABC
Delta     1  A456     A      FG987
Delta     2  A987     A      FG987
Delta     2  A987     B      98765
Delta     3  C101     B      98765
Delta     4  C456     C      01DEF

Number of cases read:  17    Number of cases listed:  17

but your files have only the record of services by child and agency, with family identified only within agency,
|-----------------------------|---------------------------|
|Output Created               |09-SEP-2010 00:51:37       |
|-----------------------------|---------------------------|
[Cases]
Agency Agency_Family_ID Child_ID

A      AC987            C123
A      AC987            C987
A      FG987            A456
A      FG987            A987
A      PQ345            A123
A      RS234            B789
A      RS234            C987
B      12345            C987
B      23456            B890
B      23456            C123
B      67890            A123
B      67890            B456
B      98765            A987
B      98765            C101
C      01ABC            A789
C      01ABC            B890
C      01DEF            C456

Number of cases read:  17    Number of cases listed:  17

then, first set up a dataset in which the 'family ID' is the same as the child ID:

STRING  Pass           (A2).
COMPUTE Pass          = '0 '.
STRING  Our_Family_ID  (A4).
COMPUTE Our_Family_ID = Child_ID.

DATASET NAME Reconstruct WINDOW=FRONT.
DATASET COPY Log.
LIST.

List
|-----------------------------|---------------------------|
|Output Created               |09-SEP-2010 00:51:38       |
|-----------------------------|---------------------------|
[Reconstruct]
Agency Agency_Family_ID Child_ID Pass Our_Family_ID

A      AC987            C123     0    C123
A      AC987            C987     0    C987
A      FG987            A456     0    A456
A      FG987            A987     0    A987
A      PQ345            A123     0    A123
A      RS234            B789     0    B789
A      RS234            C987     0    C987
B      12345            C987     0    C987
B      23456            B890     0    B890
B      23456            C123     0    C123
B      67890            A123     0    A123
B      67890            B456     0    B456
B      98765            A987     0    A987
B      98765            C101     0    C101
C      01ABC            A789     0    A789
C      01ABC            B890     0    B890
C      01DEF            C456     0    C456

Number of cases read:  17    Number of cases listed:  17

and repeat the following steps as many times as needed:
*  ................................................................. .
*  Then, iterate two steps until there are no more changes:          .
*  A.  Identify families WITHIN AGENCIES, by assigning each child    .
*      the lowest ID of any child in the same family in the same     .
*      agency.                                                       .
*  B.  Make family keys consistent WITHIN CHILDREN, by assigning     .
*      each child the lowest ID given that child within any agency.  .

*  Pass 1:                                                           .
*  A.  Assign each child the lowest family ID of any child in the    .
*      same family in the same agency.                               .

COMPUTE Pass = '1A'.
AGGREGATE OUTFILE=* MODE=ADDVARIABLES OVERWRITE=YES
  /BREAK=Agency Agency_Family_ID
  /Our_Family_ID = MIN(Our_Family_ID).

*  B.  Assign each child the lowest family ID given that child in    .
*      any agency's records                                          .

COMPUTE Pass = '1B'.
AGGREGATE OUTFILE=* MODE=ADDVARIABLES OVERWRITE=YES
  /BREAK=Child_ID
  /Our_Family_ID = MIN(Our_Family_ID).

How many times are 'needed'? Well, three times is enough for this test data, and I wrote the test code accordingly. But it would be better to issue steps A. and B. from a loop within Python; and after each loop pass, check to see whether any ID values have changed, and terminate the loop if none have. (That Python exercise is the next step.)

Here's the result, re-merged with the original data where the families were identified:

DATASET ACTIVATE Families.
DATASET COPY     Compare.
DATASET ACTIVATE Compare WINDOW=FRONT.
SORT CASES
     BY Agency Agency_Family_ID Child_ID.
MATCH FILE
   /FILE=*
   /FILE=Reconstruct
   /BY Agency Agency_Family_ID Child_ID
   /KEEP=Family Our_Family_ID
         Child  Child_ID
         Agency Agency_Family_ID.

SORT CASES
     BY Our_Family_ID Child_ID Agency Agency_Family_ID.
LIST.

List
|-----------------------------|---------------------------|
|Output Created               |09-SEP-2010 00:51:46       |
|-----------------------------|---------------------------|
[Compare]
Family Our_Family_ID Child Child_ID Agency Agency_Family_ID

Alpha  A123             1  A123     A      PQ345
Alpha  A123             1  A123     B      67890
Alpha  A123             2  B456     B      67890
Delta  A456             1  A456     A      FG987
Delta  A456             2  A987     A      FG987
Delta  A456             2  A987     B      98765
Delta  A456             3  C101     B      98765
Gamma  A789             1  A789     C      01ABC
Gamma  A789             2  B890     B      23456
Gamma  A789             2  B890     C      01ABC
Gamma  A789             3  C123     A      AC987
Gamma  A789             3  C123     B      23456
Gamma  A789             4  C987     A      AC987
Beta   A789             2  C987     A      RS234
Beta   A789             2  C987     B      12345
Beta   B789             1  B789     A      RS234
Delta  C456             4  C456     C      01DEF

Number of cases read:  17    Number of cases listed:  17

The families are correctly identified by variable Our_Family_ID, the lowest-valued child ID found in the family. They aren't listed in their originial alphabetical order; that wouldn't have been possible. And family "Delta" is not completely recognized; that is, its 4th child (ID 'C456') is not recognized as belonging to the same family. That's because there's no chain of children served by multiple agencies that joins that child to the other three in the family.

Finally, the code actually run also logs the 'family ID' values for all records at each step; here is the path by which the final values were reached:
|-----------------------------|---------------------------|
|Output Created               |09-SEP-2010 00:51:48       |
|-----------------------------|---------------------------|
[Summary]
Age Agency_F
ncy amily_ID Child_ID Pass_0 Pass_1A Pass_1B Pass_2A Pass_2B Pass_3A Pass_3B

A   AC987    C123     C123   C123    B890    B789    A789    A789    A789
A   AC987    C987     C987   C123    B789    B789    B789    A789    A789
A   FG987    A456     A456   A456    A456    A456    A456    A456    A456
A   FG987    A987     A987   A456    A456    A456    A456    A456    A456
A   PQ345    A123     A123   A123    A123    A123    A123    A123    A123
A   RS234    B789     B789   B789    B789    B789    B789    B789    B789
A   RS234    C987     C987   B789    B789    B789    B789    B789    A789
B   12345    C987     C987   C987    B789    B789    B789    B789    A789
B   23456    B890     B890   B890    A789    A789    A789    A789    A789
B   23456    C123     C123   B890    B890    A789    A789    A789    A789
B   67890    A123     A123   A123    A123    A123    A123    A123    A123
B   67890    B456     B456   A123    A123    A123    A123    A123    A123
B   98765    A987     A987   A987    A456    A456    A456    A456    A456
B   98765    C101     C101   A987    A987    A456    A456    A456    A456
C   01ABC    A789     A789   A789    A789    A789    A789    A789    A789
C   01ABC    B890     B890   A789    A789    A789    A789    A789    A789
C   01DEF    C456     C456   C456    C456    C456    C456    C456    C456

Number of cases read:  17    Number of cases listed:  17
=============================
APPENDIX: Test data, and code
=============================
*  C:\Documents and Settings\Richard\My Documents                .
*    \Technical\spssx-l\Z-2010abc\                               .
*    2001-09-01 Westover-sorting out a nested data structure.SPS .

*  In response to posting                            .
*  Date:    Wed, 1 Sep 2010 11:17:16 -0700           .
*  From:    Terry Westover <[hidden email]>  .
*  Subject: sorting out a nested data structure      .
*  To:      [hidden email]                 .


*  The problem is to assign a unique family ID to each family, where .
*  one family may have different IDs in each of several agencies.    .
*  agencies.                                                         .
*                                                                    .
*  By hypothesis, a child may be served by several agencies;         .
*  there's a unique across-agency *child* ID;                        .
*  and there are unique within-agency (but not across-agency)        .
*  *family* IDs;                                                     .
*                                                                    .
*  This solution uses transitive closure to assign each family its   .
*  earliest child ID, so far as that can be determined.              .


*  ................................................................. .
*  .................   Test data               ..................... .
*  (This may be the hardest part of the whole problem.)              .



DATA LIST LIST/
  Family Child  Child_ID Agency Agency_Family_ID
  (A5,    F2,   A4,      A1,     A5).
BEGIN DATA
  Alpha    1    A123      A      PQ345
  Alpha    1    A123      B      67890
  Alpha    2    B456      B      67890
  Beta     1    B789      A      RS234
  Beta     2    C987      A      RS234
  Beta     2    C987      B      12345
  Gamma    3    C123      A      AC987
  Gamma    4    C987      A      AC987
  Gamma    2    B890      B      23456
  Gamma    3    C123      B      23456
  Gamma    1    A789      C      01ABC             
  Gamma    2    B890      C      01ABC
  Delta    1    A456      A      FG987
  Delta    2    A987      A      FG987
  Delta    2    A987      B      98765
  Delta    3    C101      B      98765
  Delta    4    C456      C      01DEF
END DATA.
DATASET NAME Families WINDOW=FRONT.

*  This is the file of families, with children listed in birth order,.
*  each with records for all agencies where they receive services.   .
LIST.

*  ................................................................. .
NEW FILE.
ADD FILE
  /FILE=Families
  /DROP=Family Child
  /KEEP=Agency Agency_Family_ID Child_ID.
SORT CASES
     BY Agency Agency_Family_ID Child_ID.
DATASET NAME Cases    WINDOW=FRONT.

*  This is the file you have to work with. There's no longer a key   .
*  to identify families, except within clients of each agency.       .
LIST.

*  ................................................................. .
*  Re-identify families, as well as possible. The 'family ID' will   .
*  the earliest child ID found in the family.                        .

*  First, set up the file with the new 'family ID' in it. Initially, .
*  the 'family ID' is the same as the child ID.                      .

NEW FILE.
ADD FILE
   /FILE=Cases.

STRING  Pass           (A2).
COMPUTE Pass          = '0 '.
STRING  Our_Family_ID  (A4).
COMPUTE Our_Family_ID = Child_ID.

DATASET NAME Reconstruct WINDOW=FRONT.
DATASET COPY Log.
LIST.

*  ................................................................. .
*  Then, iterate two steps until there are no more changes:          .
*  A.  Identify families WITHIN AGENCIES, by assigning each child    .
*      the lowest ID of any child in the same family in the same     .
*      agency.                                                       .
*  B.  Make family keys consistent WITHIN CHILDREN, by assigning     .
*      each child the lowest ID given that child within any agency.  .
*  (They have to be repeated, because carrying out B. may make A.    .
*  necessary again. It's easy to show the process will terminate.)   .


*  Pass 1:                                                           .
*  A.  Assign each child the lowest family ID of any child in the    .
*      same family in the same agency.                               .

COMPUTE Pass = '1A'.
AGGREGATE OUTFILE=* MODE=ADDVARIABLES OVERWRITE=YES
  /BREAK=Agency Agency_Family_ID
  /Our_Family_ID = MIN(Our_Family_ID).

* ... Record results from this pass                              ... .
.     DATASET ACTIVATE Log        WINDOW=FRONT.
.     ADD FILES
         /FILE=*
         /FILE=Reconstruct
         /BY Agency Agency_Family_ID Child_ID Pass.
.     DATASET ACTIVATE Reconstruct WINDOW=FRONT.
* ...                                                            ... .


*  B.  Assign each child the lowest family ID given that child in    .
*      any agency's records                                          .

COMPUTE Pass = '1B'.
AGGREGATE OUTFILE=* MODE=ADDVARIABLES OVERWRITE=YES
  /BREAK=Child_ID
  /Our_Family_ID = MIN(Our_Family_ID).

* ... Record results from this pass                              ... .
.     DATASET ACTIVATE Log        WINDOW=FRONT.
.     ADD FILES
         /FILE=*
         /FILE=Reconstruct
         /BY Agency Agency_Family_ID Child_ID Pass.
.     DATASET ACTIVATE Reconstruct WINDOW=FRONT.
* ...                                                            ... .

*  Pass 2:                                                           .
*  A.  Assign each child the lowest family ID of any child in the    .
*      same family in the same agency.                               .

COMPUTE Pass = '2A'.
AGGREGATE OUTFILE=* MODE=ADDVARIABLES OVERWRITE=YES
  /BREAK=Agency Agency_Family_ID
  /Our_Family_ID = MIN(Our_Family_ID).

* ... Record results from this pass                              ... .
.     DATASET ACTIVATE Log        WINDOW=FRONT.
.     ADD FILES
         /FILE=*
         /FILE=Reconstruct
         /BY Agency Agency_Family_ID Child_ID Pass.
.     DATASET ACTIVATE Reconstruct WINDOW=FRONT.
* ...                                                            ... .

*  B.  Assign each child the lowest family ID given that child in    .
*      any agency's records                                          .

COMPUTE Pass = '2B'.
AGGREGATE OUTFILE=* MODE=ADDVARIABLES OVERWRITE=YES
  /BREAK=Child_ID
  /Our_Family_ID = MIN(Our_Family_ID).

* ... Record results from this pass                              ... .
.     DATASET ACTIVATE Log        WINDOW=FRONT.
.     ADD FILES
         /FILE=*
         /FILE=Reconstruct
         /BY Agency Agency_Family_ID Child_ID Pass.
.     DATASET ACTIVATE Reconstruct WINDOW=FRONT.
* ...                                                            ... .


*  Pass 3:                                                           .
*  A.  Assign each child the lowest family ID of any child in the    .
*      same family in the same agency.                               .

COMPUTE Pass = '3A'.
AGGREGATE OUTFILE=* MODE=ADDVARIABLES OVERWRITE=YES
  /BREAK=Agency Agency_Family_ID
  /Our_Family_ID = MIN(Our_Family_ID).

* ... Record results from this pass                              ... .
.     DATASET ACTIVATE Log        WINDOW=FRONT.
.     ADD FILES
         /FILE=*
         /FILE=Reconstruct
         /BY Agency Agency_Family_ID Child_ID Pass.
.     DATASET ACTIVATE Reconstruct WINDOW=FRONT.
* ...                                                            ... .

*  B.  Assign each child the lowest family ID given that child in    .
*      any agency's records                                          .

COMPUTE Pass = '3B'.
AGGREGATE OUTFILE=* MODE=ADDVARIABLES OVERWRITE=YES
  /BREAK=Child_ID
  /Our_Family_ID = MIN(Our_Family_ID).

* ... Record results from this pass                              ... .
.     DATASET ACTIVATE Log        WINDOW=FRONT.
.     ADD FILES
         /FILE=*
         /FILE=Reconstruct
         /BY Agency Agency_Family_ID Child_ID Pass.
.     DATASET ACTIVATE Reconstruct WINDOW=FRONT.
* ...                                                            ... .


*  ................................................................. .
*  See how far the reconstruction of original families has succeeded .

DATASET ACTIVATE Families.
DATASET COPY     Compare.
DATASET ACTIVATE Compare WINDOW=FRONT.
SORT CASES
     BY Agency Agency_Family_ID Child_ID.

MATCH FILE
   /FILE=*
   /FILE=Reconstruct
   /BY Agency Agency_Family_ID Child_ID
   /KEEP=Family Our_Family_ID
         Child  Child_ID
         Agency Agency_Family_ID.

SORT CASES
     BY Our_Family_ID Child_ID Agency Agency_Family_ID.

LIST.


*  ................................................................. .
*  Show how final family IDs were reached                            .

DATASET ACTIVATE Log.
DATASET COPY     Summary.
DATASET ACTIVATE Summary  WINDOW=FRONT.
SORT CASES BY Agency Agency_Family_ID Child_ID Pass .

STRING  NewName(A7).
COMPUTE NewName = CONCAT('Pass_',Pass).

CASESTOVARS
 /ID        = Agency Agency_Family_ID Child_ID
 /INDEX     = NewName
 /GROUPBY   = VARIABLE
 /DROP      = Pass.

LIST.

====================To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD