Tomorrow I
have to work on a set of records which are supposed to have unique IDs for people.
There are multiple records per person. But a different number of records per person. In the past I would do something like aggregate outfile=aggfile /break = ID /lo_SSN = min (SSN) /lo_dob = min (dob) /lo_race= min(race) /lo_sex = min(sex) /lo_zip = min(zipcode) /hi_ssn = max (ssn) /hi_dob = max(dob) /hi_race = max(race) /hi_sex= max(sex) /hi_zip = max(zipcode). do repeat redflag = redflag1 to redflag6/ xlo = lo_ssn to lo_zip xhi = hi_ssn to hi_zip. compute redflag = xlo ne xhi. end repeat. temporary. select if sum(redflag1 to redflag6 ne 0). list variables = ID redflag1 to redflag6. Is there an extension command to find ID's that have changes across a series of records in a varlist? Has anyone figured out a way to trick duplicate cases to do this? -- Art Kendall Social Research Consultants
Art Kendall
Social Research Consultants |
Just a minor change to what you have already, aggregate the min (or max) value to the same data-file, then compare that aggregated value to the original values in long format. This will flag any discrepancies.
You do something with sort and lag to give you more info, but that is an easy start. |
In reply to this post by Art Kendall
Dear Art,
Couldn't you use something like aggregate /outfile * mode = addvariables /break = id /check_v1=sd(v1) /check_v2=sd(v2) /check_v3=sd(v3) /check_v4=sd(v4) /check_v5=sd(v5). If the varlist gets long, you can have Python spit out the necessary aggregate command for you. Best, Ruben |
Thank you and Andy for
replying
Yes using add variables is something that I have done and will do. using SD(var) would work for numeric variables. Also, now that I think back I used FIRST and LAST sometimes. The last time I thought about how to do this kind of task was when SPSS made "addvariables" available. Every now and then I think to myself, "this is a task I have done many times over the decades. I wonder if it has been made more automated?" It looks like I'll keep on doing it this old way. Art Kendall Social Research ConsultantsOn 4/29/2013 5:05 AM, Ruben van den Berg [via SPSSX Discussion] wrote: Dear Art,
Art Kendall
Social Research Consultants |
Thank you for your email. Please note that I will be out of the office beginning Monday, April 29th till Wednesday May 1st with limited access to emails. I will respond to your email upon my return to the office on Thursday May 2.
Thank you. Genevieve Odoom Policy and Program Analyst OANHSS 7050 Weston Rd. Suite 700 Woodbridge, On L4L 8G7 Tel: (905) 851-8821 x241 Fax: (905) 851-0744 [hidden email] www.oanhss.org ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
In reply to this post by Art Kendall
Why not
AGGREGATE OUTFILE * / BREAK ALL/ N=N. COMPUTE RedFlag = ID EQ LAG(ID).
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
I'm not sure about the second command. If there are consistencies within a case, it will result in a case group with N > 1. The first case within this group will not be flagged while it should. Perhaps
AGGREGATE OUTFILE * / BREAK ALL/ N=N. AGGREGATE OUTFILE * / BREAK id/ N=N. COMPUTE RedFlag = n gt 1. may get around this? |
In reply to this post by Art Kendall
A different approach -
You could use Aggregate to Break on All, adding the Count to the file; Break on ID, adding the ID-count to the file; then Select and list all the cases where the two counts are not equal. -- Rich Ulrich Date: Sun, 28 Apr 2013 13:19:42 -0700 From: [hidden email] Subject: finding inconsistent within person information To: [hidden email] Tomorrow I have to work on a set of records which are supposed to have unique IDs for people. There are multiple records per person. But a different number of records per person. In the past I would do something like aggregate outfile=aggfile /break = ID /lo_SSN = min (SSN) /lo_dob = min (dob) /lo_race= min(race) /lo_sex = min(sex) /lo_zip = min(zipcode) /hi_ssn = max (ssn) /hi_dob = max(dob) /hi_race = max(race) /hi_sex= max(sex) /hi_zip = max(zipcode). do repeat redflag = redflag1 to redflag6/ xlo = lo_ssn to lo_zip xhi = hi_ssn to hi_zip. compute redflag = xlo ne xhi. end repeat. temporary. select if sum(redflag1 to redflag6 ne 0). list variables = ID redflag1 to redflag6. Is there an extension command to find ID's that have changes across a series of records in a varlist? Has anyone figured out a way to trick duplicate cases to do this? ... |
In reply to this post by Ruben Geert van den Berg
this appears to work.
new file. data list list /id (f2) year (f4) gender (a1) race (a1)yearborn(f4). begin data 01 1990 M W 1950 01 1991 M W 1950 01 1992 M W 1950 01 1993 F W 1950 01 1994 M W 1950 01 1995 M W 1950 02 1991 F W 1960 02 1992 F W 1975 02 1993 F W 1960 02 1994 F W 1960 02 1995 F B 1960 02 1996 F W 1960 03 1992 F W 1972 03 1993 F W 1972 03 1994 F W 1972 03 1995 F W 1972 04 1990 M W 1968 04 1991 M W 1968 04 1992 M W 1968 04 1993 M W 1968 04 1994 M W 1968 04 1995 M W 1968 end data. AGGREGATE OUTFILE * / BREAK ID gender Race yearborn/ Kount=N. AGGREGATE OUTFILE * / BREAK id/ Kount=N. COMPUTE RedFlag = Kount gt 1. list. Art Kendall Social Research ConsultantsOn 4/29/2013 10:55 AM, Ruben van den Berg [via SPSSX Discussion] wrote: I'm not sure about the second command. If there are consistencies within a case, it will result in a case group with N > 1. The first case within this group will not be flagged while it should. Perhaps
Art Kendall
Social Research Consultants |
Administrator
|
In reply to this post by Ruben Geert van den Berg
Maybe better to use MATCH create FIRST and LAST flags. Both should be 1 (True) unless there are cases with same ID but diffs in the variable values.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Free forum by Nabble | Edit this page |