My apologies beforehand, this is a lengthy post. I tried to respond to a
previous note from Eugene requesting clarity and data examples. Thank you to all who read and respond. The data I have are stacked by ID number, with numerous entries (cases) per ID, the original data were medical claims, thus each record or entry represented a separate claim. I wish to delete all duplicate records, but I want to retain the data that may exist in the separate cases. For example, the variable "PrimDxGrpr" has 23 possible values (1-23), each representing a diagnostic group, every member ID (person) includes data regarding the diagnostic group. Regardless of the number of duplicate cases per ID, some IDs have only a SINGLE diagnostic group number in all entries, whereas other IDs have multiple diagnostic codes spread across all entries. Again, I wish to have only single entries per ID, but I also wish to retain any unique data that may exist in the duplicate entries. DATA EXAMPLE ID=0818, has 131 entries, for the variable "PrimDxGrpr" this ID has data for 9 diagnostic groups. I recoded PrimDxGrpr into 23 new variables (PrimDxGrpr_01 through PrimDxGrpr_23), and the appropriate data for each diagnostic group carried over to its New corresponding variable. In order to retain all data after I have deleted duplicate records, I need the values in New variable to represent the total number or SUM of all cases with that diagnostic group code. In the case of ID=0818, DxGrpr_01 should reflect the total number of cases that included that diagnostic group code. I have tried TRANSFORM, Count Values within cases, with no success. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
You have to provide a specific data example showing before and after the transformation you wish to do. Your description implies multiple possible meanings. I noticed that you included a screen shot on your reply to me but I can't access it. And, please reply to the list rather than to me because there others on the list that are very skilled analysts.
Gene maguin -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Joseph Youngblood Sent: Tuesday, April 15, 2014 2:29 PM To: [hidden email] Subject: Removing Duplicate Records but Retaining all Data My apologies beforehand, this is a lengthy post. I tried to respond to a previous note from Eugene requesting clarity and data examples. Thank you to all who read and respond. The data I have are stacked by ID number, with numerous entries (cases) per ID, the original data were medical claims, thus each record or entry represented a separate claim. I wish to delete all duplicate records, but I want to retain the data that may exist in the separate cases. For example, the variable "PrimDxGrpr" has 23 possible values (1-23), each representing a diagnostic group, every member ID (person) includes data regarding the diagnostic group. Regardless of the number of duplicate cases per ID, some IDs have only a SINGLE diagnostic group number in all entries, whereas other IDs have multiple diagnostic codes spread across all entries. Again, I wish to have only single entries per ID, but I also wish to retain any unique data that may exist in the duplicate entries. DATA EXAMPLE ID=0818, has 131 entries, for the variable "PrimDxGrpr" this ID has data for 9 diagnostic groups. I recoded PrimDxGrpr into 23 new variables (PrimDxGrpr_01 through PrimDxGrpr_23), and the appropriate data for each diagnostic group carried over to its New corresponding variable. In order to retain all data after I have deleted duplicate records, I need the values in New variable to represent the total number or SUM of all cases with that diagnostic group code. In the case of ID=0818, DxGrpr_01 should reflect the total number of cases that included that diagnostic group code. I have tried TRANSFORM, Count Values within cases, with no success. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by JosephYoungblood
The first example shows multiple records for the same ID, and the diagnostic
group code assigned. Line # ID PrimDxGrpr 1 0818 1 2 0818 1 3 0818 2 4 0818 2 5 0818 2 6 0818 3 7 0818 3 8 0742 2 9 0742 1 10 0742 3 11 0742 3 12 0742 3 This shows the ideal: New Variables show cell count for each diagnostic group by ID. Line # ID P_DxGp_1 P_DxGp_3 P_DxGp_3 1 0818 2 3 2 2 0742 1 1 3 ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Oh .... Ok. This:
Aggregate outfile=*/break=id primdxgrpr/ P_DxGp=nu. Yields ID PrimDxGrpr P_DxGp 0818 1 2 0818 2 3 0818 3 2 0742 1 1 0742 2 2 0742 3 3 Then: Casestovars id=id. But before you do that, run a frequencies on primdxgrpr and check that you have cases for every one of your 13 or 23 dx groups because you are using the resulting dx group variable names to also indicate the dx groups themselves. So suppose that's true. Then you get ID PrimDxGrpr.1 PrimDxGrpr.2 PrimDxGrpr.3 P_DxGp.1 P_DxGp.2 P_DxGp.3. 0818 1 2 3 2 3 2 0742 1 2 3 1 2 3 There's some tidying up to do but that is trivial, I think. But, if that is not true, then you will need to do a loop and vector chunk of code to shift the dx counts to the variable name. Gene Maguin -----Original Message----- From: Joseph Youngblood [mailto:[hidden email]] Sent: Tuesday, April 15, 2014 3:18 PM To: [hidden email]; Maguin, Eugene Cc: Joseph Youngblood Subject: Re: Removing Duplicate Records but Retaining all Data The first example shows multiple records for the same ID, and the diagnostic group code assigned. Line # ID PrimDxGrpr 1 0818 1 2 0818 1 3 0818 2 4 0818 2 5 0818 2 6 0818 3 7 0818 3 8 0742 2 9 0742 1 10 0742 3 11 0742 3 12 0742 3 This shows the ideal: New Variables show cell count for each diagnostic group by ID. Line # ID P_DxGp_1 P_DxGp_3 P_DxGp_3 1 0818 2 3 2 2 0742 1 1 3 ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
In reply to this post by JosephYoungblood
For that example, you could use AGGREGATE followed by CASESTOVARS. E.g.,
* Create a sample data set. DATA LIST list / Line(f5.0) ID(N4.0) PrimDxGrpr (f1). BEGIN DATA 1 0818 1 2 0818 1 3 0818 2 4 0818 2 5 0818 2 6 0818 3 7 0818 3 8 0742 2 9 0742 1 10 0742 3 11 0742 3 12 0742 3 END DATA. FORMATS ID(N4). * Use AGGREGATE to get the counts. AGGREGATE /OUTFILE=* /BREAK=ID PrimDxGrpr /P_DxGp_=N. * Now restructure from LONG to WIDE. CASESTOVARS /ID=ID /INDEX=PrimDxGrpr /SEPARATOR = "" /GROUPBY=VARIABLE. LIST. OUTPUT: ID P_DxGp_1 P_DxGp_2 P_DxGp_3 0742 1 1 3 0818 2 3 2 Number of cases read: 2 Number of cases listed: 2
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
In reply to this post by JosephYoungblood
At 02:28 PM 4/15/2014, Joseph Youngblood wrote:
>The data I have are stacked by ID number, with numerous entries >(cases) per ID. Each record or entry represents a separate [medical] claim. > >I wish to delete all duplicate records, but I want to retain the >data that may exist in the separate cases. I'll come back to this, but the first question is, what makes a record a 'duplicate'? It sounds like every record has information not duplicated on any other record -- for example, presumably, the date of the medical claim. So, you aren't removing duplicates; you're summarizing. That's fine; it just needs to be recognized as such. >For example, the variable "PrimDxGrpr" has 23 possible values >(1-23), each representing a diagnostic group ... Here, I presume each diagnostic group is a set of diagnoses, presumably ICD codes. >Every member ID (person) includes data regarding the diagnostic group. I presume, by "Every member ID", you mean every medical claim record. What data is there, regarding the diagnostic group Date of diagnosis? Amount of the medical claim? You say this is data regarding the diagnostic group, but you say each record is about an individual claim, so you could have multiple records for the same person and same diagnostic group, but with different information. Or, am I guessing wrong what is in each of the records? >Regardless of the number of duplicate cases per ID, OK, we're not going to say 'duplicate' anymore. These are not duplicates of the same information; they're separate records with different information. Starting again, >Regardless of the number of xxxxxxxxx cases per ID, some IDs have >only a SINGLE diagnostic group number in all entries, whereas other >IDs have multiple diagnostic codes spread across all entries. In other words, some people have had claims (possibly more than one claim) within only one diagnostic group, others have had claims in several diagnostic groups, perhaps with more than one claim in one or more of these groups. >Again, I wish to have only single entries per ID, but I also wish to retain >any unique data that may exist in the duplicate entries. > >I recoded PrimDxGrpr into 23 new variables (PrimDxGrpr_01 through >PrimDxGrpr_23), and the appropriate data for each diagnostic group >carried over to its New corresponding variable. Now we're getting somewhere. What "appropriate data" are you talking about, that becomes the value of whichever new variable? Date? Cost? Do you really only have one variable's worth of information you want to keep? And, I remind you again: at *this* point, you don't have information about the diagnostic group. You have information about the claim, which falls into one of the diagnostic groups. >In order to retain all data after I have deleted duplicate records, >I need the values in New variable to represent the total number or >SUM of all cases with that diagnostic group code. OK, I'm going to say that you have one quantity you want to add up; I'll call it MedicalQuantity. Start with your original data; forget the 23 new variables. Use code like this: DATASET DECLARE Summary. AGGREGATE OUTFILE=Summary /BREAK=ID PrimDxGrpr /TotQuant 'Sum of whatever "MedicalQuantity" represents' =SUM(MedicalQuantity). DATASET ACTIVATE Summary WINDOW=FRONT. Now you have a long-form summary file. It has one record for each ID for each diagnostic group in which that person had *any* claim. If you really want one record per ID, then this (but not tested, and CASESTOVARS is sometimes hard to get right): CASESTOVARS /ID=ID /INDEX=PrimDxGrpr. And if, as seems likely to me, you need to keep more than one summary quantity per diagnositic group (like, say, the number of claims), say that, and we'll tell you how to refine this l0gic. -Good luck, Richard Ristow ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
In reply to this post by JosephYoungblood
Joseph,
Please study the AGGREGATE and CASESTOVARS commands! 2 lines of code! David
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Maguin, Eugene
At 02:45 PM 4/15/2014, Maguin, Eugene wrote:
>You have to provide a specific data example showing before and after >the transformation you wish to do. It does help, doesn't it? Four answers, you got! Looks like all of us thinking pretty much the same thing. Again: the truly confusing word was 'duplicate'. 'Duplicate' records repeat the same information; your 'duplicates' actually have different information, which you want to keep in summary form. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
In reply to this post by David Marso
Here's the code I had in mind.
DATA LIST LIST /Line# ID PrimDxGrpr . BEGIN DATA 1 0818 1 2 0818 1 3 0818 2 4 0818 2 5 0818 2 6 0818 3 7 0818 3 8 0742 2 9 0742 1 10 0742 3 11 0742 3 12 0742 3 END DATA. DATASET DECLARE agg. AGGREGATE OUTFILE agg/BREAK ID PrimDxGrpr /P_DxGp_=N. DATASET ACTIVATE agg. CASESTOVARS ID=ID /INDEX=PrimDxGrpr.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Maguin, Eugene
Joe,
When you did the frequencies after the aggregate was every one of your dx groups represented? If all of them were, then I don’t understand this, ‘…but the cell values are not the quantities of each Dx code summed for cases.’ Again a specific example is needed. If all of them were not represented, then please re-run the casestovars using either Bruce’s syntax or David’s syntax (I think both are the same but I’ve deleted
Bruce’s yesterday) because theirs is better thought out than mine was, specifically with respect to the index function.
Gene Maguin From: Joseph A. Youngblood [mailto:[hidden email]]
Gene,
Sincerely,
Joseph A. Youngblood
|
In reply to this post by David Marso
Funny note to that syntax, my old SPSS (15) I had to add
FORMAT PrimDxGrpr (F2.0). before aggregating to ensure SPSS could use the column for var names afterwards: the data shows with 2 decimal places and without the FORMAT an error appears and the new var names get replaced with v1 to v3 (as the decimal sign, both "." and "," - for us Europeans - are not valid for use in variable names). |
Free forum by Nabble | Edit this page |