Hi, I'm a new SPSS user and was hoping to get some help with an
epidemiology project that I'm working on. I'm using a data set that has multiple values (medical diagnoses) in one variable. So in the variable Diagnosis, there are up to five separate values in each entry, with hundreds of possible values. I could make a new variable for each diagnosis, but like I said there are hundreds of different ones. We are trying to associate different risk factors with each diagnosis. So, what would be a good strategy to separate the different diagnoses? Thanks -OJ |
OJ,
How is the variable formatted? How are the five diagnoses separated? Do they occupy separate digits or string positions? Stephen Brand For personalized and professional consultation in statistics and research design, visit www.statisticsdoc.com -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf Of OJ Sent: Saturday, February 03, 2007 5:28 PM To: [hidden email] Subject: Variable with multiple values Hi, I'm a new SPSS user and was hoping to get some help with an epidemiology project that I'm working on. I'm using a data set that has multiple values (medical diagnoses) in one variable. So in the variable Diagnosis, there are up to five separate values in each entry, with hundreds of possible values. I could make a new variable for each diagnosis, but like I said there are hundreds of different ones. We are trying to associate different risk factors with each diagnosis. So, what would be a good strategy to separate the different diagnoses? Thanks -OJ |
In reply to this post by OJ-3
Is the current variable a string?
is each Diagnosis a fixed number of characters like ICD? Are there special separator characters between diagnoses? Please post a few examples of what a current variable and what you want the results to be. Art Kendall Social Research Consultants OJ wrote: > Hi, I'm a new SPSS user and was hoping to get some help with an > epidemiology project that I'm working on. I'm using a data set that has > multiple values (medical diagnoses) in one variable. So in the variable > Diagnosis, there are up to five separate values in each entry, with > hundreds of possible values. I could make a new variable for each > diagnosis, but like I said there are hundreds of different ones. We are > trying to associate different risk factors with each diagnosis. So, what > would be a good strategy to separate the different diagnoses? Thanks > > -OJ > > > |
In reply to this post by OJ-3
At 05:28 PM 2/3/2007, OJ wrote:
>I'm hoping to get some help with an epidemiology project. I'm using a >data set that has multiple values (medical diagnoses) in one variable: >. So in the variable Diagnosis, there are up to five separate values >in each entry, with hundreds of possible values. I could make a new >variable for each diagnosis, but like I said there are hundreds of >different ones. We are trying to associate different risk factors >with each diagnosis. So, what would be a good strategy to separate >the different diagnoses? Thanks Art and Stephen Brand ("Statisticsdoc") have asked questions about your data representation. Those are important, but more critical is: what does your data mean, and what do you want to do with it? Strictly, a "variable with multiple values" is impossible in SPSS; that's just a feature of SPSS's representation of data. However, in the real world there are frequently categorical concepts where the categories are not mutually exclusive, and one case may have several: What magazine do you subscribe to? Or, in your case, diagnoses. In SPSS, these are called 'multiple response' sets or groups. See commands MRSETS and MULT RESPONSE in the Command Syntax reference, plus documentation for the CTABLES module, if you have it. There are two ways to represent them. In 'multiple dichotomies', there is one yes/no variable for each category; that's the "new variable for each diagnosis", and indeed it's clumsy when there are many categories. In 'multiple response groups' or 'multiple category sets', there is a set of categorical variables, each giving one category applicable to the case; or, of course, 'N/A'. The group has enough variables for a reasonable estimate of the maximum number of categories per case. That's usual for diagnoses: variables like DX1 through DX5, or some such. Your data may already be in this form; "variable with multiple values" makes sense, and may be simply a mis-phrasing, in SPSS terminology. As Stephen and Art have said, if your data doesn't have this form, it probably should; and you should tell us how the diagnoses are represented now, so we can help you convert. Now, the big question: How to analyze your data? You're "trying to associate different risk factors with each diagnosis". You have two conceptual problems: . With hundreds of possible diagnoses, many of them are probably too rare in your data to assess risk factors. The usual solutions are to drop these and assess only for the more common diagnoses; or to combine related diagnosis categories into larger categories, and analyze by those categories. . With multiple diagnoses per patient, you need a conceptual way to decide which diagnoses should be considered associated with the risk factors you're assessing. A common approach is to drop all but one of the diagnoses (the 'primary diagnosis'), and analyze by the primary diagnosis. I could imagine associating the risk factors with *all* the diagnoses for a case, in which case you'd 'unroll' your data: create one record for each diagnosis present for the patient. That would raise difficulties because you'd clearly have non-independent records. I'm getting out of my depth here; if you like this approach, maybe others can say if there's a legitimate way to analyze it. -Good luck, Richard |
Thanks to Richard and everyone else for clarification. I appreciate the help. So, I have one variable and the data set I am using is for ocular diagnoses. It contains all ICD 9 codes dumped into this one variable. So one entry might look like this:
[IMMATURE CATARACT][POSTERIOR CAPSULAR OPACIFICATION][HYPERMETROPIA][ASTIGMATISM][PRESBYOPIA][Pseudophakia (SIO)] I understand that I can create a program to separate these into separate variables, but then the issue is how to properly analyze these as mentioned by Richard. I could put these into further categories, such as "Cornea", "Retina", etc., but that would take a lot of manual labor to categorize all of the diagnoses. There are hundreds of diagnoses, but maybe on the order of 2-300, and the data set is fairly large (by the end it may be up to a few thousand patients). I am already looking at select diagnoses by using the 'multiple dichotomies' method you described, but I am interested in some of the more rare diseases. Even if there is a relatively small number of cases for a given diagnosis, I think I could find a significant difference. So, I don't think using a 'primary diagnosis' category is the best way to go about this either. While some of the diagnoses are less important and can be teased out, there could still be three important diagnoses in one case. Someone mentioned that I could use a piece of code to separate out these diagnoses and then use varstocases to create up to 5 records from each original record. Then casestovars to create the 'hundreds of diagnoses' flags. Any more suggestions? Thanks so much Regards, -OJ >>> Richard Ristow <[hidden email]> 2/5/2007 8:35 AM >>> At 05:28 PM 2/3/2007, OJ wrote: >I'm hoping to get some help with an epidemiology project. I'm using a >data set that has multiple values (medical diagnoses) in one variable: >. So in the variable Diagnosis, there are up to five separate values >in each entry, with hundreds of possible values. I could make a new >variable for each diagnosis, but like I said there are hundreds of >different ones. We are trying to associate different risk factors >with each diagnosis. So, what would be a good strategy to separate >the different diagnoses? Thanks Art and Stephen Brand ("Statisticsdoc") have asked questions about your data representation. Those are important, but more critical is: what does your data mean, and what do you want to do with it? Strictly, a "variable with multiple values" is impossible in SPSS; that's just a feature of SPSS's representation of data. However, in the real world there are frequently categorical concepts where the categories are not mutually exclusive, and one case may have several: What magazine do you subscribe to? Or, in your case, diagnoses. In SPSS, these are called 'multiple response' sets or groups. See commands MRSETS and MULT RESPONSE in the Command Syntax reference, plus documentation for the CTABLES module, if you have it. There are two ways to represent them. In 'multiple dichotomies', there is one yes/no variable for each category; that's the "new variable for each diagnosis", and indeed it's clumsy when there are many categories. In 'multiple response groups' or 'multiple category sets', there is a set of categorical variables, each giving one category applicable to the case; or, of course, 'N/A'. The group has enough variables for a reasonable estimate of the maximum number of categories per case. That's usual for diagnoses: variables like DX1 through DX5, or some such. Your data may already be in this form; "variable with multiple values" makes sense, and may be simply a mis-phrasing, in SPSS terminology. As Stephen and Art have said, if your data doesn't have this form, it probably should; and you should tell us how the diagnoses are represented now, so we can help you convert. Now, the big question: How to analyze your data? You're "trying to associate different risk factors with each diagnosis". You have two conceptual problems: . With hundreds of possible diagnoses, many of them are probably too rare in your data to assess risk factors. The usual solutions are to drop these and assess only for the more common diagnoses; or to combine related diagnosis categories into larger categories, and analyze by those categories. . With multiple diagnoses per patient, you need a conceptual way to decide which diagnoses should be considered associated with the risk factors you're assessing. A common approach is to drop all but one of the diagnoses (the 'primary diagnosis'), and analyze by the primary diagnosis. I could imagine associating the risk factors with *all* the diagnoses for a case, in which case you'd 'unroll' your data: create one record for each diagnosis present for the patient. That would raise difficulties because you'd clearly have non-independent records. I'm getting out of my depth here; if you like this approach, maybe others can say if there's a legitimate way to analyze it. -Good luck, Richard |
At 09:12 PM 2/5/2007, Osamah Saeedi wrote:
>I have one variable and the data set I am using is for ocular >diagnoses. [The variable] contains all-ICD 9 codes dumped into this >one variable. So one entry might look like this: > >[IMMATURE CATARACT][POSTERIOR CAPSULAR >OPACIFICATION][HYPERMETROPIA][ASTIGMATISM][PRESBYOPIA][Pseudophakia >(SIO)] > >I understand that I can create a program to separate these into >separate variables, That would be the first step. Parse the string, and get a list of diagnoses in (to start with) the 'multiple categories' form: a list of variables, each with one ICD-9 code. I might start, actually, by writing a separate *record* for each ICD-9 code, with the patient ID and the order of the diagnosis code within the list of diagnoses for the patient. Then, use MULT RESPONSE (if it's a set of variables) or FREQUENCIES (if it's different records) to get a list of what diagnoses you find, and how frequent they are. That's not what you're looking for in your study, but you need it to get any sense of your data. >I am already looking at select diagnoses by using the 'multiple >dichotomies' method you described, but I am interested in some of the >more rare diseases. I don't think 'multiple dichotomies' is the best representation when there are as many categories as you have - hundreds. >Even if there is a relatively small number of cases for a given >diagnosis, I think I could find a significant difference. Worth a try. Rare events are hard to analyze, but maybe you've got the data to do it. Then, you just have the conceptual problem: if a patient has several DXs, what do you say, about which DX is associated with which risk factors? There, you have to think about your conceptual model; and that's more than we can solve quickly, here on the list. |
Hi Osamah,
I'd like to add one more thing that we realized when we conducted a similar kind of investigation. It's good to keep in mind that it's not a computer, but a real doctor who makes the diagnoses. This might mean that the average doctor never writes down more than, say, five diagnoses per visit, and does not write down the same diagnosis for each and every visit. For instance, when a patient has been diagnosed with depressive disorder, he/she might not been re-diagnosed the next visit(s). We therefore checked the average number of diagnoses per patient. In the end, we used only the first diagnosis of a certain disease. Second, known comorbidities should also be kept in mind when analyzing your data. For example, anxiety and depressive disorder co-occur more often than that they occur alone. Doctor X might write down Anxiety, Y writes down Depression, and Z both. This is important when you analyze risk factors. Cheers! Albert-Jan --- Richard Ristow <[hidden email]> wrote: > At 09:12 PM 2/5/2007, Osamah Saeedi wrote: > > >I have one variable and the data set I am using is > for ocular > >diagnoses. [The variable] contains all-ICD 9 codes > dumped into this > >one variable. So one entry might look like this: > > > >[IMMATURE CATARACT][POSTERIOR CAPSULAR > >OPACIFICATION][HYPERMETROPIA][ASTIGMATISM][PRESBYOPIA][Pseudophakia > >(SIO)] > > > >I understand that I can create a program to > separate these into > >separate variables, > > That would be the first step. Parse the string, and > get a list of > diagnoses in (to start with) the 'multiple > categories' form: a list of > variables, each with one ICD-9 code. I might start, > actually, by > writing a separate *record* for each ICD-9 code, > with the patient ID > and the order of the diagnosis code within the list > of diagnoses for > the patient. > > Then, use MULT RESPONSE (if it's a set of variables) > or FREQUENCIES (if > it's different records) to get a list of what > diagnoses you find, and > how frequent they are. That's not what you're > looking for in your > study, but you need it to get any sense of your > data. > > >I am already looking at select diagnoses by using > the 'multiple > >dichotomies' method you described, but I am > interested in some of the > >more rare diseases. > > I don't think 'multiple dichotomies' is the best > representation when > there are as many categories as you have - hundreds. > > >Even if there is a relatively small number of cases > for a given > >diagnosis, I think I could find a significant > difference. > > Worth a try. Rare events are hard to analyze, but > maybe you've got the > data to do it. > > Then, you just have the conceptual problem: if a > patient has several > DXs, what do you say, about which DX is associated > with which risk > factors? There, you have to think about your > conceptual model; and > that's more than we can solve quickly, here on the > list. > ____________________________________________________________________________________ We won't tell. Get more on shows you hate to love (and love to hate): Yahoo! TV's Guilty Pleasures list. http://tv.yahoo.com/collections/265 |
Hi listers,
Please disregard the previous post. After reviewing the earlier post today (thread "summarize") I realized the solution was to make the output file active (/outfile=3D*), which resolved my difficulty. Thanks Mike |
Free forum by Nabble | Edit this page |