SPSSX Discussion

Variable with multiple values

Classic

List

Threaded

8 messages Options

OJ-3

Variable with multiple values

Hi, I'm a new SPSS user and was hoping to get some help with an
epidemiology project that I'm working on. I'm using a data set that has
multiple values (medical diagnoses) in one variable. So in the variable
Diagnosis, there are up to five separate values in each entry, with
hundreds of possible values. I could make a new variable for each
diagnosis, but like I said there are hundreds of different ones. We are
trying to associate different risk factors with each diagnosis. So, what
would be a good strategy to separate the different diagnoses? Thanks

-OJ

statisticsdoc

Re: Variable with multiple values

OJ,

How is the variable formatted? How are the five diagnoses separated? Do
they occupy separate digits or string positions?

Stephen Brand

For personalized and professional consultation in statistics and research
design, visit
www.statisticsdoc.com

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf Of
OJ
Sent: Saturday, February 03, 2007 5:28 PM
To: [hidden email]
Subject: Variable with multiple values

Hi, I'm a new SPSS user and was hoping to get some help with an
epidemiology project that I'm working on. I'm using a data set that has
multiple values (medical diagnoses) in one variable. So in the variable
Diagnosis, there are up to five separate values in each entry, with
hundreds of possible values. I could make a new variable for each
diagnosis, but like I said there are hundreds of different ones. We are
trying to associate different risk factors with each diagnosis. So, what
would be a good strategy to separate the different diagnoses? Thanks

-OJ

Art Kendall-2

Re: Variable with multiple values

In reply to this post by OJ-3

Is the current variable a string?
is each Diagnosis a fixed number of characters like ICD?
Are there special separator characters between diagnoses?

Please post a few examples of what a current variable and what you want
the results to be.

Art Kendall
Social Research Consultants

OJ wrote:

> Hi, I'm a new SPSS user and was hoping to get some help with an
> epidemiology project that I'm working on. I'm using a data set that has
> multiple values (medical diagnoses) in one variable. So in the variable
> Diagnosis, there are up to five separate values in each entry, with
> hundreds of possible values. I could make a new variable for each
> diagnosis, but like I said there are hundreds of different ones. We are
> trying to associate different risk factors with each diagnosis. So, what
> would be a good strategy to separate the different diagnoses? Thanks
>
> -OJ
>
>
>

Richard Ristow

Re: Variable with multiple values

In reply to this post by OJ-3

At 05:28 PM 2/3/2007, OJ wrote:

>I'm hoping to get some help with an epidemiology project. I'm using a
>data set that has multiple values (medical diagnoses) in one variable:
>. So in the variable Diagnosis, there are up to five separate values
>in each entry, with hundreds of possible values. I could make a new
>variable for each diagnosis, but like I said there are hundreds of
>different ones. We are trying to associate different risk factors
>with each diagnosis. So, what would be a good strategy to separate
>the different diagnoses? Thanks

Art and Stephen Brand ("Statisticsdoc") have asked questions about your
data representation. Those are important, but more critical is: what
does your data mean, and what do you want to do with it?

Strictly, a "variable with multiple values" is impossible in SPSS;
that's just a feature of SPSS's representation of data. However, in the
real world there are frequently categorical concepts where the
categories are not mutually exclusive, and one case may have several:
What magazine do you subscribe to? Or, in your case, diagnoses.

In SPSS, these are called 'multiple response' sets or groups. See
commands MRSETS and MULT RESPONSE in the Command Syntax reference, plus
documentation for the CTABLES module, if you have it. There are two
ways to represent them. In 'multiple dichotomies', there is one yes/no
variable for each category; that's the "new variable for each
diagnosis", and indeed it's clumsy when there are many categories. In
'multiple response groups' or 'multiple category sets', there is a set
of categorical variables, each giving one category applicable to the
case; or, of course, 'N/A'. The group has enough variables for a
reasonable estimate of the maximum number of categories per case.
That's usual for diagnoses: variables like DX1 through DX5, or some
such.

Your data may already be in this form; "variable with multiple values"
makes sense, and may be simply a mis-phrasing, in SPSS terminology. As
Stephen and Art have said, if your data doesn't have this form, it
probably should; and you should tell us how the diagnoses are
represented now, so we can help you convert.

Now, the big question: How to analyze your data? You're "trying to
associate different risk factors with each diagnosis". You have two
conceptual problems:

. With hundreds of possible diagnoses, many of them are probably too
rare in your data to assess risk factors. The usual solutions are to
drop these and assess only for the more common diagnoses; or to combine
related diagnosis categories into larger categories, and analyze by
those categories.

. With multiple diagnoses per patient, you need a conceptual way to
decide which diagnoses should be considered associated with the risk
factors you're assessing. A common approach is to drop all but one of
the diagnoses (the 'primary diagnosis'), and analyze by the primary
diagnosis. I could imagine associating the risk factors with *all* the
diagnoses for a case, in which case you'd 'unroll' your data: create
one record for each diagnosis present for the patient. That would raise
difficulties because you'd clearly have non-independent records. I'm
getting out of my depth here; if you like this approach, maybe others
can say if there's a legitimate way to analyze it.

-Good luck,
Richard

OJ-3

Re: Variable with multiple values

Thanks to Richard and everyone else for clarification. I appreciate the help. So, I have one variable and the data set I am using is for ocular diagnoses. It contains all ICD 9 codes dumped into this one variable. So one entry might look like this:

[IMMATURE CATARACT][POSTERIOR CAPSULAR OPACIFICATION][HYPERMETROPIA][ASTIGMATISM][PRESBYOPIA][Pseudophakia (SIO)]

I understand that I can create a program to separate these into separate variables, but then the issue is how to properly analyze these as mentioned by Richard. I could put these into further categories, such as "Cornea", "Retina", etc., but that would take a lot of manual labor to categorize all of the diagnoses. There are hundreds of diagnoses, but maybe on the order of 2-300, and the data set is fairly large (by the end it may be up to a few thousand patients). I am already looking at select diagnoses by using the 'multiple dichotomies' method you described, but I am interested in some of the more rare diseases. Even if there is a relatively small number of cases for a given diagnosis, I think I could find a significant difference. So, I don't think using a 'primary diagnosis' category is the best way to go about this either. While some of the diagnoses are less important and can be teased out, there could still be three important diagnoses in one case.

Someone mentioned that I could use a piece of code to separate out these diagnoses and then use varstocases to create up to 5 records from each original record. Then casestovars to create the 'hundreds of diagnoses' flags. Any more suggestions? Thanks so much

Regards,

-OJ

>>> Richard Ristow <[hidden email]> 2/5/2007 8:35 AM >>>
At 05:28 PM 2/3/2007, OJ wrote:

>I'm hoping to get some help with an epidemiology project. I'm using a
>data set that has multiple values (medical diagnoses) in one variable:
>. So in the variable Diagnosis, there are up to five separate values
>in each entry, with hundreds of possible values. I could make a new
>variable for each diagnosis, but like I said there are hundreds of
>different ones. We are trying to associate different risk factors
>with each diagnosis. So, what would be a good strategy to separate
>the different diagnoses? Thanks

Art and Stephen Brand ("Statisticsdoc") have asked questions about your
data representation. Those are important, but more critical is: what
does your data mean, and what do you want to do with it?

Strictly, a "variable with multiple values" is impossible in SPSS;
that's just a feature of SPSS's representation of data. However, in the
real world there are frequently categorical concepts where the
categories are not mutually exclusive, and one case may have several:
What magazine do you subscribe to? Or, in your case, diagnoses.

In SPSS, these are called 'multiple response' sets or groups. See
commands MRSETS and MULT RESPONSE in the Command Syntax reference, plus
documentation for the CTABLES module, if you have it. There are two
ways to represent them. In 'multiple dichotomies', there is one yes/no
variable for each category; that's the "new variable for each
diagnosis", and indeed it's clumsy when there are many categories. In
'multiple response groups' or 'multiple category sets', there is a set
of categorical variables, each giving one category applicable to the
case; or, of course, 'N/A'. The group has enough variables for a
reasonable estimate of the maximum number of categories per case.
That's usual for diagnoses: variables like DX1 through DX5, or some
such.

Your data may already be in this form; "variable with multiple values"
makes sense, and may be simply a mis-phrasing, in SPSS terminology. As
Stephen and Art have said, if your data doesn't have this form, it
probably should; and you should tell us how the diagnoses are
represented now, so we can help you convert.

Now, the big question: How to analyze your data? You're "trying to
associate different risk factors with each diagnosis". You have two
conceptual problems:

. With hundreds of possible diagnoses, many of them are probably too
rare in your data to assess risk factors. The usual solutions are to
drop these and assess only for the more common diagnoses; or to combine
related diagnosis categories into larger categories, and analyze by
those categories.

. With multiple diagnoses per patient, you need a conceptual way to
decide which diagnoses should be considered associated with the risk
factors you're assessing. A common approach is to drop all but one of
the diagnoses (the 'primary diagnosis'), and analyze by the primary
diagnosis. I could imagine associating the risk factors with *all* the
diagnoses for a case, in which case you'd 'unroll' your data: create
one record for each diagnosis present for the patient. That would raise
difficulties because you'd clearly have non-independent records. I'm
getting out of my depth here; if you like this approach, maybe others
can say if there's a legitimate way to analyze it.

-Good luck,
Richard

Richard Ristow

Re: Variable with multiple values

At 09:12 PM 2/5/2007, Osamah Saeedi wrote:

>I have one variable and the data set I am using is for ocular
>diagnoses. [The variable] contains all-ICD 9 codes dumped into this
>one variable. So one entry might look like this:
>
>[IMMATURE CATARACT][POSTERIOR CAPSULAR
>OPACIFICATION][HYPERMETROPIA][ASTIGMATISM][PRESBYOPIA][Pseudophakia
>(SIO)]
>
>I understand that I can create a program to separate these into
>separate variables,

That would be the first step. Parse the string, and get a list of
diagnoses in (to start with) the 'multiple categories' form: a list of
variables, each with one ICD-9 code. I might start, actually, by
writing a separate *record* for each ICD-9 code, with the patient ID
and the order of the diagnosis code within the list of diagnoses for
the patient.

Then, use MULT RESPONSE (if it's a set of variables) or FREQUENCIES (if
it's different records) to get a list of what diagnoses you find, and
how frequent they are. That's not what you're looking for in your
study, but you need it to get any sense of your data.

>I am already looking at select diagnoses by using the 'multiple
>dichotomies' method you described, but I am interested in some of the
>more rare diseases.

I don't think 'multiple dichotomies' is the best representation when
there are as many categories as you have - hundreds.

>Even if there is a relatively small number of cases for a given
>diagnosis, I think I could find a significant difference.

Worth a try. Rare events are hard to analyze, but maybe you've got the
data to do it.

Then, you just have the conceptual problem: if a patient has several
DXs, what do you say, about which DX is associated with which risk
factors? There, you have to think about your conceptual model; and
that's more than we can solve quickly, here on the list.

Albert-Jan Roskam

Re: Variable with multiple values

Hi Osamah,

I'd like to add one more thing that we realized when
we conducted a similar kind of investigation. It's
good to keep in mind that it's not a computer, but a
real doctor who makes the diagnoses. This might mean
that the average doctor never writes down more than,
say, five diagnoses per visit, and does not write down
the same diagnosis for each and every visit. For
instance, when a patient has been diagnosed with
depressive disorder, he/she might not been
re-diagnosed the next visit(s). We therefore checked
the average number of diagnoses per patient. In the
end, we used only the first diagnosis of a certain
disease.

Second, known comorbidities should also be kept in
mind when analyzing your data. For example, anxiety
and depressive disorder co-occur more often than that
they occur alone. Doctor X might write down Anxiety, Y
writes down Depression, and Z both. This is important
when you analyze risk factors.

Cheers!
Albert-Jan

--- Richard Ristow <[hidden email]> wrote:

> At 09:12 PM 2/5/2007, Osamah Saeedi wrote:
>
> >I have one variable and the data set I am using is
> for ocular
> >diagnoses. [The variable] contains all-ICD 9 codes
> dumped into this
> >one variable. So one entry might look like this:
> >
> >[IMMATURE CATARACT][POSTERIOR CAPSULAR
>
>OPACIFICATION][HYPERMETROPIA][ASTIGMATISM][PRESBYOPIA][Pseudophakia
> >(SIO)]
> >
> >I understand that I can create a program to
> separate these into
> >separate variables,
>
> That would be the first step. Parse the string, and
> get a list of
> diagnoses in (to start with) the 'multiple
> categories' form: a list of
> variables, each with one ICD-9 code. I might start,
> actually, by
> writing a separate *record* for each ICD-9 code,
> with the patient ID
> and the order of the diagnosis code within the list
> of diagnoses for
> the patient.
>
> Then, use MULT RESPONSE (if it's a set of variables)
> or FREQUENCIES (if
> it's different records) to get a list of what
> diagnoses you find, and
> how frequent they are. That's not what you're
> looking for in your
> study, but you need it to get any sense of your
> data.
>
> >I am already looking at select diagnoses by using
> the 'multiple
> >dichotomies' method you described, but I am
> interested in some of the
> >more rare diseases.
>
> I don't think 'multiple dichotomies' is the best
> representation when
> there are as many categories as you have - hundreds.
>
> >Even if there is a relatively small number of cases
> for a given
> >diagnosis, I think I could find a significant
> difference.
>
> Worth a try. Rare events are hard to analyze, but
> maybe you've got the
> data to do it.
>
> Then, you just have the conceptual problem: if a
> patient has several
> DXs, what do you say, about which DX is associated
> with which risk
> factors? There, you have to think about your
> conceptual model; and
> that's more than we can solve quickly, here on the
> list.
>

____________________________________________________________________________________
We won't tell. Get more on shows you hate to love
(and love to hate): Yahoo! TV's Guilty Pleasures list.
http://tv.yahoo.com/collections/265

Roberts, Michael

Re: Aggregating datafiles in Python

Hi listers,

Please disregard the previous post. After reviewing the earlier post
today (thread "summarize") I realized the solution was to make the
output file active (/outfile=3D*), which resolved my difficulty.

Thanks

Mike