How to do analyses (mainly linear regression/repeated measures ANOVA) with data sets in "long" form

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

How to do analyses (mainly linear regression/repeated measures ANOVA) with data sets in "long" form

joseph.williams
Hi Everyone,

I have a gigantic (5 million + data) set that is in long form – each line currently represents a person's behavior with a particular object, and there are a bunch of different variables to do regression analyses on, in tandem with looking at the effects of an experimental variable.

I've been playing around with using "Restructure" to put it into "wide" form with a participant on each line – which is how I've used SPSS the last 8 years – but it takes forever and actually makes it harder to specify variables.

Can anyone point me to some documentation/links or provide information on whether/how I can do the analyses and graphs I want with the data in long form?

Thanks a lot,

Joseph
Reply | Threaded
Open this post in threaded view
|

Re: How to do analyses (mainly linear regression/repeated measures ANOVA) with data sets in "long" form

Rich Ulrich
Useful information is most apt to be relevant if you find it within your own
subject area, which is not something that you mention.  What do you have?

If it is 5 million subjects with multiple "objects" (and time?), then conventional
testing should be largely irrelevant, since the nominal power is huge.
And that same comment may apply if it is "only" 5 million lines in long form.
What do you have?

What I have to suggest assumes that conventional regression testing is largely
irrelevant because of N.  Generating useful tests based on specific, selected d.f. 
is a separate problem, one which usually is ignored). 

An obvious, potential starting point for graphs and analyses would be to explicitly
subtract off the Subject effects, which is what Repeated Measures does implicitly: 
  1) Aggregate with /ADD VARS and subtract in order to get deviation scores within Subject.
I would considering preserving that overall level as a potential predictor.
  2) If the Objects have notably different means, Aggregate first (that is, before step (1))
by Object, across Subjects,  with /ADD VARS , in order to get deviations scores for objects,
which you then use as the starting place for step (1).

I list (2) as (2) rather than making it (1) because I think I would avoid doing it when the
item differences are merely "significant" for the huge N and are not "notable."  I say
this because the ipsatizing procedure - subtracting means - puts weight on the assumption
that the data are, indeed, measured on a good scale, with equal intervals and no
basement/ceiling effects.   Repeated Measures does give you a good form of data
(pre/post; object1/object2) for examining linearity and equal intervals.  Problems are
more likely when the ratings do not use the same part of the scale for all Objects.

Hope this helps.  Please give more detail, anyway.

--
Rich Ulrich



Date: Sun, 6 Oct 2013 18:46:32 -0700
From: [hidden email]
Subject: How to do analyses (mainly linear regression/repeated measures ANOVA) with data sets in "long" form
To: [hidden email]

Hi Everyone,

I have a gigantic (5 million + data) set that is in long form – each line currently represents a person's behavior with a particular object, and there are a bunch of different variables to do regression analyses on, in tandem with looking at the effects of an experimental variable.

I've been playing around with using "Restructure" to put it into "wide" form with a participant on each line – which is how I've used SPSS the last 8 years – but it takes forever and actually makes it harder to specify variables.

Can anyone point me to some documentation/links or provide information on whether/how I can do the analyses and graphs I want with the data in long form?

Thanks a lot,

Joseph
Reply | Threaded
Open this post in threaded view
|

Re: How to do analyses (mainly linear regression/repeated measures ANOVA) with data sets in "long" form

Joseph Jay Williams
Thanks a lot for your help! My subject area is psychology/education, and the 5 million number of observations is not so much the issue as the 25+ variables that could be relevant.

For a couple reasons, right now I'm trying to understand how to use SPSS to analyze long-form data using tests I often use with wide data (but I plan to consider the statistical issues you're pointing out). 

I think the technical question I have is – how do I run a typical ANOVA/regression with SPSS data in long form, without Restructuring the data to wide form.

E.g. If the data is in long form as shown below:
Participant Exercise Number AccuracyType1 AccuracyType2
1 1 0.06 0.03
1 2 0.14 0.36
1 3 0.43 0.94
2 1 0.60 0.32
2 2 0.25 0.03
3 1 0.33 0.60
3 2 0.43 0.31
3 3 0.52 0.54
3 4 0.03 0.32
  
How would I tell SPSS to do an ANOVA with ExerciseNumber as between-sub variable, AccuracyType (AccuracyType1, AccuracyType2) as within-sub?

Or tell SPSS to regress AccuracyType2 onto AccuracyType1 and ExerciseNumber?

Joseph


Joseph Jay Williams, Ph.D.
Graduate School of Education, Stanford University


On Sun, Oct 6, 2013 at 8:19 PM, Rich Ulrich <[hidden email]> wrote:
Useful information is most apt to be relevant if you find it within your own
subject area, which is not something that you mention.  What do you have?

If it is 5 million subjects with multiple "objects" (and time?), then conventional
testing should be largely irrelevant, since the nominal power is huge.
And that same comment may apply if it is "only" 5 million lines in long form.
What do you have?

What I have to suggest assumes that conventional regression testing is largely
irrelevant because of N.  Generating useful tests based on specific, selected d.f. 
is a separate problem, one which usually is ignored). 

An obvious, potential starting point for graphs and analyses would be to explicitly
subtract off the Subject effects, which is what Repeated Measures does implicitly: 
  1) Aggregate with /ADD VARS and subtract in order to get deviation scores within Subject.
I would considering preserving that overall level as a potential predictor.
  2) If the Objects have notably different means, Aggregate first (that is, before step (1))
by Object, across Subjects,  with /ADD VARS , in order to get deviations scores for objects,
which you then use as the starting place for step (1).

I list (2) as (2) rather than making it (1) because I think I would avoid doing it when the
item differences are merely "significant" for the huge N and are not "notable."  I say
this because the ipsatizing procedure - subtracting means - puts weight on the assumption
that the data are, indeed, measured on a good scale, with equal intervals and no
basement/ceiling effects.   Repeated Measures does give you a good form of data
(pre/post; object1/object2) for examining linearity and equal intervals.  Problems are
more likely when the ratings do not use the same part of the scale for all Objects.

Hope this helps.  Please give more detail, anyway.

--
Rich Ulrich



Date: Sun, 6 Oct 2013 18:46:32 -0700
From: [hidden email]
Subject: How to do analyses (mainly linear regression/repeated measures ANOVA) with data sets in "long" form
To: [hidden email]


Hi Everyone,

I have a gigantic (5 million + data) set that is in long form – each line currently represents a person's behavior with a particular object, and there are a bunch of different variables to do regression analyses on, in tandem with looking at the effects of an experimental variable.

I've been playing around with using "Restructure" to put it into "wide" form with a participant on each line – which is how I've used SPSS the last 8 years – but it takes forever and actually makes it harder to specify variables.

Can anyone point me to some documentation/links or provide information on whether/how I can do the analyses and graphs I want with the data in long form?

Thanks a lot,

Joseph

Reply | Threaded
Open this post in threaded view
|

Re: How to do analyses (mainly linear regression/repeated measures ANOVA) with data sets in "long" form

Maguin, Eugene
In reply to this post by joseph.williams

To begin, look at the examples for the Mixed command.

Gene Maguin

 

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Joseph Jay Williams
Sent: Monday, October 07, 2013 4:31 PM
To: [hidden email]
Subject: Re: How to do analyses (mainly linear regression/repeated measures ANOVA) with data sets in "long" form

 

Thanks a lot for your help! My subject area is psychology/education, and the 5 million number of observations is not so much the issue as the 25+ variables that could be relevant.

 

For a couple reasons, right now I'm trying to understand how to use SPSS to analyze long-form data using tests I often use with wide data (but I plan to consider the statistical issues you're pointing out). 

 

I think the technical question I have is – how do I run a typical ANOVA/regression with SPSS data in long form, without Restructuring the data to wide form.

 

E.g. If the data is in long form as shown below:

Participant

Exercise Number

AccuracyType1

AccuracyType2

1

1

0.06

0.03

1

2

0.14

0.36

1

3

0.43

0.94

2

1

0.60

0.32

2

2

0.25

0.03

3

1

0.33

0.60

3

2

0.43

0.31

3

3

0.52

0.54

3

4

0.03

0.32

  

How would I tell SPSS to do an ANOVA with ExerciseNumber as between-sub variable, AccuracyType (AccuracyType1, AccuracyType2) as within-sub?

 

Or tell SPSS to regress AccuracyType2 onto AccuracyType1 and ExerciseNumber?


Joseph

 

 

Joseph Jay Williams, Ph.D.

Graduate School of Education, Stanford University

 

On Sun, Oct 6, 2013 at 8:19 PM, Rich Ulrich <[hidden email]> wrote:

Useful information is most apt to be relevant if you find it within your own
subject area, which is not something that you mention.  What do you have?

If it is 5 million subjects with multiple "objects" (and time?), then conventional
testing should be largely irrelevant, since the nominal power is huge.
And that same comment may apply if it is "only" 5 million lines in long form.
What do you have?

What I have to suggest assumes that conventional regression testing is largely
irrelevant because of N.  Generating useful tests based on specific, selected d.f. 
is a separate problem, one which usually is ignored). 

An obvious, potential starting point for graphs and analyses would be to explicitly
subtract off the Subject effects, which is what Repeated Measures does implicitly: 
  1) Aggregate with /ADD VARS and subtract in order to get deviation scores within Subject.
I would considering preserving that overall level as a potential predictor.
  2) If the Objects have notably different means, Aggregate first (that is, before step (1))
by Object, across Subjects,  with /ADD VARS , in order to get deviations scores for objects,
which you then use as the starting place for step (1).

I list (2) as (2) rather than making it (1) because I think I would avoid doing it when the
item differences are merely "significant" for the huge N and are not "notable."  I say
this because the ipsatizing procedure - subtracting means - puts weight on the assumption
that the data are, indeed, measured on a good scale, with equal intervals and no
basement/ceiling effects.   Repeated Measures does give you a good form of data
(pre/post; object1/object2) for examining linearity and equal intervals.  Problems are
more likely when the ratings do not use the same part of the scale for all Objects.

Hope this helps.  Please give more detail, anyway.

--
Rich Ulrich


Date: Sun, 6 Oct 2013 18:46:32 -0700
From: [hidden email]
Subject: How to do analyses (mainly linear regression/repeated measures ANOVA) with data sets in "long" form
To: [hidden email]

 

Hi Everyone,

 

I have a gigantic (5 million + data) set that is in long form – each line currently represents a person's behavior with a particular object, and there are a bunch of different variables to do regression analyses on, in tandem with looking at the effects of an experimental variable.

 

I've been playing around with using "Restructure" to put it into "wide" form with a participant on each line – which is how I've used SPSS the last 8 years – but it takes forever and actually makes it harder to specify variables.

 

Can anyone point me to some documentation/links or provide information on whether/how I can do the analyses and graphs I want with the data in long form?

 

Thanks a lot,


Joseph

 

Reply | Threaded
Open this post in threaded view
|

Re: How to do analyses (mainly linear regression/repeated measures ANOVA) with data sets in "long" form

Art Kendall
In reply to this post by Joseph Jay Williams
from the example data you posted is this a correct description of your data?
You have participants each with doubly repeated data.  The two repeated factor are
exercise with up to 4 levels and type of accuracy with 2 levels.
Some participants are missing data on some exercises, but have up to 4 exercises. When an exercise is missing of course its accuracy is missing.

Is exercise 1 always  the same exercise?  Is exercise 4 always the same exercise?
that gives you a table of 8 means for interaction, 2 for type of accuracy and 4 for exercise.

Where do the 25 variables come into the model? The example data only has 4 variables.

It is late in the day but IIRC repeated measures GLM does not accommodate missing data but MIXED does.

What questions are you asking of the data?  It is extremely likely that the means will be statistically distinct (significantly different).  But the question then becomes is it meaningful.

BUT as Rich says maybe you should just use MEANS or AGGREGATE to get the means and SDs
Art Kendall
Social Research Consultants
On 10/7/2013 5:24 PM, Joseph Jay Williams [via SPSSX Discussion] wrote:
Thanks a lot for your help! My subject area is psychology/education, and the 5 million number of observations is not so much the issue as the 25+ variables that could be relevant.

For a couple reasons, right now I'm trying to understand how to use SPSS to analyze long-form data using tests I often use with wide data (but I plan to consider the statistical issues you're pointing out). 

I think the technical question I have is – how do I run a typical ANOVA/regression with SPSS data in long form, without Restructuring the data to wide form.

E.g. If the data is in long form as shown below:
Participant Exercise Number AccuracyType1 AccuracyType2
1 1 0.06 0.03
1 2 0.14 0.36
1 3 0.43 0.94
2 1 0.60 0.32
2 2 0.25 0.03
3 1 0.33 0.60
3 2 0.43 0.31
3 3 0.52 0.54
3 4 0.03 0.32
  
How would I tell SPSS to do an ANOVA with ExerciseNumber as between-sub variable, AccuracyType (AccuracyType1, AccuracyType2) as within-sub?

Or tell SPSS to regress AccuracyType2 onto AccuracyType1 and ExerciseNumber?

Joseph


Joseph Jay Williams, Ph.D.
Graduate School of Education, Stanford University


On Sun, Oct 6, 2013 at 8:19 PM, Rich Ulrich <[hidden email]> wrote:
Useful information is most apt to be relevant if you find it within your own
subject area, which is not something that you mention.  What do you have?

If it is 5 million subjects with multiple "objects" (and time?), then conventional
testing should be largely irrelevant, since the nominal power is huge.
And that same comment may apply if it is "only" 5 million lines in long form.
What do you have?

What I have to suggest assumes that conventional regression testing is largely
irrelevant because of N.  Generating useful tests based on specific, selected d.f. 
is a separate problem, one which usually is ignored). 

An obvious, potential starting point for graphs and analyses would be to explicitly
subtract off the Subject effects, which is what Repeated Measures does implicitly: 
  1) Aggregate with /ADD VARS and subtract in order to get deviation scores within Subject.
I would considering preserving that overall level as a potential predictor.
  2) If the Objects have notably different means, Aggregate first (that is, before step (1))
by Object, across Subjects,  with /ADD VARS , in order to get deviations scores for objects,
which you then use as the starting place for step (1).

I list (2) as (2) rather than making it (1) because I think I would avoid doing it when the
item differences are merely "significant" for the huge N and are not "notable."  I say
this because the ipsatizing procedure - subtracting means - puts weight on the assumption
that the data are, indeed, measured on a good scale, with equal intervals and no
basement/ceiling effects.   Repeated Measures does give you a good form of data
(pre/post; object1/object2) for examining linearity and equal intervals.  Problems are
more likely when the ratings do not use the same part of the scale for all Objects.

Hope this helps.  Please give more detail, anyway.

--
Rich Ulrich



Date: Sun, 6 Oct 2013 18:46:32 -0700
From: [hidden email]
Subject: How to do analyses (mainly linear regression/repeated measures ANOVA) with data sets in "long" form
To: [hidden email]


Hi Everyone,

I have a gigantic (5 million + data) set that is in long form – each line currently represents a person's behavior with a particular object, and there are a bunch of different variables to do regression analyses on, in tandem with looking at the effects of an experimental variable.

I've been playing around with using "Restructure" to put it into "wide" form with a participant on each line – which is how I've used SPSS the last 8 years – but it takes forever and actually makes it harder to specify variables.

Can anyone point me to some documentation/links or provide information on whether/how I can do the analyses and graphs I want with the data in long form?

Thanks a lot,

Joseph




To start a new topic under SPSSX Discussion, email [hidden email]
To unsubscribe from SPSSX Discussion, click here.
NAML

Art Kendall
Social Research Consultants