SPSSX Discussion

Cases to variables with timestamp-based unitizing

Classic

List

Threaded

4 messages Options

emrinke

Cases to variables with timestamp-based unitizing

Dear all,

I have a data transformation problem and would greatly appreciate any suggestions on how to solve it.

I am analyzing data from a rating task with multiple raters. The ratings concerned audiovisual material, i.e. continuous data which had to be properly segmented (unitized) by the raters. Each segment coded by the raters has a timestamp attached to it in the "start" and "end" variables in which start and end times of the segment are recorded (in sec).

The data look like this (this is a simplified version with only two raters when in actuality there are nine):

Rater Start End Var1 Var2 ...

case1 R1 17.54 123.29 4 2

case2 R2 18.02 123.76 4 3

case3 R1 128.43 171.53 2 1

case4 R2 130.13 148.21 2 1
.
.
.

I now intend to do analyses for which the data need to be set up
differently. The ratings of the separate judges for all variables
should be represented in individual columns while the rows should
correspond to a single observed unit each. This means that a relatively
simple "cases to variables" procedure is in order. However, the issue
is complicated by the need to identify the units and match cases
accordingly beforehand. I do not expect agreement on start and end
times of the segments to be exactly the same for them to be considered
a unit. Instead what is expected here is agreement between raters in a
range of, say, 5 sec for both start and end time of the segment. That
is, in the above data example, cases 1 and 2 should be counted as a
unit and both ratings put into its single row, while for cases 3 and 4
raters disagree too much on the end time of the segment. Therefore,
cases 3 and 4 should be kept as single units within the dataset.

Consequently, this is what the data should look like in the end:

Start End Var1_R1 Var1_R2 Var2_R1 Var2_R2 ...

case1 17.54 123.29 4 4 2 3

case2 128.43 171.53 2 . 1 .

case3 130.13 148.21 . 2 . 1
.
.
.

For the analyses intended it does not matter much whether start and end times of the cases (now "units") equal those set by the first rater (as is the case in the example data matrix) or (more elegant) the mean of
all ratings then subsumed under the case/unit.

I am unsure how to go about solving this transformation task in an automated fashion in Stata - hence any help is much appreciated.

Eike
___________________________________________________________
NEU: WEB.DE DSL für 19,99 EUR/mtl. und ohne Mindest-Laufzeit!
http://produkte.web.de/go/02/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Maguin, Eugene

Re: Cases to variables with timestamp-based unitizing

Eike,

>>I am unsure how to go about solving this transformation task in an
automated fashion in Stata - hence any help is much appreciated.

By the way this is a an spss list, not a Stata list. Nothing anybody writes
here will work in Stata. Concepts, yes. Code, No. Why don't you post this on
the Stata list. I know there is one. And, I'm sure there are seriously
competent people on that list.

Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

emrinke

Re: Cases to variables with timestamp-based unitizing

In reply to this post by emrinke

Dear Gene and other SPSS listers,

sorry, this was a nightly mishap on my side - I am indeed more interested in getting this problem solved in SPSS than in Stata, as I am currently working with SPSS to do the reliability analyses of the rating data. Therefore, here is my post again - this time with the correct software title. I'd still appreciated any help with this data structure problem!

Dear all,

I have a data transformation problem and would greatly appreciate any suggestions on how to solve it.

I am analyzing data from a rating task with multiple raters. The ratings concerned audiovisual material, i.e. continuous data which had to be properly segmented (unitized) by the raters. Each segment coded by the raters has a timestamp attached to it in the "start" and "end" variables in which start and end times of the segment are recorded (in sec).

The data look like this (this is a simplified version with only two raters when in actuality there are nine):

Rater Start    End    Var1 Var2 ...

case1 R1 17.54 123.29 4 2

case2 R2 18.02 123.76 4 3

case3 R1    128.43 171.53 2 1

case4 R2 130.13 148.21 2 1
.
.
.

I now intend to do analyses for which the data need to be set up
differently. The ratings of the separate judges for all variables
should be represented in individual columns while the rows should
correspond to a single observed unit each. This means that a relatively
simple "cases to variables" procedure is in order. However, the issue
is complicated by the need to identify the units and match cases
accordingly beforehand. I do not expect agreement on start and end
times of the segments to be exactly the same for them to be considered
a unit. Instead what is expected here is agreement between raters in a
range of, say, 5 sec for both start and end time of the segment. That
is, in the above data example, cases 1 and 2 should be counted as a
unit and both ratings put into its single row, while for cases 3 and 4
raters disagree too much on the end time of the segment. Therefore,
cases 3 and 4 should be kept as single units within the dataset.

Consequently, this is what the data should look like in the end:

Start End Var1_R1 Var1_R2    Var2_R1 Var2_R2 ...

case1 17.54 123.29 4 4 2 3

case2 128.43 171.53 2 .    1    .

case3 130.13 148.21 . 2    . 1
.
.
.

For the analyses intended it does not matter much whether start and end times of the cases (now "units") equal those set by the first rater (as is the case in the example data matrix) or (more elegant) the mean of
all ratings then subsumed under the case/unit.

I am unsure how to go about solving this transformation task in an automated fashion in SPSS (!) - hence any help is much appreciated.

Eike

-----Ursprüngliche Nachricht-----
Von: SPSSX(r) Discussion [mailto:[hidden email]] Im Auftrag von Gene Maguin
Gesendet: Montag, 7. Juni 2010 15:20
An: [hidden email]
Betreff: Re: Cases to variables with timestamp-based unitizing

Eike,

>>I am unsure how to go about solving this transformation task in an

automated fashion in Stata - hence any help is much appreciated.

By the way this is a an spss list, not a Stata list. Nothing anybody writes

here will work in Stata. Concepts, yes. Code, No. Why don't you post this on

the Stata list. I know there is one. And, I'm sure there are seriously

competent people on that list.

Gene Maguin

=====================

To manage your subscription to SPSSX-L, send a message to

[hidden email] (not to SPSSX-L), with no body text except the

command. To leave the list, send the command

SIGNOFF SPSSX-L

For a list of commands to manage subscriptions, send the command

INFO REFCARD

WEB.DE DSL ab 19,99 Euro/Monat. Bis zu 150,- Euro Startguthaben und
50,- Euro Geldprämie inklusive! https://freundschaftswerbung.web.de

====================To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Maguin, Eugene

Re: Cases to variables with timestamp-based unitizing

In reply to this post by emrinke

Eike,

How many segments per judge are you talking about? 10? 100? 1000? 5000?

Ok. I don't know that I can give a complete solution because I suspect that
'issues', and maybe quite a few, are going to come up in the transformation
and analysis phases that may need further fixing. I don't have a lot of
confidence that what follows is going to be helpful. But, here's how I'd
work on it. If you haven't already done so, do a frequencies on both start
and end and see if you can group times. My impression from your description
is that each judge will identify nonoverlapping segments such that start1 <
end 1 < start2 < end2, etc. The problem is that any pair of judges may not
agree very well on the start and end times. The killer problem is if a judge
identifies as a segment as one segment that other judges identify as two
adjacent segments. The simple check is that all judges have the same number
of segments. After that you might want to compute segment lengths,
inter-segment start and inter-segment end intervals. Do that by judge and
look at the distributions by judge to see if you can find odd values.
Perhaps there's no reason why all judges have the number of segments but
each concantenate different segments.

So, let's say there are no problems/any problems are now fixed. So now redo
the frequency distributions on the start and end times. Basically you are
going to be somewhere on this continuum. One end is that segment start (end)
times are 'tightly' grouped; the other is that start (end) times are
'loosely' grouped. You have to define 'tightly' and 'loosely'. I can't. So
you mentioned the idea of defining window of a certain width, say 5 seconds,
and keeping records that are within the window and throwing out the window.
Think about this window for a minute. It's got a width but how do you decide
where on the time scale to place the left edge for that segment? I have not
idea what to tell you to do, nor do I know what I'd do without having worked
with data and knowing what the longterm analytical goals were.

If the number of segments were small, I might define window boundaries in
syntax for each segment and, using those boundaries, number the segments.
With the segments numbered, a casestovars operation is easy. But with
hundreds or thousands of segments, I'm not sure how to escape the burden of
keying the boundary times. Far worse would be such loose groupings that a
segment's start time (or end time) distribution overlaps with that for an
adjacent segment.

There's another angle on this. The above babblings assumes a long form file.
But, a wide form file might work better for some things. If you did a
casestovars right away so that raterIDs were casesIDs, then you could do
frequencies start and end times by segment and so on. There may be
advantages to this data structure but I'm not sure. At some point you may
need to do a casestovars operation, if only for your analyses, the question
is when is the optimal time and it simply depends.

Gene Maguin

>>I have a data transformation problem and would greatly appreciate any
suggestions on how to solve it.

I am analyzing data from a rating task with multiple raters. The ratings
concerned audiovisual material, i.e. continuous data which had to be
properly segmented (unitized) by the raters. Each segment coded by the
raters has a timestamp attached to it in the "start" and "end" variables in
which start and end times of the segment are recorded (in sec).

The data look like this (this is a simplified version with only two raters
when in actuality there are nine):

Rater Start End Var1 Var2 ...
case1 R1 17.54 123.29 4 2
case2 R2 18.02 123.76 4 3
case3 R1 128.43 171.53 2 1
case4 R2 130.13 148.21 2 1
.
.
.

I now intend to do analyses for which the data need to be set up
differently. The ratings of the separate judges for all variables
should be represented in individual columns while the rows should
correspond to a single observed unit each. This means that a relatively
simple "cases to variables" procedure is in order. However, the issue
is complicated by the need to identify the units and match cases
accordingly beforehand. I do not expect agreement on start and end
times of the segments to be exactly the same for them to be considered
a unit. Instead what is expected here is agreement between raters in a
range of, say, 5 sec for both start and end time of the segment. That
is, in the above data example, cases 1 and 2 should be counted as a
unit and both ratings put into its single row, while for cases 3 and 4
raters disagree too much on the end time of the segment. Therefore,
cases 3 and 4 should be kept as single units within the dataset.

Consequently, this is what the data should look like in the end:

Start End Var1_R1 Var1_R2 Var2_R1
Var2_R2 ...
case1 17.54 123.29 4 4 2
3
case2 128.43 171.53 2 . 1
.
case3 130.13 148.21 . 2 .
1
.
.
.

For the analyses intended it does not matter much whether start and end
times of the cases (now "units") equal those set by the first rater (as is
the case in the example data matrix) or (more elegant) the mean of all
ratings then subsumed under the case/unit.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD