restructuring or aggregating data to change unique identifier

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

restructuring or aggregating data to change unique identifier

Miriam L. Gerver
Since my last question to this list was answered so quickly I'm going
to ask another one. I have a dataset with email addresses and
departments. Each email address can be associated with one or more
departments. Is there a way to restructure the data so that the  email
address is the unique identifier instead of the department?

This is what the data looks like:

Department        email address
1                              abc@abc
2                              abc@abc
3                              abc@def
4                              abc@def
5                              abc@ghi
6                              abc@jkl

email address   Department 1    Department 2
abc@abc                           1                      2
abc@def                            3                      4
abc@ghi                            4
abc@jkl                             5


Thank you,
Miriam Gerver
Reply | Threaded
Open this post in threaded view
|

Re: restructuring or aggregating data to change unique identifier

Richard Ristow
At 06:53 PM 11/8/2006, Miriam L. Gerver wrote:

>I have a dataset with email addresses and departments. Each email
>address can be associated with one or more departments. Is there a way
>to restructure the data so that the email address is the unique
>identifier instead of the department?
>
>[I have]
>Department    email address
>1             abc@abc
>2             abc@abc
>3             abc@def
>4             abc@def
>5             abc@ghi
>6             abc@jkl
>[and I want]
>email address Department 1  Department 2
>abc@abc       1             2
>abc@def       3             4
>abc@ghi       4
>abc@jkl       5

Sure. That's called going from 'long' to 'wide' organization, and it's
what CASESTOVARS is for. You can click it up rather than writing the
syntax: Menu Data>Restructure, select "Restructure selected cases into
variables." It won't be difficult. Sort by email address (and probably
department) first.

NOW (OK, this is one of the bees in my bonnet), consider carefully
whether you want to do it. Many operations, both data manipulation and
analysis, are easier with 'long' organization; plenty often, changing
from 'wide' to 'long' makes the problem easier overall, even with the
additional step. Think, anyhow, what you plan to do, and which
organization is better for doing it.

-Onward, and good luck to you,
  Richard
Reply | Threaded
Open this post in threaded view
|

Re: restructuring or aggregating data to change unique identifier

Miriam L. Gerver
Thanks, Richard. I tried that and ended up with an indicator variable
for each department, rather than a variable for the first department
associated with a particular email address and a second department
associated with that particular email address, etc. So it gave me
about 30,000 indicator variables!

So now I have:

email address         ind1       ind2       ind3       ind4       ind5
      ind6
abc@abc                    1           1           0            0
     0          0
abc@def                     0           0           1            1
      0          0
abc@ghi                     0           0           0            0
      1          0
abc@jkl                      0           0           0            0
       0          1

I'd still like the format to be:
email address         1st dept listed       2nd dept listed
abc@abc                    1                              2
abc@def                     3                              4
abc@ghi                     5
abc@jkl                      6

Any ideas?

Thanks,
Miriam


On 11/8/06, Richard Ristow <[hidden email]> wrote:

> At 06:53 PM 11/8/2006, Miriam L. Gerver wrote:
>
> >I have a dataset with email addresses and departments. Each email
> >address can be associated with one or more departments. Is there a way
> >to restructure the data so that the email address is the unique
> >identifier instead of the department?
> >
> >[I have]
> >Department    email address
> >1             abc@abc
> >2             abc@abc
> >3             abc@def
> >4             abc@def
> >5             abc@ghi
> >6             abc@jkl
> >[and I want]
> >email address Department 1  Department 2
> >abc@abc       1             2
> >abc@def       3             4
> >abc@ghi       5
> >abc@jkl       6
>
> Sure. That's called going from 'long' to 'wide' organization, and it's
> what CASESTOVARS is for. You can click it up rather than writing the
> syntax: Menu Data>Restructure, select "Restructure selected cases into
> variables." It won't be difficult. Sort by email address (and probably
> department) first.
>
> NOW (OK, this is one of the bees in my bonnet), consider carefully
> whether you want to do it. Many operations, both data manipulation and
> analysis, are easier with 'long' organization; plenty often, changing
> from 'wide' to 'long' makes the problem easier overall, even with the
> additional step. Think, anyhow, what you plan to do, and which
> organization is better for doing it.
>
> -Onward, and good luck to you,
>   Richard
>
>
Reply | Threaded
Open this post in threaded view
|

Re: restructuring or aggregating data to change unique identifier

Richard Ristow
At 09:12 PM 11/8/2006, I suggested:

>>You're going from 'long' to 'wide' organization, and it's what
>>CASESTOVARS is for. You can click it up rather than writing the
>>syntax: Menu Data>Restructure, select "Restructure selected cases
>>into variables." It won't be difficult. Sort by email address (and
>>probably department) first.

At 11:06 AM 11/9/2006, Miriam L. Gerver wrote:
>Thanks, Richard. I tried that and ended up with an indicator variable
>for each department, rather than a variable for the first department
>associated with a particular email address and a second department
>associated with that particular email address, etc. So it gave me
>about 30,000 indicator variables!

OK: There's no way to tell what happened. SPSS menus work by generating
and running SPSS syntax. When you don't get the result you wanted from
the menus, give the syntax that was generated and run.

Anyway, I entered your data, clicked up the command, and it ran the
first time; see output below. ("Three things you should be wary of: A
new kid in his prime; A man with all the answers; and code that runs
first time.")

Again, do think whether you want to restructure this way.

This is tested; SPSS draft output. The SORT CASES and CASESTOVARS
commands were clicked up from the Data>Restructure menu.
* ...................................................          .
LIST.
|-----------------------------|---------------------------|
|Output Created               |09-NOV-2006 11:43:29       |
|-----------------------------|---------------------------|
Dept E_mail

   1  abc@abc
   2  abc@abc
   3  abc@def
   4  abc@def
   5  abc@ghi
   6  abc@jkl

Number of cases read:  6    Number of cases listed:  6


SORT CASES BY E_mail .
CASESTOVARS
  /ID = E_mail
  /GROUPBY = VARIABLE .


Cases to Variables
|--------------------------|---------------------------|
|Output Created            |09-NOV-2006 11:45:09       |
|--------------------------|---------------------------|

Generated Variables
|---------|------|
|Original |Result|
|Variabl  |------|
|e        |Name  |
|-------|-|------|
|Dept   |1|Dept.1|
|       |-|------|
|       |2|Dept.2|
|-------|-|------|

Processing Statistics
|---------------|---|
|Cases In       |6  |
|---------------|---|
|Cases Out      |4  |
|---------------|---|
|Cases In/Cases |1.5|
|Out            |   |
|---------------|---|
|Variables In   |2  |
|---------------|---|
|Variables Out  |3  |
|---------------|---|
|Index Values   |2  |
|---------------|---|


LIST.
|-----------------------------|---------------------------|
|Output Created               |09-NOV-2006 11:45:40       |
|-----------------------------|---------------------------|
E_mail  Dept.1 Dept.2

abc@abc    1      2
abc@def    3      4
abc@ghi    5      .
abc@jkl    6      .

Number of cases read:  4    Number of cases listed:  4
Reply | Threaded
Open this post in threaded view
|

Re: restructuring data to change unique identifier

Richard Ristow
In reply to this post by Miriam L. Gerver
Postscript:

At 11:06 AM 11/9/2006, Miriam L. Gerver wrote:

>I ended up with an indicator variable for each department, rather than
>a variable for the first department associated with a particular email
>address and a second department associated with that particular email
>address, etc. So it gave me about 30,000 indicator variables!

I think I see how you did that. In the restructuring wizard, you
. Panel 1: Selected "Restructure selected cases into variables",
correctly
. Panel 2:
- Selected E-mail address as Identifier variable (correctly)
+ Selected Department as an index variable. THAT'S what's wrong, but
it's an easy mistake to make: you know you want to do *something* with
"Department", and those look like the only two choices. (In fact, you
want to leave "Department" unselected from either category, at this
point.)
. Panel 3: (Sort the data or not, is OK)
. Panel 4: Checked "Create indicator variables", which is the second
problem.

That generates the following:

SORT CASES BY E_mail Dept .
CASESTOVARS
  /ID = E_mail
  /INDEX = Dept
  /GROUPBY = VARIABLE
  /VIND ROOT = ind.

And that gives you what you saw. This is SPSS draft output:

LIST.
|-----------------------------|---------------------------|
|Output Created               |09-NOV-2006 16:55:44       |
|-----------------------------|---------------------------|
Dept E_mail

   1  abc@abc
   2  abc@abc
   3  abc@def
   4  abc@def
   5  abc@ghi
   6  abc@jkl

Number of cases read:  6    Number of cases listed:  6


SORT CASES BY E_mail Dept .
CASESTOVARS
  /ID = E_mail
  /INDEX = Dept
  /GROUPBY = VARIABLE
  /VIND ROOT = ind.


Cases to Variables
|--------------------------|---------------------------|
|Output Created            |09-NOV-2006 16:55:44       |
|--------------------------|---------------------------|

Generated Variables
|---------|----|------|
|Original |Dept|Result|
|Variable |    |------|
|         |    |Name  |
|---------|----|------|
|Indicator|1   |ind1  |
|         |----|------|
|         |2   |ind2  |
|         |----|------|
|         |3   |ind3  |
|         |----|------|
|         |4   |ind4  |
|         |----|------|
|         |5   |ind5  |
|         |----|------|
|         |6   |ind6  |
|---------|----|------|

Processing Statistics
|---------------|---|
|Cases In       |6  |
|---------------|---|
|Cases Out      |4  |
|---------------|---|
|Cases In/Cases |1.5|
|Out            |   |
|---------------|---|
|Variables In   |2  |
|---------------|---|
|Variables Out  |7  |
|---------------|---|
|Index Values   |6  |
|---------------|---|


LIST.
|-----------------------------|---------------------------|
|Output Created               |09-NOV-2006 16:55:44       |
|-----------------------------|---------------------------|
E_mail  ind1 ind2 ind3 ind4 ind5 ind6

abc@abc   1    1    0    0    0    0
abc@def   0    0    1    1    0    0
abc@ghi   0    0    0    0    1    0
abc@jkl   0    0    0    0    0    1

Number of cases read:  4    Number of cases listed:  4
Reply | Threaded
Open this post in threaded view
|

Re: restructuring data to change unique identifier

Miriam L. Gerver
Thank you, Richard. The syntax you sent worked perfectly! Yes, I
didn't realize that I could leave "department" unselected. And yes, I
selected the indicator variable option (for no rational reason).

Miriam

On 11/9/06, Richard Ristow <[hidden email]> wrote:

> Postscript:
>
> At 11:06 AM 11/9/2006, Miriam L. Gerver wrote:
>
> >I ended up with an indicator variable for each department, rather than
> >a variable for the first department associated with a particular email
> >address and a second department associated with that particular email
> >address, etc. So it gave me about 30,000 indicator variables!
>
> I think I see how you did that. In the restructuring wizard, you
> . Panel 1: Selected "Restructure selected cases into variables",
> correctly
> . Panel 2:
> - Selected E-mail address as Identifier variable (correctly)
> + Selected Department as an index variable. THAT'S what's wrong, but
> it's an easy mistake to make: you know you want to do *something* with
> "Department", and those look like the only two choices. (In fact, you
> want to leave "Department" unselected from either category, at this
> point.)
> . Panel 3: (Sort the data or not, is OK)
> . Panel 4: Checked "Create indicator variables", which is the second
> problem.
>
> That generates the following:
>
> SORT CASES BY E_mail Dept .
> CASESTOVARS
>   /ID = E_mail
>   /INDEX = Dept
>   /GROUPBY = VARIABLE
>   /VIND ROOT = ind.
>
> And that gives you what you saw. This is SPSS draft output:
>
> LIST.
> |-----------------------------|---------------------------|
> |Output Created               |09-NOV-2006 16:55:44       |
> |-----------------------------|---------------------------|
> Dept E_mail
>
>    1  abc@abc
>    2  abc@abc
>    3  abc@def
>    4  abc@def
>    5  abc@ghi
>    6  abc@jkl
>
> Number of cases read:  6    Number of cases listed:  6
>
>
> SORT CASES BY E_mail Dept .
> CASESTOVARS
>   /ID = E_mail
>   /INDEX = Dept
>   /GROUPBY = VARIABLE
>   /VIND ROOT = ind.
>
>
> Cases to Variables
> |--------------------------|---------------------------|
> |Output Created            |09-NOV-2006 16:55:44       |
> |--------------------------|---------------------------|
>
> Generated Variables
> |---------|----|------|
> |Original |Dept|Result|
> |Variable |    |------|
> |         |    |Name  |
> |---------|----|------|
> |Indicator|1   |ind1  |
> |         |----|------|
> |         |2   |ind2  |
> |         |----|------|
> |         |3   |ind3  |
> |         |----|------|
> |         |4   |ind4  |
> |         |----|------|
> |         |5   |ind5  |
> |         |----|------|
> |         |6   |ind6  |
> |---------|----|------|
>
> Processing Statistics
> |---------------|---|
> |Cases In       |6  |
> |---------------|---|
> |Cases Out      |4  |
> |---------------|---|
> |Cases In/Cases |1.5|
> |Out            |   |
> |---------------|---|
> |Variables In   |2  |
> |---------------|---|
> |Variables Out  |7  |
> |---------------|---|
> |Index Values   |6  |
> |---------------|---|
>
>
> LIST.
> |-----------------------------|---------------------------|
> |Output Created               |09-NOV-2006 16:55:44       |
> |-----------------------------|---------------------------|
> E_mail  ind1 ind2 ind3 ind4 ind5 ind6
>
> abc@abc   1    1    0    0    0    0
> abc@def   0    0    1    1    0    0
> abc@ghi   0    0    0    0    1    0
> abc@jkl   0    0    0    0    0    1
>
> Number of cases read:  4    Number of cases listed:  4
>
>