Since my last question to this list was answered so quickly I'm going
to ask another one. I have a dataset with email addresses and departments. Each email address can be associated with one or more departments. Is there a way to restructure the data so that the email address is the unique identifier instead of the department? This is what the data looks like: Department email address 1 abc@abc 2 abc@abc 3 abc@def 4 abc@def 5 abc@ghi 6 abc@jkl email address Department 1 Department 2 abc@abc 1 2 abc@def 3 4 abc@ghi 4 abc@jkl 5 Thank you, Miriam Gerver |
At 06:53 PM 11/8/2006, Miriam L. Gerver wrote:
>I have a dataset with email addresses and departments. Each email >address can be associated with one or more departments. Is there a way >to restructure the data so that the email address is the unique >identifier instead of the department? > >[I have] >Department email address >1 abc@abc >2 abc@abc >3 abc@def >4 abc@def >5 abc@ghi >6 abc@jkl >[and I want] >email address Department 1 Department 2 >abc@abc 1 2 >abc@def 3 4 >abc@ghi 4 >abc@jkl 5 Sure. That's called going from 'long' to 'wide' organization, and it's what CASESTOVARS is for. You can click it up rather than writing the syntax: Menu Data>Restructure, select "Restructure selected cases into variables." It won't be difficult. Sort by email address (and probably department) first. NOW (OK, this is one of the bees in my bonnet), consider carefully whether you want to do it. Many operations, both data manipulation and analysis, are easier with 'long' organization; plenty often, changing from 'wide' to 'long' makes the problem easier overall, even with the additional step. Think, anyhow, what you plan to do, and which organization is better for doing it. -Onward, and good luck to you, Richard |
Thanks, Richard. I tried that and ended up with an indicator variable
for each department, rather than a variable for the first department associated with a particular email address and a second department associated with that particular email address, etc. So it gave me about 30,000 indicator variables! So now I have: email address ind1 ind2 ind3 ind4 ind5 ind6 abc@abc 1 1 0 0 0 0 abc@def 0 0 1 1 0 0 abc@ghi 0 0 0 0 1 0 abc@jkl 0 0 0 0 0 1 I'd still like the format to be: email address 1st dept listed 2nd dept listed abc@abc 1 2 abc@def 3 4 abc@ghi 5 abc@jkl 6 Any ideas? Thanks, Miriam On 11/8/06, Richard Ristow <[hidden email]> wrote: > At 06:53 PM 11/8/2006, Miriam L. Gerver wrote: > > >I have a dataset with email addresses and departments. Each email > >address can be associated with one or more departments. Is there a way > >to restructure the data so that the email address is the unique > >identifier instead of the department? > > > >[I have] > >Department email address > >1 abc@abc > >2 abc@abc > >3 abc@def > >4 abc@def > >5 abc@ghi > >6 abc@jkl > >[and I want] > >email address Department 1 Department 2 > >abc@abc 1 2 > >abc@def 3 4 > >abc@ghi 5 > >abc@jkl 6 > > Sure. That's called going from 'long' to 'wide' organization, and it's > what CASESTOVARS is for. You can click it up rather than writing the > syntax: Menu Data>Restructure, select "Restructure selected cases into > variables." It won't be difficult. Sort by email address (and probably > department) first. > > NOW (OK, this is one of the bees in my bonnet), consider carefully > whether you want to do it. Many operations, both data manipulation and > analysis, are easier with 'long' organization; plenty often, changing > from 'wide' to 'long' makes the problem easier overall, even with the > additional step. Think, anyhow, what you plan to do, and which > organization is better for doing it. > > -Onward, and good luck to you, > Richard > > |
At 09:12 PM 11/8/2006, I suggested:
>>You're going from 'long' to 'wide' organization, and it's what >>CASESTOVARS is for. You can click it up rather than writing the >>syntax: Menu Data>Restructure, select "Restructure selected cases >>into variables." It won't be difficult. Sort by email address (and >>probably department) first. At 11:06 AM 11/9/2006, Miriam L. Gerver wrote: >Thanks, Richard. I tried that and ended up with an indicator variable >for each department, rather than a variable for the first department >associated with a particular email address and a second department >associated with that particular email address, etc. So it gave me >about 30,000 indicator variables! OK: There's no way to tell what happened. SPSS menus work by generating and running SPSS syntax. When you don't get the result you wanted from the menus, give the syntax that was generated and run. Anyway, I entered your data, clicked up the command, and it ran the first time; see output below. ("Three things you should be wary of: A new kid in his prime; A man with all the answers; and code that runs first time.") Again, do think whether you want to restructure this way. This is tested; SPSS draft output. The SORT CASES and CASESTOVARS commands were clicked up from the Data>Restructure menu. * ................................................... . LIST. |-----------------------------|---------------------------| |Output Created |09-NOV-2006 11:43:29 | |-----------------------------|---------------------------| Dept E_mail 1 abc@abc 2 abc@abc 3 abc@def 4 abc@def 5 abc@ghi 6 abc@jkl Number of cases read: 6 Number of cases listed: 6 SORT CASES BY E_mail . CASESTOVARS /ID = E_mail /GROUPBY = VARIABLE . Cases to Variables |--------------------------|---------------------------| |Output Created |09-NOV-2006 11:45:09 | |--------------------------|---------------------------| Generated Variables |---------|------| |Original |Result| |Variabl |------| |e |Name | |-------|-|------| |Dept |1|Dept.1| | |-|------| | |2|Dept.2| |-------|-|------| Processing Statistics |---------------|---| |Cases In |6 | |---------------|---| |Cases Out |4 | |---------------|---| |Cases In/Cases |1.5| |Out | | |---------------|---| |Variables In |2 | |---------------|---| |Variables Out |3 | |---------------|---| |Index Values |2 | |---------------|---| LIST. |-----------------------------|---------------------------| |Output Created |09-NOV-2006 11:45:40 | |-----------------------------|---------------------------| E_mail Dept.1 Dept.2 abc@abc 1 2 abc@def 3 4 abc@ghi 5 . abc@jkl 6 . Number of cases read: 4 Number of cases listed: 4 |
In reply to this post by Miriam L. Gerver
Postscript:
At 11:06 AM 11/9/2006, Miriam L. Gerver wrote: >I ended up with an indicator variable for each department, rather than >a variable for the first department associated with a particular email >address and a second department associated with that particular email >address, etc. So it gave me about 30,000 indicator variables! I think I see how you did that. In the restructuring wizard, you . Panel 1: Selected "Restructure selected cases into variables", correctly . Panel 2: - Selected E-mail address as Identifier variable (correctly) + Selected Department as an index variable. THAT'S what's wrong, but it's an easy mistake to make: you know you want to do *something* with "Department", and those look like the only two choices. (In fact, you want to leave "Department" unselected from either category, at this point.) . Panel 3: (Sort the data or not, is OK) . Panel 4: Checked "Create indicator variables", which is the second problem. That generates the following: SORT CASES BY E_mail Dept . CASESTOVARS /ID = E_mail /INDEX = Dept /GROUPBY = VARIABLE /VIND ROOT = ind. And that gives you what you saw. This is SPSS draft output: LIST. |-----------------------------|---------------------------| |Output Created |09-NOV-2006 16:55:44 | |-----------------------------|---------------------------| Dept E_mail 1 abc@abc 2 abc@abc 3 abc@def 4 abc@def 5 abc@ghi 6 abc@jkl Number of cases read: 6 Number of cases listed: 6 SORT CASES BY E_mail Dept . CASESTOVARS /ID = E_mail /INDEX = Dept /GROUPBY = VARIABLE /VIND ROOT = ind. Cases to Variables |--------------------------|---------------------------| |Output Created |09-NOV-2006 16:55:44 | |--------------------------|---------------------------| Generated Variables |---------|----|------| |Original |Dept|Result| |Variable | |------| | | |Name | |---------|----|------| |Indicator|1 |ind1 | | |----|------| | |2 |ind2 | | |----|------| | |3 |ind3 | | |----|------| | |4 |ind4 | | |----|------| | |5 |ind5 | | |----|------| | |6 |ind6 | |---------|----|------| Processing Statistics |---------------|---| |Cases In |6 | |---------------|---| |Cases Out |4 | |---------------|---| |Cases In/Cases |1.5| |Out | | |---------------|---| |Variables In |2 | |---------------|---| |Variables Out |7 | |---------------|---| |Index Values |6 | |---------------|---| LIST. |-----------------------------|---------------------------| |Output Created |09-NOV-2006 16:55:44 | |-----------------------------|---------------------------| E_mail ind1 ind2 ind3 ind4 ind5 ind6 abc@abc 1 1 0 0 0 0 abc@def 0 0 1 1 0 0 abc@ghi 0 0 0 0 1 0 abc@jkl 0 0 0 0 0 1 Number of cases read: 4 Number of cases listed: 4 |
Thank you, Richard. The syntax you sent worked perfectly! Yes, I
didn't realize that I could leave "department" unselected. And yes, I selected the indicator variable option (for no rational reason). Miriam On 11/9/06, Richard Ristow <[hidden email]> wrote: > Postscript: > > At 11:06 AM 11/9/2006, Miriam L. Gerver wrote: > > >I ended up with an indicator variable for each department, rather than > >a variable for the first department associated with a particular email > >address and a second department associated with that particular email > >address, etc. So it gave me about 30,000 indicator variables! > > I think I see how you did that. In the restructuring wizard, you > . Panel 1: Selected "Restructure selected cases into variables", > correctly > . Panel 2: > - Selected E-mail address as Identifier variable (correctly) > + Selected Department as an index variable. THAT'S what's wrong, but > it's an easy mistake to make: you know you want to do *something* with > "Department", and those look like the only two choices. (In fact, you > want to leave "Department" unselected from either category, at this > point.) > . Panel 3: (Sort the data or not, is OK) > . Panel 4: Checked "Create indicator variables", which is the second > problem. > > That generates the following: > > SORT CASES BY E_mail Dept . > CASESTOVARS > /ID = E_mail > /INDEX = Dept > /GROUPBY = VARIABLE > /VIND ROOT = ind. > > And that gives you what you saw. This is SPSS draft output: > > LIST. > |-----------------------------|---------------------------| > |Output Created |09-NOV-2006 16:55:44 | > |-----------------------------|---------------------------| > Dept E_mail > > 1 abc@abc > 2 abc@abc > 3 abc@def > 4 abc@def > 5 abc@ghi > 6 abc@jkl > > Number of cases read: 6 Number of cases listed: 6 > > > SORT CASES BY E_mail Dept . > CASESTOVARS > /ID = E_mail > /INDEX = Dept > /GROUPBY = VARIABLE > /VIND ROOT = ind. > > > Cases to Variables > |--------------------------|---------------------------| > |Output Created |09-NOV-2006 16:55:44 | > |--------------------------|---------------------------| > > Generated Variables > |---------|----|------| > |Original |Dept|Result| > |Variable | |------| > | | |Name | > |---------|----|------| > |Indicator|1 |ind1 | > | |----|------| > | |2 |ind2 | > | |----|------| > | |3 |ind3 | > | |----|------| > | |4 |ind4 | > | |----|------| > | |5 |ind5 | > | |----|------| > | |6 |ind6 | > |---------|----|------| > > Processing Statistics > |---------------|---| > |Cases In |6 | > |---------------|---| > |Cases Out |4 | > |---------------|---| > |Cases In/Cases |1.5| > |Out | | > |---------------|---| > |Variables In |2 | > |---------------|---| > |Variables Out |7 | > |---------------|---| > |Index Values |6 | > |---------------|---| > > > LIST. > |-----------------------------|---------------------------| > |Output Created |09-NOV-2006 16:55:44 | > |-----------------------------|---------------------------| > E_mail ind1 ind2 ind3 ind4 ind5 ind6 > > abc@abc 1 1 0 0 0 0 > abc@def 0 0 1 1 0 0 > abc@ghi 0 0 0 0 1 0 > abc@jkl 0 0 0 0 0 1 > > Number of cases read: 4 Number of cases listed: 4 > > |
Free forum by Nabble | Edit this page |