|
Hello All,
I often deal with SPSS data sets that come from an HTML-based Research Survey creation program. This program exports well to SPSS 14.0 and automatically provides Variable Names, Variable Labels, and Value Labels. The problem is that this Survey Program reads in the HTML text as the Variable Labels and Value Labels, with the special HTML characters included (such as, </b>). Example Variable Name: q7xg_qs1_9_q7x Example Variable Label: <b>Product One</b>(remote access) : Example Value Label: 1 '<b>Not aware of it</b>' I have been cleaning the HTML characters out of the Variable Labels by copying them to another software and using the wildcard "<*>" to Find/Replace the HTML characters with a space, but I do not know of a simple way to clean up the Value Labels. Some of the files are fairly large (17,000 or more variables), and this can be a very time-consuming clean-up process. Any suggestions would be greatly appreciated. Thanks, Benton Smith [hidden email] <mailto:[hidden email]> |
|
I'm sorry; I left off the Subject line on my early request for help.
Benton ________________________________ Hello All, I often deal with SPSS data sets that come from an HTML-based Research Survey creation program. This program exports well to SPSS 14.0 and automatically provides Variable Names, Variable Labels, and Value Labels. The problem is that this Survey Program reads in the HTML text as the Variable Labels and Value Labels, with the special HTML characters included (such as, </b>). Example Variable Name: q7xg_qs1_9_q7x Example Variable Label: <b>Product One</b>(remote access) : Example Value Label: 1 '<b>Not aware of it</b>' I have been cleaning the HTML characters out of the Variable Labels by copying them to another software and using the wildcard "<*>" to Find/Replace the HTML characters with a space, but I do not know of a simple way to clean up the Value Labels. Some of the files are fairly large (17,000 or more variables), and this can be a very time-consuming clean-up process. Any suggestions would be greatly appreciated. Thanks, Benton Smith [hidden email] <mailto:[hidden email]> |
|
This would be easy to automate if you can use the Python programmability. Here is some untested code.
begin program. import spss, spssaux, re vardict = spssaux.VariableDict() for v in vardict: vlset = vardict[v].ValueLabels for val in vlset: vlset[val] = re.sub("<.*?>", "", vlset[val]) vardict[v].ValueLabels = vlset end program. It - creates a Python variable dictionary - loops over all the variables - gets the value labels for each variable as a Python dictionary - substitutes out html tags - assigns the modified labels back to the variable The regular expression that gets rid of the tags, "<.*?>", has one subtlety. The usual behavior is to do "greedy" matching, so with <b>abc</b>, you would match the entire string, not the tag. By using the form .*?, Python uses the shortest matching string. You could also do the variable labels in this code. HTH, Jon Peck -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Smith, Benton Sent: Thursday, May 10, 2007 8:48 AM To: [hidden email] Subject: [SPSSX-L] Editing Existing Value Labels Across Large Data Sets I'm sorry; I left off the Subject line on my early request for help. Benton ________________________________ Hello All, I often deal with SPSS data sets that come from an HTML-based Research Survey creation program. This program exports well to SPSS 14.0 and automatically provides Variable Names, Variable Labels, and Value Labels. The problem is that this Survey Program reads in the HTML text as the Variable Labels and Value Labels, with the special HTML characters included (such as, </b>). Example Variable Name: q7xg_qs1_9_q7x Example Variable Label: <b>Product One</b>(remote access) : Example Value Label: 1 '<b>Not aware of it</b>' I have been cleaning the HTML characters out of the Variable Labels by copying them to another software and using the wildcard "<*>" to Find/Replace the HTML characters with a space, but I do not know of a simple way to clean up the Value Labels. Some of the files are fairly large (17,000 or more variables), and this can be a very time-consuming clean-up process. Any suggestions would be greatly appreciated. Thanks, Benton Smith [hidden email] <mailto:[hidden email]> |
|
Thanks Jon,
I just downloaded Python today... so it may be a while before I get up the learning curve enough to execute your suggestion. I'm still just trying to get SPSS and Python to synch with one-another. I will let you know if I'm able to run the code you created. I have also received a suggestion to use an export of the data dictionary in the same way that I clean my Variable Labels. I appreciate your time and recommendations, Benton -----Original Message----- From: Peck, Jon [mailto:[hidden email]] Sent: Thursday, May 10, 2007 11:43 AM To: Smith, Benton; [hidden email] Subject: RE: [SPSSX-L] Editing Existing Value Labels Across Large Data Sets This would be easy to automate if you can use the Python programmability. Here is some untested code. begin program. import spss, spssaux, re vardict = spssaux.VariableDict() for v in vardict: vlset = vardict[v].ValueLabels for val in vlset: vlset[val] = re.sub("<.*?>", "", vlset[val]) vardict[v].ValueLabels = vlset end program. It - creates a Python variable dictionary - loops over all the variables - gets the value labels for each variable as a Python dictionary - substitutes out html tags - assigns the modified labels back to the variable The regular expression that gets rid of the tags, "<.*?>", has one subtlety. The usual behavior is to do "greedy" matching, so with <b>abc</b>, you would match the entire string, not the tag. By using the form .*?, Python uses the shortest matching string. You could also do the variable labels in this code. HTH, Jon Peck -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Smith, Benton Sent: Thursday, May 10, 2007 8:48 AM To: [hidden email] Subject: [SPSSX-L] Editing Existing Value Labels Across Large Data Sets I'm sorry; I left off the Subject line on my early request for help. Benton ________________________________ Hello All, I often deal with SPSS data sets that come from an HTML-based Research Survey creation program. This program exports well to SPSS 14.0 and automatically provides Variable Names, Variable Labels, and Value Labels. The problem is that this Survey Program reads in the HTML text as the Variable Labels and Value Labels, with the special HTML characters included (such as, </b>). Example Variable Name: q7xg_qs1_9_q7x Example Variable Label: <b>Product One</b>(remote access) : Example Value Label: 1 '<b>Not aware of it</b>' I have been cleaning the HTML characters out of the Variable Labels by copying them to another software and using the wildcard "<*>" to Find/Replace the HTML characters with a space, but I do not know of a simple way to clean up the Value Labels. Some of the files are fairly large (17,000 or more variables), and this can be a very time-consuming clean-up process. Any suggestions would be greatly appreciated. Thanks, Benton Smith [hidden email] <mailto:[hidden email]> |
|
In reply to this post by Peck, Jon
Hi Jon,
I tried to run the code you provide, but it gives me the following error: Traceback (most recent call last): File "<string>", line 8, in ? File "C:\Python24\lib\site-packages\spssaux.py", line 654, in _ValLabSet spss.Submit("VALUE LABELS " + spss.GetVariableName(self.index) + " " + vllist) File "C:\Python24\lib\site-packages\spss\spss150\spss.py", line 772, in GetVariableName raise SpssError,error spss.spss150.errMsg.SpssError: [errLevel 1000] Expects an integer argument. Can you please take a look and see what's wrong. Many thanks, Vlad. On 5/10/07, Peck, Jon <[hidden email]> wrote: > > This would be easy to automate if you can use the Python > programmability. Here is some untested code. > > begin program. > import spss, spssaux, re > > vardict = spssaux.VariableDict() > for v in vardict: > vlset = vardict[v].ValueLabels > for val in vlset: > vlset[val] = re.sub("<.*?>", "", vlset[val]) > vardict[v].ValueLabels = vlset > end program. > > It > - creates a Python variable dictionary > - loops over all the variables > - gets the value labels for each variable as a Python dictionary > - substitutes out html tags > - assigns the modified labels back to the variable > > The regular expression that gets rid of the tags, "<.*?>", has one > subtlety. > The usual behavior is to do "greedy" matching, so with <b>abc</b>, you > would match the entire string, not the tag. By using the form .*?, Python > uses the shortest matching string. > > You could also do the variable labels in this code. > > HTH, > Jon Peck > > > -----Original Message----- > From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of > Smith, Benton > Sent: Thursday, May 10, 2007 8:48 AM > To: [hidden email] > Subject: [SPSSX-L] Editing Existing Value Labels Across Large Data Sets > > I'm sorry; I left off the Subject line on my early request for help. > > > > Benton > > > > ________________________________ > > > > Hello All, > > I often deal with SPSS data sets that come from an HTML-based Research > Survey creation program. This program exports well to SPSS 14.0 and > automatically provides Variable Names, Variable Labels, and Value > Labels. The problem is that this Survey Program reads in the HTML text > as the Variable Labels and Value Labels, with the special HTML > characters included (such as, </b>). > > > > Example Variable Name: q7xg_qs1_9_q7x > > Example Variable Label: <b>Product One</b>(remote access) : > > Example Value Label: 1 '<b>Not aware of it</b>' > > > > I have been cleaning the HTML characters out of the Variable Labels by > copying them to another software and using the wildcard "<*>" to > Find/Replace the HTML characters with a space, but I do not know of a > simple way to clean up the Value Labels. Some of the files are fairly > large (17,000 or more variables), and this can be a very time-consuming > clean-up process. Any suggestions would be greatly appreciated. > > > > > > Thanks, > > Benton Smith > > [hidden email] <mailto:[hidden email]> > -- Vlad Simion Data Analyst Tel: +40 720130611 |
|
Try this slightly modified version.
begin program. import spss, spssaux, re vardict = spssaux.VariableDict() for v in vardict: vlset = vardict[v].ValueLabels for val in vlset: vlset[val] = re.sub("<.*?>", "", vlset[val]) vardict[v.VariableName].ValueLabels = vlset end program. ________________________________ From: vlad simion [mailto:[hidden email]] Sent: Friday, May 11, 2007 3:29 AM To: Peck, Jon Cc: [hidden email] Subject: Re: Editing Existing Value Labels Across Large Data Sets Hi Jon, I tried to run the code you provide, but it gives me the following error: Traceback (most recent call last): File "<string>", line 8, in ? File "C:\Python24\lib\site-packages\spssaux.py", line 654, in _ValLabSet spss.Submit("VALUE LABELS " + spss.GetVariableName(self.index) + " " + vllist) File "C:\Python24\lib\site-packages\spss\spss150\spss.py", line 772, in GetVariableName raise SpssError,error spss.spss150.errMsg.SpssError: [errLevel 1000] Expects an integer argument. Can you please take a look and see what's wrong. Many thanks, Vlad. On 5/10/07, Peck, Jon <[hidden email]> wrote: This would be easy to automate if you can use the Python programmability. Here is some untested code. begin program. import spss, spssaux, re vardict = spssaux.VariableDict() for v in vardict: vlset = vardict[v].ValueLabels for val in vlset: vlset[val] = re.sub("<.*?>", "", vlset[val]) vardict[v].ValueLabels = vlset end program. It - creates a Python variable dictionary - loops over all the variables - gets the value labels for each variable as a Python dictionary - substitutes out html tags - assigns the modified labels back to the variable The regular expression that gets rid of the tags, "<.*?>", has one subtlety. The usual behavior is to do "greedy" matching, so with <b>abc</b>, you would match the entire string, not the tag. By using the form .*?, Python uses the shortest matching string. You could also do the variable labels in this code. HTH, Jon Peck -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Smith, Benton Sent: Thursday, May 10, 2007 8:48 AM To: [hidden email] Subject: [SPSSX-L] Editing Existing Value Labels Across Large Data Sets I'm sorry; I left off the Subject line on my early request for help. Benton ________________________________ Hello All, I often deal with SPSS data sets that come from an HTML-based Research Survey creation program. This program exports well to SPSS 14.0 and automatically provides Variable Names, Variable Labels, and Value Labels. The problem is that this Survey Program reads in the HTML text as the Variable Labels and Value Labels, with the special HTML characters included (such as, </b>). Example Variable Name: q7xg_qs1_9_q7x Example Variable Label: <b>Product One</b>(remote access) : Example Value Label: 1 '<b>Not aware of it</b>' I have been cleaning the HTML characters out of the Variable Labels by copying them to another software and using the wildcard "<*>" to Find/Replace the HTML characters with a space, but I do not know of a simple way to clean up the Value Labels. Some of the files are fairly large (17,000 or more variables), and this can be a very time-consuming clean-up process. Any suggestions would be greatly appreciated. Thanks, Benton Smith [hidden email] <mailto:[hidden email]> -- Vlad Simion Data Analyst Tel: +40 720130611 |
|
Thank you very much Jon, it's working :)
All the best, Vlad. On 5/11/07, Peck, Jon <[hidden email]> wrote: > > Try this slightly modified version. > > > > begin program. > > import spss, spssaux, re > > > > vardict = spssaux.VariableDict() > > for v in vardict: > > vlset = vardict[v].ValueLabels > > for val in vlset: > > vlset[val] = re.sub("<.*?>", "", vlset[val]) > > vardict[v.VariableName].ValueLabels = vlset > > end program. > > > > > > > ------------------------------ > > *From:* vlad simion [mailto:[hidden email]] > *Sent:* Friday, May 11, 2007 3:29 AM > *To:* Peck, Jon > *Cc:* [hidden email] > *Subject:* Re: Editing Existing Value Labels Across Large Data Sets > > > > Hi Jon, > > I tried to run the code you provide, but it gives me the following error: > > Traceback (most recent call last): > File "<string>", line 8, in ? > File "C:\Python24\lib\site-packages\spssaux.py", line 654, in _ValLabSet > > spss.Submit("VALUE LABELS " + spss.GetVariableName(self.index) + " " + > vllist) > File "C:\Python24\lib\site-packages\spss\spss150\spss.py", line 772, in > GetVariableName > raise SpssError,error > spss.spss150.errMsg.SpssError: [errLevel 1000] Expects an integer > argument. > > Can you please take a look and see what's wrong. > > Many thanks, > > Vlad. > > On 5/10/07, *Peck, Jon* <[hidden email]> wrote: > > This would be easy to automate if you can use the Python > programmability. Here is some untested code. > > begin program. > import spss, spssaux, re > > vardict = spssaux.VariableDict() > for v in vardict: > vlset = vardict[v].ValueLabels > for val in vlset: > vlset[val] = re.sub("<.*?>", "", vlset[val]) > vardict[v].ValueLabels = vlset > end program. > > It > - creates a Python variable dictionary > - loops over all the variables > - gets the value labels for each variable as a Python dictionary > - substitutes out html tags > - assigns the modified labels back to the variable > > The regular expression that gets rid of the tags, "<.*?>", has one > subtlety. > The usual behavior is to do "greedy" matching, so with <b>abc</b>, you > would match the entire string, not the tag. By using the form .*?, Python > uses the shortest matching string. > > You could also do the variable labels in this code. > > HTH, > Jon Peck > > > -----Original Message----- > From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of > Smith, Benton > Sent: Thursday, May 10, 2007 8:48 AM > To: [hidden email] > Subject: [SPSSX-L] Editing Existing Value Labels Across Large Data Sets > > I'm sorry; I left off the Subject line on my early request for help. > > > > Benton > > > > ________________________________ > > > > Hello All, > > I often deal with SPSS data sets that come from an HTML-based Research > Survey creation program. This program exports well to SPSS 14.0 and > automatically provides Variable Names, Variable Labels, and Value > Labels. The problem is that this Survey Program reads in the HTML text > as the Variable Labels and Value Labels, with the special HTML > characters included (such as, </b>). > > > > Example Variable Name: q7xg_qs1_9_q7x > > Example Variable Label: <b>Product One</b>(remote access) : > > Example Value Label: 1 '<b>Not aware of it</b>' > > > > I have been cleaning the HTML characters out of the Variable Labels by > copying them to another software and using the wildcard "<*>" to > Find/Replace the HTML characters with a space, but I do not know of a > simple way to clean up the Value Labels. Some of the files are fairly > large (17,000 or more variables), and this can be a very time-consuming > clean-up process. Any suggestions would be greatly appreciated. > > > > > > Thanks, > > Benton Smith > > [hidden email] <mailto:[hidden email]> > > > > > -- > Vlad Simion > Data Analyst > Tel: +40 720130611 > -- Vlad Simion Data Analyst Tel: +40 720130611 |
| Free forum by Nabble | Edit this page |
