SPSSX Discussion

String removal from ctables runs

Classic

List

Threaded

3 messages Options

Timothy Hennigar

String removal from ctables runs

Ok – so I have several large CTables runs –in the order of 500 -1000 tables each ..

I run this and I can use SPSSINC MODIFY TABLES hide footnotes (to do just that) .. but it takes

anywhere from 20 minutes to ½ hour to run it on each file – and if I need to run 30 decks of these

you can see the time is enormous ..

I have found its faster for me to export as HTML and search and replace the superscripts

by searching for the tags, ie., text - now, there are various forms it takes so while this

takes some time – its still only 5-10 minutes (so maybe a 1/3 the time as the SPSS script)

I am thinking – maybe a python script that would open the HTML (its only text) and search

for the tag and then delete it and everything up until and including its finds the end tag

might be faster still – the HTML files are about 30 MB.

Any comments?

Thanks!

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Jon Peck

Re: String removal from ctables runs

MODIFY TABLES uses the scripting apis, and it can be slow, although this depends a lot on the table sizes and other factors. However, if you export the html, the following code will strip all the superscripts. Set inputdir to the directory holding the files to convert and outputdir to a directory to hold the converted files. The output directory must already exist. The search expression uses non-greedy matching. Otherwise you would remove everything from the first through the last . It uses the re.M flag so that the search will match across lines

import re, glob, os

inputdir = "c:/temp"

outputdir = "c:/tempout"

for f in glob.glob(inputdir + os.sep + "*.htm"):

with open(f) as input, open(outputdir + os.sep + os.path.basename(f), "wb") as output:

inputs = input.read()

inputs = re.sub(r".*?", "", inputs, flags=re.M)

output.write(inputs)

print "done"

On Wed, Jun 8, 2016 at 12:53 PM, Timothy Hennigar <[hidden email]> wrote:

Ok – so I have several large CTables runs –in the order of 500 -1000 tables each ..

I run this and I can use SPSSINC MODIFY TABLES hide footnotes (to do just that) .. but it takes
anywhere from 20 minutes to ½ hour to run it on each file – and if I need to run 30 decks of these
you can see the time is enormous ..

I have found its faster for me to export as HTML and search and replace the superscripts
by searching for the tags, ie., text - now, there are various forms it takes so while this
takes some time – its still only 5-10 minutes (so maybe a 1/3 the time as the SPSS script)

I am thinking – maybe a python script that would open the HTML (its only text) and search
for the tag and then delete it and everything up until and including its finds the end tag 
might be faster still – the HTML files are about 30 MB.

Any comments?

Thanks!

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Jon K Peck
[hidden email]

Timothy Hennigar

Re: String removal from ctables runs

Wow- that’s super fast .. (only a few seconds) .. FANTASTIK (I can likely modify that for other uses also!)

Thanks!

*********************************

Notice: This e-mail and any attachments may contain confidential and privileged information. If you are not the intended recipient, please notify the sender immediately by return e-mail, do not use the information, delete this e-mail and destroy any copies. Any dissemination or use of this information by a person other than the intended recipient is unauthorized and may be illegal. Email transmissions cannot be guaranteed to be secure or error free. The sender therefore does not accept any liability for errors or omissions in the contents of this message that arise as a result of email transmissions.

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Jon Peck
Sent: Wednesday, June 8, 2016 03:54 PM
To: [hidden email]
Subject: Re: String removal from ctables runs

import re, glob, os

inputdir = "c:/temp"

outputdir = "c:/tempout"

for f in glob.glob(inputdir + os.sep + "*.htm"):

with open(f) as input, open(outputdir + os.sep + os.path.basename(f), "wb") as output:

inputs = input.read()

inputs = re.sub(r".*?", "", inputs, flags=re.M)

output.write(inputs)

print "done"

On Wed, Jun 8, 2016 at 12:53 PM, Timothy Hennigar <[hidden email]> wrote:

Ok – so I have several large CTables runs –in the order of 500 -1000 tables each ..

I run this and I can use SPSSINC MODIFY TABLES hide footnotes (to do just that) .. but it takes

anywhere from 20 minutes to ½ hour to run it on each file – and if I need to run 30 decks of these

you can see the time is enormous ..

I have found its faster for me to export as HTML and search and replace the superscripts

by searching for the tags, ie., text - now, there are various forms it takes so while this

takes some time – its still only 5-10 minutes (so maybe a 1/3 the time as the SPSS script)

I am thinking – maybe a python script that would open the HTML (its only text) and search

for the tag and then delete it and everything up until and including its finds the end tag

might be faster still – the HTML files are about 30 MB.

Any comments?

Thanks!

Jon K Peck
[hidden email]