Importing HTML

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Importing HTML

DKUKEC
Hi Everyone,

We have a report in HTML format that was generated by a public safety
agency.  Regrettably, the format of the data prevents us from aggregating
records or conducting other descriptive analyses.  The following is an
example of the way the data are generated by the agency – they are unable
to provide the data in any other format.  I should note that each record
starts with the “—ACKNOWLEDGMENT –“ and ends with “—END – “.

-- ACKNOWLEDGEMENT--

ENTER ACCEPTED AS FOLLOWING RECORD
LOST GUN
   SERIAL NO: TBP12345                               LOSS DATE: 11/27/2009
        MAKE: SAT                                   ENTRY DATE: 11/27/2009
       MODEL: TP123                                        PCN: A123456789
     CALIBER: 9                                            NIC: A123456789
        TYPE: PI
     CASE NO: 08-1234567
ENTERING MNE: A12345678
ENTERING AGY: AL12345678K7 – WINNIPEG COUNTY SHERIFF'S OFFICE
  NOTIFY AGY: NO NOTIFY/PUBLICLY AVAILABLE
        MISC: 2009-11WPG12345  TAURUS 3" BBL TITANIUM - FULLY LOADED

--END--

I was wondering if anyone could suggest syntax that would allow us to
import the data in tabular format that contains the following variables.
The desired database would include the following variables that correspond
to the above reported variables.


RTYPE  DATEREP     ENTRYD  PCN      NIC SERIALNO
LOST GUN 27-NOV-2009 27-NOV-2009 A123456789  A123456789 TBP12345

VARS continued….

MAKE MODEL CALIBRE TYPE CASENO     ENTERINGMNE
SAT TP123 9 PI 08-1234567  A12345678

VARS continued….

ENTERINGAGY     NOTIFYAGY
AL12345678K7-WINNIPEG COUNTY SHERIFF’S OFFICE NO NOTIFY/PUBLIC AVAILABLE\

VARS continued….
MISC
2009-11WPG12345  TAURUS 3" BBL TITANIUM - FULLY LOADED

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Importing HTML

Maguin, Eugene
Damir,

I'd like to offer some incomplete suggestions of where to start. But do understand that I don't have experience with HTML format data and since I'm sure others do, they will have better, more complete advice.

The big, giant problem is the HTML format. I don't think spss can read html data. I looked at Get Data and  nothing there. I guess I'd try opening the file in Word and saving it as a txt file. I'm only guessing that that would get rid of the HTML crap. If not try a text editor program, maybe notepad.

Once you have text file, the next big question is whether the record structure is absolutely constant. Meaning is the number of records (lines) from '-- ACKNOWLEDGEMENT--' to '-- ACKNOWLEDGEMENT--' always the same and is the layout of the fields within the set of records thatt make up a case always the same. If so, then I'd use Data List, I think, but I'd guess that Get Data/type=text would work also.

If the case structure is not the same, then you have a hard job ahead of you. If there is structure within  RTYPE, then you might be able to use an input program structure. Look specifically at the Input Program, Reread and File Type commands in the syntax reference.

If there is not (enoug) structure, even within the RTYPE variable, then you have a hell of a job. I'd read the file as a single string records and then start writing syntax to eventually be able to either write out a file for a new read operation or a casestovars operation. At that point you are your own in seriously tall grass.

But, I'd bet somebody knows a simple way to read html format data.

Gene Maguin



>>We have a report in HTML format that was generated by a public safety
agency.  Regrettably, the format of the data prevents us from aggregating
records or conducting other descriptive analyses.  The following is an
example of the way the data are generated by the agency – they are unable
to provide the data in any other format.  I should note that each record
starts with the “—ACKNOWLEDGMENT –“ and ends with “—END – “.

-- ACKNOWLEDGEMENT--

ENTER ACCEPTED AS FOLLOWING RECORD
LOST GUN
   SERIAL NO: TBP12345                               LOSS DATE: 11/27/2009
        MAKE: SAT                                   ENTRY DATE: 11/27/2009
       MODEL: TP123                                        PCN: A123456789
     CALIBER: 9                                            NIC: A123456789
        TYPE: PI
     CASE NO: 08-1234567
ENTERING MNE: A12345678
ENTERING AGY: AL12345678K7 – WINNIPEG COUNTY SHERIFF'S OFFICE
  NOTIFY AGY: NO NOTIFY/PUBLICLY AVAILABLE
        MISC: 2009-11WPG12345  TAURUS 3" BBL TITANIUM - FULLY LOADED

--END--

I was wondering if anyone could suggest syntax that would allow us to
import the data in tabular format that contains the following variables.
The desired database would include the following variables that correspond
to the above reported variables.


RTYPE  DATEREP     ENTRYD  PCN      NIC SERIALNO
LOST GUN 27-NOV-2009 27-NOV-2009 A123456789  A123456789 TBP12345

VARS continued….

MAKE MODEL CALIBRE TYPE CASENO     ENTERINGMNE
SAT TP123 9 PI 08-1234567  A12345678

VARS continued….

ENTERINGAGY     NOTIFYAGY
AL12345678K7-WINNIPEG COUNTY SHERIFF’S OFFICE NO NOTIFY/PUBLIC AVAILABLE\

VARS continued….
MISC
2009-11WPG12345  TAURUS 3" BBL TITANIUM - FULLY LOADED

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Importing HTML

Garry Gelade
In reply to this post by DKUKEC
Damir

I would look at the possibility of converting your HTML to text as a first step. You can buy HTML to text converters e.g.
Detagger (http://www.jafsoft.com/detagger/) Convert Doc (http://www.softinterface.com/) Total HTML Converter (http://www.coolutils.com/). I'm sure there are others as well.

You will probably need to experiment to find a converter that offers some flexibility in controlling the output. Once you have produced a text file in some sort of organized format, it should be relatively easy to convert it into a form that SPSS can read such as a csv or mdb file, or to write a Python routine to complete the reformatting.

Garry Gelade
Business Analytic Ltd.




-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Damir
Sent: 06 April 2010 17:54
To: [hidden email]
Subject: Importing HTML

Hi Everyone,

We have a report in HTML format that was generated by a public safety
agency.  Regrettably, the format of the data prevents us from aggregating
records or conducting other descriptive analyses.  The following is an
example of the way the data are generated by the agency – they are unable
to provide the data in any other format.  I should note that each record
starts with the “—ACKNOWLEDGMENT –“ and ends with “—END – “.

-- ACKNOWLEDGEMENT--

ENTER ACCEPTED AS FOLLOWING RECORD
LOST GUN
   SERIAL NO: TBP12345                               LOSS DATE: 11/27/2009
        MAKE: SAT                                   ENTRY DATE: 11/27/2009
       MODEL: TP123                                        PCN: A123456789
     CALIBER: 9                                            NIC: A123456789
        TYPE: PI
     CASE NO: 08-1234567
ENTERING MNE: A12345678
ENTERING AGY: AL12345678K7 – WINNIPEG COUNTY SHERIFF'S OFFICE
  NOTIFY AGY: NO NOTIFY/PUBLICLY AVAILABLE
        MISC: 2009-11WPG12345  TAURUS 3" BBL TITANIUM - FULLY LOADED

--END--

I was wondering if anyone could suggest syntax that would allow us to
import the data in tabular format that contains the following variables.
The desired database would include the following variables that correspond
to the above reported variables.


RTYPE  DATEREP     ENTRYD  PCN      NIC SERIALNO
LOST GUN 27-NOV-2009 27-NOV-2009 A123456789  A123456789 TBP12345

VARS continued….

MAKE MODEL CALIBRE TYPE CASENO     ENTERINGMNE
SAT TP123 9 PI 08-1234567  A12345678

VARS continued….

ENTERINGAGY     NOTIFYAGY
AL12345678K7-WINNIPEG COUNTY SHERIFF’S OFFICE NO NOTIFY/PUBLIC AVAILABLE\

VARS continued….
MISC
2009-11WPG12345  TAURUS 3" BBL TITANIUM - FULLY LOADED

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Importing HTML

Bruce Weaver
Administrator
In reply to this post by DKUKEC
The same question was posted to the SPSS newsgroup.  Assuming the input file is a "text" file that looks what is shown below, I've posted some syntax in the newsgroup that seems to do the job.  You can see it (and the entire newsgroup thread) here:

http://groups.google.com/group/comp.soft-sys.stat.spss/browse_frm/thread/21c3f768e8c6a2e7/f9e8df627cea5efa#f9e8df627cea5efa


Damir-6 wrote
Hi Everyone,

We have a report in HTML format that was generated by a public safety
agency.  Regrettably, the format of the data prevents us from aggregating
records or conducting other descriptive analyses.  The following is an
example of the way the data are generated by the agency – they are unable
to provide the data in any other format.  I should note that each record
starts with the “—ACKNOWLEDGMENT –“ and ends with “—END – “.

-- ACKNOWLEDGEMENT--

ENTER ACCEPTED AS FOLLOWING RECORD
LOST GUN
   SERIAL NO: TBP12345                               LOSS DATE: 11/27/2009
        MAKE: SAT                                   ENTRY DATE: 11/27/2009
       MODEL: TP123                                        PCN: A123456789
     CALIBER: 9                                            NIC: A123456789
        TYPE: PI
     CASE NO: 08-1234567
ENTERING MNE: A12345678
ENTERING AGY: AL12345678K7 – WINNIPEG COUNTY SHERIFF'S OFFICE
  NOTIFY AGY: NO NOTIFY/PUBLICLY AVAILABLE
        MISC: 2009-11WPG12345  TAURUS 3" BBL TITANIUM - FULLY LOADED

--END--

I was wondering if anyone could suggest syntax that would allow us to
import the data in tabular format that contains the following variables.
The desired database would include the following variables that correspond
to the above reported variables.


RTYPE  DATEREP     ENTRYD  PCN      NIC SERIALNO
LOST GUN 27-NOV-2009 27-NOV-2009 A123456789  A123456789 TBP12345

VARS continued….

MAKE MODEL CALIBRE TYPE CASENO     ENTERINGMNE
SAT TP123 9 PI 08-1234567  A12345678

VARS continued….

ENTERINGAGY     NOTIFYAGY
AL12345678K7-WINNIPEG COUNTY SHERIFF’S OFFICE NO NOTIFY/PUBLIC AVAILABLE\

VARS continued….
MISC
2009-11WPG12345  TAURUS 3" BBL TITANIUM - FULLY LOADED

=====================
To manage your subscription to SPSSX-L, send a message to
LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Importing HTML

Albert-Jan Roskam
In reply to this post by Garry Gelade
Hi,
 
In Python you can make a BeautifulSoup object from your html data. You have to install that module first.

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Tue, 4/6/10, Garry Gelade <[hidden email]> wrote:

From: Garry Gelade <[hidden email]>
Subject: Re: [SPSSX-L] Importing HTML
To: [hidden email]
Date: Tuesday, April 6, 2010, 10:52 PM

Damir

I would look at the possibility of converting your HTML to text as a first step. You can buy HTML to text converters e.g.
Detagger (http://www.jafsoft.com/detagger/) Convert Doc (http://www.softinterface.com/) Total HTML Converter (http://www.coolutils.com/). I'm sure there are others as well.

You will probably need to experiment to find a converter that offers some flexibility in controlling the output. Once you have produced a text file in some sort of organized format, it should be relatively easy to convert it into a form that SPSS can read such as a csv or mdb file, or to write a Python routine to complete the reformatting.

Garry Gelade
Business Analytic Ltd.




-----Original Message-----
From: SPSSX(r) Discussion [mailto:SPSSX-L@...] On Behalf Of Damir
Sent: 06 April 2010 17:54
To: SPSSX-L@...
Subject: Importing HTML

Hi Everyone,

We have a report in HTML format that was generated by a public safety
agency.  Regrettably, the format of the data prevents us from aggregating
records or conducting other descriptive analyses.  The following is an
example of the way the data are generated by the agency – they are unable
to provide the data in any other format.  I should note that each record
starts with the “—ACKNOWLEDGMENT –“ and ends with “—END – “.

-- ACKNOWLEDGEMENT--

ENTER ACCEPTED AS FOLLOWING RECORD
LOST GUN
   SERIAL NO: TBP12345                               LOSS DATE: 11/27/2009
        MAKE: SAT                                   ENTRY DATE: 11/27/2009
       MODEL: TP123                                        PCN: A123456789
     CALIBER: 9                                            NIC: A123456789
        TYPE: PI
     CASE NO: 08-1234567
ENTERING MNE: A12345678
ENTERING AGY: AL12345678K7 – WINNIPEG COUNTY SHERIFF'S OFFICE
  NOTIFY AGY: NO NOTIFY/PUBLICLY AVAILABLE
        MISC: 2009-11WPG12345  TAURUS 3" BBL TITANIUM - FULLY LOADED

--END--

I was wondering if anyone could suggest syntax that would allow us to
import the data in tabular format that contains the following variables.
The desired database would include the following variables that correspond
to the above reported variables.


RTYPE  DATEREP     ENTRYD  PCN      NIC SERIALNO
LOST GUN 27-NOV-2009 27-NOV-2009 A123456789  A123456789 TBP12345

VARS continued….

MAKE MODEL CALIBRE TYPE CASENO     ENTERINGMNE
SAT TP123 9 PI 08-1234567  A12345678

VARS continued….

ENTERINGAGY     NOTIFYAGY
AL12345678K7-WINNIPEG COUNTY SHERIFF’S OFFICE NO NOTIFY/PUBLIC AVAILABLE\

VARS continued….
MISC
2009-11WPG12345  TAURUS 3" BBL TITANIUM - FULLY LOADED

=====================
To manage your subscription to SPSSX-L, send a message to
LISTSERV@... (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
LISTSERV@... (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD