|
Hi Everyone,
We have a report in HTML format that was generated by a public safety agency. Regrettably, the format of the data prevents us from aggregating records or conducting other descriptive analyses. The following is an example of the way the data are generated by the agency â they are unable to provide the data in any other format. I should note that each record starts with the ââACKNOWLEDGMENT ââ and ends with ââEND â â. -- ACKNOWLEDGEMENT-- ENTER ACCEPTED AS FOLLOWING RECORD LOST GUN SERIAL NO: TBP12345 LOSS DATE: 11/27/2009 MAKE: SAT ENTRY DATE: 11/27/2009 MODEL: TP123 PCN: A123456789 CALIBER: 9 NIC: A123456789 TYPE: PI CASE NO: 08-1234567 ENTERING MNE: A12345678 ENTERING AGY: AL12345678K7 â WINNIPEG COUNTY SHERIFF'S OFFICE NOTIFY AGY: NO NOTIFY/PUBLICLY AVAILABLE MISC: 2009-11WPG12345 TAURUS 3" BBL TITANIUM - FULLY LOADED --END-- I was wondering if anyone could suggest syntax that would allow us to import the data in tabular format that contains the following variables. The desired database would include the following variables that correspond to the above reported variables. RTYPE DATEREP ENTRYD PCN NIC SERIALNO LOST GUN 27-NOV-2009 27-NOV-2009 A123456789 A123456789 TBP12345 VARS continuedâ¦. MAKE MODEL CALIBRE TYPE CASENO ENTERINGMNE SAT TP123 9 PI 08-1234567 A12345678 VARS continuedâ¦. ENTERINGAGY NOTIFYAGY AL12345678K7-WINNIPEG COUNTY SHERIFFâS OFFICE NO NOTIFY/PUBLIC AVAILABLE\ VARS continuedâ¦. MISC 2009-11WPG12345 TAURUS 3" BBL TITANIUM - FULLY LOADED ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Damir,
I'd like to offer some incomplete suggestions of where to start. But do understand that I don't have experience with HTML format data and since I'm sure others do, they will have better, more complete advice. The big, giant problem is the HTML format. I don't think spss can read html data. I looked at Get Data and nothing there. I guess I'd try opening the file in Word and saving it as a txt file. I'm only guessing that that would get rid of the HTML crap. If not try a text editor program, maybe notepad. Once you have text file, the next big question is whether the record structure is absolutely constant. Meaning is the number of records (lines) from '-- ACKNOWLEDGEMENT--' to '-- ACKNOWLEDGEMENT--' always the same and is the layout of the fields within the set of records thatt make up a case always the same. If so, then I'd use Data List, I think, but I'd guess that Get Data/type=text would work also. If the case structure is not the same, then you have a hard job ahead of you. If there is structure within RTYPE, then you might be able to use an input program structure. Look specifically at the Input Program, Reread and File Type commands in the syntax reference. If there is not (enoug) structure, even within the RTYPE variable, then you have a hell of a job. I'd read the file as a single string records and then start writing syntax to eventually be able to either write out a file for a new read operation or a casestovars operation. At that point you are your own in seriously tall grass. But, I'd bet somebody knows a simple way to read html format data. Gene Maguin >>We have a report in HTML format that was generated by a public safety agency. Regrettably, the format of the data prevents us from aggregating records or conducting other descriptive analyses. The following is an example of the way the data are generated by the agency – they are unable to provide the data in any other format. I should note that each record starts with the “—ACKNOWLEDGMENT –“ and ends with “—END – “. -- ACKNOWLEDGEMENT-- ENTER ACCEPTED AS FOLLOWING RECORD LOST GUN SERIAL NO: TBP12345 LOSS DATE: 11/27/2009 MAKE: SAT ENTRY DATE: 11/27/2009 MODEL: TP123 PCN: A123456789 CALIBER: 9 NIC: A123456789 TYPE: PI CASE NO: 08-1234567 ENTERING MNE: A12345678 ENTERING AGY: AL12345678K7 – WINNIPEG COUNTY SHERIFF'S OFFICE NOTIFY AGY: NO NOTIFY/PUBLICLY AVAILABLE MISC: 2009-11WPG12345 TAURUS 3" BBL TITANIUM - FULLY LOADED --END-- I was wondering if anyone could suggest syntax that would allow us to import the data in tabular format that contains the following variables. The desired database would include the following variables that correspond to the above reported variables. RTYPE DATEREP ENTRYD PCN NIC SERIALNO LOST GUN 27-NOV-2009 27-NOV-2009 A123456789 A123456789 TBP12345 VARS continued…. MAKE MODEL CALIBRE TYPE CASENO ENTERINGMNE SAT TP123 9 PI 08-1234567 A12345678 VARS continued…. ENTERINGAGY NOTIFYAGY AL12345678K7-WINNIPEG COUNTY SHERIFF’S OFFICE NO NOTIFY/PUBLIC AVAILABLE\ VARS continued…. MISC 2009-11WPG12345 TAURUS 3" BBL TITANIUM - FULLY LOADED ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by DKUKEC
Damir
I would look at the possibility of converting your HTML to text as a first step. You can buy HTML to text converters e.g. Detagger (http://www.jafsoft.com/detagger/) Convert Doc (http://www.softinterface.com/) Total HTML Converter (http://www.coolutils.com/). I'm sure there are others as well. You will probably need to experiment to find a converter that offers some flexibility in controlling the output. Once you have produced a text file in some sort of organized format, it should be relatively easy to convert it into a form that SPSS can read such as a csv or mdb file, or to write a Python routine to complete the reformatting. Garry Gelade Business Analytic Ltd. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Damir Sent: 06 April 2010 17:54 To: [hidden email] Subject: Importing HTML Hi Everyone, We have a report in HTML format that was generated by a public safety agency. Regrettably, the format of the data prevents us from aggregating records or conducting other descriptive analyses. The following is an example of the way the data are generated by the agency – they are unable to provide the data in any other format. I should note that each record starts with the “—ACKNOWLEDGMENT –“ and ends with “—END – “. -- ACKNOWLEDGEMENT-- ENTER ACCEPTED AS FOLLOWING RECORD LOST GUN SERIAL NO: TBP12345 LOSS DATE: 11/27/2009 MAKE: SAT ENTRY DATE: 11/27/2009 MODEL: TP123 PCN: A123456789 CALIBER: 9 NIC: A123456789 TYPE: PI CASE NO: 08-1234567 ENTERING MNE: A12345678 ENTERING AGY: AL12345678K7 – WINNIPEG COUNTY SHERIFF'S OFFICE NOTIFY AGY: NO NOTIFY/PUBLICLY AVAILABLE MISC: 2009-11WPG12345 TAURUS 3" BBL TITANIUM - FULLY LOADED --END-- I was wondering if anyone could suggest syntax that would allow us to import the data in tabular format that contains the following variables. The desired database would include the following variables that correspond to the above reported variables. RTYPE DATEREP ENTRYD PCN NIC SERIALNO LOST GUN 27-NOV-2009 27-NOV-2009 A123456789 A123456789 TBP12345 VARS continued…. MAKE MODEL CALIBRE TYPE CASENO ENTERINGMNE SAT TP123 9 PI 08-1234567 A12345678 VARS continued…. ENTERINGAGY NOTIFYAGY AL12345678K7-WINNIPEG COUNTY SHERIFF’S OFFICE NO NOTIFY/PUBLIC AVAILABLE\ VARS continued…. MISC 2009-11WPG12345 TAURUS 3" BBL TITANIUM - FULLY LOADED ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Administrator
|
In reply to this post by DKUKEC
The same question was posted to the SPSS newsgroup. Assuming the input file is a "text" file that looks what is shown below, I've posted some syntax in the newsgroup that seems to do the job. You can see it (and the entire newsgroup thread) here:
http://groups.google.com/group/comp.soft-sys.stat.spss/browse_frm/thread/21c3f768e8c6a2e7/f9e8df627cea5efa#f9e8df627cea5efa
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
|
In reply to this post by Garry Gelade
|
| Free forum by Nabble | Edit this page |
