|
Dear SPSS users:
I have used Transform - Compute Variable function to compute a variable and now I want to see the computation for that variable, but I am unable to do this. How does one see the actual computation, or formula, used to compute the variable? Thank you! - Swetal Sindhvad ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Dear Swetal,
You can see the transformations in the Journal (=log), of course only if you have it on and in the Append mode. In the menu, go Edit -> Options and find there, where the Journal is (tabs General or File locations, depending on your version). Then open the log in a text editor and find the commands. Best regards Jan -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Swetal Sindhvad Sent: Tuesday, March 11, 2008 12:23 AM To: [hidden email] Subject: Computation of Variable Dear SPSS users: I have used Transform - Compute Variable function to compute a variable and now I want to see the computation for that variable, but I am unable to do this. How does one see the actual computation, or formula, used to compute the variable? Thank you! - Swetal Sindhvad ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD _____ Tato zpráva a všechny připojené soubory jsou důvěrné a určené výlučně adresátovi(-ům). Jestliže nejste oprávněným adresátem, je zakázáno jakékoliv zveřejňování, zprostředkování nebo jiné použití těchto informací. Jestliže jste tento mail dostali neoprávněně, prosím, uvědomte odesilatele a smažte zprávu i přiložené soubory. Odesilatel nezodpovídá za jakékoliv chyby nebo opomenutí způsobené tímto přenosem. This message and any attached files are confidential and intended solely for the addressee(s). Any publication, transmission or other use of the information by a person or entity other than the intended addressee is prohibited. If you receive this in error please contact the sender and delete the message as well as all attached documents. The sender does not accept liability for any errors or omissions as a result of the transmission. -.- -- ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Swetal Sindhvad
Good Morning List,
I would appreciate some advice on how to read-in text data where the data has header information followed by the related data like so: Header information: Line1: Var1 value, var2 value Line2: Var3 value Line3: var4 value Line4: var5 value, var6 value Line5: var6 value, var7 value, var8 value Data: Var9 var10, var11,...,varn The pattern repeats after about 21 lines of data, with different header information each time. I have already tried the multi-line read-in, but because of the differing formats, the resulting data is ridiculously difficult to work with. I would appreciate any suggestions. TIA Mike ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Swetal Sindhvad
Two other options,
1) Use 'paste' instead of run, to paste the syntax, or 2) If you set your Viewer options (Edit--Options, Viewer tab) to have the "Display commands in log" (bottom left check box), then the syntax will automatically be included in your output file. Melissa -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Swetal Sindhvad Sent: Monday, March 10, 2008 6:23 PM To: [hidden email] Subject: [SPSSX-L] Computation of Variable Dear SPSS users: I have used Transform - Compute Variable function to compute a variable and now I want to see the computation for that variable, but I am unable to do this. How does one see the actual computation, or formula, used to compute the variable? Thank you! - Swetal Sindhvad ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD PRIVILEGED AND CONFIDENTIAL INFORMATION This transmittal and any attachments may contain PRIVILEGED AND CONFIDENTIAL information and is intended only for the use of the addressee. If you are not the designated recipient, or an employee or agent authorized to deliver such transmittals to the designated recipient, you are hereby notified that any dissemination, copying or publication of this transmittal is strictly prohibited. If you have received this transmittal in error, please notify us immediately by replying to the sender and delete this copy from your system. You may also call us at (309) 827-6026 for assistance. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Roberts, Michael
My advice would be to use a python script to with to read the lines of data and create a list, containing all values,
i.e[2,3,4,ade,45,er] from here you could loop over the list to write out a new file with one entry per line. Then use a standard import from SPSS to read the data Mike -----Original Message----- From: SPSSX(r) Discussion on behalf of Roberts, Michael Sent: Tue 3/11/2008 10:26 AM To: [hidden email] Subject: Reading text data with Good Morning List, I would appreciate some advice on how to read-in text data where the data has header information followed by the related data like so: Header information: Line1: Var1 value, var2 value Line2: Var3 value Line3: var4 value Line4: var5 value, var6 value Line5: var6 value, var7 value, var8 value Data: Var9 var10, var11,...,varn The pattern repeats after about 21 lines of data, with different header information each time. I have already tried the multi-line read-in, but because of the differing formats, the resulting data is ridiculously difficult to work with. I would appreciate any suggestions. TIA Mike ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ====================To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Roberts, Michael
Michael,
Perhaps others will better understand what you are describing. I have a couple of clarifying questions. You say the structure is the following. Header information: Line1: Var1 value, var2 value Line2: Var3 value Line3: var4 value Line4: var5 value, var6 value Line5: var6 value, var7 value, var8 value Data: Var9 var10, var11,...,varn May I assume that the data actually look like this? Var1 2.073, var2 1.999 Var3 .0087 var4 99999 var5 234.5, var6 -12.89 var6 -1.00, var7 873.2, var8 10000 2.78 3.238, -1.34,..., 3.4 That is, the header section contains a combination of text that is the names of the variables followed by the value of the variables. Do you want to keep the header information? Or, can it be discarded? Is there a single line of data containing the values for var9 to varn with a header, data sequence? Gene Maguin ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Gene,
Thank you for the question. Your reading of the data layout for the header is right-on, although the data is alphanumeric. The second set of data is on two lines (two lines make a case), and includes a field header for each variable, however. One twist is that the header data is not consistent - while I show five lines of data, sometimes one of those lines is not included - sort of like an additional address element that does not exist for an address. This layout has played havoc with my attempts to read the data into SPSS!!! The problem is that I need to keep the header information, since each of the subsequent data cases are associated with the header data. This data file was generated by our systems persons from a mainframe as a report, but is practically useless in its present form, and any help would be very, very appreciated! TIA Mike -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Gene Maguin Sent: Tuesday, March 11, 2008 6:20 PM To: [hidden email] Subject: Re: Reading text data with Michael, Perhaps others will better understand what you are describing. I have a couple of clarifying questions. You say the structure is the following. Header information: Line1: Var1 value, var2 value Line2: Var3 value Line3: var4 value Line4: var5 value, var6 value Line5: var6 value, var7 value, var8 value Data: Var9 var10, var11,...,varn May I assume that the data actually look like this? Var1 2.073, var2 1.999 Var3 .0087 var4 99999 var5 234.5, var6 -12.89 var6 -1.00, var7 873.2, var8 10000 2.78 3.238, -1.34,..., 3.4 That is, the header section contains a combination of text that is the names of the variables followed by the value of the variables. Do you want to keep the header information? Or, can it be discarded? Is there a single line of data containing the values for var9 to varn with a header, data sequence? Gene Maguin ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
At 07:44 PM 3/11/2008, Roberts, Michael wrote:
[notes on data layout omitted] >The second set of data is on two lines (two lines make a case), and >includes a field header for each variable, however. One twist is >that the header data is not consistent - while I show five lines of >data, sometimes one of those lines is not included - sort of like an >additional address element that does not exist for an address. [And] >I need to keep the header information, It looks like Gene's onto this one; but, it could be clearer if you'd post a few cases, including ones with differing numbers of header lines. Right now, I'd think about using an INPUT PROGRAM, but I haven't seen your data, nor looked at your problem as hard as Gene has. -- No virus found in this outgoing message. Checked by AVG. Version: 7.5.518 / Virus Database: 269.21.7/1324 - Release Date: 3/10/2008 7:27 PM ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Roberts, Michael
Michael,
When you look at your data file, how do you know 1) where one case ends and another begins and 2) within a case, how do you know where the header ends and the two data lines begin. I think the problem is being able to find a pattern in the data structure that you use to structure the data reading operation. It also sounds as if the data structure varies. That will pose additional problems. While you are working on using spss to read this, another possibility to investigate is going back to the systems people and asking them if they would write a post-processing program to restructure the data more to spss' liking. You might also being reading up on the Input Program. I'm almost certain that you will need a custom input procedure. Also, I agree with Richard on the usefulness of posting some data. I'd suggest that you post 3-4 cases of data. I also suggest that you select cases that illustrate the variability in the data structure. Gene Maguin ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Richard Ristow
Thank you all for responding. Here is what the simulated data looks
like, warts and all! The second set of header data is missing the suite information, so there are only 4 lines of data: 1RUN DATE: 02/06/08 blah blah MANAGEMENT INFORMATION SYSTEM PAGE: 1 blah GROUP blah PROJECT AD HOC 08-035 -blah BASE: 0100053 blah blah INC. 1234 WEST blah STREET SUITE 123 TAMPA ,FL 33607-4173 -blah: 010005300 COUNTY: 29 COUNTY DESCRIPTION: HILLSBOROUGH -INDIVIDUAL NAME ADDRESS TYPE TPI TOMY ZIP STATUS PHONE IND BDATE EDATE +_______________________________________________________________________ _____________________________________________________________ 123456700 blah blah INC blah blah 1234 W blah AVENUE TAMPA FL 33000-0000 90 1598841579 1 555-555-5555 07/01/06 99/99/99 543210101 Doe, John J., Md. 1234 N blah STREE PLANT CITY FL 33555-4302 27 1003804675 1 123-456-7890 07/01/06 99/99/99 1RUN DATE: 02/06/08 blah blah MANAGEMENT INFORMATION SYSTEM PAGE: 1185 blah GROUP blah PROJECT AD HOC 08-035 -blah BASE: 0150009 more blah blah, INC 5678 SW ATH STREET 0CORAL GABLES ,FL 33134 -blah: 015000900 COUNTY: 06 COUNTY DESCRIPTION: BROWARD -INDIVIDUAL NAME ADDRESS TYPE TPI TOMY ZIP STATUS PHONE IND BDATE EDATE +_______________________________________________________________________ _____________________________________________________________ 011234450 Junior, John DO blah blah 3456 S Blah ROAD FT LAUDERDALE FL 33333-0000 26 1235128794 1 777-123-3456 07/01/06 99/99/99 023456700 Brown, John A The blah Group 1000 W Anywhere RD CORAL SPRINGS FL 33000-0000 25 1740275528 1 555-444-5555 07/01/06 99/99/99 TIA Mike -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow Sent: Tuesday, March 11, 2008 8:46 PM To: [hidden email] Subject: Re: Reading text data with At 07:44 PM 3/11/2008, Roberts, Michael wrote: [notes on data layout omitted] >The second set of data is on two lines (two lines make a case), and >includes a field header for each variable, however. One twist is >that the header data is not consistent - while I show five lines of >data, sometimes one of those lines is not included - sort of like an >additional address element that does not exist for an address. [And] >I need to keep the header information, It looks like Gene's onto this one; but, it could be clearer if you'd post a few cases, including ones with differing numbers of header lines. Right now, I'd think about using an INPUT PROGRAM, but I haven't seen your data, nor looked at your problem as hard as Gene has. -- No virus found in this outgoing message. Checked by AVG. Version: 7.5.518 / Virus Database: 269.21.7/1324 - Release Date: 3/10/2008 7:27 PM ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
I've said it before and i'll say it again...
use python a small program will do this -----Original Message----- From: SPSSX(r) Discussion on behalf of Roberts, Michael Sent: Wed 3/12/2008 5:57 PM To: [hidden email] Subject: Re: Reading text data with Thank you all for responding. Here is what the simulated data looks like, warts and all! The second set of header data is missing the suite information, so there are only 4 lines of data: 1RUN DATE: 02/06/08 blah blah MANAGEMENT INFORMATION SYSTEM PAGE: 1 blah GROUP blah PROJECT AD HOC 08-035 -blah BASE: 0100053 blah blah INC. 1234 WEST blah STREET SUITE 123 TAMPA ,FL 33607-4173 -blah: 010005300 COUNTY: 29 COUNTY DESCRIPTION: HILLSBOROUGH -INDIVIDUAL NAME ADDRESS TYPE TPI TOMY ZIP STATUS PHONE IND BDATE EDATE +_______________________________________________________________________ _____________________________________________________________ 123456700 blah blah INC blah blah 1234 W blah AVENUE TAMPA FL 33000-0000 90 1598841579 1 555-555-5555 07/01/06 99/99/99 543210101 Doe, John J., Md. 1234 N blah STREE PLANT CITY FL 33555-4302 27 1003804675 1 123-456-7890 07/01/06 99/99/99 1RUN DATE: 02/06/08 blah blah MANAGEMENT INFORMATION SYSTEM PAGE: 1185 blah GROUP blah PROJECT AD HOC 08-035 -blah BASE: 0150009 more blah blah, INC 5678 SW ATH STREET 0CORAL GABLES ,FL 33134 -blah: 015000900 COUNTY: 06 COUNTY DESCRIPTION: BROWARD -INDIVIDUAL NAME ADDRESS TYPE TPI TOMY ZIP STATUS PHONE IND BDATE EDATE +_______________________________________________________________________ _____________________________________________________________ 011234450 Junior, John DO blah blah 3456 S Blah ROAD FT LAUDERDALE FL 33333-0000 26 1235128794 1 777-123-3456 07/01/06 99/99/99 023456700 Brown, John A The blah Group 1000 W Anywhere RD CORAL SPRINGS FL 33000-0000 25 1740275528 1 555-444-5555 07/01/06 99/99/99 TIA Mike -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow Sent: Tuesday, March 11, 2008 8:46 PM To: [hidden email] Subject: Re: Reading text data with At 07:44 PM 3/11/2008, Roberts, Michael wrote: [notes on data layout omitted] >The second set of data is on two lines (two lines make a case), and >includes a field header for each variable, however. One twist is >that the header data is not consistent - while I show five lines of >data, sometimes one of those lines is not included - sort of like an >additional address element that does not exist for an address. [And] >I need to keep the header information, It looks like Gene's onto this one; but, it could be clearer if you'd post a few cases, including ones with differing numbers of header lines. Right now, I'd think about using an INPUT PROGRAM, but I haven't seen your data, nor looked at your problem as hard as Gene has. -- No virus found in this outgoing message. Checked by AVG. Version: 7.5.518 / Virus Database: 269.21.7/1324 - Release Date: 3/10/2008 7:27 PM ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ====================To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
To expand make a map listing of the variables and pass over all irrelevant info at the start if it's fixed then read in what you need
-----Original Message----- From: SPSSX(r) Discussion on behalf of Pearmain, Michael Sent: Wed 3/12/2008 9:29 PM To: [hidden email] Subject: Re: Reading text data with I've said it before and i'll say it again... use python a small program will do this -----Original Message----- From: SPSSX(r) Discussion on behalf of Roberts, Michael Sent: Wed 3/12/2008 5:57 PM To: [hidden email] Subject: Re: Reading text data with Thank you all for responding. Here is what the simulated data looks like, warts and all! The second set of header data is missing the suite information, so there are only 4 lines of data: 1RUN DATE: 02/06/08 blah blah MANAGEMENT INFORMATION SYSTEM PAGE: 1 blah GROUP blah PROJECT AD HOC 08-035 -blah BASE: 0100053 blah blah INC. 1234 WEST blah STREET SUITE 123 TAMPA ,FL 33607-4173 -blah: 010005300 COUNTY: 29 COUNTY DESCRIPTION: HILLSBOROUGH -INDIVIDUAL NAME ADDRESS TYPE TPI TOMY ZIP STATUS PHONE IND BDATE EDATE +_______________________________________________________________________ _____________________________________________________________ 123456700 blah blah INC blah blah 1234 W blah AVENUE TAMPA FL 33000-0000 90 1598841579 1 555-555-5555 07/01/06 99/99/99 543210101 Doe, John J., Md. 1234 N blah STREE PLANT CITY FL 33555-4302 27 1003804675 1 123-456-7890 07/01/06 99/99/99 1RUN DATE: 02/06/08 blah blah MANAGEMENT INFORMATION SYSTEM PAGE: 1185 blah GROUP blah PROJECT AD HOC 08-035 -blah BASE: 0150009 more blah blah, INC 5678 SW ATH STREET 0CORAL GABLES ,FL 33134 -blah: 015000900 COUNTY: 06 COUNTY DESCRIPTION: BROWARD -INDIVIDUAL NAME ADDRESS TYPE TPI TOMY ZIP STATUS PHONE IND BDATE EDATE +_______________________________________________________________________ _____________________________________________________________ 011234450 Junior, John DO blah blah 3456 S Blah ROAD FT LAUDERDALE FL 33333-0000 26 1235128794 1 777-123-3456 07/01/06 99/99/99 023456700 Brown, John A The blah Group 1000 W Anywhere RD CORAL SPRINGS FL 33000-0000 25 1740275528 1 555-444-5555 07/01/06 99/99/99 TIA Mike -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow Sent: Tuesday, March 11, 2008 8:46 PM To: [hidden email] Subject: Re: Reading text data with At 07:44 PM 3/11/2008, Roberts, Michael wrote: [notes on data layout omitted] >The second set of data is on two lines (two lines make a case), and >includes a field header for each variable, however. One twist is >that the header data is not consistent - while I show five lines of >data, sometimes one of those lines is not included - sort of like an >additional address element that does not exist for an address. [And] >I need to keep the header information, It looks like Gene's onto this one; but, it could be clearer if you'd post a few cases, including ones with differing numbers of header lines. Right now, I'd think about using an INPUT PROGRAM, but I haven't seen your data, nor looked at your problem as hard as Gene has. -- No virus found in this outgoing message. Checked by AVG. Version: 7.5.518 / Virus Database: 269.21.7/1324 - Release Date: 3/10/2008 7:27 PM ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ======= To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ====================To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Roberts, Michael
At 05:57 PM 3/12/2008, Roberts, Michael wrote:
>Thank you all for responding. Here is what the simulated data looks >like, warts and all! OK. This is printer-image data, in the form for the line printers that were standard for a long time on IBM mainframes and elsewhere. It's hard to tell your line length, but on most of these printers the maximum line length was 132 characters. Records were usually 133 characters long, with the first character controlling the printer: blank - Print on next line 1 - Print at head of a new page 0 - Skip a line, then print - - Skip 2 lines, then print + - Print on top the previous line (used for underscores, etc.) I see the following types of line: A. Page header - two lines, or three? (That is, should "AD HOC 08-035" be on the second line?) This isn't really what they look like; I've shortened them to fit here: 1RUN DATE: 02/06/08 blah blah MANAGEMENT INFORMATION SYSTEM PAGE: 1 blah GROUP blah PROJECT AD HOC 08-035 Conjecture: No information from these lines is needed in the final dataset. B. "Base" groups of lines: -blah BASE: 0100053 blah blah INC. 1234 WEST blah STREET SUITE 123 TAMPA ,FL 33607-4173 Conjecture: The "base" is that 7-digit number. It, and the name and address on these four lines, are retained and apply to all subsequent data until the next "base" group. C. "County" lines: -blah: 010005300 COUNTY: 29 COUNTY DESCRIPTION: HILLSBOROUGH Conjecture: There may be several counties for one 'base', though you have only one per base in the example data. County data is retained and applies to all subsequent data until the next "base" group or "county" line. D. "Individual" headers and records: It looks like there are two lines with field headers, then a line that prints underscores under them, then any number of lines each with data for an individual. Here is what I can see, with "element" names from a set of header lines, and data ("values") for two individuals: Element Value Value INDIVIDUAL 011234450 023456700 NAME Junior, John DO Brown, John A ?????? blah blah The blah Group ADDRESS 3456 S Blah ROAD 1000 W Anywhere RD (wrapped) FT LAUDERDALE FL CORAL SPRINGS FL TYPE ???? ???? TPI ???? ???? TOMY ???? ???? ZIP 33333-0000 33000-000 STATUS 26 25 ?????? 1235128794 1740275528 ?????? 1 1 PHONE 777-123-3456 555-444-5555 IND ???? ???? BDATE 07/01/06 07/01/06 EDATE 99/99/99 99/99/99 . I've matched values to element names partly by order, partly by morphology ("777-123-3456" has the form of a phone number, for example) . The addresses above are on two lines, but they're on a single line in the data; I've wrapped them so the lines fit, above . Where there's "??????" for an element name, there's a data element that doesn't seem to match any name in the headers, after making the assignments that seem clearly right . Where there's "????" for a value, I don't see any data that seems to match the name. (Are TYPE, TPI and TOMY part of the address, somehow?) ............... I expect you need logic like this; or, anyway, this is how I did it (in SAS) the last time I had to: A. Read a line. Classify it into one of the above categories. B. If it's a page header line, or one of the lines of element names for individuals, ignore it. (However, if it's a line of element names, that may be useful as an indication that individual data will follow.) C. It it's the start of a "base" group, as indicated by "BASE:" being the second token on the line, read the values from the four lines in the group, and keep them (LEAVE statement) for future use. (What is the meaning of the "blah" that precedes the word "BASE:"? Is it a value that needs to be kept?) D. If it's a "county" line ("COUNTY:" is the 4th token), read the county (number) and description, and save for future use. Does the "blah" that begins the line need to be saved? E. If it's lines for an individual (probably indicated by preceding lines of element names with underscores), read the elements as above, except correct and fill in the things I couldn't get. Write a record (END CASE) with the individual data plus the last preceding "base" and "county" data. It'd be an INPUT PROGRAM, of course. Python? Python's probably better suited to writing parsers in than native SPSS is. I'm not sure how you'd do the path from external file to Python to SPSS data file. Use Python without the SPSS interface, to pre-process the file into easier-to-recognize lines, write that out, then read in SPSS? Or how would you do it? -Onward, ever onward, Richard ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Good Afternoon Listers,
I want to thank all responders for taking the time to send helpful ideas and solutions to my query. I especially want to thank Gene and Richard for their time, knowledge, and willingness to help with this knotty data problem with only a skeleton dataset to go on. The syntax and ideas from Gene and Richard have allowed me to extract the text data which Richard described below as 'Printer image data' with relative ease. For those of you who haven't experienced it - this is not the best formatted data to work with! Am I glad there are such talented and experienced members here in this list!!! Again, my thanks to you. Sincerely Mike -----Original Message----- From: Richard Ristow [mailto:[hidden email]] Sent: Thursday, March 13, 2008 2:20 AM To: Roberts, Michael; [hidden email] Cc: Gene Maguin; Pearmain, Michael Subject: Re: Reading text data with At 05:57 PM 3/12/2008, Roberts, Michael wrote: >Thank you all for responding. Here is what the simulated data looks >like, warts and all! OK. This is printer-image data, in the form for the line printers that were standard for a long time on IBM mainframes and elsewhere. It's hard to tell your line length, but on most of these printers the maximum line length was 132 characters. Records were usually 133 characters long, with the first character controlling the printer: blank - Print on next line 1 - Print at head of a new page 0 - Skip a line, then print - - Skip 2 lines, then print + - Print on top the previous line (used for underscores, etc.) I see the following types of line: A. Page header - two lines, or three? (That is, should "AD HOC 08-035" be on the second line?) This isn't really what they look like; I've shortened them to fit here: 1RUN DATE: 02/06/08 blah blah MANAGEMENT INFORMATION SYSTEM PAGE: 1 blah GROUP blah PROJECT AD HOC 08-035 Conjecture: No information from these lines is needed in the final dataset. B. "Base" groups of lines: -blah BASE: 0100053 blah blah INC. 1234 WEST blah STREET SUITE 123 TAMPA ,FL 33607-4173 Conjecture: The "base" is that 7-digit number. It, and the name and address on these four lines, are retained and apply to all subsequent data until the next "base" group. C. "County" lines: -blah: 010005300 COUNTY: 29 COUNTY DESCRIPTION: HILLSBOROUGH Conjecture: There may be several counties for one 'base', though you have only one per base in the example data. County data is retained and applies to all subsequent data until the next "base" group or "county" line. D. "Individual" headers and records: It looks like there are two lines with field headers, then a line that prints underscores under them, then any number of lines each with data for an individual. Here is what I can see, with "element" names from a set of header lines, and data ("values") for two individuals: Element Value Value INDIVIDUAL 011234450 023456700 NAME Junior, John DO Brown, John A ?????? blah blah The blah Group ADDRESS 3456 S Blah ROAD 1000 W Anywhere RD (wrapped) FT LAUDERDALE FL CORAL SPRINGS FL TYPE ???? ???? TPI ???? ???? TOMY ???? ???? ZIP 33333-0000 33000-000 STATUS 26 25 ?????? 1235128794 1740275528 ?????? 1 1 PHONE 777-123-3456 555-444-5555 IND ???? ???? BDATE 07/01/06 07/01/06 EDATE 99/99/99 99/99/99 . I've matched values to element names partly by order, partly by morphology ("777-123-3456" has the form of a phone number, for example) . The addresses above are on two lines, but they're on a single line in the data; I've wrapped them so the lines fit, above . Where there's "??????" for an element name, there's a data element that doesn't seem to match any name in the headers, after making the assignments that seem clearly right . Where there's "????" for a value, I don't see any data that seems to match the name. (Are TYPE, TPI and TOMY part of the address, somehow?) ............... I expect you need logic like this; or, anyway, this is how I did it (in SAS) the last time I had to: A. Read a line. Classify it into one of the above categories. B. If it's a page header line, or one of the lines of element names for individuals, ignore it. (However, if it's a line of element names, that may be useful as an indication that individual data will follow.) C. It it's the start of a "base" group, as indicated by "BASE:" being the second token on the line, read the values from the four lines in the group, and keep them (LEAVE statement) for future use. (What is the meaning of the "blah" that precedes the word "BASE:"? Is it a value that needs to be kept?) D. If it's a "county" line ("COUNTY:" is the 4th token), read the county (number) and description, and save for future use. Does the "blah" that begins the line need to be saved? E. If it's lines for an individual (probably indicated by preceding lines of element names with underscores), read the elements as above, except correct and fill in the things I couldn't get. Write a record (END CASE) with the individual data plus the last preceding "base" and "county" data. It'd be an INPUT PROGRAM, of course. Python? Python's probably better suited to writing parsers in than native SPSS is. I'm not sure how you'd do the path from external file to Python to SPSS data file. Use Python without the SPSS interface, to pre-process the file into easier-to-recognize lines, write that out, then read in SPSS? Or how would you do it? -Onward, ever onward, Richard ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
| Free forum by Nabble | Edit this page |
