|
Hi List, The code from
Albert-Jan helped me deal with the non-printing characters in the data file. I am still
struggling with how to read in the variables for each case, and would dearly like
some help with how to approach reading in these data from a text file. The Programming
and Data Management for SPSS 17.0 guide by R. Levesque shows how to read in
data for simple delimited/fixed, and complex delimited/fixed, but I think the
data files I am working with don’t fit any of these groups J It looks “mixed” since not all variables are
included in all records, and all records are all on one loooong line! I think an
Input Program approach should work, but the documentation seems to suggest that
I identify the positions of the variables, which is impossible given the data(?) How can I get around this? The following
is what I am working with – Each record starts
and ends with an with an ASCII character (converted from a non-printing
character at the beginning using Albert-Jan’s approach) – in the sample data
below the record begins with “G1” and ends before the next “G1”; There is a 68-character
transaction header – the only consistent thing about these records, with
information about the record; Next there are
segment identifiers which identify related components of the record followed by
variables which are also preceded by variable identifiers (Ex, Gx,
Dx, Mx, etc); Not all
variables are required, so the following record may have a variable in a different
position from the previous record; There are a set
of core variables that every record must include; Some variables
are required, depending upon a couple of other variables. Here is an
example of 3 records of generated data (no real personal identifiers), but
similar to what the text data files look like; the variable indicators are genuine
and so are the segment identifiers (AMxx).
TIA Mike From: Albert-Jan Roskam [mailto:[hidden email]]
|
|
Robert,
You've been working on this a long time and I'd like to try. First, I'd like to understand the exact data structure. So, 1) are these data currently in a text file? So that you will need to use either a get data /type text or a data list statement to read them? 2) are there, in fact, underscore characters between cases? 3) you show four lines per case. Are there really just two lines per case--the 68-character transaction header and another record that becomes three lines in the email because of line wrapping? Or iss there really just one line per case? I copied your sample data to a text editor and I suddenly got one line per case. 4) assuming a case consists of two records, is the length of the second record always the same or is it variable? Or, if a case is one line, is the length of a case (record) always the same? Your example data shows that to be the case. 5) Also when I copied the data to my text editor the dash-spaces ('- ') between groups of characters collapsed to a single character, a '?' on my editor and the spaces (' ') disappeared and the character groups ran together. Clarify this please. 6) Also, I don't understand where the 68 charater header ends. Do you mean that, using case 1 as an example, the header record is [1]G1001050865501335251B1M02501335210123456789AB!!!!!20060523070109DR01‑! (I've filled spaces with '!') Which is 71 characters long. Or, are you saying the header is actually G1001050865501335251B1M02501335210123456789AB!!!!!20060523070109DR01 Where 'G1' is the added characters and '[1]' is something else but is really there? Gene Maguin >>The code from Albert-Jan helped me deal with the non-printing characters in the data file. I am still struggling with how to read in the variables for each case, and would dearly like some help with how to approach reading in these data from a text file. The Programming and Data Management for SPSS 17.0 guide by R. Levesque shows how to read in data for simple delimited/fixed, and complex delimited/fixed, but I think the data files I am working with don’t fit any of these groups J It looks “mixed” since not all variables are included in all records, and all records are all on one loooong line! I think an Input Program approach should work, but the documentation seems to suggest that I identify the positions of the variables, which is impossible given the data(?) How can I get around this? The following is what I am working with – Each record starts and ends with an with an ASCII character (converted from a non-printing character at the beginning using Albert-Jan’s approach) – in the sample data below the record begins with “G1” and ends before the next “G1”; There is a 68-character transaction header – the only consistent thing about these records, with information about the record; Next there are segment identifiers which identify related components of the record followed by variables which are also preceded by variable identifiers (Ex, Gx, Dx, Mx, etc); Not all variables are required, so the following record may have a variable in a different position from the previous record; There are a set of core variables that every record must include; Some variables are required, depending upon a couple of other variables. Here is an example of 3 records of generated data (no real personal identifiers), but similar to what the text data files look like; the variable indicators are genuine and so are the segment identifiers (AMxx). ________________________________ [1]G1001050865501335251B1M02501335210123456789AB 20060523070109DR01‑ AM04 C23163749020 C1YLJOURNEYX‑ AM01 CX99 CY9876543210 C4185001212 C51 CAIOMPQ CBVLVYN C700‑ AM07 EM1 D20126565 E103 D700378180301 E730000 D30 D530 D61 D80 DE20060420 DF99 DK9 C80 28EA‑ AM11 D962D DC20{ DQ82D DU82D‑ AM03 EZ01 DB1234000009 ________________________________ [1]G1001050876201224251B1M0250133521012468013579 20030325070109DR01‑ AM04 C27495343567 C1YLJOURNEYX‑ AM01 CX99 CY7468357123 C428460621 C52 CAYYZDBCA CBAXBEYMHV C700‑ AM07 EM1 D20430007 E103 D745802004064 E7120000 D30 D515 D61 D80 DE20040417 DF99 DK9 C80 28ML‑ AM11 D985E DC20{ DQ105E DU105E‑ AM03 EZ01 DB1750355699 ________________________________ [1]G1001050897501335251B1M0250133521011117771234 20021223070109DR01‑ AM04 C28800437556 C1YLJOURNEYX‑ AM01 CX99 CY3451236789 C420011021 C51 CAXLMNQ CBNGHYE C700‑ AM07 EM1 D21941780 E103 D716252051501 E714000 D30 D57 D61 D80 DE20061123 DF99 DK9 C80 28EA‑ AM11 D958G DC20{ DQ78G DU78G‑ AM03 EZ01 DB1346262193 ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Gene,
Using the code from Albert-Jan (below), I converted the hex characters in the data file; specifically I changed the ETX (^C) to CR, the FS (^\) to 22 ("), and all other control characters to blank (20), then using the "get data" command with delimiter specified as '"', and reading from the second case, I am able to bring in the data without problems (thus far!). Will let you know if I run into any difficulty :) My next attempt will be to automate this conversion and reading these text files using Python - I had done something like this a while back, so I am quite rusty :) BEGIN PROGRAM. import string, random def create_testfile(): f = open("C:/Documents and Settings/robertsm/Desktop/atest.txt", "wb") create_testfile() def ditch_punctuation(infile, outfile): translation_table = {2:32, 3:13, 28:34, 29:32, 30:32, 32:32,10:32 } old = "".join([chr(char) for char in translation_table.keys()]) new = "".join([chr(char) for char in translation_table.values()]) trans = string.maketrans(old, new) f_out = open(outfile, "wb") for line in open(infile, "rb"): out = string.translate(line, trans) f_out.write(out) ditch_punctuation(infile="C:/Documents and Settings/robertsm/Desktop/temp.txt", outfile="C:/Documents and Settings/robertsm/Desktop/atest_converted.txt") END PROGRAM. Thank you for taking a look at this problem! Mike -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Gene Maguin Sent: Thursday, December 31, 2009 10:38 AM To: [hidden email] Subject: Re: Reading NCPDP formatted data into SPSS Robert, You've been working on this a long time and I'd like to try. First, I'd like to understand the exact data structure. So, 1) are these data currently in a text file? So that you will need to use either a get data /type text or a data list statement to read them? 2) are there, in fact, underscore characters between cases? 3) you show four lines per case. Are there really just two lines per case--the 68-character transaction header and another record that becomes three lines in the email because of line wrapping? Or iss there really just one line per case? I copied your sample data to a text editor and I suddenly got one line per case. 4) assuming a case consists of two records, is the length of the second record always the same or is it variable? Or, if a case is one line, is the length of a case (record) always the same? Your example data shows that to be the case. 5) Also when I copied the data to my text editor the dash-spaces ('- ') between groups of characters collapsed to a single character, a '?' on my editor and the spaces (' ') disappeared and the character groups ran together. Clarify this please. 6) Also, I don't understand where the 68 charater header ends. Do you mean that, using case 1 as an example, the header record is [1]G1001050865501335251B1M02501335210123456789AB!!!!!20060523070109DR01‑! (I've filled spaces with '!') Which is 71 characters long. Or, are you saying the header is actually G1001050865501335251B1M02501335210123456789AB!!!!!20060523070109DR01 Where 'G1' is the added characters and '[1]' is something else but is really there? Gene Maguin >>The code from Albert-Jan helped me deal with the non-printing characters in the data file. I am still struggling with how to read in the variables for each case, and would dearly like some help with how to approach reading in these data from a text file. The Programming and Data Management for SPSS 17.0 guide by R. Levesque shows how to read in data for simple delimited/fixed, and complex delimited/fixed, but I think the data files I am working with don’t fit any of these groups J It looks “mixed” since not all variables are included in all records, and all records are all on one loooong line! I think an Input Program approach should work, but the documentation seems to suggest that I identify the positions of the variables, which is impossible given the data(?) How can I get around this? The following is what I am working with – Each record starts and ends with an with an ASCII character (converted from a non-printing character at the beginning using Albert-Jan’s approach) – in the sample data below the record begins with “G1” and ends before the next “G1”; There is a 68-character transaction header – the only consistent thing about these records, with information about the record; Next there are segment identifiers which identify related components of the record followed by variables which are also preceded by variable identifiers (Ex, Gx, Dx, Mx, etc); Not all variables are required, so the following record may have a variable in a different position from the previous record; There are a set of core variables that every record must include; Some variables are required, depending upon a couple of other variables. Here is an example of 3 records of generated data (no real personal identifiers), but similar to what the text data files look like; the variable indicators are genuine and so are the segment identifiers (AMxx). ________________________________ [1]G1001050865501335251B1M02501335210123456789AB 20060523070109DR01‑ AM04 C23163749020 C1YLJOURNEYX‑ AM01 CX99 CY9876543210 C4185001212 C51 CAIOMPQ CBVLVYN C700‑ AM07 EM1 D20126565 E103 D700378180301 E730000 D30 D530 D61 D80 DE20060420 DF99 DK9 C80 28EA‑ AM11 D962D DC20{ DQ82D DU82D‑ AM03 EZ01 DB1234000009 ________________________________ [1]G1001050876201224251B1M0250133521012468013579 20030325070109DR01‑ AM04 C27495343567 C1YLJOURNEYX‑ AM01 CX99 CY7468357123 C428460621 C52 CAYYZDBCA CBAXBEYMHV C700‑ AM07 EM1 D20430007 E103 D745802004064 E7120000 D30 D515 D61 D80 DE20040417 DF99 DK9 C80 28ML‑ AM11 D985E DC20{ DQ105E DU105E‑ AM03 EZ01 DB1750355699 ________________________________ [1]G1001050897501335251B1M0250133521011117771234 20021223070109DR01‑ AM04 C28800437556 C1YLJOURNEYX‑ AM01 CX99 CY3451236789 C420011021 C51 CAXLMNQ CBNGHYE C700‑ AM07 EM1 D21941780 E103 D716252051501 E714000 D30 D57 D61 D80 DE20061123 DF99 DK9 C80 28EA‑ AM11 D958G DC20{ DQ78G DU78G‑ AM03 EZ01 DB1346262193 ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
|
Hi Albert-Jan, Happy 2010 to
you too! And to everyone else on this list!!! Next, thank you
again for looking at this data extraction problem. The data files individually are not big –
about 1-2 mb at most, while my RAM = 4gb.
I am trying to understand what looks like a pretty cool program you
wrote; it looks like it parses the data with ‘-‘. I am not sure what the backslash characters (‘\r,
\n, \t) in the ‘write’ statement do, however(?); are these commands to insert
new lines, tabs, and some other usable identifier in the output file? Will let you
all know how this works out. Thanking You Sincerely Mike From: Albert-Jan Roskam [mailto:[hidden email]]
|
| Free forum by Nabble | Edit this page |
