Reading NCPDP formatted data into SPSS

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Reading NCPDP formatted data into SPSS

Roberts, Michael

Hi List,

 

The code from Albert-Jan helped me deal with the non-printing characters in the data file. 

 

I am still struggling with how to read in the variables for each case, and would dearly like some help with how to approach reading in these data from a text file. 

 

The Programming and Data Management for SPSS 17.0 guide by R. Levesque shows how to read in data for simple delimited/fixed, and complex delimited/fixed, but I think the data files I am working with don’t fit any of these groups J  It looks “mixed” since not all variables are included in all records, and all records are all on one loooong line!

 

I think an Input Program approach should work, but the documentation seems to suggest that I identify the positions of the variables, which is impossible given the data(?)  How can I get around this?

 

The following is what I am working with –

Each record starts and ends with an with an ASCII character (converted from a non-printing character at the beginning using Albert-Jan’s approach) – in the sample data below the record begins with “G1” and ends before the next “G1”;

There is a 68-character transaction header – the only consistent thing about these records, with information about the record;

Next there are segment identifiers which identify related components of the record followed by variables which are also preceded by variable identifiers (Ex, Gx, Dx, Mx, etc);

Not all variables are required, so the following record may have a variable in a different position from the previous record;

There are a set of core variables that every record must include;

Some variables are required, depending upon a couple of other variables.

 

Here is an example of 3 records of generated data (no real personal identifiers), but similar to what the text data files look like; the variable indicators are genuine and so are the segment identifiers (AMxx).

 


[1]G1001050865501335251B1M02501335210123456789AB     20060523070109DR01AM04C23163749020C1YLJOURNEYXAM01CX99CY9876543210C4185001212C51CAIOMPQCBVLVYNC700AM07EM1D20126565E103D700378180301E730000D30D530D61D80DE20060420DF99DK9C8028EAAM11D962DDC20{DQ82DDU82DAM03EZ01DB1234000009
[1]G1001050876201224251B1M0250133521012468013579     20030325070109DR01
AM04C27495343567C1YLJOURNEYXAM01CX99CY7468357123C428460621C52CAYYZDBCACBAXBEYMHVC700AM07EM1D20430007E103D745802004064E7120000D30D515D61D80DE20040417DF99DK9C8028MLAM11D985EDC20{DQ105EDU105EAM03EZ01DB1750355699
[1]G1001050897501335251B1M0250133521011117771234     20021223070109DR01
AM04C28800437556C1YLJOURNEYXAM01CX99CY3451236789C420011021C51CAXLMNQCBNGHYEC700AM07EM1D21941780E103D716252051501E714000D30D57D61D80DE20061123DF99DK9C8028EAAM11D958GDC20{DQ78GDU78GAM03EZ01DB1346262193
[1]

 

TIA

Mike

 

 

From: Albert-Jan Roskam [mailto:[hidden email]]
Sent: Thursday, December 17, 2009 4:34 PM
To: Roberts, Michael
Subject: RE: [SPSSX-L] Reading NCPDP formatted data into SPSS

 

Hi Mike,

The code consist of to functions: create_testfile() and ditch_punctuation(). The first function was only for demo purposes. It generates a dummy file. You don't need that entire block, because you have a real file.

The second function does all the work. There are 128 ascii characters, ranging from 0-127. What 'chr'  does is that is converts an ordinal (0-127) to a character. On www.asciitable.com you can see, for example, that the decimal 32 stands for "space".If i wanted to replace all spaces by exclamation marks (decimal 33), my translation would look like this:
translation_table = {32: 33}. The program loops through each line of the input file, translates each line using the table, and writes the translated lines to a new file.

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In the face of ambiguity, refuse the temptation to guess.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Thu, 12/17/09, Roberts, Michael <[hidden email]> wrote:


From: Roberts, Michael <[hidden email]>
Subject: RE: [SPSSX-L] Reading NCPDP formatted data into SPSS
To: "Albert-Jan Roskam" <[hidden email]>
Date: Thursday, December 17, 2009, 3:54 PM

From: Albert-Jan Roskam [mailto:[hidden email]]
Sent: Tuesday, December 15, 2009 5:51 AM
To: [hidden email]; Roberts, Michael
Subject: Re: [SPSSX-L] Reading NCPDP formatted data into SPSS

 

Hi Mike,

 

I was doing something similar the other day. Basically, you want to convert certain ascii signs to other ascii signs, right? For instance "end of record" to "newline". Doesn't the NCPDP organization have some utility for this? I'd try something like the following (success not guaranteed ;-). It's up to you what the translation table looks like:

 

 

 

Hi Albert-Jan,

 

I was going through this code you sent and am not familiar with some of the bits.  Would you be able to explain what some of them do?  I tinkered a bit with your code, but got back strange stuff :)  My Python skills are very rudimentary, so several of the functions (“chr”,”trans”)you have used are unclear to me.

 

Thanking You

Sincerely

 

Michael

 

 

BEGIN PROGRAM.

import string, random

def create_testfile():
    f = open("d:/temp/ncpdp.txt", "wb")
    for i in range(10**5):
        char = random.randint(0, 127)  // Not sure what this bit does???
        f.write(chr(char))
create_testfile()                 //Looks like this definition reads in the ncpdp data file and creates a test file.

 

def ditch_punctuation(infile, outfile): //looks like this bit does the switch from one character to another(?)
    translation_table = {1:10, 2:10, 30:10, 32: 9} # specify decimal number as 'from: to'. check e.g. www.asciitable.com  // does this instruct the program to substitute the ‘10’,’9’, etc. for the ‘1’, ‘2’, 30’(?)
    old = "".join([chr(char) for char in translation_table.keys()])
    new = "".join([chr(char) for char in translation_table.values()])
    trans  = string.maketrans(old, new)
    f_out = open(outfile, "wb")
    for line in open(infile, "rb"):
        out = string.translate(line, trans)
        f_out.write(out)

ditch_punctuation(infile="d:/temp/ncpdp.txt", outfile="d:/temp/ncpdp_converted.txt")
END PROGRAM.




~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In the face of ambiguity, refuse the temptation to guess.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Mon, 12/14/09, Roberts, Michael <[hidden email]> wrote:


From: Roberts, Michael <[hidden email]>
Subject: Re: [SPSSX-L] Reading NCPDP formatted data into SPSS
To: [hidden email]
Date: Monday, December 14, 2009, 6:01 PM

Hi Albert-Jan,

 

Thank you for the suggestion; I did change the setting to the Unicode= on, but it was not too much help.  I ended up changing the various characters in a Hex editor (changed the HEX values to correspond to ASCII printing characters) , saved the file as text, and read into SPSS.  Using my process there are altogether four steps (would like a more elegant method, however!) to reading in the NCPDP data file:

 

1.      Using a Hex editor, change the control characters (non-printing) to two distinct ASCII printing ones, and save it as an ASCII file – result is a long record with 1 case per record, but all variables are identifiable by the second distinct separator;

2.       Read in the text file using the record identifier to separate the records/cases as defined in the Hex editor;

3.      Save the file as an ASCII file;

4.      Read in the saved file in step 3, using variable separators to identify the different variables.

 

I know this is a clunky way to read in these data, but I can’t think of a better  way to do it!  There are about 6 non-printing characters identifying the various elements of a record in this type (NCPDP) of data file: ^B, ^C, ^], ^\, ^^, and Blank. 

 

We will be working with potentially hundreds of these files, so any timesaving tips would be very much appreciated!

 

TIA

 

Mike

 

 

 

From: Albert-Jan Roskam [mailto:[hidden email]]
Sent: Saturday, December 12, 2009 4:26 PM
To: Roberts, Michael
Subject: Re: [SPSSX-L] Reading NCPDP formatted data into SPSS

 

Hi,

Did you try running the command SET UNICODE = ON before opening the file? That might help deal with funny characters.

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In the face of ambiguity, refuse the temptation to guess.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Fri, 12/11/09, Roberts, Michael <[hidden email]> wrote:


From: Roberts, Michael <[hidden email]>
Subject: [SPSSX-L] Reading NCPDP formatted data into SPSS
To: [hidden email]
Date: Friday, December 11, 2009, 6:00 PM

Hi list,

Just wanted to know whether anyone on this list has tried reading any NCPDP (National Council for Prescription Drug Programs) formatted data files into SPSS.  If you have, you know the file is loaded with non-printing characters, non-printing delimiters, and sundry other booby traps.  If anyone had some experience, I would appreciate any helpful pointers you may care to suggest.

TIA

Mike

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

 

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Reading NCPDP formatted data into SPSS

Maguin, Eugene
Robert,

You've been working on this a long time and I'd like to try. First, I'd like to understand the exact data structure. So,
1) are these data currently in a text file? So that you will need to use either a get data /type text or a data list statement to read them?

2) are there, in fact, underscore characters between cases?

3) you show four lines per case. Are there really just two lines per case--the 68-character transaction header and another record that becomes three lines in the email because of line wrapping? Or iss there really just one line per case? I copied your sample data to a text editor and I suddenly got one line per case.

4) assuming a case consists of two records, is the length of the second record always the same or is it variable? Or, if a case is one line, is the length of a case (record) always the same? Your example data shows that to be the case.

5) Also when I copied the data to my text editor the dash-spaces ('- ') between groups of characters collapsed to a single character, a '?' on my editor and the spaces (' ') disappeared and the character groups ran together. Clarify this please.

6) Also, I don't understand where the 68 charater header ends. Do you mean that, using case 1 as an example, the header record is

[1]G1001050865501335251B1M02501335210123456789AB!!!!!20060523070109DR01‑!

(I've filled spaces with '!') Which is 71 characters long. Or, are you saying the header is actually

G1001050865501335251B1M02501335210123456789AB!!!!!20060523070109DR01

Where 'G1' is the added characters and '[1]' is something else but is really there?


Gene Maguin



>>The code from Albert-Jan helped me deal with the non-printing characters in the data file. I am still struggling with how to read in the variables for each case, and would dearly like some help with how to approach reading in these data from a text file.

The Programming and Data Management for SPSS 17.0 guide by R. Levesque shows how to read in data for simple delimited/fixed, and complex delimited/fixed, but I think the data files I am working with don’t fit any of these groups J  It looks “mixed” since not all variables are included in all records, and all records are all on one loooong line!

I think an Input Program approach should work, but the documentation seems to suggest that I identify the positions of the variables, which is impossible given the data(?)  How can I get around this?


The following is what I am working with –
Each record starts and ends with an with an ASCII character (converted from a non-printing character at the beginning using Albert-Jan’s approach) – in the sample data below the record begins with “G1” and ends before the next “G1”;

There is a 68-character transaction header – the only consistent thing about these records, with information about the record;
Next there are segment identifiers which identify related components of the record followed by variables which are also preceded by variable identifiers (Ex, Gx, Dx, Mx, etc);
Not all variables are required, so the following record may have a variable in a different position from the previous record; There are a set of core variables that every record must include; Some variables are required, depending upon a couple of other variables.

Here is an example of 3 records of generated data (no real personal identifiers), but similar to what the text data files look like; the variable indicators are genuine and so are the segment identifiers (AMxx).

________________________________

[1]G1001050865501335251B1M02501335210123456789AB     20060523070109DR01‑ AM04 C23163749020 C1YLJOURNEYX‑ AM01 CX99 CY9876543210 C4185001212 C51 CAIOMPQ CBVLVYN C700‑ AM07 EM1 D20126565 E103 D700378180301 E730000 D30 D530 D61 D80 DE20060420 DF99 DK9 C80 28EA‑ AM11 D962D DC20{ DQ82D DU82D‑ AM03 EZ01 DB1234000009
________________________________

[1]G1001050876201224251B1M0250133521012468013579     20030325070109DR01‑ AM04 C27495343567 C1YLJOURNEYX‑ AM01 CX99 CY7468357123 C428460621 C52 CAYYZDBCA CBAXBEYMHV C700‑ AM07 EM1 D20430007 E103 D745802004064 E7120000 D30 D515 D61 D80 DE20040417 DF99 DK9 C80 28ML‑ AM11 D985E DC20{ DQ105E DU105E‑ AM03 EZ01 DB1750355699
________________________________

[1]G1001050897501335251B1M0250133521011117771234     20021223070109DR01‑ AM04 C28800437556 C1YLJOURNEYX‑ AM01 CX99 CY3451236789 C420011021 C51 CAXLMNQ CBNGHYE C700‑ AM07 EM1 D21941780 E103 D716252051501 E714000 D30 D57 D61 D80 DE20061123 DF99 DK9 C80 28EA‑ AM11 D958G DC20{ DQ78G DU78G‑ AM03 EZ01 DB1346262193

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Reading NCPDP formatted data into SPSS

Roberts, Michael
Gene,

Using the code from Albert-Jan (below), I converted the hex characters in the data file; specifically I changed the ETX (^C) to CR, the FS (^\) to 22 ("), and all other control characters to blank (20), then using the "get data" command with delimiter specified as '"', and reading from the second case, I am able to bring in the data without problems (thus far!).  Will let you know if I run into any difficulty :) My next attempt will be to automate this conversion and reading these text files using Python - I had done something like this a while back, so I am quite rusty :)


BEGIN PROGRAM.
import string, random
def create_testfile():
    f = open("C:/Documents and Settings/robertsm/Desktop/atest.txt", "wb")
    create_testfile()
def ditch_punctuation(infile, outfile):
    translation_table = {2:32, 3:13, 28:34, 29:32, 30:32, 32:32,10:32 }
    old = "".join([chr(char) for char in translation_table.keys()])
    new = "".join([chr(char) for char in translation_table.values()])
    trans  = string.maketrans(old, new)
    f_out = open(outfile, "wb")
    for line in open(infile, "rb"):
        out = string.translate(line, trans)
        f_out.write(out)
ditch_punctuation(infile="C:/Documents and Settings/robertsm/Desktop/temp.txt", outfile="C:/Documents and Settings/robertsm/Desktop/atest_converted.txt")
END PROGRAM.

Thank you for taking a look at this problem!

Mike

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Gene Maguin
Sent: Thursday, December 31, 2009 10:38 AM
To: [hidden email]
Subject: Re: Reading NCPDP formatted data into SPSS

Robert,

You've been working on this a long time and I'd like to try. First, I'd like to understand the exact data structure. So,
1) are these data currently in a text file? So that you will need to use either a get data /type text or a data list statement to read them?

2) are there, in fact, underscore characters between cases?

3) you show four lines per case. Are there really just two lines per case--the 68-character transaction header and another record that becomes three lines in the email because of line wrapping? Or iss there really just one line per case? I copied your sample data to a text editor and I suddenly got one line per case.

4) assuming a case consists of two records, is the length of the second record always the same or is it variable? Or, if a case is one line, is the length of a case (record) always the same? Your example data shows that to be the case.

5) Also when I copied the data to my text editor the dash-spaces ('- ') between groups of characters collapsed to a single character, a '?' on my editor and the spaces (' ') disappeared and the character groups ran together. Clarify this please.

6) Also, I don't understand where the 68 charater header ends. Do you mean that, using case 1 as an example, the header record is

[1]G1001050865501335251B1M02501335210123456789AB!!!!!20060523070109DR01‑!

(I've filled spaces with '!') Which is 71 characters long. Or, are you saying the header is actually

G1001050865501335251B1M02501335210123456789AB!!!!!20060523070109DR01

Where 'G1' is the added characters and '[1]' is something else but is really there?


Gene Maguin



>>The code from Albert-Jan helped me deal with the non-printing characters in the data file. I am still struggling with how to read in the variables for each case, and would dearly like some help with how to approach reading in these data from a text file.

The Programming and Data Management for SPSS 17.0 guide by R. Levesque shows how to read in data for simple delimited/fixed, and complex delimited/fixed, but I think the data files I am working with don’t fit any of these groups J  It looks “mixed” since not all variables are included in all records, and all records are all on one loooong line!

I think an Input Program approach should work, but the documentation seems to suggest that I identify the positions of the variables, which is impossible given the data(?)  How can I get around this?


The following is what I am working with –
Each record starts and ends with an with an ASCII character (converted from a non-printing character at the beginning using Albert-Jan’s approach) – in the sample data below the record begins with “G1” and ends before the next “G1”;

There is a 68-character transaction header – the only consistent thing about these records, with information about the record;
Next there are segment identifiers which identify related components of the record followed by variables which are also preceded by variable identifiers (Ex, Gx, Dx, Mx, etc);
Not all variables are required, so the following record may have a variable in a different position from the previous record; There are a set of core variables that every record must include; Some variables are required, depending upon a couple of other variables.

Here is an example of 3 records of generated data (no real personal identifiers), but similar to what the text data files look like; the variable indicators are genuine and so are the segment identifiers (AMxx).

________________________________

[1]G1001050865501335251B1M02501335210123456789AB     20060523070109DR01‑ AM04 C23163749020 C1YLJOURNEYX‑ AM01 CX99 CY9876543210 C4185001212 C51 CAIOMPQ CBVLVYN C700‑ AM07 EM1 D20126565 E103 D700378180301 E730000 D30 D530 D61 D80 DE20060420 DF99 DK9 C80 28EA‑ AM11 D962D DC20{ DQ82D DU82D‑ AM03 EZ01 DB1234000009
________________________________

[1]G1001050876201224251B1M0250133521012468013579     20030325070109DR01‑ AM04 C27495343567 C1YLJOURNEYX‑ AM01 CX99 CY7468357123 C428460621 C52 CAYYZDBCA CBAXBEYMHV C700‑ AM07 EM1 D20430007 E103 D745802004064 E7120000 D30 D515 D61 D80 DE20040417 DF99 DK9 C80 28ML‑ AM11 D985E DC20{ DQ105E DU105E‑ AM03 EZ01 DB1750355699
________________________________

[1]G1001050897501335251B1M0250133521011117771234     20021223070109DR01‑ AM04 C28800437556 C1YLJOURNEYX‑ AM01 CX99 CY3451236789 C420011021 C51 CAXLMNQ CBNGHYE C700‑ AM07 EM1 D21941780 E103 D716252051501 E714000 D30 D57 D61 D80 DE20061123 DF99 DK9 C80 28EA‑ AM11 D958G DC20{ DQ78G DU78G‑ AM03 EZ01 DB1346262193

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Reading NCPDP formatted data into SPSS

Albert-Jan Roskam
Hi Micheal,

Happy 2010! The code below just came out of my still somewhat champaign-soaked brain. Just like Gene, I had a hard time understanding the record structure. The hyphen-like sign seems useful to split the file up. Your output file is now tab-separated, but some fields have a space as a separator. That's not a problem in spss. How large are your input files? Larger or smaller than the size of your RAM memory? The current program and the previous one could probably be streamlined into one little program, but let's first see if this works.

# -*- coding: utf-8 -*-
import re
def process_file(infile, outfile):
    casecount = 0
    for element in infile.split("‑"):
        m = re.search("G1.{66}", element) # skip the last two blanks?
        if m and not casecount:
            casecount += 1
            outfile.write(m.group(0).strip())
        elif m:
            outfile.write("\r\n" + m.group(0).strip())
        else:
            outfile.write("\t" + element.strip())
    outfile.close()
    print "Done! (%s)" % outfile.name

# testcode
infile = """G1001050865501335251B1M02501335210123456789AB     20060523070109DR01‑ AM04 C23163749020 C1YLJOURNEYX‑ AM01 CX99 CY9876543210 C4185001212 C51 CAIOMPQ CBVLVYN C700‑ AM07 EM1 D20126565 E103 D700378180301 E730000 D30 D530 D61 D80 DE20060420 DF99 DK9 C80 28EA‑ AM11 D962D DC20{ DQ82D DU82D‑ AM03 EZ01 DB1234000009
G1001050876201224251B1M0250133521012468013579     20030325070109DR01‑ AM04 C27495343567 C1YLJOURNEYX‑ AM01 CX99 CY7468357123 C428460621 C52 CAYYZDBCA CBAXBEYMHV C700‑ AM07 EM1 D20430007 E103 D745802004064 E7120000 D30 D515 D61 D80 DE20040417 DF99 DK9 C80 28ML‑ AM11 D985E DC20{ DQ105E DU105E‑ AM03 EZ01 DB1750355699
G1001050897501335251B1M0250133521011117771234     20021223070109DR01‑ AM04 C28800437556 C1YLJOURNEYX‑ AM01 CX99 CY3451236789 C420011021 C51 CAXLMNQ CBNGHYE C700‑ AM07 EM1 D21941780 E103 D716252051501 E714000 D30 D57 D61 D80 DE20061123 DF99 DK9 C80 28EA‑ AM11 D958G DC20{ DQ78G DU78G‑ AM03 EZ01 DB1346262193"""
process_file(infile, outfile = open("d:/temp/out.txt", "wb"))

# your code
process_file(infile = open("C:/Documents and Settings/robertsm/Desktop/atest.txt", "rb"),\
             outfile = open("d:/temp/out.txt", "wb"))

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In the face of ambiguity, refuse the temptation to guess.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Thu, 12/31/09, Roberts, Michael <[hidden email]> wrote:

From: Roberts, Michael <[hidden email]>
Subject: Re: [SPSSX-L] Reading NCPDP formatted data into SPSS
To: [hidden email]
Date: Thursday, December 31, 2009, 8:48 PM

Gene,

Using the code from Albert-Jan (below), I converted the hex characters in the data file; specifically I changed the ETX (^C) to CR, the FS (^\) to 22 ("), and all other control characters to blank (20), then using the "get data" command with delimiter specified as '"', and reading from the second case, I am able to bring in the data without problems (thus far!).  Will let you know if I run into any difficulty :) My next attempt will be to automate this conversion and reading these text files using Python - I had done something like this a while back, so I am quite rusty :)


BEGIN PROGRAM.
import string, random
def create_testfile():
    f = open("C:/Documents and Settings/robertsm/Desktop/atest.txt", "wb")
    create_testfile()
def ditch_punctuation(infile, outfile):
    translation_table = {2:32, 3:13, 28:34, 29:32, 30:32, 32:32,10:32 }
    old = "".join([chr(char) for char in translation_table.keys()])
    new = "".join([chr(char) for char in translation_table.values()])
    trans  = string.maketrans(old, new)
    f_out = open(outfile, "wb")
    for line in open(infile, "rb"):
        out = string.translate(line, trans)
        f_out.write(out)
ditch_punctuation(infile="C:/Documents and Settings/robertsm/Desktop/temp.txt", outfile="C:/Documents and Settings/robertsm/Desktop/atest_converted.txt")
END PROGRAM.

Thank you for taking a look at this problem!

Mike

-----Original Message-----
From: SPSSX(r) Discussion [mailto:SPSSX-L@...] On Behalf Of Gene Maguin
Sent: Thursday, December 31, 2009 10:38 AM
To: SPSSX-L@...
Subject: Re: Reading NCPDP formatted data into SPSS

Robert,

You've been working on this a long time and I'd like to try. First, I'd like to understand the exact data structure. So,
1) are these data currently in a text file? So that you will need to use either a get data /type text or a data list statement to read them?

2) are there, in fact, underscore characters between cases?

3) you show four lines per case. Are there really just two lines per case--the 68-character transaction header and another record that becomes three lines in the email because of line wrapping? Or is there really just one line per case? I copied your sample data to a text editor and I suddenly got one line per case.

4) assuming a case consists of two records, is the length of the second record always the same or is it variable? Or, if a case is one line, is the length of a case (record) always the same? Your example data shows that to be the case.

5) Also when I copied the data to my text editor the dash-spaces ('- ') between groups of characters collapsed to a single character, a '?' on my editor and the spaces (' ') disappeared and the character groups ran together. Clarify this please.

6) Also, I don't understand where the 68 charater header ends. Do you mean that, using case 1 as an example, the header record is

[1]G1001050865501335251B1M02501335210123456789AB!!!!!20060523070109DR01‑!

(I've filled spaces with '!') Which is 71 characters long. Or, are you saying the header is actually

G1001050865501335251B1M02501335210123456789AB!!!!!20060523070109DR01

Where 'G1' is the added characters and '[1]' is something else but is really there?


Gene Maguin



>>The code from Albert-Jan helped me deal with the non-printing characters in the data file. I am still struggling with how to read in the variables for each case, and would dearly like some help with how to approach reading in these data from a text file.

The Programming and Data Management for SPSS 17.0 guide by R. Levesque shows how to read in data for simple delimited/fixed, and complex delimited/fixed, but I think the data files I am working with don’t fit any of these groups J  It looks “mixed” since not all variables are included in all records, and all records are all on one loooong line!

I think an Input Program approach should work, but the documentation seems to suggest that I identify the positions of the variables, which is impossible given the data(?)  How can I get around this?


The following is what I am working with –
Each record starts and ends with an with an ASCII character (converted from a non-printing character at the beginning using Albert-Jan’s approach) – in the sample data below the record begins with “G1” and ends before the next “G1”;

There is a 68-character transaction header – the only consistent thing about these records, with information about the record;
Next there are segment identifiers which identify related components of the record followed by variables which are also preceded by variable identifiers (Ex, Gx, Dx, Mx, etc);
Not all variables are required, so the following record may have a variable in a different position from the previous record; There are a set of core variables that every record must include; Some variables are required, depending upon a couple of other variables.

Here is an example of 3 records of generated data (no real personal identifiers), but similar to what the text data files look like; the variable indicators are genuine and so are the segment identifiers (AMxx).

________________________________

[1]G1001050865501335251B1M02501335210123456789AB     20060523070109DR01‑ AM04 C23163749020 C1YLJOURNEYX‑ AM01 CX99 CY9876543210 C4185001212 C51 CAIOMPQ CBVLVYN C700‑ AM07 EM1 D20126565 E103 D700378180301 E730000 D30 D530 D61 D80 DE20060420 DF99 DK9 C80 28EA‑ AM11 D962D DC20{ DQ82D DU82D‑ AM03 EZ01 DB1234000009
________________________________

[1]G1001050876201224251B1M0250133521012468013579     20030325070109DR01‑ AM04 C27495343567 C1YLJOURNEYX‑ AM01 CX99 CY7468357123 C428460621 C52 CAYYZDBCA CBAXBEYMHV C700‑ AM07 EM1 D20430007 E103 D745802004064 E7120000 D30 D515 D61 D80 DE20040417 DF99 DK9 C80 28ML‑ AM11 D985E DC20{ DQ105E DU105E‑ AM03 EZ01 DB1750355699
________________________________

[1]G1001050897501335251B1M0250133521011117771234     20021223070109DR01‑ AM04 C28800437556 C1YLJOURNEYX‑ AM01 CX99 CY3451236789 C420011021 C51 CAXLMNQ CBNGHYE C700‑ AM07 EM1 D21941780 E103 D716252051501 E714000 D30 D57 D61 D80 DE20061123 DF99 DK9 C80 28EA‑ AM11 D958G DC20{ DQ78G DU78G‑ AM03 EZ01 DB1346262193

=====================
To manage your subscription to SPSSX-L, send a message to
LISTSERV@... (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
LISTSERV@... (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: Reading NCPDP formatted data into SPSS

Roberts, Michael

Hi Albert-Jan,

 

Happy 2010 to you too! And to everyone else on this list!!!

 

Next, thank you again for looking at this data extraction problem.  The data files individually are not big – about 1-2 mb at most, while my RAM = 4gb.  I am trying to understand what looks like a pretty cool program you wrote; it looks like it parses the data with ‘-‘.  I am not sure what the backslash characters (‘\r, \n, \t) in the ‘write’ statement do, however(?); are these commands to insert new lines, tabs, and some other usable identifier in the output file?

 

Will let you all know how this works out.

 

Thanking You

Sincerely

 

Mike

 

 

 

From: Albert-Jan Roskam [mailto:[hidden email]]
Sent: Friday, January 01, 2010 9:48 AM
To: [hidden email]; Roberts, Michael
Subject: Re: [SPSSX-L] Reading NCPDP formatted data into SPSS

 

Hi Micheal,

Happy 2010! The code below just came out of my still somewhat champaign-soaked brain. Just like Gene, I had a hard time understanding the record structure. The hyphen-like sign seems useful to split the file up. Your output file is now tab-separated, but some fields have a space as a separator. That's not a problem in spss. How large are your input files? Larger or smaller than the size of your RAM memory? The current program and the previous one could probably be streamlined into one little program, but let's first see if this works.

# -*- coding: utf-8 -*-
import re
def process_file(infile, outfile):
    casecount = 0
    for element in infile.split("‑"):
        m = re.search("G1.{66}", element) # skip the last two blanks?
        if m and not casecount:
            casecount += 1
            outfile.write(m.group(0).strip())
        elif m:
            outfile.write("\r\n" + m.group(0).strip())
        else:
            outfile.write("\t" + element.strip())
    outfile.close()
    print "Done! (%s)" % outfile.name

# testcode
infile = """G1001050865501335251B1M02501335210123456789AB     20060523070109DR01‑ AM04 C23163749020 C1YLJOURNEYX‑ AM01 CX99 CY9876543210 C4185001212 C51 CAIOMPQ CBVLVYN C700‑ AM07 EM1 D20126565 E103 D700378180301 E730000 D30 D530 D61 D80 DE20060420 DF99 DK9 C80 28EA‑ AM11 D962D DC20{ DQ82D DU82D‑ AM03 EZ01 DB1234000009
G1001050876201224251B1M0250133521012468013579     20030325070109DR01‑ AM04 C27495343567 C1YLJOURNEYX‑ AM01 CX99 CY7468357123 C428460621 C52 CAYYZDBCA CBAXBEYMHV C700‑ AM07 EM1 D20430007 E103 D745802004064 E7120000 D30 D515 D61 D80 DE20040417 DF99 DK9 C80 28ML‑ AM11 D985E DC20{ DQ105E DU105E‑ AM03 EZ01 DB1750355699
G1001050897501335251B1M0250133521011117771234     20021223070109DR01‑ AM04 C28800437556 C1YLJOURNEYX‑ AM01 CX99 CY3451236789 C420011021 C51 CAXLMNQ CBNGHYE C700‑ AM07 EM1 D21941780 E103 D716252051501 E714000 D30 D57 D61 D80 DE20061123 DF99 DK9 C80 28EA‑ AM11 D958G DC20{ DQ78G DU78G‑ AM03 EZ01 DB1346262193"""
process_file(infile, outfile = open("d:/temp/out.txt", "wb"))

# your code
process_file(infile = open("C:/Documents and Settings/robertsm/Desktop/atest.txt", "rb"),\
             outfile = open("d:/temp/out.txt", "wb"))

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In the face of ambiguity, refuse the temptation to guess.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Thu, 12/31/09, Roberts, Michael <[hidden email]> wrote:


From: Roberts, Michael <[hidden email]>
Subject: Re: [SPSSX-L] Reading NCPDP formatted data into SPSS
To: [hidden email]
Date: Thursday, December 31, 2009, 8:48 PM

Gene,

Using the code from Albert-Jan (below), I converted the hex characters in the data file; specifically I changed the ETX (^C) to CR, the FS (^\) to 22 ("), and all other control characters to blank (20), then using the "get data" command with delimiter specified as '"', and reading from the second case, I am able to bring in the data without problems (thus far!).  Will let you know if I run into any difficulty :) My next attempt will be to automate this conversion and reading these text files using Python - I had done something like this a while back, so I am quite rusty :)


BEGIN PROGRAM.
import string, random
def create_testfile():
    f = open("C:/Documents and Settings/robertsm/Desktop/atest.txt", "wb")
    create_testfile()
def ditch_punctuation(infile, outfile):
    translation_table = {2:32, 3:13, 28:34, 29:32, 30:32, 32:32,10:32 }
    old = "".join([chr(char) for char in translation_table.keys()])
    new = "".join([chr(char) for char in translation_table.values()])
    trans  = string.maketrans(old, new)
    f_out = open(outfile, "wb")
    for line in open(infile, "rb"):
        out = string.translate(line, trans)
        f_out.write(out)
ditch_punctuation(infile="C:/Documents and Settings/robertsm/Desktop/temp.txt", outfile="C:/Documents and Settings/robertsm/Desktop/atest_converted.txt")
END PROGRAM.

Thank you for taking a look at this problem!

Mike

-----Original Message-----
From: SPSSX(r) Discussion [mailto:SPSSX-L@...] On Behalf Of Gene Maguin
Sent: Thursday, December 31, 2009 10:38 AM
To: SPSSX-L@...
Subject: Re: Reading NCPDP formatted data into SPSS

Robert,

You've been working on this a long time and I'd like to try. First, I'd like to understand the exact data structure. So,
1) are these data currently in a text file? So that you will need to use either a get data /type text or a data list statement to read them?

2) are there, in fact, underscore characters between cases?

3) you show four lines per case. Are there really just two lines per case--the 68-character transaction header and another record that becomes three lines in the email because of line wrapping? Or is there really just one line per case? I copied your sample data to a text editor and I suddenly got one line per case.

4) assuming a case consists of two records, is the length of the second record always the same or is it variable? Or, if a case is one line, is the length of a case (record) always the same? Your example data shows that to be the case.

5) Also when I copied the data to my text editor the dash-spaces ('- ') between groups of characters collapsed to a single character, a '?' on my editor and the spaces (' ') disappeared and the character groups ran together. Clarify this please.

6) Also, I don't understand where the 68 charater header ends. Do you mean that, using case 1 as an example, the header record is

[1]G1001050865501335251B1M02501335210123456789AB!!!!!20060523070109DR01‑!

(I've filled spaces with '!') Which is 71 characters long. Or, are you saying the header is actually

G1001050865501335251B1M02501335210123456789AB!!!!!20060523070109DR01

Where 'G1' is the added characters and '[1]' is something else but is really there?


Gene Maguin



>>The code from Albert-Jan helped me deal with the non-printing characters in the data file. I am still struggling with how to read in the variables for each case, and would dearly like some help with how to approach reading in these data from a text file.

The Programming and Data Management for SPSS 17.0 guide by R. Levesque shows how to read in data for simple delimited/fixed, and complex delimited/fixed, but I think the data files I am working with don’t fit any of these groups J  It looks “mixed” since not all variables are included in all records, and all records are all on one loooong line!

I think an Input Program approach should work, but the documentation seems to suggest that I identify the positions of the variables, which is impossible given the data(?)  How can I get around this?


The following is what I am working with –
Each record starts and ends with an with an ASCII character (converted from a non-printing character at the beginning using Albert-Jan’s approach) – in the sample data below the record begins with “G1” and ends before the next “G1”;

There is a 68-character transaction header – the only consistent thing about these records, with information about the record;
Next there are segment identifiers which identify related components of the record followed by variables which are also preceded by variable identifiers (Ex, Gx, Dx, Mx, etc);
Not all variables are required, so the following record may have a variable in a different position from the previous record; There are a set of core variables that every record must include; Some variables are required, depending upon a couple of other variables.

Here is an example of 3 records of generated data (no real personal identifiers), but similar to what the text data files look like; the variable indicators are genuine and so are the segment identifiers (AMxx).

________________________________

[1]G1001050865501335251B1M02501335210123456789AB     20060523070109DR01‑ AM04 C23163749020 C1YLJOURNEYX‑ AM01 CX99 CY9876543210 C4185001212 C51 CAIOMPQ CBVLVYN C700‑ AM07 EM1 D20126565 E103 D700378180301 E730000 D30 D530 D61 D80 DE20060420 DF99 DK9 C80 28EA‑ AM11 D962D DC20{ DQ82D DU82D‑ AM03 EZ01 DB1234000009
________________________________

[1]G1001050876201224251B1M0250133521012468013579     20030325070109DR01‑ AM04 C27495343567 C1YLJOURNEYX‑ AM01 CX99 CY7468357123 C428460621 C52 CAYYZDBCA CBAXBEYMHV C700‑ AM07 EM1 D20430007 E103 D745802004064 E7120000 D30 D515 D61 D80 DE20040417 DF99 DK9 C80 28ML‑ AM11 D985E DC20{ DQ105E DU105E‑ AM03 EZ01 DB1750355699
________________________________

[1]G1001050897501335251B1M0250133521011117771234     20021223070109DR01‑ AM04 C28800437556 C1YLJOURNEYX‑ AM01 CX99 CY3451236789 C420011021 C51 CAXLMNQ CBNGHYE C700‑ AM07 EM1 D21941780 E103 D716252051501 E714000 D30 D57 D61 D80 DE20061123 DF99 DK9 C80 28EA‑ AM11 D958G DC20{ DQ78G DU78G‑ AM03 EZ01 DB1346262193

=====================
To manage your subscription to SPSSX-L, send a message to
LISTSERV@... (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
LISTSERV@... (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD