|
(sorry, I forgot to change the subject line in my previous mail)
Hi, I want to randomly round numbers to multiples of a thousand. Is the syntax below the way to do this? I don't want the numbers to be consistently rounded up or down. Thanks in advance! Cheers!! Albert-Jan data list free / x (f6). begin data 121321 121011 128777 end data. set seed = 12139. numeric y (f12). compute y = ( rnd ( (x + rv.uniform(-10**3, 10**3)) * 10**-3) ) * 10**3. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Hello all,
I have a large pdf file that has data printed on each page with the same variables always located in the same location on each page. The pages are not simple tables. They are formatted as paragraphs with data presented after colons. The pages also have large graphical headers and footers. Each page has data for multiple individuals on it. Here is an oversimplified example of what each page looks like: Headers Patient name: XXX XXXXX Member ID: XXXXXXXXXXXX Admit date: XXX XXXXX Diagnosis: XXX XXXXX Procedure XXX XXXXX Patient name: XXX XXXXX Member ID: XXX XXXXX Admit date: XXX XXXXX Diagnosis: XXX XXXXX Procedure XXX XXXXX Patient name: XXX XXXXX Member ID: XXX XXXXX Admit date: XXX XXXXX Diagnosis: XXX XXXXX Procedure XXX XXXXX Footers There is a lot more information but I think you get the idea. Being that the information is always in the same place on each of over 1900 pages it seems like there should be a way to grab that. I do have acrobat professional so I think I can convert the pages to an editable form, probably read into excel or save as text. Let's just assume I get that problem solved. Any idea how I could maybe use python or some other program to read this information in? Any ideas would be much appreciated! Thanks Matt Matthew Pirritano, Ph.D. Research Analyst IV Medical Services Initiative (MSI) Orange County Health Care Agency (714) 568-5648 ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Matthew,
OMG, that's a PROBLEM. I had something iike that recently where I had paper pages of data in columns. I scanned it to pdf and then converted it to text using acrobat professional. Columns were not preserved! And, neither were lines! You may have the same problem. But, let's say that God smiles, warmly and broadly, on you and every data element is preserved in EXACTLY the same line and column on every page. The easiest way to read the data is through a data list command. It'll be lengthy. But, so what. I've done this before and once the data elements are in exactly the same position on every line and column on every page, well, it's tediously simple. So, maybe it was only a half-smile. If you can get the same data elements on the same line on every page, you can read the page as a fixed number of 80 or 100 character text strings and then process each string to extract the data elements for that line. This will be serious work. The reread command in an Input program section to re-read the line once you have determined what they contain can be useful but I think it will depend on the exact structure of your datafile. Good luck! Gene Maguin >>I have a large pdf file that has data printed on each page with the same variables always located in the same location on each page. The pages are not simple tables. They are formatted as paragraphs with data presented after colons. The pages also have large graphical headers and footers. Each page has data for multiple individuals on it. Here is an oversimplified example of what each page looks like: Headers Patient name: XXX XXXXX Member ID: XXXXXXXXXXXX Admit date: XXX XXXXX Diagnosis: XXX XXXXX Procedure XXX XXXXX Patient name: XXX XXXXX Member ID: XXX XXXXX Admit date: XXX XXXXX Diagnosis: XXX XXXXX Procedure XXX XXXXX Patient name: XXX XXXXX Member ID: XXX XXXXX Admit date: XXX XXXXX Diagnosis: XXX XXXXX Procedure XXX XXXXX Footers There is a lot more information but I think you get the idea. Being that the information is always in the same place on each of over 1900 pages it seems like there should be a way to grab that. I do have acrobat professional so I think I can convert the pages to an editable form, probably read into excel or save as text. Let's just assume I get that problem solved. Any idea how I could maybe use python or some other program to read this information in? Any ideas would be much appreciated! Thanks Matt Matthew Pirritano, Ph.D. Research Analyst IV Medical Services Initiative (MSI) Orange County Health Care Agency (714) 568-5648 ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Thanks for the idea. Unfortunately I think I just need to have the data
sent in a different format. There are headers, data for multiple individuals per page, the text file splits up what were the 2 columns in the pdf and puts the second column after the data for three people, and on and on. It looks like I may be sol on this one. Thanks Matt Matthew Pirritano, Ph.D. Research Analyst IV Medical Services Initiative (MSI) Orange County Health Care Agency (714) 568-5648 -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Gene Maguin Sent: Monday, April 20, 2009 9:45 AM To: [hidden email] Subject: Re: reading in formatted data Matthew, OMG, that's a PROBLEM. I had something iike that recently where I had paper pages of data in columns. I scanned it to pdf and then converted it to text using acrobat professional. Columns were not preserved! And, neither were lines! You may have the same problem. But, let's say that God smiles, warmly and broadly, on you and every data element is preserved in EXACTLY the same line and column on every page. The easiest way to read the data is through a data list command. It'll be lengthy. But, so what. I've done this before and once the data elements are in exactly the same position on every line and column on every page, well, it's tediously simple. So, maybe it was only a half-smile. If you can get the same data elements on the same line on every page, you can read the page as a fixed number of 80 or 100 character text strings and then process each string to extract the data elements for that line. This will be serious work. The reread command in an Input program section to re-read the line once you have determined what they contain can be useful but I think it will depend on the exact structure of your datafile. Good luck! Gene Maguin >>I have a large pdf file that has data printed on each page with the same variables always located in the same location on each page. The pages are not simple tables. They are formatted as paragraphs with data presented after colons. The pages also have large graphical headers and footers. Each page has data for multiple individuals on it. Here is an oversimplified example of what each page looks like: Headers Patient name: XXX XXXXX Member ID: XXXXXXXXXXXX Admit date: XXX XXXXX Diagnosis: XXX XXXXX Procedure XXX XXXXX Patient name: XXX XXXXX Member ID: XXX XXXXX Admit date: XXX XXXXX Diagnosis: XXX XXXXX Procedure XXX XXXXX Patient name: XXX XXXXX Member ID: XXX XXXXX Admit date: XXX XXXXX Diagnosis: XXX XXXXX Procedure XXX XXXXX Footers There is a lot more information but I think you get the idea. Being that the information is always in the same place on each of over 1900 pages it seems like there should be a way to grab that. I do have acrobat professional so I think I can convert the pages to an editable form, probably read into excel or save as text. Let's just assume I get that problem solved. Any idea how I could maybe use python or some other program to read this information in? Any ideas would be much appreciated! Thanks Matt Matthew Pirritano, Ph.D. Research Analyst IV Medical Services Initiative (MSI) Orange County Health Care Agency (714) 568-5648 ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Maguin, Eugene
Listers,
I have searched high and low and I cannot find the answer to this. If I want to format a variable as a date. I already have the data read in. I'm just creating a new variable and want it to be a date. It is easy enough to do this using point and click. But what about syntax? I have used numeric variable(date). But this limits me to the day-month-year format. What if I want it to be month-year? This has gotta be super simple but the answer eludes me? Thanks Matt Matthew Pirritano, Ph.D. Research Analyst IV Medical Services Initiative (MSI) Orange County Health Care Agency (714) 568-5648 ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Go to help and search for date format. The first hit is the table of all
date formats. Does the MOYRw format do what you want? Note that this just a format and doesn't actually squash day and time out of the underlying value which is a whole bunch of seconds;-) -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Pirritano, Matthew Sent: Monday, April 20, 2009 3:35 PM To: [hidden email] Subject: date format Listers, I have searched high and low and I cannot find the answer to this. If I want to format a variable as a date. I already have the data read in. I'm just creating a new variable and want it to be a date. It is easy enough to do this using point and click. But what about syntax? I have used numeric variable(date). But this limits me to the day-month-year format. What if I want it to be month-year? This has gotta be super simple but the answer eludes me? Thanks Matt Matthew Pirritano, Ph.D. Research Analyst IV Medical Services Initiative (MSI) Orange County Health Care Agency (714) 568-5648 ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by mpirritano
Hi Matthew!
I'd save it to .txt using acrobat distiller, then run something like the Python code below. It reads all data and writes it to a tab-separated file. It uses regular expressions (regex) to match the patterns you're looking for. You could extend it for all your vars. The regexes are quite crude; you can improve them (if needed) depending on your data characteristics. For example, you could make a regex for a specific date pattern. I'm not sure if it handles missing values well. Maybe if you replace: if regex_name.findall(line): names.extend(regex_name.findall(line)) with simply: names.extend(regex_name.findall(line)) It'll simply write blanks. It worked well on a small test file. If the file is really huuuge the program may consume a lot of memory. Cheers!! Albert-Jan import re f_in = open("d:/temp/waspdf.txt", "rb") f_out = open("d:/temp/isgood.txt", "wb") regex_name = re.compile("Patient name: (\w+ \w+)?\s+") regex_id = re.compile("Member ID:\s+(\w+)?\s+") regex_adm = re.compile("Admit date:\s+(\w+ \w+)?\s+") names = [] ids = [] adms = [] for linenum, line in enumerate(f_in): if regex_name.findall(line): names.extend(regex_name.findall(line)) if regex_id.findall(line): ids.extend(regex_id.findall(line)) if regex_adm.findall(line): adms.extend(regex_adm.findall(line)) for namex, idx, admx in zip(names, ids, adms): combi = namex.strip() + "\t" + idx.strip() + "\t" + admx.strip() wrtstr = "".join(combi)+ "\r\n" if linenum % 100 == 0: print "--> Writing line", linenum, ":", wrtstr f_out.write(wrtstr) f_in.close() f_out.close() print "--> Done! See d:/temp for results # Post-hoc check to verify if the number of names, ids and admission dates is equal lens = [len(names)] + [len(ids)] + [len(adms)] if min(lens) != max(lens): print "--> The sh*t has hit the fan: check your results carefully" del names, ids, adms --- On Mon, 4/20/09, Pirritano, Matthew <[hidden email]> wrote: > From: Pirritano, Matthew <[hidden email]> > Subject: Re: reading in formatted data > To: [hidden email] > Date: Monday, April 20, 2009, 7:15 PM > Thanks for the idea. Unfortunately I > think I just need to have the data > sent in a different format. There are headers, data for > multiple > individuals per page, the text file splits up what were the > 2 columns in > the pdf and puts the second column after the data for three > people, and > on and on. > > It looks like I may be sol on this one. > > Thanks > Matt > > Matthew Pirritano, Ph.D. > Research Analyst IV > Medical Services Initiative (MSI) > Orange County Health Care Agency > (714) 568-5648 > > -----Original Message----- > From: SPSSX(r) Discussion [mailto:[hidden email]] > On Behalf Of > Gene Maguin > Sent: Monday, April 20, 2009 9:45 AM > To: [hidden email] > Subject: Re: reading in formatted data > > Matthew, > > OMG, that's a PROBLEM. I had something iike that recently > where I had > paper > pages of data in columns. I scanned it to pdf and then > converted it to > text > using acrobat professional. Columns were not preserved! > And, neither > were > lines! You may have the same problem. > > But, let's say that God smiles, warmly and broadly, on you > and every > data > element is preserved in EXACTLY the same line and column on > every page. > > The easiest way to read the data is through a data list > command. It'll > be > lengthy. But, so what. I've done this before and once the > data elements > are > in exactly the same position on every line and column on > every page, > well, > it's tediously simple. > > So, maybe it was only a half-smile. If you can get the same > data > elements on > the same line on every page, you can read the page as a > fixed number of > 80 > or 100 character text strings and then process each string > to extract > the > data elements for that line. This will be serious work. The > reread > command > in an Input program section to re-read the line once you > have determined > what they contain can be useful but I think it will depend > on the exact > structure of your datafile. > > Good luck! > Gene Maguin > > > >>I have a large pdf file that has data printed on > each page with the > same > variables always located in the same location on each page. > The pages > are not simple tables. They are formatted as paragraphs > with data > presented after colons. The pages also have large graphical > headers and > footers. Each page has data for multiple individuals on it. > Here is an > oversimplified example of what each page looks like: > > Headers > > Patient name: XXX XXXXX > > Member ID: XXXXXXXXXXXX > Admit date: XXX XXXXX > > Diagnosis: > XXX > XXXXX > > > > Procedure XXX > XXXXX > > Patient name: XXX XXXXX > > Member ID: XXX XXXXX > Admit date: XXX XXXXX > > Diagnosis: > XXX > XXXXX > > > > Procedure XXX > XXXXX > > Patient name: XXX XXXXX > > Member ID: XXX XXXXX > Admit date: XXX XXXXX > > Diagnosis: XXX XXXXX > > > > Procedure XXX > XXXXX > > > Footers > > There is a lot more information but I think you get the > idea. Being that > the information is always in the same place on each of over > 1900 pages > it seems like there should be a way to grab that. I do have > acrobat > professional so I think I can convert the pages to an > editable form, > probably read into excel or save as text. Let's just assume > I get that > problem solved. > > Any idea how I could maybe use python or some other program > to read this > information in? > > Any ideas would be much appreciated! > > Thanks > Matt > > > > Matthew Pirritano, Ph.D. > Research Analyst IV > Medical Services Initiative (MSI) > Orange County Health Care Agency > (714) 568-5648 > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] > (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the > command > INFO REFCARD > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] > (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the > command > INFO REFCARD > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] > (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the > command > INFO REFCARD > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by mpirritano
At 12:08 PM 4/20/2009, Pirritano, Matthew wrote:
>I [need to read] a large pdf file that has data printed on each page >with the same variables always located in the same location on each >page. The are formatted as paragraphs with data presented after >colons. The pages also have large graphical headers and footers. >Each page has data for multiple individuals on it. Here is an >oversimplified example: > >Headers > >Patient name: XXX XXXXX Member ID: XXXXXXXXXXXX >Admit date: XXX XXXXX Diagnosis: XXX >XXXXX > Procedure XXX >XXXXX > > >Footers There are at least two ways to think about this. One is as Gene Maguin suggested: >[If] God smiles, warmly and broadly, and every data element is >preserved in EXACTLY the same line and column on every page. The >easiest way to read the data is through a [complicated] DATA LIST command. Right. You'll probably read a whole page with one DATA LIST; that'll be several patients. I think, you can organize the variables within an INPUT PROGRAM so that VARSTOCASES can unroll the file to one record per patient. (Straight DATA LIST can't do it, that I can see.) That puts the burden of dealing with any format irregularities before you read to SPSS. There will be format irregularities. Or, you can explicitly parse the incoming data, using string commands to extract the fields as they come. You've got the advantage that many of your fields are identified by tags ending with colons. I've done this a few times. It's work, but the result is pretty robust. Albert-jan Roskam recommends doing this in Python. I'd stick with SPSS transformation language, myself. An INPUT PROGRAM is appropriate for this. As Albert-jan says, you'll have to get it into text-only format, remove the graphics, before SPSS can read it at all. -Best of luck, Richard ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by ViAnn Beadle
Dear All,
I need help in this. I am not sure if this is possible but I am looking at the following date calculation date1 date2 Proportionofyears Proportionofmonths proportionofdays 20081102 20091102 1 20080101 20090601 1.5 Is there a way to express proportion of dates in SPSS? In simple thinking i am looking at date2/date1=Proportionofyears. If possible I will like to know the proportionofmonths and proportionofdays calculation too. Any answers is greatly appreciated. Thanks in advance, Dorraj What can you do with the new Windows Live? Find out |
|
Hi Dorraj
It's a bit of a "hack", because the trick comes with looking at day since they are always changing (unlike months fixed at 12 in a year). So this "hack" will work it out as you have asked (specifying your "proportion"). Also, from 1-Jan to 1-Jun is 5 months, not 6 :) You can use this syntax: COMPUTE day1=MOD(date1,100). COMPUTE month1=MOD(TRUNC(date1/100),100). COMPUTE year1=TRUNC(date1/10000). COMPUTE day2=MOD(date2,100). COMPUTE month2=MOD(TRUNC(date2/100),100). COMPUTE year2=TRUNC(date2/10000). EXECUTE. * Date and Time Wizard: date1_new. COMPUTE date1_new = DATE.DMY(day1, month1, year1). VARIABLE LABEL date1_new. VARIABLE LEVEL date1_new (SCALE). FORMATS date1_new (DATE11). VARIABLE WIDTH date1_new(11). EXECUTE. * Date and Time Wizard: date2_new. COMPUTE date2_new = DATE.DMY(day2, month2, year2). VARIABLE LABEL date2_new. VARIABLE LEVEL date2_new (SCALE). FORMATS date2_new (DATE11). VARIABLE WIDTH date2_new(11). EXECUTE. COMPUTE date_diff = (DATEDIFF(date2_new,date1_new,"months"))/12 . EXECUTE . DELETE VARIABLES day1 month1 year1 day2 month2 year2 date1_new date2_new. Regards Adam 2009/4/30 DorraJ Oet <[hidden email]>
-- Cell: +27 84 777 1801 Website: http://www.sigmasurveys.co.za Blog: http://www.sigmasurveys.co.za/resources |
| Free forum by Nabble | Edit this page |
