SPSSX Discussion

random rounding

Classic

List

Threaded

10 messages Options

Albert-Jan Roskam

random rounding

(sorry, I forgot to change the subject line in my previous mail)

Hi,

I want to randomly round numbers to multiples of a thousand. Is the syntax below the way to do this? I don't want the numbers to be consistently rounded up or down. Thanks in advance!

Cheers!!
Albert-Jan

data list free / x (f6).
begin data
121321
121011
128777
end data.

set seed = 12139.
numeric y (f12).
compute y = ( rnd ( (x + rv.uniform(-10**3, 10**3)) * 10**-3) ) * 10**3.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

mpirritano

reading in formatted data

Hello all,

I have a large pdf file that has data printed on each page with the same
variables always located in the same location on each page. The pages
are not simple tables. They are formatted as paragraphs with data
presented after colons. The pages also have large graphical headers and
footers. Each page has data for multiple individuals on it. Here is an
oversimplified example of what each page looks like:

Headers

Patient name: XXX XXXXX Member ID: XXXXXXXXXXXX
Admit date: XXX XXXXX Diagnosis: XXX
XXXXX
Procedure XXX
XXXXX

Patient name: XXX XXXXX Member ID: XXX XXXXX
Admit date: XXX XXXXX Diagnosis: XXX
XXXXX
Procedure XXX
XXXXX

Patient name: XXX XXXXX Member ID: XXX XXXXX
Admit date: XXX XXXXX Diagnosis: XXX XXXXX
Procedure XXX
XXXXX

Footers

There is a lot more information but I think you get the idea. Being that
the information is always in the same place on each of over 1900 pages
it seems like there should be a way to grab that. I do have acrobat
professional so I think I can convert the pages to an editable form,
probably read into excel or save as text. Let's just assume I get that
problem solved.

Any idea how I could maybe use python or some other program to read this
information in?

Any ideas would be much appreciated!

Thanks
Matt

Matthew Pirritano, Ph.D.
Research Analyst IV
Medical Services Initiative (MSI)
Orange County Health Care Agency
(714) 568-5648

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Maguin, Eugene

Re: reading in formatted data

Matthew,

OMG, that's a PROBLEM. I had something iike that recently where I had paper
pages of data in columns. I scanned it to pdf and then converted it to text
using acrobat professional. Columns were not preserved! And, neither were
lines! You may have the same problem.

But, let's say that God smiles, warmly and broadly, on you and every data
element is preserved in EXACTLY the same line and column on every page.

The easiest way to read the data is through a data list command. It'll be
lengthy. But, so what. I've done this before and once the data elements are
in exactly the same position on every line and column on every page, well,
it's tediously simple.

So, maybe it was only a half-smile. If you can get the same data elements on
the same line on every page, you can read the page as a fixed number of 80
or 100 character text strings and then process each string to extract the
data elements for that line. This will be serious work. The reread command
in an Input program section to re-read the line once you have determined
what they contain can be useful but I think it will depend on the exact
structure of your datafile.

Good luck!
Gene Maguin

>>I have a large pdf file that has data printed on each page with the same
variables always located in the same location on each page. The pages
are not simple tables. They are formatted as paragraphs with data
presented after colons. The pages also have large graphical headers and
footers. Each page has data for multiple individuals on it. Here is an
oversimplified example of what each page looks like:

Headers

Patient name: XXX XXXXX Member ID: XXXXXXXXXXXX
Admit date: XXX XXXXX Diagnosis: XXX
XXXXX
Procedure XXX
XXXXX

Patient name: XXX XXXXX Member ID: XXX XXXXX
Admit date: XXX XXXXX Diagnosis: XXX
XXXXX
Procedure XXX
XXXXX

Patient name: XXX XXXXX Member ID: XXX XXXXX
Admit date: XXX XXXXX Diagnosis: XXX XXXXX
Procedure XXX
XXXXX

Footers

There is a lot more information but I think you get the idea. Being that
the information is always in the same place on each of over 1900 pages
it seems like there should be a way to grab that. I do have acrobat
professional so I think I can convert the pages to an editable form,
probably read into excel or save as text. Let's just assume I get that
problem solved.

Any idea how I could maybe use python or some other program to read this
information in?

Any ideas would be much appreciated!

Thanks
Matt

Matthew Pirritano, Ph.D.
Research Analyst IV
Medical Services Initiative (MSI)
Orange County Health Care Agency
(714) 568-5648

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

mpirritano

Re: reading in formatted data

Thanks for the idea. Unfortunately I think I just need to have the data
sent in a different format. There are headers, data for multiple
individuals per page, the text file splits up what were the 2 columns in
the pdf and puts the second column after the data for three people, and
on and on.

It looks like I may be sol on this one.

Thanks
Matt

Matthew Pirritano, Ph.D.
Research Analyst IV
Medical Services Initiative (MSI)
Orange County Health Care Agency
(714) 568-5648

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Gene Maguin
Sent: Monday, April 20, 2009 9:45 AM
To: [hidden email]
Subject: Re: reading in formatted data

Matthew,

OMG, that's a PROBLEM. I had something iike that recently where I had
paper
pages of data in columns. I scanned it to pdf and then converted it to
text
using acrobat professional. Columns were not preserved! And, neither
were
lines! You may have the same problem.

But, let's say that God smiles, warmly and broadly, on you and every
data
element is preserved in EXACTLY the same line and column on every page.

The easiest way to read the data is through a data list command. It'll
be
lengthy. But, so what. I've done this before and once the data elements
are
in exactly the same position on every line and column on every page,
well,
it's tediously simple.

So, maybe it was only a half-smile. If you can get the same data
elements on
the same line on every page, you can read the page as a fixed number of
80
or 100 character text strings and then process each string to extract
the
data elements for that line. This will be serious work. The reread
command
in an Input program section to re-read the line once you have determined
what they contain can be useful but I think it will depend on the exact
structure of your datafile.

Good luck!
Gene Maguin

>>I have a large pdf file that has data printed on each page with the
same
variables always located in the same location on each page. The pages
are not simple tables. They are formatted as paragraphs with data
presented after colons. The pages also have large graphical headers and
footers. Each page has data for multiple individuals on it. Here is an
oversimplified example of what each page looks like:

Headers

Patient name: XXX XXXXX Member ID: XXXXXXXXXXXX
Admit date: XXX XXXXX Diagnosis: XXX
XXXXX
Procedure XXX
XXXXX

Patient name: XXX XXXXX Member ID: XXX XXXXX
Admit date: XXX XXXXX Diagnosis: XXX
XXXXX
Procedure XXX
XXXXX

Patient name: XXX XXXXX Member ID: XXX XXXXX
Admit date: XXX XXXXX Diagnosis: XXX XXXXX
Procedure XXX
XXXXX

Footers

There is a lot more information but I think you get the idea. Being that
the information is always in the same place on each of over 1900 pages
it seems like there should be a way to grab that. I do have acrobat
professional so I think I can convert the pages to an editable form,
probably read into excel or save as text. Let's just assume I get that
problem solved.

Any idea how I could maybe use python or some other program to read this
information in?

Any ideas would be much appreciated!

Thanks
Matt

Matthew Pirritano, Ph.D.
Research Analyst IV
Medical Services Initiative (MSI)
Orange County Health Care Agency
(714) 568-5648

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

mpirritano

date format

In reply to this post by Maguin, Eugene

Listers,

I have searched high and low and I cannot find the answer to this.

If I want to format a variable as a date. I already have the data read
in. I'm just creating a new variable and want it to be a date. It is
easy enough to do this using point and click. But what about syntax?

I have used numeric variable(date). But this limits me to the
day-month-year format. What if I want it to be month-year?

This has gotta be super simple but the answer eludes me?

Thanks
Matt

Matthew Pirritano, Ph.D.
Research Analyst IV
Medical Services Initiative (MSI)
Orange County Health Care Agency
(714) 568-5648

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

ViAnn Beadle

Re: date format

Go to help and search for date format. The first hit is the table of all
date formats. Does the MOYRw format do what you want? Note that this just a
format and doesn't actually squash day and time out of the underlying value
which is a whole bunch of seconds;-)

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Pirritano, Matthew
Sent: Monday, April 20, 2009 3:35 PM
To: [hidden email]
Subject: date format

Listers,

I have searched high and low and I cannot find the answer to this.

If I want to format a variable as a date. I already have the data read
in. I'm just creating a new variable and want it to be a date. It is
easy enough to do this using point and click. But what about syntax?

I have used numeric variable(date). But this limits me to the
day-month-year format. What if I want it to be month-year?

This has gotta be super simple but the answer eludes me?

Thanks
Matt

Matthew Pirritano, Ph.D.
Research Analyst IV
Medical Services Initiative (MSI)
Orange County Health Care Agency
(714) 568-5648

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Albert-Jan Roskam

Re: reading in formatted data

In reply to this post by mpirritano

Hi Matthew!

I'd save it to .txt using acrobat distiller, then run something like the Python code below. It reads all data and writes it to a tab-separated file. It uses regular expressions (regex) to match the patterns you're looking for. You could extend it for all your vars. The regexes are quite crude; you can improve them (if needed) depending on your data characteristics. For example, you could make a regex for a specific date pattern.

I'm not sure if it handles missing values well. Maybe if you replace:
if regex_name.findall(line):
names.extend(regex_name.findall(line))
with simply:
names.extend(regex_name.findall(line))
It'll simply write blanks.

It worked well on a small test file. If the file is really huuuge the program may consume a lot of memory.

Cheers!!
Albert-Jan

import re
f_in = open("d:/temp/waspdf.txt", "rb")
f_out = open("d:/temp/isgood.txt", "wb")

regex_name = re.compile("Patient name: (\w+ \w+)?\s+")
regex_id = re.compile("Member ID:\s+(\w+)?\s+")
regex_adm = re.compile("Admit date:\s+(\w+ \w+)?\s+")

names = []
ids = []
adms = []

for linenum, line in enumerate(f_in):
if regex_name.findall(line):
names.extend(regex_name.findall(line))
if regex_id.findall(line):
ids.extend(regex_id.findall(line))
if regex_adm.findall(line):
adms.extend(regex_adm.findall(line))
for namex, idx, admx in zip(names, ids, adms):
combi = namex.strip() + "\t" + idx.strip() + "\t" + admx.strip()
wrtstr = "".join(combi)+ "\r\n"
if linenum % 100 == 0:
print "--> Writing line", linenum, ":", wrtstr
f_out.write(wrtstr)
f_in.close()
f_out.close()
print "--> Done! See d:/temp for results

# Post-hoc check to verify if the number of names, ids and admission dates is equal
lens = [len(names)] + [len(ids)] + [len(adms)]
if min(lens) != max(lens):
print "--> The sh*t has hit the fan: check your results carefully"
del names, ids, adms

--- On Mon, 4/20/09, Pirritano, Matthew <[hidden email]> wrote:

> From: Pirritano, Matthew <[hidden email]>
> Subject: Re: reading in formatted data
> To: [hidden email]
> Date: Monday, April 20, 2009, 7:15 PM
> Thanks for the idea. Unfortunately I
> think I just need to have the data
> sent in a different format. There are headers, data for
> multiple
> individuals per page, the text file splits up what were the
> 2 columns in
> the pdf and puts the second column after the data for three
> people, and
> on and on.
>
> It looks like I may be sol on this one.
>
> Thanks
> Matt
>
> Matthew Pirritano, Ph.D.
> Research Analyst IV
> Medical Services Initiative (MSI)
> Orange County Health Care Agency
> (714) 568-5648
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]]
> On Behalf Of
> Gene Maguin
> Sent: Monday, April 20, 2009 9:45 AM
> To: [hidden email]
> Subject: Re: reading in formatted data
>
> Matthew,
>
> OMG, that's a PROBLEM. I had something iike that recently
> where I had
> paper
> pages of data in columns. I scanned it to pdf and then
> converted it to
> text
> using acrobat professional. Columns were not preserved!
> And, neither
> were
> lines! You may have the same problem.
>
> But, let's say that God smiles, warmly and broadly, on you
> and every
> data
> element is preserved in EXACTLY the same line and column on
> every page.
>
> The easiest way to read the data is through a data list
> command. It'll
> be
> lengthy. But, so what. I've done this before and once the
> data elements
> are
> in exactly the same position on every line and column on
> every page,
> well,
> it's tediously simple.
>
> So, maybe it was only a half-smile. If you can get the same
> data
> elements on
> the same line on every page, you can read the page as a
> fixed number of
> 80
> or 100 character text strings and then process each string
> to extract
> the
> data elements for that line. This will be serious work. The
> reread
> command
> in an Input program section to re-read the line once you
> have determined
> what they contain can be useful but I think it will depend
> on the exact
> structure of your datafile.
>
> Good luck!
> Gene Maguin
>
>
> >>I have a large pdf file that has data printed on
> each page with the
> same
> variables always located in the same location on each page.
> The pages
> are not simple tables. They are formatted as paragraphs
> with data
> presented after colons. The pages also have large graphical
> headers and
> footers. Each page has data for multiple individuals on it.
> Here is an
> oversimplified example of what each page looks like:
>
> Headers
>
> Patient name: XXX XXXXX
>
> Member ID: XXXXXXXXXXXX
> Admit date: XXX XXXXX
>
> Diagnosis:
> XXX
> XXXXX
>
>
>
> Procedure XXX
> XXXXX
>
> Patient name: XXX XXXXX
>
> Member ID: XXX XXXXX
> Admit date: XXX XXXXX
>
> Diagnosis:
> XXX
> XXXXX
>
>
>
> Procedure XXX
> XXXXX
>
> Patient name: XXX XXXXX
>
> Member ID: XXX XXXXX
> Admit date: XXX XXXXX
>
> Diagnosis: XXX XXXXX
>
>
>
> Procedure XXX
> XXXXX
>
>
> Footers
>
> There is a lot more information but I think you get the
> idea. Being that
> the information is always in the same place on each of over
> 1900 pages
> it seems like there should be a way to grab that. I do have
> acrobat
> professional so I think I can convert the pages to an
> editable form,
> probably read into excel or save as text. Let's just assume
> I get that
> problem solved.
>
> Any idea how I could maybe use python or some other program
> to read this
> information in?
>
> Any ideas would be much appreciated!
>
> Thanks
> Matt
>
>
>
> Matthew Pirritano, Ph.D.
> Research Analyst IV
> Medical Services Initiative (MSI)
> Orange County Health Care Agency
> (714) 568-5648
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email]
> (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the
> command
> INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email]
> (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the
> command
> INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email]
> (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the
> command
> INFO REFCARD
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Richard Ristow

Re: reading in formatted data

In reply to this post by mpirritano

At 12:08 PM 4/20/2009, Pirritano, Matthew wrote:

>I [need to read] a large pdf file that has data printed on each page
>with the same variables always located in the same location on each
>page. The are formatted as paragraphs with data presented after
>colons. The pages also have large graphical headers and footers.
>Each page has data for multiple individuals on it. Here is an
>oversimplified example:
>
>Headers
>
>Patient name: XXX XXXXX Member ID: XXXXXXXXXXXX
>Admit date: XXX XXXXX Diagnosis: XXX
>XXXXX
> Procedure XXX
>XXXXX
>
>
>Footers

There are at least two ways to think about this. One is as Gene
Maguin suggested:

>[If] God smiles, warmly and broadly, and every data element is
>preserved in EXACTLY the same line and column on every page. The
>easiest way to read the data is through a [complicated] DATA LIST command.

Right. You'll probably read a whole page with one DATA LIST; that'll
be several patients. I think, you can organize the variables within
an INPUT PROGRAM so that VARSTOCASES can unroll the file to one
record per patient. (Straight DATA LIST can't do it, that I can see.)

That puts the burden of dealing with any format irregularities before
you read to SPSS. There will be format irregularities.

Or, you can explicitly parse the incoming data, using string commands
to extract the fields as they come. You've got the advantage that
many of your fields are identified by tags ending with colons.

I've done this a few times. It's work, but the result is pretty robust.

Albert-jan Roskam recommends doing this in Python. I'd stick with
SPSS transformation language, myself.

An INPUT PROGRAM is appropriate for this.

As Albert-jan says, you'll have to get it into text-only format,
remove the graphics, before SPSS can read it at all.

-Best of luck,
Richard

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jarrod Teo-2

date Calculation

In reply to this post by ViAnn Beadle

Dear All,

I need help in this.

I am not sure if this is possible but I am looking at the following date calculation

date1        date2          Proportionofyears Proportionofmonths proportionofdays
20081102    20091102      1
20080101    20090601      1.5

Is there a way to express proportion of dates in SPSS? In simple thinking i am looking at date2/date1=Proportionofyears.

If possible I will like to know the proportionofmonths and proportionofdays calculation too.

Any answers is greatly appreciated.

Thanks in advance,
Dorraj

What can you do with the new Windows Live? Find out

Adam-237

Re: date Calculation

Hi Dorraj

It's a bit of a "hack", because the trick comes with looking at day since they are always changing (unlike months fixed at 12 in a year). So this "hack" will work it out as you have asked (specifying your "proportion").

Also, from 1-Jan to 1-Jun is 5 months, not 6 :)

You can use this syntax:

COMPUTE day1=MOD(date1,100).
COMPUTE month1=MOD(TRUNC(date1/100),100).
COMPUTE year1=TRUNC(date1/10000).
COMPUTE day2=MOD(date2,100).
COMPUTE month2=MOD(TRUNC(date2/100),100).
COMPUTE year2=TRUNC(date2/10000).
EXECUTE.

* Date and Time Wizard: date1_new.
COMPUTE date1_new = DATE.DMY(day1, month1, year1).
VARIABLE LABEL date1_new.
VARIABLE LEVEL date1_new (SCALE).
FORMATS date1_new (DATE11).
VARIABLE WIDTH date1_new(11).
EXECUTE.
* Date and Time Wizard: date2_new.
COMPUTE date2_new = DATE.DMY(day2, month2, year2).
VARIABLE LABEL date2_new.
VARIABLE LEVEL date2_new (SCALE).
FORMATS date2_new (DATE11).
VARIABLE WIDTH date2_new(11).
EXECUTE.

COMPUTE date_diff = (DATEDIFF(date2_new,date1_new,"months"))/12 .
EXECUTE .

DELETE VARIABLES day1 month1 year1 day2 month2 year2 date1_new date2_new.

Regards
Adam

2009/4/30 DorraJ Oet <[hidden email]>

Dear All,

I need help in this.

I am not sure if this is possible but I am looking at the following date calculation

date1        date2          Proportionofyears Proportionofmonths proportionofdays
20081102    20091102      1
20080101    20090601      1.5

Is there a way to express proportion of dates in SPSS? In simple thinking i am looking at date2/date1=Proportionofyears.

If possible I will like to know the proportionofmonths and proportionofdays calculation too.

Any answers is greatly appreciated.

Thanks in advance,
Dorraj

What can you do with the new Windows Live? Find out

--
Cell: +27 84 777 1801
Website: http://www.sigmasurveys.co.za
Blog: http://www.sigmasurveys.co.za/resources