Importing Excel file -- unknown number of rows

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Importing Excel file -- unknown number of rows

Warren Chrusciel
I am trying to write syntax to import a worksheet of an Excel file
starting at row 30.  The Excel file currently is a template that will
ultimately have an unknown number of rows.

GET DATA
  /TYPE=XLS
  /FILE='D:\Template file.xls'
  /SHEET=name 'Worksheet 1'
  /CELLRANGE=range 'A30:BO3000'
  /READNAMES=on
  /ASSUMEDSTRWIDTH=32767.

I noticed that when I choose the second entry on the CELLRANGE (currently
at BO3000) I get an error message if there are fewer rows with data than
included in the CELLRANGE.  For instance, I will get an error message if
the data end in row 2999.  However, the data are imported properly if row
3000 is added to the data set.

Is there any way to write syntax that can accomodate an unknown number of
rows in an Excel worksheet assuming I don't want to start importing at row
1?

Thank you,

Warren Chrusciel
Municipal Property Assessment Corporation
1420 Blair Place, Suite 300
Ottawa, Ontario K1J 9L8

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Importing Excel file -- unknown number of rows

Albert-Jan Roskam
Hi Warren!
 
It may be possible with native spss syntax, but here's how to do it in Python. You need the xlrd library for that. This may not work properly with files newer versions of excel. Also, I didn't test if date fields are processed correctly.
 
import xlrd, csv
 
def write_csv(sheet, writer, STARTROW):
    for rx in range(sheet.nrows):
        for cx in range(sheet.ncols):
            if rx + 1 >= STARTROW:
                writer.writerow( (sheet.name, rx+1, sheet.cell_value(rx, cx)) )
 
def main(xls_file, csv_file, STARTROW = 30):
    book = xlrd.open_workbook(xls_file)
    f = open(csv_file, "wb")
    writer = csv.writer(f)
    for sheetno in range(book.nsheets):
        sheet = book.sheet_by_index(sheetno)
        nrows = sheet.nrows
        if nrows >= STARTROW:
            write_csv(sheet, writer, STARTROW)
    f.close()
if __name__ == "__main__":
    main(xls_file = "d:/temp/test.xls", csv_file = "d:/temp/test.csv")


Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Thu, 5/13/10, Warren Chrusciel <[hidden email]> wrote:

From: Warren Chrusciel <[hidden email]>
Subject: [SPSSX-L] Importing Excel file -- unknown number of rows
To: [hidden email]
Date: Thursday, May 13, 2010, 2:26 PM

I am trying to write syntax to import a worksheet of an Excel file
starting at row 30.  The Excel file currently is a template that will
ultimately have an unknown number of rows.

GET DATA
  /TYPE=XLS
  /FILE='D:\Template file.xls'
  /SHEET=name 'Worksheet 1'
  /CELLRANGE=range 'A30:BO3000'
  /READNAMES=on
  /ASSUMEDSTRWIDTH=32767.

I noticed that when I choose the second entry on the CELLRANGE (currently
at BO3000) I get an error message if there are fewer rows with data than
included in the CELLRANGE.  For instance, I will get an error message if
the data end in row 2999.  However, the data are imported properly if row
3000 is added to the data set.

Is there any way to write syntax that can accomodate an unknown number of
rows in an Excel worksheet assuming I don't want to start importing at row
1?

Thank you,

Warren Chrusciel
Municipal Property Assessment Corporation
1420 Blair Place, Suite 300
Ottawa, Ontario K1J 9L8

=====================
To manage your subscription to SPSSX-L, send a message to
LISTSERV@... (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: Importing Excel file -- unknown number of rows

Langston, Eric
In reply to this post by Warren Chrusciel
Warren-

I believe that the only way to deal with an unknown number of rows is to
use /CELLRANGE=FULL -- but that starts that data import at row 1 of
course.

What version of SPSS are you using?  One idea, though not elegant, is to
brute force it together and set the formats using the ALTER TYPE syntax
-- which was introduced in version 16.

Two additional pieces to help the 'durability' of your code would be to
use the FILE HANDLE command and to change the SHEET subcommand.  By
using the FILE HANDLE command, the GET DATA syntax is more generalized.
By using the /SHEET=INDEX subcommand, it refers to the first worksheet
in the Excel file rather than the worksheet named "Worksheet 1" -- that
way you don't get an error if someone edits the worksheet name.  So
here's the syntax:

FILE HANDLE TemplateFile /NAME='D:\Template file.xls'.
GET DATA
  /TYPE=XLS
  /FILE= TemplateFile
  /SHEET=index 1.


That will bring *everything* into Excel.  And here's where the brute
force begins ... If we brought everything in, then we have a couple of
things to fix:
1) The first 29 rows need to be deleted.
2) The variable names need to be set.
3) The variable formats need to be set.



1) The first 29 rows need to be deleted.
Here's the syntax that will delete the first 29 rows:

COMPUTE case=$casenum.
EXECUTE.
SELECT IF case>=30.
EXECUTE.



2) The variable names need to be set.
The variable names will default to V1, V2, etc if there isn't any text
in the first row of the Excel file.  Suppose that the third column had
"STUFF" in the first row and the rest were blank, then the variable
names would be V1, V2, STUFF, V4 -- it skips V3 because it found a name.
I'm crossing my fingers for you that the first row will have consistent
values due to the template.  You'll need to customize the syntax below
to fit your variables and I've written it a couple different ways to
give you examples:

* If you just want to paste a column of variables from SPSS and then
paste a column of variables from Excel, this is the best bet.  The first
variable in the list before the equal sign to be renamed to the first
variable in the list after the equal sign.  Not much effort for you to
write, but if someone else looks at the syntax it might be hard to see
what the 14th variable would be renamed to.
RENAME VARIABLES (V1 V2 = Var1 Var2).

* If you want to quickly see that V1 will be Var1, this is the best bet.
This makes it easy to see what the 14th variable would be named to (per
my above example).
RENAME VARIABLES (v1=Var1) (v2=Var2).

* If either of the above give you errors (because it applies all the
name changes at once), then you use one (it applies the name changes one
at a time).
RENAME VARIABLES (V1=Var1).
RENAME VARIABLES (V2=Var2).




3) The variable formats need to be set.
Normally this would be set, but since we're brute forcing it we have to
do this manually.  And it had the potential to get a bit thorny --
specifically, if you're dealing with dates and times.  I am guessing
that the first 29 rows certainly have text in them and the only way to
syntactically change a string variable to a numeric variable is with the
ALTER TYPE command.  I've included the syntax that will convert an Excel
serial number to an SPSS serial number.  For example, if you type
"1/1/2010" into Excel and then format it as a number, it would be 40179.
I've also included the syntax that will convert an Excel serial time to
an SPSS serial time.  If the dates or times get imported as text rather
than serial numbers ... then it gets thorny because the date conversion
is a bit more labor intensive.  I've given one example of how to change
a text value of "1/1/2010" (mm/dd/yyyy) into an SPSS date -- it doesn't
require a leading zero on the month or day.

* If you just want to change something to a number.
ALTER TYPE Var1 (F8.2).

* If the variable has values with a % sign.
ALTER TYPE Var1 (PCT8.2).

* This will convert an Excel date serial number to an SPSS date serial
number, which can then be formatted as a date.
compute SPSS_Date_Time = date.mdy(01,01,1900) + Time.days(
Excel_Serial_Value - 2 ).
formats SPSS_Date_Time (date11).
execute.

*This will convert an Excel time serial number to an SPSS time serial
number, which can then be formatted as time.
compute SPSS_Time = Excel_Serial_Value * 24 * 60 * 60.
formats SPSS_Time (time5).
execute.

* Convert a string date (mm/dd/yyyy) to SPSS date.
compute mo =
number(substr(STRING_DATE,1,index(STRING_DATE,"/")-1),F2.0).
compute day =
number(substr(STRING_DATE,index(STRING_DATE,"/")+1,index(substr(STRING_D
ATE,index(STRING_DATE,"/")+1,50),"/")-1),F2.0).
compute year =
number(substr(STRING_DATE,char.rindex(STRING_DATE,"/")+1,5),F4.0).
compute SPSS_Date = date.mdy(mo, day, year).
formats SPSS_Date (date11).
exe.



Since I'm not sure exactly what your data will look like when it's
brought into SPSS, I've tried to cover the road bumps that I typically
run into, but that's far from exhaustive.  Please let me know if you
have any issues and I'll try to help.

Good luck!
-Eric






Background on the date conversion:
Excel measures dates as the number of days after Jan 1, 1900 and counts
1900 as if it were a leap year (which it wasn't).  That was the easiest
way for Microsoft to make Excel compatible with Lotus 1-2-3.
SPSS measures dates as the number of seconds since midnight, October 14,
1582 - the first day of the Gregorian Calendar.

The first part of the compute generates the number of seconds from Oct
14, 1582 to Jan 1, 1900.  The second part generates the number of
seconds after Jan 1, 1900 - if Excel correctly assumed that 1900 was not
a leap year, then it would only be necessary to subtract 1.

For reference, if you have an SPSS serial value that you want to convert
to an Excel serial value -- if you copy and paste output from SPSS to
Excel.  Here's an Excel formula that will give convert it -- you'll
still need to format the result as a date (Keyboard shortcut Ctrl +
Shift + 3)
=(B3/86400)-115859




-----Original Message-----
From: Warren Chrusciel [mailto:[hidden email]]
Sent: Thursday, May 13, 2010 7:26 AM
Subject: Importing Excel file -- unknown number of rows

I am trying to write syntax to import a worksheet of an Excel file
starting at row 30.  The Excel file currently is a template that will
ultimately have an unknown number of rows.

GET DATA
  /TYPE=XLS
  /FILE='D:\Template file.xls'
  /SHEET=name 'Worksheet 1'
  /CELLRANGE=range 'A30:BO3000'
  /READNAMES=on
  /ASSUMEDSTRWIDTH=32767.

I noticed that when I choose the second entry on the CELLRANGE
(currently
at BO3000) I get an error message if there are fewer rows with data than
included in the CELLRANGE.  For instance, I will get an error message if
the data end in row 2999.  However, the data are imported properly if
row
3000 is added to the data set.

Is there any way to write syntax that can accomodate an unknown number
of
rows in an Excel worksheet assuming I don't want to start importing at
row
1?

Thank you,

Warren Chrusciel
Municipal Property Assessment Corporation
1420 Blair Place, Suite 300
Ottawa, Ontario K1J 9L8

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Importing Excel file -- unknown number of rows

Warren Chrusciel
In reply to this post by Warren Chrusciel
We had come up with a “brute force” syntax already.    Yours is actually
less “brute force” than ours so I intend to borrow liberally from yours.

In particular, I like:

 1) The way you deleted the first 29 rows.
 2) The systematic way you renamed the variables.
 3) The suggestion to use the “ALTER TYPE” command.
 4) Your detailed explanation of the way to convert dates from Excel to
SPSS.
 5) The suggestion to use the /SHEET Index subcommand.

Even though our syntax is not yet elegant it is much improved with the
changes.

Thanks for your assistance.

Warren Chrusciel

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD