import and merge multiple Excel files

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

import and merge multiple Excel files

Kylie

Hi all,

 

I have about 120 Excel files that I need to merge together (add cases) into one SPSS file. The files all have the same column headings, contain only one sheet each (with the same sheet name), and are saved in the same Windows directory. They have a consistent file name structure starting with a unique subject ID (ie, DD016_Baseline_Excel_Raw_Data.xlsx, DD033_Baseline Excel_Raw_Data.xlsx, etc).

 

Can anyone suggest the best general process for this? Are there Python extension commands that are relevant? Or scripting? Any similar examples? Any pointers would be appreciated.

 

Thanks,

Kylie.

 

Reply | Threaded
Open this post in threaded view
|

Re: import and merge multiple Excel files

Ruben Geert van den Berg
Dear Kylie,

Please try the code below. The first block should generate test data and the second block should

-read the variable names from the first row of the first sheet of the first workbook
-read all data from all lines in all sheets in all workbooks (from line 2)
-output an active DataSet containing the source_file, source_sheet and all data
-string lengths in SPSS should be exactly as long as required given the data contained in the work books

This hasn't been thoroughly tested yet so there may be complications but it seems to work on the test data provided. Please keep us informed on how things are going, OK?

Kind regards,

Ruben

*Create test data.
begin program.
rdir=r"d:\temp" # Please specify a folder in which test files can be created.
import xlwt,random
for year in range(2004,2014):
    wb=xlwt.Workbook()
    ws=wb.add_sheet("data")
    for col,cont in enumerate(['EmployeeID','JobTitle','YearSalary','DaysAbsent']):
        ws.write(0,col,cont)
    for row,id in enumerate([104,21,60,2,1030]):
                ws.write(row+1,0,id)
    for row in range(5):
                ws.write(row+1,1,random.choice(['Developer','Tester','Manager']))
    for row in range(5):
                ws.write(row+1,2,random.randrange(40,80)*1000)
    for row in range(5):
                ws.write(row+1,3,random.choice(range(20)))
    wb.save(os.path.join(rdir,'data_%d.xls'%year))
end program.

*2. Read and merge all xls workbooks.
begin program.
rdir=r"d:\temp" # Please specify folder holding .xls files
import xlrd,spss
fils=[fil for fil in os.listdir(rdir) if fil.endswith(".xls")] # Should probably be "xlsx" in your case.
allData=[]
for cnt,fil in enumerate(fils):
    wb=xlrd.open_workbook(os.path.join(rdir,fil))
    for ws in wb.sheets():
        for row in range(1,ws.nrows):
            allData.append([fil]+[ws.name]+[val for val in ws.row_values(row)])
    if cnt==0:
        Names=["source_file"]+["source_sheet"]+ws.row_values(0)
mxLens=[0]*len(vNames)
for line in allData:
    for cnt in range(len(line)):
        if isinstance(line[cnt],basestring) and len(line[cnt])>mxLens[cnt]:
            mxLens[cnt]=len(line[cnt])
with spss.DataStep():
    nds = spss.Dataset('*') ### nds = "New Data Set"
    for vrbl in zip(vNames,mxLens):
        nds.varlist.append(vrbl[0],vrbl[1])
    for line in allData:
        nds.cases.append(line)
end program.

Some notes:

-Make sure you have no open DataSet when you run this
-A crucial assumption is that the structure (column orders) are identical over sheets over workbooks
-The first rows of all sheets in all workbooks should hold (identical) variable names
-You need to have 1) SPSS, 2) SPSS Python essentials and 3) the Python xlrd module properly installed
-You may need to replace ".xls" with ".xlsx" in the second block
-Date variables should be no problem but will look weird in SPSS. To convert a date called "date_1" to a normal date, try

compute date_new=datesum(date.dmy(3,1,1900),date_1,"days").
format date_new(datetime22).

This should work although there seems to be some kind of bug somewhere so please check carefully.
Reply | Threaded
Open this post in threaded view
|

Re: import and merge multiple Excel files

Andy W
In reply to this post by Kylie
I have not used it - but (I believe Jon Peck) wrote an extension command that would fit the bill here without having to roll your own Python - the `SPSSINC PROCESS FILES` command (it appears as of now to be bundled in the download section with the split dataset command as well). See here on the developerworks site (hopefully that link continues to work).

Here is an example in the developerworks forum of the tool in action with a near synonymous situation.

Also with a real consistent naming structure and no missing ID's it would be pretty simple to write this up in a macro, see this other answer I gave recently. All that would need to be changed is (what I have done anyways) is to have the first pass of the loop create a basefile, and then successively add files concatenate all of the new files to that basefile.





Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/
Reply | Threaded
Open this post in threaded view
|

Re: import and merge multiple Excel files

Jon K Peck
SPSSINC PROCESS FILES  can do this pretty easily.  You will wind up doing an ADD FILES for each Excel file (after the first) even though ADD FILES can handle 50 files at a time, but that's probably not going to be an issue unless you have to do this several times per second.

A few tips beyond the example that Andy pointed out.

Getting this process started is a little bit tricky, since ADD FILES requires that you already have a data file open.
Move one of your Excel files to a different directory and open it with GET DATA /TYPE=XLSX or interactively.  Give it a dataset name, say, ACTIVE, so that it will remain open and referenceable  as other files are read.

Your syntax file to be applied to each dataset by PROCESS FILES would just have statements like
GET DATA /TYPE XLS .../FILE="JOB_INPUTFILE" ...
DATASET NAME=FRED.
ADD FILES /FILE=ACTIVE /FILE="JOB_INPUTFILE".
DATASET CLOSE FRED.
JOB_INPUTFILE is defined by PROCESS FILES as a file handle for the name of the current input.  It will be redefined each time another file is processed.

You can then construct the PROCESS FILE command from the menus via Utilities > Process Data Files.
The input filespec would be something like
c:\mydata\*.xlsx

After process files is run, you can save the constructed file in the usual way.

You can, of course, do this with Python or even Basic scripting more directly, but it probably isn't worth the trouble to do that.

HTH,

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621




From:        Andy W <[hidden email]>
To:        [hidden email],
Date:        12/05/2012 05:20 AM
Subject:        Re: [SPSSX-L] import and merge multiple Excel files
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




I have not used it - but (I believe Jon Peck) wrote an extension command that
would fit the bill here without having to roll your own Python - the
`SPSSINC PROCESS FILES` command (it appears as of now to be bundled in the
download section with the split dataset command as well). See  here
<
https://www.ibm.com/developerworks/mydeveloperworks/files/app?lang=en#/person/270002VCWN/file/1ebc8c4d-05df-4a2f-8b2b-617ba3695583>
on the developerworks site (hopefully that link continues to work).

Here is an example in the developerworks forum
<
http://www.ibm.com/developerworks/forums/thread.jspa?messageID=14573567>
of the tool in action with a near synonymous situation.

Also with a real consistent naming structure and no missing ID's it would be
pretty simple to write this up in a macro, see  this other answer I gave
recently
<
http://spssx-discussion.1045642.n5.nabble.com/Looping-td5716527.html>  .
All that would need to be changed is (what I have done anyways) is to have
the first pass of the loop create a basefile, and then successively add
files concatenate all of the new files to that basefile.









-----
Andy W
[hidden email]
http://andrewpwheeler.wordpress.com/
--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/import-and-merge-multiple-Excel-files-tp5716664p5716669.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: import and merge multiple Excel files

Kylie

Many thanks to Ruben, Andy and Jon for their replies.

 

I have been trying to use the framework that Jon outlines below and is illustrated in the example that Andy linked to. I am having a problem with naming/activating the correct datasets at the right time, which I hope someone can help me with.

 

I have imported the first Excel file into SPSS and named it ‘active’ using <DATASET NAME active>. I have created a syntax file (ImportFromExcelAndMerge.sps) that contains the following:

 

GET DATA

  /TYPE=XLSX

  /FILE="JOB_INPUTFILE"

  /SHEET=name 'SPSS'

  /ASSUMEDSTRWIDTH=32767.

DATASET NAME incoming.

 

ADD FILES /FILE='active' /FILE='incoming'.

EXECUTE.

DATASET CLOSE incoming.

 

And I run this by the following syntax generated by the PROCESS FILES custom dialogue box:

 

DATASET ACTIVATE active.

SPSSINC PROCESS FILES INPUTDATA="U:\.AU Work\Client Files\XXXXX\CGMS data processing\*.xlsx" 

SYNTAX="U:\.AU Work\Client Files\XXXXX\CGMS data processing\ImportFromExcelAndMerge.sps"

CONTINUEONERROR=YES

VIEWERDIR= "U:\.AU Work\Client Files\XXXXX\CGMS data processing"

CLOSEDATA=YES

MACRONAME="!JOB"

LOGFILEMODE=APPEND

/MACRODEFS ITEMS.

 

When I run the above PROCESS FILES command I get this error for each file it tries to loop over: <Dataset Activate: Unknown dataset active>. The add files works when I substitute specific file names into the ImportFromExcelAndMerge syntax. I’ve tried various other combinations of naming/activating the datasets but this is the closest I have got. Any suggestions would be appreciated.

 

Thanks,

Kylie.

 

 

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Jon K Peck
Sent: Thursday, 6 December 2012 3:10 AM
To: [hidden email]
Subject: Re: import and merge multiple Excel files

 

SPSSINC PROCESS FILES  can do this pretty easily.  You will wind up doing an ADD FILES for each Excel file (after the first) even though ADD FILES can handle 50 files at a time, but that's probably not going to be an issue unless you have to do this several times per second.

A few tips beyond the example that Andy pointed out.

Getting this process started is a little bit tricky, since ADD FILES requires that you already have a data file open.
Move one of your Excel files to a different directory and open it with GET DATA /TYPE=XLSX or interactively.  Give it a dataset name, say, ACTIVE, so that it will remain open and referenceable  as other files are read.

Your syntax file to be applied to each dataset by PROCESS FILES would just have statements like
GET DATA /TYPE XLS .../FILE="JOB_INPUTFILE" ...
DATASET NAME=FRED.
ADD FILES /FILE=ACTIVE /FILE="JOB_INPUTFILE".
DATASET CLOSE FRED.
JOB_INPUTFILE is defined by PROCESS FILES as a file handle for the name of the current input.  It will be redefined each time another file is processed.

You can then construct the PROCESS FILE command from the menus via Utilities > Process Data Files.
The input filespec would be something like
c:\mydata\*.xlsx

After process files is run, you can save the constructed file in the usual way.

You can, of course, do this with Python or even Basic scripting more directly, but it probably isn't worth the trouble to do that.

HTH,

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621




From:        Andy W <[hidden email]>
To:        [hidden email],
Date:        12/05/2012 05:20 AM
Subject:        Re: [SPSSX-L] import and merge multiple Excel files
Sent by:        "SPSSX(r) Discussion" <[hidden email]>





I have not used it - but (I believe Jon Peck) wrote an extension command that
would fit the bill here without having to roll your own Python - the
`SPSSINC PROCESS FILES` command (it appears as of now to be bundled in the
download section with the split dataset command as well). See  here
<
https://www.ibm.com/developerworks/mydeveloperworks/files/app?lang=en#/person/270002VCWN/file/1ebc8c4d-05df-4a2f-8b2b-617ba3695583>
on the developerworks site (hopefully that link continues to work).

Here is an example in the developerworks forum
<
http://www.ibm.com/developerworks/forums/thread.jspa?messageID=14573567>
of the tool in action with a near synonymous situation.

Also with a real consistent naming structure and no missing ID's it would be
pretty simple to write this up in a macro, see  this other answer I gave
recently
<
http://spssx-discussion.1045642.n5.nabble.com/Looping-td5716527.html>  .
All that would need to be changed is (what I have done anyways) is to have
the first pass of the loop create a basefile, and then successively add
files concatenate all of the new files to that basefile.









-----
Andy W
[hidden email]
http://andrewpwheeler.wordpress.com/
--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/import-and-merge-multiple-Excel-files-tp5716664p5716669.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: import and merge multiple Excel files

Jon K Peck
It appears that at the time you run the PROCESS FILES command there is no dataset named active.  You specified CLOSEDATA=YES to PROCESS FILES, so it closes ALL the datasets after each iteration, including active.  You are already closing the newly merged dataset, so you should use CLOSEDATA=NO to keep active open throughout the job.

HTH,


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621




From:        Kylie Lange <[hidden email]>
To:        [hidden email],
Date:        12/06/2012 11:40 PM
Subject:        Re: [SPSSX-L] import and merge multiple Excel files
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Many thanks to Ruben, Andy and Jon for their replies.
 
I have been trying to use the framework that Jon outlines below and is illustrated in the example that Andy linked to. I am having a problem with naming/activating the correct datasets at the right time, which I hope someone can help me with.
 
I have imported the first Excel file into SPSS and named it ‘active’ using <DATASET NAME active>. I have created a syntax file (ImportFromExcelAndMerge.sps) that contains the following:
 
GET DATA
  /TYPE=XLSX
  /FILE="JOB_INPUTFILE"
  /SHEET=name 'SPSS'
  /ASSUMEDSTRWIDTH=32767.
DATASET NAME incoming.
 
ADD FILES /FILE='active' /FILE='incoming'.
EXECUTE.
DATASET CLOSE incoming.
 
And I run this by the following syntax generated by the PROCESS FILES custom dialogue box:
 
DATASET ACTIVATE active.
SPSSINC PROCESS FILES INPUTDATA="U:\.AU Work\Client Files\XXXXX\CGMS data processing\*.xlsx"  
SYNTAX="U:\.AU Work\Client Files\XXXXX\CGMS data processing\ImportFromExcelAndMerge.sps"
CONTINUEONERROR=YES
VIEWERDIR= "U:\.AU Work\Client Files\XXXXX\CGMS data processing"
CLOSEDATA=YES
MACRONAME="!JOB"
LOGFILEMODE=APPEND
/MACRODEFS ITEMS.
 
When I run the above PROCESS FILES command I get this error for each file it tries to loop over: <Dataset Activate: Unknown dataset active>. The add files works when I substitute specific file names into the ImportFromExcelAndMerge syntax. I’ve tried various other combinations of naming/activating the datasets but this is the closest I have got. Any suggestions would be appreciated.
 
Thanks,
Kylie.
 
 
From: SPSSX(r) Discussion [[hidden email]] On Behalf Of Jon K Peck
Sent:
Thursday, 6 December 2012 3:10 AM
To:
[hidden email]
Subject:
Re: import and merge multiple Excel files

 
SPSSINC PROCESS FILES  can do this pretty easily.  You will wind up doing an ADD FILES for each Excel file (after the first) even though ADD FILES can handle 50 files at a time, but that's probably not going to be an issue unless you have to do this several times per second.

A few tips beyond the example that Andy pointed out.


Getting this process started is a little bit tricky, since ADD FILES requires that you already have a data file open.

Move one of your Excel files to a different directory and open it with GET DATA /TYPE=XLSX or interactively.  Give it a dataset name, say, ACTIVE, so that it will remain open and referenceable  as other files are read.


Your syntax file to be applied to each dataset by PROCESS FILES would just have statements like

GET DATA /TYPE XLS .../FILE="JOB_INPUTFILE" ...

DATASET NAME=FRED.

ADD FILES /FILE=ACTIVE /FILE="JOB_INPUTFILE".

DATASET CLOSE FRED.

JOB_INPUTFILE is defined by PROCESS FILES as a file handle for the name of the current input.  It will be redefined each time another file is processed.


You can then construct the PROCESS FILE command from the menus via Utilities > Process Data Files.

The input filespec would be something like

c:\mydata\*.xlsx


After process files is run, you can save the constructed file in the usual way.


You can, of course, do this with Python or even Basic scripting more directly, but it probably isn't worth the trouble to do that.


HTH,

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM

peck@...
new phone: 720-342-5621





From:        
Andy W <apwheele@...>
To:        
[hidden email],
Date:        
12/05/2012 05:20 AM
Subject:        
Re: [SPSSX-L] import and merge multiple Excel files
Sent by:        
"SPSSX(r) Discussion" <[hidden email]>





I have not used it - but (I believe Jon Peck) wrote an extension command that
would fit the bill here without having to roll your own Python - the
`SPSSINC PROCESS FILES` command (it appears as of now to be bundled in the
download section with the split dataset command as well). See  here
<
https://www.ibm.com/developerworks/mydeveloperworks/files/app?lang=en#/person/270002VCWN/file/1ebc8c4d-05df-4a2f-8b2b-617ba3695583>
on the developerworks site (hopefully that link continues to work).

Here is an example in the developerworks forum
<
http://www.ibm.com/developerworks/forums/thread.jspa?messageID=14573567>
of the tool in action with a near synonymous situation.

Also with a real consistent naming structure and no missing ID's it would be
pretty simple to write this up in a macro, see  this other answer I gave
recently
<
http://spssx-discussion.1045642.n5.nabble.com/Looping-td5716527.html>  .
All that would need to be changed is (what I have done anyways) is to have
the first pass of the loop create a basefile, and then successively add
files concatenate all of the new files to that basefile.









-----
Andy W

apwheele@...
http://andrewpwheeler.wordpress.com/
--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/import-and-merge-multiple-Excel-files-tp5716664p5716669.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to

LISTSERV@... (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: import and merge multiple Excel files

Richard Ristow
In reply to this post by Jon K Peck
At 11:40 AM 12/5/2012, Jon K Peck wrote:

>SPSSINC PROCESS FILES  can do this pretty easily.  You will wind up
>doing an ADD FILES for each Excel file (after the first) ...
>
>Getting this process started is a little bit tricky, since ADD FILES
>requires that you already have a data file open.
>Move one of your Excel files to a different directory and open it
>with GET DATA /TYPE=XLSX or interactively.  Give it a dataset name,
>say, ACTIVE, so that it will remain open and referenceable  as other
>files are read.

It may be easier, though perhaps slightly less efficient, to use ADD
FILES for *all* the Excel files? To do this, create ACTIVE with one
case and one variable that doesn't occur in the Excel sheets, then
just run the ADD FILES for all Excel sheets. You'll want to delete
the extraneous variable and case at the end.

>Your syntax file to be applied to each dataset by PROCESS FILES
>would just have statements like
>GET DATA /TYPE XLS .../FILE="JOB_INPUTFILE" ...
>DATASET NAME=FRED.
>ADD FILES /FILE=ACTIVE /FILE="JOB_INPUTFILE".
>DATASET CLOSE FRED.
>JOB_INPUTFILE is defined by PROCESS FILES as a file handle for the
>name of the current input.  It will be redefined each time another
>file is processed.

Does SPSSINC PROCESS FILES do anything about the really annoying part
of this process? Namely: When you read an Excel spreadsheet, many of
the columns come in as string variables. The string variables are
assigned the length of the longest value IN THAT SHEET; when you read
multiple sheets with the same variables, the lengths are not likely
to come out the same in all of them.

But then, you can't ADD FILES, because SPSS won't add files if
variables are 'incompatible' -- which includes, strings that aren't
the same length. Does SPSS PROCESS FILES adjust string lengths to
help with this?

(I've protested repeatedly that this behavior is an unnecessary
glitch in SPSS. It would be far better to take strings of different
lengths as compatible, giving the result the longest length from any
input file.)

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: import and merge multiple Excel files

Jon K Peck
Richard,

You make a good point about possible string width variation.  My approach assumes that the variables will all come in the same way.  If there are string variables and this does not  happen for whatever reason, the solution is to add an ALTER TYPE command to the iterated syntax that sets the widths as needed before running the ADD FILES command.

PROCESS FILES does not know anything about what syntax is running over the datasets.  It just invokes the syntax file repeatedly, providing it with appropriate file handles and macro definitions.  It doesn't even open the files, since they could be of various types and might need to be read with various options.

As for ADD FILES, no doubt it would be nice to be more tolerant of width variations.  The problem is that the variable definitions have to be established when the variable is first encountered, and string widths are immutable except for what ALTER TYPE can do.  The new STAR JOIN command tolerates variation in key widths, but it is not appropriate for this application.


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621




From:        Richard Ristow <[hidden email]>
To:        Jon K Peck/Chicago/IBM@IBMUS, [hidden email],
Date:        12/07/2012 12:25 PM
Subject:        Re: import and merge multiple Excel files




At 11:40 AM 12/5/2012, Jon K Peck wrote:

>SPSSINC PROCESS FILES  can do this pretty easily.  You will wind up
>doing an ADD FILES for each Excel file (after the first) ...
>
>Getting this process started is a little bit tricky, since ADD FILES
>requires that you already have a data file open.
>Move one of your Excel files to a different directory and open it
>with GET DATA /TYPE=XLSX or interactively.  Give it a dataset name,
>say, ACTIVE, so that it will remain open and referenceable  as other
>files are read.

It may be easier, though perhaps slightly less efficient, to use ADD
FILES for *all* the Excel files? To do this, create ACTIVE with one
case and one variable that doesn't occur in the Excel sheets, then
just run the ADD FILES for all Excel sheets. You'll want to delete
the extraneous variable and case at the end.

>Your syntax file to be applied to each dataset by PROCESS FILES
>would just have statements like
>GET DATA /TYPE XLS .../FILE="JOB_INPUTFILE" ...
>DATASET NAME=FRED.
>ADD FILES /FILE=ACTIVE /FILE="JOB_INPUTFILE".
>DATASET CLOSE FRED.
>JOB_INPUTFILE is defined by PROCESS FILES as a file handle for the
>name of the current input.  It will be redefined each time another
>file is processed.

Does SPSSINC PROCESS FILES do anything about the really annoying part
of this process? Namely: When you read an Excel spreadsheet, many of
the columns come in as string variables. The string variables are
assigned the length of the longest value IN THAT SHEET; when you read
multiple sheets with the same variables, the lengths are not likely
to come out the same in all of them.

But then, you can't ADD FILES, because SPSS won't add files if
variables are 'incompatible' -- which includes, strings that aren't
the same length. Does SPSS PROCESS FILES adjust string lengths to
help with this?

(I've protested repeatedly that this behavior is an unnecessary
glitch in SPSS. It would be far better to take strings of different
lengths as compatible, giving the result the longest length from any
input file.)


Reply | Threaded
Open this post in threaded view
|

Re: import and merge multiple Excel files

Bruce Weaver
Administrator
In reply to this post by Richard Ristow
Richard Ristow wrote
--- snip ---

But then, you can't ADD FILES, because SPSS won't add files if
variables are 'incompatible' -- which includes, strings that aren't
the same length. Does SPSS PROCESS FILES adjust string lengths to
help with this?

(I've protested repeatedly that this behavior is an unnecessary
glitch in SPSS. It would be far better to take strings of different
lengths as compatible, giving the result the longest length from any
input file.)
AMEN, Brother!!!
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: import and merge multiple Excel files

Andy W
In reply to this post by Jon K Peck
I've had similar problems to this before as well, and ALTER TYPE is really handy (I am missing it at my current job with SPSS v15 :(. Most of my code that updates files with newly given data on a regular basis then has a call along the lines of;

ALTER TYPE ALL (A = A100).

Which changes all of the string variables within the active dataset to a width of 100. Another problem I have had with Excel files is that when a field is empty it defaults to a numeric variable. Fortunately these are fewer in number and I typically explicitly set these fields to a string value. I still get a bunch of error messages saying the field is defaulting to numeric unfortunately though...

To all those who are listening - please don't transfer data via Excel files (just use csv). It would circumvent many of such nuisances given excels weird way of intermingling data formats with the actual data.

Andy

 
Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/
Reply | Threaded
Open this post in threaded view
|

Re: import and merge multiple Excel files

Kylie
In reply to this post by Jon K Peck

Thanks again to everyone for your contributions.

 

Jon’s posting made perfect sense and I have made the change to the CLOSEDATA parameter, and now have the following as the called syntax file (it now merges to the currently active dataset rather than the named active dataset, as I was ending up with a new unnamed dataset):

 

============

GET DATA

  /TYPE=XLSX

  /FILE="JOB_INPUTFILE"

  /SHEET=name 'SPSS'

  /ASSUMEDSTRWIDTH=32767.

DATASET NAME incoming.

 

DATASET ACTIVATE active.

ADD FILES /FILE=* /FILE='incoming'.

EXECUTE.

DATASET CLOSE incoming.

============

 

When I run this, it successfully merges two files to the initially open file but then stops, with an Output Save warning: Output document NAME=’^1’ does not exist. The syntax in the Log for the Output Save is:

 

OUTPUT SAVE OUTFILE="U:/.AU Work/Client Files/XXXXX/CGMS data processing/DD016_Baseline_Excel_Raw Data.spv"

 

where DD016_Baseline_Excel_Raw Data.xlsx is the last file successfully added to the merged file. The previous (and only) file merged in has an output file successfully saved that contains the PROCESS FILES syntax with no errors or warnings. Any suggestions as to what is causing the job to stop?

 

Thanks,

Kylie.

 

PS – Regarding the string length issues, yes I had considered that and all of the test files that I am currently working with get imported with the same string widths. They SHOULD only contain certain string entries and so there shouldn’t be a problem – but I’ve grabbed the ALTER TYPE suggestions just in case.

 

 

From: Jon K Peck [mailto:[hidden email]]
Sent: Saturday, 8 December 2012 12:37 AM
To: Kylie Lange
Cc: [hidden email]
Subject: Re: [SPSSX-L] import and merge multiple Excel files

 

It appears that at the time you run the PROCESS FILES command there is no dataset named active.  You specified CLOSEDATA=YES to PROCESS FILES, so it closes ALL the datasets after each iteration, including active.  You are already closing the newly merged dataset, so you should use CLOSEDATA=NO to keep active open throughout the job.

HTH,


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621




From:        Kylie Lange <[hidden email]>
To:        [hidden email],
Date:        12/06/2012 11:40 PM
Subject:        Re: [SPSSX-L] import and merge multiple Excel files
Sent by:        "SPSSX(r) Discussion" <[hidden email]>





Many thanks to Ruben, Andy and Jon for their replies.
 
I have been trying to use the framework that Jon outlines below and is illustrated in the example that Andy linked to. I am having a problem with naming/activating the correct datasets at the right time, which I hope someone can help me with.
 
I have imported the first Excel file into SPSS and named it ‘active’ using <DATASET NAME active>. I have created a syntax file (ImportFromExcelAndMerge.sps) that contains the following:
 
GET DATA
  /TYPE=XLSX
  /FILE="JOB_INPUTFILE"
  /SHEET=name 'SPSS'
  /ASSUMEDSTRWIDTH=32767.
DATASET NAME incoming.
 
ADD FILES /FILE='active' /FILE='incoming'.
EXECUTE.
DATASET CLOSE incoming.
 
And I run this by the following syntax generated by the PROCESS FILES custom dialogue box:
 
DATASET ACTIVATE active.
SPSSINC PROCESS FILES INPUTDATA="U:\.AU Work\Client Files\XXXXX\CGMS data processing\*.xlsx"  
SYNTAX="U:\.AU Work\Client Files\XXXXX\CGMS data processing\ImportFromExcelAndMerge.sps"
CONTINUEONERROR=YES
VIEWERDIR= "U:\.AU Work\Client Files\XXXXX\CGMS data processing"
CLOSEDATA=YES
MACRONAME="!JOB"
LOGFILEMODE=APPEND
/MACRODEFS ITEMS.
 
When I run the above PROCESS FILES command I get this error for each file it tries to loop over: <Dataset Activate: Unknown dataset active>. The add files works when I substitute specific file names into the ImportFromExcelAndMerge syntax. I’ve tried various other combinations of naming/activating the datasets but this is the closest I have got. Any suggestions would be appreciated.
 
Thanks,
Kylie.
 
 
From: SPSSX(r) Discussion [[hidden email]] On Behalf Of Jon K Peck
Sent:
Thursday, 6 December 2012 3:10 AM
To:
[hidden email]
Subject:
Re: import and merge multiple Excel files

 
SPSSINC PROCESS FILES  can do this pretty easily.  You will wind up doing an ADD FILES for each Excel file (after the first) even though ADD FILES can handle 50 files at a time, but that's probably not going to be an issue unless you have to do this several times per second.

A few tips beyond the example that Andy pointed out.


Getting this process started is a little bit tricky, since ADD FILES requires that you already have a data file open.

Move one of your Excel files to a different directory and open it with GET DATA /TYPE=XLSX or interactively.  Give it a dataset name, say, ACTIVE, so that it will remain open and referenceable  as other files are read.


Your syntax file to be applied to each dataset by PROCESS FILES would just have statements like

GET DATA /TYPE XLS .../FILE="JOB_INPUTFILE" ...

DATASET NAME=FRED.

ADD FILES /FILE=ACTIVE /FILE="JOB_INPUTFILE".

DATASET CLOSE FRED.

JOB_INPUTFILE is defined by PROCESS FILES as a file handle for the name of the current input.  It will be redefined each time another file is processed.


You can then construct the PROCESS FILE command from the menus via Utilities > Process Data Files.

The input filespec would be something like

c:\mydata\*.xlsx


After process files is run, you can save the constructed file in the usual way.


You can, of course, do this with Python or even Basic scripting more directly, but it probably isn't worth the trouble to do that.


HTH,

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621





From:        
Andy W <[hidden email]>
To:        
[hidden email],
Date:        
12/05/2012 05:20 AM

Subject:        
Re: [SPSSX-L] import and merge multiple Excel files
Sent by:        
"SPSSX(r) Discussion" <[hidden email]>






I have not used it - but (I believe Jon Peck) wrote an extension command that
would fit the bill here without having to roll your own Python - the
`SPSSINC PROCESS FILES` command (it appears as of now to be bundled in the
download section with the split dataset command as well). See  here
<
https://www.ibm.com/developerworks/mydeveloperworks/files/app?lang=en#/person/270002VCWN/file/1ebc8c4d-05df-4a2f-8b2b-617ba3695583>
on the developerworks site (hopefully that link continues to work).

Here is an example in the developerworks forum
<
http://www.ibm.com/developerworks/forums/thread.jspa?messageID=14573567>
of the tool in action with a near synonymous situation.

Also with a real consistent naming structure and no missing ID's it would be
pretty simple to write this up in a macro, see  this other answer I gave
recently
<
http://spssx-discussion.1045642.n5.nabble.com/Looping-td5716527.html>  .
All that would need to be changed is (what I have done anyways) is to have
the first pass of the loop create a basefile, and then successively add
files concatenate all of the new files to that basefile.









-----
Andy W
[hidden email]
http://andrewpwheeler.wordpress.com/
--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/import-and-merge-multiple-Excel-files-tp5716664p5716669.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: import and merge multiple Excel files

Kylie

Thanks to an off-list response to Jon I have resolved the problem and the job now runs successfully (well, at least over the 10 files I’ve got to test on so far!). I had to make a change to the viewer settings in PROCESS FILES. The code below now saves just one single output file at the end of the job, rather than one after each pass of PROCESS FILES. This resolved the problem.

 

So for posterity, the final code is:

 

============

* Open the first Excel file from its own separate directory and give it the dataset name 'active'.

GET DATA /TYPE=XLSX

  /FILE='U:\.AU Work\Client Files\XXXXX\CGMS data processing\first file\DD001_Baseline_Excel_Raw Data.xlsx'

  /SHEET=name 'SPSS'

  /CELLRANGE=full

  /READNAMES=on

  /ASSUMEDSTRWIDTH=32767.

EXECUTE.

DATASET NAME active.

 

* Call the PROCESS FILES command to loop over all other Excel files.

DATASET ACTIVATE active.

SPSSINC PROCESS FILES INPUTDATA="U:\.AU Work\Client Files\XXXXX\CGMS data processing\*.xlsx" 

SYNTAX="U:\.AU Work\Client Files\XXXXX\CGMS data processing\ImportFromExcelAndMerge.sps"

CONTINUEONERROR=YES

VIEWERFILE= "U:\.AU Work\Client Files\XXXXX\CGMS data processing\final output.spv"

CLOSEDATA=NO

MACRONAME="!JOB"

LOGFILEMODE=APPEND

/MACRODEFS ITEMS.

============

 

And the called syntax file, ImportFromExcelAndMerge.sps:

 

GET DATA

  /TYPE=XLSX

  /FILE="JOB_INPUTFILE"

  /SHEET=name 'SPSS'

  /ASSUMEDSTRWIDTH=32767.

DATASET NAME incoming.

 

DATASET ACTIVATE active.

ADD FILES /FILE=* /FILE='incoming'.

EXECUTE.

DATASET CLOSE incoming.

 

============

 

Many thanks to everyone, especially Jon P.

 

Cheers,

Kylie.

 

 

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Kylie Lange
Sent: Monday, 10 December 2012 3:29 PM
To: [hidden email]
Subject: Re: import and merge multiple Excel files

 

Thanks again to everyone for your contributions.

 

Jon’s posting made perfect sense and I have made the change to the CLOSEDATA parameter, and now have the following as the called syntax file (it now merges to the currently active dataset rather than the named active dataset, as I was ending up with a new unnamed dataset):

 

============

GET DATA

  /TYPE=XLSX

  /FILE="JOB_INPUTFILE"

  /SHEET=name 'SPSS'

  /ASSUMEDSTRWIDTH=32767.

DATASET NAME incoming.

 

DATASET ACTIVATE active.

ADD FILES /FILE=* /FILE='incoming'.

EXECUTE.

DATASET CLOSE incoming.

============

 

When I run this, it successfully merges two files to the initially open file but then stops, with an Output Save warning: Output document NAME=’^1’ does not exist. The syntax in the Log for the Output Save is:

 

OUTPUT SAVE OUTFILE="U:/.AU Work/Client Files/XXXXX/CGMS data processing/DD016_Baseline_Excel_Raw Data.spv"

 

where DD016_Baseline_Excel_Raw Data.xlsx is the last file successfully added to the merged file. The previous (and only) file merged in has an output file successfully saved that contains the PROCESS FILES syntax with no errors or warnings. Any suggestions as to what is causing the job to stop?

 

Thanks,

Kylie.

 

PS – Regarding the string length issues, yes I had considered that and all of the test files that I am currently working with get imported with the same string widths. They SHOULD only contain certain string entries and so there shouldn’t be a problem – but I’ve grabbed the ALTER TYPE suggestions just in case.

 

 

From: Jon K Peck [[hidden email]]
Sent: Saturday, 8 December 2012 12:37 AM
To: Kylie Lange
Cc: [hidden email]
Subject: Re: [SPSSX-L] import and merge multiple Excel files

 

It appears that at the time you run the PROCESS FILES command there is no dataset named active.  You specified CLOSEDATA=YES to PROCESS FILES, so it closes ALL the datasets after each iteration, including active.  You are already closing the newly merged dataset, so you should use CLOSEDATA=NO to keep active open throughout the job.

HTH,


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621




From:        Kylie Lange <[hidden email]>
To:        [hidden email],
Date:        12/06/2012 11:40 PM
Subject:        Re: [SPSSX-L] import and merge multiple Excel files
Sent by:        "SPSSX(r) Discussion" <[hidden email]>





Many thanks to Ruben, Andy and Jon for their replies.
 
I have been trying to use the framework that Jon outlines below and is illustrated in the example that Andy linked to. I am having a problem with naming/activating the correct datasets at the right time, which I hope someone can help me with.
 
I have imported the first Excel file into SPSS and named it ‘active’ using <DATASET NAME active>. I have created a syntax file (ImportFromExcelAndMerge.sps) that contains the following:
 
GET DATA
  /TYPE=XLSX
  /FILE="JOB_INPUTFILE"
  /SHEET=name 'SPSS'
  /ASSUMEDSTRWIDTH=32767.
DATASET NAME incoming.
 
ADD FILES /FILE='active' /FILE='incoming'.
EXECUTE.
DATASET CLOSE incoming.
 
And I run this by the following syntax generated by the PROCESS FILES custom dialogue box:
 
DATASET ACTIVATE active.
SPSSINC PROCESS FILES INPUTDATA="U:\.AU Work\Client Files\XXXXX\CGMS data processing\*.xlsx"  
SYNTAX="U:\.AU Work\Client Files\XXXXX\CGMS data processing\ImportFromExcelAndMerge.sps"
CONTINUEONERROR=YES
VIEWERDIR= "U:\.AU Work\Client Files\XXXXX\CGMS data processing"
CLOSEDATA=YES
MACRONAME="!JOB"
LOGFILEMODE=APPEND
/MACRODEFS ITEMS.
 
When I run the above PROCESS FILES command I get this error for each file it tries to loop over: <Dataset Activate: Unknown dataset active>. The add files works when I substitute specific file names into the ImportFromExcelAndMerge syntax. I’ve tried various other combinations of naming/activating the datasets but this is the closest I have got. Any suggestions would be appreciated.
 
Thanks,
Kylie.
 
 
From: SPSSX(r) Discussion [[hidden email]] On Behalf Of Jon K Peck
Sent:
Thursday, 6 December 2012 3:10 AM
To:
[hidden email]
Subject:
Re: import and merge multiple Excel files

 
SPSSINC PROCESS FILES  can do this pretty easily.  You will wind up doing an ADD FILES for each Excel file (after the first) even though ADD FILES can handle 50 files at a time, but that's probably not going to be an issue unless you have to do this several times per second.

A few tips beyond the example that Andy pointed out.


Getting this process started is a little bit tricky, since ADD FILES requires that you already have a data file open.

Move one of your Excel files to a different directory and open it with GET DATA /TYPE=XLSX or interactively.  Give it a dataset name, say, ACTIVE, so that it will remain open and referenceable  as other files are read.


Your syntax file to be applied to each dataset by PROCESS FILES would just have statements like

GET DATA /TYPE XLS .../FILE="JOB_INPUTFILE" ...

DATASET NAME=FRED.

ADD FILES /FILE=ACTIVE /FILE="JOB_INPUTFILE".

DATASET CLOSE FRED.

JOB_INPUTFILE is defined by PROCESS FILES as a file handle for the name of the current input.  It will be redefined each time another file is processed.


You can then construct the PROCESS FILE command from the menus via Utilities > Process Data Files.

The input filespec would be something like

c:\mydata\*.xlsx


After process files is run, you can save the constructed file in the usual way.


You can, of course, do this with Python or even Basic scripting more directly, but it probably isn't worth the trouble to do that.


HTH,

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621





From:        
Andy W <[hidden email]>
To:        
[hidden email],
Date:        
12/05/2012 05:20 AM

Subject:        
Re: [SPSSX-L] import and merge multiple Excel files
Sent by:        
"SPSSX(r) Discussion" <[hidden email]>






I have not used it - but (I believe Jon Peck) wrote an extension command that
would fit the bill here without having to roll your own Python - the
`SPSSINC PROCESS FILES` command (it appears as of now to be bundled in the
download section with the split dataset command as well). See  here
<
https://www.ibm.com/developerworks/mydeveloperworks/files/app?lang=en#/person/270002VCWN/file/1ebc8c4d-05df-4a2f-8b2b-617ba3695583>
on the developerworks site (hopefully that link continues to work).

Here is an example in the developerworks forum
<
http://www.ibm.com/developerworks/forums/thread.jspa?messageID=14573567>
of the tool in action with a near synonymous situation.

Also with a real consistent naming structure and no missing ID's it would be
pretty simple to write this up in a macro, see  this other answer I gave
recently
<
http://spssx-discussion.1045642.n5.nabble.com/Looping-td5716527.html>  .
All that would need to be changed is (what I have done anyways) is to have
the first pass of the loop create a basefile, and then successively add
files concatenate all of the new files to that basefile.









-----
Andy W
[hidden email]
http://andrewpwheeler.wordpress.com/
--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/import-and-merge-multiple-Excel-files-tp5716664p5716669.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD