Hello all,
I have a data collection effort and I'm not sure how to approach this, but I suspect Python might be involved ;-) I need to compile staffing data which would tabulate when each of about 50 different id numbers appear on daily Excel spreadsheets over a period of five years (hence about 1,800 files). There is a separate file for each day, and they are gathered in folders by month, and months in turn by years. For example, on 12/15/08, the data appear in a file named "12-15-08.xls" in the "12-2008" subfolder which is in the 2008 folder. Some are in Excel 2003 format, some in 2007. The format appears to be fairly consistent with the four-digit id numbers ("number stored as text") appearing in a column with several blank cells between groups of id's. Just looking for some ideas before I charge off into a blind alley. Many thanks! --Bud Meredith (Bud) Bliss Crime Analyst - Beaverton Police Dept. 4755 SW Griffith Drive Beaverton, OR 97005 503-526-2294 - Fax: 503-526-2541 ====================================================================== PUBLIC RECORDS LAW DISCLOSURE This e-mail is a public record of the City of Beaverton and is subject to public disclosure unless exempt from disclosure under Oregon Public Records Law. This email is subject to the State Retention Schedule. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
I'm not clear on just what you need to
do, but there seem to be two parts.
1. Loop over a large number of Excel data files, reading each in. Probably merge these together 2. Extract some information. Possibly you need to use the file name as a variable in order to capture the date. Python is indeed likely to be involved, but you can perhaps avoid writing any Python code of your own. There is an extension command, SPSSINC PROCESS FILES that can loop over wildcard-selected files in a directory tree. It generates macros and file handles that can be used in the syntax, so you could use that to keep doing merges and even to create variable values from the file names. You can get this from the new SPSS Community site at www.ibm.com/developerworks/spssdevcentral. It requires, of course, the Python Essentials for IBM SPSS Statistics. You could also write Python code directly to iterate over the directory contents and generate the requisite syntax. HTH, Jon Peck Senior Software Engineer, IBM [hidden email] 312-651-3435 From: Bud Bliss <[hidden email]> To: [hidden email] Date: 01/03/2011 04:44 PM Subject: [SPSSX-L] Suggestions needed for data collection Sent by: "SPSSX(r) Discussion" <[hidden email]> Hello all, I have a data collection effort and I'm not sure how to approach this, but I suspect Python might be involved ;-) I need to compile staffing data which would tabulate when each of about 50 different id numbers appear on daily Excel spreadsheets over a period of five years (hence about 1,800 files). There is a separate file for each day, and they are gathered in folders by month, and months in turn by years. For example, on 12/15/08, the data appear in a file named "12-15-08.xls" in the "12-2008" subfolder which is in the 2008 folder. Some are in Excel 2003 format, some in 2007. The format appears to be fairly consistent with the four-digit id numbers ("number stored as text") appearing in a column with several blank cells between groups of id's. Just looking for some ideas before I charge off into a blind alley. Many thanks! --Bud Meredith (Bud) Bliss Crime Analyst - Beaverton Police Dept. 4755 SW Griffith Drive Beaverton, OR 97005 503-526-2294 - Fax: 503-526-2541 ====================================================================== PUBLIC RECORDS LAW DISCLOSURE This e-mail is a public record of the City of Beaverton and is subject to public disclosure unless exempt from disclosure under Oregon Public Records Law. This email is subject to the State Retention Schedule. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Bud Bliss
Dear Bud
In addition to John Peck's suggestions, there are a couple of other possibilities. 1. Use an Excel macro to merge the spreadsheets before importing into SPSS The macro you need is available here: http://www.mrexcel.com/forum/showthread.php?t=49155 It takes all workbooks in one folder & combines them into one. ======================== 2. Use Bulk File Merger. This is specialised software for merging multiple Excel files. Not free, but cheap. http://www.essexredevelopment.com/ I've not tried either myself, but probably worth a look. Garry Gelade Business Analytic -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Bud Bliss Sent: 03 January 2011 23:38 To: [hidden email] Subject: Suggestions needed for data collection Hello all, I have a data collection effort and I'm not sure how to approach this, but I suspect Python might be involved ;-) I need to compile staffing data which would tabulate when each of about 50 different id numbers appear on daily Excel spreadsheets over a period of five years (hence about 1,800 files). There is a separate file for each day, and they are gathered in folders by month, and months in turn by years. For example, on 12/15/08, the data appear in a file named "12-15-08.xls" in the "12-2008" subfolder which is in the 2008 folder. Some are in Excel 2003 format, some in 2007. The format appears to be fairly consistent with the four-digit id numbers ("number stored as text") appearing in a column with several blank cells between groups of id's. Just looking for some ideas before I charge off into a blind alley. Many thanks! --Bud Meredith (Bud) Bliss Crime Analyst - Beaverton Police Dept. 4755 SW Griffith Drive Beaverton, OR 97005 503-526-2294 - Fax: 503-526-2541 ====================================================================== PUBLIC RECORDS LAW DISCLOSURE This e-mail is a public record of the City of Beaverton and is subject to public disclosure unless exempt from disclosure under Oregon Public Records Law. This email is subject to the State Retention Schedule. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
I will be out of the office until Monday, Jan. 10th. If you need immediate assistance, please call 812-856-5824. I will respond to your e-mail
as soon as possible. Thank you,
Shimon Sarraf
Center for Postsecondary Research Indiana University at Bloomington |
In reply to this post by Garry Gelade
Hi Bud,
I dug up two Python scripts that may be of help. Not sure if it's exactly what you want, but you may be able to alter it. They use the xlrd module (not builtin). In addition, you might want to look at os.walk, which allows you to process directory trees. What the scripts do is described in the """docstrings""".
script 1 ##################
"""
Merge all xls files in a given directory into one multisheet xls file. The sheets get the orginal file name, without the extension. Files with names that exceed 29 characters are skipped """ import glob, os, time
import xlrd, xlwt class TooManyXLSsheetsError(Exception):
pass def merge_xls(in_dir="d:/temp", out_file="d:/temp/merged_output.xls", ask_for_prefix=True):
""" Main function: merge xls sheets """ xls_files = glob.glob(in_dir + os.sep + "*.xls") xls_files.sort() merged_book = xlwt.Workbook() osheet_names = [os.path.basename(xls_file)[:-4] for xls_file in xls_files] if xls_files: for xls_no, xls_file in enumerate(xls_files): print "---> Processing file %s" % (xls_file) book = xlrd.open_workbook(xls_file) isheet_names = xlrd.Book.sheet_names(book) check_xls(book, merged_book, isheet_names, osheet_names, xls_no, xls_files) stamped_outfile = save_xls(merged_book, out_file, ask_for_prefix) print_msg(xls_files, osheet_names, stamped_outfile) else: print "NOTE *** No xls files in %s. Nothing to do" % (in_dir) def check_xls(book, merged_book, isheet_names, osheet_names, xls_no, xls_files):
""" Check existence and file names of input xls files """ if len(osheet_names[xls_no]) <= 29: write_sheets(book, merged_book, isheet_names, osheet_names, xls_no, xls_files) else: print "WARNING *** File name too long: <%s.xls> (maximum is 31 chars) " % (osheet_names[xls_no]) print "WARNING *** File <%s.xls> was skipped." % (osheet_names[xls_no]) def write_sheets(book, merged_book, isheet_names, osheet_names, xls_no, xls_files):
""" Write sheets, and add sheet numbering in case of multisheet xls input """ osheet_name = osheet_names[xls_no] xls_file = xls_files[xls_no] MAX_NO_OF_SHEETS = 99 # len(str(MAX_NO_OF_SHEETS) == 2; max(len(osheet_names)) 29; 2 + 29 = 31 = max sheetname length if book.nsheets == 1: ws = merged_book.add_sheet(osheet_name) isheet_name = isheet_names[0] sheet = book.sheet_by_index(0) write_cells(sheet, book, ws, isheet_name, osheet_name, xls_file) elif book.nsheets in range(1, MAX_NO_OF_SHEETS+1): for sheetx in range(book.nsheets): isheet_name = isheet_names[sheetx] ws = merged_book.add_sheet(osheet_name+str(sheetx+1).zfill(2)) sheet = book.sheet_by_index(sheetx) write_cells(sheet, book, ws, isheet_name, osheet_name, xls_file) elif book.nsheets > MAX_NO_OF_SHEETS: print "ERROR *** File %s has %s sheets (maximum is 99)" % (xls_file, book.nsheets) raise Exception (TooManyXLSsheetsError) def write_cells(sheet, book, ws, isheet_name, osheet_name, xls_file, format_cell=True, suppress=False): """ Write cells, and apply formatting if needed. If format_cells=True, bold font and horizontal split is applied. If suppress=True, cell values <= 5 are replaced by 'x' """ MISSINGVALUES = ("#LEEG!", "#NULL!") SUPPRESSCELLS = range(1, 5+1) # suppress *all* values <= 5; including percentages! rx = 0 # initialize to zero in case of empty input xls file. style1 = format_cells(ws) style2 = xlwt.easyxf('font: name Arial, bold True;') for rx in range(sheet.nrows): for cx in range(sheet.ncols): cell_value = sheet.cell_value(rx, cx) if format_cell and rx == 0: format_cells(ws) ws.write(rx, cx, cell_value, style1) # zeroes are represented as a float 0.0, null values as int 0. if cell_value in MISSINGVALUES or (isinstance(cell_value, int) and cell_value == 0): ws.write(rx, cx, "-") elif suppress and cell_value in SUPPRESSCELLS: ws.write(rx, cx, "x") else: ws.write(rx, cx, cell_value) footer = "source tab: " + isheet_name + " || source file: " + os.path.basename(xls_file) if format_cell: ws.write(rx+2, 0, footer.upper(), style2) # print bold source tab & file name below the table else: ws.write(rx+2, 0, footer.upper()) def format_cells(ws, font='Arial', boldrow=True, panes_frozen=True):
""" Custom-style: Add horizontal split pane and bold font at first row """ ws.panes_frozen = panes_frozen ws.horz_split_pos = 1 borders = xlwt.Borders()
borders.left = 0 borders.right = 0 borders.top = 6 borders.bottom = 6 font0 = xlwt.Font() # waarom geen boldrow???
font0.name = font font0.bold = boldrow style0 = xlwt.XFStyle() style0.borders = borders style0.font = font0 return style0 def check_prefix(stamped_outfile): """ Generate user prompt for prefix entry and check prefix for illegal signs""" while True: prefix = raw_input("---> Give a file prefix [ENTER = None]: ") + "_" if prefix == "_": prefix = "" illegal_signs = set(prefix).intersection(set(r"""\/:*?"<>|""")) if illegal_signs: print "---> Illegal sign(s) in prefix (%s). Try again." % " ".join(illegal_signs) continue else: stamped_outfile = os.path.join(os.path.dirname(stamped_outfile), \ prefix + os.path.basename(stamped_outfile)) return stamped_outfile break def save_xls(merged_book, out_file, ask_for_prefix):
stamped_outfile = out_file[:-4] + "_" + time.strftime("%Y-%m-%d") + ".xls" if ask_for_prefix: stamped_outfile = check_prefix(stamped_outfile) merged_book.save(stamped_outfile) return stamped_outfile def print_msg(xls_files, osheet_names, stamped_outfile):
""" Print status messages """ print "\n---> Merged xls file written to %s using the following source files: " % (stamped_outfile) MAXSHEETNAMELEN = 29 for n_sheet, osheet_name in enumerate(osheet_names): if len(osheet_name) <= MAXSHEETNAMELEN: print "\t", str(n_sheet+1).zfill(3), "%s.xls" % (osheet_name) excl_sheets = [os.path.basename(xls_file)[:-4] for xls_file in xls_files if \ len(os.path.basename(xls_file)[:-4]) > MAXSHEETNAMELEN] if excl_sheets: print "\n--> The following files were skipped because the file name exceeds 29 characters: " for n_sheet, excl_sheet in enumerate(excl_sheets): print "\t", str(n_sheet+1).zfill(3), excl_sheet if __name__ == "__main__":
merge_xls() script 2 ######################################## # -*- coding: cp1252 -*-
import xlrd, os.path, codecs, spss ##def replace_chars():
## """ This is not used in the current .py file """ ## trans = {} ## funnychars = u"éèêëóòôöáàâäúùüûÉÈÊËÓÒÔÖÁÀÂÄÚÙÜÛ" ## asciichars = "eeeeooooaaaauuuuEEEEOOOOAAAAUUUU" ## for f, a in zip(funnychars, asciichars): ## trans[ord(f)] = ord(a) ## return trans ##s = s.translate(replace_chars()) """ Process all sheets of an Excel file and write the result into one sav file.
Up to 50 sheets can be processed (ADD FILES maximum = 50 sheets). """ def get_data(xls, sheet_name):
command = """get data /type=xls /file='%s' /sheet=name '%s' /cellrange=full /readnames=on /assumedstrwidth=32767.\ndataset name %s.\n""" % (xls, sheet_name, sheet_name) spss.Submit(codecs.encode(command, "cp1252")) def merge_sheets2sav(xls):
try: wb = xlrd.open_workbook(xls) addfilescmd = ["add files"] if wb.nsheets > 50: raise "--> xls file contains too many sheets (%s), maximum = 50" % wb.nsheets for sheet in wb.sheets(): get_data(xls, sheet.name) addfilescmd.append(" /file = " + sheet.name + " /in = in_" + sheet.name + "\n") spss.Submit(codecs.encode(" ".join(addfilescmd) + ".", "cp1252")) out = os.path.dirname(xls) + "/out.sav" spss.Submit("save outfile = '%s'.\ndataset close all." % out) print "--> Merged sav file written:", out except IOError: raise "--> Input file does not exist (%s)" % xls except spss.SpssError: raise "--> Spss error. Incompatible sheets?" except: raise "--> Unknown error." if __name__ == "__main__":
merge_sheets2sav(xls="d:/temp/in.xls") Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From: Garry Gelade <[hidden email]> To: [hidden email] Sent: Tue, January 4, 2011 10:12:47 AM Subject: Re: [SPSSX-L] Suggestions needed for data collection Dear Bud In addition to John Peck's suggestions, there are a couple of other possibilities. 1. Use an Excel macro to merge the spreadsheets before importing into SPSS The macro you need is available here: http://www.mrexcel.com/forum/showthread.php?t=49155 It takes all workbooks in one folder & combines them into one. ======================== 2. Use Bulk File Merger. This is specialised software for merging multiple Excel files. Not free, but cheap. http://www.essexredevelopment.com/ I've not tried either myself, but probably worth a look. Garry Gelade Business Analytic -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Bud Bliss Sent: 03 January 2011 23:38 To: [hidden email] Subject: Suggestions needed for data collection Hello all, I have a data collection effort and I'm not sure how to approach this, but I suspect Python might be involved ;-) I need to compile staffing data which would tabulate when each of about 50 different id numbers appear on daily Excel spreadsheets over a period of five years (hence about 1,800 files). There is a separate file for each day, and they are gathered in folders by month, and months in turn by years. For example, on 12/15/08, the data appear in a file named "12-15-08.xls" in the "12-2008" subfolder which is in the 2008 folder. Some are in Excel 2003 format, some in 2007. The format appears to be fairly consistent with the four-digit id numbers ("number stored as text") appearing in a column with several blank cells between groups of id's. Just looking for some ideas before I charge off into a blind alley. Many thanks! --Bud Meredith (Bud) Bliss Crime Analyst - Beaverton Police Dept. 4755 SW Griffith Drive Beaverton, OR 97005 503-526-2294 - Fax: 503-526-2541 ====================================================================== PUBLIC RECORDS LAW DISCLOSURE This e-mail is a public record of the City of Beaverton and is subject to public disclosure unless exempt from disclosure under Oregon Public Records Law. This email is subject to the State Retention Schedule. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |