Suggestions needed for data collection

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Suggestions needed for data collection

Bud Bliss
Hello all,
I have a data collection effort and I'm not sure how to approach this, but I suspect Python might be involved ;-)
I need to compile staffing data which would tabulate when each of about 50 different id numbers appear on daily Excel spreadsheets over a period of five years (hence about 1,800 files). There is a separate file for each day, and they are gathered in folders by month, and months in turn by years. For example, on 12/15/08, the data appear in a file named "12-15-08.xls" in the "12-2008" subfolder which is in the 2008 folder. Some are in Excel 2003 format, some in 2007. The format appears to be fairly consistent with the four-digit id numbers ("number stored as text") appearing in a column with several blank cells between groups of id's.
Just looking for some ideas before I charge off into a blind alley.
Many thanks!
        --Bud

Meredith (Bud) Bliss
Crime Analyst - Beaverton Police Dept.
4755 SW Griffith Drive
Beaverton, OR 97005
503-526-2294 - Fax: 503-526-2541

======================================================================
PUBLIC RECORDS LAW DISCLOSURE

This e-mail is a public record of the City of Beaverton and is subject to public disclosure unless exempt from disclosure under Oregon Public Records Law.  This email is subject to the State Retention Schedule.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Suggestions needed for data collection

Jon K Peck
I'm not clear on just what you need to do, but there seem to be two parts.

1. Loop over a large number of Excel data files, reading each in.  Probably merge these together

2. Extract some information.  Possibly you need to use the file name as a variable in order to capture the date.

Python is indeed likely to be involved, but you can perhaps avoid writing any Python code of your own.  There is an extension command, SPSSINC PROCESS FILES that can loop over wildcard-selected files in a directory tree.  It generates macros and file handles that can be used in the syntax, so you could use that to keep doing merges and even to create variable values from the file names.

You can get this from the new SPSS Community site at www.ibm.com/developerworks/spssdevcentral.  It requires, of course, the Python Essentials for IBM SPSS Statistics.

You could also write Python code directly to iterate over the directory contents and generate the requisite syntax.

HTH,

Jon Peck
Senior Software Engineer, IBM
[hidden email]
312-651-3435




From:        Bud Bliss <[hidden email]>
To:        [hidden email]
Date:        01/03/2011 04:44 PM
Subject:        [SPSSX-L] Suggestions needed for data collection
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Hello all,
I have a data collection effort and I'm not sure how to approach this, but I suspect Python might be involved ;-)
I need to compile staffing data which would tabulate when each of about 50 different id numbers appear on daily Excel spreadsheets over a period of five years (hence about 1,800 files). There is a separate file for each day, and they are gathered in folders by month, and months in turn by years. For example, on 12/15/08, the data appear in a file named "12-15-08.xls" in the "12-2008" subfolder which is in the 2008 folder. Some are in Excel 2003 format, some in 2007. The format appears to be fairly consistent with the four-digit id numbers ("number stored as text") appearing in a column with several blank cells between groups of id's.
Just looking for some ideas before I charge off into a blind alley.
Many thanks!
       --Bud

Meredith (Bud) Bliss
Crime Analyst - Beaverton Police Dept.
4755 SW Griffith Drive
Beaverton, OR 97005
503-526-2294 - Fax: 503-526-2541

======================================================================
PUBLIC RECORDS LAW DISCLOSURE

This e-mail is a public record of the City of Beaverton and is subject to public disclosure unless exempt from disclosure under Oregon Public Records Law.  This email is subject to the State Retention Schedule.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: Suggestions needed for data collection

Garry Gelade
In reply to this post by Bud Bliss
Dear Bud

In addition to John Peck's suggestions, there are a couple of other
possibilities.

1. Use an Excel macro to merge the spreadsheets before importing into SPSS
The macro you need is available here:

http://www.mrexcel.com/forum/showthread.php?t=49155


It takes all workbooks in one folder & combines them into one.

========================
2. Use Bulk File Merger.
This is specialised software for merging multiple Excel files. Not free, but
cheap.
http://www.essexredevelopment.com/



I've not tried either myself, but probably worth a look.

Garry Gelade
Business Analytic

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Bud
Bliss
Sent: 03 January 2011 23:38
To: [hidden email]
Subject: Suggestions needed for data collection

Hello all,
I have a data collection effort and I'm not sure how to approach this, but I
suspect Python might be involved ;-)
I need to compile staffing data which would tabulate when each of about 50
different id numbers appear on daily Excel spreadsheets over a period of
five years (hence about 1,800 files). There is a separate file for each day,
and they are gathered in folders by month, and months in turn by years. For
example, on 12/15/08, the data appear in a file named "12-15-08.xls" in the
"12-2008" subfolder which is in the 2008 folder. Some are in Excel 2003
format, some in 2007. The format appears to be fairly consistent with the
four-digit id numbers ("number stored as text") appearing in a column with
several blank cells between groups of id's.
Just looking for some ideas before I charge off into a blind alley.
Many thanks!
        --Bud

Meredith (Bud) Bliss
Crime Analyst - Beaverton Police Dept.
4755 SW Griffith Drive
Beaverton, OR 97005
503-526-2294 - Fax: 503-526-2541

======================================================================
PUBLIC RECORDS LAW DISCLOSURE

This e-mail is a public record of the City of Beaverton and is subject to
public disclosure unless exempt from disclosure under Oregon Public Records
Law.  This email is subject to the State Retention Schedule.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Automatic reply: Suggestions needed for data collection

Sarraf, Shimon Aaron

I will be out of the office until Monday, Jan. 10th. If you need immediate assistance, please call 812-856-5824. I will respond to your e-mail as soon as possible.

 

Thank you,

Shimon Sarraf

Center for Postsecondary Research

Indiana University at Bloomington

 

Reply | Threaded
Open this post in threaded view
|

Re: Suggestions needed for data collection

Albert-Jan Roskam
In reply to this post by Garry Gelade
Hi Bud,
 
I dug up two Python scripts that may be of help. Not sure if it's exactly what you want, but you may be able to alter it. They use the xlrd module (not builtin). In addition, you might want to look at os.walk, which allows you to process directory trees. What the scripts do is described in the """docstrings""".
 
script 1 ##################
"""
Merge all xls files in a given directory into one multisheet xls file.
The sheets get the orginal file name, without the extension.
Files with names that exceed 29 characters are skipped
"""
import glob, os, time
import xlrd, xlwt
class TooManyXLSsheetsError(Exception):
    pass
def merge_xls(in_dir="d:/temp", out_file="d:/temp/merged_output.xls", ask_for_prefix=True):
    """ Main function: merge xls sheets """
    xls_files = glob.glob(in_dir + os.sep + "*.xls")
    xls_files.sort()
    merged_book = xlwt.Workbook()
    osheet_names = [os.path.basename(xls_file)[:-4] for xls_file in xls_files]
    if xls_files:
        for xls_no, xls_file in enumerate(xls_files):
            print "---> Processing file %s" % (xls_file)
            book = xlrd.open_workbook(xls_file)
            isheet_names = xlrd.Book.sheet_names(book)
            check_xls(book, merged_book, isheet_names, osheet_names, xls_no, xls_files)
        stamped_outfile = save_xls(merged_book, out_file, ask_for_prefix)
        print_msg(xls_files, osheet_names, stamped_outfile)
    else:
        print "NOTE *** No xls files in %s. Nothing to do" % (in_dir)
def check_xls(book, merged_book, isheet_names, osheet_names, xls_no, xls_files):
    """ Check existence and file names of input xls files """
    if len(osheet_names[xls_no]) <= 29:
        write_sheets(book, merged_book, isheet_names, osheet_names, xls_no, xls_files)
    else:
        print "WARNING *** File name too long: <%s.xls> (maximum is 31 chars) " % (osheet_names[xls_no])
        print "WARNING *** File <%s.xls> was skipped." % (osheet_names[xls_no])
def write_sheets(book, merged_book, isheet_names, osheet_names, xls_no, xls_files):
    """ Write sheets, and add sheet numbering in case of multisheet xls input """
    osheet_name = osheet_names[xls_no]
    xls_file = xls_files[xls_no]
    MAX_NO_OF_SHEETS = 99 # len(str(MAX_NO_OF_SHEETS) == 2; max(len(osheet_names)) 29; 2 + 29 = 31 = max sheetname length
    if book.nsheets == 1:
        ws = merged_book.add_sheet(osheet_name)
        isheet_name = isheet_names[0]
        sheet = book.sheet_by_index(0)
        write_cells(sheet, book, ws, isheet_name, osheet_name, xls_file)
    elif book.nsheets in range(1, MAX_NO_OF_SHEETS+1):
        for sheetx in range(book.nsheets):
            isheet_name = isheet_names[sheetx]
            ws = merged_book.add_sheet(osheet_name+str(sheetx+1).zfill(2))
            sheet = book.sheet_by_index(sheetx)
            write_cells(sheet, book, ws, isheet_name, osheet_name, xls_file)
    elif book.nsheets > MAX_NO_OF_SHEETS:
        print "ERROR *** File %s has %s sheets (maximum is 99)" % (xls_file, book.nsheets)
        raise Exception (TooManyXLSsheetsError)

def write_cells(sheet, book, ws, isheet_name, osheet_name, xls_file, format_cell=True, suppress=False):
    """ Write cells, and apply formatting if needed. If format_cells=True, bold font and
    horizontal split is applied. If suppress=True, cell values <= 5 are replaced by 'x' """
    MISSINGVALUES = ("#LEEG!", "#NULL!")
    SUPPRESSCELLS = range(1, 5+1) # suppress *all* values <= 5; including percentages!
    rx = 0 # initialize to zero in case of empty input xls file.
    style1 = format_cells(ws)
    style2 = xlwt.easyxf('font: name Arial, bold True;')
    for rx in range(sheet.nrows):
        for cx in range(sheet.ncols):
            cell_value = sheet.cell_value(rx, cx)
            if format_cell and rx == 0:
                format_cells(ws)
                ws.write(rx, cx, cell_value, style1)
            # zeroes are represented as a float 0.0, null values as int 0.
            if cell_value in MISSINGVALUES or (isinstance(cell_value, int) and cell_value == 0):
                ws.write(rx, cx, "-")
            elif suppress and cell_value in SUPPRESSCELLS:
                ws.write(rx, cx, "x")
            else:
                ws.write(rx, cx, cell_value)
    footer = "source tab: " + isheet_name + " || source file: " + os.path.basename(xls_file)
    if format_cell:
        ws.write(rx+2, 0, footer.upper(), style2)  # print bold source tab & file name below the table
    else:
        ws.write(rx+2, 0, footer.upper())
def format_cells(ws, font='Arial', boldrow=True, panes_frozen=True):
    """ Custom-style: Add horizontal split pane and bold font at first row """
    ws.panes_frozen = panes_frozen
    ws.horz_split_pos = 1
    borders = xlwt.Borders()
    borders.left = 0
    borders.right = 0
    borders.top = 6
    borders.bottom = 6
    font0 = xlwt.Font() # waarom geen boldrow???
    font0.name = font
    font0.bold = boldrow
   
    style0 = xlwt.XFStyle()
    style0.borders = borders
    style0.font = font0
    return style0
 

def check_prefix(stamped_outfile):
    """ Generate user prompt for prefix entry and check prefix for illegal signs"""
    while True:
        prefix = raw_input("---> Give a file prefix [ENTER = None]: ") + "_"
        if prefix == "_":
            prefix = ""
        illegal_signs = set(prefix).intersection(set(r"""\/:*?"<>|"""))
        if illegal_signs:
            print "---> Illegal sign(s) in prefix (%s). Try again." % " ".join(illegal_signs)
            continue
        else:
            stamped_outfile = os.path.join(os.path.dirname(stamped_outfile), \
                                           prefix + os.path.basename(stamped_outfile))
            return stamped_outfile
            break
def save_xls(merged_book, out_file, ask_for_prefix):
        stamped_outfile = out_file[:-4] + "_" + time.strftime("%Y-%m-%d") + ".xls"
        if ask_for_prefix:
            stamped_outfile = check_prefix(stamped_outfile)
        merged_book.save(stamped_outfile)
        return stamped_outfile
def print_msg(xls_files, osheet_names, stamped_outfile):
    """ Print status messages """
    print "\n---> Merged xls file written to %s using the following source files: " % (stamped_outfile)
    MAXSHEETNAMELEN = 29
    for n_sheet, osheet_name in enumerate(osheet_names):
        if len(osheet_name) <= MAXSHEETNAMELEN:
            print "\t", str(n_sheet+1).zfill(3), "%s.xls" % (osheet_name)
    excl_sheets = [os.path.basename(xls_file)[:-4] for xls_file in xls_files if \
                   len(os.path.basename(xls_file)[:-4]) > MAXSHEETNAMELEN]   
    if excl_sheets:
        print "\n--> The following files were skipped because the file name exceeds 29 characters: "
        for n_sheet, excl_sheet in enumerate(excl_sheets):
            print "\t", str(n_sheet+1).zfill(3), excl_sheet
if __name__ == "__main__":
    merge_xls()

script 2 ########################################
 
# -*- coding: cp1252 -*-
import xlrd, os.path, codecs, spss
 
##def replace_chars():
##    """ This is not used in the current .py file """
##    trans = {}
##    funnychars = u"éèêëóòôöáàâäúùüûÉÈÊËÓÒÔÖÁÀÂÄÚÙÜÛ"
##    asciichars = "eeeeooooaaaauuuuEEEEOOOOAAAAUUUU"
##    for f, a in zip(funnychars, asciichars):
##        trans[ord(f)] = ord(a)
##    return trans
##s = s.translate(replace_chars())
 
""" Process all sheets of an Excel file and write the result into one sav file.
Up to 50 sheets can be processed (ADD FILES maximum = 50 sheets). """
def get_data(xls, sheet_name):
    command = """get data /type=xls /file='%s' /sheet=name '%s' /cellrange=full
    /readnames=on /assumedstrwidth=32767.\ndataset name %s.\n""" % (xls, sheet_name, sheet_name)
    spss.Submit(codecs.encode(command, "cp1252"))
def merge_sheets2sav(xls):
    try:
        wb = xlrd.open_workbook(xls)
        addfilescmd = ["add files"]
        if wb.nsheets > 50:
            raise "--> xls file contains too many sheets (%s), maximum = 50" % wb.nsheets
        for sheet in wb.sheets():
            get_data(xls, sheet.name)
            addfilescmd.append(" /file = " + sheet.name + " /in = in_" + sheet.name + "\n")
        spss.Submit(codecs.encode(" ".join(addfilescmd) + ".", "cp1252"))
        out = os.path.dirname(xls) + "/out.sav"
        spss.Submit("save outfile = '%s'.\ndataset close all." % out)
        print "--> Merged sav file written:", out
    except IOError:
        raise "--> Input file does not exist (%s)" % xls
    except spss.SpssError:
        raise "--> Spss error. Incompatible sheets?"
    except:
        raise "--> Unknown error."
if __name__ == "__main__":
    merge_sheets2sav(xls="d:/temp/in.xls")
 
Cheers!!
Albert-Jan


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



From: Garry Gelade <[hidden email]>
To: [hidden email]
Sent: Tue, January 4, 2011 10:12:47 AM
Subject: Re: [SPSSX-L] Suggestions needed for data collection

Dear Bud

In addition to John Peck's suggestions, there are a couple of other
possibilities.

1. Use an Excel macro to merge the spreadsheets before importing into SPSS
The macro you need is available here:

http://www.mrexcel.com/forum/showthread.php?t=49155


It takes all workbooks in one folder & combines them into one.

========================
2. Use Bulk File Merger.
This is specialised software for merging multiple Excel files. Not free, but
cheap.
http://www.essexredevelopment.com/



I've not tried either myself, but probably worth a look.

Garry Gelade
Business Analytic

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Bud
Bliss
Sent: 03 January 2011 23:38
To: [hidden email]
Subject: Suggestions needed for data collection

Hello all,
I have a data collection effort and I'm not sure how to approach this, but I
suspect Python might be involved ;-)
I need to compile staffing data which would tabulate when each of about 50
different id numbers appear on daily Excel spreadsheets over a period of
five years (hence about 1,800 files). There is a separate file for each day,
and they are gathered in folders by month, and months in turn by years. For
example, on 12/15/08, the data appear in a file named "12-15-08.xls" in the
"12-2008" subfolder which is in the 2008 folder. Some are in Excel 2003
format, some in 2007. The format appears to be fairly consistent with the
four-digit id numbers ("number stored as text") appearing in a column with
several blank cells between groups of id's.
Just looking for some ideas before I charge off into a blind alley.
Many thanks!
        --Bud

Meredith (Bud) Bliss
Crime Analyst - Beaverton Police Dept.
4755 SW Griffith Drive
Beaverton, OR 97005
503-526-2294 - Fax: 503-526-2541

======================================================================
PUBLIC RECORDS LAW DISCLOSURE

This e-mail is a public record of the City of Beaverton and is subject to
public disclosure unless exempt from disclosure under Oregon Public Records
Law.  This email is subject to the State Retention Schedule.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD