Extracting data from a messy file.

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Extracting data from a messy file.

Ntonghanwah Forcheh
I am trying to reorganise data in a messy file where the variable names are listed followed by a few lines of data values for different cases, then another row of variable names and then one or more rows of corresponding data values.
Been trying to use vector-Loop-DO IF and Lag(vectorname, n) combination of commands but just cant get them to work.

Any hints please?
example is as shown below.
Thanks
Forcheh

INDIVIDUAL_BARCODE    ID_CONSENT    Q101    Q102    Q103    Q104A    Q104_OTHER    Q105    Q106    AGE_DIFFERENCE    Q201
BAISIV 12397-01    1    2    52    1    10    Working at namola leuba    6    2        2
BAISIV 12397-02    1    1    13    3    9        1    2        2
BAISIV 12397-03    1    1    13    3    9        1    2        2
BAISIV 12397-05    1    2    17    4    9        1    2        2
BAISIV 12397-06    1    1    15    4    9        1    2        2
INDIVIDUAL_BARCODE    ID_CONSENT    Q101    Q102    Q103    Q104A    Q104B    Q105    Q106    AGE_DIFFERENCE    Q201
BAISIV 11886-03    1    1    24    5    2    Clothing retailer    1    2        2
BAISIV 11886-04    1    1    18    5    9        1    2        2
BAISIV 11886-05    1    1    13    3    9        1    2        2
BAISIV 11886-07    2                                   
INDIVIDUAL_BARCODE    ID_CONSENT    Q101    Q102    Q103    Q104A    Q105    Q106    AGE_DIFFERENCE    Q201    Q203aa
BAISIV 12398-01    1    2    29    4    5    1    2        2   
BAISIV 12398-02    1    2    19    5    9    1    2        2   
BAISIV 12398-07    2                                   
BAISIV 12399-01    1    1    57    4    1    Veterenary officer    6    6    34    23
INDIVIDUAL_BARCODE    ID_CONSENT    Q101    Q102    Q103    Q104A    Q105    Q106    AGE_DIFFERENCE    Q201    Q202A
BAISIV 12396-01    1    1    24    6    5    1    2        1    1
BAISIV 12396-03    1    2    20    5    10    1    2        1    2
BAISIV 12396-03    1    2    20    5    10    1    2        1    2
                                       
                                       
BAISIV 12326-01    1    2    54    4    6    1    2        2   

--
Professor Ntonghanwah Forcheh
Department of Statistics,
University of Botswana
Private Bag UB00705, Gaborone, Botswana.
Office: +267 355 2696,
 Mobile:  Orange +267 75 26 2963,    Bmobile:  73181378:    Mascom  754 21238
fax: +267 3185099;
Alternative Email: [hidden email]
*@Honesty is a Virtue, Freedom of the Mind is Power.
Motto: Never be afraid to be honest, Never lie to yourself, Trust in the
Truth and you will be forever free.*


Reply | Threaded
Open this post in threaded view
|

Re: Extracting data from a messy file.

Rich Ulrich
First, there is the question of what the organization is for these lines of data.

I think: This is like comma-delimited data, where "aaa" is a regular delimiter.
When a line ends with aaa, there is a Missing value following that aaa.

I think:  There is a full set of possible questions, with a specific order, and each
"set"  of lines following "INDIVIDUAL... " has a subset of those.
I think:  I don't know what would be intended by a line that is shown already as
all "aaa" markers, except for a useless line of all Missing.

For processing:  I would start by adding to the start of each line a couple of
counters, one for Set and one for Line number.  The INDIVIDUAL ...  would
be numbered as 0.  The rest of the line is processed as text, and partitioned
(A25 or larger) into a the total number of variables, with the 'aaa' removed.
Most of these appear to have small integers, but you don't have to predict
where the long String answers occur.

Then I would pull out the INDIVIDUAL...  lines to a separate file called INDEX,
leaving behind DATAONLY.  Using a master list of Questions I would match to
convert every VARNAME  into its proper, sequential number among the full list of
questions, in file INDEX#.

That list can be MATCHed  with  /table= INDEX# ...    by set-number back to DATAONLY.

DATAONLY + INDEX#  has variables, plus the columns where the questions belong in
the full set.  It is straightforward from there to use LOOP and Vectors to assign each
to its proper slot; and finally, to convert Strings to arithmetic, after checking that
the file was created with proper consistency. 


Does this handle the problem?

--
Rich Ulrich



Date: Sun, 22 Sep 2013 23:11:05 +0200
From: [hidden email]
Subject: Extracting data from a messy file.
To: [hidden email]

I am trying to reorganise data in a messy file where the variable names are listed followed by a few lines of data values for different cases, then another row of variable names and then one or more rows of corresponding data values.
Been trying to use vector-Loop-DO IF and Lag(vectorname, n) combination of commands but just cant get them to work.

Any hints please?
example is as shown below.
Thanks
Forcheh

INDIVIDUAL_BARCODEааа ID_CONSENTааа Q101ааа Q102ааа Q103ааа Q104Aааа Q104_OTHERааа Q105ааа Q106ааа AGE_DIFFERENCEааа Q201
BAISIV 12397-01ааа 1ааа 2ааа 52ааа 1ааа 10ааа Working at namola leubaааа 6ааа 2ааа ааа 2
BAISIV 12397-02ааа 1ааа 1ааа 13ааа 3ааа 9ааа ааа 1ааа 2ааа ааа 2
BAISIV 12397-03ааа 1ааа 1ааа 13ааа 3ааа 9ааа ааа 1ааа 2ааа ааа 2
BAISIV 12397-05ааа 1ааа 2ааа 17ааа 4ааа 9ааа ааа 1ааа 2ааа ааа 2
BAISIV 12397-06ааа 1ааа 1ааа 15ааа 4ааа 9ааа ааа 1ааа 2ааа ааа 2
INDIVIDUAL_BARCODEааа ID_CONSENTааа Q101ааа Q102ааа Q103ааа Q104Aааа Q104Bааа Q105ааа Q106ааа AGE_DIFFERENCEааа Q201
BAISIV 11886-03ааа 1ааа 1ааа 24ааа 5ааа 2ааа Clothing retailerааа 1ааа 2ааа ааа 2
BAISIV 11886-04ааа 1ааа 1ааа 18ааа 5ааа 9ааа ааа 1ааа 2ааа ааа 2
BAISIV 11886-05ааа 1ааа 1ааа 13ааа 3ааа 9ааа ааа 1ааа 2ааа ааа 2
BAISIV 11886-07ааа 2ааа ааа ааа ааа ааа ааа ааа ааа ааа
INDIVIDUAL_BARCODEааа ID_CONSENTааа Q101ааа Q102ааа Q103ааа Q104Aааа Q105ааа Q106ааа AGE_DIFFERENCEааа Q201ааа Q203aa
BAISIV 12398-01ааа 1ааа 2ааа 29ааа 4ааа 5ааа 1ааа 2ааа ааа 2ааа
BAISIV 12398-02ааа 1ааа 2ааа 19ааа 5ааа 9ааа 1ааа 2ааа ааа 2ааа
BAISIV 12398-07ааа 2ааа ааа ааа ааа ааа ааа ааа ааа ааа
BAISIV 12399-01ааа 1ааа 1ааа 57ааа 4ааа 1ааа Veterenary officerааа 6ааа 6ааа 34ааа 23
INDIVIDUAL_BARCODEааа ID_CONSENTааа Q101ааа Q102ааа Q103ааа Q104Aааа Q105ааа Q106ааа AGE_DIFFERENCEааа Q201ааа Q202A
BAISIV 12396-01ааа 1ааа 1ааа 24ааа 6ааа 5ааа 1ааа 2ааа ааа 1ааа 1
BAISIV 12396-03ааа 1ааа 2ааа 20ааа 5ааа 10ааа 1ааа 2ааа ааа 1ааа 2
BAISIV 12396-03ааа 1ааа 2ааа 20ааа 5ааа 10ааа 1ааа 2ааа ааа 1ааа 2
ааа ааа ааа ааа ааа ааа ааа ааа ааа ааа
ааа ааа ааа ааа ааа ааа ааа ааа ааа ааа
BAISIV 12326-01ааа 1ааа 2ааа 54ааа 4ааа 6ааа 1ааа 2ааа ааа 2ааа

--
Professor Ntonghanwah Forcheh
Department of Statistics,
University of Botswana
Private Bag UB00705, Gaborone, Botswana.
Office: +267 355 2696,
аMobile:а Orange +267 75 26 2963,а а Bmobile:а 73181378:а а Mascomа 754 21238
fax: +267 3185099;
Alternative Email: [hidden email]
*@Honesty is a Virtue, Freedom of the Mind is Power.
Motto: Never be afraid to be honest, Never lie to yourself, Trust in the
Truth and you will be forever free.*


Reply | Threaded
Open this post in threaded view
|

Re: Extracting data from a messy file.

Albert-Jan Roskam
In reply to this post by Ntonghanwah Forcheh
Hi,

I hacked this together and it seems to do the job. Output is in a csv file, see here if the indentation gets messed up: http://pastebin.com/WvHxF44m

import os
import csv
import pprint

# assumes that nobody has filled in a question using 4 consecutive spaces
def read(infile):
    with open(infile, "rb") as f:
        records = {}
        allvarnames = set()
        for line in f.readlines():
            values = None
            if line.startswith("INDIVIDUAL_BARCODE"):
                varnames = line.rstrip("\n").split(" " * 4)
                allvarnames.update(varnames)
            else:
                values =  line.rstrip("\n").split(" " * 4)
                id_value = values[0]
            if values:
                record = dict(zip(varnames, values))
                records[id_value] = record
    #pprint.pprint(records)
    return allvarnames, records

def write(data, outfile, missing=""):
    allvarnames, records = data
    with open(outfile, "wb") as w:
        writer = csv.writer(w)
        allvarnames = sorted(list(allvarnames))
        writer.writerow(allvarnames)
        for id_value, record in records.items():
            merged = []
            for varname in allvarnames:
                merged.append(record.get(varname, missing))
            writer.writerow(merged)

if __name__ == "__main__":
    infile = os.path.expanduser("~/Desktop/forcheh.txt")
    outfile = os.path.expanduser("~/Desktop/forcheh_out.csv")
    write(read(infile), outfile)
 
Regards,
Albert-Jan


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a
fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 


From: Ntonghanwah Forcheh <[hidden email]>
To: [hidden email]
Sent: Sunday, September 22, 2013 11:11 PM
Subject: [SPSSX-L] Extracting data from a messy file.

I am trying to reorganise data in a messy file where the variable names are listed followed by a few lines of data values for different cases, then another row of variable names and then one or more rows of corresponding data values.
Been trying to use vector-Loop-DO IF and Lag(vectorname, n) combination of commands but just cant get them to work.

Any hints please?
example is as shown below.
Thanks
Forcheh

INDIVIDUAL_BARCODE    ID_CONSENT    Q101    Q102    Q103    Q104A    Q104_OTHER    Q105    Q106    AGE_DIFFERENCE    Q201
BAISIV 12397-01    1    2    52    1    10    Working at namola leuba    6    2        2
BAISIV 12397-02    1    1    13    3    9        1    2        2
BAISIV 12397-03    1    1    13    3    9        1    2        2
BAISIV 12397-05    1    2    17    4    9        1    2        2
BAISIV 12397-06    1    1    15    4    9        1    2        2
INDIVIDUAL_BARCODE    ID_CONSENT    Q101    Q102    Q103    Q104A    Q104B    Q105    Q106    AGE_DIFFERENCE    Q201
BAISIV 11886-03    1    1    24    5    2    Clothing retailer    1    2        2
BAISIV 11886-04    1    1    18    5    9        1    2        2
BAISIV 11886-05    1    1    13    3    9        1    2        2
BAISIV 11886-07    2                                   
INDIVIDUAL_BARCODE    ID_CONSENT    Q101    Q102    Q103    Q104A    Q105    Q106    AGE_DIFFERENCE    Q201    Q203aa
BAISIV 12398-01    1    2    29    4    5    1    2        2   
BAISIV 12398-02    1    2    19    5    9    1    2        2   
BAISIV 12398-07    2                                   
BAISIV 12399-01    1    1    57    4    1    Veterenary officer    6    6    34    23
INDIVIDUAL_BARCODE    ID_CONSENT    Q101    Q102    Q103    Q104A    Q105    Q106    AGE_DIFFERENCE    Q201    Q202A
BAISIV 12396-01    1    1    24    6    5    1    2        1    1
BAISIV 12396-03    1    2    20    5    10    1    2        1    2
BAISIV 12396-03    1    2    20    5    10    1    2        1    2
                                       
                                       
BAISIV 12326-01    1    2    54    4    6    1    2        2   

--
Professor Ntonghanwah Forcheh
Department of Statistics,
University of Botswana
Private Bag UB00705, Gaborone, Botswana.
Office: +267 355 2696,
 Mobile:  Orange +267 75 26 2963,    Bmobile:  73181378:    Mascom  754 21238
fax: +267 3185099;
Alternative Email: [hidden email]
*@Honesty is a Virtue, Freedom of the Mind is Power.
Motto: Never be afraid to be honest, Never lie to yourself, Trust in the
Truth and you will be forever free.*




Reply | Threaded
Open this post in threaded view
|

Re: Extracting data from a messy file.

hillel vardi
In reply to this post by Ntonghanwah Forcheh
Shalom

Here is an example of reading the "messy file "  using SPSS syntax

* >>>  use text editor to change the DELIMITERS ( 4 spaces ) to comma .

GET DATA
  /TYPE=TXT
  /FILE="L:\spss\messy_file.txt"
  /DELCASE=LINE
  /DELIMITERS=","
  /ARRANGEMENT=DELIMITED
  /FIRSTCASE=1
  /IMPORTCASE=ALL
  /VARIABLES=
  V1 A18
  V2 A10
  V3 A4
  V4 A4
  V5 A4
  V6 A5
  V7 A23
  V8 A4
  V9 A14
  V10 A14
  V11 A6.
CACHE.
EXECUTE.

* >>>>   if there are lines with more then 11 variable  add them  .


DATASET NAME messy.

ALTER TYPE v1 to v11(a30).
string  #v1 to #v11   id vn1 vn2(a30)  .
do if index(v1,'INDIVIDUAL_BARCODE') gt 0 .
do repeat vvv=v1 to v11/
                    vv_vv=#v1 to #v11 .
compute  vv_vv=vvv.
END REPEAT .
ELSE.
VECTOR  vv2=v1 to v11.
VECTOR  vv1=#v1 to #v11.
loop   i=1 to 11.
compute id=vv2(1) .
compute vn1=vv1(i) .
compute vn2=vv2(i) .
xsave OUTFILE="L:\spss\messy_file2.sav" / keep= id vn1 vn2 .
END LOOP.
END IF .
execute .

get   FILE="L:\spss\messy_file2.sav"  .
SORT CASES BY id vn1 vn2 .
add files file= */ BY id vn1 vn2 / first=first.
* >>>>>   there where duplicate values in the example I deleted them .
select if first eq 1.
CASESTOVARS
  /ID=id
  /INDEX=vn1
  /GROUPBY=VARIABLE.

* >>>> use add files file=* / keep=  var name          to order variables to the original order .

>>  Here is file I used

INDIVIDUAL_BARCODE,ID_CONSENT,Q101,Q102,Q103,Q104A,Q104_OTHER,Q105,Q106,AGE_DIFFERENCE,Q201
BAISIV 12397-01,1,2,52,1,10,Working at namola leuba,6,2,,2
BAISIV 12397-02,1,1,13,3,9,,1,2,,2
BAISIV 12397-03,1,1,13,3,9,,1,2,,2
BAISIV 12397-05,1,2,17,4,9,,1,2,,2
BAISIV 12397-06,1,1,15,4,9,,1,2,,2
INDIVIDUAL_BARCODE,ID_CONSENT,Q101,Q102,Q103,Q104A,Q104B,Q105,Q106,AGE_DIFFERENCE,Q201
BAISIV 11886-03,1,1,24,5,2,Clothing retailer,1,2,,2
BAISIV 11886-04,1,1,18,5,9,,1,2,,2
BAISIV 11886-05,1,1,13,3,9,,1,2,,2
BAISIV 11886-07,2,,,,,,,,  
INDIVIDUAL_BARCODE,ID_CONSENT,Q101,Q102,Q103,Q104A,Q105,Q106,AGE_DIFFERENCE,Q201,Q203aa
BAISIV 12398-01,1,2,29,4,5,1,2,,2  
BAISIV 12398-02,1,2,19,5,9,1,2,,2  
BAISIV 12398-07,2,,,,,,,,  
BAISIV 12399-01,1,1,57,4,1,Veterenary officer,6,6,34,23
INDIVIDUAL_BARCODE,ID_CONSENT,Q101,Q102,Q103,Q104A,Q105,Q106,AGE_DIFFERENCE,Q201,Q202A
BAISIV 12396-01,1,1,24,6,5,1,2,,1,1
BAISIV 12396-03,1,2,20,5,10,1,2,,1,2
BAISIV 12396-03,1,2,20,5,10,1,2,,1,2
BAISIV 12326-01,1,2,54,4,6,1,2,,2


 
Hillel Vardi
BGU


On 22/09/2013 23:11, Ntonghanwah Forcheh wrote:
I am trying to reorganise data in a messy file where the variable names are listed followed by a few lines of data values for different cases, then another row of variable names and then one or more rows of corresponding data values.
Been trying to use vector-Loop-DO IF and Lag(vectorname, n) combination of commands but just cant get them to work.

Any hints please?
example is as shown below.
Thanks
Forcheh

INDIVIDUAL_BARCODE    ID_CONSENT    Q101    Q102    Q103    Q104A    Q104_OTHER    Q105    Q106    AGE_DIFFERENCE    Q201
BAISIV 12397-01    1    2    52    1    10    Working at namola leuba    6    2        2
BAISIV 12397-02    1    1    13    3    9        1    2        2
BAISIV 12397-03    1    1    13    3    9        1    2        2
BAISIV 12397-05    1    2    17    4    9        1    2        2
BAISIV 12397-06    1    1    15    4    9        1    2        2
INDIVIDUAL_BARCODE    ID_CONSENT    Q101    Q102    Q103    Q104A    Q104B    Q105    Q106    AGE_DIFFERENCE    Q201
BAISIV 11886-03    1    1    24    5    2    Clothing retailer    1    2        2
BAISIV 11886-04    1    1    18    5    9        1    2        2
BAISIV 11886-05    1    1    13    3    9        1    2        2
BAISIV 11886-07    2                                   
INDIVIDUAL_BARCODE    ID_CONSENT    Q101    Q102    Q103    Q104A    Q105    Q106    AGE_DIFFERENCE    Q201    Q203aa
BAISIV 12398-01    1    2    29    4    5    1    2        2   
BAISIV 12398-02    1    2    19    5    9    1    2        2   
BAISIV 12398-07    2                                   
BAISIV 12399-01    1    1    57    4    1    Veterenary officer    6    6    34    23
INDIVIDUAL_BARCODE    ID_CONSENT    Q101    Q102    Q103    Q104A    Q105    Q106    AGE_DIFFERENCE    Q201    Q202A
BAISIV 12396-01    1    1    24    6    5    1    2        1    1
BAISIV 12396-03    1    2    20    5    10    1    2        1    2
BAISIV 12396-03    1    2    20    5    10    1    2        1    2
                                       
                                       
BAISIV 12326-01    1    2    54    4    6    1    2        2   

--
Professor Ntonghanwah Forcheh
Department of Statistics,
University of Botswana
Private Bag UB00705, Gaborone, Botswana.
Office: +267 355 2696,
 Mobile:  Orange +267 75 26 2963,    Bmobile:  73181378:    Mascom  754 21238
fax: +267 3185099;
Alternative Email: [hidden email]
*@Honesty is a Virtue, Freedom of the Mind is Power.
Motto: Never be afraid to be honest, Never lie to yourself, Trust in the
Truth and you will be forever free.*



Reply | Threaded
Open this post in threaded view
|

Re: Extracting data from a messy file.

Hillel Vardi-2
In reply to this post by Ntonghanwah Forcheh
Shalom

Here is an example of reading the "messy file "  using SPSS syntax

* >>>  use text editor to change the DELIMITERS ( 4 spaces ) to comma .

GET DATA
  /TYPE=TXT
  /FILE="L:\spss\messy_file.txt"
  /DELCASE=LINE
  /DELIMITERS=","
  /ARRANGEMENT=DELIMITED
  /FIRSTCASE=1
  /IMPORTCASE=ALL
  /VARIABLES=
  V1 A18
  V2 A10
  V3 A4
  V4 A4
  V5 A4
  V6 A5
  V7 A23
  V8 A4
  V9 A14
  V10 A14
  V11 A6.
CACHE.
EXECUTE.

* >>>>   if there are lines with more then 11 variable  add them  .


DATASET NAME messy.

ALTER TYPE v1 to v11(a30).
string  #v1 to #v11   id vn1 vn2(a30)  .
do if index(v1,'INDIVIDUAL_BARCODE') gt 0 .
do repeat vvv=v1 to v11/
                    vv_vv=#v1 to #v11 .
compute  vv_vv=vvv.
END REPEAT .
ELSE.
VECTOR  vv2=v1 to v11.
VECTOR  vv1=#v1 to #v11.
loop   i=1 to 11.
compute id=vv2(1) .
compute vn1=vv1(i) .
compute vn2=vv2(i) .
xsave OUTFILE="L:\spss\messy_file2.sav" / keep= id vn1 vn2 .
END LOOP.
END IF .
execute .

get   FILE="L:\spss\messy_file2.sav"  .
SORT CASES BY id vn1 vn2 .
add files file= */ BY id vn1 vn2 / first=first.
* >>>>>   there where duplicate values in the example I deleted them .
select if first eq 1.
CASESTOVARS
  /ID=id
  /INDEX=vn1
  /GROUPBY=VARIABLE.

* >>>> use add files file=* / keep=  var name          to order variables to the original order .

>>  Here is file I used

INDIVIDUAL_BARCODE,ID_CONSENT,Q101,Q102,Q103,Q104A,Q104_OTHER,Q105,Q106,AGE_DIFFERENCE,Q201
BAISIV 12397-01,1,2,52,1,10,Working at namola leuba,6,2,,2
BAISIV 12397-02,1,1,13,3,9,,1,2,,2
BAISIV 12397-03,1,1,13,3,9,,1,2,,2
BAISIV 12397-05,1,2,17,4,9,,1,2,,2
BAISIV 12397-06,1,1,15,4,9,,1,2,,2
INDIVIDUAL_BARCODE,ID_CONSENT,Q101,Q102,Q103,Q104A,Q104B,Q105,Q106,AGE_DIFFERENCE,Q201
BAISIV 11886-03,1,1,24,5,2,Clothing retailer,1,2,,2
BAISIV 11886-04,1,1,18,5,9,,1,2,,2
BAISIV 11886-05,1,1,13,3,9,,1,2,,2
BAISIV 11886-07,2,,,,,,,,  
INDIVIDUAL_BARCODE,ID_CONSENT,Q101,Q102,Q103,Q104A,Q105,Q106,AGE_DIFFERENCE,Q201,Q203aa
BAISIV 12398-01,1,2,29,4,5,1,2,,2  
BAISIV 12398-02,1,2,19,5,9,1,2,,2  
BAISIV 12398-07,2,,,,,,,,  
BAISIV 12399-01,1,1,57,4,1,Veterenary officer,6,6,34,23
INDIVIDUAL_BARCODE,ID_CONSENT,Q101,Q102,Q103,Q104A,Q105,Q106,AGE_DIFFERENCE,Q201,Q202A
BAISIV 12396-01,1,1,24,6,5,1,2,,1,1
BAISIV 12396-03,1,2,20,5,10,1,2,,1,2
BAISIV 12396-03,1,2,20,5,10,1,2,,1,2
BAISIV 12326-01,1,2,54,4,6,1,2,,2


 
Hillel Vardi
BGU


On 22/09/2013 23:11, Ntonghanwah Forcheh wrote:
I am trying to reorganise data in a messy file where the variable names are listed followed by a few lines of data values for different cases, then another row of variable names and then one or more rows of corresponding data values.
Been trying to use vector-Loop-DO IF and Lag(vectorname, n) combination of commands but just cant get them to work.

Any hints please?
example is as shown below.
Thanks
Forcheh

INDIVIDUAL_BARCODE    ID_CONSENT    Q101    Q102    Q103    Q104A    Q104_OTHER    Q105    Q106    AGE_DIFFERENCE    Q201
BAISIV 12397-01    1    2    52    1    10    Working at namola leuba    6    2        2
BAISIV 12397-02    1    1    13    3    9        1    2        2
BAISIV 12397-03    1    1    13    3    9        1    2        2
BAISIV 12397-05    1    2    17    4    9        1    2        2
BAISIV 12397-06    1    1    15    4    9        1    2        2
INDIVIDUAL_BARCODE    ID_CONSENT    Q101    Q102    Q103    Q104A    Q104B    Q105    Q106    AGE_DIFFERENCE    Q201
BAISIV 11886-03    1    1    24    5    2    Clothing retailer    1    2        2
BAISIV 11886-04    1    1    18    5    9        1    2        2
BAISIV 11886-05    1    1    13    3    9        1    2        2
BAISIV 11886-07    2                                   
INDIVIDUAL_BARCODE    ID_CONSENT    Q101    Q102    Q103    Q104A    Q105    Q106    AGE_DIFFERENCE    Q201    Q203aa
BAISIV 12398-01    1    2    29    4    5    1    2        2   
BAISIV 12398-02    1    2    19    5    9    1    2        2   
BAISIV 12398-07    2                                   
BAISIV 12399-01    1    1    57    4    1    Veterenary officer    6    6    34    23
INDIVIDUAL_BARCODE    ID_CONSENT    Q101    Q102    Q103    Q104A    Q105    Q106    AGE_DIFFERENCE    Q201    Q202A
BAISIV 12396-01    1    1    24    6    5    1    2        1    1
BAISIV 12396-03    1    2    20    5    10    1    2        1    2
BAISIV 12396-03    1    2    20    5    10    1    2        1    2
                                       
                                       
BAISIV 12326-01    1    2    54    4    6    1    2        2   

--
Professor Ntonghanwah Forcheh
Department of Statistics,
University of Botswana
Private Bag UB00705, Gaborone, Botswana.
Office: +267 355 2696,
 Mobile:  Orange +267 75 26 2963,    Bmobile:  73181378:    Mascom  754 21238
fax: +267 3185099;
Alternative Email: [hidden email]
*@Honesty is a Virtue, Freedom of the Mind is Power.
Motto: Never be afraid to be honest, Never lie to yourself, Trust in the
Truth and you will be forever free.*



Reply | Threaded
Open this post in threaded view
|

Re: Extracting data from a messy file.

Ntonghanwah Forcheh
In reply to this post by hillel vardi
Dear all,

Many thanks to the responses to my question. After posting, I "disappeared" and only returned to office.
I shall attempt the suggestions provided and give feed back very soon.

I notice that when the tabular data were posted, the table lines were converted to string of aaa. these were not part of the data. My apologies.

forcheh 


On Tue, Sep 24, 2013 at 12:04 AM, hillel vardi <[hidden email]> wrote:
Shalom

Here is an example of reading the "messy file "  using SPSS syntax

* >>>  use text editor to change the DELIMITERS ( 4 spaces ) to comma .

GET DATA
  /TYPE=TXT
  /FILE="L:\spss\messy_file.txt"
  /DELCASE=LINE
  /DELIMITERS=","
  /ARRANGEMENT=DELIMITED
  /FIRSTCASE=1
  /IMPORTCASE=ALL
  /VARIABLES=
  V1 A18
  V2 A10
  V3 A4
  V4 A4
  V5 A4
  V6 A5
  V7 A23
  V8 A4
  V9 A14
  V10 A14
  V11 A6.
CACHE.
EXECUTE.

* >>>>   if there are lines with more then 11 variable  add them  .


DATASET NAME messy.

ALTER TYPE v1 to v11(a30).
string  #v1 to #v11   id vn1 vn2(a30)  .
do if index(v1,'INDIVIDUAL_BARCODE') gt 0 .
do repeat vvv=v1 to v11/
                    vv_vv=#v1 to #v11 .
compute  vv_vv=vvv.
END REPEAT .
ELSE.
VECTOR  vv2=v1 to v11.
VECTOR  vv1=#v1 to #v11.
loop   i=1 to 11.
compute id=vv2(1) .
compute vn1=vv1(i) .
compute vn2=vv2(i) .
xsave OUTFILE="L:\spss\messy_file2.sav" / keep= id vn1 vn2 .
END LOOP.
END IF .
execute .

get   FILE="L:\spss\messy_file2.sav"  .
SORT CASES BY id vn1 vn2 .
add files file= */ BY id vn1 vn2 / first=first.
* >>>>>   there where duplicate values in the example I deleted them .
select if first eq 1.
CASESTOVARS
  /ID=id
  /INDEX=vn1
  /GROUPBY=VARIABLE.

* >>>> use add files file=* / keep=  var name          to order variables to the original order .

>>  Here is file I used

INDIVIDUAL_BARCODE,ID_CONSENT,Q101,Q102,Q103,Q104A,Q104_OTHER,Q105,Q106,AGE_DIFFERENCE,Q201
BAISIV 12397-01,1,2,52,1,10,Working at namola leuba,6,2,,2
BAISIV 12397-02,1,1,13,3,9,,1,2,,2
BAISIV 12397-03,1,1,13,3,9,,1,2,,2
BAISIV 12397-05,1,2,17,4,9,,1,2,,2
BAISIV 12397-06,1,1,15,4,9,,1,2,,2
INDIVIDUAL_BARCODE,ID_CONSENT,Q101,Q102,Q103,Q104A,Q104B,Q105,Q106,AGE_DIFFERENCE,Q201
BAISIV 11886-03,1,1,24,5,2,Clothing retailer,1,2,,2
BAISIV 11886-04,1,1,18,5,9,,1,2,,2
BAISIV 11886-05,1,1,13,3,9,,1,2,,2
BAISIV 11886-07,2,,,,,,,,  
INDIVIDUAL_BARCODE,ID_CONSENT,Q101,Q102,Q103,Q104A,Q105,Q106,AGE_DIFFERENCE,Q201,Q203aa
BAISIV 12398-01,1,2,29,4,5,1,2,,2  
BAISIV 12398-02,1,2,19,5,9,1,2,,2  
BAISIV 12398-07,2,,,,,,,,  
BAISIV 12399-01,1,1,57,4,1,Veterenary officer,6,6,34,23
INDIVIDUAL_BARCODE,ID_CONSENT,Q101,Q102,Q103,Q104A,Q105,Q106,AGE_DIFFERENCE,Q201,Q202A
BAISIV 12396-01,1,1,24,6,5,1,2,,1,1
BAISIV 12396-03,1,2,20,5,10,1,2,,1,2
BAISIV 12396-03,1,2,20,5,10,1,2,,1,2
BAISIV 12326-01,1,2,54,4,6,1,2,,2


 
Hillel Vardi
BGU



On 22/09/2013 23:11, Ntonghanwah Forcheh wrote:
I am trying to reorganise data in a messy file where the variable names are listed followed by a few lines of data values for different cases, then another row of variable names and then one or more rows of corresponding data values.
Been trying to use vector-Loop-DO IF and Lag(vectorname, n) combination of commands but just cant get them to work.

Any hints please?
example is as shown below.
Thanks
Forcheh

INDIVIDUAL_BARCODE    ID_CONSENT    Q101    Q102    Q103    Q104A    Q104_OTHER    Q105    Q106    AGE_DIFFERENCE    Q201
BAISIV 12397-01    1    2    52    1    10    Working at namola leuba    6    2        2
BAISIV 12397-02    1    1    13    3    9        1    2        2
BAISIV 12397-03    1    1    13    3    9        1    2        2
BAISIV 12397-05    1    2    17    4    9        1    2        2
BAISIV 12397-06    1    1    15    4    9        1    2        2
INDIVIDUAL_BARCODE    ID_CONSENT    Q101    Q102    Q103    Q104A    Q104B    Q105    Q106    AGE_DIFFERENCE    Q201
BAISIV 11886-03    1    1    24    5    2    Clothing retailer    1    2        2
BAISIV 11886-04    1    1    18    5    9        1    2        2
BAISIV 11886-05    1    1    13    3    9        1    2        2
BAISIV 11886-07    2                                   
INDIVIDUAL_BARCODE    ID_CONSENT    Q101    Q102    Q103    Q104A    Q105    Q106    AGE_DIFFERENCE    Q201    Q203aa
BAISIV 12398-01    1    2    29    4    5    1    2        2   
BAISIV 12398-02    1    2    19    5    9    1    2        2   
BAISIV 12398-07    2                                   
BAISIV 12399-01    1    1    57    4    1    Veterenary officer    6    6    34    23
INDIVIDUAL_BARCODE    ID_CONSENT    Q101    Q102    Q103    Q104A    Q105    Q106    AGE_DIFFERENCE    Q201    Q202A
BAISIV 12396-01    1    1    24    6    5    1    2        1    1
BAISIV 12396-03    1    2    20    5    10    1    2        1    2
BAISIV 12396-03    1    2    20    5    10    1    2        1    2
                                       
                                       
BAISIV 12326-01    1    2    54    4    6    1    2        2   

--
Professor Ntonghanwah Forcheh
Department of Statistics,
University of Botswana
Private Bag UB00705, Gaborone, Botswana.
Office: <a href="tel:%2B267%20355%202696" value="+2673552696" target="_blank">+267 355 2696,
 Mobile:  Orange <a href="tel:%2B267%2075%2026%202963" value="+26775262963" target="_blank">+267 75 26 2963,    Bmobile:  73181378:    Mascom  754 21238
fax: <a href="tel:%2B267%203185099" value="+2673185099" target="_blank">+267 3185099;
Alternative Email: [hidden email]
*@Honesty is a Virtue, Freedom of the Mind is Power.
Motto: Never be afraid to be honest, Never lie to yourself, Trust in the
Truth and you will be forever free.*






--
Professor Ntonghanwah Forcheh
Department of Statistics,
University of Botswana
Private Bag UB00705, Gaborone, Botswana.
Office: +267 355 2696,
 Mobile:  Orange +267 75 26 2963,    Bmobile:  73181378:    Mascom  754 21238
fax: +267 3185099;
Alternative Email: [hidden email]
*@Honesty is a Virtue, Freedom of the Mind is Power.
Motto: Never be afraid to be honest, Never lie to yourself, Trust in the
Truth and you will be forever free.*


Reply | Threaded
Open this post in threaded view
|

Re: Extracting data from a messy file.

Rich Ulrich
The presence of the "aaa" strings was useful, in this way:

It showed that there is *some* delimiter between fields, not merely
some blanks.  It is a delimiter that you will conveniently make use of
when reading the input.  It identifies the text field, and it marks
off some included Missing fields.

--
Rich Ulrich


Date: Thu, 26 Sep 2013 15:31:04 +0200
From: [hidden email]
Subject: Re: Extracting data from a messy file.
To: [hidden email]

Dear all,

Many thanks to the responses to my question. After posting, I "disappeared" and only returned to office.
I shall attempt the suggestions provided and give feed back very soon.

I notice that when the tabular data were posted, the table lines were converted to string of aaa. these were not part of the data. My apologies.

forchehа
[snip, previous]
Reply | Threaded
Open this post in threaded view
|

Re: Extracting data from a messy file.

Albert-Jan Roskam
________________________________

> From: Rich Ulrich <[hidden email]>
>To: [hidden email]
>Sent: Thursday, September 26, 2013 6:33 PM
>Subject: Re: [SPSSX-L] Extracting data from a messy file.
>
>The presence of the "aaa" strings was useful, in this way:
>
>It showed that there is *some* delimiter between fields, not merely
>some blanks.  It is a delimiter that you will conveniently make use of
>when reading the input.  It identifies the text field, and it marks
>off some included Missing fields.

There was no aaa in the plain text version I received. I concluded that 4 spaces were the delimiter and the resulting file looked right.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD