I am trying to reorganise data in a messy file where the variable names are listed followed by a few lines of data values for different cases, then another row of variable names and then one or more rows of corresponding data values. Been trying to use vector-Loop-DO IF and Lag(vectorname, n) combination of commands but just cant get them to work. Any hints please? example is as shown below.INDIVIDUAL_BARCODE ID_CONSENT Q101 Q102 Q103 Q104A Q104_OTHER Q105 Q106 AGE_DIFFERENCE Q201 BAISIV 12397-01 1 2 52 1 10 Working at namola leuba 6 2 2 BAISIV 12397-02 1 1 13 3 9 1 2 2 BAISIV 12397-03 1 1 13 3 9 1 2 2 BAISIV 12397-05 1 2 17 4 9 1 2 2 BAISIV 12397-06 1 1 15 4 9 1 2 2 INDIVIDUAL_BARCODE ID_CONSENT Q101 Q102 Q103 Q104A Q104B Q105 Q106 AGE_DIFFERENCE Q201 BAISIV 11886-03 1 1 24 5 2 Clothing retailer 1 2 2 BAISIV 11886-04 1 1 18 5 9 1 2 2 BAISIV 11886-05 1 1 13 3 9 1 2 2 BAISIV 11886-07 2 INDIVIDUAL_BARCODE ID_CONSENT Q101 Q102 Q103 Q104A Q105 Q106 AGE_DIFFERENCE Q201 Q203aa BAISIV 12398-01 1 2 29 4 5 1 2 2 BAISIV 12398-02 1 2 19 5 9 1 2 2 BAISIV 12398-07 2 BAISIV 12399-01 1 1 57 4 1 Veterenary officer 6 6 34 23 INDIVIDUAL_BARCODE ID_CONSENT Q101 Q102 Q103 Q104A Q105 Q106 AGE_DIFFERENCE Q201 Q202A BAISIV 12396-01 1 1 24 6 5 1 2 1 1 BAISIV 12396-03 1 2 20 5 10 1 2 1 2 BAISIV 12396-03 1 2 20 5 10 1 2 1 2 BAISIV 12326-01 1 2 54 4 6 1 2 2 -- Professor Ntonghanwah Forcheh Department of Statistics, University of Botswana Private Bag UB00705, Gaborone, Botswana. Office: +267 355 2696, Mobile: Orange +267 75 26 2963, Bmobile: 73181378: Mascom 754 21238 fax: +267 3185099; Alternative Email: [hidden email] *@Honesty is a Virtue, Freedom of the Mind is Power. Motto: Never be afraid to be honest, Never lie to yourself, Trust in the Truth and you will be forever free.* |
First, there is the question of what the organization is for these lines of data.
I think: This is like comma-delimited data, where "aaa" is a regular delimiter. When a line ends with aaa, there is a Missing value following that aaa. I think: There is a full set of possible questions, with a specific order, and each "set" of lines following "INDIVIDUAL... " has a subset of those. I think: I don't know what would be intended by a line that is shown already as all "aaa" markers, except for a useless line of all Missing. For processing: I would start by adding to the start of each line a couple of counters, one for Set and one for Line number. The INDIVIDUAL ... would be numbered as 0. The rest of the line is processed as text, and partitioned (A25 or larger) into a the total number of variables, with the 'aaa' removed. Most of these appear to have small integers, but you don't have to predict where the long String answers occur. Then I would pull out the INDIVIDUAL... lines to a separate file called INDEX, leaving behind DATAONLY. Using a master list of Questions I would match to convert every VARNAME into its proper, sequential number among the full list of questions, in file INDEX#. That list can be MATCHed with /table= INDEX# ... by set-number back to DATAONLY. DATAONLY + INDEX# has variables, plus the columns where the questions belong in the full set. It is straightforward from there to use LOOP and Vectors to assign each to its proper slot; and finally, to convert Strings to arithmetic, after checking that the file was created with proper consistency. Does this handle the problem? -- Rich Ulrich Date: Sun, 22 Sep 2013 23:11:05 +0200 From: [hidden email] Subject: Extracting data from a messy file. To: [hidden email] I am trying to reorganise data in a messy file where the variable names are listed followed by a few lines of data values for different cases, then another row of variable names and then one or more rows of corresponding data values. Been trying to use vector-Loop-DO IF and Lag(vectorname, n) combination of commands but just cant get them to work. Any hints please? example is as shown below.INDIVIDUAL_BARCODEааа ID_CONSENTааа Q101ааа Q102ааа Q103ааа Q104Aааа Q104_OTHERааа Q105ааа Q106ааа AGE_DIFFERENCEааа Q201 BAISIV 12397-01ааа 1ааа 2ааа 52ааа 1ааа 10ааа Working at namola leubaааа 6ааа 2ааа ааа 2 BAISIV 12397-02ааа 1ааа 1ааа 13ааа 3ааа 9ааа ааа 1ааа 2ааа ааа 2 BAISIV 12397-03ааа 1ааа 1ааа 13ааа 3ааа 9ааа ааа 1ааа 2ааа ааа 2 BAISIV 12397-05ааа 1ааа 2ааа 17ааа 4ааа 9ааа ааа 1ааа 2ааа ааа 2 BAISIV 12397-06ааа 1ааа 1ааа 15ааа 4ааа 9ааа ааа 1ааа 2ааа ааа 2 INDIVIDUAL_BARCODEааа ID_CONSENTааа Q101ааа Q102ааа Q103ааа Q104Aааа Q104Bааа Q105ааа Q106ааа AGE_DIFFERENCEааа Q201 BAISIV 11886-03ааа 1ааа 1ааа 24ааа 5ааа 2ааа Clothing retailerааа 1ааа 2ааа ааа 2 BAISIV 11886-04ааа 1ааа 1ааа 18ааа 5ааа 9ааа ааа 1ааа 2ааа ааа 2 BAISIV 11886-05ааа 1ааа 1ааа 13ааа 3ааа 9ааа ааа 1ааа 2ааа ааа 2 BAISIV 11886-07ааа 2ааа ааа ааа ааа ааа ааа ааа ааа ааа INDIVIDUAL_BARCODEааа ID_CONSENTааа Q101ааа Q102ааа Q103ааа Q104Aааа Q105ааа Q106ааа AGE_DIFFERENCEааа Q201ааа Q203aa BAISIV 12398-01ааа 1ааа 2ааа 29ааа 4ааа 5ааа 1ааа 2ааа ааа 2ааа BAISIV 12398-02ааа 1ааа 2ааа 19ааа 5ааа 9ааа 1ааа 2ааа ааа 2ааа BAISIV 12398-07ааа 2ааа ааа ааа ааа ааа ааа ааа ааа ааа BAISIV 12399-01ааа 1ааа 1ааа 57ааа 4ааа 1ааа Veterenary officerааа 6ааа 6ааа 34ааа 23 INDIVIDUAL_BARCODEааа ID_CONSENTааа Q101ааа Q102ааа Q103ааа Q104Aааа Q105ааа Q106ааа AGE_DIFFERENCEааа Q201ааа Q202A BAISIV 12396-01ааа 1ааа 1ааа 24ааа 6ааа 5ааа 1ааа 2ааа ааа 1ааа 1 BAISIV 12396-03ааа 1ааа 2ааа 20ааа 5ааа 10ааа 1ааа 2ааа ааа 1ааа 2 BAISIV 12396-03ааа 1ааа 2ааа 20ааа 5ааа 10ааа 1ааа 2ааа ааа 1ааа 2 ааа ааа ааа ааа ааа ааа ааа ааа ааа ааа ааа ааа ааа ааа ааа ааа ааа ааа ааа ааа BAISIV 12326-01ааа 1ааа 2ааа 54ааа 4ааа 6ааа 1ааа 2ааа ааа 2ааа -- Professor Ntonghanwah Forcheh Department of Statistics, University of Botswana Private Bag UB00705, Gaborone, Botswana. Office: +267 355 2696, аMobile:а Orange +267 75 26 2963,а а Bmobile:а 73181378:а а Mascomа 754 21238 fax: +267 3185099; Alternative Email: [hidden email] *@Honesty is a Virtue, Freedom of the Mind is Power. Motto: Never be afraid to be honest, Never lie to yourself, Trust in the Truth and you will be forever free.* |
In reply to this post by Ntonghanwah Forcheh
Hi, I hacked this together and it seems to do the job. Output is in a csv file, see here if the indentation gets messed up: http://pastebin.com/WvHxF44m import os import csv import pprint # assumes that nobody has filled in a question using 4 consecutive spaces def read(infile): with open(infile, "rb") as f: records = {} allvarnames = set() for line in f.readlines(): values = None if line.startswith("INDIVIDUAL_BARCODE"): varnames = line.rstrip("\n").split(" " * 4) allvarnames.update(varnames) else: values = line.rstrip("\n").split(" " * 4) id_value = values[0] if values: record = dict(zip(varnames, values)) records[id_value] = record #pprint.pprint(records) return allvarnames, records def write(data, outfile, missing=""): allvarnames, records = data with open(outfile, "wb") as w: writer = csv.writer(w) allvarnames = sorted(list(allvarnames)) writer.writerow(allvarnames) for id_value, record in records.items(): merged = [] for varname in allvarnames: merged.append(record.get(varname, missing)) writer.writerow(merged) if __name__ == "__main__": infile = os.path.expanduser("~/Desktop/forcheh.txt") outfile = os.path.expanduser("~/Desktop/forcheh_out.csv") write(read(infile), outfile) Regards, Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
In reply to this post by Ntonghanwah Forcheh
Shalom
Here is an example of reading the "messy file " using SPSS syntax * >>> use text editor to change the DELIMITERS ( 4 spaces ) to comma . GET DATA /TYPE=TXT /FILE="L:\spss\messy_file.txt" /DELCASE=LINE /DELIMITERS="," /ARRANGEMENT=DELIMITED /FIRSTCASE=1 /IMPORTCASE=ALL /VARIABLES= V1 A18 V2 A10 V3 A4 V4 A4 V5 A4 V6 A5 V7 A23 V8 A4 V9 A14 V10 A14 V11 A6. CACHE. EXECUTE. * >>>> if there are lines with more then 11 variable add them . DATASET NAME messy. ALTER TYPE v1 to v11(a30). string #v1 to #v11 id vn1 vn2(a30) . do if index(v1,'INDIVIDUAL_BARCODE') gt 0 . do repeat vvv=v1 to v11/ vv_vv=#v1 to #v11 . compute vv_vv=vvv. END REPEAT . ELSE. VECTOR vv2=v1 to v11. VECTOR vv1=#v1 to #v11. loop i=1 to 11. compute id=vv2(1) . compute vn1=vv1(i) . compute vn2=vv2(i) . xsave OUTFILE="L:\spss\messy_file2.sav" / keep= id vn1 vn2 . END LOOP. END IF . execute . get FILE="L:\spss\messy_file2.sav" . SORT CASES BY id vn1 vn2 . add files file= */ BY id vn1 vn2 / first=first. * >>>>> there where duplicate values in the example I deleted them . select if first eq 1. CASESTOVARS /ID=id /INDEX=vn1 /GROUPBY=VARIABLE. * >>>> use add files file=* / keep= var name to order variables to the original order . >> Here is file I used INDIVIDUAL_BARCODE,ID_CONSENT,Q101,Q102,Q103,Q104A,Q104_OTHER,Q105,Q106,AGE_DIFFERENCE,Q201 BAISIV 12397-01,1,2,52,1,10,Working at namola leuba,6,2,,2 BAISIV 12397-02,1,1,13,3,9,,1,2,,2 BAISIV 12397-03,1,1,13,3,9,,1,2,,2 BAISIV 12397-05,1,2,17,4,9,,1,2,,2 BAISIV 12397-06,1,1,15,4,9,,1,2,,2 INDIVIDUAL_BARCODE,ID_CONSENT,Q101,Q102,Q103,Q104A,Q104B,Q105,Q106,AGE_DIFFERENCE,Q201 BAISIV 11886-03,1,1,24,5,2,Clothing retailer,1,2,,2 BAISIV 11886-04,1,1,18,5,9,,1,2,,2 BAISIV 11886-05,1,1,13,3,9,,1,2,,2 BAISIV 11886-07,2,,,,,,,, INDIVIDUAL_BARCODE,ID_CONSENT,Q101,Q102,Q103,Q104A,Q105,Q106,AGE_DIFFERENCE,Q201,Q203aa BAISIV 12398-01,1,2,29,4,5,1,2,,2 BAISIV 12398-02,1,2,19,5,9,1,2,,2 BAISIV 12398-07,2,,,,,,,, BAISIV 12399-01,1,1,57,4,1,Veterenary officer,6,6,34,23 INDIVIDUAL_BARCODE,ID_CONSENT,Q101,Q102,Q103,Q104A,Q105,Q106,AGE_DIFFERENCE,Q201,Q202A BAISIV 12396-01,1,1,24,6,5,1,2,,1,1 BAISIV 12396-03,1,2,20,5,10,1,2,,1,2 BAISIV 12396-03,1,2,20,5,10,1,2,,1,2 BAISIV 12326-01,1,2,54,4,6,1,2,,2 Hillel Vardi BGU On 22/09/2013 23:11, Ntonghanwah Forcheh wrote:
|
In reply to this post by Ntonghanwah Forcheh
Shalom
Here is an example of reading the "messy file " using SPSS syntax * >>> use text editor to change the DELIMITERS ( 4 spaces ) to comma . GET DATA /TYPE=TXT /FILE="L:\spss\messy_file.txt" /DELCASE=LINE /DELIMITERS="," /ARRANGEMENT=DELIMITED /FIRSTCASE=1 /IMPORTCASE=ALL /VARIABLES= V1 A18 V2 A10 V3 A4 V4 A4 V5 A4 V6 A5 V7 A23 V8 A4 V9 A14 V10 A14 V11 A6. CACHE. EXECUTE. * >>>> if there are lines with more then 11 variable add them . DATASET NAME messy. ALTER TYPE v1 to v11(a30). string #v1 to #v11 id vn1 vn2(a30) . do if index(v1,'INDIVIDUAL_BARCODE') gt 0 . do repeat vvv=v1 to v11/ vv_vv=#v1 to #v11 . compute vv_vv=vvv. END REPEAT . ELSE. VECTOR vv2=v1 to v11. VECTOR vv1=#v1 to #v11. loop i=1 to 11. compute id=vv2(1) . compute vn1=vv1(i) . compute vn2=vv2(i) . xsave OUTFILE="L:\spss\messy_file2.sav" / keep= id vn1 vn2 . END LOOP. END IF . execute . get FILE="L:\spss\messy_file2.sav" . SORT CASES BY id vn1 vn2 . add files file= */ BY id vn1 vn2 / first=first. * >>>>> there where duplicate values in the example I deleted them . select if first eq 1. CASESTOVARS /ID=id /INDEX=vn1 /GROUPBY=VARIABLE. * >>>> use add files file=* / keep= var name to order variables to the original order . >> Here is file I used INDIVIDUAL_BARCODE,ID_CONSENT,Q101,Q102,Q103,Q104A,Q104_OTHER,Q105,Q106,AGE_DIFFERENCE,Q201 BAISIV 12397-01,1,2,52,1,10,Working at namola leuba,6,2,,2 BAISIV 12397-02,1,1,13,3,9,,1,2,,2 BAISIV 12397-03,1,1,13,3,9,,1,2,,2 BAISIV 12397-05,1,2,17,4,9,,1,2,,2 BAISIV 12397-06,1,1,15,4,9,,1,2,,2 INDIVIDUAL_BARCODE,ID_CONSENT,Q101,Q102,Q103,Q104A,Q104B,Q105,Q106,AGE_DIFFERENCE,Q201 BAISIV 11886-03,1,1,24,5,2,Clothing retailer,1,2,,2 BAISIV 11886-04,1,1,18,5,9,,1,2,,2 BAISIV 11886-05,1,1,13,3,9,,1,2,,2 BAISIV 11886-07,2,,,,,,,, INDIVIDUAL_BARCODE,ID_CONSENT,Q101,Q102,Q103,Q104A,Q105,Q106,AGE_DIFFERENCE,Q201,Q203aa BAISIV 12398-01,1,2,29,4,5,1,2,,2 BAISIV 12398-02,1,2,19,5,9,1,2,,2 BAISIV 12398-07,2,,,,,,,, BAISIV 12399-01,1,1,57,4,1,Veterenary officer,6,6,34,23 INDIVIDUAL_BARCODE,ID_CONSENT,Q101,Q102,Q103,Q104A,Q105,Q106,AGE_DIFFERENCE,Q201,Q202A BAISIV 12396-01,1,1,24,6,5,1,2,,1,1 BAISIV 12396-03,1,2,20,5,10,1,2,,1,2 BAISIV 12396-03,1,2,20,5,10,1,2,,1,2 BAISIV 12326-01,1,2,54,4,6,1,2,,2 Hillel Vardi BGU On 22/09/2013 23:11, Ntonghanwah Forcheh wrote:
|
In reply to this post by hillel vardi
Dear all, Many thanks to the responses to my question. After posting, I "disappeared" and only returned to office. I shall attempt the suggestions provided and give feed back very soon.
I notice that when the tabular data were posted, the table lines were converted to string of aaa. these were not part of the data. My apologies. forcheh On Tue, Sep 24, 2013 at 12:04 AM, hillel vardi <[hidden email]> wrote:
Professor Ntonghanwah Forcheh Department of Statistics, University of Botswana Private Bag UB00705, Gaborone, Botswana. Office: +267 355 2696, Mobile: Orange +267 75 26 2963, Bmobile: 73181378: Mascom 754 21238 fax: +267 3185099; Alternative Email: [hidden email] *@Honesty is a Virtue, Freedom of the Mind is Power. Motto: Never be afraid to be honest, Never lie to yourself, Trust in the Truth and you will be forever free.* |
The presence of the "aaa" strings was useful, in this way:
It showed that there is *some* delimiter between fields, not merely some blanks. It is a delimiter that you will conveniently make use of when reading the input. It identifies the text field, and it marks off some included Missing fields. -- Rich Ulrich Date: Thu, 26 Sep 2013 15:31:04 +0200 From: [hidden email] Subject: Re: Extracting data from a messy file. To: [hidden email] Dear all, Many thanks to the responses to my question. After posting, I "disappeared" and only returned to office. I shall attempt the suggestions provided and give feed back very soon.
I notice that when the tabular data were posted, the table lines were converted to string of aaa. these were not part of the data. My apologies. forchehа
[snip, previous] |
________________________________
> From: Rich Ulrich <[hidden email]> >To: [hidden email] >Sent: Thursday, September 26, 2013 6:33 PM >Subject: Re: [SPSSX-L] Extracting data from a messy file. > >The presence of the "aaa" strings was useful, in this way: > >It showed that there is *some* delimiter between fields, not merely >some blanks. It is a delimiter that you will conveniently make use of >when reading the input. It identifies the text field, and it marks >off some included Missing fields. There was no aaa in the plain text version I received. I concluded that 4 spaces were the delimiter and the resulting file looked right. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |