As promised in my latest posting for the Outliers thread, I'd like to share
my experiences with numerals in a survey originally taken in Arabic script in the field and then entered into an English-language database. Errors arising from numerical transliteration caused a lot of anomalous values revealed later as outliers in the broader sense (they may have caused other errors that remain unrevealed because they happened to fall within normal or acceptable ranges). To be precise about vocabulary in this message: Our Western numerals are technically called 'Arabic' (to distinguish them from the Roman numerals like XVII), and are in fact a stylized version of the actual numerals used today in the Arabic language (or other languages using Arabic script), then to avoid confusion I will refer to 'Arab' numerals and 'Western' numerals in this posting. That we need to use these roundabout designations is itself a sort of homage and recognition of our debt to the enlightened Muslin scholars that a millennium ago invented the modern numerical system we all use nowadays, including the important step of introducing the zero, of which the Greek and Roman mathematicians (let alone Sumerians, Babylonians, Chinese and Egyptians) had never a clue (though there are doubts about some other cultures such as the Maya, which may have had the zero after all). After the survey was applied to thousands of peasants in their own language, data entry was done in an Access database. The data entry form contained the questions in both English and local language and the alternative qualitative responses also in both languages. Regarding numbers, the data entry clerks were expected to transcribe the Arab numerals into Western numerals. They were all local people with some computer literacy, working ability in English, and of course a full familiarity with the Latin alphabet and the Western numerals. No problem was expected. That was wrong, alas. There were no significant problems with text answers, but many problems with numbers. The problems stemmed mainly for the almost automatic manner in which our brains process counting and number-processing operations (bilinguals and polyglots, for instance, can seldom count money or other objects in anything but their own native language, except at painstakingly slow speed). This was compounded by the requirement that data entry proceeds quickly. The main causes for problems were the following: a. The Arab zero is represented by a dot or period, unlike the oval zero of the Western system. Instead, an oval symbol is used in Arab numerals to represent the number 5. The dot, of course, is used in the West (or rather in Northern Europe and their former colonies) to represent the split between integers and decimals, a function accomplished by a comma in the Arab system (and also in some Western countries more influenced in the past by the Arabs, like Spain or Italy). b. There are similarities, inversions or small differences between other numbers in both systems (say between the 2, the3, the 4 and the 7) c. Arabic (and other languages using that script) is written like Hebrew from right to left, unlike Western languages which are written from left to right. But there is a trick: even if written the other way, numbers in both systems LOOK THE SAME. For the figure 24 we write first the 2 and then the 4, while in Arabic it is first the 4 and then the 2, but in the end both figures look as 24. It similar to saying 'twenty-four' nowadays and 'four-and-twenty' in centuries past. This means that when you translate from Arabic into some Western language, while all the rest of a text is reversed, numbers should be kept in the same un-reversed order. The un-reversed way of our "Arabic" numerical system is revealed by the fact that we do everything from left to right, but we add numbers beginning at the right (the units) and progressing towards the left with the tens, the hundreds and so on, and do the same with subtraction or multiplication, following Arabic algebra ("algebra" is an Arabic word, by the way). As a consequence of (a), data entry clerks, or surveyors in the field, are often confused. One of the most usual errors is a duplication of zeroes when originally it was one zero and one decimal point, like the original number 10.55 (where a careless surveyor put a Western decimal separator point instead of an Arabic decimal comma) transliterated as 10055. Now imagine this is the acreage of a farm and think the consequences for your analysis of land tenure, seed rates or whatever. If the oval 5s are further transliterated as zeroes, the original 10.55 may be transmogrified into 10000 (though it is unlikely that the same clerk makes the two errors together; more likely, one may be taking the decimal dots for zeroes, rendering 10.55 ad 10055, while another takes the 5s for zeros and renders 1455 as 1400). In some cases, a survey-taker in the field may find his pencil is failing, and tries to mark his dot better for the zero in a number 10, marked as 1. As he is writing standing up on rough ground in some remote farm, his effort to mark his dot for a second or third time may end up marking two adjacent dots, and his 1. may become 1.. or 1... that may be later read as 100 or 1000 by some frantic data-entry clerk working in a hurry. In related situations, a short decimal comma may be taken for a dot, or an emphatically long comma may be taken for an Arabic-script numeral 1, which looks suspiciously similar to an elongated comma. As a consequence of (b) some numbers are easily confused, and a farmer of 75 may be marked as 25 or something similar. The (c) factor produces, by automatic and thoughtless application of text reversion rule in numerical data entry, an inversion of the numbers, so that for instance 48 becomes 84. Most of this (except the faulty pencil problem in the field and similar troubles) could have been avoided if the entire process was done in the local language, and then the figures translated into the Western numerical system directly by computer. There are in fact versions of the most common database management software in Arabic, but not in other languages using the same script. The above story occurred in Afghanistan, where some of the farmers in the survey were interviewed in Dari (one of the official languages of Afghanistan, actually a form of Persian) and others in Pashtu (the other official language, an ancient tongue spoken by the largest ethnic group, the Pashtun, not similar to anything else), both using Arabic script but completely different from Arabic. No standard software existed at the time for any of them. That was the reason alleged for using English in data entry, requiring DE clerks with some knowledge of English to do the feat (difficult even to accomplished polyglots) to transliterate the numbers at full speed in their minds. There was of course some quality control, but not full double data entry due to budgetary and time limitations. The errors in the database caused a significant number of anomalous figures in many questions of the survey, requiring extensive cleaning. Even after the official cleaning phase ended, and preliminary reports released for practical use, some more cases were still revealed during more in-depth analyses, as new ratios appeared that were out of the acceptable range or having somewhat unreasonable values. Morals of the story: 1. Beware of cultural and linguistic factors in statistics, number systems and data entry 2. Use local languages and numerical systems as far as possible, and reliable transliteration 3. Double check, triple check, quadruple check the quality of data entry 4. Provide ample supplies of good pencils and pencil sharpeners to your field staff, and (in Arabic script countries) tell them to take special care with dots, commas and all the other problems pointed out here. 5. Remember how much we owe to the Arabs Hector |
Free forum by Nabble | Edit this page |