SPSSX Discussion

Numeric transliteration, anomalies and outliers

Classic

List

Threaded

1 message

Hector Maletta

Numeric transliteration, anomalies and outliers

As promised in my latest posting for the Outliers thread, I'd like to share
my experiences with numerals in a survey originally taken in Arabic script
in the field and then entered into an English-language database. Errors
arising from numerical transliteration caused a lot of anomalous values
revealed later as outliers in the broader sense (they may have caused other
errors that remain unrevealed because they happened to fall within normal or
acceptable ranges).

To be precise about vocabulary in this message: Our Western numerals are
technically called 'Arabic' (to distinguish them from the Roman numerals
like XVII), and are in fact a stylized version of the actual numerals used
today in the Arabic language (or other languages using Arabic script), then
to avoid confusion I will refer to 'Arab' numerals and 'Western' numerals in
this posting. That we need to use these roundabout designations is itself a
sort of homage and recognition of our debt to the enlightened Muslin
scholars that a millennium ago invented the modern numerical system we all
use nowadays, including the important step of introducing the zero, of which
the Greek and Roman mathematicians (let alone Sumerians, Babylonians,
Chinese and Egyptians) had never a clue (though there are doubts about some
other cultures such as the Maya, which may have had the zero after all).

After the survey was applied to thousands of peasants in their own language,
data entry was done in an Access database. The data entry form contained the
questions in both English and local language and the alternative qualitative
responses also in both languages. Regarding numbers, the data entry clerks
were expected to transcribe the Arab numerals into Western numerals. They
were all local people with some computer literacy, working ability in
English, and of course a full familiarity with the Latin alphabet and the
Western numerals. No problem was expected.

That was wrong, alas. There were no significant problems with text answers,
but many problems with numbers. The problems stemmed mainly for the almost
automatic manner in which our brains process counting and number-processing
operations (bilinguals and polyglots, for instance, can seldom count money
or other objects in anything but their own native language, except at
painstakingly slow speed). This was compounded by the requirement that data
entry proceeds quickly. The main causes for problems were the following:

a. The Arab zero is represented by a dot or period, unlike the oval
zero of the Western system. Instead, an oval symbol is used in Arab numerals
to represent the number 5. The dot, of course, is used in the West (or
rather in Northern Europe and their former colonies) to represent the split
between integers and decimals, a function accomplished by a comma in the
Arab system (and also in some Western countries more influenced in the past
by the Arabs, like Spain or Italy).
b. There are similarities, inversions or small differences between
other numbers in both systems (say between the 2, the3, the 4 and the 7)
c. Arabic (and other languages using that script) is written like
Hebrew from right to left, unlike Western languages which are written from
left to right. But there is a trick: even if written the other way, numbers
in both systems LOOK THE SAME. For the figure 24 we write first the 2 and
then the 4, while in Arabic it is first the 4 and then the 2, but in the end
both figures look as 24. It similar to saying 'twenty-four' nowadays and
'four-and-twenty' in centuries past. This means that when you translate from
Arabic into some Western language, while all the rest of a text is reversed,
numbers should be kept in the same un-reversed order. The un-reversed way of
our "Arabic" numerical system is revealed by the fact that we do everything
from left to right, but we add numbers beginning at the right (the units)
and progressing towards the left with the tens, the hundreds and so on, and
do the same with subtraction or multiplication, following Arabic algebra
("algebra" is an Arabic word, by the way).

As a consequence of (a), data entry clerks, or surveyors in the field, are
often confused. One of the most usual errors is a duplication of zeroes when
originally it was one zero and one decimal point, like the original number
10.55 (where a careless surveyor put a Western decimal separator point
instead of an Arabic decimal comma) transliterated as 10055. Now imagine
this is the acreage of a farm and think the consequences for your analysis
of land tenure, seed rates or whatever. If the oval 5s are further
transliterated as zeroes, the original 10.55 may be transmogrified into
10000 (though it is unlikely that the same clerk makes the two errors
together; more likely, one may be taking the decimal dots for zeroes,
rendering 10.55 ad 10055, while another takes the 5s for zeros and renders
1455 as 1400).

In some cases, a survey-taker in the field may find his pencil is failing,
and tries to mark his dot better for the zero in a number 10, marked as 1.
As he is writing standing up on rough ground in some remote farm, his effort
to mark his dot for a second or third time may end up marking two adjacent
dots, and his 1. may become 1.. or 1... that may be later read as 100 or
1000 by some frantic data-entry clerk working in a hurry. In related
situations, a short decimal comma may be taken for a dot, or an emphatically
long comma may be taken for an Arabic-script numeral 1, which looks
suspiciously similar to an elongated comma.

As a consequence of (b) some numbers are easily confused, and a farmer of 75
may be marked as 25 or something similar.

The (c) factor produces, by automatic and thoughtless application of text
reversion rule in numerical data entry, an inversion of the numbers, so that
for instance 48 becomes 84.

Most of this (except the faulty pencil problem in the field and similar
troubles) could have been avoided if the entire process was done in the
local language, and then the figures translated into the Western numerical
system directly by computer. There are in fact versions of the most common
database management software in Arabic, but not in other languages using the
same script. The above story occurred in Afghanistan, where some of the
farmers in the survey were interviewed in Dari (one of the official
languages of Afghanistan, actually a form of Persian) and others in Pashtu
(the other official language, an ancient tongue spoken by the largest ethnic
group, the Pashtun, not similar to anything else), both using Arabic script
but completely different from Arabic. No standard software existed at the
time for any of them. That was the reason alleged for using English in data
entry, requiring DE clerks with some knowledge of English to do the feat
(difficult even to accomplished polyglots) to transliterate the numbers at
full speed in their minds.

There was of course some quality control, but not full double data entry due
to budgetary and time limitations. The errors in the database caused a
significant number of anomalous figures in many questions of the survey,
requiring extensive cleaning. Even after the official cleaning phase ended,
and preliminary reports released for practical use, some more cases were
still revealed during more in-depth analyses, as new ratios appeared that
were out of the acceptable range or having somewhat unreasonable values.

Morals of the story:

1. Beware of cultural and linguistic factors in statistics, number
systems and data entry
2. Use local languages and numerical systems as far as possible, and
reliable transliteration
3. Double check, triple check, quadruple check the quality of data
entry
4. Provide ample supplies of good pencils and pencil sharpeners to your
field staff, and (in Arabic script countries) tell them to take special care
with dots, commas and all the other problems pointed out here.
5. Remember how much we owe to the Arabs

Hector