Calculating Gini coefficients for each subset (villages) of large data set

classic Classic list List threaded Threaded
27 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Calculating Gini coefficients for each subset (villages) of large data set

Mattias
Dear all,

I am working on a household dataset from India (n=21331) and am trying to calculate a Gini coefficient for income for each village (n=1451). I understand the formula and syntax for calculating Gini coefficients and am using the syntax below.

My problem is that I can’t seem to figure out the proper way to calculate one for each village (variable name IDPSU). I have tried do if loops, select if’s and filters but can’t get it right. I may very well be missing something very simple. I would be very grateful for any help in solving this.

Thanks in advance
Mattias

* Step 1.
SORT CASES BY INCOME.
* Step 2.
AGGREGATE OUTFILE = *
/PRESORTED
/BREAK = INCOME
/persons = N .
WEIGHT BY persons.
* Step 3.
COMPUTE brk = 1.
AGGREGATE OUTFILE = incagg.sav
/BREAK = brk
/suminc = SUM(INCOME).


MATCH FILES / FILE = * / TABLE = incagg.sav / BY brk .
EXECUTE.


* Step 4 .
DO IF ($CASENUM = 1).
+ COMPUTE cincome = persons * income .
ELSE.
+ COMPUTE cincome = LAG(cincome) + persons * income .
END IF.
* Step 5 .
COMPUTE pcinc = cincome/suminc .
EXECUTE.


* Step 6.
RANK VARIABLES=income (A)
/RFRACTION into cdfinc
/PRINT=YES
/TIES=HIGH .


* Step 7.
COMPUTE d1 = ($casenum = 1).
COMPUTE d2 = ($casenum = 1).
* Note that it doesn't matter whether D1 or D2 is the Y variable
* in the D1-D2 pair.
* D1 and D2 are identical and are created to allow you to draw a
* diagonal line on the graph.
GRAPH
/SCATTERPLOT(OVERLAY)=cdfinc d2 WITH pcinc d1 (PAIR)
/MISSING=LISTWISE
/TITLE= 'Lorenz Curve for Income'.


* Step 8.
* Calculate and print the Gini coefficient.
* For last case, LAREA is area under the Lorenz curve.
DO IF ($casenum = 1) .
+ COMPUTE larea = 0.
ELSE.
+ COMPUTE larea = LAG(larea) +
(cdfinc - LAG(cdfinc)) * (pcinc + LAG(pcinc))/2 .
END IF.
IF (cdfinc = 1) gini = (.5 - larea)/.5 .
REPORT
/VARIABLES gini (VALUES)
/BREAK (TOTAL) '' (SKIP(1))
/SUMMARY MAX( gini) SKIP(1) '' .
Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

David Marso
Administrator
Maybe you need to include IDPSU in the code? Say in the AGGREATE as a BREAK and in the LAG logic?
You don't provide the formula for Gini or a usable reference and I don't feel like looking it up and reinventing code.
Mattias wrote
Dear all,

I am working on a household dataset from India (n=21331) and am trying to calculate a Gini coefficient for income for each village (n=1451). I understand the formula and syntax for calculating Gini coefficients and am using the syntax below.

My problem is that I can’t seem to figure out the proper way to calculate one for each village (variable name IDPSU). I have tried do if loops, select if’s and filters but can’t get it right. I may very well be missing something very simple. I would be very grateful for any help in solving this.

Thanks in advance
Mattias

* Step 1.
SORT CASES BY INCOME.
* Step 2.
AGGREGATE OUTFILE = *
/PRESORTED
/BREAK = INCOME
/persons = N .
WEIGHT BY persons.
* Step 3.
COMPUTE brk = 1.
AGGREGATE OUTFILE = incagg.sav
/BREAK = brk
/suminc = SUM(INCOME).


MATCH FILES / FILE = * / TABLE = incagg.sav / BY brk .
EXECUTE.


* Step 4 .
DO IF ($CASENUM = 1).
+ COMPUTE cincome = persons * income .
ELSE.
+ COMPUTE cincome = LAG(cincome) + persons * income .
END IF.
* Step 5 .
COMPUTE pcinc = cincome/suminc .
EXECUTE.


* Step 6.
RANK VARIABLES=income (A)
/RFRACTION into cdfinc
/PRINT=YES
/TIES=HIGH .


* Step 7.
COMPUTE d1 = ($casenum = 1).
COMPUTE d2 = ($casenum = 1).
* Note that it doesn't matter whether D1 or D2 is the Y variable
* in the D1-D2 pair.
* D1 and D2 are identical and are created to allow you to draw a
* diagonal line on the graph.
GRAPH
/SCATTERPLOT(OVERLAY)=cdfinc d2 WITH pcinc d1 (PAIR)
/MISSING=LISTWISE
/TITLE= 'Lorenz Curve for Income'.


* Step 8.
* Calculate and print the Gini coefficient.
* For last case, LAREA is area under the Lorenz curve.
DO IF ($casenum = 1) .
+ COMPUTE larea = 0.
ELSE.
+ COMPUTE larea = LAG(larea) +
(cdfinc - LAG(cdfinc)) * (pcinc + LAG(pcinc))/2 .
END IF.
IF (cdfinc = 1) gini = (.5 - larea)/.5 .
REPORT
/VARIABLES gini (VALUES)
/BREAK (TOTAL) '' (SKIP(1))
/SUMMARY MAX( gini) SKIP(1) '' .
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

jkpeck
In reply to this post by Mattias
Here is an example of calculating Gini coefficients across split files.  It requires the R Essentials.  If you are interested in comparing these villages, some notion of a confidence interval for the Gini coefficients might be helpful, so this code produces bootstrapped CIs as well.

This examples uses the employee data.sav file shipped with Statistics and computes the Gini statistics for salary for each job category.

sort cases by jobcat.
split files by jobcat.
begin program r.
tryCatch(library(DescTools), error=function(e) {install.packages("DescTools",
    repos="http://cran.us.r-project.org")
    library(DescTools)}
)
while (!spssdata.IsLastSplit()) {
    dta = spssdata.GetSplitDataFromSPSS("jobcat salary")
    g = Gini(dta[[2]], conf.level=.95)
    print(sprintf("jobcat = %s %s %s %s", dta[[1,1]], g[1],g[2],g[3]))
}
end program.

The first statement in the R code, tryCatch..., attempts to install the required package from the R repository.  Once you have done that, you could remove that code.

Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

Mattias
Hi,
Thank you for helping out. I tried running the syntax you suggested but get the following error message:

Error # 6887.  Command name: begin program
External program failed during initialization.
Execution of this command stops.
Additional error message: create startx process is failure.

Could it be that the syntax  is no longer up to date?
Thanks
Mattias
Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

David Marso
Administrator
More likely that python or R are absent or not properly installed?
Mattias wrote
Hi,
Thank you for helping out. I tried running the syntax you suggested but get the following error message:

Error # 6887.  Command name: begin program
External program failed during initialization.
Execution of this command stops.
Additional error message: create startx process is failure.

Could it be that the syntax  is no longer up to date?
Thanks
Mattias
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

Jon K Peck
In reply to this post by Mattias
This indicates a problem with the R plugin installation.  Did you install the appropriate version of R for your version of Statistics and the corresponding R Essentials (including matching the bit size)?

Run this to see if your R plugin connection works at all.
begin program r.
print(sessionInfo())
end program.


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        Mattias <[hidden email]>
To:        [hidden email],
Date:        02/20/2014 05:31 AM
Subject:        Re: [SPSSX-L] Calculating Gini coefficients for each subset              (villages) of              large data set
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Hi,
Thank you for helping out. I tried running the syntax you suggested but get
the following error message:

Error # 6887.  Command name: begin program
External program failed during initialization.
Execution of this command stops.
Additional error message: create startx process is failure.

Could it be that the syntax  is no longer up to date?
Thanks
Mattias




--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Calculating-Gini-coefficients-for-each-subset-villages-of-large-data-set-tp5724495p5724575.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

Mattias
I have never used R before and installed it only for this. When I run the last syntax you suggested Jon I get the following error message:

begin program r.
print(sessionInfo())
end program.

>Warning # 6894.  Command name: begin program
>The external program exit unexpectedly and lost its content, a new exteranl
>program will startup to execute the rest of job.

>Error # 6887.  Command name: begin program
>External program failed during initialization.
>Execution of this command stops.
Additional error message: create startx process is failure.
Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

Mattias
In reply to this post by Jon K Peck
I am running spss 20 on Windows 7 professional 64-bit and have installed R 3.0.2, 32- and 64-bit and now also the Essentials for R for spss 20. I keep getting the same error message. I then tried installing R 2.12.1 but the same problem remains.

I also read here (https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014729480)
that I am not the only person with the same problem.

What should I do?
Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

Jon K Peck
Statistics 20 uses only R2.12.x.  Any other R versions are irrelevant.  Make sure that you install the R Essentials 32 or 64-bit version that matches with the 32 or 64-bit version of Statistics.  Just because you are running 64-bit Win7 doesn't mean that you are running 64-bit Statistics.


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        Mattias <[hidden email]>
To:        [hidden email],
Date:        02/20/2014 09:11 AM
Subject:        Re: [SPSSX-L] Calculating Gini coefficients for each subset              (villages) of              large data set
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




I am running spss 20 on Windows 7 professional 64-bit and have installed R
3.0.2, 32- and 64-bit and now also the Essentials for R for spss 20. I keep
getting the same error message. I then tried installing R 2.12.1 but the
same problem remains.

I also read here
(
https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014729480)
that I am not the only person with the same problem.

What should I do?




--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Calculating-Gini-coefficients-for-each-subset-villages-of-large-data-set-tp5724495p5724584.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

Richard Ristow
In reply to this post by Mattias
A side note.  At 08:19 AM 2/14/2014, Mattias wrote:

>I am working on a household dataset from India (n=21331) and am
>trying to calculate a Gini coefficient for income for each village (n=1451).

That's a mean of 21331/1451=14.7 households per village. Isn't that a
little small for calculating a Gini coefficient? Should villages,
especially smaller ones, be combined into, say, regional groups
before calculating the Gini coefficient?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

Jon K Peck
And the population is probably not evenly distributed.  Using the Gini code I suggested, though, will include confidence intervals in the output.


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        Richard Ristow <[hidden email]>
To:        [hidden email],
Date:        02/20/2014 10:59 AM
Subject:        Re: [SPSSX-L] Calculating Gini coefficients for each subset              (villages) of              large data set
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




A side note.  At 08:19 AM 2/14/2014, Mattias wrote:

>I am working on a household dataset from India (n=21331) and am
>trying to calculate a Gini coefficient for income for each village (n=1451).

That's a mean of 21331/1451=14.7 households per village. Isn't that a
little small for calculating a Gini coefficient? Should villages,
especially smaller ones, be combined into, say, regional groups
before calculating the Gini coefficient?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

Mattias
I finally managed to get the Essentials for R working (at least I get no error messages…)

First - Richard wrote:

>That's a mean of 21331/1451=14.7 households per village. Isn't that a
>little small for calculating a Gini coefficient?

Yes it is, which is why I became interested in the syntax Jon suggested; because it provides CI’s.

I have data on household income from the variable INCOME and village id from the variable IDPSU. Following Jon’s original suggestion I ran the following syntax:

sort cases by IDPSU.
split file by IDPSU.
begin program r.
tryCatch(library(DescTools), error=function(e) {install.packages("DescTools",
    repos="http://cran.us.r-project.org")
    library(DescTools)}
)
while (!spssdata.IsLastSplit()) {
    dta = spssdata.GetSplitDataFromSPSS("IDPSU INCOME")
    g = Gini(dta[[2]], conf.level=.95)
    print(sprintf("IDPSU = %s %s %s %s", dta[[1,1]], g[1],g[2],g[3]))
}
end program.
split file off.

I get no error messages but also no Gini coefficients computed… What is not working?
Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

Richard Ristow
In reply to this post by Jon K Peck
I'd written,
>>[You have] a mean of 21331/1451=14.7 households per village. Isn't
>>that a little small for calculating a Gini coefficient?

At 02:25 PM 2/20/2014, Jon K Peck wrote:
>And the population is probably not evenly distributed.

Most definitely. If the mean is 15 households per village, you'll
probably have many fewer for some villages, making the problem worse.

>Using the Gini code I suggested, though, will include confidence
>intervals in the output.

However -- and this isn't about SPSS, or R, or the package -- Gini
coefficients derived from small samples, and their calculated
confidence intervals, should be viewed skeptically.

If the distribution of income (or whatever) is highly concentrated, a
small sample may underestimate the Gini coefficient badly, through
failing to include any of the small, very-high-income subgroup in the
sample. I doubt that any statistical methodology can compensate for this.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

Garry Gelade
Hi Richard

I was interested in your comment because it so happens I am calculating a
lot of Ginis myself, (though I'm using Python not R).

You said " a small sample may underestimate the Gini coefficient badly,
through failing to include any of the small, very-high-income subgroup in
the sample."

I don't get your point I'm afraid.  Of course the Gini coefficient
calculated on a small sample will be subject to sampling error, and
unrepresentative of the population at large,  But it seems to me Mattias is
not trying to estimate the Gini coefficient for the population at large, he
is interested in the village level of analysis, and the distribution of
village Ginis.  If the high income group in the population is small, few
villages will have high income members, and the distribution of village
Ginis will (correctly) reflect that.

Regards

Garry

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Richard Ristow
Sent: 21 February 2014 17:49
To: [hidden email]
Subject: Re: Calculating Gini coefficients for each subset (villages) of
large data set

I'd written,
>>[You have] a mean of 21331/1451=14.7 households per village. Isn't
>>that a little small for calculating a Gini coefficient?

At 02:25 PM 2/20/2014, Jon K Peck wrote:
>And the population is probably not evenly distributed.

Most definitely. If the mean is 15 households per village, you'll probably
have many fewer for some villages, making the problem worse.

>Using the Gini code I suggested, though, will include confidence
>intervals in the output.

However -- and this isn't about SPSS, or R, or the package -- Gini
coefficients derived from small samples, and their calculated confidence
intervals, should be viewed skeptically.

If the distribution of income (or whatever) is highly concentrated, a small
sample may underestimate the Gini coefficient badly, through failing to
include any of the small, very-high-income subgroup in the sample. I doubt
that any statistical methodology can compensate for this.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command SIGNOFF SPSSX-L For a list of
commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

Richard Ristow
At 03:54 AM 2/24/2014, Garry Gelade wrote:

>You said "a small sample may underestimate the Gini coefficient
>badly, through failing to include any of the small, very-high-income
>subgroup in the sample."
>
>Of course the Gini coefficient calculated on a small sample will be
>subject to sampling error, and unrepresentative of the population at
>large,  But it seems to me Mattias is not trying to estimate the
>Gini coefficient for the population at large, he is interested in
>the village level of analysis, and the distribution of village
>Ginis. If the high income group in the population is small, few
>villages will have high income members, and the distribution of
>village Ginis will (correctly) reflect that.

Calculating a Gini coefficient is a problem in numerical integration;
and numerical integration gets less reliable as the function being
integrated gets more 'peaked'. If Matthias has everybody in each
village, his village Gini coefficients will be right, period. But if
he's sampling -- here's a concrete example, to illustrate the problem.

Suppose there's a population of (large) size P, with mean income I.
Suppose, also, that one half the income is equally distributed among
99% of the population, and that the remaining half is evenly
distributed among the remaining 1%. (So, total income is P*I; total
income for the lower 99% and the upper 1% are each P*I/2.) Then, the
Gini coefficient can be calculated exactly (see APPENDIX, below); it is 0.49.

Now, suppose you draw a sample of size 10. With probability
0.99**10=~0.904, the sample will include no high-income individuals;
everybody in the sample will have income I/1.98, and the empirical
Gini coefficient is 0.
===============================================================
APPENDIX: Gini coefficient, for the hypothetical case:

In the lower 99%, mean individual income (and, by hypothesis, actual income) is

(P*I/2)/(0.99*P) = I/1.98 =~ 0.505*I;

in the top 1% mean (and actual) individual income is

(P*I/2)/(0.01*P) = 50*I.

If p is the cumulative proportion of the population and i the
cumulative proportion of income, the Lorenz curve becomes

(A) i = p/1.98                  p<= 0.99
(B) i = 0.5 + 50*(p-0.99)       p>= 0.99

where, if i=0.5, p=0.99.

The Gini coefficient is 1-2*(area under Lorenz curve). Here, that
area is (by area of triangles and trapezoids)

area (A): 0.99*0.5/2    = 0.25*0.99 =   0.2475
area (B): .01*(1+0.5)/2 = 0.75*0.01 =   0.0075
                                         ------
                                         0.0255

So the Gini coefficient is 1-2*.0255=   0.49

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

Richard Ristow
Correcting an apparent transcription error in the example, at 01:35
PM 2/24/2014, I wrote:

>The Gini coefficient is 1-2*(area under Lorenz curve). [In the
>example distribution], that area is (by area of triangles and trapezoids)

Below, for "0.0255" read "0.2550". The Gini-coefficient calculation
uses the correct value.

>area (A): 0.99*0.5/2    = 0.25*0.99 =   0.2475
>area (B): .01*(1+0.5)/2 = 0.75*0.01 =   0.0075
>                                         ------
>                                         0.0255
>
>So the Gini coefficient is 1-2*.0255=   0.49

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

Mattias
What alternative measure of inequality would you suggest I use?

I am running multilevel models and am, among other things, interested in the association between village inequality and household behavior. Households are distributed into villages like this:

N_IDPSU
                Frequency Percent Valid Percent Cumulative Percent
Valid 1 6 ,0 ,0 ,0
        2 44 ,2 ,2 ,2
        3 99 ,5 ,5 ,7
        4 144 ,7 ,7 1,4
        5 160 ,8 ,8 2,1
        6 276 1,3 1,3 3,4
        7 322 1,5 1,5 4,9
        8 432 2,0 2,0 7,0
        9 369 1,7 1,7 8,7
        10 560 2,6 2,6 11,3
        11 759 3,6 3,6 14,9
        12 768 3,6 3,6 18,5
        13 936 4,4 4,4 22,9
        14 1358 6,4 6,4 29,2
        15 1650 7,7 7,7 37,0
        16 2416 11,3 11,3 48,3
        17 2618 12,3 12,3 60,6
        18 2304 10,8 10,8 71,4
        19 1254 5,9 5,9 77,2
        20 1140 5,3 5,3 82,6
        21 315 1,5 1,5 84,1
        22 396 1,9 1,9 85,9
        23 483 2,3 2,3 88,2
        24 432 2,0 2,0 90,2
        25 225 1,1 1,1 91,3
        26 234 1,1 1,1 92,4
        27 297 1,4 1,4 93,7
        28 252 1,2 1,2 94,9
        29 232 1,1 1,1 96,0
        30 120 ,6 ,6 96,6
        31 155 ,7 ,7 97,3
        32 96 ,5 ,5 97,8
        33 33 ,2 ,2 97,9
        34 68 ,3 ,3 98,2
        35 105 ,5 ,5 98,7
        36 144 ,7 ,7 99,4
        37 74 ,3 ,3 99,7
        55 55 ,3 ,3 100,0
        Total 21331 100,0 100,0
Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

Richard Ristow
After I wrote about the problems inherent in
calculating a Gini coefficient based on a small
sample, at 04:10 AM 2/25/2014, Mattias wrote:

>What alternative measure of inequality would you suggest I use?

I'm out of my depth on this; I'm far below
professional level in the relevant social
sciences, and I don't know the village societies
you're studying at all. You need to talk with
subject specialists -- people who are familiar
with village societies of the kind you're
studying; with people who've done similar
investigations; and, if possible, with someone
who has both kinds of experience.

You don't say how you got your samples, or what
fraction of the village population you usually
sampled. If you have the whole population, your
Gini coefficient is right, but not necessarily
relevant; especially for small villages, you have
to consider whether a group of villages is the
relevant economic unit. (To repeat myself: I
cannot judge this. I know nothing about the
villages you are studying; you know a good deal;
and you should have access to people who know more.)

You may have to exclude some villages from
analysis because your sample is too small. From
the table you sent, I see that for 9% of the
villages, your sample is fewer than 10
households; you may have to exclude those, unless
you can pool them with others. But, again, talk
with subject specialists about these issues.

The Wikipedia article on the Gini coefficient
seems pretty good, but you can read it as well as
I can. The most helpful lead I found in it is,

>"Small sample bias – sparsely populated regions
>more likely to have low Gini coefficient":  Gini
>index has a downward-bias for small populations.[56] ...

That cites the article: George Deltas (February
2003). "The Small-Sample Bias of the Gini
Coefficient: Results and Implications for
Empirical Research". The Review of Economics and
Statistics 85 (1): 226–234.
doi:10.1162/rest.2003.85.1.226. Reviewing that
article would be a place to start. Here, you will
benefit from the help of methodologists more than subject specialists.

(Is it clear what I mean by "methodologists" and
"subject specialists"? The former are familiar
with analytic techniques; the latter, with the
people you are studying --though they may well
know techniques as well. The two can help you in different ways.)

The second-most helpful hint I found in the
Wikipedia article was a caution: the Gini index
measures inequality, not prosperity. It may be
that people in all income quantiles of a more
prosperous village have higher incomes than those
in a less prosperous village, even if the more
prosperous village has higher inequality.
Relative prosperity surely also affects the behaviors you're studying.

Finally, a hint I found in Webbing around about
the Gini coefficient, but can't find again: If,
by any chance, you have household mean income for
your villages, compare it with mean income from
your sample. If the mean income of the sample is
notably lower, that suggests there's a
high-income group whom your sample missed.

-Go forth in peace, and best of success to you,
  Richard Ristow

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

Garry Gelade
In reply to this post by Richard Ristow
Hi Richard

Yes, your calculation is correct, but it has nothing to do with Gini or
problems of numerical integration. It shows is that drawing a small sample
(N=10) from a *large* population containing a small proportion of high
income individuals is likely to give a misleading Gini for the population at
large.  As I see it,  that is likely to happen whatever index of inequality
you use. It's simply that when taking a small sample from a large
population, the high income individuals are likely to be excluded from the
sample, and the inequality measure will be biased.

A similar sort of thing happens when we calculate an SD (sometimes used as a
measure of inequality).  If we assume your example population, and give it a
mean income of 1, then 1% have an income of 50, and 99% have an income of
.505.

The SD of income for the whole population is SQRT(  .01*(50 - 1)^2 +
.99*(.505 - 1)^2 ) = 4.9.

But on your (correct) logic, a sample of 10 is highly likely to consist of
individuals with identical income, giving a sample SD of zero. So I wouldn't
say Gini is particularly susceptible to bias in this sort of situation.
Using a different measure of inequality would lead to similar problems.

Best Regards

Garry

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Richard Ristow
Sent: 24 February 2014 18:35
To: [hidden email]
Subject: Re: Calculating Gini coefficients for each subset (villages) of
large data set

At 03:54 AM 2/24/2014, Garry Gelade wrote:

>You said "a small sample may underestimate the Gini coefficient badly,
>through failing to include any of the small, very-high-income subgroup
>in the sample."
>
>Of course the Gini coefficient calculated on a small sample will be
>subject to sampling error, and unrepresentative of the population at
>large,  But it seems to me Mattias is not trying to estimate the Gini
>coefficient for the population at large, he is interested in the
>village level of analysis, and the distribution of village Ginis. If
>the high income group in the population is small, few villages will
>have high income members, and the distribution of village Ginis will
>(correctly) reflect that.

Calculating a Gini coefficient is a problem in numerical integration; and
numerical integration gets less reliable as the function being integrated
gets more 'peaked'. If Matthias has everybody in each village, his village
Gini coefficients will be right, period. But if he's sampling -- here's a
concrete example, to illustrate the problem.

Suppose there's a population of (large) size P, with mean income I.
Suppose, also, that one half the income is equally distributed among 99% of
the population, and that the remaining half is evenly distributed among the
remaining 1%. (So, total income is P*I; total income for the lower 99% and
the upper 1% are each P*I/2.) Then, the Gini coefficient can be calculated
exactly (see APPENDIX, below); it is 0.49.

Now, suppose you draw a sample of size 10. With probability 0.99**10=~0.904,
the sample will include no high-income individuals; everybody in the sample
will have income I/1.98, and the empirical Gini coefficient is 0.
===============================================================
APPENDIX: Gini coefficient, for the hypothetical case:

In the lower 99%, mean individual income (and, by hypothesis, actual income)
is

(P*I/2)/(0.99*P) = I/1.98 =~ 0.505*I;

in the top 1% mean (and actual) individual income is

(P*I/2)/(0.01*P) = 50*I.

If p is the cumulative proportion of the population and i the cumulative
proportion of income, the Lorenz curve becomes

(A) i = p/1.98                  p<= 0.99
(B) i = 0.5 + 50*(p-0.99)       p>= 0.99

where, if i=0.5, p=0.99.

The Gini coefficient is 1-2*(area under Lorenz curve). Here, that area is
(by area of triangles and trapezoids)

area (A): 0.99*0.5/2    = 0.25*0.99 =   0.2475
area (B): .01*(1+0.5)/2 = 0.75*0.01 =   0.0075
                                         ------
                                         0.0255

So the Gini coefficient is 1-2*.0255=   0.49

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command SIGNOFF SPSSX-L For a list of
commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

Mattias
Hi Richard and Garry,

I think that you both make important points and that the one does not exclude the other as I understand the two of you to be thinking. Garry’s comment makes a lot of sense to me; that it is essentially a question of how representative the (small) sample is. Richard refers to work (e.g. the Deltas (2003) article) which shows that there is a downward bias with sample size for Gini coefficients when samples are small as they are in my data. I asked the question of possible alternative measures that would be more appropriate for small samples because I also found the Deltas (2003) article you mention which by the way also argues that “The small sample bias is especially relevant when [..] the Gini is used to compare income inequality across sub-populations, some of which may have very small sample sizes”. Delta suggests a ‘small sample adjusted’ Gini instead when samples are small.

At the same time, the Wiki on income inequality measures also describes the property of population independence as one of four properties any measure on inequality should fulfil in the following way: “the income inequality metric should not depend on whether an economy has a large or small population. An economy with only a few people should not be automatically judged by the metric as being more equal than a large economy with lots of people. This means that the metric should be independent of the level of population”

I have spent most of my time being what you call a subject specialist, for example through conducting surveys and by doing field work in village India. It is indeed of particular substantive interest to investigate village level inequality because the village is a social unit of specific importance in India. In this project, however, I am working on secondary data from a large scale survey and this is the first time I use an inequality measure so I am trying to find my way and as such I am anxious to understand the limitations of for example the Gini and also reluctant to use unconventional “adjusted” Gini measures such as that suggested by Deltas (2003).

As for how to actually calculate Gini coefficients for villages - which was my original question - I am still looking for a way to do this in terms of syntax. Jon's suggestion to use an R plug in is exhausted for now since it appears I will need special technical assistance to install things properly.

I am now at a stage where I am considering cutting and pasting a syntax section for each village id since I cannot find a correct way of looping. Any suggestions for syntax in SPSS?

Mattias
12