Calculating Gini coefficients for each subset (villages) of large data set

classic Classic list List threaded Threaded
27 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

David Marso
Administrator
"I am now at a stage where I am considering cutting and pasting a syntax section for each village id since I cannot find a correct way of looping. Any suggestions for syntax in SPSS?"

Yes!  I made such suggestions almost TWO weeks ago!
Reread the thread and teach yourself about AGGREGATE!

Mattias wrote
Hi Richard and Garry,

I think that you both make important points and that the one does not exclude the other as I understand the two of you to be thinking. Garry’s comment makes a lot of sense to me; that it is essentially a question of how representative the (small) sample is. Richard refers to work (e.g. the Deltas (2003) article) which shows that there is a downward bias with sample size for Gini coefficients when samples are small as they are in my data. I asked the question of possible alternative measures that would be more appropriate for small samples because I also found the Deltas (2003) article you mention which by the way also argues that “The small sample bias is especially relevant when [..] the Gini is used to compare income inequality across sub-populations, some of which may have very small sample sizes”. Delta suggests a ‘small sample adjusted’ Gini instead when samples are small.

At the same time, the Wiki on income inequality measures also describes the property of population independence as one of four properties any measure on inequality should fulfil in the following way: “the income inequality metric should not depend on whether an economy has a large or small population. An economy with only a few people should not be automatically judged by the metric as being more equal than a large economy with lots of people. This means that the metric should be independent of the level of population”

I have spent most of my time being what you call a subject specialist, for example through conducting surveys and by doing field work in village India. It is indeed of particular substantive interest to investigate village level inequality because the village is a social unit of specific importance in India. In this project, however, I am working on secondary data from a large scale survey and this is the first time I use an inequality measure so I am trying to find my way and as such I am anxious to understand the limitations of for example the Gini and also reluctant to use unconventional “adjusted” Gini measures such as that suggested by Deltas (2003).

As for how to actually calculate Gini coefficients for villages - which was my original question - I am still looking for a way to do this in terms of syntax. Jon's suggestion to use an R plug in is exhausted for now since it appears I will need special technical assistance to install things properly.

I am now at a stage where I am considering cutting and pasting a syntax section for each village id since I cannot find a correct way of looping. Any suggestions for syntax in SPSS?

Mattias
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

David Marso
Administrator
/** UNTESTED CODE use at your own risk after testing and verification          **/.
/** Simply modified the code you originally pasted to allow a 'break' variable **/.
/** Note BREAK on AGGREGATE, SPLIT FILE, BY on RANK...                         **/.
/** These are the "looping" constructs you will want to read up on.            **/.

* Step 1.
SORT CASES BY IDPSU INCOME.
* Step 2.
AGGREGATE OUTFILE = *
  / PRESORTED
  / BREAK = IDPSU INCOME
  / persons = N .
WEIGHT BY persons.

* Step 3.  
AGGREGATE OUTFILE = * MODE ADDVARIABLES
  / BREAK = IDPSU
  / suminc = SUM(INCOME).
MATCH FILES
  / FILE *
  / BY IDPSU
  / @TOP@=FIRST.
* Step 4 .
COMPUTE pincome=persons * income.

SPLIT FILE BY IDPSU.
CREATE cincome=CSUM(pincome).
SPLIT FILE OFF.

* Step 5 .
COMPUTE pcinc = cincome/suminc .

* Step 6.
RANK VARIABLES=income (A) BY IDPSU
  / RFRACTION into cdfinc
  / PRINT=YES
  / TIES=HIGH .

* Step 7.
COMPUTE d1 = @TOP@.
COMPUTE d2 = @TOP@.
* Note that it doesn't matter whether D1 or D2 is the Y variable
* in the D1-D2 pair.
SPLIT FILE BY IDPSU.
* D1 and D2 are identical and are created to allow you to draw a
* diagonal line on the graph.
GRAPH
  /SCATTERPLOT(OVERLAY)=cdfinc d2 WITH pcinc d1 (PAIR)
  /MISSING=LISTWISE
  /TITLE= 'Lorenz Curve for Income'.


* Step 8.
* Calculate and print the Gini coefficient(s).
* For last case, LAREA is area under the Lorenz curve.
DO IF (@TOP@) .
+ COMPUTE larea = 0.
ELSE.
+ COMPUTE larea = LAG(larea) + (cdfinc - LAG(cdfinc)) * (pcinc + LAG(pcinc))/2 .
END IF.
IF (cdfinc = 1) gini = (.5 - larea)/.5 .
REPORT
  /VARIABLES gini (VALUES)
  /BREAK (IDPSU) '' (SKIP(1))
  /SUMMARY MAX( gini) SKIP(1) '' .

David Marso wrote
"I am now at a stage where I am considering cutting and pasting a syntax section for each village id since I cannot find a correct way of looping. Any suggestions for syntax in SPSS?"

Yes!  I made such suggestions almost TWO weeks ago!
Reread the thread and teach yourself about AGGREGATE!

Mattias wrote
Hi Richard and Garry,

I think that you both make important points and that the one does not exclude the other as I understand the two of you to be thinking. Garry’s comment makes a lot of sense to me; that it is essentially a question of how representative the (small) sample is. Richard refers to work (e.g. the Deltas (2003) article) which shows that there is a downward bias with sample size for Gini coefficients when samples are small as they are in my data. I asked the question of possible alternative measures that would be more appropriate for small samples because I also found the Deltas (2003) article you mention which by the way also argues that “The small sample bias is especially relevant when [..] the Gini is used to compare income inequality across sub-populations, some of which may have very small sample sizes”. Delta suggests a ‘small sample adjusted’ Gini instead when samples are small.

At the same time, the Wiki on income inequality measures also describes the property of population independence as one of four properties any measure on inequality should fulfil in the following way: “the income inequality metric should not depend on whether an economy has a large or small population. An economy with only a few people should not be automatically judged by the metric as being more equal than a large economy with lots of people. This means that the metric should be independent of the level of population”

I have spent most of my time being what you call a subject specialist, for example through conducting surveys and by doing field work in village India. It is indeed of particular substantive interest to investigate village level inequality because the village is a social unit of specific importance in India. In this project, however, I am working on secondary data from a large scale survey and this is the first time I use an inequality measure so I am trying to find my way and as such I am anxious to understand the limitations of for example the Gini and also reluctant to use unconventional “adjusted” Gini measures such as that suggested by Deltas (2003).

As for how to actually calculate Gini coefficients for villages - which was my original question - I am still looking for a way to do this in terms of syntax. Jon's suggestion to use an R plug in is exhausted for now since it appears I will need special technical assistance to install things properly.

I am now at a stage where I am considering cutting and pasting a syntax section for each village id since I cannot find a correct way of looping. Any suggestions for syntax in SPSS?

Mattias
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

Garry Gelade
In reply to this post by Mattias
Dear Mattias [+ Richard]

I think I see your problem a bit more clearly now.  If your villages are of different sizes, then I can see that the Gini would tend to be biased downwards for smaller villages to an extent which depends on the distribution of income in the population.

So any effect of Gini on your outcome variable would be partially confounded  by village size. Perhaps you could estimate the extent of the bias by taking a large number of random samples of sizes say 5, 10, 15, or 20 from your population, and calculating Ginis.  If sample size has a substantial effect on Gini, you could then correct the empirical Gini for each village by a size factor somehow.

Regards

Garry

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Mattias
Sent: 26 February 2014 14:19
To: [hidden email]
Subject: Re: Calculating Gini coefficients for each subset (villages) of large data set

Hi Richard and Garry,

I think that you both make important points and that the one does not exclude the other as I understand the two of you to be thinking. Garry’s comment makes a lot of sense to me; that it is essentially a question of how representative the (small) sample is. Richard refers to work (e.g. the Deltas (2003) article) which shows that there is a downward bias with sample size for Gini coefficients when samples are small as they are in my data. I asked the question of possible alternative measures that would be more appropriate for small samples because I also found the Deltas (2003) article you mention which by the way also argues that “The small sample bias is especially relevant when [..] the Gini is used to compare income inequality across sub-populations, some of which may have very small sample sizes”.
Delta suggests a ‘small sample adjusted’ Gini instead when samples are small.

At the same time, the Wiki on income inequality measures also describes the property of population independence as one of four properties any measure on inequality should fulfil in the following way: “the income inequality metric should not depend on whether an economy has a large or small population. An economy with only a few people should not be automatically judged by the metric as being more equal than a large economy with lots of people. This means that the metric should be independent of the level of population”

I have spent most of my time being what you call a subject specialist, for example through conducting surveys and by doing field work in village India.
It is indeed of particular substantive interest to investigate village level inequality because the village is a social unit of specific importance in India. In this project, however, I am working on secondary data from a large scale survey and this is the first time I use an inequality measure so I am trying to find my way and as such I am anxious to understand the limitations of for example the Gini and also reluctant to use unconventional “adjusted”
Gini measures such as that suggested by Deltas (2003).

As for how to actually calculate Gini coefficients for villages - which was my original question - I am still looking for a way to do this in terms of syntax. Jon's suggestion to use an R plug in is exhausted for now since it appears I will need special technical assistance to install things properly.

I am now at a stage where I am considering cutting and pasting a syntax section for each village id since I cannot find a correct way of looping.
Any suggestions for syntax in SPSS?

Mattias




--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Calculating-Gini-coefficients-for-each-subset-villages-of-large-data-set-tp5724495p5724639.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Aw: Re: Calculating Gini coefficients for each subset (villages) of large data set

Mattias
In reply to this post by David Marso
Got it working after a few minor adjustments and have calculated Gini's. Will verify now.
Learned something important about @TOP@ in the process.
 
Gesendet: Mittwoch, 26. Februar 2014 um 16:35 Uhr
Von: "David Marso [via SPSSX Discussion]" <[hidden email]>
An: Mattias <[hidden email]>
Betreff: Re: Calculating Gini coefficients for each subset (villages) of large data set
/** UNTESTED CODE use at your own risk after testing and verification          **/.
/** Simply modified the code you originally pasted to allow a 'break' variable **/.
/** Note BREAK on AGGREGATE, SPLIT FILE, BY on RANK...                         **/.
/** These are the "looping" constructs you will want to read up on.            **/.

* Step 1.
SORT CASES BY IDPSU INCOME.
* Step 2.
AGGREGATE OUTFILE = *
  / PRESORTED
  / BREAK = IDPSU INCOME
  / persons = N .
WEIGHT BY persons.

* Step 3.  
AGGREGATE OUTFILE = * MODE ADDVARIABLES
  / BREAK = IDPSU
  / suminc = SUM(INCOME).
MATCH FILES
  / FILE *
  / BY IDPSU
  / @TOP@=FIRST.
* Step 4 .
COMPUTE pincome=persons * income.

SPLIT FILE BY IDPSU.
CREATE cincome=CSUM(pincome).
SPLIT FILE OFF.

* Step 5 .
COMPUTE pcinc = cincome/suminc .

* Step 6.
RANK VARIABLES=income (A) BY IDPSU
  / RFRACTION into cdfinc
  / PRINT=YES
  / TIES=HIGH .

* Step 7.
COMPUTE d1 = @TOP@.
COMPUTE d2 = @TOP@.
* Note that it doesn't matter whether D1 or D2 is the Y variable
* in the D1-D2 pair.
SPLIT FILE BY IDPSU.
* D1 and D2 are identical and are created to allow you to draw a
* diagonal line on the graph.
GRAPH
  /SCATTERPLOT(OVERLAY)=cdfinc d2 WITH pcinc d1 (PAIR)
  /MISSING=LISTWISE
  /TITLE= 'Lorenz Curve for Income'.


* Step 8.
* Calculate and print the Gini coefficient(s).
* For last case, LAREA is area under the Lorenz curve.
DO IF (@TOP@) .
+ COMPUTE larea = 0.
ELSE.
+ COMPUTE larea = LAG(larea) + (cdfinc - LAG(cdfinc)) * (pcinc + LAG(pcinc))/2 .
END IF.
IF (cdfinc = 1) gini = (.5 - larea)/.5 .
REPORT
  /VARIABLES gini (VALUES)
  /BREAK (IDPSU) '' (SKIP(1))
  /SUMMARY MAX( gini) SKIP(1) '' .
 
David Marso wrote
"I am now at a stage where I am considering cutting and pasting a syntax section for each village id since I cannot find a correct way of looping. Any suggestions for syntax in SPSS?"

Yes!  I made such suggestions almost TWO weeks ago!
Reread the thread and teach yourself about AGGREGATE!
 
Mattias wrote
Hi Richard and Garry,

I think that you both make important points and that the one does not exclude the other as I understand the two of you to be thinking. Garry’s comment makes a lot of sense to me; that it is essentially a question of how representative the (small) sample is. Richard refers to work (e.g. the Deltas (2003) article) which shows that there is a downward bias with sample size for Gini coefficients when samples are small as they are in my data. I asked the question of possible alternative measures that would be more appropriate for small samples because I also found the Deltas (2003) article you mention which by the way also argues that “The small sample bias is especially relevant when [..] the Gini is used to compare income inequality across sub-populations, some of which may have very small sample sizes”. Delta suggests a ‘small sample adjusted’ Gini instead when samples are small.

At the same time, the Wiki on income inequality measures also describes the property of population independence as one of four properties any measure on inequality should fulfil in the following way: “the income inequality metric should not depend on whether an economy has a large or small population. An economy with only a few people should not be automatically judged by the metric as being more equal than a large economy with lots of people. This means that the metric should be independent of the level of population”

I have spent most of my time being what you call a subject specialist, for example through conducting surveys and by doing field work in village India. It is indeed of particular substantive interest to investigate village level inequality because the village is a social unit of specific importance in India. In this project, however, I am working on secondary data from a large scale survey and this is the first time I use an inequality measure so I am trying to find my way and as such I am anxious to understand the limitations of for example the Gini and also reluctant to use unconventional “adjusted” Gini measures such as that suggested by Deltas (2003).

As for how to actually calculate Gini coefficients for villages - which was my original question - I am still looking for a way to do this in terms of syntax. Jon's suggestion to use an R plug in is exhausted for now since it appears I will need special technical assistance to install things properly.

I am now at a stage where I am considering cutting and pasting a syntax section for each village id since I cannot find a correct way of looping. Any suggestions for syntax in SPSS?

Mattias
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
 
To unsubscribe from Calculating Gini coefficients for each subset (villages) of large data set, click here.
NAML
Reply | Threaded
Open this post in threaded view
|

Re: Aw: Re: Calculating Gini coefficients for each subset (villages) of large data set

David Marso
Administrator
There is nothing special about @TOP@.
It is simply a variable (0/1) created on the MATCH (FIRST function) so we know to reinitialize the accumulation.
I tend to flank such variables with @ to avoid collisions.
Please post the new code with 'minor adjustments' so the list has it available as a future resource.
You can see there are numerous techniques for 'looping' (SPLIT FILE, BY on AGGREGATE and RANK).
MACRO also has the !DO, the transformation language has DO REPEAT and LOOP.
---------
Mattias wrote
Got it working after a few minor adjustments and have calculated Gini's. Will verify now.

Learned something important about @TOP@ in the process.

 

Gesendet: Mittwoch, 26. Februar 2014 um 16:35 Uhr
Von: "David Marso [via SPSSX Discussion]" <[hidden email]>
An: Mattias <[hidden email]>
Betreff: Re: Calculating Gini coefficients for each subset (villages) of large data set

/** UNTESTED CODE use at your own risk after testing and verification          **/.
/** Simply modified the code you originally pasted to allow a 'break' variable **/.
/** Note BREAK on AGGREGATE, SPLIT FILE, BY on RANK...                         **/.
/** These are the "looping" constructs you will want to read up on.            **/.

* Step 1.
SORT CASES BY IDPSU INCOME.
* Step 2.
AGGREGATE OUTFILE = *
  / PRESORTED
  / BREAK = IDPSU INCOME
  / persons = N .
WEIGHT BY persons.

* Step 3.  
AGGREGATE OUTFILE = * MODE ADDVARIABLES
  / BREAK = IDPSU
  / suminc = SUM(INCOME).
MATCH FILES
  / FILE *
  / BY IDPSU
  / @TOP@=FIRST.
* Step 4 .
COMPUTE pincome=persons * income.

SPLIT FILE BY IDPSU.
CREATE cincome=CSUM(pincome).
SPLIT FILE OFF.

* Step 5 .
COMPUTE pcinc = cincome/suminc .

* Step 6.
RANK VARIABLES=income (A) BY IDPSU
  / RFRACTION into cdfinc
  / PRINT=YES
  / TIES=HIGH .

* Step 7.
COMPUTE d1 = @TOP@.
COMPUTE d2 = @TOP@.
* Note that it doesn't matter whether D1 or D2 is the Y variable
* in the D1-D2 pair.
SPLIT FILE BY IDPSU.
* D1 and D2 are identical and are created to allow you to draw a
* diagonal line on the graph.
GRAPH
  /SCATTERPLOT(OVERLAY)=cdfinc d2 WITH pcinc d1 (PAIR)
  /MISSING=LISTWISE
  /TITLE= 'Lorenz Curve for Income'.


* Step 8.
* Calculate and print the Gini coefficient(s).
* For last case, LAREA is area under the Lorenz curve.
DO IF (@TOP@) .
+ COMPUTE larea = 0.
ELSE.
+ COMPUTE larea = LAG(larea) + (cdfinc - LAG(cdfinc)) * (pcinc + LAG(pcinc))/2 .
END IF.
IF (cdfinc = 1) gini = (.5 - larea)/.5 .
REPORT
  /VARIABLES gini (VALUES)
  /BREAK (IDPSU) '' (SKIP(1))
  /SUMMARY MAX( gini) SKIP(1) '' .
 


David Marso wrote

"I am now at a stage where I am considering cutting and pasting a syntax section for each village id since I cannot find a correct way of looping. Any suggestions for syntax in SPSS?"

Yes!  I made such suggestions almost TWO weeks ago!
Reread the thread and teach yourself about AGGREGATE!
 


Mattias wrote

Hi Richard and Garry,

I think that you both make important points and that the one does not exclude the other as I understand the two of you to be thinking. Garry’s comment makes a lot of sense to me; that it is essentially a question of how representative the (small) sample is. Richard refers to work (e.g. the Deltas (2003) article) which shows that there is a downward bias with sample size for Gini coefficients when samples are small as they are in my data. I asked the question of possible alternative measures that would be more appropriate for small samples because I also found the Deltas (2003) article you mention which by the way also argues that “The small sample bias is especially relevant when [..] the Gini is used to compare income inequality across sub-populations, some of which may have very small sample sizes”. Delta suggests a ‘small sample adjusted’ Gini instead when samples are small.

At the same time, the Wiki on income inequality measures also describes the property of population independence as one of four properties any measure on inequality should fulfil in the following way: “the income inequality metric should not depend on whether an economy has a large or small population. An economy with only a few people should not be automatically judged by the metric as being more equal than a large economy with lots of people. This means that the metric should be independent of the level of population”

I have spent most of my time being what you call a subject specialist, for example through conducting surveys and by doing field work in village India. It is indeed of particular substantive interest to investigate village level inequality because the village is a social unit of specific importance in India. In this project, however, I am working on secondary data from a large scale survey and this is the first time I use an inequality measure so I am trying to find my way and as such I am anxious to understand the limitations of for example the Gini and also reluctant to use unconventional “adjusted” Gini measures such as that suggested by Deltas (2003).

As for how to actually calculate Gini coefficients for villages - which was my original question - I am still looking for a way to do this in terms of syntax. Jon's suggestion to use an R plug in is exhausted for now since it appears I will need special technical assistance to install things properly.

I am now at a stage where I am considering cutting and pasting a syntax section for each village id since I cannot find a correct way of looping. Any suggestions for syntax in SPSS?

Mattias






Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
 



If you reply to this email, your message will be added to the discussion below:
http://spssx-discussion.1045642.n5.nabble.com/Calculating-Gini-coefficients-for-each-subset-villages-of-large-data-set-tp5724495p5724641.html 

To unsubscribe from Calculating Gini coefficients for each subset (villages) of large data set, click here .
NAML
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

Mattias
In reply to this post by Mattias
Here is suggested syntax for computing Gini coefficients for each subunit (IDPSU) based on INCOME provided by David Marso and slightly adjusted by me. Be aware that, as discussed above, these Gini coefficients are subject to downward bias when samples are small and that this problem is not addressed in this syntax.

* Step 1.
SORT CASES BY IDPSU INCOME.
* Step 2.
AGGREGATE OUTFILE = *
  / PRESORTED
  / BREAK = IDPSU INCOME
  / persons = N .
WEIGHT BY persons.

* Step 3.  
AGGREGATE OUTFILE = * MODE ADDVARIABLES
  / BREAK = IDPSU
  / suminc = SUM(INCOME).
MATCH FILES
  / FILE *
  / BY IDPSU
  / FIRST=@TOP@.
* Step 4 .
COMPUTE pincome=persons * income.

SPLIT FILE BY IDPSU.
CREATE cincome=CSUM(pincome).
SPLIT FILE OFF.

* Step 5 .
COMPUTE pcinc = cincome/suminc .

* Step 6.
RANK VARIABLES=income (A) BY IDPSU
  / RFRACTION into cdfinc
  / PRINT=YES
  / TIES=HIGH .

* Step 7.
COMPUTE d1 = @TOP@.
COMPUTE d2 = @TOP@.
* Note that it doesn't matter whether D1 or D2 is the Y variable
* in the D1-D2 pair.
SPLIT FILE BY IDPSU.
* D1 and D2 are identical and are created to allow you to draw a
* diagonal line on the graph.
GRAPH
  /SCATTERPLOT(OVERLAY)=cdfinc d2 WITH pcinc d1 (PAIR)
  /MISSING=LISTWISE
  /TITLE= 'Lorenz Curve for Income'.

* Step 8.
* Calculate and print the Gini coefficient(s).
* For last case, LAREA is area under the Lorenz curve.
DO IF (@TOP@) .
+ COMPUTE larea = 0.
ELSE.
+ COMPUTE larea = LAG(larea) + (cdfinc - LAG(cdfinc)) * (pcinc + LAG(pcinc))/2 .
END IF.
IF (cdfinc = 1) gini = (.5 - larea)/.5 .
REPORT
  /VARIABLES gini (VALUE)
  /BREAK = IDPSU (SKIP(1)) ''
  /SUMMARY MAX( gini) SKIP(1) '' .
Reply | Threaded
Open this post in threaded view
|

Re: Calculating Gini coefficients for each subset (villages) of large data set

Richard Ristow
At 05:21 AM 2/28/2014, Mattias wrote:

>Here is suggested syntax for computing Gini coefficients for each
>subunit (IDPSU) based on INCOME provided by David Marso and slightly
>adjusted by me.

I did a little numerical experimentation, with three constructed distributions:
A. Four quartiles, with income (from low to high) 0.4, 0.8, 1.2, and
1.6 times the mean; true Gini coefficient 0.250
B. Half the income distributed evenly among 10% of the population,
the other half evenly among the other 90%; true Gini = 0.40
B. Half the income distributed evenly among 1% of the population, the
other half evenly among the other 99%; true Gini = 0.49.

I randomly drew samples of 5, 10, and 20, with 20 'villages' for each
distribution and sampling level -- 180 'villages' all told. Here's
what it looked like; this is the result of AGGREGATE over the
combinations of distributions and per-village sample sizes:
|-----------------------------|---------------------------|
|Output Created               |03-MAR-2014 21:14:50       |
|-----------------------------|---------------------------|
  [Summary]
Distrib PerVillg TrueGini Villages GiniMean GiniSD GiniMin GiniMax

A            5      .250        20    .202    .076    .042    .300
A           10      .250        20    .238    .033    .192    .291
A           20      .250        20    .234    .040    .146    .292
B            5      .400        20    .210    .241    .000    .492
B           10      .400        20    .223    .232    .000    .494
B           20      .400        20    .364    .119    .000    .500
C            5      .490        20    .000    .000    .000    .000
C           10      .490        20    .041    .183    .000    .817
C           20      .490        20    .120    .292    .000    .817

Number of cases read:  9    Number of cases listed:  9

Roughly as expected: the mean Gini coefficient was low in every
condition, although for distributions A and B, the observed range of
estimated Gini coefficients always spanned the true value.  (I wrote
in an earlier post about distribution C, with half the income
concentrated in 1% of the population. That is a very intractable
distribution for estimating the Gini coefficient from empirical data,
and that shows, above.) And larger sample sizes matter: it looks like
a Gini estimated from  sample of size 5 can't be trusted; Ginis from
samples of sizes 10-20 are at least broadly reasonable estimates.
However, if the true distribution of income shows a high
concentration in a small proportion of the population, empirical Gini
coefficients can be wildly unreliable.

I append the full code, with should allow anyone to reproduce or
extend these results. It divides roughly into three sections: a long
INPUT PROGRAM to generate the data; code to calculate the empirical
Gini coefficients; and a section to summarize and report the results.

The code to calculate empirical Gini coefficients is essentially a
streamlined version of that in the post I'm responding to; it's by
David Marso, Mattias, and an unknown earlier author. In streamlining it, I've

. Taken out code to plot each empirical Lorenz curve, and to print
the final values
. Taken out an initial AGGREGATE to collapse all cases with the same
income into a single record -- it's not necessary, probably costs
more than it saves in most cases, and isn't likely to be of use with
empirical data
. Rather than using CREATE and RANK to calculate cumulative values
for population and income, I've included code to do that in the
transformation program that calculates the Lorenz-cure integral and
the Gini coefficient.

-Live long and prosper (in the higher percentiles!),
  Richard Ristow
===================
APPENDIX: Test code
===================
*  C:\Documents and Settings\Richard\My Documents          .
*    \Technical\spssx-l\Z-2014\                            .
*    2014-02-28 Mattias-                                   .
*    Re Calculating Gini coefficients for each subset.SPS  .

*  In response to posting                                           .
*  Date:         Fri, 28 Feb 2014 02:21:49 -0800                    .
*  From: Mattias <[hidden email]>                             .
*  Subject:      Re: Calculating Gini coefficients for each subset  .
*                (villages) of large data set                       .
*  To: [hidden email]                                     .

*  ................................................................ .
*  .................   Test data               .................... .
SET RNG = MT       /* 'Mersenne twister' random number generator  */ .
SET MTINDEX = 4718 /*  Boston, MA telephone book                  */ .

NEW FILE.
INPUT PROGRAM.
.  STRING    IDPSU   (A6).
.  NUMERIC   TrueGini(F6.3).
.  NUMERIC   SampSize(F3).
.  LEAVE     IDPSU TrueGini SampSize.
.  NUMERIC   Person  (F3).
.  NUMERIC   Income  (DOLLAR9.2).

.  VAR WIDTH IDPSU (6).
.  VAR WIDTH TrueGini SampSize Person Income (8).

*  The number of 'villages' to generate for each combination of     .
*  distribution (A-C) and sample size:                              .
.  NUMERIC   #VlgPss (F3).
.  COMPUTE   #VlgPss= 20.

*  Distribution A:                                                  .
*  Mean income is 1,000.                                            .
*  Income in 1st quartile  =    400                                 .
*  Income in 2nd quartile  =    800                                 .
*  Income in 3rd quartile  =  1,200                                 .
*  Income in 4th quartile  =  1,600                                 .
*  Actual Gini coefficient =  0.25                                  .

.  COMPUTE TrueGini = 0.25.
.  LOOP   #Stype    = 1 TO  3.
.  RECODE #Stype (1 =  5)
                  (2 = 10)
                  (3 = 20) INTO SampSize.
.     LOOP #Village  = 1 TO #VlgPss.
.        COMPUTE IDPSU = CONCAT('A',
                                 STRING(SampSize,N2),
                                 '.',
                                 STRING(#Village,N2)).
.        LOOP Person = 1 TO SampSize.
.           COMPUTE #IncRank = RV.UNIFORM(0,1).
.           RECODE  #IncRank
                    (0.00 THRU 0.25 =  400)
                    (0.25 THRU 0.50 =  800)
                    (0.50 THRU 0.75 = 1200)
                    (0.75 THRU 1.00 = 1600) INTO Income.
.           END CASE.
.        END LOOP.
.     END LOOP.
.  END LOOP.

*  Distribution B:                                                  .
*  Mean income is 1,000                                             .
*  1/2 of total income is to top    10%, evenly:                    .
*  I = (1,000/2)/0.1       =  5,000                                 .
*  1/2 of total income is to bottom 90%, evenly:                    .
*  I = (1,000/2)/0.9       =    555.56                              .
*  Actual Gini coefficient =  0.40                                  .

.  COMPUTE TrueGini = 0.40.
.  LOOP   #Stype    = 1 TO  3.
.  RECODE #Stype (1 =  5)
                  (2 = 10)
                  (3 = 20) INTO SampSize.
.     LOOP #Village  = 1 TO #VlgPss.
.        COMPUTE IDPSU = CONCAT('B',
                                 STRING(SampSize,N2),
                                 '.',
                                 STRING(#Village,N2)).
.        LOOP Person = 1 TO SampSize.
.           COMPUTE #IncRank = RV.UNIFORM(0,1).
.           RECODE  #IncRank
                    (0.00 THRU 0.90 =  555.56)
                    (0.90 THRU 1.00 = 5000.00)
                    INTO Income.
.           END CASE.
.        END LOOP.
.     END LOOP.
.  END LOOP.

*  Distribution C:                                                  .
*  Mean income is 1,000                                             .
*  1/2 of total income is to top     1%, evenly:                    .
*  I = (1,000/2)/0.01      = 50,000                                 .
*  1/2 of total income is to bottom 99%, evenly:                    .
*  I = (1,000/2)/0.9       =   505.05                               .
*  Actual Gini coefficient = 0.49

.  COMPUTE TrueGini = 0.49.
.  LOOP   #Stype    = 1 TO  3.
.  RECODE #Stype (1 =  5)
                  (2 = 10)
                  (3 = 20) INTO SampSize.
.     LOOP #Village  = 1 TO #VlgPss.
.        COMPUTE IDPSU = CONCAT('C',
                                 STRING(SampSize,N2),
                                 '.',
                                 STRING(#Village,N2)).
.        LOOP Person = 1 TO SampSize.
.           COMPUTE #IncRank = RV.UNIFORM(0,1).
.           RECODE  #IncRank
                    (0.00 THRU 0.99 =   505.05)
                    (0.99 THRU 1.00 = 50000.00)
                    INTO Income.
.           END CASE.
.        END LOOP.
.     END LOOP.
.  END LOOP.

END FILE.
END INPUT PROGRAM.
EXECUTE /* to avoid some DATASET glitches */.

DATASET NAME     TestData WINDOW=FRONT.

DATASET COPY     Process.
DATASET ACTIVATE Process  WINDOW=FRONT.

*--F0r testing, select a small subset of the data .
*--SELECT IF    SUBSTR(IDPSU,1,1) EQ 'A'
             AND SampSize          EQ 20
             AND NUMBER(SUBSTR(IDPSU,5,2),F2) LE 5.

*  ................................................................ .
*  .................   Test code               .................... .


* Step 1.
SORT CASES BY IDPSU INCOME.

* Step 2,    combining all subjects with the same income into   ... .
*            a single summary record, omitted in this version   ... .

* Step 3:    Total income (and population) for each group       ... .

* ... If there is a variable giving the number of persons at    ... .
* ... each income level, replace "#Persons" by that variable,   ... .
* ... globally, and delete the following declaration and        ... .
* ... COMPUTE.                                                  ... .
.     NUMERIC   Persons (F4).
.     VAR LABEL Persons 'How many persons the record represents'.
.     COMPUTE   Persons = 1.

COMPUTE GrpPop = Persons.
COMPUTE GrpInc = Persons * income.

AGGREGATE OUTFILE = *
     MODE     =ADDVARIABLES
     OVERWRITE=YES
   / BREAK  = IDPSU
   / GrpPop 'Total persons in sub-sample or group' = SUM(GrpPop)
   / GrpInc 'Total income  in sub-sample or group' = SUM(GrpInc).

FORMATS GrpPop (F5)
         GrpInc (DOLLAR11.2).

* Steps 4-8: Cumulative population & income; cumulative         ... .
*            fractional population & income (X and Y values on  ... .
*            the Lorenz curve), and cumulative Lorenz integral. ... .

MATCH FILES
   / FILE *
   / BY IDPSU
   / FIRST=@TOP@.

NUMERIC   /* Cumulative, at or below this income level: */
         #CPersons (F5)     /* Total persons             */
         #CIncome  (F11.2)  /* Total income              */.

NUMERIC CDFpers    CDFinc     Gini
         #PvCDFpers #PvCDFinc  (F6.3)
         @LArea                (F8.4).

VAR LABELS
     CDFpers    'Fraction of population at or below this income level'
     CDFinc     'Fraction of income to persons at or below this level'
     Gini       'Gini coefficient'.
*   Descriptions for scratch variables:                      .
*   #PvCDFpers  Previous value of "CDFpers"'                 .
*   #PvCDFinc   Previous value of "CDFinc"'                  .
*   @LArea      Area under the Lorenz curve, to this level'  .

LEAVE   @LArea.

DO IF @TOP@.
.  COMPUTE #CPersons  = 0.
.  COMPUTE #CIncome   = 0.
.  COMPUTE #PvCDFpers = 0.
.  COMPUTE #PvCDFinc  = 0.
.  COMPUTE @LArea     = 0.
END IF.

*  Steps 4-6:                                                   ... .
*  Cumulative total        population and income:               ... .
COMPUTE    #CPersons = #CPersons + Persons.
COMPUTE    #CIncome  = #CIncome  + Persons*Income.

*  Cumulative FRACTIONS of population and income:               ... .
COMPUTE    CDFpers  = #CPersons / GrpPop.
COMPUTE    CDFinc   = #CIncome  / GrpInc.


*  Step 7 (graph of Lorenz curve) omitted from this version     ... .

*  Step 8:  Cumulative area under the Lorenz curve,             ... .
COMPUTE @LArea   = @LArea
                  + (CDFinc + #PvCDFinc)*(CDFpers-#PvCDFpers)/2.

*           and Gini coefficient:                               ... .
IF  #CPersons EQ GrpPop  GINI = 1 - 2*@LArea   .

*  Save current cumulative fractions of population and income,  ... .
*  for the next step in the Lorenz integration:                 ... .
COMPUTE    #PvCDFpers = CDFpers.
COMPUTE    #PvCDFinc  = CDFinc.

*  ................................................................ .
*  .................   Output code             .................... .
EXECUTE /* to avoid some DATASET glitches */.

DATASET COPY     Output.
DATASET ACTIVATE Output WINDOW=FRONT.

SELECT IF NOT MISSING(Gini).

NUMERIC   MeanInc  (DOLLAR9.2).
VAR LABEL MeanInc  'Sample mean income'.
COMPUTE   MeanInc = GrpInc / GrpPop.

LIST CASES=5
     /VAR=IDPSU TrueGini SampSize MeanInc Gini.

ADD FILES
     /FILE=*
     /RENAME=(SampSize=PerVillg
     /KEEP=IDPSU TrueGini PerVillg MeanInc Gini.

STRING    Distrib (A1).
COMPUTE   Distrib = SUBSTR(IDPSU,1,1).
VAR LABEL Distrib 'Which of the constructed underlying distributions'.
VAL LABEL Distrib
           'A' 'Tractable, four quartiles'
           'B' 'Half of income to top 10%'
           'C' 'Half of income to top  1$'.

DATASET   DECLARE Summary.
AGGREGATE OUTFILE=Summary
     /BREAK = Distrib PerVillg
     /TrueGini 'Gini for constructed income distribution'
    = MAX(TrueGini)
     /Villages 'Number of "villages" drawn'
    = NU
     /GiniMean 'Mean of sample Ginis'
    = MEAN(Gini)
     /GiniSD   'Std deviation of sample Ginis'
    = SD(Gini)
     /GiniMin  'Smallest Gini coefficient in sample'
    = MIN(Gini)
     /GiniMax  'Largest  Gini coefficient in sample'
    = MAX(Gini).

DATASET  ACTIVATE Summary WINDOW=FRONT.

FORMATS GiniMean GiniSD (F6.3).

LIST.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
12