By WIKTORIA KORYGA (Predictive Solutions)
The Gini index is a measure of the concentration of a variable’s distribution. In statistics it is commonly used to describe the concentration (unevenness) of the distribution of a random variable, while its most popular use in economics is as a measure of the degree of income inequality.
GINI INDEX AS A MEASURE OF VARIATION IN QUALITATIVE VARIABLES
The Gini index is also used as a measure of variation for qualitative, categorical variables. We encounter categorical data in many types of analysis, very often in scientific fields such as sociology, economics or biostatistics. One of the measures used to analyse variation is precisely the Gini index expressed by the formula:
where:
k – number of categories of the variable,
– probability of belonging to a given category.
The value of the Gini index indicates how much variability there is in the qualitative variable under study. It can be compared to the variance and standard deviation calculated for quantitative variables.
The Gini index describing the concentration of the distribution of qualitative variables can take values from zero, while the upper limit is not strictly defined. The maximum value that the Gini index can take depends on the number of categories of the variable. If a variable had two categories, the maximum variability would be 0.5, whereas if there were four categories, each category would contain 25% of the observations, so the Gini index would be 0.75. Note that the number of categories only affects the value of the maximum variability that can be achieved for a given variable. The minimum value is always zero and represents the absence of variability giving us certainty in decision-making. This is the situation when all observations belong to only one category of the variable. This means that if we wanted to predict from such a distribution of a variable whether an observation belongs to a particular category, we would be right 100% of the time.
To explain, we will use the example of the variable gender having two categories – female and male. When analysing the variability, we will use the percentage of people in each category.
Table 1. Analysis of variation for a variable with two categories
GINI INDEX FOR QUALITATIVE VARIABLES IN PS IMAGO PRO
Let us look at an example of the use of the Gini coefficient available in the Data Audit procedure in PS IMAGO PRO. The procedure calculates the value of the Gini index and what percentage of the maximum value of the Gini index is its calculated value for the analysed variable (Gini versus maximum value – Table 3). Keep in mind that the maximum value of the Gini index is variable and depends on the number of categories of the analysed characteristic. Let us look at the distribution of the variable presenting the completed field of study of the people taking part in a certain survey.
Table 2. Distribution of the variable field of study
The variable has four categories, so we can conclude that the maximum value of the Gini coefficient will be 0.75
Recall – the minimum value of the index will be 0 in the case of no variability, i.e., when all respondents state that they graduated from the Faculty of Law.
The table below shows the value of the Gini index and the Gini compared to the maximum value – that is, the percentage of maximum variability possible for this variable.