Modeling literacy—Analytics

Obtaining and preparing data
Workflow using ArcGIS Desktop
Workflow using ArcGIS Pro
References and resources for learning more

Improving literacy, for both boys and girls, is one of the most important objectives we can have as a society because of its positive impacts on health, wealth, productivity, and quality of life. Identifying the factors that contribute to low literacy rates is an important first step toward remediation. Poverty and poor health prevent children from attending school, of course, but access to schools, and cultural biases relating to gender, are also important components of literacy. Recognizing that each of these factors has a geography is also essential; child health might be the key factor explaining literacy in one country, but not the most important factor in another country.

Let's look more closely at literacy rates across the continent of Africa. A workflow for obtaining and analyzing country-level statistical data is given below.

What data is needed?

We begin our analysis by downloading the United Nations MDG data, selecting as many variables as possible that might relate to literacy including child hunger, mortality rates, gender parity patterns, and primary school enrollment rates.

The map below shows data availability for ten of those variables, all collected between 1990 and 2015. Notice that complete data for these ten variables would constitute 250 pieces of information for each country. Not surprisingly, no country has a complete set of data. The map below shows where data is rich, sparse, and missing altogether.

Having data for each country is important, of course, but so is having current data. The maps below focus on the literacy rate variable only, showing where data is scarce and also where it is out-of-date. Notice that both Algeria and Sudan have very little data and, unfortunately, the data they do have is old. Other countries, like Angola, have very little data, but at least the data is current.

Before beginning analysis, it is very important to make every effort to fill in missing data with current and reliable data. When adding or updating data, be sure to keep track of where each piece of data came from.

Our efforts to find current and reliable data were not entirely successful. For some countries we couldn't find data across key variables and, consequently, we had to exclude Mayotte, Reunion Island, Sudan, and Western Sahara from the analyses below. In addition, incomplete data across many countries for particular variables—several poverty indicators, for example—made it necessary to exclude those variables.

For demonstration purposes we have limited the variables included in the data package provided above to the literacy rate variable and several other variables that were consistently significant for modeling literacy rates:


Variable	Description
Literacy rate	The percentage of people age 15 to 24 who can read and write
Adolescent birth rate	The number of girls, per 1,000 girls, age 15 to 19 who deliver babies
Child hunger	The percentage of children under the age of 5 who are moderately to severely underweight
Primary school enrollment	The percentage of children, primary school age, enrolled in primary education; this is the reported net ratio, rather than the gross ratio
Child mortality	The number of children, per 1,000 live births, who die before the age of 5
Gender parity index	The ratio of girls to boys enrolled in primary school education; 1.0 reflects perfect balance; indices smaller than 1.0 indicate higher enrollment for boys than for girls
Women in government	The percentage of seats in the national parliament held by women

Where is literacy lowest and highest?

Based on the most up-to-date literacy rate data available for each country, the map below shows the spatial distribution of literacy across Africa. For the countries shown with the lightest green color, less than 45 percent of the people age 15 to 24 can read and write. In Somalia, according to 2015 data provided by Afrol News, more than 80 percent of the population is illiterate. In contrast, almost 100 percent of the people age 15 to 24 are literate in Libya and South Africa.

Where are literacy rates rising?

By subtracting the earliest literacy rate values (occurring sometime between 1990 and 2015) from the most up-to-date literacy rate values, we get a sense for how literacy rates are changing. In the map below, countries symbolized using pink have seen literacy rate declines. The Central African Republic has experienced the sharpest declines (a 24 percent decrease). The largest literacy rate increases are found in Burundi and Chad (35 and 32 percent increase, respectively).

It is almost impossible to look at a map like the one above and not wonder what is going on. Why are there locations where literacy rates are declining? Why is literacy high in some countries but low in others? Are low literacy rates the result of health problems, economic issues, or cultural practices?

What factors contribute to low literacy rates?

To get a better understanding of the factors that are contributing to high or low literacy rates, let's look at how the literacy rate values correlate with other variables, such as school enrollment and health. We will use the Spearman Rank-Order Correlation statistic to measure relationship strength ( r_s ) and significance (p-value).

There isn't presently a tool in ArcGIS for running the Spearman Rank-Order Correlation statistic. There are several options for accessing analytical functionality outside of ArcGIS, however, including the following:

Use the Export Feature Attributes to ASCII tool to create a comma-delimited text file for use with statistical packages like SAS or SPSS. Most software packages will read comma-delimited text files.
Tap into R functionality. R has a large number of statistical functions. You can create custom R tools and execute them in ArcGIS. For more information on this, see the examples posted on the R - ArcGIS Community page.
The easiest option is to access SciPy analytical methods from the ArcGIS Python window. This option is detailed below in the ArcMap and ArcGIS Pro workflows.

The table below shows the results of the Spearman Rank-Order Correlation analysis. They indicate literacy rates are most strongly correlated with child health (child mortality rate), child hunger (percentage of underweight children), gender parity, and primary school enrollment.


Variable	Relationship Strength (r_s)	Relationship Direction	Statistical Significance
Child mortality rate	-0.74	negative	Very high, p < 0.001
Child hunger (underweight)	-0.72	negative	Very high, p < 0.001
Gender parity	0.67	positive	Very high, p < 0.001
Primary school enrollment	0.67	positive	Very high, p < 0.001
Adolescent birth rate	-0.58	negative	Very high, p < 0.001
Women in government seats (percent)	0.26	positive	Significant, p < 0.1

The Spearman r_s values indicate how strong each relationship (correlation) is; the r_s values range from 0.00 to 1.00. The sign (+/-) associated with each r_s value determines if the relationship is positive or negative. The p-value associated with each r_s value indicates the probability that there is no relationship. A very small p-value means there is very little chance that there is no relationship at all. Notice that the relationship between literacy rates and primary school enrollment is strong and positive (r_s = +0.67, p <0.001). As we would expect, locations with higher primary school enrollment rates also have higher literacy rates. The gender parity variable also has a strong, positive relationship with literacy rates (r_s = +0.67, p <0.001). As the number of boys and girls in primary schools becomes more balanced, literacy rates increase.

Positive and negative data relationships

Some variables have a negative correlation with literacy rates. As child hunger (the proportion of underweight children) increases, for example, literacy rates decrease. Similarly, as adolescent birth rates increase, literacy rates decrease.

Simple correlations, like these, are a good place to start when you want to understand variable relationships. A quick web search, however, using spurious correlations for your search term, will reveal the problem with relying on correlations alone to make important decisions (like how and where to allocate scarce resources aimed at improving literacy rates across Africa). We can really only trust these relationships if we find a properly specified model for the literacy rate values.

How can you find a properly specified model?

We will try using Ordinary Least Squares regression (OLS). Regression analysis is a statistical method that allows you to estimate relationships among variables. We will see if it can identify key factors contributing to high and low literacy rates. If it can, we will use what we learn to recommend programs for improving literacy.

We begin with Exploratory Regression. Exploratory regression tries every possible combination of the variables you provide, looking for properly specified models that satisfy all of the requirements of the OLS regression method. A properly specified OLS model has several characteristics:

The coefficients associated with each of the explanatory variables are statistically significant (meaning all of the variables in the model are truly helping to explain the literacy rate values).
The relationship between each explanatory variable is justifiable. If the coefficient for the child mortality variable were positive, for example, it would indicate both child mortality and literacy increase together—and this does not make sense; we are expecting a negative relationship here.
The under and over predictions (called residuals) from the model should be normally distributed and random to indicate your model is free from bias and isn't missing any key explanatory variables.

There is good news: the Exploratory Regression tool does, in fact, find a couple of models that meet all of the requirements of the OLS method. The best properly specified model comprises three explanatory variables: adolescent birth rates, primary school enrollment, and primary school gender parity. This model explains almost 68 percent of the variation in literacy rate values across Africa (Adj R² = 0.677).

Finding a properly specified model gives us confidence that any investments made to reduce adolescent birth rates, or to increase primary school enrollment, particularly for girls, will have a positive impact on literacy.

Where should we invest in these programs?

We could invest in all of these programs equally across the entire continent. This would be costly, however. Alternatively, we could identify where the need for each type of program is greatest and focus our investments accordingly. This is an excellent plan, of course, but to ensure investments will positively impact literacy rates as well, we can examine the relationships between each explanatory variable (adolescent birth rates, primary school enrollment, and gender parity) and literacy.

We will use the Grouping Analysis tool to show us where the low literacy rates correspond to the factors contributing to low literacy. Grouping Analysis uses a K-means clustering algorithm to partition countries into groups so the countries in the same group are as similar as possible and the groups themselves are as different possible. We will use Grouping Analysis on each of our model explanatory variables separately.

Adolescent birth rates

High adolescent birth rates are associated with a number of problems beyond low literacy, including poverty, high maternal death rates, HIV, dependency, abuse, and violence. The Grouping Analysis tool finds three distinct groups for the Adolescent Birth Rate and Literacy Rate variables across the continent of Africa. Countries in the green group (see the map below) have both low literacy rates and high adolescent birth rates. Niger, Mali, Chad, and the other countries in the green group would be good locations to begin a rollout of programs designed to protect young girls from having children while they are still children themselves. Encouraging families and communities to allow girls to attend school and to stay in school is an important remediation strategy that also relates to the next explanatory variable: Gender Parity.

Gender parity

Differences in the proportion of boys versus girls enrolled in primary school education is also an important factor impacting literacy rates. Grouping Analysis finds two distinct groups for the gender parity and literacy rate variables. The countries in the red group have the fewest girls attending school and the lowest literacy rates. Investing in programs that address gender bias and result in more girls attending school—especially in Somalia, Chad, Guinea, and Ethiopia, where the proportion of girls enrolled in school is lowest—will likely have the biggest impacts on literacy. Programs that encourage girls to attend and complete school will also boost primary school enrollment rates, the next explanatory variable in our model.

Primary school enrollment rates

Adolescent pregnancy and gender bias will certainly have a negative impact on primary school enrollment rates. Physical access to school facilities, family economics, and a country's political will, however, are also factors that impact whether or not a child attends school. Grouping Analysis finds two distinct groups where both literacy rates and school enrollment rates are low. Countries in the red group, including Somalia and Niger, where enrollment rates are especially low, will likely benefit most from programs to increase primary school enrollment rates.

Final thoughts

This case study focused on literacy rates across Africa, but similar workflows could be used to examine poverty, infant mortality, and other country-level statistical indicators. Exploratory Regression was used to discover the key explanatory variables contributing to literacy rates. Grouping Analysis was used to suggest a variety of targeted remediation programs directed at countries where these programs are likely to see their greatest successes.

Obtaining and preparing data

This section outlines the steps needed to download and prepare data for analysis. Because the prepared data is provided in the data package at the beginning of this case study, you do not need to perform the steps in this section in order to complete the workflows below. You may still want to read or work through these instructions, however, to get an understanding of what is required in case you ever need to work with CSV-formatted data.

Download the data as CSV files from the United Nations Millennium Development goals data portal. It seems to work best to create a separate CSV file for every indicator you are interested in.
You will notice a number of things that need to be corrected in Excel before you can bring the CSV tables into ArcMap.
1. Delete the fields that are not needed for analysis (SeriesCode, MDG, all the fields named Footnotes, and all fields named Type, for example).
2. Field names cannot begin with a number, so change all the year fields, such as 1991, 1992, 1993, and so on, to Y1991, Y1992, Y1993, and up to Y2015.
3. Delete all the footnotes at the bottom of the table (you will find these following the last country record).
4. When ArcGIS brings in a table, it uses the first few records to determine if the data is text or numeric. To ensure the year fields are all created as numbers (especially when there are so many null values), insert a row at the top of each table with data reflecting the appropriate data type. You may delete this row after the table is in ArcGIS.

Once the CSV files are in ArcGIS, use the Copy Rows tool to convert them to ArcGIS tables.

Workflow using ArcGIS Desktop

Determine the most current literacy rate for each country

If you haven't done so already, download and unzip the data package provided at the top of this case study.
Double-click the ModelingLiteracy.mpk map package to open it.
Right-click the Literacy layer and open the table. Notice the many Null values.
You will be looking for a model that explains literacy rates. The variable you will be modeling will be the most up-to-date literacy rate available for each country. To pull this value out from all of the null values, you will use the Calculate Field tool with some Python statements.
Begin by using the Search pane to locate and open the Add Field tool.
Add a new field to hold the most current literacy rate value using the parameters below:
- Input Table: Literacy
- Field Name: LastLiteracy
- Field Type: FLOAT
- Field Alias: Most current literacy rate
- Field is Nullable: Yes
Set the value of this new field to be the last non-Null literacy rate for each country. To do this, find and open the Calculate Field tool and run it with the following parameters:
- Input Table: Literacy
- Field Name: LastLiteracy
- Expression: getLast( !Y1990!, !Y1991!, !Y1992!, !Y1993!, !Y1994!, !Y1995!, !Y1996!, !Y1997!, !Y1998!, !Y1999!, !Y2000!, !Y2001!, !Y2002!, !Y2003!, !Y2004!, !Y2005!, !Y2006!, !Y2007!, !Y2008!, !Y2009!, !Y2010!, !Y2011!, !Y2012!, !Y2013!, !Y2014!, !Y2015!)
- Code Block:
```
def getLast(*allYears):
    notNull = list(filter(None,allYears))
    return notNull[-1]
```

Python statements with the Calculate Field tool

The Expression parameter tells the Calculate Field tool to get all the literacy rate values from the fields Y1990, Y1991, Y1992, and so on, and to pass those values to a function called getLast.

The Code Block defines the getLast function. Let's look at the Python code line by line:


Python Statement	What it does
def getLast(*allyears):	Indicates you want to define (def) a new function called getLast that will use the sequence of values passed to it from the Expression parameter.
notNull = list(filter(None,allYears))	This line, indented four spaces, indicates Python should put all of the non-Null values into a list called notNull.
return notNull[-1]	This line, indented four spaces, instructs Python to set the LastLiteracy field value to be the final value in the notNull list. The -1 index tells Python to find the end of the list.

Compute mean values for each variable

You now have the most current literacy rate value for each country. What are the factors that might impact or explain those literacy rates? We suspect they are a function of several things including child health, hunger, school enrollment, and gender biases. Later in this workflow you will use these variables as your candidate explanatory variables to see if you can find a properly specified regression model for the literacy rate values.

The data you have for the candidate explanatory variables were collected between 1990 and 2015. Using all of the data available, you will compute a mean value for each variable.

Begin by adding fields to hold the computed mean values. Find and open the Add Field tool. In this and the next step, you will add a mean value field to each of the explanatory variable datasets. Start with the AdolescentBirthRate feature class by running the Add Field tool using the following parameters:
- Input Table: AdolescentBirthRate
- Field Name: MeanAdBirthRt
- Field Type: FLOAT
- Field Alias: Mean Adolescent Birth Rate
- Field is Nullable: Yes
  Tip:
  Every time you run a tool, it is recorded in the Results window. Double-clicking on a tool entry in the Results window opens the dialog with the parameters filled out. If you need to run a tool several times with slightly different parameters, it is usually quicker to access the tool from the Results window and modify the parameters as needed.
Open the Add Field tool from the Results window and use it to add a mean value field to the remaining explanatory variable datasets. Use the parameters shown below:


Input Table	Field Name	Field Type	Field Alias
ChildHunger	MeanHunger	FLOAT	Mean Child Hunger Rate
SchoolEnrollment	MeanSchEnroll	FLOAT	Mean Primary School Enrollment Rate
ChildMortality	MeanMortality	FLOAT	Mean Child Mortality Rate
GenderParity	MeanParity	FLOAT	Mean Gender Parity Index
WomenInGovSeats	MeanWmGovSeats	FLOAT	Mean % Gov Seats Held by Women

Once the new fields have been created, you will use the Calculate Field tool to compute the mean values for each dataset. (Opening the Calculate Field tool from the Results window will save you some typing). To compute the mean value for the adolescent birth rate dataset, for example, use the parameters below:
- Input Table: AdolescentBirthRate
- Field Name: MeanAdBirthRt
- Expression: getMean( !Y1990!, !Y1991!, !Y1992!, !Y1993!, !Y1994!, !Y1995!, !Y1996!, !Y1997!, !Y1998!, !Y1999!, !Y2000!, !Y2001!, !Y2002!, !Y2003!, !Y2004!, !Y2005!, !Y2006!, !Y2007!, !Y2008!, !Y2009!, !Y2010!, !Y2011!, !Y2012!, !Y2013!, !Y2014!, !Y2015!)
- Code Block:
```
def getMean(*allYears):
    notNull = list(filter(None,allYears))
    theSum = sum(notNull)
    theCnt = len(notNull)
    return theSum/theCnt
```

The Code Block defines the getMean function. This is what each Python statement does:


Python Statement	What it does
def getMean(*allYears):	Indicates you want to define a new function called getMean that will use the sequence of values passed to it from the Expression parameter.
notNull = list(filter(None,allYears))	This line, indented four spaces, indicates Python should put all of the non-Null values into a list called notNull.
theSum = sum(notNull)	This line, indented four spaces, instructs Python to sum the non-Null values.
theCnt = len(notNull)	This line, indented four spaces, instructs Python to count the non-Null values.
return theSum/theCnt	This line, indented four spaces, sets the value of the Field Name provided to be the mean (the sum divided by the count).

Open the Calculate Field tool from the Results window and use it to compute the mean values for the remaining explanatory variable datasets (ChildHunger, GenderParity, ChildMortality, SchoolEnrollment, and WomenInGovSeats). When you open the Calculate Field tool from the Results window, you will only need to change the Input Table and Field Name parameter values. The Expression and Code Block parameters remain the same.

Consolidate all of the data into a single feature class

The ArcGIS modeling tools require the literacy rate field (your dependent variable) and all of the mean values (your candidate explanatory variables) to be in the same feature class. You will use the Join Field tool to consolidate.

Find and open the Join Field tool and use it to add the last literacy rate and mean value fields to the AfricaData feature class. To add the last literacy rate field, for example, use the following parameters:
- Input Table: AfricaData
- Input Join Field: CountryCode
- Join Table: Literacy
- Output Join Field: CountryCode
- Join Fields: LastLiteracy
Similarly, to add the MeanAdBirthRt field, use the following parameters:
- Input Table: AfricaData
- Input Join Field: CountryCode
- Join Table: AdolescentBirthRate
- Output Join Field: CountryCode
- Join Fields: MeanAdBirthRt
Continue to use the Join Field tool until all of the Mean value fields are in the AfricaData feature class.

Calculate correlations

You can get a sense for the relationship between the dependent variable (the most up-to-date literacy rate for each country) and each mean explanatory variable using the Spearman Rank-Order Correlation statistic. This tool is not presently in ArcGIS, but you can easily run it from the Python command window by importing SciPy.

In ArcMap, open the Python command window and type the following:

>>> import scipy.stats as stat
>>> dataArray = arcpy.da.FeatureClassToNumPyArray("AfricaData",("LastLiteracy", "MeanAdBirthRt","MeanHunger","MeanSchEnroll","MeanMortality","MeanParity","MeanWmGovSeats"))
>>> rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanAdBirthRt"])
>>> print ("rs = %f, p = %f" % (rs,pval))

The result will be displayed as rs = -0.583488, p = 0.000017.

The result indicates a negative and significant relationship. The r_s values from the Spearman Rank-Order Correlation statistic range from 0.0 to 1.0, so 0.58 is not an exceptionally strong correlation, but it is statistically significant (the probability that there is no relationship is very small, p = 0.000017). The negative sign (-0.58) indicates a negative relationship between adolescent birth rates and the literacy rates; as adolescent birth rates go up, literacy goes down.

To see the correlations for the other explanatory variables, type the following in the Python command window.

>>> rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanHunger"])
>>> print ("rs = %f, p = %f" % (rs,pval))
rs = -0.715657, p = 0.000000
>>> rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanSchEnroll"])
>>> print ("rs = %f, p = %f" % (rs,pval))
rs = 0.674144, p = 0.000000
>>> rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanMortality"])
>>> print ("rs = %f, p = %f" % (rs,pval))
rs = -0.741906, p = 0.000000
>>> rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanParity"])
>>> print ("rs = %f, p = %f" % (rs,pval))
rs = 0.672988, p = 0.000000
>>> rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanWmGovSeats"])
>>> print ("rs = %f, p = %f" % (rs,pval))
rs = 0.257401, p = 0.080690

Find a properly specified regression model

Looking at correlations is helpful, but you cannot fully trust these relationships unless you find a properly specified model. You will use the Exploratory Regression tool to see if you have any properly specified models among all of the candidate explanatory variables.

So that you will see all of the messages displayed by the Exploratory Regression tool, disable background processing. Do this by clicking on the Geoprocessing menu and selecting Geoprocessing Options. Uncheck the Enable box for Background Processing.

Find and open the Exploratory Regression tool. Run the tool with the following parameters (accept the default values for all other parameters):
- Input Features: AfricaData
- Dependent Variable: LastLiteracy
- Candidate Explanatory Variables: MeanAdBirthRt; MeanHunger; MeanSchEnroll; MeanMortality; MeanParity; MeanWmGovSeats

Tool documentation provides a full explanation of each section of the Exploratory Regression Analysis report displayed during tool execution. Let's focus on the first part of the report. Here, Exploratory Regression tries all possible combinations of one, two, three, four, and five variable models.

The tool reports models with the highest adjusted R² values first. R² values range from 0.0 to 1.0 and this diagnostic tells you how much of the variation in the literacy rate values has been explained by the model. Notice that the MeanParity variable explains 54 percent of the literacy rate variation. Any passing models found are listed after the models with the highest adjusted R² values.

Notice that a model with only the MeanParity variable does pass; in other words, it meets all of the requirements of the OLS method. Since it only tells 54 percent of the literacy rate story, however, we will look for better results among the models using two or more explanatory variables.

Notice that the tool does, in fact, find a number of two-variable and three-variable passing models. In addition, notice that there are no passing four- or five-variable models.

The best model of literacy rates is a function of adolescent birth rates (MEANADBIRTHRT), school enrollment (MEANSCHENROLL), and mean gender parity (MEANPARITY). This model is best because it has the highest adjusted R² value and the lowest AICc value.

Identify where remediation might be most effective

While you could use the Select Layer by Attribute tool to find the countries where low literacy overlaps with each of the explanatory variables in your model, you would need to make a decision about what constitutes low literacy. (Is 35.4 percent low? What about 36.1 percent? Where is the threshold?) Similarly, you would need to identify threshold values for each of the explanatory variables (adolescent birth rates, school enrollment, and the gender parity index). Certainly, this is a reasonable thing to do. An alternative, however, is to use the Grouping Analysis tool to identify these threshold values for you. Grouping Analysis will optimize the within-group similarity and the between-group differences.

Find and open the Grouping Analysis tool. Run the tool with the following parameters. The first time you run the tool, you will let it identify the optimal number of groups. The second time you will create the report file.
- Input Features: AfricaData
- Unique ID Field: CountryCode
- Output Feature Class: the name of your output feature class such as FindGroups
- Number of Groups: 2
- Analysis Fields: LastLiteracy; MeanAdBirthRt
- Spatial Constraints: NO_SPATIAL_CONSTRAINT
- Initialization Method: FIND_SEED_LOCATIONS
- Evaluate Optimal Number of Groups: Yes

Grouping Analysis will first try partitioning the countries into two groups, then three, then four, up to fifteen groups. It will calculate the Calinski Harabasz pseudo F-statistic to measure the effectiveness of each solution. For the analysis above, Grouping Analysis finds optimal homogeneity within each group and maximum differentiation among the groups when there is a total of three groups. There is a random component in how grouping analysis works, so your output may not be identical to the output below.

Each component of the tool output is explained in the tool documentation. The R² values, for example, indicate how effective each variable is at differentiating countries. A variable with an even distribution of values is not as effective as a variable with natural breaks.

Run Grouping Analysis again, this time specify three groups (since the first run of the tool indicated three groups was optimal), create a report, and turn off the option to evaluate the optimal number of groups. With the overhead of creating the report, the tool may take several minutes to complete. Use the following parameters:
- Input Features: AfricaData
- Unique ID Field: CountryCode
- Output Feature Class: the name of your output feature class such as Grp3LiteracyAndAdBirthRt
- Number of Groups: 3
- Analysis Fields: LastLiteracy; MeanAdBirthRt
- Spatial Constraints: NO_SPATIAL_CONSTRAINT
- Initialization Method: FIND_SEED_LOCATIONS
- Output Report File: the name of your report file such as Grp3LitAdBirthRt.pdf
- Evaluate Optimal Number of Groups: No

Your map output will look similar to the map shown below. The groups will be the same, but the colors used to represent each group may be different. In other words, the green group below might be colored blue or red, but the same features will likely be together in each group.

Groups based on literacy and adolescent birth rates

To interpret the characteristics of each group, open the report file. You may either browse to the report on your hard disk or double-click the PDF in the Results window.

Each element of the report is explained in the tool documentation. Let's focus on the parallel box plot which summarizes each group across all of the variables. Notice that the green group (for the map above, the colors associated with each group for your results may be different) is associated with the highest adolescent birth rates and lowest literacy rates. If programs to reduce adolescent birth rates cannot be implemented across the entire continent, it might make sense to begin in Niger, Mali, Chad, and the other countries in this group.

Run Grouping Analysis again, this time partitioning countries based on mean primary school enrollment and literacy rates. As before, you will run the tool once to identify the optimal number of groups and again to create the report. Note: creating the report will add several minutes to tool execution, so be sure to remove the entry for the Output Report File parameter until you are ready for it.
- Input Features: AfricaData
- Unique ID Field: CountryCode
- Output Feature Class: the name of your output feature class such as FindGroups
- Number of Groups: 2
- Analysis Fields: LastLiteracy; MeanSchEnroll
- Spatial Constraints: NO_SPATIAL_CONSTRAINT
- Initialization Method: FIND_SEED_LOCATIONS
- Evaluate Optimal Number of Groups: Yes

While there is a random component to the Grouping Analysis algorithm, if you run the Grouping Analysis tool repeatedly, you will see that the results are fairly consistent in indicating two groups are optimal.

Run the Grouping Analysis tool again to create the report file.
- Input Features: AfricaData
- Unique ID Field: CountryCode
- Output Feature Class: the name of your output feature class such as Grp2LiteracyAndSchEnroll
- Number of Groups: 3
- Analysis Fields: LastLiteracy; MeanSchEnroll
- Spatial Constraints: NO_SPATIAL_CONSTRAINT
- Initialization Method: FIND_SEED_LOCATIONS
- Output Report File: the name of your report file such as Grp2LitSchEnroll.pdf
- Evaluate Optimal Number of Groups: No

The colors for your results might be reversed, but the countries in each group should be the same.

Groups based on literacy and primary school enrollment

Open the report file and find the parallel box plot graph. Notice the clear distinction between the blue and red groups with regard to both primary school enrollment rates and literacy rates.
Open the table associated with the output result layer and sort, smallest to largest, on the LastLiteracy field to see that Somalia and Niger have the lowest literacy rates and very low primary school enrollment. Consequently, all of the countries in the red group, starting with Somalia and Niger, would benefit from programs aimed at increasing primary school enrollment rates.

For many countries in Africa, encouraging girls to go to school, and to stay in school, will also help to reduce adolescent birth rates and may provide remediation for the final variable in our model: gender bias. Let's use Grouping Analysis to examine differences in gender parity (the balance between boys and girls attending primary school) across Africa.

Run Grouping Analysis again for literacy and the gender parity index, without specifying a report file. Check the Evaluate Optimal Number of Groups parameter. Your output should look similar to the map below.

Again, the optimal number of groups is two. If you open the output table and sort, smallest to largest, on the MeanParity field, you will notice that Somalia and Chad are associated with the smallest indices; in these countries there are many more boys than girls attending primary school. Guinea and the Central African Republic also have low indices for gender parity in conjunction with very low literacy rates. Programs encouraging families to educate their daughters will likely have the biggest impacts in these countries.

Summary

Your found a properly specified regression model indicating literacy rates are a function of adolescent birth rates, primary school enrollment rates, and gender parity. You also created maps showing which countries were associated with both low literacy and each of the contributing variables. You can use this information to suggest targeted remediation strategies aimed at increasing literacy across Africa.

Workflow using ArcGIS Pro

Determine the most current literacy rate for each country

If you haven't done so already, download and unzip the data package provided at the top of this case study.
Open ArcGIS Pro and browse to the ModelingLiteracy.ppkx project package.
Once the project opens, right-click the Literacy layer in the Contexts pane and select Attribute Table. Notice the many Null values.

You will be looking for a model that explains literacy rates. The variable you will be modeling will be the most up-to-date literacy rate available for each country. To pull this value out from all of the null values, you will use the Calculate Field tool with some Python statements.

Begin by searching for the Add Field tool in the Geoprocessing pane.
Add a new field to hold the most current literacy rate value using the parameters below:
- Input Table: Literacy
- Field Name: LastLiteracy
- Field Type: Float
- Field Alias: Most current literacy rate
- Field is Nullable: Yes
Set the value of this new field to be the last non-Null literacy rate for each country. To do this, find and open the Calculate Field tool and run it with the following parameters:
- Input Table: Literacy
- Field Name:Most current literacy rate
- Expression: getLast( !Y1990!, !Y1991!, !Y1992!, !Y1993!, !Y1994!, !Y1995!, !Y1996!, !Y1997!, !Y1998!, !Y1999!, !Y2000!, !Y2001!, !Y2002!, !Y2003!, !Y2004!, !Y2005!, !Y2006!, !Y2007!, !Y2008!, !Y2009!, !Y2010!, !Y2011!, !Y2012!, !Y2013!, !Y2014!, !Y2015!)
- Code Block:
```
def getLast(*allYears):
    notNull = list(filter(None,allYears))
    return notNull[-1]
```

Using Python with the Calculate Field tool

The Expression parameter tells the Calculate Field tool to get all the literacy rate values from the fields Y1990, Y1991, Y1992, and so on, and to pass those values to a function called getLast.

The Code Block defines the getLast function. Let's look at the Python code line by line:


Python Statement	What it does
def getLast(*allyears):	Indicates you want to define (def) a new function called getLast that will use the sequence of values passed to it from the Expression parameter.
notNull = list(filter(None,allYears))	This line, indented four spaces, indicates Python should put all of the non-Null values into a list called notNull.
return notNull[-1]	This line, indented four spaces, instructs Python to set the LastLiteracy field value to be the final value in the notNull list. The -1 index tells Python to find the end of the list.

Compute mean values for each variable

The data you have for the candidate explanatory variables were collected between 1990 and 2015. Using all of the data available, you will compute a mean value for each variable.

Begin by adding fields to hold the computed mean values. Find and open the Add Field tool. In this and the next step, you will add a mean value field to each of the explanatory variable datasets. Start with the AdolescentBirthRate feature class by running the Add Field tool using the following parameters:
- Input Table: AdolescentBirthRate
- Field Name: MeanAdBirthRt
- Field Type: Float
- Field Alias: Mean Adolescent Birth Rate
- Field is Nullable: Yes
Use the Add Field tool to add a mean value field to the remaining explanatory variable datasets, using the parameters shown below:


Input Table	Field Name	Field Type	Field Alias
ChildHunger	MeanHunger	FLOAT	Mean Child Hunger Rate
SchoolEnrollment	MeanSchEnroll	FLOAT	Mean Primary School Enrollment Rate
ChildMortality	MeanMortality	FLOAT	Mean Child Mortality Rate
GenderParity	MeanParity	FLOAT	Mean Gender Parity Index
WomenInGovSeats	MeanWmGovSeats	FLOAT	Mean % Gov Seats Held by Women

Once the new fields have been created, you will use the Calculate Field tool to compute the mean values for each dataset. To compute the mean value for the Adolescent Birth Rate dataset, for example, use the Calculate Field tool with the parameters below:
- Input Table: AdolescentBirthRate
- Field Name: Mean Adolescent Birth Rate
- Expression: getMean( !Y1990!, !Y1991!, !Y1992!, !Y1993!, !Y1994!, !Y1995!, !Y1996!, !Y1997!, !Y1998!, !Y1999!, !Y2000!, !Y2001!, !Y2002!, !Y2003!, !Y2004!, !Y2005!, !Y2006!, !Y2007!, !Y2008!, !Y2009!, !Y2010!, !Y2011!, !Y2012!, !Y2013!, !Y2014!, !Y2015!)
- Code Block:
```
def getMean(*allYears):
    notNull = list(filter(None,allYears))
    theSum = sum(notNull)
    theCnt = len(notNull)
    return theSum/theCnt
```

The Code Block defines the getMean function. This is what each Python statement does:


Python Statement	What it does
def getMean(*allYears):	Indicates you want to define (def) a new function called getMean that will use the sequence of values passed to it from the Expression parameter.
notNull = list(filter(None,allYears))	This line, indented four spaces, indicates Python should put all of the non-Null values into a list called notNull.
theSum = sum(notNull)	This line, indented four spaces, instructs Python to sum the non-Null values.
theCnt = len(notNull)	This line, indented four spaces, instructs Python to count the non-Null values.
return theSum/theCnt	This line, indented four spaces, sets the value of the Field Name provided to be the mean (the sum divided by the count).

Use the Calculate Field tool to compute the mean values for the remaining explanatory variable datasets (ChildHunger, GenderParity, ChildMortality, SchoolEnrollment, and WomenInGovSeats).

Consolidate all of the data into a single feature class

Find and open the Join Field tool and use it to add the last literacy rate and mean value fields to the AfricaData feature class. To add the last literacy rate field, for example, use the following parameters:
- Input Table: AfricaData
- Input Join Field: CountryCode
- Join Table: Literacy
- Output Join Field: CountryCode
- Join Fields: Most current literacy rate
Similarly, to add the MeanAdBirthRt field, use the Join Field tool with the following parameters:
- Input Table: AfricaData
- Input Join Field: CountryCode
- Join Table: AdolescentBirthRate
- Output Join Field: CountryCode
- Join Fields: Mean Adolescent Birth Rate
Continue to run the Join Field tool until all of the Mean value fields are in the AfricaData featureclass (Mean Adolescent Birth Rate, Mean Child Hunger Rate, Mean Gender Parity Index, Mean Child Mortality Rate, Mean Primary School Enrollment Rate, Mean % Gov Seats Held by Women).

Calculate correlations

You can get a sense for the relationship between the dependent variable (the most up-to-date literacy rate for each country) and each mean explanatory variable using the Spearman Rank-Order Correlation. This tool is not presently in ArcGIS, but you can easily run it from the Python command window by importing SciPy.

In ArcGIS Pro, open the Python command window and type the following:

import scipy.stats as stat
dataArray = arcpy.da.FeatureClassToNumPyArray('AfricaData',("LastLiteracy","MeanAdBirthRt", "MeanHunger","MeanSchEnroll","MeanMortality","MeanParity","MeanWmGovSeats"))
rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanAdBirthRt"])
print ("rs = %f, p = %f" % (rs,pval))

The result will be displayed as rs = -0.583488, p = 0.000017.

The result indicates a negative and significant relationship. The r_s values from the Spearman Rank-Order Correlation statistic range from 0.0 to 1.0, so 0.58 is not an exceptionally strong correlation, but it is statistically significant (the probability that there is no relationship is very small, p = 0.000017). The negative sign (-0.58) indicates a negative relationship.

To see the correlations for the other explanatory variables, type the following in the Python command window.

rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanHunger"])
print ("rs = %f, p = %f" % (rs,pval))
rs = -0.715657, p = 0.000000
rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanParity"])
print ("rs = %f, p = %f" % (rs,pval))
rs = 0.672988, p = 0.000000
rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanMortality"])
print ("rs = %f, p = %f" % (rs,pval))
rs = -0.741906, p = 0.000000
rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanSchEnroll"])
print ("rs = %f, p = %f" % (rs,pval))
rs = 0.674144, p = 0.000000
rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanWmGovSeats"])
print ("rs = %f, p = %f" % (rs,pval))
rs = 0.257401, p = 0.080690

Find a properly specified regression model

Find and open the Exploratory Regression tool. Run the tool with the following parameters (accept the default values for all other parameters):
- Input Features: AfricaData
- Dependent Variable: LastLiteracy
- Candidate Explanatory Variables: MeanAdBirthRt; MeanHunger; MeanSchEnroll; MeanMortality; MeanParity; MeanWmGovSeats
To see the report, hover over the progress bar at the bottom of the Geoprocessing pane and click the icon to open the full Exploratory Regression analysis report.

When the report opens, you may resize the window by using the cursor to grab the lower left corner of the message window.

Tool documentation provides a full explanation of each section of the Exploratory Regression analysis report. Let's focus on the first part of the report. Here, Exploratory Regression tries all possible combinations of one, two, three, four, and five variable models.

Notice that the tool does, in fact, find a number of two-variable and three-variable passing models. In addition, notice that there are no passing four- or five-variable models.

Identify where remediation might be most effective

While you could use the Select Layer By Attribute tool to find the countries where low literacy overlaps with each of the explanatory variables in our model, you would need to make a decision about what constitutes low literacy (Is 35.4 percent low? What about 36.1 percent? Where is the threshold?). Similarly, you would need to identify threshold values for each of the explanatory variables (adolescent birth rates, school enrollment, and the gender parity index). Certainly, this is a reasonable thing to do. An alternative, however, is to use the Grouping Analysis tool to identify these threshold values for you. Grouping Analysis will optimize the within-group similarity and the between-group differences.

Find and open the Grouping Analysis tool. Run the tool with the following parameters. The first time you run the tool, let it identify the optimal number of groups.
- Input Features: AfricaData
- Unique ID Field: CountryCode
- Output Feature Class: the name of your output feature class such as FindGroups
- Number of Groups: 2
- Analysis Fields: Most current literacy rate; Mean Adolescent Birth Rate
- Spatial Constraints: No spatial constraint
- Initialization Method: Find seed locations
- Evaluate Optimal Number of Groups: Yes

Each component of the output is explained in the tool documentation. The R² values, for example, indicate how effective each variable is at differentiating countries. A variable with an even distribution of values will not be as effective as a variable with natural breaks.

Run Grouping Analysis again, this time specifying three groups (since the first run indicated three groups would be optimal), creating a report, and turning off the option to evaluate the optimal number of groups. With the overhead of creating the report, the tool may take several minutes to complete. Use the following parameters:
- Input Features: AfricaData
- Unique ID Field: CountryCode
- Output Feature Class: the name of your output feature class such as Grp3LiteracyAndAdBirthRt
- Number of Groups: 3
- Analysis Fields: Most current literacy rate; Mean Adolescent Birth Rate
- Spatial Constraints: No spatial constraint
- Initialization Method: Find seed locations
- Output Report File: the name of your report file such as Grp3LitAdBirRt.pdf
- Evaluate Optimal Number of Groups: No

Your map output will look similar to the map shown below. The groups should be the same, but the colors used to represent each group may be different. In other words, the green group below might be colored blue or red, but the same features will likely be together in each group.

To interpret the characteristics of each group, open the report file by either browsing to the report on your hard disk or hovering over the progress bar at the bottom of the Geoprocessing pane and clicking on the report name.

Each element of the report is explained in the tool documentation. Let's focus on the parallel box plot, which summarizes each group across all of the variables. Notice that the green group (for the map above, the colors associated with each group for your results may be different) is associated with the highest adolescent birth rates and lowest literacy rates. If programs to reduce adolescent birth rates cannot be implemented across the entire continent, it might make sense to begin in Niger, Mali, Chad, and the other countries in this group.

Parallel box plot for literacy and adolescent birth rates

Run Grouping Analysis again, this time partitioning countries based on mean primary school enrollment and literacy rates. As before, you will run the tool once to identify the optimal number of groups and again to create the report. Note that creating the report will add several minutes to tool execution, so be sure to remove the entry for the Output Report File parameter until you are ready for it.
- Input Features: AfricaData
- Unique ID Field: CountryCode
- Output Feature Class: the name of your output feature class such as FindGroups
- Number of Groups: 2
- Analysis Fields: Most current literacy rate; Mean Primary School Enrollment Rate
- Spatial Constraints: No spatial constraint
- Initialization Method: Find seed locations
- Evaluate Optimal Number of Groups: Yes

Run the Grouping Analysis tool again to create the report file:
- Input Features: AfricaData
- Unique ID Field: CountryCode
- Output Feature Class: the name of your output feature class such as Grp2LiteracyAndSchEnroll
- Number of Groups: 2
- Analysis Fields: Most current literacy rate; Mean Primary School Enrollment Rate
- Spatial Constraints: No spatial constraint
- Initialization Method: Find seed locations
- Output Report File: the name of your report file such as Grp2LitSchEnroll.pdf
- Evaluate Optimal Number of Groups: No

The colors for your results might be reversed, but the countries in each group should be the same.

Groups based on literacy rates and primary school enrollment

Open the report file by hovering over the progress bar at the bottom of the Geoprocessing pane and clicking on the icon to open the messages. . Within the report, find the parallel box plot graph. Notice the clear distinction between the blue and red groups with regard to both primary school enrollment rates and literacy rates.
Open the table associated with the output result layer and sort, smallest to largest, on the LastLiteracy field to see that Somalia and Niger have the lowest literacy rates and very low primary school enrollment. Consequently, all of the countries in the red group, starting with Somalia and Niger, would benefit from programs aimed at increasing primary school enrollment rates.

Run Grouping Analysis again for literacy and the mean gender parity index, without specifying a report file. Check the Evaluate Optimal Number of Groups parameter. Your output should look similar to the map below.

Summary

You found a properly specified regression model indicating literacy rates are a function of adolescent birth rates, primary school enrollment rates, and gender parity. You also created maps showing which countries were associated with both low literacy and each of the contributing variables. You can use this information to suggest targeted remediation strategies aimed at increasing literacy across Africa.

References and resources for learning more

Afrol News, 2015. Some 80% of Somalis now illiterate. Afrol News, 23 January.

Hillman, A.L. and Jenkner, E. 2014. Educating Children in Poor Countries. International Monetary Fund, Economic Issues No. 33. www.imf.org/external/pubs/ft/issues/issues33/

Loaiza, E. and Liang, M. 2013. Adolescent Pregnancy: A Review of the Evidence. UNFPA. New York. www.unfpa.org/publications/adolescent-pregnancy

Madamombe, Itai. 2007. Food keeps African children in school. Africa Renewal Online. January 2007, page 10. www.un.org/africarenewal/magazine/january-2007/food-keeps-african-children-school

The World Bank, 2014. Girls' Education. The World Bank, Dec 3, 2014. www.worldbank.org/en/topic/education/brief/girls-education

UNESCO, 2011. Education for all. Regional Overview, Sub-Saharan Africa. UNESCO Global Monitoring Report. https://en.unesco.org/gem-report/

UNFPA, 2013. Motherhood in Childhood, Facing the challenge of adolescent pregnancy. UNFPA Publication. www.unfpa.org/publications/state-world-population-2013-0

United Nations. 2015. The Millennium Development Goals Report 2015. United Nations, New York. http://mdgs.un.org/unsd/mdg/Resources/Static/Products/Progress2015/English2015.pdf

Watkins, Kevin. 2013. Too Little Access, Not Enough Learning: Africa's Twin Deficit in Education. Brookings, January 16, 2013. The Brookings Institute Press. www.brookings.edu/research/opinions/2013/01/16-africa-learning-watkins

This case study demonstrates a number of analytical methods that can be adapted to many different application areas, allowing you to answer a variety of questions.


Method	Generic Question	Examples
Spearman Rank-Order Correlation	How does this relate to that?	Am I more likely to be robbed in a rich neighborhood or a poor neighborhood? Are test scores higher when teacher-to-student ratios are lower? How strong is the correlation between access to clean drinking water and literacy rates?
Exploratory Regression and Ordinary Least-Squares regression	What are the factors that contribute to or promote the thing I'm interested in?	What are the key variables that explain high forest fire frequency? What demographic characteristics contribute to high rates of public transportation usage? What factors are strong predictors of traffic accidents? Why are cancer rates so high in particular locations?
Grouping Analysis	Which features are most alike?	Which countries face the same challenges with regard to vulnerability? How should we divide the region into homogeneous sales territories?

You also used data manipulation and management functions, including Add Field, Calculate Field, and Join Field.

A number of resources are available to help you learn more about the analyses demonstrated in this case study:

Spatial Statistics resources

Regression Analysis Basics

Answering Why Questions: An introduction to regression analysis with spatial data

What they don't tell you about regression analysis

Spatial Data Mining I: Essentials of Cluster Analysis

Spatial Data Mining II: A Deep Dive Into Space-Time Analysis