ArcGIS Desktop

  • ArcGIS Pro
  • ArcMap

  • My Profile
  • Help
  • Sign Out
ArcGIS Desktop

ArcGIS Online

The mapping platform for your organization

ArcGIS Desktop

A complete professional GIS

ArcGIS Enterprise

GIS in your enterprise

ArcGIS Developers

Tools to build location-aware apps

ArcGIS Solutions

Free template maps and apps for your industry

ArcGIS Marketplace

Get apps and data for your organization

  • Documentation
  • Support
Esri
  • Sign In
user
  • My Profile
  • Sign Out

Analytics

  • Home
  • Applied Analysis
  • Python

Modeling literacy

  • Obtaining and preparing data
  • Workflow using ArcGIS Desktop
  • Workflow using ArcGIS Pro
  • References and resources for learning more
Download the data packages

Note:

While both the data and the premise for the analyses outlined below are real, the conclusions and recommendations described are illustrative only.

The primary objective of this case study is to demonstrate a number of spatial analysis methods including ordinary least-squares regression and K-means clustering. This case study also discusses spatial analysis topics such as data preparation and data quality. In addition, if you complete either of the workflows below, you will get a taste for how easy it is to use simple Python statements to reformat and summarize your data. You will also learn how to tap into SciPy to get access to analytical functions not available in ArcGIS.

Improving literacy, for both boys and girls, is one of the most important objectives we can have as a society because of its positive impacts on health, wealth, productivity, and quality of life. Identifying the factors that contribute to low literacy rates is an important first step toward remediation. Poverty and poor health prevent children from attending school, of course, but access to schools, and cultural biases relating to gender, are also important components of literacy. Recognizing that each of these factors has a geography is also essential; child health might be the key factor explaining literacy in one country, but not the most important factor in another country.

Let's look more closely at literacy rates across the continent of Africa. A workflow for obtaining and analyzing country-level statistical data is given below.

Carrying burdens in Africa

What data is needed?

We begin our analysis by downloading the United Nations MDG data, selecting as many variables as possible that might relate to literacy including child hunger, mortality rates, gender parity patterns, and primary school enrollment rates.

The map below shows data availability for ten of those variables, all collected between 1990 and 2015. Notice that complete data for these ten variables would constitute 250 pieces of information for each country. Not surprisingly, no country has a complete set of data. The map below shows where data is rich, sparse, and missing altogether.

Data availability across Africa

Having data for each country is important, of course, but so is having current data. The maps below focus on the literacy rate variable only, showing where data is scarce and also where it is out-of-date. Notice that both Algeria and Sudan have very little data and, unfortunately, the data they do have is old. Other countries, like Angola, have very little data, but at least the data is current.

Data currency

Before beginning analysis, it is very important to make every effort to fill in missing data with current and reliable data. When adding or updating data, be sure to keep track of where each piece of data came from.

Our efforts to find current and reliable data were not entirely successful. For some countries we couldn't find data across key variables and, consequently, we had to exclude Mayotte, Reunion Island, Sudan, and Western Sahara from the analyses below. In addition, incomplete data across many countries for particular variables—several poverty indicators, for example—made it necessary to exclude those variables.

For demonstration purposes we have limited the variables included in the data package provided above to the literacy rate variable and several other variables that were consistently significant for modeling literacy rates:

VariableDescription

Literacy rate

The percentage of people age 15 to 24 who can read and write

Adolescent birth rate

The number of girls, per 1,000 girls, age 15 to 19 who deliver babies

Child hunger

The percentage of children under the age of 5 who are moderately to severely underweight

Primary school enrollment

The percentage of children, primary school age, enrolled in primary education; this is the reported net ratio, rather than the gross ratio

Child mortality

The number of children, per 1,000 live births, who die before the age of 5

Gender parity index

The ratio of girls to boys enrolled in primary school education; 1.0 reflects perfect balance; indices smaller than 1.0 indicate higher enrollment for boys than for girls

Women in government

The percentage of seats in the national parliament held by women

Where is literacy lowest and highest?

Based on the most up-to-date literacy rate data available for each country, the map below shows the spatial distribution of literacy across Africa. For the countries shown with the lightest green color, less than 45 percent of the people age 15 to 24 can read and write. In Somalia, according to 2015 data provided by Afrol News, more than 80 percent of the population is illiterate. In contrast, almost 100 percent of the people age 15 to 24 are literate in Libya and South Africa.

Literacy rates

Where are literacy rates rising?

By subtracting the earliest literacy rate values (occurring sometime between 1990 and 2015) from the most up-to-date literacy rate values, we get a sense for how literacy rates are changing. In the map below, countries symbolized using pink have seen literacy rate declines. The Central African Republic has experienced the sharpest declines (a 24 percent decrease). The largest literacy rate increases are found in Burundi and Chad (35 and 32 percent increase, respectively).

Changes in literacy rates

It is almost impossible to look at a map like the one above and not wonder what is going on. Why are there locations where literacy rates are declining? Why is literacy high in some countries but low in others? Are low literacy rates the result of health problems, economic issues, or cultural practices?

What factors contribute to low literacy rates?

To get a better understanding of the factors that are contributing to high or low literacy rates, let's look at how the literacy rate values correlate with other variables, such as school enrollment and health. We will use the Spearman Rank-Order Correlation statistic to measure relationship strength ( rs ) and significance (p-value).

There isn't presently a tool in ArcGIS for running the Spearman Rank-Order Correlation statistic. There are several options for accessing analytical functionality outside of ArcGIS, however, including the following:

  1. Use the Export Feature Attributes to ASCII tool to create a comma-delimited text file for use with statistical packages like SAS or SPSS. Most software packages will read comma-delimited text files.
  2. Tap into R functionality. R has a large number of statistical functions. You can create custom R tools and execute them in ArcGIS. For more information on this, see the examples posted on the R - ArcGIS Community page.
  3. The easiest option is to access SciPy analytical methods from the ArcGIS Python window. This option is detailed below in the ArcMap and ArcGIS Pro workflows.

The table below shows the results of the Spearman Rank-Order Correlation analysis. They indicate literacy rates are most strongly correlated with child health (child mortality rate), child hunger (percentage of underweight children), gender parity, and primary school enrollment.

VariableRelationship Strength (rs)Relationship DirectionStatistical Significance

Child mortality rate

-0.74

negative

Very high, p < 0.001

Child hunger (underweight)

-0.72

negative

Very high, p < 0.001

Gender parity

0.67

positive

Very high, p < 0.001

Primary school enrollment

0.67

positive

Very high, p < 0.001

Adolescent birth rate

-0.58

negative

Very high, p < 0.001

Women in government seats (percent)

0.26

positive

Significant, p < 0.1

The Spearman rs values indicate how strong each relationship (correlation) is; the rs values range from 0.00 to 1.00. The sign (+/-) associated with each rs value determines if the relationship is positive or negative. The p-value associated with each rs value indicates the probability that there is no relationship. A very small p-value means there is very little chance that there is no relationship at all. Notice that the relationship between literacy rates and primary school enrollment is strong and positive (rs = +0.67, p <0.001). As we would expect, locations with higher primary school enrollment rates also have higher literacy rates. The gender parity variable also has a strong, positive relationship with literacy rates (rs = +0.67, p <0.001). As the number of boys and girls in primary schools becomes more balanced, literacy rates increase.

Positive and negative data relationships

Some variables have a negative correlation with literacy rates. As child hunger (the proportion of underweight children) increases, for example, literacy rates decrease. Similarly, as adolescent birth rates increase, literacy rates decrease.

Note:

Having a negative correlation is not a bad thing; this terminology simply specifies whether two variables move in the same direction in relation to each other (positive) or in opposite directions (negative).

Simple correlations, like these, are a good place to start when you want to understand variable relationships. A quick web search, however, using spurious correlations for your search term, will reveal the problem with relying on correlations alone to make important decisions (like how and where to allocate scarce resources aimed at improving literacy rates across Africa). We can really only trust these relationships if we find a properly specified model for the literacy rate values.

How can you find a properly specified model?

We will try using Ordinary Least Squares regression (OLS). Regression analysis is a statistical method that allows you to estimate relationships among variables. We will see if it can identify key factors contributing to high and low literacy rates. If it can, we will use what we learn to recommend programs for improving literacy.

Note:

If you are new to regression analysis, don't worry at all! There are lots of great resources listed below and at the end of this document to help you quickly become proficient with this statistical method.

We begin with Exploratory Regression. Exploratory regression tries every possible combination of the variables you provide, looking for properly specified models that satisfy all of the requirements of the OLS regression method. A properly specified OLS model has several characteristics:

  • The coefficients associated with each of the explanatory variables are statistically significant (meaning all of the variables in the model are truly helping to explain the literacy rate values).
  • The relationship between each explanatory variable is justifiable. If the coefficient for the child mortality variable were positive, for example, it would indicate both child mortality and literacy increase together—and this does not make sense; we are expecting a negative relationship here.
  • The under and over predictions (called residuals) from the model should be normally distributed and random to indicate your model is free from bias and isn't missing any key explanatory variables.

There is good news: the Exploratory Regression tool does, in fact, find a couple of models that meet all of the requirements of the OLS method. The best properly specified model comprises three explanatory variables: adolescent birth rates, primary school enrollment, and primary school gender parity. This model explains almost 68 percent of the variation in literacy rate values across Africa (Adj R2 = 0.677).

Finding a properly specified model gives us confidence that any investments made to reduce adolescent birth rates, or to increase primary school enrollment, particularly for girls, will have a positive impact on literacy.

Where should we invest in these programs?

We could invest in all of these programs equally across the entire continent. This would be costly, however. Alternatively, we could identify where the need for each type of program is greatest and focus our investments accordingly. This is an excellent plan, of course, but to ensure investments will positively impact literacy rates as well, we can examine the relationships between each explanatory variable (adolescent birth rates, primary school enrollment, and gender parity) and literacy.

We will use the Grouping Analysis tool to show us where the low literacy rates correspond to the factors contributing to low literacy. Grouping Analysis uses a K-means clustering algorithm to partition countries into groups so the countries in the same group are as similar as possible and the groups themselves are as different possible. We will use Grouping Analysis on each of our model explanatory variables separately.

Note:

The most powerful tool in our spatial statistics arsenal for showing where each explanatory variable has the strongest impact on literacy is Geographically Weighted Regression (GWR). With only 47 countries in our dataset, however, we don't have enough features to use this powerful method. More information about GWR is provided in the resources section at the end of this case study.

Adolescent birth rates

High adolescent birth rates are associated with a number of problems beyond low literacy, including poverty, high maternal death rates, HIV, dependency, abuse, and violence. The Grouping Analysis tool finds three distinct groups for the Adolescent Birth Rate and Literacy Rate variables across the continent of Africa. Countries in the green group (see the map below) have both low literacy rates and high adolescent birth rates. Niger, Mali, Chad, and the other countries in the green group would be good locations to begin a rollout of programs designed to protect young girls from having children while they are still children themselves. Encouraging families and communities to allow girls to attend school and to stay in school is an important remediation strategy that also relates to the next explanatory variable: Gender Parity.

Literacy and adolescent birth rates

Gender parity

Differences in the proportion of boys versus girls enrolled in primary school education is also an important factor impacting literacy rates. Grouping Analysis finds two distinct groups for the gender parity and literacy rate variables. The countries in the red group have the fewest girls attending school and the lowest literacy rates. Investing in programs that address gender bias and result in more girls attending school—especially in Somalia, Chad, Guinea, and Ethiopia, where the proportion of girls enrolled in school is lowest—will likely have the biggest impacts on literacy. Programs that encourage girls to attend and complete school will also boost primary school enrollment rates, the next explanatory variable in our model.

Literacy and gender parity

Primary school enrollment rates

Adolescent pregnancy and gender bias will certainly have a negative impact on primary school enrollment rates. Physical access to school facilities, family economics, and a country's political will, however, are also factors that impact whether or not a child attends school. Grouping Analysis finds two distinct groups where both literacy rates and school enrollment rates are low. Countries in the red group, including Somalia and Niger, where enrollment rates are especially low, will likely benefit most from programs to increase primary school enrollment rates.

Literacy and primary school enrollment

Final thoughts

This case study focused on literacy rates across Africa, but similar workflows could be used to examine poverty, infant mortality, and other country-level statistical indicators. Exploratory Regression was used to discover the key explanatory variables contributing to literacy rates. Grouping Analysis was used to suggest a variety of targeted remediation programs directed at countries where these programs are likely to see their greatest successes.

Education in Africa

Obtaining and preparing data

This section outlines the steps needed to download and prepare data for analysis. Because the prepared data is provided in the data package at the beginning of this case study, you do not need to perform the steps in this section in order to complete the workflows below. You may still want to read or work through these instructions, however, to get an understanding of what is required in case you ever need to work with CSV-formatted data.

  1. Download the data as CSV files from the United Nations Millennium Development goals data portal. It seems to work best to create a separate CSV file for every indicator you are interested in.
  2. You will notice a number of things that need to be corrected in Excel before you can bring the CSV tables into ArcMap.
    1. Delete the fields that are not needed for analysis (SeriesCode, MDG, all the fields named Footnotes, and all fields named Type, for example).
    2. Field names cannot begin with a number, so change all the year fields, such as 1991, 1992, 1993, and so on, to Y1991, Y1992, Y1993, and up to Y2015.
    3. Delete all the footnotes at the bottom of the table (you will find these following the last country record).
    4. When ArcGIS brings in a table, it uses the first few records to determine if the data is text or numeric. To ensure the year fields are all created as numbers (especially when there are so many null values), insert a row at the top of each table with data reflecting the appropriate data type. You may delete this row after the table is in ArcGIS.
      The first row defines data format
  3. Tip:
    To save yourself some typing, create the template row in the first table and copy it into each of the remaining CSV tables.
  4. Once the CSV files are in ArcGIS, use the Copy Rows tool to convert them to ArcGIS tables.

Workflow using ArcGIS Desktop

ArcMap icon

Note:

The steps below are based on ArcGIS 10.4 for Desktop but should work fine for later software releases as well. To follow the steps below, you may download the Millennium Development Goals data and prepare it as outlined above, download and unzip the data in the data package provided (already cleaned up for you), or improvise using your own data. The steps below assume you are using the data package provided.

Determine the most current literacy rate for each country

  1. If you haven't done so already, download and unzip the data package provided at the top of this case study.
  2. Double-click the ModelingLiteracy.mpk map package to open it.
  3. Right-click the Literacy layer and open the table. Notice the many Null values.
    The table includes a lot of null values

    You will be looking for a model that explains literacy rates. The variable you will be modeling will be the most up-to-date literacy rate available for each country. To pull this value out from all of the null values, you will use the Calculate Field tool with some Python statements.

  4. Begin by using the Search pane to locate and open the Add Field tool.
    Search pane
  5. Add a new field to hold the most current literacy rate value using the parameters below:
    • Input Table: Literacy
    • Field Name: LastLiteracy
    • Field Type: FLOAT
    • Field Alias: Most current literacy rate
    • Field is Nullable: Yes
  6. Set the value of this new field to be the last non-Null literacy rate for each country. To do this, find and open the Calculate Field tool and run it with the following parameters:
    • Input Table: Literacy
    • Field Name: LastLiteracy
    • Expression: getLast( !Y1990!, !Y1991!, !Y1992!, !Y1993!, !Y1994!, !Y1995!, !Y1996!, !Y1997!, !Y1998!, !Y1999!, !Y2000!, !Y2001!, !Y2002!, !Y2003!, !Y2004!, !Y2005!, !Y2006!, !Y2007!, !Y2008!, !Y2009!, !Y2010!, !Y2011!, !Y2012!, !Y2013!, !Y2014!, !Y2015!)
    • Code Block:

      def getLast(*allYears):
          notNull = list(filter(None,allYears))
          return notNull[-1]
      
  7. Note:
    Python is particular about indentations and case. You will get an error if you do not indent the second and third lines above in the code block, for example. In addition, if you define your function to be getLast, but refer to it in the Expression parameter as GetLast or getlast, Python will not be able to find it. Also, if the Literacy table is open when you run the Calculate Field tool, you may need to close and reopen it in order to see the updated values.
    Python statements with the Calculate Field tool
    The Expression parameter tells the Calculate Field tool to get all the literacy rate values from the fields Y1990, Y1991, Y1992, and so on, and to pass those values to a function called getLast.
    The Code Block defines the getLast function. Let's look at the Python code line by line:

    Python StatementWhat it does

    def getLast(*allyears):

    Indicates you want to define (def) a new function called getLast that will use the sequence of values passed to it from the Expression parameter.

    notNull = list(filter(None,allYears))

    This line, indented four spaces, indicates Python should put all of the non-Null values into a list called notNull.

    return notNull[-1]

    This line, indented four spaces, instructs Python to set the LastLiteracy field value to be the final value in the notNull list. The -1 index tells Python to find the end of the list.

Compute mean values for each variable

You now have the most current literacy rate value for each country. What are the factors that might impact or explain those literacy rates? We suspect they are a function of several things including child health, hunger, school enrollment, and gender biases. Later in this workflow you will use these variables as your candidate explanatory variables to see if you can find a properly specified regression model for the literacy rate values.

The data you have for the candidate explanatory variables were collected between 1990 and 2015. Using all of the data available, you will compute a mean value for each variable.

  1. Begin by adding fields to hold the computed mean values. Find and open the Add Field tool. In this and the next step, you will add a mean value field to each of the explanatory variable datasets. Start with the AdolescentBirthRate feature class by running the Add Field tool using the following parameters:
    • Input Table: AdolescentBirthRate
    • Field Name: MeanAdBirthRt
    • Field Type: FLOAT
    • Field Alias: Mean Adolescent Birth Rate
    • Field is Nullable: Yes
      Add Field dialog box
      Tip:

      Every time you run a tool, it is recorded in the Results window. Double-clicking on a tool entry in the Results window opens the dialog with the parameters filled out. If you need to run a tool several times with slightly different parameters, it is usually quicker to access the tool from the Results window and modify the parameters as needed.

      Results window
  2. Open the Add Field tool from the Results window and use it to add a mean value field to the remaining explanatory variable datasets. Use the parameters shown below:
  3. Input TableField NameField TypeField Alias

    ChildHunger

    MeanHungerFLOATMean Child Hunger Rate

    SchoolEnrollment

    MeanSchEnrollFLOATMean Primary School Enrollment Rate

    ChildMortality

    MeanMortalityFLOATMean Child Mortality Rate

    GenderParity

    MeanParityFLOATMean Gender Parity Index

    WomenInGovSeats

    MeanWmGovSeatsFLOATMean % Gov Seats Held by Women
  4. Once the new fields have been created, you will use the Calculate Field tool to compute the mean values for each dataset. (Opening the Calculate Field tool from the Results window will save you some typing). To compute the mean value for the adolescent birth rate dataset, for example, use the parameters below:
    • Input Table: AdolescentBirthRate
    • Field Name: MeanAdBirthRt
    • Expression: getMean( !Y1990!, !Y1991!, !Y1992!, !Y1993!, !Y1994!, !Y1995!, !Y1996!, !Y1997!, !Y1998!, !Y1999!, !Y2000!, !Y2001!, !Y2002!, !Y2003!, !Y2004!, !Y2005!, !Y2006!, !Y2007!, !Y2008!, !Y2009!, !Y2010!, !Y2011!, !Y2012!, !Y2013!, !Y2014!, !Y2015!)
    • Code Block:

      def getMean(*allYears):
          notNull = list(filter(None,allYears))
          theSum = sum(notNull)
          theCnt = len(notNull)
          return theSum/theCnt
      
  5. The Code Block defines the getMean function. This is what each Python statement does:

    Python StatementWhat it does

    def getMean(*allYears):

    Indicates you want to define a new function called getMean that will use the sequence of values passed to it from the Expression parameter.

    notNull = list(filter(None,allYears))

    This line, indented four spaces, indicates Python should put all of the non-Null values into a list called notNull.

    theSum = sum(notNull)

    This line, indented four spaces, instructs Python to sum the non-Null values.

    theCnt = len(notNull)

    This line, indented four spaces, instructs Python to count the non-Null values.

    return theSum/theCnt

    This line, indented four spaces, sets the value of the Field Name provided to be the mean (the sum divided by the count).

  6. Open the Calculate Field tool from the Results window and use it to compute the mean values for the remaining explanatory variable datasets (ChildHunger, GenderParity, ChildMortality, SchoolEnrollment, and WomenInGovSeats). When you open the Calculate Field tool from the Results window, you will only need to change the Input Table and Field Name parameter values. The Expression and Code Block parameters remain the same.

Consolidate all of the data into a single feature class

The ArcGIS modeling tools require the literacy rate field (your dependent variable) and all of the mean values (your candidate explanatory variables) to be in the same feature class. You will use the Join Field tool to consolidate.

  1. Find and open the Join Field tool and use it to add the last literacy rate and mean value fields to the AfricaData feature class. To add the last literacy rate field, for example, use the following parameters:
    • Input Table: AfricaData
    • Input Join Field: CountryCode
    • Join Table: Literacy
    • Output Join Field: CountryCode
    • Join Fields: LastLiteracy
      Join Field tool dialog box
  2. Similarly, to add the MeanAdBirthRt field, use the following parameters:
    • Input Table: AfricaData
    • Input Join Field: CountryCode
    • Join Table: AdolescentBirthRate
    • Output Join Field: CountryCode
    • Join Fields: MeanAdBirthRt
  3. Continue to use the Join Field tool until all of the Mean value fields are in the AfricaData feature class.

Calculate correlations

You can get a sense for the relationship between the dependent variable (the most up-to-date literacy rate for each country) and each mean explanatory variable using the Spearman Rank-Order Correlation statistic. This tool is not presently in ArcGIS, but you can easily run it from the Python command window by importing SciPy.

  1. In ArcMap, open the Python command window and type the following:
  2. >>> import scipy.stats as stat
    >>> dataArray = arcpy.da.FeatureClassToNumPyArray("AfricaData",("LastLiteracy", "MeanAdBirthRt","MeanHunger","MeanSchEnroll","MeanMortality","MeanParity","MeanWmGovSeats"))
    >>> rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanAdBirthRt"])
    >>> print ("rs = %f, p = %f" % (rs,pval))
    
    The result will be displayed as rs = -0.583488, p = 0.000017.
    Python window
    The result indicates a negative and significant relationship. The rs values from the Spearman Rank-Order Correlation statistic range from 0.0 to 1.0, so 0.58 is not an exceptionally strong correlation, but it is statistically significant (the probability that there is no relationship is very small, p = 0.000017). The negative sign (-0.58) indicates a negative relationship between adolescent birth rates and the literacy rates; as adolescent birth rates go up, literacy goes down.
  3. To see the correlations for the other explanatory variables, type the following in the Python command window.

    Tip:
    If you touch the up arrow key while in the Python window, it will bring back what you previously typed. Rather than retyping statements, you can touch the up arrow key twice, modify the second variable name, place your cursor at the end of the statement, and reexecute the statement; touch the up arrow key twice again to reexecute the print statement.

    >>> rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanHunger"])
    >>> print ("rs = %f, p = %f" % (rs,pval))
    rs = -0.715657, p = 0.000000
    >>> rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanSchEnroll"])
    >>> print ("rs = %f, p = %f" % (rs,pval))
    rs = 0.674144, p = 0.000000
    >>> rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanMortality"])
    >>> print ("rs = %f, p = %f" % (rs,pval))
    rs = -0.741906, p = 0.000000
    >>> rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanParity"])
    >>> print ("rs = %f, p = %f" % (rs,pval))
    rs = 0.672988, p = 0.000000
    >>> rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanWmGovSeats"])
    >>> print ("rs = %f, p = %f" % (rs,pval))
    rs = 0.257401, p = 0.080690
    

Find a properly specified regression model

Looking at correlations is helpful, but you cannot fully trust these relationships unless you find a properly specified model. You will use the Exploratory Regression tool to see if you have any properly specified models among all of the candidate explanatory variables.

So that you will see all of the messages displayed by the Exploratory Regression tool, disable background processing. Do this by clicking on the Geoprocessing menu and selecting Geoprocessing Options. Uncheck the Enable box for Background Processing.

Disable background processing
  1. Find and open the Exploratory Regression tool. Run the tool with the following parameters (accept the default values for all other parameters):
    • Input Features: AfricaData
    • Dependent Variable: LastLiteracy
    • Candidate Explanatory Variables: MeanAdBirthRt; MeanHunger; MeanSchEnroll; MeanMortality; MeanParity; MeanWmGovSeats
      Exploratory Regression tool UI
  2. Tool documentation provides a full explanation of each section of the Exploratory Regression Analysis report displayed during tool execution. Let's focus on the first part of the report. Here, Exploratory Regression tries all possible combinations of one, two, three, four, and five variable models.
    Exploratory Regression report

    The tool reports models with the highest adjusted R2 values first. R2 values range from 0.0 to 1.0 and this diagnostic tells you how much of the variation in the literacy rate values has been explained by the model. Notice that the MeanParity variable explains 54 percent of the literacy rate variation. Any passing models found are listed after the models with the highest adjusted R2 values.

    Notice that a model with only the MeanParity variable does pass; in other words, it meets all of the requirements of the OLS method. Since it only tells 54 percent of the literacy rate story, however, we will look for better results among the models using two or more explanatory variables.

  3. Notice that the tool does, in fact, find a number of two-variable and three-variable passing models. In addition, notice that there are no passing four- or five-variable models.
  4. The best model of literacy rates is a function of adolescent birth rates (MEANADBIRTHRT), school enrollment (MEANSCHENROLL), and mean gender parity (MEANPARITY). This model is best because it has the highest adjusted R2 value and the lowest AICc value.
    Several passing models

Identify where remediation might be most effective

While you could use the Select Layer by Attribute tool to find the countries where low literacy overlaps with each of the explanatory variables in your model, you would need to make a decision about what constitutes low literacy. (Is 35.4 percent low? What about 36.1 percent? Where is the threshold?) Similarly, you would need to identify threshold values for each of the explanatory variables (adolescent birth rates, school enrollment, and the gender parity index). Certainly, this is a reasonable thing to do. An alternative, however, is to use the Grouping Analysis tool to identify these threshold values for you. Grouping Analysis will optimize the within-group similarity and the between-group differences.

  1. Find and open the Grouping Analysis tool. Run the tool with the following parameters. The first time you run the tool, you will let it identify the optimal number of groups. The second time you will create the report file.
    • Input Features: AfricaData
    • Unique ID Field: CountryCode
    • Output Feature Class: the name of your output feature class such as FindGroups
    • Number of Groups: 2
    • Analysis Fields: LastLiteracy; MeanAdBirthRt
    • Spatial Constraints: NO_SPATIAL_CONSTRAINT
    • Initialization Method: FIND_SEED_LOCATIONS
    • Evaluate Optimal Number of Groups: Yes
      Grouping Analysis tool parameters
  2. Grouping Analysis will first try partitioning the countries into two groups, then three, then four, up to fifteen groups. It will calculate the Calinski Harabasz pseudo F-statistic to measure the effectiveness of each solution. For the analysis above, Grouping Analysis finds optimal homogeneity within each group and maximum differentiation among the groups when there is a total of three groups. There is a random component in how grouping analysis works, so your output may not be identical to the output below.
    Three groups is optimal
    Each component of the tool output is explained in the tool documentation. The R2 values, for example, indicate how effective each variable is at differentiating countries. A variable with an even distribution of values is not as effective as a variable with natural breaks.
  3. Run Grouping Analysis again, this time specify three groups (since the first run of the tool indicated three groups was optimal), create a report, and turn off the option to evaluate the optimal number of groups. With the overhead of creating the report, the tool may take several minutes to complete. Use the following parameters:
    • Input Features: AfricaData
    • Unique ID Field: CountryCode
    • Output Feature Class: the name of your output feature class such as Grp3LiteracyAndAdBirthRt
    • Number of Groups: 3
    • Analysis Fields: LastLiteracy; MeanAdBirthRt
    • Spatial Constraints: NO_SPATIAL_CONSTRAINT
    • Initialization Method: FIND_SEED_LOCATIONS
    • Output Report File: the name of your report file such as Grp3LitAdBirthRt.pdf
    • Evaluate Optimal Number of Groups: No
  4. Your map output will look similar to the map shown below. The groups will be the same, but the colors used to represent each group may be different. In other words, the green group below might be colored blue or red, but the same features will likely be together in each group.
    Groups based on literacy and adolescent birth rates
  5. To interpret the characteristics of each group, open the report file. You may either browse to the report on your hard disk or double-click the PDF in the Results window.
    Accessing the report from the Results window
  6. Each element of the report is explained in the tool documentation. Let's focus on the parallel box plot which summarizes each group across all of the variables. Notice that the green group (for the map above, the colors associated with each group for your results may be different) is associated with the highest adolescent birth rates and lowest literacy rates. If programs to reduce adolescent birth rates cannot be implemented across the entire continent, it might make sense to begin in Niger, Mali, Chad, and the other countries in this group.
    Parallel box plot for three groups
  7. Run Grouping Analysis again, this time partitioning countries based on mean primary school enrollment and literacy rates. As before, you will run the tool once to identify the optimal number of groups and again to create the report. Note: creating the report will add several minutes to tool execution, so be sure to remove the entry for the Output Report File parameter until you are ready for it.
    • Input Features: AfricaData
    • Unique ID Field: CountryCode
    • Output Feature Class: the name of your output feature class such as FindGroups
    • Number of Groups: 2
    • Analysis Fields: LastLiteracy; MeanSchEnroll
    • Spatial Constraints: NO_SPATIAL_CONSTRAINT
    • Initialization Method: FIND_SEED_LOCATIONS
    • Evaluate Optimal Number of Groups: Yes
  8. While there is a random component to the Grouping Analysis algorithm, if you run the Grouping Analysis tool repeatedly, you will see that the results are fairly consistent in indicating two groups are optimal.
  9. Run the Grouping Analysis tool again to create the report file.
    • Input Features: AfricaData
    • Unique ID Field: CountryCode
    • Output Feature Class: the name of your output feature class such as Grp2LiteracyAndSchEnroll
    • Number of Groups: 3
    • Analysis Fields: LastLiteracy; MeanSchEnroll
    • Spatial Constraints: NO_SPATIAL_CONSTRAINT
    • Initialization Method: FIND_SEED_LOCATIONS
    • Output Report File: the name of your report file such as Grp2LitSchEnroll.pdf
    • Evaluate Optimal Number of Groups: No
  10. The colors for your results might be reversed, but the countries in each group should be the same.
    Groups based on literacy and primary school enrollment
  11. Open the report file and find the parallel box plot graph. Notice the clear distinction between the blue and red groups with regard to both primary school enrollment rates and literacy rates.
    Parallel box plot for two groups
  12. Open the table associated with the output result layer and sort, smallest to largest, on the LastLiteracy field to see that Somalia and Niger have the lowest literacy rates and very low primary school enrollment. Consequently, all of the countries in the red group, starting with Somalia and Niger, would benefit from programs aimed at increasing primary school enrollment rates.
  13. For many countries in Africa, encouraging girls to go to school, and to stay in school, will also help to reduce adolescent birth rates and may provide remediation for the final variable in our model: gender bias. Let's use Grouping Analysis to examine differences in gender parity (the balance between boys and girls attending primary school) across Africa.
  14. Run Grouping Analysis again for literacy and the gender parity index, without specifying a report file. Check the Evaluate Optimal Number of Groups parameter. Your output should look similar to the map below.
    Groups based on literacy and gender parity
  15. Again, the optimal number of groups is two. If you open the output table and sort, smallest to largest, on the MeanParity field, you will notice that Somalia and Chad are associated with the smallest indices; in these countries there are many more boys than girls attending primary school. Guinea and the Central African Republic also have low indices for gender parity in conjunction with very low literacy rates. Programs encouraging families to educate their daughters will likely have the biggest impacts in these countries.

Summary

Your found a properly specified regression model indicating literacy rates are a function of adolescent birth rates, primary school enrollment rates, and gender parity. You also created maps showing which countries were associated with both low literacy and each of the contributing variables. You can use this information to suggest targeted remediation strategies aimed at increasing literacy across Africa.

Workflow using ArcGIS Pro

ArcGIS Pro icon

Note:

The steps below are based on the 1.1 release of ArcGIS Pro, but they should work fine for later software releases as well. To follow the steps below, you may download the Millennium Development Goals data and prepare it as outlined above, download and unzip the data in the data package provided (already cleaned up for you), or improvise using your own data. The steps below assume you are using the data package provided.

Determine the most current literacy rate for each country

  1. If you haven't done so already, download and unzip the data package provided at the top of this case study.
  2. Open ArcGIS Pro and browse to the ModelingLiteracy.ppkx project package.
  3. Once the project opens, right-click the Literacy layer in the Contexts pane and select Attribute Table. Notice the many Null values.
    The table includes a lot of null values
  4. You will be looking for a model that explains literacy rates. The variable you will be modeling will be the most up-to-date literacy rate available for each country. To pull this value out from all of the null values, you will use the Calculate Field tool with some Python statements.
  5. Begin by searching for the Add Field tool in the Geoprocessing pane.
  6. Add a new field to hold the most current literacy rate value using the parameters below:
    • Input Table: Literacy
    • Field Name: LastLiteracy
    • Field Type: Float
    • Field Alias: Most current literacy rate
    • Field is Nullable: Yes
      Add Field tool parameters
  7. Set the value of this new field to be the last non-Null literacy rate for each country. To do this, find and open the Calculate Field tool and run it with the following parameters:
    • Input Table: Literacy
    • Field Name:Most current literacy rate
    • Expression: getLast( !Y1990!, !Y1991!, !Y1992!, !Y1993!, !Y1994!, !Y1995!, !Y1996!, !Y1997!, !Y1998!, !Y1999!, !Y2000!, !Y2001!, !Y2002!, !Y2003!, !Y2004!, !Y2005!, !Y2006!, !Y2007!, !Y2008!, !Y2009!, !Y2010!, !Y2011!, !Y2012!, !Y2013!, !Y2014!, !Y2015!)
    • Code Block:

      def getLast(*allYears):
          notNull = list(filter(None,allYears))
          return notNull[-1]
      
  8. Note:
    Python is particular about indentations and case. You will get an error if you do not indent the second and third lines above in the code block, for example. In addition, if you define your function to be getLast, but refer to it in the Expression parameter as GetLast or getlast, Python will not be able to find it.
    Using Python with the Calculate Field tool
    The Expression parameter tells the Calculate Field tool to get all the literacy rate values from the fields Y1990, Y1991, Y1992, and so on, and to pass those values to a function called getLast.
    The Code Block defines the getLast function. Let's look at the Python code line by line:

    Python StatementWhat it does

    def getLast(*allyears):

    Indicates you want to define (def) a new function called getLast that will use the sequence of values passed to it from the Expression parameter.

    notNull = list(filter(None,allYears))

    This line, indented four spaces, indicates Python should put all of the non-Null values into a list called notNull.

    return notNull[-1]

    This line, indented four spaces, instructs Python to set the LastLiteracy field value to be the final value in the notNull list. The -1 index tells Python to find the end of the list.

Compute mean values for each variable

You now have the most current literacy rate value for each country. What are the factors that might impact or explain those literacy rates? We suspect they are a function of several things including child health, hunger, school enrollment, and gender biases. Later in this workflow you will use these variables as your candidate explanatory variables to see if you can find a properly specified regression model for the literacy rate values.

The data you have for the candidate explanatory variables were collected between 1990 and 2015. Using all of the data available, you will compute a mean value for each variable.

  1. Begin by adding fields to hold the computed mean values. Find and open the Add Field tool. In this and the next step, you will add a mean value field to each of the explanatory variable datasets. Start with the AdolescentBirthRate feature class by running the Add Field tool using the following parameters:
    • Input Table: AdolescentBirthRate
    • Field Name: MeanAdBirthRt
    • Field Type: Float
    • Field Alias: Mean Adolescent Birth Rate
    • Field is Nullable: Yes
      Add Field tool parameters
  2. Use the Add Field tool to add a mean value field to the remaining explanatory variable datasets, using the parameters shown below:
  3. Input TableField NameField TypeField Alias

    ChildHunger

    MeanHungerFLOATMean Child Hunger Rate

    SchoolEnrollment

    MeanSchEnrollFLOATMean Primary School Enrollment Rate

    ChildMortality

    MeanMortalityFLOATMean Child Mortality Rate

    GenderParity

    MeanParityFLOATMean Gender Parity Index

    WomenInGovSeats

    MeanWmGovSeatsFLOATMean % Gov Seats Held by Women
  4. Once the new fields have been created, you will use the Calculate Field tool to compute the mean values for each dataset. To compute the mean value for the Adolescent Birth Rate dataset, for example, use the Calculate Field tool with the parameters below:
    • Input Table: AdolescentBirthRate
    • Field Name: Mean Adolescent Birth Rate
    • Expression: getMean( !Y1990!, !Y1991!, !Y1992!, !Y1993!, !Y1994!, !Y1995!, !Y1996!, !Y1997!, !Y1998!, !Y1999!, !Y2000!, !Y2001!, !Y2002!, !Y2003!, !Y2004!, !Y2005!, !Y2006!, !Y2007!, !Y2008!, !Y2009!, !Y2010!, !Y2011!, !Y2012!, !Y2013!, !Y2014!, !Y2015!)
    • Code Block:

      def getMean(*allYears):
          notNull = list(filter(None,allYears))
          theSum = sum(notNull)
          theCnt = len(notNull)
          return theSum/theCnt
      
  5. The Code Block defines the getMean function. This is what each Python statement does:

    Python StatementWhat it does

    def getMean(*allYears):

    Indicates you want to define (def) a new function called getMean that will use the sequence of values passed to it from the Expression parameter.

    notNull = list(filter(None,allYears))

    This line, indented four spaces, indicates Python should put all of the non-Null values into a list called notNull.

    theSum = sum(notNull)

    This line, indented four spaces, instructs Python to sum the non-Null values.

    theCnt = len(notNull)

    This line, indented four spaces, instructs Python to count the non-Null values.

    return theSum/theCnt

    This line, indented four spaces, sets the value of the Field Name provided to be the mean (the sum divided by the count).

  6. Use the Calculate Field tool to compute the mean values for the remaining explanatory variable datasets (ChildHunger, GenderParity, ChildMortality, SchoolEnrollment, and WomenInGovSeats).

Consolidate all of the data into a single feature class

The ArcGIS modeling tools require the literacy rate field (your dependent variable) and all of the mean values (your candidate explanatory variables) to be in the same feature class. You will use the Join Field tool to consolidate.

  1. Find and open the Join Field tool and use it to add the last literacy rate and mean value fields to the AfricaData feature class. To add the last literacy rate field, for example, use the following parameters:
    • Input Table: AfricaData
    • Input Join Field: CountryCode
    • Join Table: Literacy
    • Output Join Field: CountryCode
    • Join Fields: Most current literacy rate
      Join Field tool parameters
  2. Similarly, to add the MeanAdBirthRt field, use the Join Field tool with the following parameters:
    • Input Table: AfricaData
    • Input Join Field: CountryCode
    • Join Table: AdolescentBirthRate
    • Output Join Field: CountryCode
    • Join Fields: Mean Adolescent Birth Rate
  3. Continue to run the Join Field tool until all of the Mean value fields are in the AfricaData featureclass (Mean Adolescent Birth Rate, Mean Child Hunger Rate, Mean Gender Parity Index, Mean Child Mortality Rate, Mean Primary School Enrollment Rate, Mean % Gov Seats Held by Women).

Calculate correlations

You can get a sense for the relationship between the dependent variable (the most up-to-date literacy rate for each country) and each mean explanatory variable using the Spearman Rank-Order Correlation. This tool is not presently in ArcGIS, but you can easily run it from the Python command window by importing SciPy.

Opening the Python window

  1. In ArcGIS Pro, open the Python command window and type the following:
  2. import scipy.stats as stat
    dataArray = arcpy.da.FeatureClassToNumPyArray('AfricaData',("LastLiteracy","MeanAdBirthRt", "MeanHunger","MeanSchEnroll","MeanMortality","MeanParity","MeanWmGovSeats"))
    rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanAdBirthRt"])
    print ("rs = %f, p = %f" % (rs,pval))
    
    The result will be displayed as rs = -0.583488, p = 0.000017.
    Python window
    The result indicates a negative and significant relationship. The rs values from the Spearman Rank-Order Correlation statistic range from 0.0 to 1.0, so 0.58 is not an exceptionally strong correlation, but it is statistically significant (the probability that there is no relationship is very small, p = 0.000017). The negative sign (-0.58) indicates a negative relationship.
  3. To see the correlations for the other explanatory variables, type the following in the Python command window.
  4. Tip:
    If you touch the up arrow key while in the Python window, it will bring back what you typed previously. Rather than retype the statements above, you can touch the up arrow key twice, modify the second variable name, place your cursor at the end of the statement, and reexecute the statement; touch the up arrow key twice again to reexecute the print statement.
    rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanHunger"])
    print ("rs = %f, p = %f" % (rs,pval))
    rs = -0.715657, p = 0.000000
    rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanParity"])
    print ("rs = %f, p = %f" % (rs,pval))
    rs = 0.672988, p = 0.000000
    rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanMortality"])
    print ("rs = %f, p = %f" % (rs,pval))
    rs = -0.741906, p = 0.000000
    rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanSchEnroll"])
    print ("rs = %f, p = %f" % (rs,pval))
    rs = 0.674144, p = 0.000000
    rs, pval = stat.spearmanr(dataArray["LastLiteracy"],dataArray["MeanWmGovSeats"])
    print ("rs = %f, p = %f" % (rs,pval))
    rs = 0.257401, p = 0.080690
    

Find a properly specified regression model

Looking at correlations is helpful, but you cannot fully trust these relationships unless you find a properly specified model. You will use the Exploratory Regression tool to see if you have any properly specified models among all of the candidate explanatory variables.

  1. Find and open the Exploratory Regression tool. Run the tool with the following parameters (accept the default values for all other parameters):
    • Input Features: AfricaData
    • Dependent Variable: LastLiteracy
    • Candidate Explanatory Variables: MeanAdBirthRt; MeanHunger; MeanSchEnroll; MeanMortality; MeanParity; MeanWmGovSeats
      Exploratory Regression tool parameters
  2. To see the report, hover over the progress bar at the bottom of the Geoprocessing pane and click the icon to open the full Exploratory Regression analysis report.
    Opening the tool execution messages
  3. When the report opens, you may resize the window by using the cursor to grab the lower left corner of the message window.
    Tool documentation provides a full explanation of each section of the Exploratory Regression analysis report. Let's focus on the first part of the report. Here, Exploratory Regression tries all possible combinations of one, two, three, four, and five variable models.
    Exploratory Regression tool output

    The tool reports models with the highest adjusted R2 values first. R2 values range from 0.0 to 1.0 and this diagnostic tells you how much of the variation in the literacy rate values has been explained by the model. Notice that the MeanParity variable explains 54 percent of the literacy rate variation. Any passing models found are listed after the models with the highest adjusted R2 values.

    Notice that a model with only the MeanParity variable does pass; in other words, it meets all of the requirements of the OLS method. Since it only tells 54 percent of the literacy rate story, however, we will look for better results among the models using two or more explanatory variables.

  4. Notice that the tool does, in fact, find a number of two-variable and three-variable passing models. In addition, notice that there are no passing four- or five-variable models.
  5. The best model of literacy rates is a function of adolescent birth rates (MEANADBIRTHRT), school enrollment (MEANSCHENROLL), and mean gender parity (MEANPARITY). This model is best because it has the highest adjusted R2 value and the lowest AICc value.
    Several passing models

Identify where remediation might be most effective

While you could use the Select Layer By Attribute tool to find the countries where low literacy overlaps with each of the explanatory variables in our model, you would need to make a decision about what constitutes low literacy (Is 35.4 percent low? What about 36.1 percent? Where is the threshold?). Similarly, you would need to identify threshold values for each of the explanatory variables (adolescent birth rates, school enrollment, and the gender parity index). Certainly, this is a reasonable thing to do. An alternative, however, is to use the Grouping Analysis tool to identify these threshold values for you. Grouping Analysis will optimize the within-group similarity and the between-group differences.

  1. Find and open the Grouping Analysis tool. Run the tool with the following parameters. The first time you run the tool, let it identify the optimal number of groups.
    • Input Features: AfricaData
    • Unique ID Field: CountryCode
    • Output Feature Class: the name of your output feature class such as FindGroups
    • Number of Groups: 2
    • Analysis Fields: Most current literacy rate; Mean Adolescent Birth Rate
    • Spatial Constraints: No spatial constraint
    • Initialization Method: Find seed locations
    • Evaluate Optimal Number of Groups: Yes
      Grouping Analysis tool parameters
  2. Grouping Analysis will first try partitioning the countries into two groups, then three, then four, up to fifteen groups. It will calculate the Calinski Harabasz pseudo F-statistic to measure the effectiveness of each solution. For the analysis above, Grouping Analysis finds optimal homogeneity within each group and maximum differentiation among the groups when there is a total of three groups. There is a random component in how grouping analysis works, so your output may not be identical to the output below.
    Three groups is optimal
    Each component of the output is explained in the tool documentation. The R2 values, for example, indicate how effective each variable is at differentiating countries. A variable with an even distribution of values will not be as effective as a variable with natural breaks.
  3. Run Grouping Analysis again, this time specifying three groups (since the first run indicated three groups would be optimal), creating a report, and turning off the option to evaluate the optimal number of groups. With the overhead of creating the report, the tool may take several minutes to complete. Use the following parameters:
    • Input Features: AfricaData
    • Unique ID Field: CountryCode
    • Output Feature Class: the name of your output feature class such as Grp3LiteracyAndAdBirthRt
    • Number of Groups: 3
    • Analysis Fields: Most current literacy rate; Mean Adolescent Birth Rate
    • Spatial Constraints: No spatial constraint
    • Initialization Method: Find seed locations
    • Output Report File: the name of your report file such as Grp3LitAdBirRt.pdf
    • Evaluate Optimal Number of Groups: No
  4. Your map output will look similar to the map shown below. The groups should be the same, but the colors used to represent each group may be different. In other words, the green group below might be colored blue or red, but the same features will likely be together in each group.
    Groups based on literacy and adolescent birth rates
  5. To interpret the characteristics of each group, open the report file by either browsing to the report on your hard disk or hovering over the progress bar at the bottom of the Geoprocessing pane and clicking on the report name.
    Accessing the Grouping Analysis report
  6. Each element of the report is explained in the tool documentation. Let's focus on the parallel box plot, which summarizes each group across all of the variables. Notice that the green group (for the map above, the colors associated with each group for your results may be different) is associated with the highest adolescent birth rates and lowest literacy rates. If programs to reduce adolescent birth rates cannot be implemented across the entire continent, it might make sense to begin in Niger, Mali, Chad, and the other countries in this group.
    Parallel box plot for literacy and adolescent birth rates
  7. Run Grouping Analysis again, this time partitioning countries based on mean primary school enrollment and literacy rates. As before, you will run the tool once to identify the optimal number of groups and again to create the report. Note that creating the report will add several minutes to tool execution, so be sure to remove the entry for the Output Report File parameter until you are ready for it.
    • Input Features: AfricaData
    • Unique ID Field: CountryCode
    • Output Feature Class: the name of your output feature class such as FindGroups
    • Number of Groups: 2
    • Analysis Fields: Most current literacy rate; Mean Primary School Enrollment Rate
    • Spatial Constraints: No spatial constraint
    • Initialization Method: Find seed locations
    • Evaluate Optimal Number of Groups: Yes
  8. While there is a random component to the Grouping Analysis algorithm, if you run the Grouping Analysis tool repeatedly, you will see that the results are fairly consistent in indicating two groups is optimal.
  9. Run the Grouping Analysis tool again to create the report file:
    • Input Features: AfricaData
    • Unique ID Field: CountryCode
    • Output Feature Class: the name of your output feature class such as Grp2LiteracyAndSchEnroll
    • Number of Groups: 2
    • Analysis Fields: Most current literacy rate; Mean Primary School Enrollment Rate
    • Spatial Constraints: No spatial constraint
    • Initialization Method: Find seed locations
    • Output Report File: the name of your report file such as Grp2LitSchEnroll.pdf
    • Evaluate Optimal Number of Groups: No
  10. The colors for your results might be reversed, but the countries in each group should be the same.
    Groups based on literacy rates and primary school enrollment
  11. Open the report file by hovering over the progress bar at the bottom of the Geoprocessing pane and clicking on the icon to open the messages. Open tool execution messages. Within the report, find the parallel box plot graph. Notice the clear distinction between the blue and red groups with regard to both primary school enrollment rates and literacy rates.
    Parallel box plot for literacy and school enrollment
  12. Open the table associated with the output result layer and sort, smallest to largest, on the LastLiteracy field to see that Somalia and Niger have the lowest literacy rates and very low primary school enrollment. Consequently, all of the countries in the red group, starting with Somalia and Niger, would benefit from programs aimed at increasing primary school enrollment rates.
  13. For many countries in Africa, encouraging girls to go to school, and to stay in school, will also help to reduce adolescent birth rates and may provide remediation for the final variable in our model: gender bias. Let's use Grouping Analysis to examine differences in gender parity (the balance between boys and girls attending primary school) across Africa.
  14. Run Grouping Analysis again for literacy and the mean gender parity index, without specifying a report file. Check the Evaluate Optimal Number of Groups parameter. Your output should look similar to the map below.
    Groups based on literacy and gender parity
  15. Again, the optimal number of groups is two. If you open the output table and sort, smallest to largest, on the MeanParity field, you will notice that Somalia and Chad are associated with the smallest indices; in these countries there are many more boys than girls attending primary school. Guinea and the Central African Republic also have low indices for gender parity in conjunction with very low literacy rates. Programs encouraging families to educate their daughters will likely have their biggest impacts in these countries.

Summary

You found a properly specified regression model indicating literacy rates are a function of adolescent birth rates, primary school enrollment rates, and gender parity. You also created maps showing which countries were associated with both low literacy and each of the contributing variables. You can use this information to suggest targeted remediation strategies aimed at increasing literacy across Africa.

References and resources for learning more

Afrol News, 2015. Some 80% of Somalis now illiterate. Afrol News, 23 January.

Hillman, A.L. and Jenkner, E. 2014. Educating Children in Poor Countries. International Monetary Fund, Economic Issues No. 33. www.imf.org/external/pubs/ft/issues/issues33/

Loaiza, E. and Liang, M. 2013. Adolescent Pregnancy: A Review of the Evidence. UNFPA. New York. www.unfpa.org/publications/adolescent-pregnancy

Madamombe, Itai. 2007. Food keeps African children in school. Africa Renewal Online. January 2007, page 10. www.un.org/africarenewal/magazine/january-2007/food-keeps-african-children-school

The World Bank, 2014. Girls' Education. The World Bank, Dec 3, 2014. www.worldbank.org/en/topic/education/brief/girls-education

UNESCO, 2011. Education for all. Regional Overview, Sub-Saharan Africa. UNESCO Global Monitoring Report. https://en.unesco.org/gem-report/

UNFPA, 2013. Motherhood in Childhood, Facing the challenge of adolescent pregnancy. UNFPA Publication. www.unfpa.org/publications/state-world-population-2013-0

United Nations. 2015. The Millennium Development Goals Report 2015. United Nations, New York. http://mdgs.un.org/unsd/mdg/Resources/Static/Products/Progress2015/English2015.pdf

Watkins, Kevin. 2013. Too Little Access, Not Enough Learning: Africa's Twin Deficit in Education. Brookings, January 16, 2013. The Brookings Institute Press. www.brookings.edu/research/opinions/2013/01/16-africa-learning-watkins

This case study demonstrates a number of analytical methods that can be adapted to many different application areas, allowing you to answer a variety of questions.

MethodGeneric QuestionExamples

Spearman Rank-Order Correlation

How does this relate to that?

Am I more likely to be robbed in a rich neighborhood or a poor neighborhood? Are test scores higher when teacher-to-student ratios are lower? How strong is the correlation between access to clean drinking water and literacy rates?

Exploratory Regression and Ordinary Least-Squares regression

What are the factors that contribute to or promote the thing I'm interested in?

What are the key variables that explain high forest fire frequency? What demographic characteristics contribute to high rates of public transportation usage? What factors are strong predictors of traffic accidents? Why are cancer rates so high in particular locations?

Grouping Analysis

Which features are most alike?

Which countries face the same challenges with regard to vulnerability? How should we divide the region into homogeneous sales territories?

You also used data manipulation and management functions, including Add Field, Calculate Field, and Join Field.

A number of resources are available to help you learn more about the analyses demonstrated in this case study:

Spatial Statistics resources

Regression Analysis Basics

Answering Why Questions: An introduction to regression analysis with spatial data

What they don't tell you about regression analysis

Spatial Data Mining I: Essentials of Cluster Analysis

Spatial Data Mining II: A Deep Dive Into Space-Time Analysis

ArcGIS Desktop

  • Home
  • Documentation
  • Support

ArcGIS

  • ArcGIS Online
  • ArcGIS Desktop
  • ArcGIS Enterprise
  • ArcGIS
  • ArcGIS Developer
  • ArcGIS Solutions
  • ArcGIS Marketplace

About Esri

  • About Us
  • Careers
  • Esri Blog
  • User Conference
  • Developer Summit
Esri
Tell us what you think.
Copyright © 2021 Esri. | Privacy | Legal