This post continues “Wrangling Humanities Data,” a series that walks through the design of a data curation process using publicly-available grant data provided by the National Endowment for the Humanities (NEH). In this installment, I draw on some of the files, with which I previously demonstrated the creation of an archival data package, and explore the data, check on the quality and reliability of the data, and then use the data to create a geospatial dataset. The geospatial data will be the basis for a future post that will undertake more mapping activities based on the data. To do this analysis, I use the pandas data library, which is supported in a Python environment and Jupyter notebook.

At a high level, the notebook (code and descriptions below) takes these steps:

  1. Set up the programming environment (import useful Python modules, then input the original NEH grant data information).
  2. Explore the data to check data quality for inconsistencies, missing or inaccurate data, and other useful information about the information.
  3. Clean the data by removing incorrect values, and enchance and transform the data by adding or making the data more consistent. In this case, this involves providing the geographic coordinates for the awards.
  4. Finally, export the data to geojson, a portable and lightweight format that can be used for more advanced mapping projects in later steps.

As in the previous post, you can also find a Jupyter Notebook version of this post, which can be downloaded from the GitHub repository along with all of the data discussed here. File references discussed below indicate files included in the same neh-grant-data-project repository.

Cleaning and Transforming to Create a Geospatial Dataset

This notebook demonstrates some of the steps involved in cleaning the NEH grant data, checking on quality and consistency, and then transforming the data into geospatial information that can be the basis of a map. While most of the data is contained within the steps and cells of this notebook, at the end there is a script to use to export the final information as geojson, which can be used elsewhere or for other purposes. It is more consistent and portable than the original data.

Setup

For this activity, we will use some Python modules that may not be in the standard JupyterLab configuration. If you do not have them, you may need to install some of these modules: geojson (here), geopandas (here), geopy (here), descartes (here), and shapely (here).

# if you do not have the modules, uncomment the following line(s) to install
#!pip install geojson
#!pip install geopandas
#!pip install descartes
#!pip install geopy
# modules for data cleaning & transformation
import pandas as pd
import geopandas as gpd

# basic visualization
import matplotlib.pyplot as plt
%matplotlib inline
# modules for mapping
from shapely.geometry import Point

# may use for additional visualization
import seaborn as sns

# modules for geocoding
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
from time import sleep

Clean and filter the grant data using pandas

First, let’s explore the grant data and clean it up so that we can define the points that we will map - there should be one coordinate location for each grant that we want to display. For this task, we will begin using the pandas modules for working with data.

df_grants_1960s = pd.read_csv('neh-grants-data-202012/data/NEH_Grants1960s.csv')

df_grants_1960s.head()
AppNumber ApplicantType Institution OrganizationType InstCity InstState InstPostalCode InstCountry CongressionalDistrict Latitude ... EndGrant ProjectDesc ToSupport PrimaryDiscipline SupplementCount Supplements ParticipantCount Participants DisciplineCount Disciplines
0 FB-10007-68 2 Regents of the University of California, Berkeley Four-Year College Berkeley CA 94704-5940 USA 13 37.87029 ... 1969-12-31 No description No to support statement English 0 NaN 1 John Elliot [Project Director] 1 English
1 FB-10009-68 2 Pitzer College Four-Year College Claremont CA 91711-6101 USA 27 34.10373 ... 1969-12-31 No description No to support statement History of Religion 0 NaN 1 Steven Matthysse [Project Director] 1 History of Religion
2 FB-10015-68 2 University of California, Riverside University Riverside CA 92521-0001 USA 41 33.97561 ... 1969-12-31 No description No to support statement History, General 0 NaN 1 John Staude [Project Director] 1 History, General
3 FB-10019-68 2 Northeastern University Four-Year College Boston MA 02115-5005 USA 7 42.3395 ... 1969-12-31 No description No to support statement History, General 0 NaN 1 Thomas Havens [Project Director] 1 History, General
4 FB-10023-68 2 University of Pennsylvania University Philadelphia PA 19104-6205 USA 3 39.95298 ... 1969-12-31 No description No to support statement Psychology 0 NaN 1 Gresham Riley [Project Director] 1 Psychology

5 rows × 33 columns

It’s important to explore the dataset a bit first, so we can understand it better, do a sanity check, and identify the information that is needed to create a new set of mappable point data. The pandas library allows to easily do some of this basic work, including characterizing the shape of the data, seeing what datatypes it includes, and whether it is missing values.

df_grants_1960s.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1010 entries, 0 to 1009
Data columns (total 33 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   AppNumber              1010 non-null   object 
 1   ApplicantType          1010 non-null   int64  
 2   Institution            1010 non-null   object 
 3   OrganizationType       1010 non-null   object 
 4   InstCity               1010 non-null   object 
 5   InstState              1010 non-null   object 
 6   InstPostalCode         1010 non-null   object 
 7   InstCountry            1010 non-null   object 
 8   CongressionalDistrict  1010 non-null   int64  
 9   Latitude               1010 non-null   object 
 10  Longitude              1010 non-null   object 
 11  CouncilDate            1010 non-null   object 
 12  YearAwarded            1010 non-null   int64  
 13  ProjectTitle           1010 non-null   object 
 14  Program                1010 non-null   object 
 15  Division               1010 non-null   object 
 16  ApprovedOutright       1010 non-null   float64
 17  ApprovedMatching       1010 non-null   float64
 18  AwardOutright          1010 non-null   float64
 19  AwardMatching          1010 non-null   float64
 20  OriginalAmount         1010 non-null   float64
 21  SupplementAmount       1010 non-null   float64
 22  BeginGrant             1010 non-null   object 
 23  EndGrant               1010 non-null   object 
 24  ProjectDesc            1010 non-null   object 
 25  ToSupport              1010 non-null   object 
 26  PrimaryDiscipline      1010 non-null   object 
 27  SupplementCount        1010 non-null   int64  
 28  Supplements            0 non-null      float64
 29  ParticipantCount       1010 non-null   int64  
 30  Participants           960 non-null    object 
 31  DisciplineCount        1010 non-null   int64  
 32  Disciplines            1010 non-null   object 
dtypes: float64(7), int64(6), object(20)
memory usage: 260.5+ KB

Latitude and Longitude series are listed as data objects, not numeric data. I will convert these into numeric data later since it is necessary to process them into geospatial coordinates as numbers, not strings. On closer inspection, many of these fields include the strings “Unknown”, “unknown”, or “Un”, which I also want to change to null or None values. First, though, I will create a new data frame so this original data can be recovered later, if necessary, and also drop some of the information that I won’t need to map the grants.

Data quality checking and cleaning

There’s a lot of information here that won’t help to map the data. Aside from the information for location of the points (that is, Latitude and Longitude), I will keep some for use in a popup that can be displayed when hovering over a point on the map.

I regularly use panda’s isnull() and info() functions below to look for blank cells or missing information.

mappable_grant_info = df_grants_1960s.drop(['ApplicantType','OrganizationType','CouncilDate','ApprovedOutright','ApprovedMatching','OriginalAmount','SupplementAmount','BeginGrant','EndGrant','PrimaryDiscipline','SupplementCount','Supplements','ParticipantCount','DisciplineCount'], axis=1)

mappable_grant_info.head()
AppNumber Institution InstCity InstState InstPostalCode InstCountry CongressionalDistrict Latitude Longitude YearAwarded ProjectTitle Program Division AwardOutright AwardMatching ProjectDesc ToSupport Participants Disciplines
0 FB-10007-68 Regents of the University of California, Berkeley Berkeley CA 94704-5940 USA 13 37.87029 -122.26813 1967 Title not available Fellowships for Younger Scholars Fellowships and Seminars 8387.0 0.0 No description No to support statement John Elliot [Project Director] English
1 FB-10009-68 Pitzer College Claremont CA 91711-6101 USA 27 34.10373 -117.70701 1967 Title not available Fellowships for Younger Scholars Fellowships and Seminars 8387.0 0.0 No description No to support statement Steven Matthysse [Project Director] History of Religion
2 FB-10015-68 University of California, Riverside Riverside CA 92521-0001 USA 41 33.97561 -117.33113 1967 Title not available Fellowships for Younger Scholars Fellowships and Seminars 8387.0 0.0 No description No to support statement John Staude [Project Director] History, General
3 FB-10019-68 Northeastern University Boston MA 02115-5005 USA 7 42.3395 -71.09048 1967 Title not available Fellowships for Younger Scholars Fellowships and Seminars 8387.0 0.0 No description No to support statement Thomas Havens [Project Director] History, General
4 FB-10023-68 University of Pennsylvania Philadelphia PA 19104-6205 USA 3 39.95298 -75.19276 1967 Title not available Fellowships for Younger Scholars Fellowships and Seminars 8387.0 0.0 No description No to support statement Gresham Riley [Project Director] Psychology
mappable_grant_info.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1010 entries, 0 to 1009
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   AppNumber              1010 non-null   object 
 1   Institution            1010 non-null   object 
 2   InstCity               1010 non-null   object 
 3   InstState              1010 non-null   object 
 4   InstPostalCode         1010 non-null   object 
 5   InstCountry            1010 non-null   object 
 6   CongressionalDistrict  1010 non-null   int64  
 7   Latitude               1010 non-null   object 
 8   Longitude              1010 non-null   object 
 9   YearAwarded            1010 non-null   int64  
 10  ProjectTitle           1010 non-null   object 
 11  Program                1010 non-null   object 
 12  Division               1010 non-null   object 
 13  AwardOutright          1010 non-null   float64
 14  AwardMatching          1010 non-null   float64
 15  ProjectDesc            1010 non-null   object 
 16  ToSupport              1010 non-null   object 
 17  Participants           960 non-null    object 
 18  Disciplines            1010 non-null   object 
dtypes: float64(2), int64(2), object(15)
memory usage: 150.0+ KB

While there are entries for each grant, on closer inspection it is clear that there are some data quality issues. For example, a lot of grants seem to be missing Participants information, and the latitude and longitude information, which will be the basis for our mapping goal, is not numerical but listed as an object type of data. As I can see in the above cell, much of the information is described as “object,” which in these case are mostly string data, which will not be possible to convert into coordinate points.

This means that there is more data cleaning to do before the data is ready to map. I used the following methods to get more information about specific data fields, including AppNumber (the data element that is most like a unique identifier), Latitude (a creator-generated field, which is presumably generated by a geocoding algorithm), and InstCountry.

mappable_grant_info['AppNumber'].value_counts()
RO-10389       1
EO-10051-68    1
RO-10259-68    1
FB-10225-68    1
FT-10147-67    1
              ..
EO-10302-69    1
EO-10085-69    1
EO-10242-69    1
FT-10294-68    1
FT-10157-67    1
Name: AppNumber, Length: 1010, dtype: int64
mappable_grant_info['Latitude'].value_counts()
Unknown     90
41.31003    17
40.81835    16
34.07516    16
37.87029    14
            ..
40.771       1
42.82649     1
40.85069     1
42.29017     1
42.35675     1
Name: Latitude, Length: 382, dtype: int64

In looking closer at InstCounty, it is clear that not all of the grants were given to recipients within the United States. As a scoping exericse, I want to remove the locations that are not within the US.

mappable_grant_info['InstCountry'].unique()
array(['USA', 'Unknown', 'Canada', 'United Kingdom'], dtype=object)
mappable_grant_info['InstCountry'].value_counts()
USA               993
Unknown            14
Canada              2
United Kingdom      1
Name: InstCountry, dtype: int64

Many of the grant entries list Unknown or unknown in the fields with information about country, institution, and even latitude or longitude. This challenges our mapping project since if we cannot locate a grant, we cannot create a point to place it on the map. To address these challenges, we need first to identify grants withough location information, then we need to get the locations of the places without latitude and longitude. For the latter step, we will use a geocoding module.

For scoping, let’s drop the UK and Canada grants, and then see whether we can geolocate the unknowns.

#filter out Canada & UK
mappable_grant_info = mappable_grant_info[mappable_grant_info['InstCountry'].isin(['Canada','United Kingdom']) == False]
mappable_grant_info['InstCountry'].value_counts()
USA        993
Unknown     14
Name: InstCountry, dtype: int64

My goal is to use information about the city and state of the award to get a geographic coordinate for the 14 listing “Unknown” countries. There remain, however, a few entries that don’t appear to have enough information to geocode.

These can be identified by the value Un in the InstState field:

mappable_grant_info[mappable_grant_info['InstState'] == 'Un']
AppNumber Institution InstCity InstState InstPostalCode InstCountry CongressionalDistrict Latitude Longitude YearAwarded ProjectTitle Program Division AwardOutright AwardMatching ProjectDesc ToSupport Participants Disciplines
221 FB-10243-68 Unknown Unknown Un Unknown Unknown 0 Unknown unknown 1967 Title not available Fellowships for College Teachers and Independe... Research Programs 8387.0 0.0 No description No to support statement NaN Comparative Literature
337 FT-10440-68 Unknown Unknown Un Unknown Unknown 0 Unknown unknown 1967 Title not available Summer Stipends Research Programs 1500.0 0.0 No description No to support statement NaN English
428 RO-10048-67 Unknown Unknown Un Unknown Unknown 0 Unknown unknown 1967 Biography of Theodore Roethke Basic Research Research Programs 5000.0 0.0 No description No to support statement NaN Art History and Criticism
429 RO-10050-67 Unknown Unknown Un Unknown Unknown 0 Unknown unknown 1967 Symbolic Landscape in Modern Poetry Basic Research Research Programs 2000.0 0.0 No description No to support statement NaN American Literature
454 RO-10166-67 Unknown Unknown Un Unknown Unknown 0 Unknown unknown 1967 History of Book Publishing in America Basic Research Research Programs 25000.0 0.0 No description No to support statement NaN History, General
459 RO-10211-68 Unknown Unknown Un Unknown Unknown 0 Unknown unknown 1968 Biography of Richard Wright Basic Research Research Programs 8000.0 0.0 No description No to support statement NaN Literature, General
480 RO-10317-69 Unknown Unknown Un Unknown Unknown 0 Unknown unknown 1968 Biography of Richard Wright Basic Research Research Programs 8000.0 0.0 No description No to support statement NaN Literature, General; Social Sciences, General
757 FT-10257-67 McMaster University Ontario, Canada Un 00000-0000 USA 1 Unknown unknown 1967 Title not available Summer Stipends Research Programs 2000.0 0.0 No description No to support statement Chauncey Wood [Project Director] British Literature
774 FT-10310-68 Simon Fraser University B.C., Canada Un 00000-0000 USA 0 Unknown unknown 1967 Title not available Summer Stipends Research Programs 1500.0 0.0 No description No to support statement Jared Curtis [Project Director] English; Literature, General

It is clear that most of these do not have enough information to map. The last two are in Canada, and so I don’t want them anyway. So I’ll remove them from the data:

#filter out unknown state rows
mappable_grant_info = mappable_grant_info[mappable_grant_info['InstState'].isin(['Un']) == False]
mappable_grant_info.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 998 entries, 0 to 1009
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   AppNumber              998 non-null    object 
 1   Institution            998 non-null    object 
 2   InstCity               998 non-null    object 
 3   InstState              998 non-null    object 
 4   InstPostalCode         998 non-null    object 
 5   InstCountry            998 non-null    object 
 6   CongressionalDistrict  998 non-null    int64  
 7   Latitude               998 non-null    object 
 8   Longitude              998 non-null    object 
 9   YearAwarded            998 non-null    int64  
 10  ProjectTitle           998 non-null    object 
 11  Program                998 non-null    object 
 12  Division               998 non-null    object 
 13  AwardOutright          998 non-null    float64
 14  AwardMatching          998 non-null    float64
 15  ProjectDesc            998 non-null    object 
 16  ToSupport              998 non-null    object 
 17  Participants           955 non-null    object 
 18  Disciplines            998 non-null    object 
dtypes: float64(2), int64(2), object(15)
memory usage: 155.9+ KB

Let’s take a closer look at the latitude and longitude information, which will be the basis for determining each point to map later on. This reveals that there are still at least 78 entries without location information:

mappable_grant_info['Latitude'].describe()
count         998
unique        382
top       Unknown
freq           78
Name: Latitude, dtype: object
mappable_grant_info['Longitude'].describe()
count         998
unique        383
top       unknown
freq           78
Name: Longitude, dtype: object

It will not be possible to map these grants without further information. My goal is to use data from the city and state location fields, which I can send to a geocoding tool in order to provide a location coordinate. To begin, I will remove the string values “unknown” or “Unknown” in the dataset and then replace them with null (blank) values, which I can filter more easily. Then, I will use Nominatim, a geocoding tool that will search for coordinate information based on the available place information.

# replace strings with null values
nonvals = ['unknown','Unknown']

mappable_grant_info = mappable_grant_info.replace(nonvals, [None, None])

latlons = mappable_grant_info.loc[:,'Latitude':'Longitude']

print(latlons['Latitude'],'\n')
print('Is it a null value?',type(mappable_grant_info.loc[1009,'Latitude']))
0       37.87029
1       34.10373
2       33.97561
3        42.3395
4       39.95298
          ...   
1005    40.74955
1006    32.88647
1007    41.02476
1008    29.94888
1009        None
Name: Latitude, Length: 998, dtype: object 

Is it a null value? <class 'NoneType'>
mappable_grant_info.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 998 entries, 0 to 1009
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   AppNumber              998 non-null    object 
 1   Institution            970 non-null    object 
 2   InstCity               997 non-null    object 
 3   InstState              998 non-null    object 
 4   InstPostalCode         970 non-null    object 
 5   InstCountry            991 non-null    object 
 6   CongressionalDistrict  998 non-null    int64  
 7   Latitude               920 non-null    object 
 8   Longitude              920 non-null    object 
 9   YearAwarded            998 non-null    int64  
 10  ProjectTitle           998 non-null    object 
 11  Program                998 non-null    object 
 12  Division               998 non-null    object 
 13  AwardOutright          998 non-null    float64
 14  AwardMatching          998 non-null    float64
 15  ProjectDesc            998 non-null    object 
 16  ToSupport              998 non-null    object 
 17  Participants           955 non-null    object 
 18  Disciplines            998 non-null    object 
dtypes: float64(2), int64(2), object(15)
memory usage: 195.9+ KB

Since there are only 920 Lat & Lon entries, that leaves the 78 unknown (now null) entries wihtout information. While it would be possible to gather this information with a tool like Get Lat+Lon, that would take while. So I will use the information about the recipient’s city and state to get a general idea of where the award was given.

# create a geoquery string, which will allow to at least map the city
mappable_grant_info['geoquery'] = mappable_grant_info['InstCity'] + ' ' + mappable_grant_info['InstState']

mappable_grant_info[mappable_grant_info['geoquery'].isnull()]
AppNumber Institution InstCity InstState InstPostalCode InstCountry CongressionalDistrict Latitude Longitude YearAwarded ProjectTitle Program Division AwardOutright AwardMatching ProjectDesc ToSupport Participants Disciplines geoquery
168 FA-10157-68 None None NY None USA 0 None None 1967 Title not available Fellowships for University Teachers Research Programs 15520.0 0.0 No description No to support statement NaN U.S. History NaN

Now there’s only one left that lacks enough information. To check, look at that grant:

mappable_grant_info.loc[168, :]
AppNumber                                        FA-10157-68
Institution                                             None
InstCity                                                None
InstState                                                 NY
InstPostalCode                                          None
InstCountry                                              USA
CongressionalDistrict                                      0
Latitude                                                None
Longitude                                               None
YearAwarded                                             1967
ProjectTitle                             Title not available
Program                  Fellowships for University Teachers
Division                                   Research Programs
AwardOutright                                          15520
AwardMatching                                              0
ProjectDesc                                   No description
ToSupport                            No to support statement
Participants                                             NaN
Disciplines                                     U.S. History
geoquery                                                 NaN
Name: 168, dtype: object

There is just not enough information here - the record does not include information about the city or institution that received this award, so remove that one, too.

mappable_grant_info = mappable_grant_info.drop([168])

mappable_grant_info.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 997 entries, 0 to 1009
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   AppNumber              997 non-null    object 
 1   Institution            970 non-null    object 
 2   InstCity               997 non-null    object 
 3   InstState              997 non-null    object 
 4   InstPostalCode         970 non-null    object 
 5   InstCountry            990 non-null    object 
 6   CongressionalDistrict  997 non-null    int64  
 7   Latitude               920 non-null    object 
 8   Longitude              920 non-null    object 
 9   YearAwarded            997 non-null    int64  
 10  ProjectTitle           997 non-null    object 
 11  Program                997 non-null    object 
 12  Division               997 non-null    object 
 13  AwardOutright          997 non-null    float64
 14  AwardMatching          997 non-null    float64
 15  ProjectDesc            997 non-null    object 
 16  ToSupport              997 non-null    object 
 17  Participants           955 non-null    object 
 18  Disciplines            997 non-null    object 
 19  geoquery               997 non-null    object 
dtypes: float64(2), int64(2), object(16)
memory usage: 163.6+ KB

Geocode grants missing location coordinates

For the remaining 77 grants without location information, try to geocode by city and state information.

# create a subset of the information for geoquerying
geoQuerySet = mappable_grant_info.loc[:, ['AppNumber','InstCity','InstState','Latitude','Longitude','geoquery']]

#pull out the entries without location coordinates
geoQuerySet = geoQuerySet[geoQuerySet['Latitude'].isnull()]

geoQuerySet.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 77 entries, 24 to 1009
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   AppNumber  77 non-null     object
 1   InstCity   77 non-null     object
 2   InstState  77 non-null     object
 3   Latitude   0 non-null      object
 4   Longitude  0 non-null      object
 5   geoquery   77 non-null     object
dtypes: object(6)
memory usage: 4.2+ KB
# set up geolocator
# increase timeout to reduce errors 
geolocator = Nominatim(user_agent='neh-grant-points', timeout=10)

# limit to comply with rate limiting requirements
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

# initiate geolocation for the 77 grants
geoQuerySet['location'] = geoQuerySet['geoquery'].apply(geocode)
geoQuerySet.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 77 entries, 24 to 1009
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   AppNumber  77 non-null     object
 1   InstCity   77 non-null     object
 2   InstState  77 non-null     object
 3   Latitude   0 non-null      object
 4   Longitude  0 non-null      object
 5   geoquery   77 non-null     object
 6   location   76 non-null     object
dtypes: object(7)
memory usage: 4.8+ KB

All but one of the geoquery requests returned a location value. To see what’s missing, check for the empty values using isnull():

geoQuerySet[geoQuerySet['location'].isnull()]
AppNumber InstCity InstState Latitude Longitude geoquery location
998 RP-10003-68 Charlottsvlle VA None None Charlottsvlle VA None

Looks like that one has a typo! I adapted the fix & replace process that @cduvallet outlined (link to full process at the end of this post).

fixgeolocate = {
    'Charlottsvlle VA' : 'Charlottesville VA'
}

new_locations = {}
for key, val in fixgeolocate.items():
    loc = geocode(val)
    new_locations[key] = loc
    sleep(1)
print(new_locations)
{'Charlottsvlle VA': Location(The V, 201-213, 15th Street Northwest, Venable, Starr Hill, Charlottesville, Virginia, 22903, United States, (38.0360726, -78.49973472559668, 0.0))}

While here, I will also correct the typo in the dataset.

# update to correct spelling of Charlottesville in cleaned_grants_for_mappable = mappable_grant_info i=998
mappable_grant_info = mappable_grant_info.replace('Charlottsvlle','Charlottesville')
# update the geoquery set 
# get indices of rows with no location
null_locs = geoQuerySet[geoQuerySet['location'].isnull()].index

# sub in the new_locs
geoQuerySet.loc[null_locs, 'location'] = geoQuerySet.loc[null_locs, 'geoquery'].map(new_locations)
geoQuerySet.head()
AppNumber InstCity InstState Latitude Longitude geoquery location
24 AO-10025-69 Westport CT None None Westport CT (Westport Court, Westport, Columbia County, Ge...
40 EH-10038-68 Evanston IL None None Evanston IL (Evanston, Evanston Township, Cook County, Ill...
47 EH-10058-66 New York NY None None New York NY (New York, United States, (40.7127281, -74.006...
51 EO-10001-67 Dover DE None None Dover DE (Dover, Kent County, Delaware, United States, ...
61 EO-10051-68 University Park PA None None University Park PA (University Park, College Township, Centre Cou...

Now we have all information to provide a coordinate location for those grants that didn’t have the information. What remains is to pull the Lat & Lon information back into those columns.

# pull the lats & lons from the locations - see 
geoQuerySet['Latitude'] = geoQuerySet['location'].apply(lambda x: x.latitude)
geoQuerySet['Longitude'] = geoQuerySet['location'].apply(lambda x: x.longitude)

geoQuerySet.head()
AppNumber InstCity InstState Latitude Longitude geoquery location
24 AO-10025-69 Westport CT 33.554852 -82.066172 Westport CT (Westport Court, Westport, Columbia County, Ge...
40 EH-10038-68 Evanston IL 42.044739 -87.693046 Evanston IL (Evanston, Evanston Township, Cook County, Ill...
47 EH-10058-66 New York NY 40.712728 -74.006015 New York NY (New York, United States, (40.7127281, -74.006...
51 EO-10001-67 Dover DE 39.158168 -75.524368 Dover DE (Dover, Kent County, Delaware, United States, ...
61 EO-10051-68 University Park PA 40.808749 -77.858566 University Park PA (University Park, College Township, Centre Cou...
geoQuerySet.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 77 entries, 24 to 1009
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   AppNumber  77 non-null     object 
 1   InstCity   77 non-null     object 
 2   InstState  77 non-null     object 
 3   Latitude   77 non-null     float64
 4   Longitude  77 non-null     float64
 5   geoquery   77 non-null     object 
 6   location   77 non-null     object 
dtypes: float64(2), object(5)
memory usage: 7.3+ KB

Merge the data

I can see using info() that the main dataframe still lacks the latitude and longitude information for the newly points geocoded points. So now, let’s merge the locations back into the dataframe for mapping (cleaned_grants_for_mappable).

# create a clean dataframe, just in case something happens to remove data
cleaned_grants_for_mappable = mappable_grant_info
cleaned_grants_for_mappable.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 997 entries, 0 to 1009
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   AppNumber              997 non-null    object 
 1   Institution            970 non-null    object 
 2   InstCity               997 non-null    object 
 3   InstState              997 non-null    object 
 4   InstPostalCode         970 non-null    object 
 5   InstCountry            990 non-null    object 
 6   CongressionalDistrict  997 non-null    int64  
 7   Latitude               920 non-null    object 
 8   Longitude              920 non-null    object 
 9   YearAwarded            997 non-null    int64  
 10  ProjectTitle           997 non-null    object 
 11  Program                997 non-null    object 
 12  Division               997 non-null    object 
 13  AwardOutright          997 non-null    float64
 14  AwardMatching          997 non-null    float64
 15  ProjectDesc            997 non-null    object 
 16  ToSupport              997 non-null    object 
 17  Participants           955 non-null    object 
 18  Disciplines            997 non-null    object 
 19  geoquery               997 non-null    object 
dtypes: float64(2), int64(2), object(16)
memory usage: 163.6+ KB
# create a list of the indices for matching rows between the data frames
geoInfotoReplaceIndices = geoQuerySet.loc[:,'AppNumber'].index
geoInfotoReplaceIndices
Int64Index([  24,   40,   47,   51,   61,   74,   83,  103,  106,  107,  110,
             111,  117,  120,  123,  124,  148,  151,  154,  174,  188,  198,
             209,  341,  359,  378,  386,  387,  388,  398,  402,  412,  430,
             437,  477,  482,  483,  484,  511,  522,  553,  597,  608,  624,
             637,  669,  685,  689,  702,  708,  746,  750,  779,  791,  795,
             802,  855,  856,  857,  859,  860,  861,  866,  871,  872,  885,
             890,  902,  927,  936,  938,  940,  961,  970,  989,  998, 1009],
           dtype='int64')
# use the row indexes to assign data from the geoQuerySet back to the cleaned map info for Latitude
cleaned_grants_for_mappable.loc[geoInfotoReplaceIndices,['Latitude']] = geoQuerySet.loc[geoInfotoReplaceIndices,['Latitude']]

# use the row indexes to assign data from the geoQuerySet back to the cleaned map info for Longitude
cleaned_grants_for_mappable.loc[geoInfotoReplaceIndices,['Longitude']] = geoQuerySet.loc[geoInfotoReplaceIndices,['Longitude']]
cleaned_grants_for_mappable.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 997 entries, 0 to 1009
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   AppNumber              997 non-null    object 
 1   Institution            970 non-null    object 
 2   InstCity               997 non-null    object 
 3   InstState              997 non-null    object 
 4   InstPostalCode         970 non-null    object 
 5   InstCountry            990 non-null    object 
 6   CongressionalDistrict  997 non-null    int64  
 7   Latitude               997 non-null    object 
 8   Longitude              997 non-null    object 
 9   YearAwarded            997 non-null    int64  
 10  ProjectTitle           997 non-null    object 
 11  Program                997 non-null    object 
 12  Division               997 non-null    object 
 13  AwardOutright          997 non-null    float64
 14  AwardMatching          997 non-null    float64
 15  ProjectDesc            997 non-null    object 
 16  ToSupport              997 non-null    object 
 17  Participants           955 non-null    object 
 18  Disciplines            997 non-null    object 
 19  geoquery               997 non-null    object 
dtypes: float64(2), int64(2), object(16)
memory usage: 203.6+ KB
# reassign our complete information to the mappable_grant_info dataframe
mappable_grant_info = cleaned_grants_for_mappable

At this point, there are two ‘complete’ datasets: cleaned_grants_for_mappable & mappable_grant_info. Below, mappable_grant_info is used the main set.

Add coordinates and map

Using the filled in latitude and longitude, we can now add in coordinate information to prepare to map each grant as a point on a map.

# convert lat & lon into numeric values
mappable_grant_info.astype({'Latitude':'float64','Longitude':'float64'}).dtypes
AppNumber                 object
Institution               object
InstCity                  object
InstState                 object
InstPostalCode            object
InstCountry               object
CongressionalDistrict      int64
Latitude                 float64
Longitude                float64
YearAwarded                int64
ProjectTitle              object
Program                   object
Division                  object
AwardOutright            float64
AwardMatching            float64
ProjectDesc               object
ToSupport                 object
Participants              object
Disciplines               object
geoquery                  object
dtype: object
# combine latitute and longitude to create coordinate information for mapping
mappable_grant_info['coordinates'] = mappable_grant_info[['Longitude','Latitude']].values.tolist()

mappable_grant_info.tail()
AppNumber Institution InstCity InstState InstPostalCode InstCountry CongressionalDistrict Latitude Longitude YearAwarded ... Program Division AwardOutright AwardMatching ProjectDesc ToSupport Participants Disciplines geoquery coordinates
1005 RX-10010-69 American Council of Learned Societies New York NY 10017-6706 USA 12 40.74955 -73.97462 1969 ... Conferences Research Programs 25000.0 0.0 No description No to support statement Gordon Turner [Project Director] Interdisciplinary Studies, General New York NY [-73.97462, 40.74955]
1006 EO-10125 Salk Institute for Biological Studies La Jolla CA 92037-1002 USA 49 32.88647 -117.24392 1969 ... Institutional Planning and Development Education Programs 7434.0 0.0 To teach a course based on American texts and ... No to support statement Christine Tracy [Project Director] History and Philosophy of Science, Technology,... La Jolla CA [-117.24392, 32.88647]
1007 EO-10225 Manhattanville College Purchase NY 10577-2132 USA 17 41.02476 -73.71567 1969 ... Institutional Planning and Development Education Programs 28320.0 0.0 No description No to support statement Marcus Lawson [Project Director] Interdisciplinary Studies, General Purchase NY [-73.71567, 41.02476]
1008 EO-10231 Louisiana Endowment for the Humanities New Orleans LA 70113-1027 USA 2 29.94888 -90.07392 1969 ... Institutional Planning and Development Education Programs 10000.0 0.0 This is one of seven "summer workshops on Negr... No to support statement Elton Harrison [Project Director] Interdisciplinary Studies, General New Orleans LA [-90.07392, 29.94888]
1009 RO-10389 California State University, Long Beach Long Beach CA 90840-0004 USA 48 33.769 -118.192 1969 ... Basic Research Research Programs 9990.0 0.0 No description No to support statement Nizan Shaked [Project Director] Social Sciences, General Long Beach CA [-118.191604, 33.7690164]

5 rows × 21 columns

# create shapely geoPoints for a geometry column
mappable_grant_info['geometry'] = gpd.points_from_xy(mappable_grant_info['Longitude'],mappable_grant_info['Latitude'])
mappable_grant_info.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 997 entries, 0 to 1009
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   AppNumber              997 non-null    object  
 1   Institution            970 non-null    object  
 2   InstCity               997 non-null    object  
 3   InstState              997 non-null    object  
 4   InstPostalCode         970 non-null    object  
 5   InstCountry            990 non-null    object  
 6   CongressionalDistrict  997 non-null    int64   
 7   Latitude               997 non-null    object  
 8   Longitude              997 non-null    object  
 9   YearAwarded            997 non-null    int64   
 10  ProjectTitle           997 non-null    object  
 11  Program                997 non-null    object  
 12  Division               997 non-null    object  
 13  AwardOutright          997 non-null    float64 
 14  AwardMatching          997 non-null    float64 
 15  ProjectDesc            997 non-null    object  
 16  ToSupport              997 non-null    object  
 17  Participants           955 non-null    object  
 18  Disciplines            997 non-null    object  
 19  geoquery               997 non-null    object  
 20  coordinates            997 non-null    object  
 21  geometry               997 non-null    geometry
dtypes: float64(2), geometry(1), int64(2), object(17)
memory usage: 219.1+ KB
mappable_grant_info['geometry'].describe
<bound method NDFrame.describe of 0       POINT (-122.26813 37.87029)
1       POINT (-117.70701 34.10373)
2       POINT (-117.33113 33.97561)
3        POINT (-71.09048 42.33950)
4        POINT (-75.19276 39.95298)
                   ...             
1005     POINT (-73.97462 40.74955)
1006    POINT (-117.24392 32.88647)
1007     POINT (-73.71567 41.02476)
1008     POINT (-90.07392 29.94888)
1009    POINT (-118.19160 33.76902)
Name: geometry, Length: 997, dtype: geometry>

The geometry column now contains POINTs, a special datatype that will be necessary for the geospatial dataset.

For the final steps of cleaning and transforming this data to a geospatial dataset, the geopandas module provides more options, so I will convert the data to geopandas (aka “gpd”) now:

# convert the information to a geopandas dataframe
mappable_gdf = gpd.GeoDataFrame(mappable_grant_info)

Using the type() function, I can see that this is now a “GeoDataFrame” with “GeoSeries” in some of the fields.

type(mappable_gdf)
geopandas.geodataframe.GeoDataFrame
type(mappable_gdf['geometry'])
geopandas.geoseries.GeoSeries

At this point, the mappable_gdf data is ready to plot on a map. For these purposes, the data is cleaned and consistent and contains the necessary geospatial coordinates to create a point for each of the 997 grants made during the 1960s decade for which NEH maintained information about the location of the award, and that were made to the U.S. states.

Data cleanup by geography

Now I have relatively clean data, and I can use basic mapping functions of geopandas to get an idea of any egregious outliers. For this task, geopandas supports some basic mapping tools (some of the built-in mapping features included in geopandas) to get an idea of additional data errors. Once I have taken these additional steps, I can export this data for use later on to create more detailed maps.

# create a mapspace for the geodataframe
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

usa = world.query('name == "United States of America"')

ax = usa.plot(
    color='white', edgecolor='black'
)

mappable_gdf.plot(ax=ax, color='red')

plt.show()

png

The map is small, but this offers an initial data quality check, and I can see that there are a few points that don’t look right. There may be some in Europe (or Canada?) and one that appears to be in the Pacific. It’s possible that this is a grant to American Samoa or another US territory in the Pacific, but I want to find out.

# find the distant outlier... 

mappable_gdf.sort_values(by=['Latitude','Longitude']).head(2)
AppNumber Institution InstCity InstState InstPostalCode InstCountry CongressionalDistrict Latitude Longitude YearAwarded ... Division AwardOutright AwardMatching ProjectDesc ToSupport Participants Disciplines geoquery coordinates geometry
151 FA-10103-67 None Richmond VA None None 0 -32.8652 151.501 1967 ... Research Programs 13270.0 0.0 No description No to support statement NaN U.S. History Richmond VA [151.5012, -32.8652] POINT (151.50120 -32.86520)
702 FT-10051-67 Elementary Shakespeare Corp. Fort Pierce FL 34982 USA 18 27.4467 -80.3256 1967 ... Research Programs 2000.0 0.0 No description No to support statement S. Len Weingart [Project Director] British Literature Fort Pierce FL [-80.3256056, 27.4467056] POINT (-80.32561 27.44671)

2 rows × 22 columns

That one appears to have been geocoded as a location in Richmond Vale, New South Wales, Australia, which was an error that I should’ve corrected in the geocoding step. But let’s correct that now: Get Lat+Lon suggests 37.5385087,-77.43428 would be acceptable, and so does Nominatim:

geocode('Richmond Virginia')
Location(Richmond, Richmond City, Virginia, 23298, United States, (37.5385087, -77.43428, 0.0))
# update the coordinate values
mappable_gdf.at[151,'Latitude'] = 37.5385087
mappable_gdf.at[151,'Longitude'] =  -77.43428

mappable_gdf.iloc[151]
AppNumber                                        FA-10103-67
Institution                                             None
InstCity                                            Richmond
InstState                                                 VA
InstPostalCode                                          None
InstCountry                                             None
CongressionalDistrict                                      0
Latitude                                             37.5385
Longitude                                           -77.4343
YearAwarded                                             1967
ProjectTitle                             Title not available
Program                  Fellowships for University Teachers
Division                                   Research Programs
AwardOutright                                          13270
AwardMatching                                              0
ProjectDesc                                   No description
ToSupport                            No to support statement
Participants                                             NaN
Disciplines                                     U.S. History
geoquery                                         Richmond VA
coordinates                             [151.5012, -32.8652]
geometry                           POINT (151.5012 -32.8652)
Name: 151, dtype: object
# create a Point for the geometry column
mappable_gdf.loc[151,'geometry'] = Point(mappable_gdf.loc[151,'Longitude'], mappable_gdf.loc[151,'Latitude'])

Map again…

# create a mapspace for the geodataframe
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

usa = world.query('name == "United States of America"')

ax = usa.plot(
    color='white', edgecolor='black'
)

mappable_gdf.plot(ax=ax, color='red')

plt.show()

png

There’s still appear to be 3 very far to the east of North America… I can identify thes points by sorting the Longitude in descending order:

# fix a datatype problem for sorting... 
mappable_gdf = mappable_gdf.astype({'Longitude':'float64','Latitude':'float64'})

# find the farthest east... 
mappable_gdf.sort_values(by=['Longitude'],ascending=False).head(5)
AppNumber Institution InstCity InstState InstPostalCode InstCountry CongressionalDistrict Latitude Longitude YearAwarded ... Division AwardOutright AwardMatching ProjectDesc ToSupport Participants Disciplines geoquery coordinates geometry
341 FT-10462-68 University of Southern Florida Orlando DE 32816 USA 7 52.524225 13.409345 1967 ... Research Programs 1500.0 0.0 No description No to support statement Richard Lukas [Project Director] History, General Orlando DE [13.4093448, 52.5242253] POINT (13.40934 52.52423)
597 FA-10060-67 Unaffiliated Independent Scholar Belmont MA 02178-0000 USA 0 48.408657 7.238676 1967 ... Research Programs 13270.0 0.0 No description No to support statement Einar Haugen [Project Director] Literature, Other Belmont MA [7.2386756, 48.4086571] POINT (7.23868 48.40866)
483 RO-10335-69 None Dennis MA None USA 0 51.519807 -0.137680 1968 ... Research Programs 4800.0 0.0 No description No to support statement NaN Art History and Criticism Dennis MA [-0.13767995578500702, 51.519807150000005] POINT (-0.13768 51.51981)
645 FB-10135-67 University of Maine System Orono ME 04473-1513 USA 2 44.887320 -68.680500 1967 ... Research Programs 8140.0 0.0 No description No to support statement Jerome Nadelhaft [Project Director] U.S. History Orono ME [-68.6805, 44.88732] POINT (-68.68050 44.88732)
182 FB-10085-67 Colby College Waterville ME 04901-8840 USA 1 44.541900 -69.639730 1967 ... Research Programs 8140.0 0.0 No description No to support statement Dorothy Koonce [Project Director] Classics Waterville ME [-69.63973, 44.5419] POINT (-69.63973 44.54190)

5 rows × 22 columns

These were geocoded in Europe.

#row 341
geocode('Orlando DE')
Location(Orlando, 2, Münzstraße, Scheunenviertel, Mitte, Berlin, 10178, Deutschland, (52.5242253, 13.4093448, 0.0))
# this one appears to be a mistake in the data: Florida, not Delaware
mappable_gdf.loc[341]
AppNumber                                     FT-10462-68
Institution                University of Southern Florida
InstCity                                          Orlando
InstState                                              DE
InstPostalCode                                      32816
InstCountry                                           USA
CongressionalDistrict                                   7
Latitude                                          52.5242
Longitude                                         13.4093
YearAwarded                                          1967
ProjectTitle                          Title not available
Program                                   Summer Stipends
Division                                Research Programs
AwardOutright                                        1500
AwardMatching                                           0
ProjectDesc                                No description
ToSupport                         No to support statement
Participants             Richard Lukas [Project Director]
Disciplines                              History, General
geoquery                                       Orlando DE
coordinates                      [13.4093448, 52.5242253]
geometry                    POINT (13.4093448 52.5242253)
Name: 341, dtype: object
#row 597
geocode('Belmont MA')
Location(Belmont, Molsheim, Bas-Rhin, Grand Est, France métropolitaine, 67130, France, (48.4086571, 7.2386756, 0.0))
#row 483
geocode('Dennis MA')
Location(Dennis, 30, Cleveland Street, Cavendish Square & Oxford Market (South), Fitzrovia, City of Westminster, London, Greater London, England, W1T 4NG, United Kingdom, (51.519807150000005, -0.13767995578500702, 0.0))

I could’ve avoided this by telling the geocoder to prefer locations in North America. Nominatim, for example, provides options for bounded or country_codes to limit results. If I am geocoding a longer listin future, I would consider this as an option. For now, these can be corrected by providing a more specific country string with the request:

#this is better... 
geocode('Dennis MA USA')
Location(Dennis, Barnstable County, Massachusetts, United States, (41.7353872, -70.1939087, 0.0))

Following @cduvallet’s example, I’m creating a function to update location records. The function takes a specified dataframe, row index, and Location() information.

def update_location(gdf, rowIndex, location):
    '''
    The function updates a record's latitude, longitude, and adds a geo Point entry
    The To specify the changes: 
        gdf is a specific dataframe
        rowIndex is the numerical index or Name of the row to update
        location is a new geoquery string to call for a Lcoation() object from the (Nominatim) geocoder
    The function requires the geocoder to be already implemented as geocode() and for Point to be imported from shapely. 
    '''
    newloc = geocode(location)
    gdf.at[rowIndex, 'Latitude'] = newloc.latitude
    gdf.at[rowIndex, 'Longitude'] = newloc.longitude
    # create a Point
    gdf.at[rowIndex, 'geometry'] = Point(newloc.longitude,newloc.latitude)

I’ll test this on item 341. This one had a typo: the state was listed as Delaware, not Florida, so I’ll provide the correct geoquery to the function, which will query Nominatim and update the data.

# test the function ... 
#row 341
#geocode('Orlando FL')
update_location(mappable_gdf, 341, 'Orlando FL')

#while we're at it, fix the data typo in the InstState field, also in the previous dataframes
mappable_gdf.at[341,'InstState'] = 'FL'
mappable_grant_info.at[341,'InstState'] = 'FL'
cleaned_grants_for_mappable.at[341,'InstState'] = 'FL'

mappable_gdf.loc[341]
AppNumber                                          FT-10462-68
Institution                     University of Southern Florida
InstCity                                               Orlando
InstState                                                   FL
InstPostalCode                                           32816
InstCountry                                                USA
CongressionalDistrict                                        7
Latitude                                                28.548
Longitude                                             -81.4128
YearAwarded                                               1967
ProjectTitle                               Title not available
Program                                        Summer Stipends
Division                                     Research Programs
AwardOutright                                             1500
AwardMatching                                                0
ProjectDesc                                     No description
ToSupport                              No to support statement
Participants                  Richard Lukas [Project Director]
Disciplines                                   History, General
geoquery                                            Orlando DE
coordinates                           [13.4093448, 52.5242253]
geometry                 POINT (-81.41278418563017 28.5479786)
Name: 341, dtype: object

That looks good, so now I will fix the remaining two.

# update the other two points
update_location(mappable_gdf, 483, 'Dennis MA USA')
update_location(mappable_gdf, 597, 'Belmont MA USA')
# remap... 
ax = usa.plot(
    color='white', edgecolor='black'
)
mappable_gdf.plot(ax=ax, color='red')

plt.show()

png

At least all of the points appear to be in logical places now, even if there may be some underlying noise. This brings the data to a point that is clean enough to map! But first, what other cool things can we do with the data using gepandas?

Geopandas supports various data filtering

Now that the data is more or less in a shape that I want, it is ready for more analysis and visualization. We can count the number of awards by state, see all of the awards in a given state, or tally the amount of money awarded to a particular geographic area. And of course, we can make some more detailed maps! First, try some of the filtering and analyzing of the data that pandas supports.

For example, how may grants were made to each state?

# use groupby to count by state
mappable_gdf.groupby('InstState').InstState.count()
InstState
AK      6
AL      5
AR      1
AZ     13
CA    111
CO     10
CT     28
DC     44
DE     12
FL     18
GA     11
HI      8
IA     12
ID      1
IL     36
IN     23
KS      7
KY      7
LA      8
MA     76
MD     29
ME      3
MI     31
MN     14
MO      9
MS      7
MT      3
NC     33
ND      6
NE      7
NH      1
NJ     36
NM      2
NV      2
NY    146
OH     32
OK     10
OR      6
PA     53
RI      7
SC      9
SD      4
TN     20
TX     26
UT      3
VA     22
VT      3
WA     11
WI     16
WV      3
WY      6
Name: InstState, dtype: int64

Or I can ask how what grants were made to a specific state:

# show grants from Montana
mappable_gdf[mappable_gdf['InstState']=='MT']
AppNumber Institution InstCity InstState InstPostalCode InstCountry CongressionalDistrict Latitude Longitude YearAwarded ... Division AwardOutright AwardMatching ProjectDesc ToSupport Participants Disciplines geoquery coordinates geometry
326 FT-10374-68 University of Montana Missoula MT 59801-4494 USA 1 46.86494 -113.98401 1967 ... Research Programs 1500.0 0.0 No description No to support statement Duane Hampton [Project Director] U.S. History Missoula MT [-113.98401, 46.86494] POINT (-113.98401 46.86494)
348 FT-10738-69 University of Montana Missoula MT 59801-4494 USA 1 46.86494 -113.98401 1969 ... Research Programs 1500.0 0.0 Study of the concept of family in the politica... No to support statement Richard Chapman [Project Director] Political Science, General Missoula MT [-113.98401, 46.86494] POINT (-113.98401 46.86494)
921 RO-10179-67 University of Montana Missoula MT 59801-4494 USA 1 46.86494 -113.98401 1967 ... Research Programs 1370.0 0.0 No description No to support statement Joseph Mussulman [Project Director] History, General Missoula MT [-113.98401, 46.86494] POINT (-113.98401 46.86494)

3 rows × 22 columns

Minnesota_1960s_grants = mappable_gdf[mappable_gdf['InstState'] == 'MN']
Minnesota_1960s_grants.head()
AppNumber Institution InstCity InstState InstPostalCode InstCountry CongressionalDistrict Latitude Longitude YearAwarded ... Division AwardOutright AwardMatching ProjectDesc ToSupport Participants Disciplines geoquery coordinates geometry
174 FA-10546-68 None Minneapolis MN None USA 0 44.97730 -93.265469 1967 ... Research Programs 15520.0 0.0 No description No to support statement NaN American Literature Minneapolis MN [-93.2654692, 44.9772995] POINT (-93.26547 44.97730)
188 FB-10105-67 None Minneapolis MN None None 0 44.97730 -93.265469 1967 ... Research Programs 8140.0 0.0 No description No to support statement NaN U.S. History Minneapolis MN [-93.2654692, 44.9772995] POINT (-93.26547 44.97730)
426 RO-10036-67 University of Minnesota, Twin Cities Minneapolis MN 55455-0433 USA 5 44.97779 -93.236240 1967 ... Research Programs 18728.0 0.0 No description No to support statement Harrold Alllen [Project Director] Linguistics Minneapolis MN [-93.23624, 44.97779] POINT (-93.23624 44.97779)
477 RO-10293-68 None Minneapolis MN None None 0 44.97730 -93.265469 1968 ... Research Programs 18730.0 0.0 No description No to support statement NaN Linguistics Minneapolis MN [-93.2654692, 44.9772995] POINT (-93.26547 44.97730)
650 FB-10159-67 University of Minnesota, Twin Cities Minneapolis MN 55455-0433 USA 5 44.97779 -93.236240 1967 ... Research Programs 8140.0 0.0 No description No to support statement Jasper Hopkins [Project Director] Philosophy, General Minneapolis MN [-93.23624, 44.97779] POINT (-93.23624 44.97779)

5 rows × 22 columns

And I can map grants in a particular state:

# re map... 
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

usa = world.query('name == "United States of America"')

ax = usa.plot(
    color='white', edgecolor='black'
)
Minnesota_1960s_grants.plot(ax=ax, color='blue')

plt.show()

png

Exporting a clean dataset

Output the data as geojson

Before moving on, I want to conclude this section by creating outputting the data as geojson. This allows us to reuse or share the cleaned data with others, or port it into future visualization tools. For example, the data file can be ported into tools that have increased dynamic mapping capabilities, such as leaflet.js or geojson.io.

Geopandas uses the fiona library for this functions, so you may need to install and import it.

#!pip install fiona
#import fiona
# get rid of those useless coordinate & geoquery fields
mappable_gdf = mappable_gdf.drop(['coordinates'], axis=1)
mappable_gdf = mappable_gdf.drop(['geoquery'], axis=1)
mappable_gdf = mappable_gdf.drop(['AwardMatching'], axis=1)
# output cleaned data to geojson

mappable_gdf.to_file('neh_1960s_grants.geojson', driver='GeoJSON')

This notebook used many of the analysis and transformation features in python to analyze, clean, and transform the source dataset of NEH’s grants from the 1960s. Now I have a dataset, which I will reuse in the next steps as a basic data for mapping. Stay tuned for the next notebooks to follow this process.

Reference list

Credit to the examples in these tutorials (as of January 2021), which were highly informative to the exploratory work outlined above:

See these sites for US state shapefile information:

  • Eric Celeste http://eric.clst.org/tech/usgeojson
  • US Census provides various geographic files for US states and other entities: https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html

Comments