This is a post in my occasional humanities data curation series, which outlines humanities data curation activities using publicly-available data on grants and awards made by the National Endowment for the Humanities (NEH). A subset of the series focuses on mapping the data and creating geospatial data visualizations.
For reference, here are the other essays in the series:
- 
    Metadata: Origins and Metaphors (10 September 2024) 
- 
    Encoding Reparative Description: Preliminary Thoughts (17 September 2023) 
- 
    Find and replace data in the shell (11 September 2022) 
- 
    Wrangling Humanities Data: An Interactive Map of NEH Awards (17 July 2022) 
- 
    Wrangling Humanities Data: Using Regex to Clean a CSV (27 February 2021) 
- 
    Wrangling Humanities Data: Exploratory Maps of NEH Awards by State (22 January 2021) 
- 
    Wrangling Humanities Data: Cleaning and Transforming Data (19 January 2021) 
- 
    Wrangling Humanities Data: Finding and Describing Data (20 December 2020) 
This installment draws on some of the files, with which I previously demonstrated the creation of an archival data package, and explore the data, check on the quality and reliability of the data, and then use the data to create a geospatial dataset. The geospatial data will be the basis for a future post that will undertake more mapping activities based on the data. To do this analysis, I use the pandas data library, which is supported in a Python environment and Jupyter notebook.
At a high level, the notebook (code and descriptions below) takes these steps:
- Set up the programming environment (import useful Python modules, then input the original NEH grant data information).
- Explore the data to check data quality for inconsistencies, missing or inaccurate data, and other useful information about the information.
- Clean the data by removing incorrect values, and enchance and transform the data by adding or making the data more consistent. In this case, this involves providing the geographic coordinates for the awards.
- Finally, export the data to geojson, a portable and lightweight format that can be used for more advanced mapping projects in later steps.
As in the previous post, you can also find a Jupyter Notebook version of this post, which can be downloaded from the GitHub repository along with all of the data discussed here. File references discussed below indicate files included in the same neh-grant-data-project repository.
Cleaning and Transforming to Create a Geospatial Dataset
This notebook demonstrates some of the steps involved in cleaning the NEH grant data, checking on quality and consistency, and then transforming the data into geospatial information that can be the basis of a map. While most of the data is contained within the steps and cells of this notebook, at the end there is a script to use to export the final information as geojson, which can be used elsewhere or for other purposes. It is more consistent and portable than the original data.
Setup
For this activity, we will use some Python modules that may not be in the standard JupyterLab configuration. If you do not have them, you may need to install some of these modules: geojson (here), geopandas (here), geopy (here), descartes (here), and shapely (here).
# if you do not have the modules, uncomment the following line(s) to install
#!pip install geojson
#!pip install geopandas
#!pip install descartes
#!pip install geopy
# modules for data cleaning & transformation
import pandas as pd
import geopandas as gpd
# basic visualization
import matplotlib.pyplot as plt
%matplotlib inline
# modules for mapping
from shapely.geometry import Point
# may use for additional visualization
import seaborn as sns
# modules for geocoding
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
from time import sleep
Clean and filter the grant data using pandas
First, let’s explore the grant data and clean it up so that we can define the points that we will map - there should be one coordinate location for each grant that we want to display. For this task, we will begin using the pandas modules for working with data.
df_grants_1960s = pd.read_csv('neh-grants-data-202012/data/NEH_Grants1960s.csv')
df_grants_1960s.head()
| AppNumber | ApplicantType | Institution | OrganizationType | InstCity | InstState | InstPostalCode | InstCountry | CongressionalDistrict | Latitude | ... | EndGrant | ProjectDesc | ToSupport | PrimaryDiscipline | SupplementCount | Supplements | ParticipantCount | Participants | DisciplineCount | Disciplines | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | FB-10007-68 | 2 | Regents of the University of California, Berkeley | Four-Year College | Berkeley | CA | 94704-5940 | USA | 13 | 37.87029 | ... | 1969-12-31 | No description | No to support statement | English | 0 | NaN | 1 | John Elliot [Project Director] | 1 | English | 
| 1 | FB-10009-68 | 2 | Pitzer College | Four-Year College | Claremont | CA | 91711-6101 | USA | 27 | 34.10373 | ... | 1969-12-31 | No description | No to support statement | History of Religion | 0 | NaN | 1 | Steven Matthysse [Project Director] | 1 | History of Religion | 
| 2 | FB-10015-68 | 2 | University of California, Riverside | University | Riverside | CA | 92521-0001 | USA | 41 | 33.97561 | ... | 1969-12-31 | No description | No to support statement | History, General | 0 | NaN | 1 | John Staude [Project Director] | 1 | History, General | 
| 3 | FB-10019-68 | 2 | Northeastern University | Four-Year College | Boston | MA | 02115-5005 | USA | 7 | 42.3395 | ... | 1969-12-31 | No description | No to support statement | History, General | 0 | NaN | 1 | Thomas Havens [Project Director] | 1 | History, General | 
| 4 | FB-10023-68 | 2 | University of Pennsylvania | University | Philadelphia | PA | 19104-6205 | USA | 3 | 39.95298 | ... | 1969-12-31 | No description | No to support statement | Psychology | 0 | NaN | 1 | Gresham Riley [Project Director] | 1 | Psychology | 
5 rows × 33 columns
It’s important to explore the dataset a bit first, so we can understand it better, do a sanity check, and identify the information that is needed to create a new set of mappable point data. The pandas library allows to easily do some of this basic work, including characterizing the shape of the data, seeing what datatypes it includes, and whether it is missing values.
df_grants_1960s.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1010 entries, 0 to 1009
Data columns (total 33 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   AppNumber              1010 non-null   object 
 1   ApplicantType          1010 non-null   int64  
 2   Institution            1010 non-null   object 
 3   OrganizationType       1010 non-null   object 
 4   InstCity               1010 non-null   object 
 5   InstState              1010 non-null   object 
 6   InstPostalCode         1010 non-null   object 
 7   InstCountry            1010 non-null   object 
 8   CongressionalDistrict  1010 non-null   int64  
 9   Latitude               1010 non-null   object 
 10  Longitude              1010 non-null   object 
 11  CouncilDate            1010 non-null   object 
 12  YearAwarded            1010 non-null   int64  
 13  ProjectTitle           1010 non-null   object 
 14  Program                1010 non-null   object 
 15  Division               1010 non-null   object 
 16  ApprovedOutright       1010 non-null   float64
 17  ApprovedMatching       1010 non-null   float64
 18  AwardOutright          1010 non-null   float64
 19  AwardMatching          1010 non-null   float64
 20  OriginalAmount         1010 non-null   float64
 21  SupplementAmount       1010 non-null   float64
 22  BeginGrant             1010 non-null   object 
 23  EndGrant               1010 non-null   object 
 24  ProjectDesc            1010 non-null   object 
 25  ToSupport              1010 non-null   object 
 26  PrimaryDiscipline      1010 non-null   object 
 27  SupplementCount        1010 non-null   int64  
 28  Supplements            0 non-null      float64
 29  ParticipantCount       1010 non-null   int64  
 30  Participants           960 non-null    object 
 31  DisciplineCount        1010 non-null   int64  
 32  Disciplines            1010 non-null   object 
dtypes: float64(7), int64(6), object(20)
memory usage: 260.5+ KB
Latitude and Longitude series are listed as data objects, not numeric data. I will convert these into numeric data later since it is necessary to process them into geospatial coordinates as numbers, not strings. On closer inspection, many of these fields include the strings “Unknown”, “unknown”, or “Un”, which I also want to change to null or None values. First, though, I will create a new data frame so this original data can be recovered later, if necessary, and also drop some of the information that I won’t need to map the grants.
Data quality checking and cleaning
There’s a lot of information here that won’t help to map the data. Aside from the information for location of the points (that is, Latitude and Longitude), I will keep some for use in a popup that can be displayed when hovering over a point on the map.
I regularly use panda’s isnull() and info() functions below to look for blank cells or missing information.
mappable_grant_info = df_grants_1960s.drop(['ApplicantType','OrganizationType','CouncilDate','ApprovedOutright','ApprovedMatching','OriginalAmount','SupplementAmount','BeginGrant','EndGrant','PrimaryDiscipline','SupplementCount','Supplements','ParticipantCount','DisciplineCount'], axis=1)
mappable_grant_info.head()
| AppNumber | Institution | InstCity | InstState | InstPostalCode | InstCountry | CongressionalDistrict | Latitude | Longitude | YearAwarded | ProjectTitle | Program | Division | AwardOutright | AwardMatching | ProjectDesc | ToSupport | Participants | Disciplines | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | FB-10007-68 | Regents of the University of California, Berkeley | Berkeley | CA | 94704-5940 | USA | 13 | 37.87029 | -122.26813 | 1967 | Title not available | Fellowships for Younger Scholars | Fellowships and Seminars | 8387.0 | 0.0 | No description | No to support statement | John Elliot [Project Director] | English | 
| 1 | FB-10009-68 | Pitzer College | Claremont | CA | 91711-6101 | USA | 27 | 34.10373 | -117.70701 | 1967 | Title not available | Fellowships for Younger Scholars | Fellowships and Seminars | 8387.0 | 0.0 | No description | No to support statement | Steven Matthysse [Project Director] | History of Religion | 
| 2 | FB-10015-68 | University of California, Riverside | Riverside | CA | 92521-0001 | USA | 41 | 33.97561 | -117.33113 | 1967 | Title not available | Fellowships for Younger Scholars | Fellowships and Seminars | 8387.0 | 0.0 | No description | No to support statement | John Staude [Project Director] | History, General | 
| 3 | FB-10019-68 | Northeastern University | Boston | MA | 02115-5005 | USA | 7 | 42.3395 | -71.09048 | 1967 | Title not available | Fellowships for Younger Scholars | Fellowships and Seminars | 8387.0 | 0.0 | No description | No to support statement | Thomas Havens [Project Director] | History, General | 
| 4 | FB-10023-68 | University of Pennsylvania | Philadelphia | PA | 19104-6205 | USA | 3 | 39.95298 | -75.19276 | 1967 | Title not available | Fellowships for Younger Scholars | Fellowships and Seminars | 8387.0 | 0.0 | No description | No to support statement | Gresham Riley [Project Director] | Psychology | 
mappable_grant_info.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1010 entries, 0 to 1009
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   AppNumber              1010 non-null   object 
 1   Institution            1010 non-null   object 
 2   InstCity               1010 non-null   object 
 3   InstState              1010 non-null   object 
 4   InstPostalCode         1010 non-null   object 
 5   InstCountry            1010 non-null   object 
 6   CongressionalDistrict  1010 non-null   int64  
 7   Latitude               1010 non-null   object 
 8   Longitude              1010 non-null   object 
 9   YearAwarded            1010 non-null   int64  
 10  ProjectTitle           1010 non-null   object 
 11  Program                1010 non-null   object 
 12  Division               1010 non-null   object 
 13  AwardOutright          1010 non-null   float64
 14  AwardMatching          1010 non-null   float64
 15  ProjectDesc            1010 non-null   object 
 16  ToSupport              1010 non-null   object 
 17  Participants           960 non-null    object 
 18  Disciplines            1010 non-null   object 
dtypes: float64(2), int64(2), object(15)
memory usage: 150.0+ KB
While there are entries for each grant, on closer inspection it is clear that there are some data quality issues. For example, a lot of grants seem to be missing Participants information, and the latitude and longitude information, which will be the basis for our mapping goal, is not numerical but listed as an object type of data. As I can see in the above cell, much of the information is described as “object,” which in these case are mostly string data, which will not be possible to convert into coordinate points.
This means that there is more data cleaning to do before the data is ready to map. I used the following methods to get more information about specific data fields, including AppNumber (the data element that is most like a unique identifier), Latitude (a creator-generated field, which is presumably generated by a geocoding algorithm), 
and InstCountry.
mappable_grant_info['AppNumber'].value_counts()
RO-10389       1
EO-10051-68    1
RO-10259-68    1
FB-10225-68    1
FT-10147-67    1
              ..
EO-10302-69    1
EO-10085-69    1
EO-10242-69    1
FT-10294-68    1
FT-10157-67    1
Name: AppNumber, Length: 1010, dtype: int64
mappable_grant_info['Latitude'].value_counts()
Unknown     90
41.31003    17
40.81835    16
34.07516    16
37.87029    14
            ..
40.771       1
42.82649     1
40.85069     1
42.29017     1
42.35675     1
Name: Latitude, Length: 382, dtype: int64
In looking closer at InstCounty, it is clear that not all of the grants were given to recipients within the United States. As a scoping exericse, I want to remove the locations that are not within the US.
mappable_grant_info['InstCountry'].unique()
array(['USA', 'Unknown', 'Canada', 'United Kingdom'], dtype=object)
mappable_grant_info['InstCountry'].value_counts()
USA               993
Unknown            14
Canada              2
United Kingdom      1
Name: InstCountry, dtype: int64
Many of the grant entries list Unknown or unknown in the fields with information about country, institution, and even latitude or longitude. This challenges our mapping project since if we cannot locate a grant, we cannot create a point to place it on the map. To address these challenges, we need first to identify grants withough location information, then we need to get the locations of the places without latitude and longitude. For the latter step, we will use a geocoding module.
For scoping, let’s drop the UK and Canada grants, and then see whether we can geolocate the unknowns.
#filter out Canada & UK
mappable_grant_info = mappable_grant_info[mappable_grant_info['InstCountry'].isin(['Canada','United Kingdom']) == False]
mappable_grant_info['InstCountry'].value_counts()
USA        993
Unknown     14
Name: InstCountry, dtype: int64
My goal is to use information about the city and state of the award to get a geographic coordinate for the 14 listing “Unknown” countries. There remain, however, a few entries that don’t appear to have enough information to geocode.
These can be identified by the value Un in the InstState field:
mappable_grant_info[mappable_grant_info['InstState'] == 'Un']
| AppNumber | Institution | InstCity | InstState | InstPostalCode | InstCountry | CongressionalDistrict | Latitude | Longitude | YearAwarded | ProjectTitle | Program | Division | AwardOutright | AwardMatching | ProjectDesc | ToSupport | Participants | Disciplines | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 221 | FB-10243-68 | Unknown | Unknown | Un | Unknown | Unknown | 0 | Unknown | unknown | 1967 | Title not available | Fellowships for College Teachers and Independe... | Research Programs | 8387.0 | 0.0 | No description | No to support statement | NaN | Comparative Literature | 
| 337 | FT-10440-68 | Unknown | Unknown | Un | Unknown | Unknown | 0 | Unknown | unknown | 1967 | Title not available | Summer Stipends | Research Programs | 1500.0 | 0.0 | No description | No to support statement | NaN | English | 
| 428 | RO-10048-67 | Unknown | Unknown | Un | Unknown | Unknown | 0 | Unknown | unknown | 1967 | Biography of Theodore Roethke | Basic Research | Research Programs | 5000.0 | 0.0 | No description | No to support statement | NaN | Art History and Criticism | 
| 429 | RO-10050-67 | Unknown | Unknown | Un | Unknown | Unknown | 0 | Unknown | unknown | 1967 | Symbolic Landscape in Modern Poetry | Basic Research | Research Programs | 2000.0 | 0.0 | No description | No to support statement | NaN | American Literature | 
| 454 | RO-10166-67 | Unknown | Unknown | Un | Unknown | Unknown | 0 | Unknown | unknown | 1967 | History of Book Publishing in America | Basic Research | Research Programs | 25000.0 | 0.0 | No description | No to support statement | NaN | History, General | 
| 459 | RO-10211-68 | Unknown | Unknown | Un | Unknown | Unknown | 0 | Unknown | unknown | 1968 | Biography of Richard Wright | Basic Research | Research Programs | 8000.0 | 0.0 | No description | No to support statement | NaN | Literature, General | 
| 480 | RO-10317-69 | Unknown | Unknown | Un | Unknown | Unknown | 0 | Unknown | unknown | 1968 | Biography of Richard Wright | Basic Research | Research Programs | 8000.0 | 0.0 | No description | No to support statement | NaN | Literature, General; Social Sciences, General | 
| 757 | FT-10257-67 | McMaster University | Ontario, Canada | Un | 00000-0000 | USA | 1 | Unknown | unknown | 1967 | Title not available | Summer Stipends | Research Programs | 2000.0 | 0.0 | No description | No to support statement | Chauncey Wood [Project Director] | British Literature | 
| 774 | FT-10310-68 | Simon Fraser University | B.C., Canada | Un | 00000-0000 | USA | 0 | Unknown | unknown | 1967 | Title not available | Summer Stipends | Research Programs | 1500.0 | 0.0 | No description | No to support statement | Jared Curtis [Project Director] | English; Literature, General | 
It is clear that most of these do not have enough information to map. The last two are in Canada, and so I don’t want them anyway. So I’ll remove them from the data:
#filter out unknown state rows
mappable_grant_info = mappable_grant_info[mappable_grant_info['InstState'].isin(['Un']) == False]
mappable_grant_info.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 998 entries, 0 to 1009
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   AppNumber              998 non-null    object 
 1   Institution            998 non-null    object 
 2   InstCity               998 non-null    object 
 3   InstState              998 non-null    object 
 4   InstPostalCode         998 non-null    object 
 5   InstCountry            998 non-null    object 
 6   CongressionalDistrict  998 non-null    int64  
 7   Latitude               998 non-null    object 
 8   Longitude              998 non-null    object 
 9   YearAwarded            998 non-null    int64  
 10  ProjectTitle           998 non-null    object 
 11  Program                998 non-null    object 
 12  Division               998 non-null    object 
 13  AwardOutright          998 non-null    float64
 14  AwardMatching          998 non-null    float64
 15  ProjectDesc            998 non-null    object 
 16  ToSupport              998 non-null    object 
 17  Participants           955 non-null    object 
 18  Disciplines            998 non-null    object 
dtypes: float64(2), int64(2), object(15)
memory usage: 155.9+ KB
Let’s take a closer look at the latitude and longitude information, which will be the basis for determining each point to map later on. This reveals that there are still at least 78 entries without location information:
mappable_grant_info['Latitude'].describe()
count         998
unique        382
top       Unknown
freq           78
Name: Latitude, dtype: object
mappable_grant_info['Longitude'].describe()
count         998
unique        383
top       unknown
freq           78
Name: Longitude, dtype: object
It will not be possible to map these grants without further information. My goal is to use data from the city and state location fields, which I can send to a geocoding tool in order to provide a location coordinate. To begin, I will remove the string values “unknown” or “Unknown” in the dataset and then replace them with null (blank) values, which I can filter more easily. Then, I will use Nominatim, a geocoding tool that will search for coordinate information based on the available place information.
# replace strings with null values
nonvals = ['unknown','Unknown']
mappable_grant_info = mappable_grant_info.replace(nonvals, [None, None])
latlons = mappable_grant_info.loc[:,'Latitude':'Longitude']
print(latlons['Latitude'],'\n')
print('Is it a null value?',type(mappable_grant_info.loc[1009,'Latitude']))
0       37.87029
1       34.10373
2       33.97561
3        42.3395
4       39.95298
          ...   
1005    40.74955
1006    32.88647
1007    41.02476
1008    29.94888
1009        None
Name: Latitude, Length: 998, dtype: object 
Is it a null value? <class 'NoneType'>
mappable_grant_info.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 998 entries, 0 to 1009
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   AppNumber              998 non-null    object 
 1   Institution            970 non-null    object 
 2   InstCity               997 non-null    object 
 3   InstState              998 non-null    object 
 4   InstPostalCode         970 non-null    object 
 5   InstCountry            991 non-null    object 
 6   CongressionalDistrict  998 non-null    int64  
 7   Latitude               920 non-null    object 
 8   Longitude              920 non-null    object 
 9   YearAwarded            998 non-null    int64  
 10  ProjectTitle           998 non-null    object 
 11  Program                998 non-null    object 
 12  Division               998 non-null    object 
 13  AwardOutright          998 non-null    float64
 14  AwardMatching          998 non-null    float64
 15  ProjectDesc            998 non-null    object 
 16  ToSupport              998 non-null    object 
 17  Participants           955 non-null    object 
 18  Disciplines            998 non-null    object 
dtypes: float64(2), int64(2), object(15)
memory usage: 195.9+ KB
Since there are only 920 Lat & Lon entries, that leaves the 78 unknown (now null) entries wihtout information. While it would be possible to gather this information with a tool like Get Lat+Lon, that would take while. So I will use the information about the recipient’s city and state to get a general idea of where the award was given.
# create a geoquery string, which will allow to at least map the city
mappable_grant_info['geoquery'] = mappable_grant_info['InstCity'] + ' ' + mappable_grant_info['InstState']
mappable_grant_info[mappable_grant_info['geoquery'].isnull()]
| AppNumber | Institution | InstCity | InstState | InstPostalCode | InstCountry | CongressionalDistrict | Latitude | Longitude | YearAwarded | ProjectTitle | Program | Division | AwardOutright | AwardMatching | ProjectDesc | ToSupport | Participants | Disciplines | geoquery | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 168 | FA-10157-68 | None | None | NY | None | USA | 0 | None | None | 1967 | Title not available | Fellowships for University Teachers | Research Programs | 15520.0 | 0.0 | No description | No to support statement | NaN | U.S. History | NaN | 
Now there’s only one left that lacks enough information. To check, look at that grant:
mappable_grant_info.loc[168, :]
AppNumber                                        FA-10157-68
Institution                                             None
InstCity                                                None
InstState                                                 NY
InstPostalCode                                          None
InstCountry                                              USA
CongressionalDistrict                                      0
Latitude                                                None
Longitude                                               None
YearAwarded                                             1967
ProjectTitle                             Title not available
Program                  Fellowships for University Teachers
Division                                   Research Programs
AwardOutright                                          15520
AwardMatching                                              0
ProjectDesc                                   No description
ToSupport                            No to support statement
Participants                                             NaN
Disciplines                                     U.S. History
geoquery                                                 NaN
Name: 168, dtype: object
There is just not enough information here - the record does not include information about the city or institution that received this award, so remove that one, too.
mappable_grant_info = mappable_grant_info.drop([168])
mappable_grant_info.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 997 entries, 0 to 1009
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   AppNumber              997 non-null    object 
 1   Institution            970 non-null    object 
 2   InstCity               997 non-null    object 
 3   InstState              997 non-null    object 
 4   InstPostalCode         970 non-null    object 
 5   InstCountry            990 non-null    object 
 6   CongressionalDistrict  997 non-null    int64  
 7   Latitude               920 non-null    object 
 8   Longitude              920 non-null    object 
 9   YearAwarded            997 non-null    int64  
 10  ProjectTitle           997 non-null    object 
 11  Program                997 non-null    object 
 12  Division               997 non-null    object 
 13  AwardOutright          997 non-null    float64
 14  AwardMatching          997 non-null    float64
 15  ProjectDesc            997 non-null    object 
 16  ToSupport              997 non-null    object 
 17  Participants           955 non-null    object 
 18  Disciplines            997 non-null    object 
 19  geoquery               997 non-null    object 
dtypes: float64(2), int64(2), object(16)
memory usage: 163.6+ KB
Geocode grants missing location coordinates
For the remaining 77 grants without location information, try to geocode by city and state information.
# create a subset of the information for geoquerying
geoQuerySet = mappable_grant_info.loc[:, ['AppNumber','InstCity','InstState','Latitude','Longitude','geoquery']]
#pull out the entries without location coordinates
geoQuerySet = geoQuerySet[geoQuerySet['Latitude'].isnull()]
geoQuerySet.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 77 entries, 24 to 1009
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   AppNumber  77 non-null     object
 1   InstCity   77 non-null     object
 2   InstState  77 non-null     object
 3   Latitude   0 non-null      object
 4   Longitude  0 non-null      object
 5   geoquery   77 non-null     object
dtypes: object(6)
memory usage: 4.2+ KB
# set up geolocator
# increase timeout to reduce errors 
geolocator = Nominatim(user_agent='neh-grant-points', timeout=10)
# limit to comply with rate limiting requirements
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
# initiate geolocation for the 77 grants
geoQuerySet['location'] = geoQuerySet['geoquery'].apply(geocode)
geoQuerySet.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 77 entries, 24 to 1009
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   AppNumber  77 non-null     object
 1   InstCity   77 non-null     object
 2   InstState  77 non-null     object
 3   Latitude   0 non-null      object
 4   Longitude  0 non-null      object
 5   geoquery   77 non-null     object
 6   location   76 non-null     object
dtypes: object(7)
memory usage: 4.8+ KB
All but one of the geoquery requests returned a location value. To see what’s missing, check for the empty values using isnull():
geoQuerySet[geoQuerySet['location'].isnull()]
| AppNumber | InstCity | InstState | Latitude | Longitude | geoquery | location | |
|---|---|---|---|---|---|---|---|
| 998 | RP-10003-68 | Charlottsvlle | VA | None | None | Charlottsvlle VA | None | 
Looks like that one has a typo! I adapted the fix & replace process that @cduvallet outlined (link to full process at the end of this post).
fixgeolocate = {
    'Charlottsvlle VA' : 'Charlottesville VA'
}
new_locations = {}
for key, val in fixgeolocate.items():
    loc = geocode(val)
    new_locations[key] = loc
    sleep(1)
print(new_locations)
{'Charlottsvlle VA': Location(The V, 201-213, 15th Street Northwest, Venable, Starr Hill, Charlottesville, Virginia, 22903, United States, (38.0360726, -78.49973472559668, 0.0))}
While here, I will also correct the typo in the dataset.
# update to correct spelling of Charlottesville in cleaned_grants_for_mappable = mappable_grant_info i=998
mappable_grant_info = mappable_grant_info.replace('Charlottsvlle','Charlottesville')
# update the geoquery set 
# get indices of rows with no location
null_locs = geoQuerySet[geoQuerySet['location'].isnull()].index
# sub in the new_locs
geoQuerySet.loc[null_locs, 'location'] = geoQuerySet.loc[null_locs, 'geoquery'].map(new_locations)
geoQuerySet.head()
| AppNumber | InstCity | InstState | Latitude | Longitude | geoquery | location | |
|---|---|---|---|---|---|---|---|
| 24 | AO-10025-69 | Westport | CT | None | None | Westport CT | (Westport Court, Westport, Columbia County, Ge... | 
| 40 | EH-10038-68 | Evanston | IL | None | None | Evanston IL | (Evanston, Evanston Township, Cook County, Ill... | 
| 47 | EH-10058-66 | New York | NY | None | None | New York NY | (New York, United States, (40.7127281, -74.006... | 
| 51 | EO-10001-67 | Dover | DE | None | None | Dover DE | (Dover, Kent County, Delaware, United States, ... | 
| 61 | EO-10051-68 | University Park | PA | None | None | University Park PA | (University Park, College Township, Centre Cou... | 
Now we have all information to provide a coordinate location for those grants that didn’t have the information. What remains is to pull the Lat & Lon information back into those columns.
# pull the lats & lons from the locations - see 
geoQuerySet['Latitude'] = geoQuerySet['location'].apply(lambda x: x.latitude)
geoQuerySet['Longitude'] = geoQuerySet['location'].apply(lambda x: x.longitude)
geoQuerySet.head()
| AppNumber | InstCity | InstState | Latitude | Longitude | geoquery | location | |
|---|---|---|---|---|---|---|---|
| 24 | AO-10025-69 | Westport | CT | 33.554852 | -82.066172 | Westport CT | (Westport Court, Westport, Columbia County, Ge... | 
| 40 | EH-10038-68 | Evanston | IL | 42.044739 | -87.693046 | Evanston IL | (Evanston, Evanston Township, Cook County, Ill... | 
| 47 | EH-10058-66 | New York | NY | 40.712728 | -74.006015 | New York NY | (New York, United States, (40.7127281, -74.006... | 
| 51 | EO-10001-67 | Dover | DE | 39.158168 | -75.524368 | Dover DE | (Dover, Kent County, Delaware, United States, ... | 
| 61 | EO-10051-68 | University Park | PA | 40.808749 | -77.858566 | University Park PA | (University Park, College Township, Centre Cou... | 
geoQuerySet.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 77 entries, 24 to 1009
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   AppNumber  77 non-null     object 
 1   InstCity   77 non-null     object 
 2   InstState  77 non-null     object 
 3   Latitude   77 non-null     float64
 4   Longitude  77 non-null     float64
 5   geoquery   77 non-null     object 
 6   location   77 non-null     object 
dtypes: float64(2), object(5)
memory usage: 7.3+ KB
Merge the data
I can see using info() that the main dataframe still lacks the latitude and longitude information for the newly points geocoded points. So now, let’s merge the locations back into the dataframe for mapping (cleaned_grants_for_mappable).
# create a clean dataframe, just in case something happens to remove data
cleaned_grants_for_mappable = mappable_grant_info
cleaned_grants_for_mappable.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 997 entries, 0 to 1009
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   AppNumber              997 non-null    object 
 1   Institution            970 non-null    object 
 2   InstCity               997 non-null    object 
 3   InstState              997 non-null    object 
 4   InstPostalCode         970 non-null    object 
 5   InstCountry            990 non-null    object 
 6   CongressionalDistrict  997 non-null    int64  
 7   Latitude               920 non-null    object 
 8   Longitude              920 non-null    object 
 9   YearAwarded            997 non-null    int64  
 10  ProjectTitle           997 non-null    object 
 11  Program                997 non-null    object 
 12  Division               997 non-null    object 
 13  AwardOutright          997 non-null    float64
 14  AwardMatching          997 non-null    float64
 15  ProjectDesc            997 non-null    object 
 16  ToSupport              997 non-null    object 
 17  Participants           955 non-null    object 
 18  Disciplines            997 non-null    object 
 19  geoquery               997 non-null    object 
dtypes: float64(2), int64(2), object(16)
memory usage: 163.6+ KB
# create a list of the indices for matching rows between the data frames
geoInfotoReplaceIndices = geoQuerySet.loc[:,'AppNumber'].index
geoInfotoReplaceIndices
Int64Index([  24,   40,   47,   51,   61,   74,   83,  103,  106,  107,  110,
             111,  117,  120,  123,  124,  148,  151,  154,  174,  188,  198,
             209,  341,  359,  378,  386,  387,  388,  398,  402,  412,  430,
             437,  477,  482,  483,  484,  511,  522,  553,  597,  608,  624,
             637,  669,  685,  689,  702,  708,  746,  750,  779,  791,  795,
             802,  855,  856,  857,  859,  860,  861,  866,  871,  872,  885,
             890,  902,  927,  936,  938,  940,  961,  970,  989,  998, 1009],
           dtype='int64')
# use the row indexes to assign data from the geoQuerySet back to the cleaned map info for Latitude
cleaned_grants_for_mappable.loc[geoInfotoReplaceIndices,['Latitude']] = geoQuerySet.loc[geoInfotoReplaceIndices,['Latitude']]
# use the row indexes to assign data from the geoQuerySet back to the cleaned map info for Longitude
cleaned_grants_for_mappable.loc[geoInfotoReplaceIndices,['Longitude']] = geoQuerySet.loc[geoInfotoReplaceIndices,['Longitude']]
cleaned_grants_for_mappable.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 997 entries, 0 to 1009
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   AppNumber              997 non-null    object 
 1   Institution            970 non-null    object 
 2   InstCity               997 non-null    object 
 3   InstState              997 non-null    object 
 4   InstPostalCode         970 non-null    object 
 5   InstCountry            990 non-null    object 
 6   CongressionalDistrict  997 non-null    int64  
 7   Latitude               997 non-null    object 
 8   Longitude              997 non-null    object 
 9   YearAwarded            997 non-null    int64  
 10  ProjectTitle           997 non-null    object 
 11  Program                997 non-null    object 
 12  Division               997 non-null    object 
 13  AwardOutright          997 non-null    float64
 14  AwardMatching          997 non-null    float64
 15  ProjectDesc            997 non-null    object 
 16  ToSupport              997 non-null    object 
 17  Participants           955 non-null    object 
 18  Disciplines            997 non-null    object 
 19  geoquery               997 non-null    object 
dtypes: float64(2), int64(2), object(16)
memory usage: 203.6+ KB
# reassign our complete information to the mappable_grant_info dataframe
mappable_grant_info = cleaned_grants_for_mappable
At this point, there are two ‘complete’ datasets: cleaned_grants_for_mappable & mappable_grant_info. Below, mappable_grant_info is used the main set.
Add coordinates and map
Using the filled in latitude and longitude, we can now add in coordinate information to prepare to map each grant as a point on a map.
# convert lat & lon into numeric values
mappable_grant_info.astype({'Latitude':'float64','Longitude':'float64'}).dtypes
AppNumber                 object
Institution               object
InstCity                  object
InstState                 object
InstPostalCode            object
InstCountry               object
CongressionalDistrict      int64
Latitude                 float64
Longitude                float64
YearAwarded                int64
ProjectTitle              object
Program                   object
Division                  object
AwardOutright            float64
AwardMatching            float64
ProjectDesc               object
ToSupport                 object
Participants              object
Disciplines               object
geoquery                  object
dtype: object
# combine latitute and longitude to create coordinate information for mapping
mappable_grant_info['coordinates'] = mappable_grant_info[['Longitude','Latitude']].values.tolist()
mappable_grant_info.tail()
| AppNumber | Institution | InstCity | InstState | InstPostalCode | InstCountry | CongressionalDistrict | Latitude | Longitude | YearAwarded | ... | Program | Division | AwardOutright | AwardMatching | ProjectDesc | ToSupport | Participants | Disciplines | geoquery | coordinates | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1005 | RX-10010-69 | American Council of Learned Societies | New York | NY | 10017-6706 | USA | 12 | 40.74955 | -73.97462 | 1969 | ... | Conferences | Research Programs | 25000.0 | 0.0 | No description | No to support statement | Gordon Turner [Project Director] | Interdisciplinary Studies, General | New York NY | [-73.97462, 40.74955] | 
| 1006 | EO-10125 | Salk Institute for Biological Studies | La Jolla | CA | 92037-1002 | USA | 49 | 32.88647 | -117.24392 | 1969 | ... | Institutional Planning and Development | Education Programs | 7434.0 | 0.0 | To teach a course based on American texts and ... | No to support statement | Christine Tracy [Project Director] | History and Philosophy of Science, Technology,... | La Jolla CA | [-117.24392, 32.88647] | 
| 1007 | EO-10225 | Manhattanville College | Purchase | NY | 10577-2132 | USA | 17 | 41.02476 | -73.71567 | 1969 | ... | Institutional Planning and Development | Education Programs | 28320.0 | 0.0 | No description | No to support statement | Marcus Lawson [Project Director] | Interdisciplinary Studies, General | Purchase NY | [-73.71567, 41.02476] | 
| 1008 | EO-10231 | Louisiana Endowment for the Humanities | New Orleans | LA | 70113-1027 | USA | 2 | 29.94888 | -90.07392 | 1969 | ... | Institutional Planning and Development | Education Programs | 10000.0 | 0.0 | This is one of seven "summer workshops on Negr... | No to support statement | Elton Harrison [Project Director] | Interdisciplinary Studies, General | New Orleans LA | [-90.07392, 29.94888] | 
| 1009 | RO-10389 | California State University, Long Beach | Long Beach | CA | 90840-0004 | USA | 48 | 33.769 | -118.192 | 1969 | ... | Basic Research | Research Programs | 9990.0 | 0.0 | No description | No to support statement | Nizan Shaked [Project Director] | Social Sciences, General | Long Beach CA | [-118.191604, 33.7690164] | 
5 rows × 21 columns
# create shapely geoPoints for a geometry column
mappable_grant_info['geometry'] = gpd.points_from_xy(mappable_grant_info['Longitude'],mappable_grant_info['Latitude'])
mappable_grant_info.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 997 entries, 0 to 1009
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   AppNumber              997 non-null    object  
 1   Institution            970 non-null    object  
 2   InstCity               997 non-null    object  
 3   InstState              997 non-null    object  
 4   InstPostalCode         970 non-null    object  
 5   InstCountry            990 non-null    object  
 6   CongressionalDistrict  997 non-null    int64   
 7   Latitude               997 non-null    object  
 8   Longitude              997 non-null    object  
 9   YearAwarded            997 non-null    int64   
 10  ProjectTitle           997 non-null    object  
 11  Program                997 non-null    object  
 12  Division               997 non-null    object  
 13  AwardOutright          997 non-null    float64 
 14  AwardMatching          997 non-null    float64 
 15  ProjectDesc            997 non-null    object  
 16  ToSupport              997 non-null    object  
 17  Participants           955 non-null    object  
 18  Disciplines            997 non-null    object  
 19  geoquery               997 non-null    object  
 20  coordinates            997 non-null    object  
 21  geometry               997 non-null    geometry
dtypes: float64(2), geometry(1), int64(2), object(17)
memory usage: 219.1+ KB
mappable_grant_info['geometry'].describe
<bound method NDFrame.describe of 0       POINT (-122.26813 37.87029)
1       POINT (-117.70701 34.10373)
2       POINT (-117.33113 33.97561)
3        POINT (-71.09048 42.33950)
4        POINT (-75.19276 39.95298)
                   ...             
1005     POINT (-73.97462 40.74955)
1006    POINT (-117.24392 32.88647)
1007     POINT (-73.71567 41.02476)
1008     POINT (-90.07392 29.94888)
1009    POINT (-118.19160 33.76902)
Name: geometry, Length: 997, dtype: geometry>
The geometry column now contains POINTs, a special datatype that will be necessary for the geospatial dataset.
For the final steps of cleaning and transforming this data to a geospatial dataset, the geopandas module provides more options, so I will convert the data to geopandas (aka “gpd”) now:
# convert the information to a geopandas dataframe
mappable_gdf = gpd.GeoDataFrame(mappable_grant_info)
Using the type() function, I can see that this is now a “GeoDataFrame” with “GeoSeries” in some of the fields.
type(mappable_gdf)
geopandas.geodataframe.GeoDataFrame
type(mappable_gdf['geometry'])
geopandas.geoseries.GeoSeries
At this point, the mappable_gdf data is ready to plot on a map. For these purposes, the data is cleaned and consistent and contains the necessary geospatial coordinates to create a point for each of the 997 grants made during the 1960s decade for which NEH maintained information about the location of the award, and that were made to the U.S. states.
Data cleanup by geography
Now I have relatively clean data, and I can use basic mapping functions of geopandas to get an idea of any egregious outliers. For this task, geopandas supports some basic mapping tools (some of the built-in mapping features included in geopandas) to get an idea of additional data errors. Once I have taken these additional steps, I can export this data for use later on to create more detailed maps.
# create a mapspace for the geodataframe
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
usa = world.query('name == "United States of America"')
ax = usa.plot(
    color='white', edgecolor='black'
)
mappable_gdf.plot(ax=ax, color='red')
plt.show()

The map is small, but this offers an initial data quality check, and I can see that there are a few points that don’t look right. There may be some in Europe (or Canada?) and one that appears to be in the Pacific. It’s possible that this is a grant to American Samoa or another US territory in the Pacific, but I want to find out.
# find the distant outlier... 
mappable_gdf.sort_values(by=['Latitude','Longitude']).head(2)
| AppNumber | Institution | InstCity | InstState | InstPostalCode | InstCountry | CongressionalDistrict | Latitude | Longitude | YearAwarded | ... | Division | AwardOutright | AwardMatching | ProjectDesc | ToSupport | Participants | Disciplines | geoquery | coordinates | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 151 | FA-10103-67 | None | Richmond | VA | None | None | 0 | -32.8652 | 151.501 | 1967 | ... | Research Programs | 13270.0 | 0.0 | No description | No to support statement | NaN | U.S. History | Richmond VA | [151.5012, -32.8652] | POINT (151.50120 -32.86520) | 
| 702 | FT-10051-67 | Elementary Shakespeare Corp. | Fort Pierce | FL | 34982 | USA | 18 | 27.4467 | -80.3256 | 1967 | ... | Research Programs | 2000.0 | 0.0 | No description | No to support statement | S. Len Weingart [Project Director] | British Literature | Fort Pierce FL | [-80.3256056, 27.4467056] | POINT (-80.32561 27.44671) | 
2 rows × 22 columns
That one appears to have been geocoded as a location in Richmond Vale, New South Wales, Australia, which was an error that I should’ve corrected in the geocoding step. But let’s correct that now: Get Lat+Lon suggests 37.5385087,-77.43428 would be acceptable, and so does Nominatim:
geocode('Richmond Virginia')
Location(Richmond, Richmond City, Virginia, 23298, United States, (37.5385087, -77.43428, 0.0))
# update the coordinate values
mappable_gdf.at[151,'Latitude'] = 37.5385087
mappable_gdf.at[151,'Longitude'] =  -77.43428
mappable_gdf.iloc[151]
AppNumber                                        FA-10103-67
Institution                                             None
InstCity                                            Richmond
InstState                                                 VA
InstPostalCode                                          None
InstCountry                                             None
CongressionalDistrict                                      0
Latitude                                             37.5385
Longitude                                           -77.4343
YearAwarded                                             1967
ProjectTitle                             Title not available
Program                  Fellowships for University Teachers
Division                                   Research Programs
AwardOutright                                          13270
AwardMatching                                              0
ProjectDesc                                   No description
ToSupport                            No to support statement
Participants                                             NaN
Disciplines                                     U.S. History
geoquery                                         Richmond VA
coordinates                             [151.5012, -32.8652]
geometry                           POINT (151.5012 -32.8652)
Name: 151, dtype: object
# create a Point for the geometry column
mappable_gdf.loc[151,'geometry'] = Point(mappable_gdf.loc[151,'Longitude'], mappable_gdf.loc[151,'Latitude'])
Map again…
# create a mapspace for the geodataframe
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
usa = world.query('name == "United States of America"')
ax = usa.plot(
    color='white', edgecolor='black'
)
mappable_gdf.plot(ax=ax, color='red')
plt.show()

There’s still appear to be 3 very far to the east of North America… I can identify thes points by sorting the Longitude in descending order:
# fix a datatype problem for sorting... 
mappable_gdf = mappable_gdf.astype({'Longitude':'float64','Latitude':'float64'})
# find the farthest east... 
mappable_gdf.sort_values(by=['Longitude'],ascending=False).head(5)
| AppNumber | Institution | InstCity | InstState | InstPostalCode | InstCountry | CongressionalDistrict | Latitude | Longitude | YearAwarded | ... | Division | AwardOutright | AwardMatching | ProjectDesc | ToSupport | Participants | Disciplines | geoquery | coordinates | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 341 | FT-10462-68 | University of Southern Florida | Orlando | DE | 32816 | USA | 7 | 52.524225 | 13.409345 | 1967 | ... | Research Programs | 1500.0 | 0.0 | No description | No to support statement | Richard Lukas [Project Director] | History, General | Orlando DE | [13.4093448, 52.5242253] | POINT (13.40934 52.52423) | 
| 597 | FA-10060-67 | Unaffiliated Independent Scholar | Belmont | MA | 02178-0000 | USA | 0 | 48.408657 | 7.238676 | 1967 | ... | Research Programs | 13270.0 | 0.0 | No description | No to support statement | Einar Haugen [Project Director] | Literature, Other | Belmont MA | [7.2386756, 48.4086571] | POINT (7.23868 48.40866) | 
| 483 | RO-10335-69 | None | Dennis | MA | None | USA | 0 | 51.519807 | -0.137680 | 1968 | ... | Research Programs | 4800.0 | 0.0 | No description | No to support statement | NaN | Art History and Criticism | Dennis MA | [-0.13767995578500702, 51.519807150000005] | POINT (-0.13768 51.51981) | 
| 645 | FB-10135-67 | University of Maine System | Orono | ME | 04473-1513 | USA | 2 | 44.887320 | -68.680500 | 1967 | ... | Research Programs | 8140.0 | 0.0 | No description | No to support statement | Jerome Nadelhaft [Project Director] | U.S. History | Orono ME | [-68.6805, 44.88732] | POINT (-68.68050 44.88732) | 
| 182 | FB-10085-67 | Colby College | Waterville | ME | 04901-8840 | USA | 1 | 44.541900 | -69.639730 | 1967 | ... | Research Programs | 8140.0 | 0.0 | No description | No to support statement | Dorothy Koonce [Project Director] | Classics | Waterville ME | [-69.63973, 44.5419] | POINT (-69.63973 44.54190) | 
5 rows × 22 columns
These were geocoded in Europe.
#row 341
geocode('Orlando DE')
Location(Orlando, 2, Münzstraße, Scheunenviertel, Mitte, Berlin, 10178, Deutschland, (52.5242253, 13.4093448, 0.0))
# this one appears to be a mistake in the data: Florida, not Delaware
mappable_gdf.loc[341]
AppNumber                                     FT-10462-68
Institution                University of Southern Florida
InstCity                                          Orlando
InstState                                              DE
InstPostalCode                                      32816
InstCountry                                           USA
CongressionalDistrict                                   7
Latitude                                          52.5242
Longitude                                         13.4093
YearAwarded                                          1967
ProjectTitle                          Title not available
Program                                   Summer Stipends
Division                                Research Programs
AwardOutright                                        1500
AwardMatching                                           0
ProjectDesc                                No description
ToSupport                         No to support statement
Participants             Richard Lukas [Project Director]
Disciplines                              History, General
geoquery                                       Orlando DE
coordinates                      [13.4093448, 52.5242253]
geometry                    POINT (13.4093448 52.5242253)
Name: 341, dtype: object
#row 597
geocode('Belmont MA')
Location(Belmont, Molsheim, Bas-Rhin, Grand Est, France métropolitaine, 67130, France, (48.4086571, 7.2386756, 0.0))
#row 483
geocode('Dennis MA')
Location(Dennis, 30, Cleveland Street, Cavendish Square & Oxford Market (South), Fitzrovia, City of Westminster, London, Greater London, England, W1T 4NG, United Kingdom, (51.519807150000005, -0.13767995578500702, 0.0))
I could’ve avoided this by telling the geocoder to prefer locations in North America. Nominatim, for example, provides options for bounded or country_codes to limit results. If I am geocoding a longer listin future, I would consider this as an option. For now, these can be corrected by providing a more specific country string with the request:
#this is better... 
geocode('Dennis MA USA')
Location(Dennis, Barnstable County, Massachusetts, United States, (41.7353872, -70.1939087, 0.0))
Following @cduvallet’s example, I’m creating a function to update location records. The function takes a specified dataframe, row index, and Location() information.
def update_location(gdf, rowIndex, location):
    '''
    The function updates a record's latitude, longitude, and adds a geo Point entry
    The To specify the changes: 
        gdf is a specific dataframe
        rowIndex is the numerical index or Name of the row to update
        location is a new geoquery string to call for a Lcoation() object from the (Nominatim) geocoder
    The function requires the geocoder to be already implemented as geocode() and for Point to be imported from shapely. 
    '''
    newloc = geocode(location)
    gdf.at[rowIndex, 'Latitude'] = newloc.latitude
    gdf.at[rowIndex, 'Longitude'] = newloc.longitude
    # create a Point
    gdf.at[rowIndex, 'geometry'] = Point(newloc.longitude,newloc.latitude)
I’ll test this on item 341. This one had a typo: the state was listed as Delaware, not Florida, so I’ll provide the correct geoquery to the function, which will query Nominatim and update the data.
# test the function ... 
#row 341
#geocode('Orlando FL')
update_location(mappable_gdf, 341, 'Orlando FL')
#while we're at it, fix the data typo in the InstState field, also in the previous dataframes
mappable_gdf.at[341,'InstState'] = 'FL'
mappable_grant_info.at[341,'InstState'] = 'FL'
cleaned_grants_for_mappable.at[341,'InstState'] = 'FL'
mappable_gdf.loc[341]
AppNumber                                          FT-10462-68
Institution                     University of Southern Florida
InstCity                                               Orlando
InstState                                                   FL
InstPostalCode                                           32816
InstCountry                                                USA
CongressionalDistrict                                        7
Latitude                                                28.548
Longitude                                             -81.4128
YearAwarded                                               1967
ProjectTitle                               Title not available
Program                                        Summer Stipends
Division                                     Research Programs
AwardOutright                                             1500
AwardMatching                                                0
ProjectDesc                                     No description
ToSupport                              No to support statement
Participants                  Richard Lukas [Project Director]
Disciplines                                   History, General
geoquery                                            Orlando DE
coordinates                           [13.4093448, 52.5242253]
geometry                 POINT (-81.41278418563017 28.5479786)
Name: 341, dtype: object
That looks good, so now I will fix the remaining two.
# update the other two points
update_location(mappable_gdf, 483, 'Dennis MA USA')
update_location(mappable_gdf, 597, 'Belmont MA USA')
# remap... 
ax = usa.plot(
    color='white', edgecolor='black'
)
mappable_gdf.plot(ax=ax, color='red')
plt.show()

At least all of the points appear to be in logical places now, even if there may be some underlying noise. This brings the data to a point that is clean enough to map! But first, what other cool things can we do with the data using gepandas?
Geopandas supports various data filtering
Now that the data is more or less in a shape that I want, it is ready for more analysis and visualization. We can count the number of awards by state, see all of the awards in a given state, or tally the amount of money awarded to a particular geographic area. And of course, we can make some more detailed maps! First, try some of the filtering and analyzing of the data that pandas supports.
For example, how may grants were made to each state?
# use groupby to count by state
mappable_gdf.groupby('InstState').InstState.count()
InstState
AK      6
AL      5
AR      1
AZ     13
CA    111
CO     10
CT     28
DC     44
DE     12
FL     18
GA     11
HI      8
IA     12
ID      1
IL     36
IN     23
KS      7
KY      7
LA      8
MA     76
MD     29
ME      3
MI     31
MN     14
MO      9
MS      7
MT      3
NC     33
ND      6
NE      7
NH      1
NJ     36
NM      2
NV      2
NY    146
OH     32
OK     10
OR      6
PA     53
RI      7
SC      9
SD      4
TN     20
TX     26
UT      3
VA     22
VT      3
WA     11
WI     16
WV      3
WY      6
Name: InstState, dtype: int64
Or I can ask how what grants were made to a specific state:
# show grants from Montana
mappable_gdf[mappable_gdf['InstState']=='MT']
| AppNumber | Institution | InstCity | InstState | InstPostalCode | InstCountry | CongressionalDistrict | Latitude | Longitude | YearAwarded | ... | Division | AwardOutright | AwardMatching | ProjectDesc | ToSupport | Participants | Disciplines | geoquery | coordinates | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 326 | FT-10374-68 | University of Montana | Missoula | MT | 59801-4494 | USA | 1 | 46.86494 | -113.98401 | 1967 | ... | Research Programs | 1500.0 | 0.0 | No description | No to support statement | Duane Hampton [Project Director] | U.S. History | Missoula MT | [-113.98401, 46.86494] | POINT (-113.98401 46.86494) | 
| 348 | FT-10738-69 | University of Montana | Missoula | MT | 59801-4494 | USA | 1 | 46.86494 | -113.98401 | 1969 | ... | Research Programs | 1500.0 | 0.0 | Study of the concept of family in the politica... | No to support statement | Richard Chapman [Project Director] | Political Science, General | Missoula MT | [-113.98401, 46.86494] | POINT (-113.98401 46.86494) | 
| 921 | RO-10179-67 | University of Montana | Missoula | MT | 59801-4494 | USA | 1 | 46.86494 | -113.98401 | 1967 | ... | Research Programs | 1370.0 | 0.0 | No description | No to support statement | Joseph Mussulman [Project Director] | History, General | Missoula MT | [-113.98401, 46.86494] | POINT (-113.98401 46.86494) | 
3 rows × 22 columns
Minnesota_1960s_grants = mappable_gdf[mappable_gdf['InstState'] == 'MN']
Minnesota_1960s_grants.head()
| AppNumber | Institution | InstCity | InstState | InstPostalCode | InstCountry | CongressionalDistrict | Latitude | Longitude | YearAwarded | ... | Division | AwardOutright | AwardMatching | ProjectDesc | ToSupport | Participants | Disciplines | geoquery | coordinates | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 174 | FA-10546-68 | None | Minneapolis | MN | None | USA | 0 | 44.97730 | -93.265469 | 1967 | ... | Research Programs | 15520.0 | 0.0 | No description | No to support statement | NaN | American Literature | Minneapolis MN | [-93.2654692, 44.9772995] | POINT (-93.26547 44.97730) | 
| 188 | FB-10105-67 | None | Minneapolis | MN | None | None | 0 | 44.97730 | -93.265469 | 1967 | ... | Research Programs | 8140.0 | 0.0 | No description | No to support statement | NaN | U.S. History | Minneapolis MN | [-93.2654692, 44.9772995] | POINT (-93.26547 44.97730) | 
| 426 | RO-10036-67 | University of Minnesota, Twin Cities | Minneapolis | MN | 55455-0433 | USA | 5 | 44.97779 | -93.236240 | 1967 | ... | Research Programs | 18728.0 | 0.0 | No description | No to support statement | Harrold Alllen [Project Director] | Linguistics | Minneapolis MN | [-93.23624, 44.97779] | POINT (-93.23624 44.97779) | 
| 477 | RO-10293-68 | None | Minneapolis | MN | None | None | 0 | 44.97730 | -93.265469 | 1968 | ... | Research Programs | 18730.0 | 0.0 | No description | No to support statement | NaN | Linguistics | Minneapolis MN | [-93.2654692, 44.9772995] | POINT (-93.26547 44.97730) | 
| 650 | FB-10159-67 | University of Minnesota, Twin Cities | Minneapolis | MN | 55455-0433 | USA | 5 | 44.97779 | -93.236240 | 1967 | ... | Research Programs | 8140.0 | 0.0 | No description | No to support statement | Jasper Hopkins [Project Director] | Philosophy, General | Minneapolis MN | [-93.23624, 44.97779] | POINT (-93.23624 44.97779) | 
5 rows × 22 columns
And I can map grants in a particular state:
# re map... 
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
usa = world.query('name == "United States of America"')
ax = usa.plot(
    color='white', edgecolor='black'
)
Minnesota_1960s_grants.plot(ax=ax, color='blue')
plt.show()

Exporting a clean dataset
Output the data as geojson
Before moving on, I want to conclude this section by creating outputting the data as geojson. This allows us to reuse or share the cleaned data with others, or port it into future visualization tools. For example, the data file can be ported into tools that have increased dynamic mapping capabilities, such as leaflet.js or geojson.io.
Geopandas uses the fiona library for this functions, so you may need to install and import it.
#!pip install fiona
#import fiona
# get rid of those useless coordinate & geoquery fields
mappable_gdf = mappable_gdf.drop(['coordinates'], axis=1)
mappable_gdf = mappable_gdf.drop(['geoquery'], axis=1)
mappable_gdf = mappable_gdf.drop(['AwardMatching'], axis=1)
# output cleaned data to geojson
mappable_gdf.to_file('neh_1960s_grants.geojson', driver='GeoJSON')
This notebook used many of the analysis and transformation features in python to analyze, clean, and transform the source dataset of NEH’s grants from the 1960s. Now I have a dataset, which I will reuse in the next steps as a basic data for mapping. Stay tuned for the next notebooks to follow this process.
Reference list
Credit to the examples in these tutorials (as of January 2021), which were highly informative to the exploratory work outlined above:
- Duong Vu, “Introduction to Geospatial Data in Python” at Datacamp, published 24 October 2018.
- “Using GeoJson with Leaflet” at leaflet.js
- Lesley Cordero, “Getting Started on Geospatial Analysis with Python, GeoJSON and GeoPandas” at Twilio, published 14 August 2017.
- Claire Duvallet, “Mapping my cross-country road trip with Python”, published 9 February 2020.
- Dani Arribas-Bel, “Mapping in Python with geopandas”, from Geographic Data Science ‘15. Provides a lot of useful information about creating more complex plots.
- @carsonfarmer offers a great overview of using python for geospatial data wrangling, including an overview of pandas for data analysis, geopandas, and many other geospatial tools available in the python and Jupyter environment; see https://github.com/carsonfarmer/python_geospatial/tree/master/notebooks
See these sites for US state shapefile information:
- Eric Celeste http://eric.clst.org/tech/usgeojson
- US Census provides various geographic files for US states and other entities: https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html
 
       
       
      
Comments