NARA Datasets on the AWS Registry of Open Data

Today, NARA released two datasets to the Amazon Web Services (AWS) Registry of Open Data: the National Archives Catalog dataset and the 1940 Census dataset. The AWS Registry of Open Data is a service provided by AWS to store open, public datasets for free so that they can be accessed and analyzed on AWS. With these datasets, users can now access the data in bulk versus searching or browsing the user interfaces for the National Archives Catalog and the 1940 Census website.

The National Archives Catalog dataset–over 225 gigabytes of data–includes the archival descriptions and authority records from the National Archives Catalog (as of November 20, 2020), including the URLs for over 127 million digital copies and data from citizen archivist contributions. We plan to update the dataset on the Registry of Open Data regularly. 

Person authority record in the Catalog for Christina M. Tchen, First Lady Michelle Obama’s Chief of Staff from 2011 to 2017. National Archives Identifier 200295583
Series description in the Catalog for Christina Tchen’s General Files. National Archives Identifier 77165408
Photograph of First Lady Michelle Obama receiving a briefing from Christina Tchen, an example of a digital copy in the Catalog. National Archives Identifier 183897928

The 1940 Census dataset–over 15 terabytes of data–includes the metadata index, the population schedules, the enumeration district maps, and the enumeration district descriptions for the 1940 Census records. The 1940 Census population schedules were created by the Bureau of the Census in an attempt to enumerate every person living in the United States on April 1, 1940, and were digitized and released publicly on April 2, 2012. The 1940 Census enumeration district maps contain maps of counties, cities, and other minor civil divisions that show enumeration districts, census tracts, and related boundaries and numbers used for each census. The 1940 Census enumeration district descriptions contain written descriptions of census districts, subdivisions, and enumeration districts.

The metadata index for the 1940 Census dataset is 251 megabytes, and all of the 3.7 million images from the population schedules, the enumeration district maps, and the enumeration district descriptions total over 15 terabytes. This dataset reflects the 1940 Census records that are also available on NARA’s 1940 Census website and in the National Archives Catalog.

1940 Census population schedule for Alaska’s First Judicial Division,
Enumeration district 1-1. National Archives Identifier 124063344
1940 Census enumeration district map for Alaska’s First Judicial Division, Enumeration districts 1-2 through 1-4. National Archives Identifier 5822485
1940 Census enumeration district description for Alaska’s First Judicial Division, Enumeration district 1-1. National Archives Identifier 5823241

In addition to the Registry of Open Data entries for these datasets, NARA published detailed documentation to guide users on how to access both the full datasets and specific subsets of the data. Users can access both datasets using their respective Amazon Resource Names (ARNs), a method to uniquely identify resources on AWS so that users can locate the dataset, or with AWS Command Line Interface (CLI), an open source tool that enables users to interact with AWS services using commands in their command-line. For the Catalog dataset, we have also provided zip files for easy download of the full dataset.

NARA Documentation
National Archives Catalog on the AWS Registry of Open Data
1940 Census on the AWS Registry of Open Data

The release of these datasets supports NARA’s commitment to our strategic goals to Make Access Happen and to Maximize NARA’s Value to the Nation. With the release of this data, we expect to reach segments of the public beyond our traditional researchers. Universities, private industry, and other agencies are interested in accessing the data in this format and mining it to support new kinds of research and reuse on their own platforms.