Today, NARA released two datasets to the Amazon Web Services (AWS) Registry of Open Data: the National Archives Catalog dataset and the 1940 Census dataset. The AWS Registry of Open Data is a service provided by AWS to store open, public datasets for free so that they can be accessed and analyzed on AWS. With these datasets, users can now access the data in bulk versus searching or browsing the user interfaces for the National Archives Catalog and the 1940 Census website.
The National Archives Catalog dataset–over 225 gigabytes of data–includes the archival descriptions and authority records from the National Archives Catalog (as of November 20, 2020), including the URLs for over 127 million digital copies and data from citizen archivist contributions. We plan to update the dataset on the Registry of Open Data regularly.
The 1940 Census dataset–over 15 terabytes of data–includes the metadata index, the population schedules, the enumeration district maps, and the enumeration district descriptions for the 1940 Census records. The 1940 Census population schedules were created by the Bureau of the Census in an attempt to enumerate every person living in the United States on April 1, 1940, and were digitized and released publicly on April 2, 2012. The 1940 Census enumeration district maps contain maps of counties, cities, and other minor civil divisions that show enumeration districts, census tracts, and related boundaries and numbers used for each census. The 1940 Census enumeration district descriptions contain written descriptions of census districts, subdivisions, and enumeration districts.
The metadata index for the 1940 Census dataset is 251 megabytes, and all of the 3.7 million images from the population schedules, the enumeration district maps, and the enumeration district descriptions total over 15 terabytes. This dataset reflects the 1940 Census records that are also available on NARA’s 1940 Census website and in the National Archives Catalog.
In addition to the Registry of Open Data entries for these datasets, NARA published detailed documentation to guide users on how to access both the full datasets and specific subsets of the data. Users can access both datasets using their respective Amazon Resource Names (ARNs), a method to uniquely identify resources on AWS so that users can locate the dataset, or with AWS Command Line Interface (CLI), an open source tool that enables users to interact with AWS services using commands in their command-line. For the Catalog dataset, we have also provided zip files for easy download of the full dataset.
NARA Documentation |
National Archives Catalog on the AWS Registry of Open Data 1940 Census on the AWS Registry of Open Data |
The release of these datasets supports NARA’s commitment to our strategic goals to Make Access Happen and to Maximize NARA’s Value to the Nation. With the release of this data, we expect to reach segments of the public beyond our traditional researchers. Universities, private industry, and other agencies are interested in accessing the data in this format and mining it to support new kinds of research and reuse on their own platforms.