Public Use Microdata Sample (PUMS) for data engineers

Years ago I was studying how to use Power Pivot and wanted a large dataset to test. I discovered the Public Use Microdata Sample data (Public Use Microdata Sample (PUMS)). This data provides curated results from millions of questionnaires sent out each year by the US Census bureau.

These files are generated from the survey results of the American Community Survey (ACS) American Community Survey (ACS) (census.gov). This survey is administered by the US Census Bureau. Unlike the decennial (10 year) census, this is a survey of a sample of the American population. It is intended to help illuminate the changes in the each community.

They are careful to avoid sharing information that might allow a user to identify specific users. They do this in various ways:

  • Adding noise to the data to hide specific individuals
  • Capping certain attributes to hide individuals (e.g. incomes are stated to a certain maximum)
  • Identifying data at a certain geographical specificity

About the last point. There are different geographical layers available (from 1. Geographic Areas Covered in the ACS (census.gov))

  • Nation
  • Regions
  • Divisions
  • States
  • Counties
  • Census Tract
  • Block groups

Some ACS data products report at the state level and no more detailed level. Others report down to the zip code level (block groups). In a future post, I will explore these and other geographical levels. The PUMS data identifies the geography by region, division, state and Public User Microdata Area (level above Census Tract). It does not identify the Census Tract or the block groups.

This data represents the curated answers from the surveys. Only about two thirds of the responses are published. Also some of the answers are reduced to some arbitrary maximum or minimum value (e.g. salary).

The PUMS comes in 2 flavors: survey for one year; survey for 5 years. Also, within each flavor, there is one set of files that has information about each household and another set with information about the people in each household.

For data engineers, the main attraction of the PUMS data is its sheer size. The 2021 file has 1.5 million housing records and 3.3 million persons records. The 2017 – 2021 file (5 years) has 7.6 million housing records and 15.7 persons records. How easy is it to find a dataset that large that isn’t made up data?

You can download the data from their website (Index of /programs-surveys/acs/data/pums/2021/1-Year (census.gov). The web page is structured in standard directory format. The name of the file provides information about the file structure, type (housing versus persons) and geographical scope.

Listing of files from the ACS ftp site (top)
Listing of files from the ACS ftp site (middle)
  • csv_hak.zip: Comma delimited file, housing type for Alaska (AK is abbreviation). And of course all of this is zipped.
  • csv_pak.zip: Comma delimited file, persons type for Alaska (AK is abbreviation). And of course all of this is zipped.

csv_*us.zip are comma delimited files for the whole of the USA. Within these zipped files are comma delimited files as well as a pdf giving basic information about the file.

“sas” is an additional format that I have not explored yet. Prior to 2021, this file format had “unix” for a prefix. The files in these zip files is not a text format. The extension of the files is sas7bdat is associated with the SAS program.

I have used these files in trying out Power Query. How does Power Query handle large datasets? I then experimented with U-SQL. I copied all of the state files to a folder and then use file pruning in my queries. I have used these datasets with Azure Synapse Serverless SQL as well as Spark.

Let me know if you have additional uses for the PUMS data in the comments.

Leave a Reply