Why should a data engineer be interested in government data?

In a previous post, I introduced my readers to the data.gov (Data.gov a catalog for USA data). Why should a data engineer be interested in government data? Our plates are full with our own company’s data. We also have so much technology that we need to learn. Where is the time for such a side excursion?

Display person in front of computer with all kinds of distractions.

First of all, public government data can provide context for data your company is gathering. You have records of sales. But understanding what might have caused changes in sales requires data that you might not have. Grocery store sales might be affected by weather. A spike in influenza might be the cause of a change in pharmacy sales. This kind of data is publicly available.

Second, government data can aid in planning. If you want to build a physical business that targets people 12 to 18 years old, wouldn’t it help to have an idea of how many children there are in the area younger than that and what that tells you about the future census for 12 to 18 years olds when the business goes live. Or perhaps a health care institution interested in serving the broader community might be interested how well insured the population is.

These are the business reasons for using government data. But for data engineers, there are additional benefits. You can learn a lot about how to handle different data issues using government data.

When learning new concepts, we usually start with some small amount of data. A couple decades ago I was working on a tool for breaking down words into grammatical parts (root, suffix, etc). We had a list of 100 English words which we used in parsing. It worked great. Then I found a dictionary listing 20K words. The process started crawling.

What will Power Query do with a dataset of several million rows in Power BI? Would a query against that same file written using Azure Synapse Serverless SQL run fast enough? How can I really know before I try it with real data at scales near what I have in production?

Map of USA with faux statistics

A decade ago I discovered the America Community Survey microdata site (American Community Survey Microdata (census.gov)). This site provides access to curated survey results from a couple million people. I used these large files to test Power Query in Excel. Later I experimented with USQL and more recently Azure Synapse Analytics as well las Synapse Spark. My early experiments with Azure Data Factory Dataflows also used this data.

I am now experimenting with APIs in Power Query and Synapse Spark using their API interface. Several times I have seen job postings requiring experience using an API as a Power BI datasource. I had no idea how to do that. I could find articles online describing the process but didn’t have access to an API I could experiment with. I found that several US government departments publish APIs that I have successfully connected to in Power BI. Now, I’m not shy in responding to those job requests.

Have you found a public data set that you think others would be interested in? Post them in the comments.

Leave a Reply