NAPP census microdata

January 17, 2014

Historians often try not to fall in love with our sources, but sometimes I just can’t help it. For me, aside from the chatty personal journal (what historian can resist?) and the underground mine maps I’ve studied for years, my greatest fondness may be for big compilations of small bits of data, called microdata.

What’s microdata? It’s small bits of information that, by themselves, might be virtually useless, but when aggregated and analyzed can show bigger trends.

The classic use of microdata is the census. Whether you filled out the form yourself or talked to the canvasser (“enumerator”) who visited your house, small bits of information about you and your family were recorded. By itself, this isn’t much – half a dozen or more websites probably know more about you than the US Census Bureau does. But when placed together with similar information from other people, we can see broader trends – this neighborhood is slipping into poverty, or that one has an emerging immigrant business community, or this county will need to plan for more school capacity because of the number of young children.

What if we could use that same kind of explanatory power to help understand the past? This question, and the use of microdata that it implies, has motivated historians since the advent of computers in the 1960s. The raw records of the census, which in the United States has been conducted every ten years since 1790, are a good source for this microdata. The raw records (termed “manuscript census” records) are released to researchers 72 years after they were created. One challenge is that they are handwritten (hence “manuscript”), meaning that any researcher would have to carefully transcribe the handwritten documents before being able to use them as microdata. But that transcription only needs to be done once, if researchers do a good job and are willing to share.

Thank goodness those people exist! Some of the best are at the Minnesota Population Center (MPC) at the University of Minnesota. They compile and make available microdata to researchers through several different projects. Their most famous one, in the research world, is IPUMS, the Integrated Public Use Microdata Series, but they also have important lesser-known microdata compilations.

The one I’ve used most is the North Atlantic Population Project (NAPP). The MPC collaborated with several other institutions on both sides of the Atlantic to make historical census microdata available from multiple countries. NAPP converts the data so that variables can be compared directly between countries. They also add additional variables, derived from information in each census, that extends the sorts of questions researchers can ask. Best of all, NAPP shares their microdata collections with researchers for free, provided you promise to cite it appropriately,¹ not redistribute the data, and not use it for genealogical purposes. In exchange for complying with these very moderate restrictions, NAPP makes available samples of census microdata covering the US, Canada, Great Britain, Sweden, Norway, and parts of present-day Germany during the 19th and early 20th century. Some of these samples are complete transcriptions of all of the manuscript census records, plus all the extra NAPP bells and whistles.

Most researchers use statistical software packages such as SPSS or STATA to browse and manipulate NAPP data. I took a different tack, creating a set of scripts that will manipulate a NAPP data file and load it into a database, where I can explore it with SQL, the standard language used to query databases. Learn more about the gory technical details. Over time, I will post some sample SQL statements here on this blog.

Microdata census records combined with NAPP’s additional variables and a powerful search tool make it easy to dive deeply into historical patterns large and small. As long as the limitations of the original census sources are kept in mind, an extraordinary range of questions – about race, work, geography, gender, age, the family, and more – can be asked that would be difficult to answer any other way. Can you blame me for having a soft spot in my heart for historical microdata?

Minnesota Population Center. North Atlantic Population Project: Complete Count Microdata. Version 2.0 [Machine-readable database]. Minneapolis: Minnesota Population Center, 2008. Additionally, each data set offered by NAPP has its own citation. ↩︎