December 8, 2017

For more than four years, I've been working on a digital legal history project called 9CHRIS -- the 9th Circuit Historical Records Index System. The potential for historical analysis that comes from having some 40,000 briefs and transcripts available is what keeps me perpetually interested in continuing to improve this project. Each time I open one of the documents, I think of the potential that such rich detail can offer to historians and others studying the West, especially the relationship between western places, western residents, and the law.

Lately, I have realized that there's a more simplistic factor that helps keep me improving 9CHRIS: the variety of work that is required helps keep upgrades from becoming dull. From the tedious escapism of hand-correcting the records, to the puzzle-solving of deriving meaning from ugly text, to learning about servers, HTML, Markdown, SQLite, and git, to the inspirational feeling of having a senior scholar get in touch about how they have been using it to inform their work, each element of the project requires a different kind of skill and delivers a different sort of reward. I can "play around" with something new, and if it seems to work and there might be promise in it, I can then formalize it as part of the project; but if it ends up being just a curiosity or a dead-end, there was satisfaction in the exploration on its own.

Over the last few months, these two threads have come together in a new effort at improving the data. The identification of documents in the first version of 9CHRIS was mostly done by computers, which were largely but not entirely correct. I hand-corrected some of the most egregious examples where the computer guessed wrong, and the system was generally usable as a result. Recently, however, I realized that if the whole dataset could be hand-corrected, it would open a number of possibilities for future use. Other information could be layered atop the corrected data, and this might, in turn, open up prospects for much more interesting analysis. Consequently, I began hand-correcting each of the 3,359 volumes in 9CHRIS, inaugurating a new phase of the project. Beginning with volume 0001, and moving sequentially, I flip through each volume page by page, noting the beginning of each of the documents it contains, and updating the 9CHRIS site with the correct finds.

Today I passed a milestone, having corrected volume 1000. With each volume I correct, searches improve, errors are removed from the dataset, and I get new ideas about what can be done with all this data once every document has been correctly identified. Once this hand-correcting phase is complete, future project initiatives will construct layers of additional metadata on this foundation. Even before all the corrections are complete, however, the existing system will be gradually improving with each corrected document.

Posted Friday afternoon, December 8th, 2017 Tags:

This is a short post about teaching.

In a lower-level general education history course I am teaching this semester at ASU, I encourage students to make a small handwritten study sheet (a.k.a. "crib sheet" or "cheat sheet") to use in class while taking each exam, to jog their memories, help them better marshal specific evidence to use in their answers, and reduce exam anxiety.1 I also hoped the process of preparing the handwritten sheet would be a valuable study exercise, encouraging them to comprehensively scan the material as they looked for facts to include, and reinforcing key concepts by writing them on the sheet.2

When the day of the midterm exam arrived, I was surprised to see that not all students had made study sheets. In fact, only 15 out of 24 students (62.5%) made them. I collected the sheets with the exams, and later, when recording my grades, I noted whether or not each student had made a study sheet, so I could see if creating a sheet made a difference in their performance on the midterm.

sheet students mean (of 40) std dev mean (pct)
With 15 32.93 2.62 82.3%
Without 9 28.50 6.47 71.3%

Perhaps unsurprisingly, each group showed a range of outcomes. In general, those who made the sheets did better on the exam than those who did not -- the average score was 9% higher for those with study sheets. No study sheet user received a grade below a mid-C. For those students without study sheets, the range was greater, spanning failing grades all the way to a couple of As. (The exam was worth 40 points; the final table column adjusts the mean scores to a percentage basis.)

This is of course a very small set of results, doubtlessly showing the influence of more factors than the single issue of having a crib sheet or not which I've examined here. It seems quite possible, for instance, that some students who didn't make a study sheet also didn't study comprehensively for the exam, which would mean use of a study sheet might simply be a marker of students with better study habits. The data is not fine-grained enough to see if the improvement for crib sheet users was seen in the multiple choice/short answer questions, the essay questions, or both. But in any case the outcome is clear enough to me to continue to encourage students to make a study sheet when they have an opportunity to do so.

  1. The literature on the effectiveness of study sheets is generally positive, though few researchers attribute major performance gains to them. See for example Brigitte Erbe, "Reducing Test Anxiety While Increasing Learning: The Cheat Sheet, College Teaching 55, no. 3 (2007): 96-98, DOI: 10.3200/CTCH.55.3.96-98; and Afshin Gharib and William Phillips, "Test Anxiety, Student Preferences and Performance on Different Exam Types in Introductory Psychology," International Journal of e-Education, e-Business, e-Management and e-Learning 3, no. 1 (February 2013): 1-6, DOI: 10.7763/IJEEEE.2013.V3.183.  ↩

  2. Some research suggests that crib sheets do not effectively help students learn the material; see Thomas N. Dorsel and Gary W. Cundiff, "The Cheat-Sheet: Efficient Coding Device or Indispensable Crutch?" The Journal of Experimental Education 48, no. 1 (Fall 1979): 39-42. These authors, both psychologists, determined this by asking students to prepare crib sheets before a short, voluntary exam. The professors then took the sheets away from some students. The students without cheat sheets then did not do as well on the test as those who got to use the sheet they had prepared, showing that the crib sheets functioned as a "crutch" but not a device that had helped students internalize information during pre-test studying. The applicability of this study to longer, mandatory exams is not clear; also unreported is whether any student wished to bop the researchers on the nose. ↩

Posted at noon on Saturday, November 14th, 2015 Tags:

In working with manuscript census materials, modern data derived from them, and published documents from the Census Bureau, I found myself coming back to particular resources time and again. In a hope this might be of use to someone else, I've put together this list of those I use most frequently. If you have suggestions or corrections, please contact me or leave a note in the comments, below.

Manuscript materials

  • Main page for manuscript population schedules, 1790--1930, digitized from NARA microfilm, hosted on the Internet Archive:

  • The 1940 manuscript census, with full access provided at no charge by NARA.

  • The Unified Enumeration District Finder -- choose the desired year in the dropdown box in the title. Helps you find Enumeration Districts from addresses, for 1880-1940 censuses. In other words, you use this to help you find a key piece of information necessary to allow you to look up a particular place in the census. A quirky but amazing site.

  • Modern re-creations of census forms, helpful for deciphering questions (i.e. these are readable, but double-check each against an original) or for taking notes on a limited number of people:

Data sources

U.S. Census publications

  • Census Bureau: Published books from the census, in annoying linked PDF format: (or via FTP here). The Dubester catalog is helpful in figuring out what's what, and what is missing. There is data available in tabular form in these publications that hasn't made its way to the databases, so it can still be helpful to access them. These particular scans are not high quality, unfortunately, but far better than no access at all.

  • Census Bureau Library, on Contains mostly mid-late 20th century reports. Difficult to search, as may be imagined.

  • Census Bureau FTP site, includes many historic publications (note, mirrored on IA in a huge tarball):

  • Henry J. Dubester, Catalog of United States Census Publications, 1790-1945 (Washington: GPO, 1950). The standard, though the Census Bureau has stopped listing the Dubester numbers for early publications on their site, so it's not quite as important to use it to navigate. Still helpful to make sure you have seen what there is to see, and to help decipher similarly-named documents. Available as one half of this huge PDF, or at HathiTrust (Note: The first link above has a second piece that picks up where Dubester left off and continues to 1972.) Kevin Cook published a revised version in 1996 that provides SuDoc numbers for the publications listed by Dubester, making them much easier to find in a modern Government Publications depository library.

  • Henry J. Dubester, State Censuses: An Annotated Bibliography of Census of Population Taken After the Year 1790 by States and Territories of the United States (Washington, DC: Government Printing Office, 1948), 73 pages. Google Books

  • Yearly Catalogs of Census publications (more recent era):

  • Jason G. Gauthier, Measuring America: The Decennial Censuses from 1790 to 2000 ([Washington]: U.S. Census Bureau, 2002), This is a very detailed chronology of census information that can be quite helpful in figuring out exactly what was asked when, and what survived. Includes images of the population schedules.

Commercial services

I try to avoid these, but especially if you are trying to trace the history of a particular individual, they are powerful and can save you some legwork. Note that if you happen to be in the vicinity of a National Archives facility, you can use them (and others) for free on-site, as part of a deal struck when NARA began permitting the companies to digitize NARA records in huge batches.

Posted mid-morning Monday, March 16th, 2015 Tags:

As I've noted before, manuscript records collected by the census can be fascinating and informative windows to the past. They can be used to learn more about groups of people that appear only occasionally in the historical record, and since they are generally well-structured, they can be used (with care) to ask data-driven historical questions. When most historians think of using historical census sources, it's the forms from the decennial census of population that come immediately to mind. These are well-known sources, but there are always fresh nuances to discover. I stumbled over one of these just the other day.

This was news to me: the 1910 federal census used different forms to record American Indians. These filled-out forms were combined on microfilm together with those manuscript schedules that had been used to record the rest of the population, but the Indian-specific forms reflected and reinforced their non-equal place in American society.

But then one find led to another: in the course of trying to find out more about these forms, I quickly became aware of how little I knew about finding American Indians in the census. In the 1800s, American Indians were only rarely counted by enumerators in the way the non-Indian population was. On the other hand, the U.S. government occasionally tried to record American Indians specifically. So policies of separation and difference ended up having an unintended outcome, leaving behind primary sources that help us know more about these groups than we otherwise might.

The "Indian Census Rolls"

Though information about American Indians can be found in a variety of census publications, one of the largest is the set of microfilmed copies of the Indian Census Rolls, 1885-1940. These forms were the result of instructions to federal Indian agents to tally all of the American Indians living on reservations under their jurisdiction. As this detailed article from the National Archives makes clear, despite the wide scope of coverage, the forms did not cover every recognized group of American Indians, nor did they list non-affiliated members.

Image from Indian Census, 1933

But these documents do have certain advantages over traditional census forms. Unlike the regular census, the "Indian Census" rolls were supposed to be recorded or updated every year. Many of the records were typed instead of handwritten (hooray!), and for some years, records of individuals included a direct reference to the same person on the previous year's form, greatly easing the work of tracing a person through time.

The Indian Census Rolls were long available only on microfilm, which limited access to them. Recently they have been digitized and made available for searching through genealogy websites and, where they are available to paid subscribers. If you are just looking for the name of a particular ancestor, the paid sites are undoubtedly the easiest way to find that needle in a haystack. But the records can also be accessed for free with a little extra work, as the microfilm reels published by the National Archives have been digitized by the Internet Archive. Usage patterns for 9CHRIS, a site I built to help make historic court records accessible, suggest that there is a lot of demand for greater access to federal records about American Indians, so I thought it might be useful to describe how to use the freely-available Indian Census.

Using the Finding Aid and the Internet Archive

Here's how to find the rolls for a particular American Indian tribe or group in the freely-available Indian Census sources:

  1. First look them up in the finding aid to find out what "agency" was responsible for reporting about them. Agencies were units of geography, which sometimes (but not always) reflected federally-designated reservations. Sometimes a single agency might report about several tribal groups, and the opposite was also sometimes the case, where a particular tribe might fall under the jurisdiction of several agencies.

    For example if I were looking for the Washoe (also spelled "Washo"), I would see that the Bishop, Carson, and Walker River agencies each had jurisdiction.

  2. Next, use the second list, located later in the same finding aid document, to find out what reels of microfilm contain the records for that agency. (The agencies are listed in alphabetical order.) Some agencies share space on the same reel of microfilm, and in other cases the agency's records are spread across multiple reels, with a few years to each reel. Note the reel number you are interested in, in the left column.

    To continue the example, the Carson agency has records on reels 18, 19, 20, and 21. If I wanted to see records from 1933, I'd choose reel 20.

  3. Now we can go to the Internet Archive and look for the reel we want. I searched for "Indians of North America" AND census AND Reel 020" (note the reel number is always three digits) and it returned precisely the result I wanted.

  4. The reel can be read online, or downloaded as a (very large) PDF. Within each agency, the records are typically divided by year, and then by the "reservation" or administrative unit within the broader agency, which are organized alphabetically within each year.

    In this case, the records from 1933 were at the beginning of the reel. I skipped over several administrative units (Austin, Battle Mountain, Beowawe) before coming to the Carson Valley subunit, where the people I was looking up lived.

  5. To save a copy of a page in the online viewer, right-click on the image and choose "Save Image As..." to download it. The Internet Archive also built their online viewer so that each specific page in every document has a unique URL, so you can just copy and paste the URL from your browser bar.


Historians, social scientists, independent researchers, and genealogists all make extensive use of the manuscript historical census. The discrimination and unequal treatment faced by many ethnic and racial minorities in the past is sometimes reflected in their absence or unequal treatment in the census. But sometimes, as in the case of some Native Americans, these social attitudes led to policies that left behind, as a side effect, documentation that helps us understand their lives and history better than we would otherwise.

Posted early Thursday morning, September 11th, 2014 Tags:

Census microdata, such as that produced by the NAPP project, can help illuminate interesting issues surrounding work and labor.

Investigating the history of work using microdata must begin with the questions asked by the census enumerators about employment. These varied depending on the country and year, but were generally quite simple. For example, the U.S. 1880 census form had two questions pertaining to work, and a third that hinted toward labor as well. Question 13 recorded the "Profession, Occupation, or Trade of each person, male or female," and Question 14 asked the "Number of months this person has been unemployed during the Census year." These two questions were not supposed to be asked of any individual "under 10 years of age," according to the instructions.1 Question 15 also implied work, asking if the person had been sick or disabled, "so as to be unable to attend to ordinary business or duties." The recorded occupation, Question 13 in the case of the US 1880 census, was written free-form in the blank.

Transcribing the occupation

In the NAPP dataset, the occupation as it was transcribed by volunteers is found in the field OCCSTRNG. The small space on the form and the frequently creative spelling and abbreviating style of the enumerators can lead to some unreliable results if taken alone.

Let's look at mining engineers in the 1880 US microdata as produced by NAPP. We have 417 entries where OCCSTRNG is MINING ENGINEER. Good! But we also have other variations, such as:


Some of these might be small variations on the overall category, worth noting and investigating. (What's the difference, in 1880, between a Mining Expert and a Mining Engineer?) But many are clearly mining engineers, just spelled differently.

While this variable by itself is valuable, it can be difficult even for simple operations -- say, counting the number of mining engineers in a particular state -- because of the spelling differences. To address this limitation, the NAPP team created many additional variables to allow researchers to compare occupational information more broadly.

Constructing variables about work

From these tiny bits of inconsistent information about each person's occupation, NAPP adds tremendous value by creating new variables derived from this information. These are "constructed variables."

As you can see in the full list, not all variables are available for each sample, in part because the census questions asked about labor could vary depending on country and year. Additionally, some of the variables have essentially similar information, but NAPP offers a variety in order to make it easier to connect NAPP data with other data sets.

Some of these constructed variables are simple and intended to help with other comparisons. For example, the LABFORCE variable simply records if a person participated in the work force or not (or if it was impossible to tell).2 By itself, this might not seem very useful, but it could be helpful in conjunction with other variables. For example, you might want to compare people in one occupational sub-group -- farmers, for example -- with all people who were a part of the labor force rather than the public as a whole.

Other constructed variables group together workers by their occupation. This helps solve the sort of problem we faced above with misspellings of "mining engineer." OCCHISCO and OCC50US are two of these variables. Each uses a mid-twentieth century list of occupations as a starting point, with adaptations to better represent historical occupations. It is important to remember that not all occupations would have fit neatly into one of these later occupational categories, and conversely, sometimes very different occupations get lumped together inadvertently.3 Even so, this can be an important way to identify relatively fire-grained occupational information.

Some occupations have a built-in hierarchy of status that might be difficult to capture using the OCCHISCO codes alone. A "Mine Laborer" and a "Miner" are different things, but both might reasonably belong in OCCHISCO category 71120. The OCSTATUS variable records any known hierarchical information from the occupation field. So while a miner and a mine laborer would both be in the same OCCHISCO category, a "Laborer" would have OCSTATUS 33, where the miner might have an OCSTATUS of 32 ("Worker or works in"), 00 ("None"), or 99 ("Unknown").

NAPP also includes constructed variables that suggest, in a relative way, the wealth or status associated with an occupation. SEIUS uses the "Duncan Socioeconomic Index," which was developed in the 1950s, to create an occupational rank that considers income, education, and prestige. These scores are tied to the way those facotrs were perceived in the 1950s -- this means that they can be compared across decades, as the scores will always be the same for a given occupation; but it also anachronistically frames prestige in 1950 terms. For example, a "miner" receives a SEIUS score of 10, which is fairly low.4 But perhaps mining carried more prestige in 1880? For obvious reasons, there is considerable debate about the usefulness of this measure, but like all of these constructed variables, it may be helpful if used carefully.5 The NAPP variable OCSCORUS provides a related measurement, classifying each occupation according to its economic status in 1950. Unlike the SEIUS score, which factored prestige or status of an occupation into its calculations, the OCSCORUS is based only on the earning power of that job classification in 1950. As with SEIUS, there are obvious problems with anachronistically comparing 1880 job types based on what those job types earned 70 years later. However, if used carefully, OCSCORUS, like SEIUS, can put all workers somewhere on a universal scale in order to compare them.

A simple example

Let's take a look at a simple example and the SQL code needed to calculate it. (Here's how I set up a SQLite database with NAPP data.)

Where did most mining engineers live?

Let's begin with a straightforward question: Where did mining engineers live in 1880? We will use the OCCHISCO variable to look for them, noticing that the value "02700" is "Mining engineers." Let's group them by state, but notice that NAPP provides a variety of geographic levels that could be used here, from simple measures of urbanity (URBAN), to small divisions such as enumeration district (ENUMDIST) and county (COUNTYUS), up to regional groupings of states (REGIONNA).

This SQL code will produce the table below, using a JOIN to grab each state's name from the auxiliary table. (Note: I have manually folded the table to take up less vertical space.)

SELECT stateus.desc AS State
, count(*) AS Engineers
FROM data
JOIN stateus ON data.stateus =
WHERE occhisco = 2700
State Engineers State Engineers
California 146 Colorado 83
Pennsylvania 66 New York 57
Michigan 24 West Virginia 23
Arizona Territory 22 Illinois 20
Nevada 20 Massachusetts 17
Utah Territory 17 Missouri 14
New Jersey 14 Ohio 10
New Mexico 9 Virginia 8
Idaho Territory 7 Montana Territory 7
North Carolina 7 Iowa 6
Georgia 5 District of Columbia 4
New Hampshire 4 Tennessee 4
Arkansas 3 Connecticut 3
Kentucky 3 Maine 3
South Dakota 3 Alabama 2
Indiana 2 Maryland 2
Rhode Island 2 South Carolina 2
Delaware 1 Nebraska 1
North Dakota 1 Oregon 1
Wyoming Territory 1

Unsurprisingly, many mining engineers were found in the American West, where mining was booming. The strong numbers in Pennsylvania reflect the importance of the anthracite and bituminous coal industry there. But these numbers can also help remind us of the close association of engineering expertise with capital, as in the case of those located in New York, New Jersey, and Massachusetts. Similarly, they might help remind a researcher who focuses on Western mining of the growing importance of coal production in the midwest, in states such as Ohio, Illinois, and Iowa. Unexpectedly high or low numbers can help prompt deeper investigation. For instance, are two mere engineers in Alabama a sign that the state's major coal industry had yet to reach substantial levels of development, or that the mines were worked without significant engineering oversight?

Caveats about microdata and labor history

As with any data derived from the historical manuscript census, there are sometimes problems with NAPP's occupational data. Some of these problems arise from the recording, transcription, and coding phases. If the enumerator heard the person incorrectly (or could not spell well), or if a volunteer could not make out the handwriting or assumed the word was a different one, or if the occupation did not clearly fit any one category and a "best guess" had to be made in classification, errors might be introduced. It might be particularly troublesome to imagine that job categories, and their relative status, remained consistent over the decades.

Other issues stem from the nature of the census itself. Most census forms only permitted one occupation to be listed.6 Occasionally enumerators tried to squeeze two occupations in the space, such as "MINING & CIVIL ENGINEER." But most frequently the other work was simply not counted. What if a person was a farmer during the summer months and worked at mining during the winter? Only one occupation could be recorded.

Census workers were supposed to record a person where they were found on a particular day or month, such as June 1880. This specificity could contribute to errors in recording people whose work was seasonal or took them far from home, such as a mining engineer on a summer-long consulting trip in the mountains of the American West.

Similarly, the census takers did a poor job understanding and accounting for the work of women and children. A woman who took in boarders or washing made a tangible contribution to her household's economic prosperity, but this was often overlooked by enumerators who would frequently simply record "Keeping House."

Job insecurity is difficult to determine in the NAPP data. It was not well recorded by the census, especially in older censuses, which provide only crude measures of unemployment and no data at all about underemployment. In the 1880 US census, for example, there is a column to mark how many months a person has been unemployed, but this could at best unevenly reflect the cycle of on-again, off-again work that typified many labor categories, such as anthracite miners in Pennsylvania. Compounding the issue, most of the NAPP data sets do not include this information, even if it had originally been recorded on the census. (Perhaps a potentially-sensitive issue such as unemployment had been deemed unnecessary to record by genealogist volunteers who, in some cases, originally compiled the data sets that were further extended by NAPP.)


NAPP microdata derived from the census can offer important information about historic patterns of work and labor. The data is by no means a perfect representation of work activity, and it can contain noteworthy errors. Even so, when used judiciously, this microdata can shed light on important questions about work that were central to life in the past.

  1. Some quick work with the database shows that this rule was hardly observed universally. While sometimes enumerators filled this field with age-appropriate information such as "ATTENDING SCHOOL," or crossed it out with an "X," (and typos in the age field may also have occurred), it is clear that children under 10 years of age worked in small numbers in a wide variety of occupations in 1880. ↩

  2. As usual, caveats apply and the documentation for each variable must be read carefully. For example, the LABFORCE variable is designed so as to report that people who are listed as having an occupation, but are 15 years old or younger, are automatically reported as not having an occupation. This would make it impossible to use LABFORCE to pursue certain kinds of occupational questions about child labor, for example. ↩

  3. One example of inadvertent lumping of dissimilar work in the same category can be found in OCCHISCO value 03030, "Mine surveyors." In the 1880a US data set, only 15 individuals are placed in this category. With occupations such as "MINE SURVEYOR" (3), "MINING SURVEYOR" (2), and "SURVEYOR AND MINER" (1), some of these appear to be people who work for mining companies conducting underground surveying, which was often done by beginning-level mining engineers. Others in this category, such as "U.S. MIN'L SURVEYOR" (1) and "U.S. DEPUTY MINER SURV." (1) are quite different. These were experienced land surveyors appointed by the federal government to carefully survey the surface boundaries of any mining claim staked on public land. They would create reports and plat maps, swearing under oath as to their accuracy. US Mineral Surveyors, then, would seem to be a very different type of occupation than a mine surveyor, but because of the need to classify occupations they ended up in the same OCCHISCO category. ↩

  4. The average SEIUS value for all members of the labor force in 1880 is 19.70. By way of comparison, mining engineers (OCCHISCO = 02700) have an SEIUS score of 85. ↩

  5. See this cautionary note from the IPUMS documentation. Note: NAPP's SEIUS is called SEI in the IPUMS-USA data and documentation. ↩

  6. Among NAPP data sets, Norway is an exception, and allowed census takers to record two occupations. ↩

Posted late Saturday morning, July 26th, 2014 Tags:

Cover image of Seeing Underground

It's finally here! Seeing Underground: Maps, Models, and Mining Engineering in America, published by the University of Nevada Press, is available in hardcover and kindle editions. I examine how maps and models created a visual culture of mining engineering, helping American mining engineers fashion a professional identity and occupational opportunities for themselves.

I'm grateful for all the support and assistance I've received as I've chased this fascinating history.

Here's the description from the press:

Digging mineral wealth from the ground dates to prehistoric times, and Europeans pursued mining in the Americas from the earliest colonial days. Prior to the Civil War, little mining was deep enough to require maps. However, the major finds of the mid-nineteenth century, such as the Comstock Lode, were vastly larger than any before in America. In Seeing Underground, Nystrom argues that, as industrial mining came of age in the United States, the development of maps and models gave power to a new visual culture and allowed mining engineers to advance their profession, gaining authority over mining operations from the miners themselves.

Starting in the late nineteenth century, mining engineers developed a new set of practices, artifacts, and discourses to visualize complex, pitch-dark three-dimensional spaces. These maps and models became necessary tools in creating and controlling those spaces. They made mining more understandable, predictable, and profitable. Nystrom shows that this new visual culture was crucial to specific developments in American mining, such as implementing new safety regulations after the Avondale, Pennsylvania, fire of 1869 killed 110 men and boys; understanding complex geology, as in the rich ores of Butte, Montana; and settling high-stakes litigation, such as the Tonopah, Nevada, Jim Butler v. West End lawsuit, which reached the US Supreme Court.

Nystrom demonstrates that these neglected artifacts of the nineteenth and early twentieth centuries have much to teach us today. The development of a visual culture helped create a new professional class of mining engineers and changed how mining was done.

Posted late Wednesday afternoon, May 7th, 2014 Tags:

Historians often try not to fall in love with our sources, but sometimes I just can't help it. For me, aside from the chatty personal journal (what historian can resist?) and the underground mine maps I've studied for years, my greatest fondness may be for big compilations of small bits of data, called microdata.

What's microdata? It's small bits of information that, by themselves, might be virtually useless, but when aggregated and analyzed can show bigger trends.

The classic use of microdata is the census. Whether you filled out the form yourself or talked to the canvasser ("enumerator") who visited your house, small bits of information about you and your family were recorded. By itself, this isn't much -- half a dozen or more websites probably know more about you than the US Census Bureau does. But when placed together with similar information from other people, we can see broader trends -- this neighborhood is slipping into poverty, or that one has an emerging immigrant business community, or this county will need to plan for more school capacity because of the number of young children.

What if we could use that same kind of explanatory power to help understand the past? This question, and the use of microdata that it implies, has motivated historians since the advent of computers in the 1960s. The raw records of the census, which in the United States has been conducted every ten years since 1790, are a good source for this microdata. The raw records (termed "manuscript census" records) are released to researchers 72 years after they were created. One challenge is that they are handwritten (hence "manuscript"), meaning that any researcher would have to carefully transcribe the handwritten documents before being able to use them as microdata. But that transcription only needs to be done once, if researchers do a good job and are willing to share.

Thank goodness those people exist! Some of the best are at the Minnesota Population Center (MPC) at the University of Minnesota. They compile and make available microdata to researchers through several different projects. Their most famous one, in the research world, is IPUMS, the Integrated Public Use Microdata Series, but they also have important lesser-known microdata compilations.

The one I've used most is the North Atlantic Population Project (NAPP). The MPC collaborated with several other institutions on both sides of the Atlantic to make historical census microdata available from multiple countries. NAPP converts the data so that variables can be compared directly between countries. They also add additional variables, derived from information in each census, that extends the sorts of questions researchers can ask. Best of all, NAPP shares their microdata collections with researchers for free, provided you promise to cite it appropriately,1 not redistribute the data, and not use it for genealogical purposes. In exchange for complying with these very moderate restrictions, NAPP makes available samples of census microdata covering the US, Canada, Great Britain, Sweden, Norway, and parts of present-day Germany during the 19th and early 20th century. Some of these samples are complete transcriptions of all of the manuscript census records, plus all the extra NAPP bells and whistles.

Most researchers use statistical software packages such as SPSS or STATA to browse and manipulate NAPP data. I took a different tack, creating a set of scripts that will manipulate a NAPP data file and load it into a database, where I can explore it with SQL, the standard language used to query databases. Learn more about the gory technical details. Over time, I will post some sample SQL statements here on this blog.

Microdata census records combined with NAPP's additional variables and a powerful search tool make it easy to dive deeply into historical patterns large and small. As long as the limitations of the original census sources are kept in mind, an extraordinary range of questions -- about race, work, geography, gender, age, the family, and more -- can be asked that would be difficult to answer any other way. Can you blame me for having a soft spot in my heart for historical microdata?

  1. Minnesota Population Center. North Atlantic Population Project: Complete Count Microdata. Version 2.0 [Machine-readable database]. Minneapolis: Minnesota Population Center, 2008. Additionally, each data set offered by NAPP has its own citation. ↩

Posted early Friday morning, January 17th, 2014 Tags:

This post is an explanation of the napptools scripts, including how they transform NAPP download files into a SQLite database.


The tools

napptools consists of three script programs:

  • A Bash script that uses traditional unix tools cut, sed, and tr to chop a NAPP data file into its respective columns, guided by a SAS-format command file. This also creates secondary tables in .CSV format from the variable descriptions in the SAS file.

  • csv2sqlite: A public-domain AWK program written by Lorance Stinson (available from to convert .CSV files into a series of SQL statements that load the data into a database. Two small changes were made by Eric Nystrom to Stinson's original code to fix a bug and better fit the output to SQLite's capabilities by specifying non-typed columns.

  • A Bash script to employ and csv2sqlite to create .CSV and .SQL files, then load them into a SQLite database.


  1. Get an account at, receive access approval, and select your desired variables. Download the fixed-width text data file itself, which will end in a .dat.gz extension, as well as a command file in SAS format.
  2. Ensure all dependencies are met. On most Linux systems the only one you may need to install will be sqlite3, the command-line client for the SQLite database package.
  3. Run in the directory containing your data file and your command file, passing the name of the SQLite database you wish to create.
    • If your data file is napp_00001.dat.gz and your command file is then run like so:
    • -i -n napp_00001 -d MyNAPPData.db
  4. From there, you can use your database from the SQLite command shell sqlite3 or your favorite programming language.

Database structure

For databases created with napptools, most of the NAPP data ends up in a single large table, called data. Each of the columns in data is named for the NAPP field, such as SERIAL, PERWT, NAMELAST, etc. (Since SQLite's column names are not case-sensitive, lower case works fine too.)

Some of these columns have self-contained information, such as NAMELAST or OCCSTRNG, but others contain a numeric code that will typically need to be translated into human-readable values. Translations for these codes were offered by NAPP in the command file. The napptools suite breaks those translations out of the command file, into separate tables loaded into the SQLite database. These secondary tables are named for the NAPP variable, and always contain two columns, id and desc. With this information, it is easy to use a SQL JOIN command to bring the translations into your results, or you can refer to the codes directly if desired.

For example, to show the number of people listed in each category of the race variable in the state of New Mexico (stateus value of 35):

SELECT race.desc, count(*)
FROM data
JOIN race ON data.race =
WHERE stateus = 35
GROUP BY race.desc

Some caveats

  • The SQLite index generation routine is rather crude, as it makes an index for every column in the data table and the id column in all secondary tables. This is likely overkill, but there's no doubt column indexes on at least some of the columns helps many queries.
  • This was designed and used on a Debian Linux system. It seems likely that it will be portable to similar unix-based systems as long as the dependencies are all met, but YMMV.
Posted early Friday morning, January 17th, 2014 Tags:

Last semester, several of my students ran afoul of a perennial problem with PowerPoint. They had created their slide decks on a large screen, but when they connected to the room's projector, it forced a lower screen resolution.

Blammo! -- ugly slides. Text too big, images cut off, broken layouts everywhere. The wreckage was so horrifying, two design students in the audience were forced to avert their eyes.

There are several ways to solve this problem, but here's the trick I use all the time: Instead of showing your slides in PowerPoint, make your presentation deck into a PDF, and show that. PDFs will scale, unlike PowerPoint, and will capture your presentation faithfully, no matter how different the resolution might be.1

Here's how you do it:


1. Install software that allows you to create a PDF

Several free software packages allow you to create a PDF.

  • LibreOffice is a free, open source office suite (essentially a replacement for Microsoft Office), which has a PDF export feature. There are versions for Windows, Macs, and Linux. (An earlier version was called OpenOffice.) You can use LibreOffice Impress (a PowerPoint work-alike) to create your presentation, or just open your PowerPoint file in Impress to export it to PDF. This is what I usually use.
  • PDFCreator is a free and open source program to help you create PDFs in Windows. When you install it, PDFCreator makes a new printer device. Then you print your document -- in this case, your presentation slides -- to this special printer, and it saves the output as a PDF. This works from any program that can print, not just PowerPoint.
  • On a Mac, it's even easier -- the capability to print to a PDF is built right into the system. This tutorial from MIT shows how to do it (with pictures).
  • Microsoft has even created a downloadable add-in to allow you to generate PDFs from within Microsoft Office.
  • If you happen to have paid for Adobe Acrobat (not just Acrobat Reader), you can generate PDFs with Acrobat.

2. Create your presentation in PowerPoint in the normal way

3. Generate a PDF of your final presentation

Create your PDF by printing to the PDF printer from within PowerPoint. If you're using LibreOffice, just click the PDF button in the toolbar to create it.

4. Deliver your presentation from your PDF instead of your PowerPoint

To do this, just open the PDF in the normal way with Acrobat Reader or other PDF-viewing software.

The trick is to make it display full-screen. In Acrobat Reader, hit Ctrl-L to make it full-screen. In full-screen mode, use the arrow-keys or click the mouse to make the slides advance. (Left mouse button, down-arrow, or right-arrow all advance one slide; right mouse button, up-arrow, or left-arrow all go back one slide.) When you want to exit full-screen mode, hit the Esc key. (Mac users can use Adobe Reader, or see Apple's instructions for using Preview in full-screen mode.)

By delivering your presentation from a PDF, instead of a PowerPoint file, each slide will retain the layout you created for it originally. If the resolution is not the same once the projector is plugged in, a PDF will scale down gracefully, instead of scaling some elements and leaving others full-size as PowerPoint does.

Then you will be free to concentrate on delivering an excellent presentation -- not worrying about the visual mess your carefully-crafted slides could become.

  1. Well, I'm not sure if animation or sound will work. Probably not, actually. But I see that as a feature, not a bug. ↩

Posted late Wednesday morning, January 8th, 2014 Tags:

In an earlier post I discussed some of the limitations of full-text searching in digitized copies of historic mining engineering literature, and suggested several historic index publications specific to mining engineering that could be used to augment your search by looking for information the old-fashioned way.

Mining engineering information also appeared in historic indexes that covered engineering as a whole. Third-party indexes that cover engineering topics generally include at least some of the more popular mining engineering journals in their coverage. These indexes can help you cover more ground, and can also help unearth the occasional mining-related article in a general engineering periodical that might not have gotten picked up by one of the more specialized indexes. Finally, general indexes can help reveal how technological change in the mining industry is related to the engineering industry as a whole.


For work published before 1893, the two index volumes compiled by Francis E. Galloupe are valuable.

The most important index for very late 19th century and 20th century engineering is the Engineering Index. It began publishing its compendium volumes in 1892, with coverage back to 1884. The first two volumes were published by the Association of Engineering Societies, but then beginning with vol. 3, with coverage beginning in 1896, the work was done by The Engineering Magazine. Beginning with volume 5, published in 1906, the index covered one year per volume. Ownership has changed hands several times, but the Engineering Index, now known as the COMPENDEX, is still being published and updated today by Elsevier, at A paid subscription is necessary to access this site, but the index is very valuable as the back files of the Engineering Index have all been digitized and are searchable via the database.

  • The Engineering Index v.1: 1884-1891
    v.2: 1892-1895
    v.3: 1896-1900
    v.4: 1901-1905
    v.5: 1906
    v.6: 1907
    etc. etc.

Another useful third-party index is The Industrial Arts Index. Compared with the Engineering Index, the Industrial Arts Index had more application-oriented (and business-oriented) items. Coverage began for periodicals published in 1913. After the 1957 issue, the publication split, becoming the Business Periodicals Index and the Applied Science & Technology Index. The latter still exists as an EBSCO product. The main version covers from 1984 to the present, and an additional "Applied Science & Technology Index Retrospective" package contains complete data back to the beginning of the Industrial Arts Index in 1913. These are also paid subscription services.

  • The Industrial Arts Index (1913-1957)
  • Applied Science & Technology Index (1958--)
Posted early Saturday morning, November 16th, 2013 Tags: