December 6, 2019

Sometimes one just wants to query some data, right? I recently found myself again wanting to query the excellent work of the Supreme Court Database (SCDB for short), which is an important resource for legal historians and political scientists of all stripes. Several years ago, I pulled together a quick SQLite database and have used it since then. It's now time to generate a fresh database with updated SCDB data, so I figured I would document how I do it, in case it might be useful to others.

Broadly, the challenge is that the CSVs of data available from SCDB do not contain the descriptive labels for some variables, so this information has to be sought and loaded into the database separately from the main CSV file of data.

Read More...

Assumptions

This documentation presumes you are operating on a Linux machine at a command line terminal, or in a setup that is somewhat similar (e.g. a Mac with appropriate tools installed via homebrew), and that sqlite3, awk, unzip, and related tools are installed and functioning properly.

Download the data

The first step, of course, is to download the SCDB data. SCDB offers two broad sets of information, each sliced in a number of ways, then presented in several formats. The "Legacy Database" contains information about cases from 1791 to 1945, and the "Modern Database" contains cases from 1946 to the near-present. Each of these has a release number, so as of this writing "Legacy 05" is the most recent version of the early data, and "2019 Release 01" is the most recent modern data, covering through the court's 2018 term. Go to http://scdb.wustl.edu/data.php to see what's available.

SCDB offers both the Modern and Legacy databases in "case centered" or "justice centered" form. In the "case centered" data, each case appears as one row of information. This is what I needed for my own purposes. If one wanted the "justice centered" data instead, the procedure below would be much the same, though a few additional fields would be added.

To overcome the variable label problem, we will need to use a CSV for the data, and extract the labels from an SPSS .sav file. Download the .csv and .sav zipped file for the "case centered" data. I need both the Modern and Legacy databases, but the procedure is the same if only one or the other is needed.

Unpack and combine the data

Go to your working directory, and unzip the files that were downloaded (2 .csv and 2 .sav if you are using both the Legacy and Modern data). Thankfully, the good people of the SCDB put the same fields in both Legacy and Modern data, so they can be combined without issue, so long as the header is removed from the second one.

   cat SCDB_Legacy_05_caseCentered_Citation.csv > scdb-combined_leg05-201901.csv
   sed 1d SCDB_2019_01_caseCentered_Citation.csv >> scdb-combined_leg05-201901.csv

You can check the lengths of the input .csv files and the resulting combined file using wc -l to ensure that the combined file is one line shorter than the two independently. (That is, it's missing the second file's header.)

Create a database and load the data

For this (and other) databases, I prefer to use SQLite. It is free and open source, cross platform, stable, easy to set up and use, straightforward to back up, and generally excellent.

There are several ways to approach loading the data. For instance, there are a number of tools that promise to convert a CSV to either SQL statements, or in some cases directly into loaded data in a SQLite database. If the CSV file is simple, sqlite3 has a command, .import, which will read the CSV directly, creating table fields from the header line, and loading the data. However, the quote marks in the SCDB data confuses SQLite, so this simple method can't be used here.

A second approach would be to use an appropriate tool to convert the SCDB .csv file into something without these troublesome properties, then load using SQLite's .import. For example, the utility csvtool can convert a CSV nicely to a TSV, handling the weird quoting. This TSV can then be imported into the database by specifying the correct separator to the .import function.

    csvtool -t COMMA -u TAB cat scdb-combined_leg05-201901.csv > scdb-combined_leg05-201901.tsv
    sqlite3 scdb.db -cmd ".separator \t" ".import scdb-combined_leg05-201901.tsv data"

It can be fun at this point to open the database and look around. You'll find all the SCDB data in a single table, named "data." SQLite is a "flexibly typed" database, so even though it created "text"-type fields from each input record, even for numeric fields, it does the right thing with numbers as needed.

Create the field label tables

Now comes the more tricky part. The SCDB in many instances has boiled down text values to be represented by numeric values in their data, to save space and speed processing. However, the CSV download does not tell you what these values actually are. So, we need to extract them from somewhere else and load those as additional tables in the SQLite database.

Again, there may be other ways to do this, but I used the tool spssread.pl (a Perl program, available from https://czep.net/data/spssread/) to extract the necessary information from an SPSS .sav file provided by SCDB. This helpful tool provides a list of fields and their meanings (along with a single-line header) when invoked with the -r switch, and the values of the fields when invoked with -l. This information is separated by TABs.

There are a couple of problems with this information that need to be accounted for, however. One challenge is that the labels inside SPSS are slightly different than those in the CSV (they are shorter). Also note, labels and descriptions when read using spssread.pl have NUL (^@ or \x0) at the end of the strings -- this is a C thing -- so those should be stripped.

But since the same fields appear, in the same order, in both the SPSS file and the CSV file, we can do some fancy footwork with some Unix utilities to get the list of fields from our TSV data file, then match those up (positionally) with the fields and descriptions from the SPSS file. We also filter here for the characters noted above, and remove an extraneous field noting whether the data is numeric or string. We save it to TSV, which you can inspect to determine that the fields mapped correctly to each other.

    paste <(head -n1 scdb-combined_leg05-201901.tsv | tr '\t' '\n') \
       <(spssread.pl -r SCDB_2019_01_caseCentered_Citation.sav | \
       sed -e 's/\x0//g' -e 1d) | cut -f1,2,4 > fields.tsv         

Next, we can extract the values of these fields. I looked at the values from both the Legacy and the Modern SPSS files, and they were almost identical -- the one difference being that a few field values from the lawsupp (311-316, and 379-380) variable had the last section notation lopped off in the values extracted from the Legacy SPSS file. Based on this, we can use the one for the Modern data to work correctly for our combined database. We then extract the field values, removing the header and the trailing NULs, to make a TSV:

    spssread.pl -l SCDB_2019_01_caseCentered_Citation.sav | sed 's/\x0//g' | sed 1d > field-values.tsv

Then we use the field name translations in fields.tsv to translate the old SPSS label names to the correct ones from the CSV, which are also used in the database. Here I use one of my favorite text processing tools, awk. Then we can add SQL escape sequences for single-quotes (in octal) to the output and save the resulting TSV, field-values-corrected.tsv.

    awk 'BEGIN { FS = "\t" ; OFS = "\t" }
         NR == FNR { label[$2]=$1 }
         NR > FNR { print label[$1], $2, $3 }
         ' fields.tsv field-values.tsv |
         sed 's/\o047/\o047\o047/g' > field-values-corrected.tsv

Now that we have the field value information extracted, we can use it to create SQL statements, saved in a file that we will feed to SQLite. We wrap everything in a single transaction to speed up loading it into the database. First, create the label tables, by reading the field names from the corrected list. (We use the corrected field values list here, instead of the proper list of fields, because we don't want to create tables for labels that do not actually have any data.) When this is done, we'll add a blank line for clarity too.

    echo "BEGIN TRANSACTION;" > label-tables.sql
    while read line ; do echo "CREATE TABLE $line (id, desc);" ; done < <(cat field-values-corrected.tsv | cut -f1 | sort -u) >> label-tables.sql
    echo "" >> label-tables.sql

Then in similar fashion, read the corrected field values TSV and create a SQL statement to insert that data into the relevant table. Note use of Octal to specify single quotes in output, which is preferable to Hex because it is always fixed-length.

    awk 'BEGIN { FS = "\t" }
    { print "INSERT INTO", $1 " VALUES (\047" $2 "\047,\047" $3 "\047);" }
    'field-values-corrected.tsv >> label-tables.sql

Finally, we add the statement to end the transaction, and tell SQLite to read in this file, to create the new tables and add the value data.

    echo "COMMIT;" >> label-tables.sql
    echo ".read label-tables.sql" | sqlite3 scdb-combo-L05-201901.db

It should throw errors if something was not read correctly.

Querying the database

Chances are pretty good that if you've embarked on this journey to begin with, you've done so because you know some SQL and want to use that knowledge to ask questions of the SCDB data.

In the database we've built here, the data itself is housed in the data table, and the labels for numeric data is kept in other tables. Those other tables are named for the field in the main data itself -- in other words, the table holding label information for the "majOpinWriter" variable is named majOpinWriter. Each of the label tables has two fields -- id is the numeric part, and desc is the text label part. Use a SQL JOIN to connect these tables to the main data table. Bear in mind that sometimes fields do not have values, so a LEFT JOIN is usually safest.

For example, let's imagine we want to know how many majority opinions each of the justices wrote during the 1944 term -- the term that produced the infamous Korematsu case. This can be done right from the sqlite3 command line interface.

    .header on

    SELECT majopinwriter.desc AS Writer
      , COUNT(*) AS Opinions
      FROM data
      LEFT JOIN majopinwriter ON data.majopinwriter == majopinwriter.id
      WHERE term = '1944'
      GROUP BY Writer
      ORDER BY Opinions DESC;

We get the following results:

Writer Opinions
HFStone 21
WODouglas 21
HLBlack 20
OJRoberts 20
FMurphy 18
FFrankfurter 16
SFReed 16
RHJackson 15
WBRutledge 9
(blank) 8

Conclusion

As is hopefully evident, SQL (generally) and SQLite (specifically) are powerful tools for analyzing datasets such as the Supreme Court Database. If your favorite dataset is not offered in exactly the format you might want, it is often straightforward to use open source tools to transform it into the format that makes it easiest for you to use.

Posted Friday afternoon, December 6th, 2019 Tags:

January 15, 2019

My colleague Dr. Brian Leech, a history professor at Augustana College, and I are seeking the opinions of mining historians about the best books in mining history. The survey is anonymous. The results of the survey will be publicized at a future conference of the Mining History Association (and perhaps other academic conferences), and a journal article exploring the results in greater detail will be prepared for consideration by an academic journal in the field. Finally, upon submission of the article, the de-identified data will be made publicly available through this website, hosted on servers of the Arizona State University library.

If you read mining history books, please take our survey. Your opinion is important, and helps us better understand the field! It consists of just six short questions and takes 8-12 minutes to complete. Please visit http://bit.ly/MiningBooksSurvey to take the survey, which is available until March 31, 2019.

Thank you for your participation!

Posted early Tuesday morning, January 15th, 2019 Tags:

December 8, 2017

For more than four years, I've been working on a digital legal history project called 9CHRIS -- the 9th Circuit Historical Records Index System. The potential for historical analysis that comes from having some 40,000 briefs and transcripts available is what keeps me perpetually interested in continuing to improve this project. Each time I open one of the documents, I think of the potential that such rich detail can offer to historians and others studying the West, especially the relationship between western places, western residents, and the law.

Lately, I have realized that there's a more simplistic factor that helps keep me improving 9CHRIS: the variety of work that is required helps keep upgrades from becoming dull. From the tedious escapism of hand-correcting the records, to the puzzle-solving of deriving meaning from ugly text, to learning about servers, HTML, Markdown, SQLite, and git, to the inspirational feeling of having a senior scholar get in touch about how they have been using it to inform their work, each element of the project requires a different kind of skill and delivers a different sort of reward. I can "play around" with something new, and if it seems to work and there might be promise in it, I can then formalize it as part of the project; but if it ends up being just a curiosity or a dead-end, there was satisfaction in the exploration on its own.

Over the last few months, these two threads have come together in a new effort at improving the data. The identification of documents in the first version of 9CHRIS was mostly done by computers, which were largely but not entirely correct. I hand-corrected some of the most egregious examples where the computer guessed wrong, and the system was generally usable as a result. Recently, however, I realized that if the whole dataset could be hand-corrected, it would open a number of possibilities for future use. Other information could be layered atop the corrected data, and this might, in turn, open up prospects for much more interesting analysis. Consequently, I began hand-correcting each of the 3,359 volumes in 9CHRIS, inaugurating a new phase of the project. Beginning with volume 0001, and moving sequentially, I flip through each volume page by page, noting the beginning of each of the documents it contains, and updating the 9CHRIS site with the correct finds.

Today I passed a milestone, having corrected volume 1000. With each volume I correct, searches improve, errors are removed from the dataset, and I get new ideas about what can be done with all this data once every document has been correctly identified. Once this hand-correcting phase is complete, future project initiatives will construct layers of additional metadata on this foundation. Even before all the corrections are complete, however, the existing system will be gradually improving with each corrected document.

Posted Friday afternoon, December 8th, 2017 Tags:

This is a short post about teaching.

In a lower-level general education history course I am teaching this semester at ASU, I encourage students to make a small handwritten study sheet (a.k.a. "crib sheet" or "cheat sheet") to use in class while taking each exam, to jog their memories, help them better marshal specific evidence to use in their answers, and reduce exam anxiety.[^cribsheets] I also hoped the process of preparing the handwritten sheet would be a valuable study exercise, encouraging them to comprehensively scan the material as they looked for facts to include, and reinforcing key concepts by writing them on the sheet.[^notcoding]

[^cribsheets]: The literature on the effectiveness of study sheets is generally positive, though few researchers attribute major performance gains to them. See for example Brigitte Erbe, "Reducing Test Anxiety While Increasing Learning: The Cheat Sheet, College Teaching 55, no. 3 (2007): 96-98, DOI: 10.3200/CTCH.55.3.96-98; and Afshin Gharib and William Phillips, "Test Anxiety, Student Preferences and Performance on Different Exam Types in Introductory Psychology," International Journal of e-Education, e-Business, e-Management and e-Learning 3, no. 1 (February 2013): 1-6, DOI: 10.7763/IJEEEE.2013.V3.183.

[^notcoding]: Some research suggests that crib sheets do not effectively help students learn the material; see Thomas N. Dorsel and Gary W. Cundiff, "The Cheat-Sheet: Efficient Coding Device or Indispensable Crutch?" The Journal of Experimental Education 48, no. 1 (Fall 1979): 39-42. These authors, both psychologists, determined this by asking students to prepare crib sheets before a short, voluntary exam. The professors then took the sheets away from some students. The students without cheat sheets then did not do as well on the test as those who got to use the sheet they had prepared, showing that the crib sheets functioned as a "crutch" but not a device that had helped students internalize information during pre-test studying. The applicability of this study to longer, mandatory exams is not clear; also unreported is whether any student wished to bop the researchers on the nose.

When the day of the midterm exam arrived, I was surprised to see that not all students had made study sheets. In fact, only 15 out of 24 students (62.5%) made them. I collected the sheets with the exams, and later, when recording my grades, I noted whether or not each student had made a study sheet, so I could see if creating a sheet made a difference in their performance on the midterm.

sheet students mean (of 40) std dev mean (pct)
With 15 32.93 2.62 82.3%
Without 9 28.50 6.47 71.3%

Perhaps unsurprisingly, each group showed a range of outcomes. In general, those who made the sheets did better on the exam than those who did not -- the average score was 9% higher for those with study sheets. No study sheet user received a grade below a mid-C. For those students without study sheets, the range was greater, spanning failing grades all the way to a couple of As. (The exam was worth 40 points; the final table column adjusts the mean scores to a percentage basis.)

This is of course a very small set of results, doubtlessly showing the influence of more factors than the single issue of having a crib sheet or not which I've examined here. It seems quite possible, for instance, that some students who didn't make a study sheet also didn't study comprehensively for the exam, which would mean use of a study sheet might simply be a marker of students with better study habits. The data is not fine-grained enough to see if the improvement for crib sheet users was seen in the multiple choice/short answer questions, the essay questions, or both. But in any case the outcome is clear enough to me to continue to encourage students to make a study sheet when they have an opportunity to do so.

Posted at noon on Saturday, November 14th, 2015 Tags:

In working with manuscript census materials, modern data derived from them, and published documents from the Census Bureau, I found myself coming back to particular resources time and again. In a hope this might be of use to someone else, I've put together this list of those I use most frequently. If you have suggestions or corrections, please contact me or leave a note in the comments, below.

Manuscript materials

  • Main page for manuscript population schedules, 1790--1930, digitized from NARA microfilm, hosted on the Internet Archive: https://archive.org/details/us_census

  • The 1940 manuscript census, with full access provided at no charge by NARA.

  • The Unified Enumeration District Finder -- choose the desired year in the dropdown box in the title. Helps you find Enumeration Districts from addresses, for 1880-1940 censuses. In other words, you use this to help you find a key piece of information necessary to allow you to look up a particular place in the census. A quirky but amazing site.

  • Modern re-creations of census forms, helpful for deciphering questions (i.e. these are readable, but double-check each against an original) or for taking notes on a limited number of people: http://www.cyndislist.com/us/census/forms/

Data sources

U.S. Census publications

  • Census Bureau: Published books from the census, in annoying linked PDF format: http://www.census.gov/prod/www/decennial.html (or via FTP here). The Dubester catalog is helpful in figuring out what's what, and what is missing. There is data available in tabular form in these publications that hasn't made its way to the databases, so it can still be helpful to access them. These particular scans are not high quality, unfortunately, but far better than no access at all.

  • Census Bureau Library, on archive.org. Contains mostly mid-late 20th century reports. Difficult to search, as may be imagined. https://archive.org/details/CensusBureauLibrary

  • Census Bureau FTP site, includes many historic publications (note, mirrored on IA in a huge tarball): ftp://ftp.census.gov

  • Henry J. Dubester, Catalog of United States Census Publications, 1790-1945 (Washington: GPO, 1950). The standard, though the Census Bureau has stopped listing the Dubester numbers for early publications on their site, so it's not quite as important to use it to navigate. Still helpful to make sure you have seen what there is to see, and to help decipher similarly-named documents. Available as one half of this huge PDF, or at HathiTrust (Note: The first link above has a second piece that picks up where Dubester left off and continues to 1972.) Kevin Cook published a revised version in 1996 that provides SuDoc numbers for the publications listed by Dubester, making them much easier to find in a modern Government Publications depository library.

  • Henry J. Dubester, State Censuses: An Annotated Bibliography of Census of Population Taken After the Year 1790 by States and Territories of the United States (Washington, DC: Government Printing Office, 1948), 73 pages. Google Books

  • Yearly Catalogs of Census publications (more recent era): https://www.census.gov/prod/www/catalogs.html

  • Jason G. Gauthier, Measuring America: The Decennial Censuses from 1790 to 2000 ([Washington]: U.S. Census Bureau, 2002), http://www.census.gov/prod/2002pubs/pol02-ma.pdf This is a very detailed chronology of census information that can be quite helpful in figuring out exactly what was asked when, and what survived. Includes images of the population schedules.

Commercial services

I try to avoid these, but especially if you are trying to trace the history of a particular individual, they are powerful and can save you some legwork. Note that if you happen to be in the vicinity of a National Archives facility, you can use them (and others) for free on-site, as part of a deal struck when NARA began permitting the companies to digitize NARA records in huge batches.

Posted mid-morning Monday, March 16th, 2015 Tags:

As I've noted before, manuscript records collected by the census can be fascinating and informative windows to the past. They can be used to learn more about groups of people that appear only occasionally in the historical record, and since they are generally well-structured, they can be used (with care) to ask data-driven historical questions. When most historians think of using historical census sources, it's the forms from the decennial census of population that come immediately to mind. These are well-known sources, but there are always fresh nuances to discover. I stumbled over one of these just the other day.

This was news to me: the 1910 federal census used different forms to record American Indians. These filled-out forms were combined on microfilm together with those manuscript schedules that had been used to record the rest of the population, but the Indian-specific forms reflected and reinforced their non-equal place in American society.

But then one find led to another: in the course of trying to find out more about these forms, I quickly became aware of how little I knew about finding American Indians in the census. In the 1800s, American Indians were only rarely counted by enumerators in the way the non-Indian population was. On the other hand, the U.S. government occasionally tried to record American Indians specifically. So policies of separation and difference ended up having an unintended outcome, leaving behind primary sources that help us know more about these groups than we otherwise might.

The "Indian Census Rolls"

Though information about American Indians can be found in a variety of census publications, one of the largest is the set of microfilmed copies of the Indian Census Rolls, 1885-1940. These forms were the result of instructions to federal Indian agents to tally all of the American Indians living on reservations under their jurisdiction. As this detailed article from the National Archives makes clear, despite the wide scope of coverage, the forms did not cover every recognized group of American Indians, nor did they list non-affiliated members.

Image from Indian Census, 1933

But these documents do have certain advantages over traditional census forms. Unlike the regular census, the "Indian Census" rolls were supposed to be recorded or updated every year. Many of the records were typed instead of handwritten (hooray!), and for some years, records of individuals included a direct reference to the same person on the previous year's form, greatly easing the work of tracing a person through time.

The Indian Census Rolls were long available only on microfilm, which limited access to them. Recently they have been digitized and made available for searching through genealogy websites ancestry.com and fold3.com, where they are available to paid subscribers. If you are just looking for the name of a particular ancestor, the paid sites are undoubtedly the easiest way to find that needle in a haystack. But the records can also be accessed for free with a little extra work, as the microfilm reels published by the National Archives have been digitized by the Internet Archive. Usage patterns for 9CHRIS, a site I built to help make historic court records accessible, suggest that there is a lot of demand for greater access to federal records about American Indians, so I thought it might be useful to describe how to use the freely-available Indian Census.

Using the Finding Aid and the Internet Archive

Here's how to find the rolls for a particular American Indian tribe or group in the freely-available Indian Census sources:

  1. First look them up in the finding aid to find out what "agency" was responsible for reporting about them. Agencies were units of geography, which sometimes (but not always) reflected federally-designated reservations. Sometimes a single agency might report about several tribal groups, and the opposite was also sometimes the case, where a particular tribe might fall under the jurisdiction of several agencies.

    For example if I were looking for the Washoe (also spelled "Washo"), I would see that the Bishop, Carson, and Walker River agencies each had jurisdiction.

  2. Next, use the second list, located later in the same finding aid document, to find out what reels of microfilm contain the records for that agency. (The agencies are listed in alphabetical order.) Some agencies share space on the same reel of microfilm, and in other cases the agency's records are spread across multiple reels, with a few years to each reel. Note the reel number you are interested in, in the left column.

    To continue the example, the Carson agency has records on reels 18, 19, 20, and 21. If I wanted to see records from 1933, I'd choose reel 20.

  3. Now we can go to the Internet Archive and look for the reel we want. I searched for "Indians of North America" AND census AND Reel 020" (note the reel number is always three digits) and it returned precisely the result I wanted.

  4. The reel can be read online, or downloaded as a (very large) PDF. Within each agency, the records are typically divided by year, and then by the "reservation" or administrative unit within the broader agency, which are organized alphabetically within each year.

    In this case, the records from 1933 were at the beginning of the reel. I skipped over several administrative units (Austin, Battle Mountain, Beowawe) before coming to the Carson Valley subunit, where the people I was looking up lived.

  5. To save a copy of a page in the online viewer, right-click on the image and choose "Save Image As..." to download it. The Internet Archive also built their online viewer so that each specific page in every document has a unique URL, so you can just copy and paste the URL from your browser bar.

Conclusion

Historians, social scientists, independent researchers, and genealogists all make extensive use of the manuscript historical census. The discrimination and unequal treatment faced by many ethnic and racial minorities in the past is sometimes reflected in their absence or unequal treatment in the census. But sometimes, as in the case of some Native Americans, these social attitudes led to policies that left behind, as a side effect, documentation that helps us understand their lives and history better than we would otherwise.

Posted early Thursday morning, September 11th, 2014 Tags:

Census microdata, such as that produced by the NAPP project, can help illuminate interesting issues surrounding work and labor.

Investigating the history of work using microdata must begin with the questions asked by the census enumerators about employment. These varied depending on the country and year, but were generally quite simple. For example, the U.S. 1880 census form had two questions pertaining to work, and a third that hinted toward labor as well. Question 13 recorded the "Profession, Occupation, or Trade of each person, male or female," and Question 14 asked the "Number of months this person has been unemployed during the Census year." These two questions were not supposed to be asked of any individual "under 10 years of age," according to the instructions.[^fnten] Question 15 also implied work, asking if the person had been sick or disabled, "so as to be unable to attend to ordinary business or duties." The recorded occupation, Question 13 in the case of the US 1880 census, was written free-form in the blank.

Transcribing the occupation

In the NAPP dataset, the occupation as it was transcribed by volunteers is found in the field OCCSTRNG. The small space on the form and the frequently creative spelling and abbreviating style of the enumerators can lead to some unreliable results if taken alone.

Let's look at mining engineers in the 1880 US microdata as produced by NAPP. We have 417 entries where OCCSTRNG is MINING ENGINEER. Good! But we also have other variations, such as:

MINING ENGINEAR MINING ENGERNEER MINING ENG.
MINGING ENGINEER MG. ENGINEER ENGINEER MINING
MINING ENGR MINING EXPERT MININING ENGINEER

Some of these might be small variations on the overall category, worth noting and investigating. (What's the difference, in 1880, between a Mining Expert and a Mining Engineer?) But many are clearly mining engineers, just spelled differently.

While this variable by itself is valuable, it can be difficult even for simple operations -- say, counting the number of mining engineers in a particular state -- because of the spelling differences. To address this limitation, the NAPP team created many additional variables to allow researchers to compare occupational information more broadly.

Constructing variables about work

From these tiny bits of inconsistent information about each person's occupation, NAPP adds tremendous value by creating new variables derived from this information. These are "constructed variables."

As you can see in the full list, not all variables are available for each sample, in part because the census questions asked about labor could vary depending on country and year. Additionally, some of the variables have essentially similar information, but NAPP offers a variety in order to make it easier to connect NAPP data with other data sets.

Some of these constructed variables are simple and intended to help with other comparisons. For example, the LABFORCE variable simply records if a person participated in the work force or not (or if it was impossible to tell).[^fnlabforce] By itself, this might not seem very useful, but it could be helpful in conjunction with other variables. For example, you might want to compare people in one occupational sub-group -- farmers, for example -- with all people who were a part of the labor force rather than the public as a whole.

Other constructed variables group together workers by their occupation. This helps solve the sort of problem we faced above with misspellings of "mining engineer." OCCHISCO and OCC50US are two of these variables. Each uses a mid-twentieth century list of occupations as a starting point, with adaptations to better represent historical occupations. It is important to remember that not all occupations would have fit neatly into one of these later occupational categories, and conversely, sometimes very different occupations get lumped together inadvertently.[^fnsurveyor] Even so, this can be an important way to identify relatively fire-grained occupational information.

Some occupations have a built-in hierarchy of status that might be difficult to capture using the OCCHISCO codes alone. A "Mine Laborer" and a "Miner" are different things, but both might reasonably belong in OCCHISCO category 71120. The OCSTATUS variable records any known hierarchical information from the occupation field. So while a miner and a mine laborer would both be in the same OCCHISCO category, a "Laborer" would have OCSTATUS 33, where the miner might have an OCSTATUS of 32 ("Worker or works in"), 00 ("None"), or 99 ("Unknown").

NAPP also includes constructed variables that suggest, in a relative way, the wealth or status associated with an occupation. SEIUS uses the "Duncan Socioeconomic Index," which was developed in the 1950s, to create an occupational rank that considers income, education, and prestige. These scores are tied to the way those facotrs were perceived in the 1950s -- this means that they can be compared across decades, as the scores will always be the same for a given occupation; but it also anachronistically frames prestige in 1950 terms. For example, a "miner" receives a SEIUS score of 10, which is fairly low.[^fnsei] But perhaps mining carried more prestige in 1880? For obvious reasons, there is considerable debate about the usefulness of this measure, but like all of these constructed variables, it may be helpful if used carefully.[^fnseicaveat] The NAPP variable OCSCORUS provides a related measurement, classifying each occupation according to its economic status in 1950. Unlike the SEIUS score, which factored prestige or status of an occupation into its calculations, the OCSCORUS is based only on the earning power of that job classification in 1950. As with SEIUS, there are obvious problems with anachronistically comparing 1880 job types based on what those job types earned 70 years later. However, if used carefully, OCSCORUS, like SEIUS, can put all workers somewhere on a universal scale in order to compare them.

A simple example

Let's take a look at a simple example and the SQL code needed to calculate it. (Here's how I set up a SQLite database with NAPP data.)

Where did most mining engineers live?

Let's begin with a straightforward question: Where did mining engineers live in 1880? We will use the OCCHISCO variable to look for them, noticing that the value "02700" is "Mining engineers." Let's group them by state, but notice that NAPP provides a variety of geographic levels that could be used here, from simple measures of urbanity (URBAN), to small divisions such as enumeration district (ENUMDIST) and county (COUNTYUS), up to regional groupings of states (REGIONNA).

This SQL code will produce the table below, using a JOIN to grab each state's name from the auxiliary table. (Note: I have manually folded the table to take up less vertical space.)

SELECT stateus.desc AS State
, count(*) AS Engineers
FROM data
JOIN stateus ON data.stateus = stateus.id
WHERE occhisco = 2700
GROUP BY State
ORDER BY Engineers DESC
;
State Engineers State Engineers
California 146 Colorado 83
Pennsylvania 66 New York 57
Michigan 24 West Virginia 23
Arizona Territory 22 Illinois 20
Nevada 20 Massachusetts 17
Utah Territory 17 Missouri 14
New Jersey 14 Ohio 10
New Mexico 9 Virginia 8
Idaho Territory 7 Montana Territory 7
North Carolina 7 Iowa 6
Georgia 5 District of Columbia 4
New Hampshire 4 Tennessee 4
Arkansas 3 Connecticut 3
Kentucky 3 Maine 3
South Dakota 3 Alabama 2
Indiana 2 Maryland 2
Rhode Island 2 South Carolina 2
Delaware 1 Nebraska 1
North Dakota 1 Oregon 1
Wyoming Territory 1

Unsurprisingly, many mining engineers were found in the American West, where mining was booming. The strong numbers in Pennsylvania reflect the importance of the anthracite and bituminous coal industry there. But these numbers can also help remind us of the close association of engineering expertise with capital, as in the case of those located in New York, New Jersey, and Massachusetts. Similarly, they might help remind a researcher who focuses on Western mining of the growing importance of coal production in the midwest, in states such as Ohio, Illinois, and Iowa. Unexpectedly high or low numbers can help prompt deeper investigation. For instance, are two mere engineers in Alabama a sign that the state's major coal industry had yet to reach substantial levels of development, or that the mines were worked without significant engineering oversight?

Caveats about microdata and labor history

As with any data derived from the historical manuscript census, there are sometimes problems with NAPP's occupational data. Some of these problems arise from the recording, transcription, and coding phases. If the enumerator heard the person incorrectly (or could not spell well), or if a volunteer could not make out the handwriting or assumed the word was a different one, or if the occupation did not clearly fit any one category and a "best guess" had to be made in classification, errors might be introduced. It might be particularly troublesome to imagine that job categories, and their relative status, remained consistent over the decades.

Other issues stem from the nature of the census itself. Most census forms only permitted one occupation to be listed.[^fnnorway] Occasionally enumerators tried to squeeze two occupations in the space, such as "MINING & CIVIL ENGINEER." But most frequently the other work was simply not counted. What if a person was a farmer during the summer months and worked at mining during the winter? Only one occupation could be recorded.

Census workers were supposed to record a person where they were found on a particular day or month, such as June 1880. This specificity could contribute to errors in recording people whose work was seasonal or took them far from home, such as a mining engineer on a summer-long consulting trip in the mountains of the American West.

Similarly, the census takers did a poor job understanding and accounting for the work of women and children. A woman who took in boarders or washing made a tangible contribution to her household's economic prosperity, but this was often overlooked by enumerators who would frequently simply record "Keeping House."

Job insecurity is difficult to determine in the NAPP data. It was not well recorded by the census, especially in older censuses, which provide only crude measures of unemployment and no data at all about underemployment. In the 1880 US census, for example, there is a column to mark how many months a person has been unemployed, but this could at best unevenly reflect the cycle of on-again, off-again work that typified many labor categories, such as anthracite miners in Pennsylvania. Compounding the issue, most of the NAPP data sets do not include this information, even if it had originally been recorded on the census. (Perhaps a potentially-sensitive issue such as unemployment had been deemed unnecessary to record by genealogist volunteers who, in some cases, originally compiled the data sets that were further extended by NAPP.)

Conclusion

NAPP microdata derived from the census can offer important information about historic patterns of work and labor. The data is by no means a perfect representation of work activity, and it can contain noteworthy errors. Even so, when used judiciously, this microdata can shed light on important questions about work that were central to life in the past.

[^fnten]: Some quick work with the database shows that this rule was hardly observed universally. While sometimes enumerators filled this field with age-appropriate information such as "ATTENDING SCHOOL," or crossed it out with an "X," (and typos in the age field may also have occurred), it is clear that children under 10 years of age worked in small numbers in a wide variety of occupations in 1880.

[^fnlabforce]: As usual, caveats apply and the documentation for each variable must be read carefully. For example, the LABFORCE variable is designed so as to report that people who are listed as having an occupation, but are 15 years old or younger, are automatically reported as not having an occupation. This would make it impossible to use LABFORCE to pursue certain kinds of occupational questions about child labor, for example.

[^fnsurveyor]: One example of inadvertent lumping of dissimilar work in the same category can be found in OCCHISCO value 03030, "Mine surveyors." In the 1880a US data set, only 15 individuals are placed in this category. With occupations such as "MINE SURVEYOR" (3), "MINING SURVEYOR" (2), and "SURVEYOR AND MINER" (1), some of these appear to be people who work for mining companies conducting underground surveying, which was often done by beginning-level mining engineers. Others in this category, such as "U.S. MIN'L SURVEYOR" (1) and "U.S. DEPUTY MINER SURV." (1) are quite different. These were experienced land surveyors appointed by the federal government to carefully survey the surface boundaries of any mining claim staked on public land. They would create reports and plat maps, swearing under oath as to their accuracy. US Mineral Surveyors, then, would seem to be a very different type of occupation than a mine surveyor, but because of the need to classify occupations they ended up in the same OCCHISCO category.

[^fnnorway]: Among NAPP data sets, Norway is an exception, and allowed census takers to record two occupations.

[^fnsei]: The average SEIUS value for all members of the labor force in 1880 is 19.70. By way of comparison, mining engineers (OCCHISCO = 02700) have an SEIUS score of 85.

[^fnseicaveat]: See this cautionary note from the IPUMS documentation. Note: NAPP's SEIUS is called SEI in the IPUMS-USA data and documentation.

Posted late Saturday morning, July 26th, 2014 Tags:

Cover image of Seeing Underground

It's finally here! Seeing Underground: Maps, Models, and Mining Engineering in America, published by the University of Nevada Press, is available in hardcover and kindle editions. I examine how maps and models created a visual culture of mining engineering, helping American mining engineers fashion a professional identity and occupational opportunities for themselves.

I'm grateful for all the support and assistance I've received as I've chased this fascinating history.

Here's the description from the press:

Digging mineral wealth from the ground dates to prehistoric times, and Europeans pursued mining in the Americas from the earliest colonial days. Prior to the Civil War, little mining was deep enough to require maps. However, the major finds of the mid-nineteenth century, such as the Comstock Lode, were vastly larger than any before in America. In Seeing Underground, Nystrom argues that, as industrial mining came of age in the United States, the development of maps and models gave power to a new visual culture and allowed mining engineers to advance their profession, gaining authority over mining operations from the miners themselves.

Starting in the late nineteenth century, mining engineers developed a new set of practices, artifacts, and discourses to visualize complex, pitch-dark three-dimensional spaces. These maps and models became necessary tools in creating and controlling those spaces. They made mining more understandable, predictable, and profitable. Nystrom shows that this new visual culture was crucial to specific developments in American mining, such as implementing new safety regulations after the Avondale, Pennsylvania, fire of 1869 killed 110 men and boys; understanding complex geology, as in the rich ores of Butte, Montana; and settling high-stakes litigation, such as the Tonopah, Nevada, Jim Butler v. West End lawsuit, which reached the US Supreme Court.

Nystrom demonstrates that these neglected artifacts of the nineteenth and early twentieth centuries have much to teach us today. The development of a visual culture helped create a new professional class of mining engineers and changed how mining was done.

Posted late Wednesday afternoon, May 7th, 2014 Tags:

Historians often try not to fall in love with our sources, but sometimes I just can't help it. For me, aside from the chatty personal journal (what historian can resist?) and the underground mine maps I've studied for years, my greatest fondness may be for big compilations of small bits of data, called microdata.

What's microdata? It's small bits of information that, by themselves, might be virtually useless, but when aggregated and analyzed can show bigger trends.

The classic use of microdata is the census. Whether you filled out the form yourself or talked to the canvasser ("enumerator") who visited your house, small bits of information about you and your family were recorded. By itself, this isn't much -- half a dozen or more websites probably know more about you than the US Census Bureau does. But when placed together with similar information from other people, we can see broader trends -- this neighborhood is slipping into poverty, or that one has an emerging immigrant business community, or this county will need to plan for more school capacity because of the number of young children.

What if we could use that same kind of explanatory power to help understand the past? This question, and the use of microdata that it implies, has motivated historians since the advent of computers in the 1960s. The raw records of the census, which in the United States has been conducted every ten years since 1790, are a good source for this microdata. The raw records (termed "manuscript census" records) are released to researchers 72 years after they were created. One challenge is that they are handwritten (hence "manuscript"), meaning that any researcher would have to carefully transcribe the handwritten documents before being able to use them as microdata. But that transcription only needs to be done once, if researchers do a good job and are willing to share.

Thank goodness those people exist! Some of the best are at the Minnesota Population Center (MPC) at the University of Minnesota. They compile and make available microdata to researchers through several different projects. Their most famous one, in the research world, is IPUMS, the Integrated Public Use Microdata Series, but they also have important lesser-known microdata compilations.

The one I've used most is the North Atlantic Population Project (NAPP). The MPC collaborated with several other institutions on both sides of the Atlantic to make historical census microdata available from multiple countries. NAPP converts the data so that variables can be compared directly between countries. They also add additional variables, derived from information in each census, that extends the sorts of questions researchers can ask. Best of all, NAPP shares their microdata collections with researchers for free, provided you promise to cite it appropriately,[^nappcite] not redistribute the data, and not use it for genealogical purposes. In exchange for complying with these very moderate restrictions, NAPP makes available samples of census microdata covering the US, Canada, Great Britain, Sweden, Norway, and parts of present-day Germany during the 19th and early 20th century. Some of these samples are complete transcriptions of all of the manuscript census records, plus all the extra NAPP bells and whistles.

Most researchers use statistical software packages such as SPSS or STATA to browse and manipulate NAPP data. I took a different tack, creating a set of scripts that will manipulate a NAPP data file and load it into a database, where I can explore it with SQL, the standard language used to query databases. Learn more about the gory technical details. Over time, I will post some sample SQL statements here on this blog.

Microdata census records combined with NAPP's additional variables and a powerful search tool make it easy to dive deeply into historical patterns large and small. As long as the limitations of the original census sources are kept in mind, an extraordinary range of questions -- about race, work, geography, gender, age, the family, and more -- can be asked that would be difficult to answer any other way. Can you blame me for having a soft spot in my heart for historical microdata?

[^nappcite]: Minnesota Population Center. North Atlantic Population Project: Complete Count Microdata. Version 2.0 [Machine-readable database]. Minneapolis: Minnesota Population Center, 2008. Additionally, each data set offered by NAPP has its own citation.

Posted early Friday morning, January 17th, 2014 Tags:

This post is an explanation of the napptools scripts, including how they transform NAPP download files into a SQLite database.

Read More...

The tools

napptools consists of three script programs:

  • napp2csv.sh: A Bash script that uses traditional unix tools cut, sed, and tr to chop a NAPP data file into its respective columns, guided by a SAS-format command file. This also creates secondary tables in .CSV format from the variable descriptions in the SAS file.

  • csv2sqlite: A public-domain AWK program written by Lorance Stinson (available from http://lorance.freeshell.org/csvutils/) to convert .CSV files into a series of SQL statements that load the data into a database. Two small changes were made by Eric Nystrom to Stinson's original code to fix a bug and better fit the output to SQLite's capabilities by specifying non-typed columns.

  • nappbuild.sh: A Bash script to employ napp2csv.sh and csv2sqlite to create .CSV and .SQL files, then load them into a SQLite database.

Usage

  1. Get an account at http://nappdata.org, receive access approval, and select your desired variables. Download the fixed-width text data file itself, which will end in a .dat.gz extension, as well as a command file in SAS format.
  2. Ensure all dependencies are met. On most Linux systems the only one you may need to install will be sqlite3, the command-line client for the SQLite database package.
  3. Run nappbuild.sh in the directory containing your data file and your command file, passing the name of the SQLite database you wish to create.
    • If your data file is napp_00001.dat.gz and your command file is napp_00001.sas then run nappbuild.sh like so:
    • nappbuild.sh -i -n napp_00001 -d MyNAPPData.db
  4. From there, you can use your database from the SQLite command shell sqlite3 or your favorite programming language.

Database structure

For databases created with napptools, most of the NAPP data ends up in a single large table, called data. Each of the columns in data is named for the NAPP field, such as SERIAL, PERWT, NAMELAST, etc. (Since SQLite's column names are not case-sensitive, lower case works fine too.)

Some of these columns have self-contained information, such as NAMELAST or OCCSTRNG, but others contain a numeric code that will typically need to be translated into human-readable values. Translations for these codes were offered by NAPP in the command file. The napptools suite breaks those translations out of the command file, into separate tables loaded into the SQLite database. These secondary tables are named for the NAPP variable, and always contain two columns, id and desc. With this information, it is easy to use a SQL JOIN command to bring the translations into your results, or you can refer to the codes directly if desired.

For example, to show the number of people listed in each category of the race variable in the state of New Mexico (stateus value of 35):

SELECT race.desc, count(*)
FROM data
JOIN race ON data.race = race.id
WHERE stateus = 35
GROUP BY race.desc

Some caveats

  • The SQLite index generation routine is rather crude, as it makes an index for every column in the data table and the id column in all secondary tables. This is likely overkill, but there's no doubt column indexes on at least some of the columns helps many queries.
  • This was designed and used on a Debian Linux system. It seems likely that it will be portable to similar unix-based systems as long as the dependencies are all met, but YMMV.
Posted early Friday morning, January 17th, 2014 Tags: