December 6, 2019

Sometimes one just wants to query some data, right? I recently found myself again wanting to query the excellent work of the Supreme Court Database (SCDB for short), which is an important resource for legal historians and political scientists of all stripes. Several years ago, I pulled together a quick SQLite database and have used it since then. It's now time to generate a fresh database with updated SCDB data, so I figured I would document how I do it, in case it might be useful to others.

Broadly, the challenge is that the CSVs of data available from SCDB do not contain the descriptive labels for some variables, so this information has to be sought and loaded into the database separately from the main CSV file of data.

Read More...

Assumptions

This documentation presumes you are operating on a Linux machine at a command line terminal, or in a setup that is somewhat similar (e.g. a Mac with appropriate tools installed via homebrew), and that sqlite3, awk, unzip, and related tools are installed and functioning properly.

Download the data

The first step, of course, is to download the SCDB data. SCDB offers two broad sets of information, each sliced in a number of ways, then presented in several formats. The "Legacy Database" contains information about cases from 1791 to 1945, and the "Modern Database" contains cases from 1946 to the near-present. Each of these has a release number, so as of this writing "Legacy 05" is the most recent version of the early data, and "2019 Release 01" is the most recent modern data, covering through the court's 2018 term. Go to http://scdb.wustl.edu/data.php to see what's available.

SCDB offers both the Modern and Legacy databases in "case centered" or "justice centered" form. In the "case centered" data, each case appears as one row of information. This is what I needed for my own purposes. If one wanted the "justice centered" data instead, the procedure below would be much the same, though a few additional fields would be added.

To overcome the variable label problem, we will need to use a CSV for the data, and extract the labels from an SPSS .sav file. Download the .csv and .sav zipped file for the "case centered" data. I need both the Modern and Legacy databases, but the procedure is the same if only one or the other is needed.

Unpack and combine the data

Go to your working directory, and unzip the files that were downloaded (2 .csv and 2 .sav if you are using both the Legacy and Modern data). Thankfully, the good people of the SCDB put the same fields in both Legacy and Modern data, so they can be combined without issue, so long as the header is removed from the second one.

   cat SCDB_Legacy_05_caseCentered_Citation.csv > scdb-combined_leg05-201901.csv
   sed 1d SCDB_2019_01_caseCentered_Citation.csv >> scdb-combined_leg05-201901.csv

You can check the lengths of the input .csv files and the resulting combined file using wc -l to ensure that the combined file is one line shorter than the two independently. (That is, it's missing the second file's header.)

Create a database and load the data

For this (and other) databases, I prefer to use SQLite. It is free and open source, cross platform, stable, easy to set up and use, straightforward to back up, and generally excellent.

There are several ways to approach loading the data. For instance, there are a number of tools that promise to convert a CSV to either SQL statements, or in some cases directly into loaded data in a SQLite database. If the CSV file is simple, sqlite3 has a command, .import, which will read the CSV directly, creating table fields from the header line, and loading the data. However, the quote marks in the SCDB data confuses SQLite, so this simple method can't be used here.

A second approach would be to use an appropriate tool to convert the SCDB .csv file into something without these troublesome properties, then load using SQLite's .import. For example, the utility csvtool can convert a CSV nicely to a TSV, handling the weird quoting. This TSV can then be imported into the database by specifying the correct separator to the .import function.

    csvtool -t COMMA -u TAB cat scdb-combined_leg05-201901.csv > scdb-combined_leg05-201901.tsv
    sqlite3 scdb.db -cmd ".separator \t" ".import scdb-combined_leg05-201901.tsv data"

It can be fun at this point to open the database and look around. You'll find all the SCDB data in a single table, named "data." SQLite is a "flexibly typed" database, so even though it created "text"-type fields from each input record, even for numeric fields, it does the right thing with numbers as needed.

Create the field label tables

Now comes the more tricky part. The SCDB in many instances has boiled down text values to be represented by numeric values in their data, to save space and speed processing. However, the CSV download does not tell you what these values actually are. So, we need to extract them from somewhere else and load those as additional tables in the SQLite database.

Again, there may be other ways to do this, but I used the tool spssread.pl (a Perl program, available from https://czep.net/data/spssread/) to extract the necessary information from an SPSS .sav file provided by SCDB. This helpful tool provides a list of fields and their meanings (along with a single-line header) when invoked with the -r switch, and the values of the fields when invoked with -l. This information is separated by TABs.

There are a couple of problems with this information that need to be accounted for, however. One challenge is that the labels inside SPSS are slightly different than those in the CSV (they are shorter). Also note, labels and descriptions when read using spssread.pl have NUL (^@ or \x0) at the end of the strings -- this is a C thing -- so those should be stripped.

But since the same fields appear, in the same order, in both the SPSS file and the CSV file, we can do some fancy footwork with some Unix utilities to get the list of fields from our TSV data file, then match those up (positionally) with the fields and descriptions from the SPSS file. We also filter here for the characters noted above, and remove an extraneous field noting whether the data is numeric or string. We save it to TSV, which you can inspect to determine that the fields mapped correctly to each other.

    paste <(head -n1 scdb-combined_leg05-201901.tsv | tr '\t' '\n') \
       <(spssread.pl -r SCDB_2019_01_caseCentered_Citation.sav | \
       sed -e 's/\x0//g' -e 1d) | cut -f1,2,4 > fields.tsv         

Next, we can extract the values of these fields. I looked at the values from both the Legacy and the Modern SPSS files, and they were almost identical -- the one difference being that a few field values from the lawsupp (311-316, and 379-380) variable had the last section notation lopped off in the values extracted from the Legacy SPSS file. Based on this, we can use the one for the Modern data to work correctly for our combined database. We then extract the field values, removing the header and the trailing NULs, to make a TSV:

    spssread.pl -l SCDB_2019_01_caseCentered_Citation.sav | sed 's/\x0//g' | sed 1d > field-values.tsv

Then we use the field name translations in fields.tsv to translate the old SPSS label names to the correct ones from the CSV, which are also used in the database. Here I use one of my favorite text processing tools, awk. Then we can add SQL escape sequences for single-quotes (in octal) to the output and save the resulting TSV, field-values-corrected.tsv.

    awk 'BEGIN { FS = "\t" ; OFS = "\t" }
         NR == FNR { label[$2]=$1 }
         NR > FNR { print label[$1], $2, $3 }
         ' fields.tsv field-values.tsv |
         sed 's/\o047/\o047\o047/g' > field-values-corrected.tsv

Now that we have the field value information extracted, we can use it to create SQL statements, saved in a file that we will feed to SQLite. We wrap everything in a single transaction to speed up loading it into the database. First, create the label tables, by reading the field names from the corrected list. (We use the corrected field values list here, instead of the proper list of fields, because we don't want to create tables for labels that do not actually have any data.) When this is done, we'll add a blank line for clarity too.

    echo "BEGIN TRANSACTION;" > label-tables.sql
    while read line ; do echo "CREATE TABLE $line (id, desc);" ; done < <(cat field-values-corrected.tsv | cut -f1 | sort -u) >> label-tables.sql
    echo "" >> label-tables.sql

Then in similar fashion, read the corrected field values TSV and create a SQL statement to insert that data into the relevant table. Note use of Octal to specify single quotes in output, which is preferable to Hex because it is always fixed-length.

    awk 'BEGIN { FS = "\t" }
    { print "INSERT INTO", $1 " VALUES (\047" $2 "\047,\047" $3 "\047);" }
    'field-values-corrected.tsv >> label-tables.sql

Finally, we add the statement to end the transaction, and tell SQLite to read in this file, to create the new tables and add the value data.

    echo "COMMIT;" >> label-tables.sql
    echo ".read label-tables.sql" | sqlite3 scdb-combo-L05-201901.db

It should throw errors if something was not read correctly.

Querying the database

Chances are pretty good that if you've embarked on this journey to begin with, you've done so because you know some SQL and want to use that knowledge to ask questions of the SCDB data.

In the database we've built here, the data itself is housed in the data table, and the labels for numeric data is kept in other tables. Those other tables are named for the field in the main data itself -- in other words, the table holding label information for the "majOpinWriter" variable is named majOpinWriter. Each of the label tables has two fields -- id is the numeric part, and desc is the text label part. Use a SQL JOIN to connect these tables to the main data table. Bear in mind that sometimes fields do not have values, so a LEFT JOIN is usually safest.

For example, let's imagine we want to know how many majority opinions each of the justices wrote during the 1944 term -- the term that produced the infamous Korematsu case. This can be done right from the sqlite3 command line interface.

    .header on

    SELECT majopinwriter.desc AS Writer
      , COUNT(*) AS Opinions
      FROM data
      LEFT JOIN majopinwriter ON data.majopinwriter == majopinwriter.id
      WHERE term = '1944'
      GROUP BY Writer
      ORDER BY Opinions DESC;

We get the following results:

Writer Opinions
HFStone 21
WODouglas 21
HLBlack 20
OJRoberts 20
FMurphy 18
FFrankfurter 16
SFReed 16
RHJackson 15
WBRutledge 9
(blank) 8

Conclusion

As is hopefully evident, SQL (generally) and SQLite (specifically) are powerful tools for analyzing datasets such as the Supreme Court Database. If your favorite dataset is not offered in exactly the format you might want, it is often straightforward to use open source tools to transform it into the format that makes it easiest for you to use.