Supreme Court Database for the SQL-minded
Sometimes one just wants to query some data, right? I recently found myself again wanting to query the excellent work of the Supreme Court Database (SCDB for short), which is an important resource for legal historians and political scientists of all stripes. Several years ago, I pulled together a quick SQLite database and have used it since then. It’s now time to generate a fresh database with updated SCDB data, so I figured I would document how I do it, in case it might be useful to others.
Broadly, the challenge is that the CSVs of data available from SCDB do not contain the descriptive labels for some variables, so this information has to be sought and loaded into the database separately from the main CSV file of data.
Assumptions
This documentation presumes you are operating on a Linux machine at a
command line terminal, or in a setup that is somewhat similar (e.g. a
Mac with appropriate tools installed via homebrew), and that
sqlite3
, awk
, unzip
, and related tools are installed and functioning
properly.
Download the data
The first step, of course, is to download the SCDB data. SCDB offers two broad sets of information, each sliced in a number of ways, then presented in several formats. The “Legacy Database” contains information about cases from 1791 to 1945, and the “Modern Database” contains cases from 1946 to the near-present. Each of these has a release number, so as of this writing “Legacy 05” is the most recent version of the early data, and “2019 Release 01” is the most recent modern data, covering through the court’s 2018 term. Go to http://scdb.wustl.edu/data.php to see what’s available.
SCDB offers both the Modern and Legacy databases in “case centered” or “justice centered” form. In the “case centered” data, each case appears as one row of information. This is what I needed for my own purposes. If one wanted the “justice centered” data instead, the procedure below would be much the same, though a few additional fields would be added.
To overcome the variable label problem, we will need to use a CSV for
the data, and extract the labels from an SPSS .sav
file. Download
the .csv
and .sav
zipped file for the “case centered” data. I need
both the Modern and Legacy databases, but the procedure is the same if
only one or the other is needed.
Unpack and combine the data
Go to your working directory, and unzip the files that were downloaded
(2 .csv
and 2 .sav
if you are using both the Legacy and Modern
data). Thankfully, the good people of the SCDB put the same fields in
both Legacy and Modern data, so they can be combined without issue, so
long as the header is removed from the second one.
cat SCDB_Legacy_05_caseCentered_Citation.csv > scdb-combined_leg05-201901.csv
sed 1d SCDB_2019_01_caseCentered_Citation.csv >> scdb-combined_leg05-201901.csv
You can check the lengths of the input .csv
files and the resulting
combined file using wc -l
to ensure that the combined file is one
line shorter than the two independently. (That is, it’s missing the
second file’s header.)
Create a database and load the data
For this (and other) databases, I prefer to use SQLite. It is free and open source, cross platform, stable, easy to set up and use, straightforward to back up, and generally excellent.
There are several ways to approach loading the data. For instance,
there are a number of tools that promise to convert a CSV to either
SQL statements, or in some cases directly into loaded data in a SQLite
database. If the CSV file is simple, sqlite3
has a command,
.import
, which will read the CSV directly, creating table fields
from the header line, and loading the data. However, the quote marks
in the SCDB data confuses SQLite, so this simple method can’t be used
here.
A second approach would be to use an appropriate tool to convert the
SCDB .csv
file into something without these troublesome properties,
then load using SQLite’s .import
. For example, the utility csvtool
can convert a CSV nicely to a TSV, handling the weird quoting. This
TSV can then be imported into the database by specifying the correct
separator to the .import
function.
csvtool -t COMMA -u TAB cat scdb-combined_leg05-201901.csv > scdb-combined_leg05-201901.tsv
sqlite3 scdb.db -cmd ".separator \t" ".import scdb-combined_leg05-201901.tsv data"
It can be fun at this point to open the database and look around. You’ll find all the SCDB data in a single table, named “data.” SQLite is a “flexibly typed” database, so even though it created “text”-type fields from each input record, even for numeric fields, it does the right thing with numbers as needed.
Create the field label tables
Now comes the more tricky part. The SCDB in many instances has boiled down text values to be represented by numeric values in their data, to save space and speed processing. However, the CSV download does not tell you what these values actually are. So, we need to extract them from somewhere else and load those as additional tables in the SQLite database.
Again, there may be other ways to do this, but I used the tool
spssread.pl
(a Perl program, available from
https://czep.net/data/spssread/) to extract the necessary
information from an SPSS .sav
file provided by SCDB. This helpful
tool provides a list of fields and their meanings (along with a
single-line header) when invoked with the -r
switch, and the values
of the fields when invoked with -l
. This information is separated by
TABs.
There are a couple of problems with this information that need to be
accounted for, however. One challenge is that the labels inside SPSS
are slightly different than those in the CSV (they are shorter). Also
note, labels and descriptions when read using spssread.pl
have NUL
(^@ or \x0) at the end of the strings – this is a C thing – so those
should be stripped.
But since the same fields appear, in the same order, in both the SPSS file and the CSV file, we can do some fancy footwork with some Unix utilities to get the list of fields from our TSV data file, then match those up (positionally) with the fields and descriptions from the SPSS file. We also filter here for the characters noted above, and remove an extraneous field noting whether the data is numeric or string. We save it to TSV, which you can inspect to determine that the fields mapped correctly to each other.
paste <(head -n1 scdb-combined_leg05-201901.tsv | tr '\t' '\n') \
<(spssread.pl -r SCDB_2019_01_caseCentered_Citation.sav | \
sed -e 's/\x0//g' -e 1d) | cut -f1,2,4 > fields.tsv
Next, we can extract the values of these fields. I looked at the
values from both the Legacy and the Modern SPSS files, and they were
almost identical – the one difference being that a few field values
from the lawsupp
(311-316, and 379-380) variable had the last
section notation lopped off in the values extracted from the Legacy
SPSS file. Based on this, we can use the one for the Modern data to
work correctly for our combined database. We then extract the field
values, removing the header and the trailing NULs, to make a TSV:
spssread.pl -l SCDB_2019_01_caseCentered_Citation.sav | sed 's/\x0//g' | sed 1d > field-values.tsv
Then we use the field name translations in fields.tsv
to translate
the old SPSS label names to the correct ones from the CSV, which are
also used in the database. Here I use one of my favorite text
processing tools, awk
. Then we can add SQL escape sequences for
single-quotes (in octal) to the output and save the resulting TSV,
field-values-corrected.tsv
.
awk 'BEGIN { FS = "\t" ; OFS = "\t" }
NR == FNR { label[$2]=$1 }
NR > FNR { print label[$1], $2, $3 }
' fields.tsv field-values.tsv |
sed 's/\o047/\o047\o047/g' > field-values-corrected.tsv
Now that we have the field value information extracted, we can use it to create SQL statements, saved in a file that we will feed to SQLite. We wrap everything in a single transaction to speed up loading it into the database. First, create the label tables, by reading the field names from the corrected list. (We use the corrected field values list here, instead of the proper list of fields, because we don’t want to create tables for labels that do not actually have any data.) When this is done, we’ll add a blank line for clarity too.
echo "BEGIN TRANSACTION;" > label-tables.sql
while read line ; do echo "CREATE TABLE $line (id, desc);" ; done < <(cat field-values-corrected.tsv | cut -f1 | sort -u) >> label-tables.sql
echo "" >> label-tables.sql
Then in similar fashion, read the corrected field values TSV and create a SQL statement to insert that data into the relevant table. Note use of Octal to specify single quotes in output, which is preferable to Hex because it is always fixed-length.
awk 'BEGIN { FS = "\t" }
{ print "INSERT INTO", $1 " VALUES (\047" $2 "\047,\047" $3 "\047);" }
'field-values-corrected.tsv >> label-tables.sql
Finally, we add the statement to end the transaction, and tell SQLite to read in this file, to create the new tables and add the value data.
echo "COMMIT;" >> label-tables.sql
echo ".read label-tables.sql" | sqlite3 scdb-combo-L05-201901.db
It should throw errors if something was not read correctly.
Querying the database
Chances are pretty good that if you’ve embarked on this journey to begin with, you’ve done so because you know some SQL and want to use that knowledge to ask questions of the SCDB data.
In the database we’ve built here, the data itself is housed in the
data
table, and the labels for numeric data is kept in other
tables. Those other tables are named for the field in the main data
itself – in other words, the table holding label information for the
“majOpinWriter” variable is named majOpinWriter
. Each of the label
tables has two fields – id
is the numeric part, and desc
is the
text label part. Use a SQL JOIN to connect these tables to the main
data
table. Bear in mind that sometimes fields do not have values,
so a LEFT JOIN is usually safest.
For example, let’s imagine we want to know how many majority opinions
each of the justices wrote during the 1944 term – the term that
produced the infamous Korematsu case. This can be done right from
the sqlite3
command line interface.
.header on
SELECT majopinwriter.desc AS Writer
, COUNT(*) AS Opinions
FROM data
LEFT JOIN majopinwriter ON data.majopinwriter == majopinwriter.id
WHERE term = '1944'
GROUP BY Writer
ORDER BY Opinions DESC;
We get the following results:
Writer | Opinions |
---|---|
HFStone | 21 |
WODouglas | 21 |
HLBlack | 20 |
OJRoberts | 20 |
FMurphy | 18 |
FFrankfurter | 16 |
SFReed | 16 |
RHJackson | 15 |
WBRutledge | 9 |
(blank) | 8 |
Conclusion
As is hopefully evident, SQL (generally) and SQLite (specifically) are powerful tools for analyzing datasets such as the Supreme Court Database. If your favorite dataset is not offered in exactly the format you might want, it is often straightforward to use open source tools to transform it into the format that makes it easiest for you to use.