napptools: the gory details
This post is an explanation of the napptools scripts, including how they transform NAPP download files into a SQLite database.
The tools
napptools
consists of three script programs:
-
napp2csv.sh
: A Bash script that uses traditional unix toolscut
,sed
, andtr
to chop a NAPP data file into its respective columns, guided by a SAS-format command file. This also creates secondary tables in .CSV format from the variable descriptions in the SAS file. -
csv2sqlite
: A public-domain AWK program written by Lorance Stinson (available from http://lorance.freeshell.org/csvutils/) to convert .CSV files into a series of SQL statements that load the data into a database. Two small changes were made by Eric Nystrom to Stinson’s original code to fix a bug and better fit the output to SQLite’s capabilities by specifying non-typed columns. -
nappbuild.sh
: A Bash script to employnapp2csv.sh
andcsv2sqlite
to create .CSV and .SQL files, then load them into a SQLite database.
Usage
- Get an account at http://nappdata.org, receive access approval, and
select your desired variables. Download the fixed-width text data file
itself, which will end in a
.dat.gz
extension, as well as a command file in SAS format. - Ensure all dependencies are met. On most Linux systems the only one
you may need to install will be
sqlite3
, the command-line client for the SQLite database package. - Run
nappbuild.sh
in the directory containing your data file and your command file, passing the name of the SQLite database you wish to create.- If your data file is
napp_00001.dat.gz
and your command file isnapp_00001.sas
then runnappbuild.sh
like so: nappbuild.sh -i -n napp_00001 -d MyNAPPData.db
- If your data file is
- From there, you can use your database from the SQLite command shell
sqlite3
or your favorite programming language.
Database structure
For databases created with napptools
, most of the NAPP data ends up
in a single large table, called data
. Each of the columns in data
is named for the NAPP field, such as SERIAL
, PERWT
, NAMELAST
,
etc. (Since SQLite’s column names are not case-sensitive, lower case
works fine too.)
Some of these columns have self-contained information, such as
NAMELAST
or OCCSTRNG
, but others contain a numeric code that will
typically need to be translated into human-readable
values. Translations for these codes were offered by NAPP in the
command file. The napptools
suite breaks those translations out of
the command file, into separate tables loaded into the SQLite
database. These secondary tables are named for the NAPP variable, and
always contain two columns, id
and desc
. With this information,
it is easy to use a SQL JOIN command to bring the translations into
your results, or you can refer to the codes directly if desired.
For example, to show the number of people listed in each category of
the race
variable in the state of New Mexico (stateus
value of
35):
SELECT race.desc, count(*)
FROM data
JOIN race ON data.race = race.id
WHERE stateus = 35
GROUP BY race.desc
Some caveats
- The SQLite index generation routine is rather crude, as it makes an
index for every column in the
data
table and theid
column in all secondary tables. This is likely overkill, but there’s no doubt column indexes on at least some of the columns helps many queries. - This was designed and used on a Debian Linux system. It seems likely that it will be portable to similar unix-based systems as long as the dependencies are all met, but YMMV.