napptools: the gory details
This post is an explanation of the napptools scripts, including how they transform NAPP download files into a SQLite database.
napptools consists of three script programs:
napp2csv.sh: A Bash script that uses traditional unix tools
trto chop a NAPP data file into its respective columns, guided by a SAS-format command file. This also creates secondary tables in .CSV format from the variable descriptions in the SAS file.
csv2sqlite: A public-domain AWK program written by Lorance Stinson (available from http://lorance.freeshell.org/csvutils/) to convert .CSV files into a series of SQL statements that load the data into a database. Two small changes were made by Eric Nystrom to Stinson’s original code to fix a bug and better fit the output to SQLite’s capabilities by specifying non-typed columns.
nappbuild.sh: A Bash script to employ
csv2sqliteto create .CSV and .SQL files, then load them into a SQLite database.
- Get an account at http://nappdata.org, receive access approval, and
select your desired variables. Download the fixed-width text data file
itself, which will end in a
.dat.gzextension, as well as a command file in SAS format.
- Ensure all dependencies are met. On most Linux systems the only one
you may need to install will be
sqlite3, the command-line client for the SQLite database package.
nappbuild.shin the directory containing your data file and your command file, passing the name of the SQLite database you wish to create.
- If your data file is
napp_00001.dat.gzand your command file is
nappbuild.sh -i -n napp_00001 -d MyNAPPData.db
- If your data file is
- From there, you can use your database from the SQLite command shell
sqlite3or your favorite programming language.
For databases created with
napptools, most of the NAPP data ends up
in a single large table, called
data. Each of the columns in
is named for the NAPP field, such as
etc. (Since SQLite’s column names are not case-sensitive, lower case
works fine too.)
Some of these columns have self-contained information, such as
OCCSTRNG, but others contain a numeric code that will
typically need to be translated into human-readable
values. Translations for these codes were offered by NAPP in the
command file. The
napptools suite breaks those translations out of
the command file, into separate tables loaded into the SQLite
database. These secondary tables are named for the NAPP variable, and
always contain two columns,
desc. With this information,
it is easy to use a SQL JOIN command to bring the translations into
your results, or you can refer to the codes directly if desired.
For example, to show the number of people listed in each category of
race variable in the state of New Mexico (
stateus value of
SELECT race.desc, count(*) FROM data JOIN race ON data.race = race.id WHERE stateus = 35 GROUP BY race.desc
- The SQLite index generation routine is rather crude, as it makes an
index for every column in the
datatable and the
idcolumn in all secondary tables. This is likely overkill, but there’s no doubt column indexes on at least some of the columns helps many queries.
- This was designed and used on a Debian Linux system. It seems likely that it will be portable to similar unix-based systems as long as the dependencies are all met, but YMMV.