csvkit - a brief introduction
csvkit is a suite of command-line tools for converting to and working with CSV, the king of tabular file formats.
If the data analysis task is not that extensive and the data is relatively small, csvkit can be used to do some exploration and basic analysis of the data you have.
This may be all that is needed before loading the data into the powerhouse that is SQLite.
Converting data into csv format
From excel into csv
# convert data from a sheet in data.xls file into data.csv
in2csv data.xls > data.csv
From json into csv
# convert json data from data.json into a csv file called data.csv
in2csv data.json > data.csv
Exploring the csv data
What columns are in the data - csvcut
# show what columns are in the iris.csv file
csvcut -n iris.csv
Peeking at the data - csvlook
Commands from csvkit can be piped into other csvkit commands and also bash commands.
# look at the first 10 rows of the iris.csv file in a pretty format
csvlook iris.csv | head -n 10
Peeking at a select columns of data - csvcut -c
# isolating certain columns can be achieved using `awk`
cat iris.csv | awk -F ',' '{ print $2,$5 }' | csvlook | head -n 10
# csvkit has a command called `csvcut` which can be used to isolate columns also
cat iris.csv | csvcut -c 2,5 | csvlook | head -n 10
Summary statistics of the csv data - csvstat
# show a summary of each column in the iris.csv file.
csvstat iris.csv
Summary statistics of a subset of data - csvgrep
# Show a summary of the rows in the data that have the string "Iris-versicolor" in them
# ... using awk to filter the data
cat iris.csv | awk -F "," 'NR==1; /Iris-versicolor/ { print }' | csvstat
# ... using csvgrep to filter the data
cat iris.csv | csvgrep -c 5 -m Iris-versicolor| csvstat