Getting a feel for the data¶
What columns are in the data?¶
In [ ]:
Copied!
%%bash
csvcut -n iris.csv
%%bash
csvcut -n iris.csv
1: sepal_length 2: sepal_width 3: petal_length 4: petal_width 5: species
How many rows in the data?¶
In [ ]:
Copied!
%%bash
# number of lines in the file
wc -l iris.csv
%%bash
# number of lines in the file
wc -l iris.csv
151 iris.csv
In [ ]:
Copied!
%%bash
cat iris.csv | csvsql --query "SELECT COUNT(*) as rows FROM stdin;"
%%bash
cat iris.csv | csvsql --query "SELECT COUNT(*) as rows FROM stdin;"
rows 150
Alternatively
In [ ]:
Copied!
%%bash
csvsql iris.csv --query "SELECT COUNT(*) as rows FROM iris;"
%%bash
csvsql iris.csv --query "SELECT COUNT(*) as rows FROM iris;"
rows 150
In [ ]:
Copied!
%%bash
# Stats for all columns
cat iris.csv | csvstat
%%bash
# Stats for all columns
cat iris.csv | csvstat
1. "sepal_length" Type of data: Number Contains null values: False Unique values: 35 Smallest value: 4.3 Largest value: 7.9 Sum: 876.5 Mean: 5.843 Median: 5.8 StDev: 0.828 Most common values: 5 (10x) 5.1 (9x) 6.3 (9x) 5.7 (8x) 6.7 (8x) 2. "sepal_width" Type of data: Number Contains null values: False Unique values: 23 Smallest value: 2 Largest value: 4.4 Sum: 458.1 Mean: 3.054 Median: 3 StDev: 0.434 Most common values: 3 (26x) 2.8 (14x) 3.2 (13x) 3.1 (12x) 3.4 (12x) 3. "petal_length" Type of data: Number Contains null values: False Unique values: 43 Smallest value: 1 Largest value: 6.9 Sum: 563.8 Mean: 3.759 Median: 4.35 StDev: 1.764 Most common values: 1.5 (14x) 1.4 (12x) 4.5 (8x) 5.1 (8x) 1.3 (7x) 4. "petal_width" Type of data: Number Contains null values: False Unique values: 22 Smallest value: 0.1 Largest value: 2.5 Sum: 179.8 Mean: 1.199 Median: 1.3 StDev: 0.763 Most common values: 0.2 (28x) 1.3 (13x) 1.5 (12x) 1.8 (12x) 1.4 (8x) 5. "species" Type of data: Text Contains null values: False Unique values: 3 Longest value: 15 characters Most common values: Iris-setosa (50x) Iris-versicolor (50x) Iris-virginica (50x) Row count: 150
A subset of columns using csvcut¶
In [ ]:
Copied!
%%bash
csvcut -c sepal_length,sepal_width iris.csv | csvstat
%%bash
csvcut -c sepal_length,sepal_width iris.csv | csvstat
1. "sepal_length" Type of data: Number Contains null values: False Unique values: 35 Smallest value: 4.3 Largest value: 7.9 Sum: 876.5 Mean: 5.843 Median: 5.8 StDev: 0.828 Most common values: 5 (10x) 5.1 (9x) 6.3 (9x) 5.7 (8x) 6.7 (8x) 2. "sepal_width" Type of data: Number Contains null values: False Unique values: 23 Smallest value: 2 Largest value: 4.4 Sum: 458.1 Mean: 3.054 Median: 3 StDev: 0.434 Most common values: 3 (26x) 2.8 (14x) 3.2 (13x) 3.1 (12x) 3.4 (12x) Row count: 150
A subset of columns using csvsql¶
In [ ]:
Copied!
%%bash
csvsql iris.csv --query "SELECT sepal_width, species FROM iris" | csvstat
%%bash
csvsql iris.csv --query "SELECT sepal_width, species FROM iris" | csvstat
1. "sepal_width" Type of data: Number Contains null values: False Unique values: 23 Smallest value: 2 Largest value: 4.4 Sum: 458.1 Mean: 3.054 Median: 3 StDev: 0.434 Most common values: 3 (26x) 2.8 (14x) 3.2 (13x) 3.1 (12x) 3.4 (12x) 2. "species" Type of data: Text Contains null values: False Unique values: 3 Longest value: 15 characters Most common values: Iris-setosa (50x) Iris-versicolor (50x) Iris-virginica (50x) Row count: 150