Introduction and Motivation
Data analysis is a very broad and varied field.
What is the 20% of applied knowledge that can drive 80% of results in data analysis
A significant portion of an data analysis is cleaning and preparing data. This guide aims to provide effective tools at cleaning, preparing data. With the added bonus that the tools can also be leveraged further to perform some data analysis.
With the vast amount of data encountered being in a tabular form, tabular data (rows and columns) is the focus of this guide.
Who is this for?
The aim of this guide is to take someone with nearly no experience working at the terminal, to being able to perform a rigorous and repeatable data analysis. (Which includes all the steps taken to clean the data)
This guide will leverage the Bash shell and sqlite.
Why bash and SQLite
- Capability to handle a significant number of rows
- Scriptable - for repeatable data processing and analysis
- Flexibility to handle and clean data in a wide variety of formats
Bash
The Bash shell has been around for 30 years and is the default shell on almost every Linux distribution. This means it can be found on most computers you will come across (including Mac and Windows).
SQLite
SQLite is effectively a database in file format. It brings the power of a relational database to your fingertips without the server or authentication headaches.
This is a great article on some of the capabilities in SQLite: https://antonz.org/sqlite-is-not-a-toy-database
The SQL language which has been around since 1974 and is arguably the most common programming language for working with data. This means that there are likely best practice documented ways of achieving whatever the data oriented task is.
What this guide is not
This guide assumes that a data file has been made available to the analyst. Eg a csv file sent via email, or downloadable from somewhere online. As such this guide does not cover aspects of data collection such as web scraping.
This guide does not cover data visualisation. What is covered is the process from raw data to summarised / pivoted data which is in a visualisation ready format.
The hope is that the data - once it is ready to be visualised - can be taken to any tool, platform, program for this purpose.
Personal
In my very first data job, there was a weekly report I was responsible for producing every Monday. This report was a repetitive process that involved me recreating the same updated charts each week.
All the work done in this report was done entirely in Excel and took me 3 hours to complete on a good day.
In a way I am writing the guide I wish I had as a newbie data analyst. The tools in this guide would have effectively cut the weekly report down from ~3 hours to under 1 hour. Not to mention eliminating all errors associated with doing the analysis manually.