Chapter 1: What is data engineering
What data engineers do
Typical tasks include extracting, loading, and transforming data.
- Query from a source (extract)
- Perform some modifications (transform)
- put the data where others can access it now it is production quality (load)
A retailer that has a transactional database in each region it operates.
to answer questions about total sales data from all databases would be required (in a single place)
... but not before some transformations are applied, to standardise the timestamps across each region.
The combination of extracting, loading, and transforming data is accomplished by the creation of a data pipeline
The data comes into the pipeline raw, or dirty in the sense that there may be missing data or typos in the data, which is then cleaned as it flows through the pipe.
After that, it comes out the other side into a data warehouse, where it can be queried. The following diagram shows the pipeline required to accomplish the task.
Required skills and knowledge to be a data engineer
Data engineers typically need to know multiple programming languages such as Python and SQL.
Data modeling a structures knowledge is important at the transformation stage. As well as data warehouse design.
Data engineering scope may also include the infrastructure on which the data pipelines run. Eg linux servers, or the corresponding tools on cloud platforms.
Data engineering is the development, operation, and maintenance of data infrastructure, either on-premises or in the cloud (or hybrid or multi-cloud), comprising databases and pipelines to extract, transform, and load data.
SQL is the main language of data engineering.
Java and Scala are widely used in data engineering tools.
Our focus will be on Python.
Most data in production systems is stored in relational databases.
Examples include: Oracle, Microsoft SWL Server, MySQL, and PostgreSQL.
Popular choices for data warehouses include:
- Amazon Redshift
- Google BigQuery
- Apache Cassandra
Data Processing Engines
Once a data engineer extracts data from a database, they will need to transform or process it. With big data, it helps to use a data processing engine.
The most popular engine is Apache Spark.
ETL processes will likely need to be run on a shcedule.
Crontab was the original scheduler. However additional complexities such as monitoring sucess and failures, and what ran and what didn't leads to a need for a better framework.
Apache Airflow is the most popular data pipeline framework in Python.
Airflow has the following built in:
- web server
- queueing system
Airflow uses Directed Acyclic Graphs (DAGs)
Airflow can be run on a single machine, or on a cluster with nodes.
Apache NiFi is a framework for building data engineering pipelines.