BONUS - Common text file cleaning operations
Remove the first n rows in a text file
Remove first n rows using tail
# print the file with first 5 rows removed
cat iris.csv | tail -n +5
# remove first 5 rows and save to a file
cat iris.csv | tail -n +5 > iris_without_first_five_rows.csv
Remove the last n rows in a text file
# Remove the last 5 rows of a text file
cat iris.csv | tail -r | tail -n +5 | tail -r
Only keep rows with n number of columns
# print all rows which have 5 columns
cat iris.csv | awk -F ',' 'NF == 5 { print } '
Change the field separator in the file
# Using `sed` to replace pipes in a file with commas
sed 's/|/,/g' some_pipe_separeated_data.tsv
Only keep a specific columns
Keep the 1st, 3rd and 5th columns of a comma separated file
# print 1st, 3rd, and 5th columns to screen
cat iris.csv | awk -F ',' '{ print $1,$3,$5 }'
# write to file
cat iris.csv | awk -F ',' '{ print $2,$4 }' > new_iris_file.csv
Combine multiple csv files into a single file
Sometimes data is split over multiple files when extracted from a system.
In these cases it may be desireable to recombine all the data into a single file.
Combine all lines of all csv files into a single file.
cat file01.csv file02.csv file03.csv > combined.csv
cat file*.csv > combined.csv
If all of the csv files have a header
awk '(NR == 1) || (FNR > 1)' file*.csv > combined.csv