Skip to content

BONUS - Common text file cleaning operations

Remove the first n rows in a text file

Remove first n rows using tail

# print the file with first 5 rows removed
cat iris.csv | tail -n +5

# remove first 5 rows and save to a file
cat iris.csv | tail -n +5 > iris_without_first_five_rows.csv

Remove the last n rows in a text file

# Remove the last 5 rows of a text file
cat iris.csv | tail -r | tail -n +5 | tail -r

Only keep rows with n number of columns

# print all rows which have 5 columns
cat iris.csv | awk -F ',' 'NF == 5 { print } '

Change the field separator in the file

# Using `sed` to replace pipes in a file with commas
sed 's/|/,/g' some_pipe_separeated_data.tsv

Only keep a specific columns

Keep the 1st, 3rd and 5th columns of a comma separated file

# print 1st, 3rd, and 5th columns to screen
cat iris.csv | awk -F ',' '{ print $1,$3,$5 }'

# write to file
cat iris.csv | awk -F ',' '{ print $2,$4 }' > new_iris_file.csv

Combine multiple csv files into a single file

Sometimes data is split over multiple files when extracted from a system.

In these cases it may be desireable to recombine all the data into a single file.

Combine all lines of all csv files into a single file.

cat file01.csv file02.csv file03.csv > combined.csv

cat file*.csv > combined.csv

If all of the csv files have a header

awk '(NR == 1) || (FNR > 1)' file*.csv > combined.csv