Skip to content

BONUS - Common text file cleaning operations

Remove the first n rows in a text file

Remove first n rows using tail

# print the file with first 5 rows removed
cat iris.csv | tail -n +5

# remove first 5 rows and save to a file
cat iris.csv | tail -n +5 > iris_without_first_five_rows.csv


Remove the last n rows in a text file

# Remove the last 5 rows of a text file
cat iris.csv | tail -r | tail -n +5 | tail -r


Only keep rows with n number of columns


# print all rows which have 5 columns
cat iris.csv | awk -F ',' 'NF == 5 { print } '


Change the field separator in the file


# Using `sed` to replace pipes in a file with commas
sed 's/|/,/g' some_pipe_separeated_data.tsv


Only keep a specific columns

Keep the 1st, 3rd and 5th columns of a comma separated file


# print 1st, 3rd, and 5th columns to screen
cat iris.csv | awk -F ',' '{ print $1,$3,$5 }'

# write to file
cat iris.csv | awk -F ',' '{ print $2,$4 }' > new_iris_file.csv

Combine multiple csv files into a single file

Sometimes data is split over multiple files when extracted from a system.

In these cases it may be desireable to recombine all the data into a single file.

Combine all lines of all csv files into a single file.


cat file01.csv file02.csv file03.csv > combined.csv


cat file*.csv > combined.csv

If all of the csv files have a header


awk '(NR == 1) || (FNR > 1)' file*.csv > combined.csv