Working with and manipulating text files
Throughout this section we will be working with the Alice in Wonderland book as a text file.
Download Alice in Wonderland
# download Alice in Wonderland using wget
wget --output-document=alice.txt https://www.gutenberg.org/files/11/11-0.txt
# download Alice inWonderland using curl
curl -o "alice.txt" 'https://www.gutenberg.org/files/11/11-0.txt'
sed
sed
stands for Stream Editor. It allows you to manipulate text files.
Basic usage of sed
to replace the first "t" in each line with a capital "T"
# By default sed will output to standard output (print to screen)
# and leave the original file unchanged
sed 's/t/T/' alice.txt
#It is also possible to pipe data into sed
cat alice.txt | sed 's/t/T/'
To change every instance of a lowercase "t" with an uppercase "T"
sed 's/t/T/g' alice.txt
To modify the file in place, use the -i
modifier
sed -i 's/t/T/g' alice.txt
grep
grep
stands for Global Regular Expression Print. grep
allows us to search for
text within files.
Simple usage of grep
to find all the lines in a text file that contain a string
# what lines in alice in wonderland contain the string 'watch'
# by default grep will print to standard output
grep watch alice.txt
# text files can also be piped into grep
cat alice.txt | grep watch
# the output of grep can also be written to a file in the usual way
cat alice.txt | grep watch > 'lines_of_alice_that_contain_watch.txt'
Using grep to search for lines that contain a specific word
# what lines in alice in wonderland contain the word 'watch'
grep -w watch alice.txt
Search for 'chapter' in a few different ways.
# Search for lines that contain the word 'chapter'
grep chapter alice.txt
# Search for lines that contain the word 'chapter' ignoring the case
grep -i chapter alice.txt
# Search for lines that begin with the word 'chapter' ignoring the case
grep -i ^chapter alice.txt
Using the -n
modifier will print the line number of where the match is found
# print the line numbers
grep -i -n ^chapter alice.txt
Lines before and after a match can be printed in addition to the line that is matched
# print 4 lines after each line that contains the string watch
grep -A 4 watch alice.txt
# print the preceeding 3 lines before each line that contains 'White Rabbit'
grep -B 3 'White Rabbit' alice.txt
It is also possible to search for text in a file within a directory.
# prints lines which begin with "Chapter" in any file located in current directory
grep -Ri ^chapter .
# prints lines which begin with Chapter in any file located in 'linux_data_analysis' folder
grep -Ri ^chapter ~/Documents/Projects/linux_data_analysis
awk
awk
is a convenient and expressive programming language that can be used for
pattern scanning and data manipulation tasks.
Basic usage
# Awk can be used to print the contents of a file
awk '{ print }' iris.csv
# data can be piped into awk for it to work on
cat iris.csv | awk '{ print }'
By default awk assumes the space character is used to separate fields.
Using the -F
modifier we can tell awk to use a different character to process the data.
# get the first column of a comma separated file
cat iris.csv | awk -F ',' '{ print $1 }'
# get the third column of a comma separated file
cat iris.csv | awk -F ',' '{ print $3 }'
# get the 2nd and 4th column of a comma separated file
cat iris.csv | awk -F ',' '{ print $2,$4 }'
# the last column can be referenced with "$NF"
# get the last column of a comma separated file
cat iris.csv | awk -F ',' '{ print $NF }'
Awk can be used to filter lines based on the line containing a string.
# print lines containing "Iris-versicolor"
cat iris.csv | awk '/Iris-versicolor/ { print }'
Awk can be used to filter lines based on the numerical value of a field
# print lines where the second column is greater or equal to 3
cat iris.csv | awk -F ',' '$2 >= 3 { print } '