Working with and manipulating text files

Throughout this section we will be working with the Alice in Wonderland book as a text file.

Download Alice in Wonderland


# download Alice in Wonderland using wget
wget --output-document=alice.txt https://www.gutenberg.org/files/11/11-0.txt

# download Alice inWonderland using curl
curl -o "alice.txt" 'https://www.gutenberg.org/files/11/11-0.txt'

`sed`

sed stands for Stream Editor. It allows you to manipulate text files.

Basic usage of sed to replace the first "t" in each line with a capital "T"

# By default sed will output to standard output (print to screen)
# and leave the original file unchanged

sed 's/t/T/' alice.txt

#It is also possible to pipe data into sed
cat alice.txt | sed 's/t/T/'

To change every instance of a lowercase "t" with an uppercase "T"


sed 's/t/T/g' alice.txt

To modify the file in place, use the -i modifier


sed -i 's/t/T/g' alice.txt

`grep`

grep stands for Global Regular Expression Print. grep allows us to search for text within files.

Simple usage of grep to find all the lines in a text file that contain a string


# what lines in alice in wonderland contain the string 'watch'
# by default grep will print to standard output
grep watch alice.txt

# text files can also be piped into grep
cat alice.txt | grep watch

# the output of grep can also be written to a file in the usual way
cat alice.txt | grep watch > 'lines_of_alice_that_contain_watch.txt'

Using grep to search for lines that contain a specific word


# what lines in alice in wonderland contain the word 'watch'
grep -w watch alice.txt

Search for 'chapter' in a few different ways.


# Search for lines that contain the word 'chapter'
grep chapter alice.txt

# Search for lines that contain the word 'chapter' ignoring the case
grep -i chapter alice.txt

# Search for lines that begin with the word 'chapter' ignoring the case
grep -i ^chapter alice.txt

Using the -n modifier will print the line number of where the match is found


# print the line numbers 
grep -i -n ^chapter alice.txt

Lines before and after a match can be printed in addition to the line that is matched


# print 4 lines after each line that contains the string watch
grep -A 4 watch alice.txt

# print the preceeding 3 lines before each line that contains 'White Rabbit'
grep -B 3 'White Rabbit' alice.txt

It is also possible to search for text in a file within a directory.


# prints lines which begin with "Chapter" in any file located in current directory
grep -Ri ^chapter .

# prints lines which begin with Chapter in any file located in 'linux_data_analysis' folder
grep -Ri ^chapter ~/Documents/Projects/linux_data_analysis

`awk`

awk is a convenient and expressive programming language that can be used for pattern scanning and data manipulation tasks.

Basic usage


# Awk can be used to print the contents of a file
awk '{ print }' iris.csv

# data can be piped into awk for it to work on
cat iris.csv | awk '{ print }'

By default awk assumes the space character is used to separate fields.

Using the -F modifier we can tell awk to use a different character to process the data.


# get the first column of a comma separated file
cat iris.csv | awk -F ',' '{ print $1 }'

# get the third column of a comma separated file
cat iris.csv | awk -F ',' '{ print $3 }'

# get the 2nd and 4th column of a comma separated file
cat iris.csv | awk -F ',' '{ print $2,$4 }'

# the last column can be referenced with "$NF"
# get the last column of a comma separated file
cat iris.csv | awk -F ',' '{ print $NF }'

Awk can be used to filter lines based on the line containing a string.


# print lines containing "Iris-versicolor"
cat iris.csv | awk '/Iris-versicolor/ { print }'

Awk can be used to filter lines based on the numerical value of a field


# print lines where the second column is greater or equal to 3
cat iris.csv | awk -F ',' '$2 >= 3 { print } '