In this short tutorial we will learn how to retrieve the contents of a file as a list of words with Python.
Define sample file contents
Let’s start by creating a new file with some sample content that you can use in order to follow along this tutorial.
from pathlib import Path
f_path = Path(r"C:\WorkDir\word_file.txt")
f_path.write_text('This is our log file that we will parse using Python code.')
This will return the integer 58, which is the number of characters in the text we have written.
Import list of words from file to a list
To create a list of words from the file we’ll use two functions:
- Read(): reads the entire file stream into a string object.
- Split(): that divides a string into a list object. Returns a list.
We’ll also use the with block, which takes care of the file handling and saves us the need to explicitly close the file after reading it.
with open(f_path, 'r') as f_object:
word_lst = f_object.read().split()
If we print the word_lst we’ll get a word byword split of our file.
print (word_lst)
['This', 'is', 'our', 'log', 'file', 'that', 'we', 'will', 'parse', 'using', 'Python', 'code.']
Read multiple line file word by word
We can use the method outlined above in order also to split multi – line files to words. But what if we want to get a separate word by word list for every line / string in our file?
Let’s define a multi -line file.
from pathlib import Path
# define content for a multi line file
f_path = Path(r"C:\WorkDir\multi_line_file.txt")
f_path.write_text('This is our log file that we will parse using Python code. \n This is the second line.')
# open the file and read line by line
with open(f_path, 'r') as f_object:
line_lst = f_object.readlines()
#use a list comprehension to split the line strings to words
multi_word_lst = [line.split() for line in line_lst]
#output the list
print(multi_word_lst)
This will render exactly that list of lists, each representing the words in every line
[['This', 'is', 'our', 'log', 'file', 'that', 'we', 'will', 'parse', 'using', 'Python', 'code.'], ['This', 'is', 'the', 'second', 'line.']]