Cara menggunakan remove trailing punctuation python
If you have ever worked processing a large amount of textual data, you would know the pain of finding and removing irrelevant words or characters from the text. Show
Removing punctuation is a common preprocessing step in many data analysis and machine learning tasks. Using replace methodPython strings come with many useful
methods. One such method is the replace method. s = "Hello World, Welcome to my blog." print(s) s1 = s.replace('W', 'V') print(s1) Output: This method, by default, removes all occurrences of a given character or substring from the given string. Here’s an example where we first use the default value of count(-1) and then pass a custom value for it. s = "Hello world, Welcome to my blog." print(s) s1 = s.replace('o', 'a') print(f"After replacing all o's with a's: {s1}") # replace only first 2 o's s2 = s.replace('o', 'a', 2) print(f"After replacing first two o's: {s2}") Output: It is important
to note that in all our usages of the replace method, we’ve stored the result string in a new variable. Now let’s figure out how we should use this method to replace all occurrences of punctuation in a string. We must first define a list of all punctuation that we are not interested in and want to get rid of. user_comment = "NGL, i just loved the moviee...... excellent work !!!" print(f"input string: {user_comment}") clean_comment = user_comment #copy the string in new variable, we'll store the result in this variable # define list of punctuation to be removed punctuation = ['.','.','!'] # iteratively remove all occurrences of each punctuation in the input for p in punctuation: clean_comment = clean_comment.replace(p,'') #not specifying 3rd param, since we want to remove all occurrences print(f"clean string: {clean_comment}") Output: Since it was a short text, we could anticipate what kind of punctuation we would encounter. import string all_punctuation = string.punctuation print(f"All punctuation: {all_punctuation}") Output: Once we have all the punctuation as a sequence of characters, we can run the previous for loop on any text input, however large, and the output will be free of punctuation. Using maketrans and translateThere is another way in Python using which we can replace all occurrences of a bunch of characters in a string by their corresponding equivalents as desired. Let’s understand this through a simple example. We will replace all occurrences of ‘a’ with ‘e’, ‘o’ with ‘u’, and ‘i’ with ‘y’. tr_table = str.maketrans('aoi', 'euy') #defining the translation table: a=>e, o=>u, i=>y s = "i absolutely love the american ice-cream!" print(f"Original string: {s}") s1 = s.translate(tr_table) #or str.translate(s, tr_table) print(f"Translated string: {s1}") Output: In the maketrans method, the first two strings need to be of equal length, as each character in the 1st string corresponds to its replacement/translation in the 2nd string. We can also create the translation table using a dictionary of mappings instead of the two string parameters. This additionally allows us to create character-to-strings mappings, which help us replace a single character with strings (which is
impossible with string parameters). Let us use the previous example and create the mapping using a dictionary. mappings = { 'a':'e', 'o':'u', 'i':'eye', '!': None } tr_table = str.maketrans(mappings) s = "i absolutely love the american ice-cream!" print(f"Original string: {s}") print(f"translation table: {tr_table}") s1 = s.translate(tr_table) #or str.translate(s, tr_table) print(f"Translated string: {s1}") Output: Note that when we print the translation table, the keys are integers instead of characters. These are the Unicode values of the characters we had defined when creating the table. Finally, let’s use this approach to remove all punctuation occurrences from a given input text. import string s = """I reached at the front of the billing queue. The cashier started scanning my items, one after the other. Off went from my cart the almonds, the butter, the sugar, the coffee.... when suddenly I heard an old lady, the 3rd in queue behind me, scream at me, "What y'all taking all day for ! are you hoarding for the whole year !". The cashier looked tensed, she dashed all the remaining products as fast as she could, and then squeaked in a nervous tone, "That would be 298.5, sir !".""" print(f"input string:\n{s}\n") tr_table = str.maketrans("","", string.punctuation) s1 = s.translate(tr_table) print(f"translated string:\n{s1}\n") Output: Using RegExRegEx, or Regular Expression, is a sequence of characters representing a string pattern. import re # define regex pattern for 3-lettered country codes. c_pattern = re.compile("[A-Z]{3}") s = "At the Olympics, the code for Japan is JPN, and that of Brazil is BRA. RSA stands for the 'Republic of South Africa' while ARG for Argentina." print(f"Input: {s}") # find all substrings matching the above regex countries = re.findall(c_pattern, s) print(f"Countries fetched: {countries}") Output: All occurrences of 3-lettered uppercase codes have been identified with the help of the regex we defined. If we want to replace all the matching patterns in the string with something, we can do so using the re.sub
method. c_pattern = re.compile("[A-Z]{3}") s = "At the Olympics, the code for Japan is JPN, and that of Brazil is BRA. RSA stands for the 'Republic of South Africa' while ARG for Argentina.\n" print(f"Input:\n{s}") new_s = re.sub(c_pattern, "DEF", s) print(f"After replacement:\n{new_s}") Output: We can use the same method to
replace all occurrences of the punctuation with an empty string. This would effectively remove all the punctuation from the input string. For example, if we know that we can expect only the English alphabet, digits, and whitespace, then we can exclude them all in our regex using the caret symbol ^. Let’s define it both ways. import string, re p_punct1 = re.compile(f"[{string.punctuation}]") #trivial way of regex for punctuation print(f"regex 1 for punctuation: {p_punct1}") p_punct2 = re.compile("[^\w\s]") #definition by exclusion print(f"regex 2 for punctuation: {p_punct2}") Output: Now let us use both of them to replace all the punctuation from a sentence. We’ll use an earlier sentence that contains various punctuation. import string s = """I reached at the front of the billing queue. The cashier started scanning my items, one after the other. Off went from my cart the almonds, the butter, the sugar, the coffee.... when suddenly I heard an old lady, the 3rd in queue behind me, scream at me, "What y'all taking all day for ! are you hoarding for the whole year !". The cashier looked tensed, she dashed all the remaining products as fast as she could, and then squeaked in a nervous tone, "That would be 298.5, sir !".""" print(f"input string:\n{s}\n") s1 = re.sub(p_punct1, "", s) print(f"after removing punctuation using 1st regex:\n{s1}\n") s2 = re.sub(p_punct2, "", s) print(f"after removing punctuation using 2nd regex:\n{s2}\n") Output: Both of them produced results identical to each other and to the maketrans method we used earlier. Using nltkPython’s nltk is a popular, open-source NLP library. It offers a large range of language datasets, text-processing modules, and a host of other features required in NLP. import nltk s = "We can't lose this game so easily, not without putting up a fight!" tokens = nltk.word_tokenize(s) print(f"input: {s}") print(f"tokens: {tokens}") Output: The default tokenizer being used by nltk retains punctuation and splits the tokens based on whitespace and punctuation. We can use nltk’s RegexpTokenizer to specify token patterns using regex. from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer("\w+") #\w+ matches alphanumeric characters a-z,A-Z,0-9 and _ s = "We can't lose this game so easily, not without putting up a fight!" tokens = tokenizer.tokenize(s) print(f"input: {s}\n") print(f"tokens: {tokens}\n") new_s = " ".join(tokens) print(f"New string: {new_s}\n") Output: Remove punctuation from start and end onlyIf we want to remove the punctuation only from the start and end of the sentence, and not those between, we can define a regex representing such a pattern and use it to remove the leading and the trailing punctuation. Let’s first use one such regular expression in an example, and then we will dive deeper into that regex. import re pattern = re.compile("(^[^\w\s]+)|([^\w\s]+$)") sentence = '"I am going to be the best player in history!"' print(sentence) print(re.sub(pattern,"", sentence)) Output: The output shows the quotes (“) at the beginning and end, as well as the exclamation mark (!) at the second-to-last position, have been removed. The regex being used to achieve this is (^[^\w\s]+)|([^\w\s]+$) There are two, different patterns in this regex, each enclosed in parentheses and separated by an OR sign (|). That means, if either of the two patterns exists in the string, it will be identified by the given regex. The second component is almost similar to the first one, except that it matches the specified set of characters occurring AT THE END of the string. This is denoted by the trailing character $. Remove punctuation and extra spacesIn addition to removing punctuation, removing extra spaces is a common preprocessing step. s = " I have an idea! \t " print(f"input string with white spaces = {s}, length = {len(s)}\n") s1 = s.strip() print(f"after removing spaces from both ends: {s1}, length = {len(s1)}") Output: The strip method removes white spaces only at the beginning and end of the string. Let us combine the removal of punctuation and extra spaces in an example. import string tr_table = str.maketrans("","", string.punctuation) # for removing punctuation s = ' " I am going to be the best,\t the most-loved, and... the richest player in history! " ' print(f"Original string:\n{s},length = {len(s)}\n") s = s.translate(tr_table) print(f"After removing punctuation:\n{s},length = {len(s)}\n") s = " ".join(s.split()) print(f"After removing extra spaces:\n{s},length = {len(s)}") Output: Remove punctuation from a text fileSo far, we have been working on short strings that were stored in variables
of type str and were no longer than 2-3 sentences. First, let’s read the whole content of the file in a string variable and use one of our earlier methods to remove the punctuation from this content string before writing it into a new file. import re punct = re.compile("[^\w\s]") input_file = "short_sample.txt" output_file = "short_sample_processed.txt" f = open(input_file) file_content = f.read() #reading entire file content as string print(f"File content: {file_content}\n") new_file_content = re.sub(punct, "", file_content) print(f"New file content: {new_file_content}\n") # writing it to new file with open(output_file, "w") as fw: fw.write(new_file_content) Output: We read the entire file at once in the above example. The text file, however, may also span content up to millions of lines, amounting to a few hundred MBs or a few
GBs. So, we will read the text file one line at a time, process it, and write it to the new file. In the following example, we will remove punctuation from a text file(found here), which is a story about ‘The Devil With Three Golden Hairs’! import re punct = re.compile("[^\w\s]") input_file = "the devil with three golden hairs.txt" output_file = "the devil with three golden hairs_processed.txt" f_reader = open(input_file) # writing it to new file with open(output_file, "w") as f_writer: for line in f_reader: line = line.strip() #removing whitespace at ends line = re.sub(punct, "",line) #removing punctuation line += "\n" f_writer.write(line) print(f"First 10 lines of original file:") with open(input_file) as f: i = 0 for line in f: print(line,end="") i+=1 if i==10: break print(f"\nFirst 10 lines of output file:") with open(output_file) as f: i = 0 for line in f: print(line,end="") i+=1 if i==10: break Output: As seen from the first 10 lines, the punctuation has been removed from the input file, and the result is stored in the output file. Remove all punctuation except apostropheApostrophes, in the English language, carry semantic meanings. They are used to show possessive nouns, to shorten words by the omission of letters (eg. cannot=can’t, will not=won’t), etc. So it becomes important to retain the apostrophe characters while processing texts to avoid losing these semantic meanings. Let us remove all the punctuation but the apostrophes from a text. s=""""I should like to have three golden hairs from the devil's head", answered he, "else I cannot keep my wife". No sooner had he entered than he noticed that the air was not pure. "I smell man's flesh", said he, "all is not right here". The queen, when she had received the letter and read it, did as was written in it, and had a splendid wedding-feast prepared, and the king's daughter was married to the child of good fortune, and as the youth was handsome and friendly she lived with him in joy and contentment.""" print(f"Input text:\n{s}\n") tr_table = str.maketrans("","", string.punctuation) del tr_table[ord("'")] #deleting ' from translation table print(f"Removing punctuation except apostrophe:\n{s.translate(tr_table)}\n") Output: A translation table is a dictionary whose keys are integer values. They are the Unicode equivalents of the characters. Performance ComparisonNow that we have seen so many different ways for removing punctuation in Python, let us compare them in terms of their time consumption. We will compare the performances of replace, maketrans, regex, and nltk. We will use tqdm module to measure the performance of each method. Output: The str.maketrans method, in combination with str.translate is the fastest method of all, it took 26 seconds to finish 100000 iterations. ConclusionIn this tutorial, we looked at and analyzed various methods of removing punctuation from text data. We began by looking at the str.replace method. Then, we saw the use of translation tables to replace certain characters with other characters or None. We then used the powerful regex
expressions to match all punctuation in the string and remove them. We also saw how we can remove punctuation only from the start and end of the string. We saw how we can remove punctuation from any length of text stored in an external text file, and write the processed text in another text file. Finally, we compared the performances of the 4 prominent methods we saw for removing punctuation from a string. Mokhtar is the founder of LikeGeeks.com. He works as a Linux system administrator since 2010. He is responsible for maintaining, securing, and troubleshooting Linux servers for multiple clients around the world. He loves writing shell and Python scripts to automate his work. |