Converting HTML to Text on Python in 12 lines of code

html2textI made an article about this because it took me about 2 hours to solve it. Seems like all sources are outdated with their methods. Many people were talking about using module html2text which is now deprecated, then many others recommended nltk and well.. The final result is that now BeautifulSoup does a better job that them. Buuuut…..

All resources were saying about this function called get_text() which is completely incorrect, it must be getText() camel case in order to make this conversion work.

Why I would need to do this? Well, specially when some networks try to encode with strange HTML DOM characters that makes your scraping a nightmare. But here is the chunk of code explained.

 

from BeautifulSoup import BeautifulSoup as Soup
import re, urllib2, nltk
 
url = 'http://google.com'
html = urllib2.urlopen(url).read() #make the request to the url
soup = Soup(html) #using Soup on the responde read
for script in soup(["script", "style"]): #You need to extract this <script> and <style> tags
    script.extract() #strip them off
text = soup.getText() #this is the method that I had like 40 min problems
text = text.encode('utf-8') #make sure to encode your text to be compatible
#raw = nltk.clean_html(document)
print(text.encode('utf-8'))

So you now know how to get text from a response, it will be now easy to get some data using Regular Expressions 🙂

Leave a Reply