Converting HTML to Text on Python in 12 lines of code

html2textI made an article about this because it took me about 2 hours to solve it. Seems like all sources are outdated with their methods. Many people were talking about using module html2text which is now deprecated, then many others recommended nltk and well.. The final result is that now BeautifulSoup does a better job that them. Buuuut…..

All resources were saying about this function called get_text() which is completely incorrect, it must be getText() camel case in order to make this conversion work.

Why I would need to do this? Well, specially when some networks try to encode with strange HTML DOM characters that makes your scraping a nightmare. But here is the chunk of code explained.


So you now know how to get text from a response, it will be now easy to get some data using Regular Expressions 🙂

Leave a Reply