I made an article about this because it took me about 2 hours to solve it. Seems like all sources are outdated with their methods. Many people were talking about using module html2text which is now deprecated, then many others recommended nltk and well.. The final result is that now BeautifulSoup does a better job that them. Buuuut…..
All resources were saying about this function called get_text() which is completely incorrect, it must be getText() camel case in order to make this conversion work.
Why I would need to do this? Well, specially when some networks try to encode with strange HTML DOM characters that makes your scraping a nightmare. But here is the chunk of code explained.
from BeautifulSoup import BeautifulSoup as Soup import re, urllib2, nltk url = 'http://google.com' html = urllib2.urlopen(url).read() #make the request to the url soup = Soup(html) #using Soup on the responde read for script in soup(["script", "style"]): #You need to extract this <script> and <style> tags script.extract() #strip them off text = soup.getText() #this is the method that I had like 40 min problems text = text.encode('utf-8') #make sure to encode your text to be compatible #raw = nltk.clean_html(document) print(text.encode('utf-8'))
So you now know how to get text from a response, it will be now easy to get some data using Regular Expressions 🙂