{"id":39,"date":"2016-07-16T01:04:48","date_gmt":"2016-07-16T01:04:48","guid":{"rendered":"http:\/\/wizardofbots.com\/network\/?p=39"},"modified":"2016-07-16T01:09:48","modified_gmt":"2016-07-16T01:09:48","slug":"converting-html-to-text-on-python-in-12-lines-of-code","status":"publish","type":"post","link":"http:\/\/wizardofbots.com\/network\/converting-html-to-text-on-python-in-12-lines-of-code\/","title":{"rendered":"Converting HTML to Text on Python in 12 lines of code"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" class=\"alignleft wp-image-40\" src=\"http:\/\/wizardofbots.com\/network\/wp-content\/uploads\/2016\/07\/html2text.jpg\" alt=\"html2text\" width=\"200\" height=\"200\" srcset=\"http:\/\/wizardofbots.com\/network\/wp-content\/uploads\/2016\/07\/html2text.jpg 300w, http:\/\/wizardofbots.com\/network\/wp-content\/uploads\/2016\/07\/html2text-100x100.jpg 100w, http:\/\/wizardofbots.com\/network\/wp-content\/uploads\/2016\/07\/html2text-150x150.jpg 150w\" sizes=\"(max-width: 200px) 100vw, 200px\" \/>I made\u00a0an article about this because it took me about 2 hours to solve it. Seems like all sources are outdated with their\u00a0methods. Many people were talking about using module <strong>html2text <\/strong> which is now deprecated, then many others recommended <strong>nltk<\/strong> and well.. The final result is that now BeautifulSoup does a better job that them. Buuuut&#8230;..<\/p>\n<p>All resources were saying about this function called get_text() which is completely incorrect, it must be <strong> getText() <\/strong> camel case in order to make this conversion work.<\/p>\n<p>Why I would need to do this? Well, specially when some networks try to encode with strange HTML DOM characters that makes your scraping a nightmare. But here is the chunk of code explained.<\/p>\n<p>&nbsp;<\/p>\n<pre class=\"lang:python decode:true \">from BeautifulSoup import BeautifulSoup as Soup\r\nimport re, urllib2, nltk\r\n \r\nurl = 'http:\/\/google.com'\r\nhtml = urllib2.urlopen(url).read() #make the request to the url\r\nsoup = Soup(html) #using Soup on the responde read\r\nfor script in soup([\"script\", \"style\"]): #You need to extract this &lt;script&gt; and &lt;style&gt; tags\r\n    script.extract() #strip them off\r\ntext = soup.getText() #this is the method that I had like 40 min problems\r\ntext = text.encode('utf-8') #make sure to encode your text to be compatible\r\n#raw = nltk.clean_html(document)\r\nprint(text.encode('utf-8'))<\/pre>\n<p>So you now know how to get text from a response, it will be now easy to get some data using Regular Expressions \ud83d\ude42<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I made\u00a0an article about this because it took me about 2 hours to solve it. Seems like all sources are outdated with their\u00a0methods. Many people were talking about using module html2text which is now deprecated, then many others recommended nltk and well.. The final result is that now BeautifulSoup does a better job that them. [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[23],"tags":[25,26,28,27,24],"class_list":["post-39","post","type-post","status-publish","format-standard","hentry","category-python","tag-beautifulsoup","tag-code","tag-gettext","tag-html-to-text","tag-python"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v23.7 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Converting HTML to Text on Python in 12 lines of code - Wizard Of Bots<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"http:\/\/wizardofbots.com\/network\/converting-html-to-text-on-python-in-12-lines-of-code\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Converting HTML to Text on Python in 12 lines of code - Wizard Of Bots\" \/>\n<meta property=\"og:description\" content=\"I made\u00a0an article about this because it took me about 2 hours to solve it. Seems like all sources are outdated with their\u00a0methods. Many people were talking about using module html2text which is now deprecated, then many others recommended nltk and well.. The final result is that now BeautifulSoup does a better job that them. [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"http:\/\/wizardofbots.com\/network\/converting-html-to-text-on-python-in-12-lines-of-code\/\" \/>\n<meta property=\"og:site_name\" content=\"Wizard Of Bots\" \/>\n<meta property=\"article:published_time\" content=\"2016-07-16T01:04:48+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2016-07-16T01:09:48+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/wizardofbots.com\/network\/wp-content\/uploads\/2016\/07\/html2text.jpg\" \/>\n<meta name=\"author\" content=\"wizardofbots\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"wizardofbots\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"1 minute\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"http:\/\/wizardofbots.com\/network\/converting-html-to-text-on-python-in-12-lines-of-code\/\",\"url\":\"http:\/\/wizardofbots.com\/network\/converting-html-to-text-on-python-in-12-lines-of-code\/\",\"name\":\"Converting HTML to Text on Python in 12 lines of code - Wizard Of Bots\",\"isPartOf\":{\"@id\":\"https:\/\/wizardofbots.com\/network\/#website\"},\"primaryImageOfPage\":{\"@id\":\"http:\/\/wizardofbots.com\/network\/converting-html-to-text-on-python-in-12-lines-of-code\/#primaryimage\"},\"image\":{\"@id\":\"http:\/\/wizardofbots.com\/network\/converting-html-to-text-on-python-in-12-lines-of-code\/#primaryimage\"},\"thumbnailUrl\":\"http:\/\/wizardofbots.com\/network\/wp-content\/uploads\/2016\/07\/html2text.jpg\",\"datePublished\":\"2016-07-16T01:04:48+00:00\",\"dateModified\":\"2016-07-16T01:09:48+00:00\",\"author\":{\"@id\":\"https:\/\/wizardofbots.com\/network\/#\/schema\/person\/31f9e486da1c11791d94a861854a2a9f\"},\"breadcrumb\":{\"@id\":\"http:\/\/wizardofbots.com\/network\/converting-html-to-text-on-python-in-12-lines-of-code\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"http:\/\/wizardofbots.com\/network\/converting-html-to-text-on-python-in-12-lines-of-code\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"http:\/\/wizardofbots.com\/network\/converting-html-to-text-on-python-in-12-lines-of-code\/#primaryimage\",\"url\":\"http:\/\/wizardofbots.com\/network\/wp-content\/uploads\/2016\/07\/html2text.jpg\",\"contentUrl\":\"http:\/\/wizardofbots.com\/network\/wp-content\/uploads\/2016\/07\/html2text.jpg\",\"width\":300,\"height\":300},{\"@type\":\"BreadcrumbList\",\"@id\":\"http:\/\/wizardofbots.com\/network\/converting-html-to-text-on-python-in-12-lines-of-code\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/wizardofbots.com\/network\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Converting HTML to Text on Python in 12 lines of code\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/wizardofbots.com\/network\/#website\",\"url\":\"https:\/\/wizardofbots.com\/network\/\",\"name\":\"Wizard Of Bots\",\"description\":\"Botting and AI community\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/wizardofbots.com\/network\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/wizardofbots.com\/network\/#\/schema\/person\/31f9e486da1c11791d94a861854a2a9f\",\"name\":\"wizardofbots\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/wizardofbots.com\/network\/#\/schema\/person\/image\/\",\"url\":\"http:\/\/2.gravatar.com\/avatar\/584eebc303f64610559ab9f305f6928d?s=96&d=mm&r=g\",\"contentUrl\":\"http:\/\/2.gravatar.com\/avatar\/584eebc303f64610559ab9f305f6928d?s=96&d=mm&r=g\",\"caption\":\"wizardofbots\"},\"url\":\"http:\/\/wizardofbots.com\/network\/author\/wizardofbots\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Converting HTML to Text on Python in 12 lines of code - Wizard Of Bots","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"http:\/\/wizardofbots.com\/network\/converting-html-to-text-on-python-in-12-lines-of-code\/","og_locale":"en_US","og_type":"article","og_title":"Converting HTML to Text on Python in 12 lines of code - Wizard Of Bots","og_description":"I made\u00a0an article about this because it took me about 2 hours to solve it. Seems like all sources are outdated with their\u00a0methods. Many people were talking about using module html2text which is now deprecated, then many others recommended nltk and well.. The final result is that now BeautifulSoup does a better job that them. [&hellip;]","og_url":"http:\/\/wizardofbots.com\/network\/converting-html-to-text-on-python-in-12-lines-of-code\/","og_site_name":"Wizard Of Bots","article_published_time":"2016-07-16T01:04:48+00:00","article_modified_time":"2016-07-16T01:09:48+00:00","og_image":[{"url":"http:\/\/wizardofbots.com\/network\/wp-content\/uploads\/2016\/07\/html2text.jpg"}],"author":"wizardofbots","twitter_card":"summary_large_image","twitter_misc":{"Written by":"wizardofbots","Est. reading time":"1 minute"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"http:\/\/wizardofbots.com\/network\/converting-html-to-text-on-python-in-12-lines-of-code\/","url":"http:\/\/wizardofbots.com\/network\/converting-html-to-text-on-python-in-12-lines-of-code\/","name":"Converting HTML to Text on Python in 12 lines of code - Wizard Of Bots","isPartOf":{"@id":"https:\/\/wizardofbots.com\/network\/#website"},"primaryImageOfPage":{"@id":"http:\/\/wizardofbots.com\/network\/converting-html-to-text-on-python-in-12-lines-of-code\/#primaryimage"},"image":{"@id":"http:\/\/wizardofbots.com\/network\/converting-html-to-text-on-python-in-12-lines-of-code\/#primaryimage"},"thumbnailUrl":"http:\/\/wizardofbots.com\/network\/wp-content\/uploads\/2016\/07\/html2text.jpg","datePublished":"2016-07-16T01:04:48+00:00","dateModified":"2016-07-16T01:09:48+00:00","author":{"@id":"https:\/\/wizardofbots.com\/network\/#\/schema\/person\/31f9e486da1c11791d94a861854a2a9f"},"breadcrumb":{"@id":"http:\/\/wizardofbots.com\/network\/converting-html-to-text-on-python-in-12-lines-of-code\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["http:\/\/wizardofbots.com\/network\/converting-html-to-text-on-python-in-12-lines-of-code\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"http:\/\/wizardofbots.com\/network\/converting-html-to-text-on-python-in-12-lines-of-code\/#primaryimage","url":"http:\/\/wizardofbots.com\/network\/wp-content\/uploads\/2016\/07\/html2text.jpg","contentUrl":"http:\/\/wizardofbots.com\/network\/wp-content\/uploads\/2016\/07\/html2text.jpg","width":300,"height":300},{"@type":"BreadcrumbList","@id":"http:\/\/wizardofbots.com\/network\/converting-html-to-text-on-python-in-12-lines-of-code\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/wizardofbots.com\/network\/"},{"@type":"ListItem","position":2,"name":"Converting HTML to Text on Python in 12 lines of code"}]},{"@type":"WebSite","@id":"https:\/\/wizardofbots.com\/network\/#website","url":"https:\/\/wizardofbots.com\/network\/","name":"Wizard Of Bots","description":"Botting and AI community","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/wizardofbots.com\/network\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/wizardofbots.com\/network\/#\/schema\/person\/31f9e486da1c11791d94a861854a2a9f","name":"wizardofbots","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/wizardofbots.com\/network\/#\/schema\/person\/image\/","url":"http:\/\/2.gravatar.com\/avatar\/584eebc303f64610559ab9f305f6928d?s=96&d=mm&r=g","contentUrl":"http:\/\/2.gravatar.com\/avatar\/584eebc303f64610559ab9f305f6928d?s=96&d=mm&r=g","caption":"wizardofbots"},"url":"http:\/\/wizardofbots.com\/network\/author\/wizardofbots\/"}]}},"_links":{"self":[{"href":"http:\/\/wizardofbots.com\/network\/wp-json\/wp\/v2\/posts\/39"}],"collection":[{"href":"http:\/\/wizardofbots.com\/network\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/wizardofbots.com\/network\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/wizardofbots.com\/network\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/wizardofbots.com\/network\/wp-json\/wp\/v2\/comments?post=39"}],"version-history":[{"count":3,"href":"http:\/\/wizardofbots.com\/network\/wp-json\/wp\/v2\/posts\/39\/revisions"}],"predecessor-version":[{"id":43,"href":"http:\/\/wizardofbots.com\/network\/wp-json\/wp\/v2\/posts\/39\/revisions\/43"}],"wp:attachment":[{"href":"http:\/\/wizardofbots.com\/network\/wp-json\/wp\/v2\/media?parent=39"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/wizardofbots.com\/network\/wp-json\/wp\/v2\/categories?post=39"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/wizardofbots.com\/network\/wp-json\/wp\/v2\/tags?post=39"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}