Tonight I was watching Mr. Robot chapter 4, season 2. And it remind me back the good old days where the IRC was above any other social network. People meet there in tons of channels to have chats and discussions. Also there were plenty of groups talking about many stuff, the best crews were the ones with coders.

So well, what do I like about IRC?, There were plenty of cool things back then in 1995, there were the amazing eggdrops that you programmed to respond to different messages. They also had TCL’s which mean addons/plugins that you could adapt to your bot, many were different cool games in group. Also there were PsyBNCs to be always online with your shell.

To run this just do the following:

mkdir bitchx
nano install_bitchx.sh

And then just paste this code:

#!/bin/sh
####################################################################################
#
# Download Compile and Install BitchX on Ubuntu
#
####################################################################################

# download bitchx source

# @todo make smarter, i.e. regexp, though now uses _always_ available commands (sic)
DOWNLOAD_URL=$(curl -s http://bitchx.sourceforge.net |\
        grep "http://sourceforge.net" |\
         sed -e "s|.*href=\"||g" |\
         sed -e "s|\".*||g" |\
         grep "/download" | uniq) # should only be one

if [ "${DOWNLOAD_URL}" = "" ]; then
  echo "ERROR: Could not find DOWNLOAD_URL from http://bitchx.sourceforge.net"
  exit 255;
fi

# @todo make smarter, i.e. regexp, though now uses _always_ available commands (sic)
VERSION=$(echo ${DOWNLOAD_URL} | sed -e "s|.*ircii-pana/bitchx-||g" | sed -e "s|\/.*||g")

if [ "${VERSION}" = "" ]; then
  echo "ERROR: Could not find VERSION from ${DOWNLOAD_URL}"
  exit 255;
fi

echo "Will try to download and install version ${VERSION}";

DOWNLOAD_URL=http://downloads.sourceforge.net/project/bitchx/ircii-pana/bitchx-${VERSION}/bitchx-${VERSION}.tar.gz

echo "Downloading: ${DOWNLOAD_URL}"
curl -L -s "${DOWNLOAD_URL}" -o bitchx-${VERSION}.tar.gz

# install required dev libraries
sudo apt-get install libssl-dev ncurses-dev

# unpack source
tar -xzf bitchx-${VERSION}.tar.gz

# go to source dir
cd bitchx-${VERSION}

# configure
./configure --prefix=/usr --with-ssl --with-plugins --enable-ipv6

# build
make

# install (change to "make install_local" for local installation; in your own $HOME)
sudo make install

# remove src + build
cd $OLDPWD && rm -rf bitchx-${VERSION}*

# done use "BitchX" to run...

Then you just have to:

chmod +x install_bitchx.sh
./install_bitchx.sh

And this will begin to install everything you need to run BitchX, then just type to run:

BitchX

By the way, did you know you can connect to Elliot’s IRC session using this page: http://irc.colo-solutions.net/

html2textI made an article about this because it took me about 2 hours to solve it. Seems like all sources are outdated with their methods. Many people were talking about using module html2text which is now deprecated, then many others recommended nltk and well.. The final result is that now BeautifulSoup does a better job that them. Buuuut…..

All resources were saying about this function called get_text() which is completely incorrect, it must be getText() camel case in order to make this conversion work.

Why I would need to do this? Well, specially when some networks try to encode with strange HTML DOM characters that makes your scraping a nightmare. But here is the chunk of code explained.

 

from BeautifulSoup import BeautifulSoup as Soup
import re, urllib2, nltk
 
url = 'http://google.com'
html = urllib2.urlopen(url).read() #make the request to the url
soup = Soup(html) #using Soup on the responde read
for script in soup(["script", "style"]): #You need to extract this <script> and <style> tags
    script.extract() #strip them off
text = soup.getText() #this is the method that I had like 40 min problems
text = text.encode('utf-8') #make sure to encode your text to be compatible
#raw = nltk.clean_html(document)
print(text.encode('utf-8'))

So you now know how to get text from a response, it will be now easy to get some data using Regular Expressions 🙂