This is a small program. I was tasked with retrieving a small set of data from a webpage by screenscraping using python.
This is the data I should retrieve. I never learnt Python properly but I could do it. That is the beauty of Python. Python has an interface to sqlite - the smallest db engine - called "Pysqlite". The library to import is called "sqlite3"http://money.livemint.com/IID64/F132540/Financial/Ratios/Company.aspx
dbconn = sqlite3.connect('db/tcs_eps.db')
dbc = dbconn.cursor()
dbc.execute("CREATE TABLE IF NOT EXISTS EPSTABLE(DATE TEXT,EPS TEXT)")
dbc.execute("DELETE FROM EPSTABLE")
This part of the code creates a connection to a database file. If the file doesn't exist, it will be created. But the directory path should exist otherwise it'll throw an error. Sqlite is a file based database system unlike the server-client based ones like mysql, db2 etc.. Then it'll create a table if it doesn't exist already (i.e. during first run). It'll truncate the data from the table otherwise next time, same set of data will repeat in my tables.
host = "money.livemint.com"
epspage = "/IID64/F132540/Financial/Ratios/Company.aspx"
print "Please wait.. It will take some time depending on your connection speed.."
con = httplib.HTTPConnection(host)
con.connect()
con.request("GET", epspage)
resp = con.getresponse()
data = resp.read()
This part of the code opens up a connection to the host and tries to GET the page and then it reads the "data" part of the response. (Response object has so many other stuffs like status, headers etc..)
This data is the code of our html page. Analysing the content, I figured out that I should remove the unnecessary data above and below our interested part. For that I need to find a unique string above and below our part so that I can index them and take the substring out. Those indices were "InnerTable" and "R5". They were unique and appear only once and above and below our part in the file.
start=string.index(data,"InnerTable")
stop=string.index(data,"R5")
data = data[start:stop]
There are some html entities in that page which I removed using
data = re.sub(r'[ amltg]*;','',data)
This is not necessary though but I did it for my sake.
months = ['Date']+re.findall(r'[a-zA-Z]{1,3}\d\d\d\d',data)
earnings = ['Earnings/share (Rs)']+re.findall(r'\d\d\.\d\d',data)
Using this regex, I find out the date pattern and earnings per share pattern and store them in two lists. (The regex in second line should be changed to match any number of digits. I'm such a poor regex coder.)
for month, earning in zip(months, earnings):
dbc.execute('INSERT INTO EPSTABLE VALUES (?,?)' , (month,earning))
dbconn.commit()
for month,earning in dbc.execute('SELECT * FROM EPSTABLE'):
print '%s\t\t\t\t\t%s' % (month, earning)
dbconn.close()
Traversing elements simultaneously through two lists, they are inserted into the table and committed. Then traversing through each row in table and printing them. This program is just for learning coding in python with sqlite.
Full Program:
import httplib,string,re,sqlite3
dbconn = sqlite3.connect('db/tcs_eps.db')
dbc = dbconn.cursor()
dbc.execute("CREATE TABLE IF NOT EXISTS EPSTABLE(DATE TEXT,EPS TEXT)")
dbc.execute("DELETE FROM EPSTABLE")
host = "money.livemint.com"
epspage = "/IID64/F132540/Financial/Ratios/Company.aspx"
print "Please wait.. It will take some time depending on your connection speed.."
con = httplib.HTTPConnection(host)
con.connect()
con.request("GET", epspage)
resp = con.getresponse()
data = resp.read()
print "=================================================="
start=string.index(data,"InnerTable")
stop=string.index(data,"R5")
data = data[start:stop]
data = re.sub(r'[ amltg]*;','',data)
months = ['Date']+re.findall(r'[a-zA-Z]{1,3}\d\d\d\d',data)
earnings = ['Earnings/share (Rs)']+re.findall(r'\d\d\.\d\d',data)
for month, earning in zip(months, earnings):
#print '%s\t\t\t\t\t%s' % (month, earning)
dbc.execute('INSERT INTO EPSTABLE VALUES (?,?)' , (month,earning))
dbconn.commit()
for month,earning in dbc.execute('SELECT * FROM EPSTABLE'):
print '%s\t\t\t\t\t%s' % (month, earning)
dbconn.close()