Simple Python Search Spider, Page Ranker, and Visualizer

This is a set of programs that emulate some of the functions of a 
search engine.  They store their data in a SQLITE3 database named
'content.db'.  This file can be removed at any time to restart the
process.   

We will start with an empty (just removed) content.db

spider.py

This program crawls a web site and pulls a series of pages into the
database, recording the links between pages.

rm content.db
python spider.py 
Enter new web or enter: http://www.dr-chuck.com/
Webs: ['http://www.dr-chuck.com/']
How many pages:2
1 http://www.dr-chuck.com/ 12
2 http://www.dr-chuck.com/csev-blog/ 57
How many pages:

In this sample run, we told it to crawl a website and retrieve two 
pages.  If you restart the program again and tell it to crawl more
pages, it will not re-crawl any pages already in the database.  Upon 
restart it goes to the top non-crawled page and starts there.  So 
each successive run of spider.py is additive.

python spider.py 
Enter web url or enter:
Webs: ['http://www.dr-chuck.com/']
How many pages:3
3 http://www.dr-chuck.com/csev-blog 57
4 http://www.dr-chuck.com/dr-chuck/resume/speaking.htm 1
5 http://www.dr-chuck.com/dr-chuck/resume/index.htm 13
How many pages:

If you want to dump the contents of the content.db file, you can 
run spdump.py as follows:

python spdump.py 
(5, None, 1.0, 3, u'http://www.dr-chuck.com/csev-blog')
(3, None, 1.0, 4, u'http://www.dr-chuck.com/dr-chuck/resume/speaking.htm')
(1, None, 1.0, 2, u'http://www.dr-chuck.com/csev-blog/')
(1, None, 1.0, 5, u'http://www.dr-chuck.com/dr-chuck/resume/index.htm')
4 rows.

This shows the number of incoming links, the old page rank, the new page
rank, the id of the page, and the url of the page.  The spdump.py program
only shows pages that have at least one incoming link to them.

Once you have a few pages in the database, you can run Page Rank on the
pages using the sprank.py program.  You simply tell it how many Page
Rank iterations to run.

python sprank.py 
How many iterations:2
1 0.0
2 0.0
[(1, 0.559), (2, 0.659), (3, 0.985), (4, 2.135), (5, 0.659)]
Creating JSON output on content.json...

You can dump the daabase again to see that page rank has been updated:

python spdump.py 
(5, 1.0, 0.985, 3, u'http://www.dr-chuck.com/csev-blog')
(3, 1.0, 2.135, 4, u'http://www.dr-chuck.com/dr-chuck/resume/speaking.htm')
(1, 1.0, 0.659, 2, u'http://www.dr-chuck.com/csev-blog/')
(1, 1.0, 0.659, 5, u'http://www.dr-chuck.com/dr-chuck/resume/index.htm')
4 rows.

You can run sprank.py as many times as you like and it will simply refine
the page rank the more times you run it.  You can even run sprank.py a few times
and then go spider a few more pages sith spider.py and then run sprank.py
to converge the page ranks.

If you want to restart the Page Rank calculations without re-retrieving the 
web pages, you can use spreset.py

python spreset.py 
All pages set to a rank of 1.0
python sprank.py 
How many iterations:2
1 0.0
2 0.0
[(1, 0.559), (2, 0.659), (3, 0.985), (4, 2.135), (5, 0.659)]
Creating JSON output on content.json...

Whenever the sprank.py program finishes, it dumps the contents of the 
database (strongly connected component only) into the file content.json. You
can view this data by opening the file force.html in your web browser.  This 
shows an automatic layout of the nodes and links.  You can click and 
drag any node and you can also double click on a node to find the URL
that is represented by the node.

This visualization is provided using the force layout from:

http://mbostock.github.com/d3/

If you rerun the other utilities and then re-run sprank.py - you merely
have to press refresh in the browser to get the new data from content.json.

