
The Python HTTPParser both in HTTPParser and htmllib are very flexible to provide your own implementations for handling start tag, end tags and data elements, but it has limitations. For example if i wanted to preserve input formatting of HTML, but change just a few tags it would be hard to do.
I found a better solution.
Beautiful Soup is a good HTML parser from the initial impressions. I tried many public sites like cnn.com, news.com etc and able to parse it into a tree and access the elements very easily.
It provides very easy functions to search the entire tree and returns references to those.
Lets say, you want to get all hyperlink (a) tags, the code would be as simple as below
1 2 3 4 5 6 7 8 9 | from BeautifulSoup import BeautifulSoup import urllib2; data=urllib2.urlopen("http://www.cnn.com") soup=BeautifulSoup(data.read()) resultset=soup.findAll("a") for i in range(len(resultset)): print resultset[i] |
Now say, you want to make all the links absolute instead of relative, a simple function that takes the resultset would do the trick
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | from BeautifulSoup import BeautifulSoup import urllib2 def relativetoabsolute(resultset,tag,url): for i in range(len(resultset)): try: link=str(resultset[i][tag]) if not link.lower().startswith("http"): s[i][tag]=urljoin(url,link) except: pass data=urllib2.urlopen("http://www.cnn.com") soup=BeautifulSoup(data.read()) resultset=soup.findAll("a") relativetoabsolute(resultset,'href','http://www.cnn.com') print soup |
The output HTML would have all relative URLs while preserving input formatting. The cool thing is you could prettify the output by just
print soup.prettify()
Related posts:
- PyScraper: The Python Screen Scraper PyScraper is a quick python program i wrote to do...
- Ternary Search Trees After reading the excellent article by Ashwin, Now i am...
- Installing Python 2.4.4 on Cent OS 4.4 (Final) We at ObjectGraph are moving to a dedicated hosting solution....
- iPhone Development and a sample Objective C Program Kiichi and myself are learning how to program in Objective...
- Pygments.com Launched I purchased Pygments.com domain yesterday. Iam surprised the domain...
Related posts brought to you by Yet Another Related Posts Plugin.






















2 Responses
you can make your code more readable and abstract like so:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
site = urlopen(”http://www.host.com”)
soup = BeautifulSoup(site)
for i in soup(’element’):
print i
as you can see this code is a lot more abstract.
By the way python “FOR” loop is a bit different then most of other languages (thought you can use this method on most of other languages). If you can iterate over an object then you don’t have to use the range(len(obj)) method. But you should instead write the object. the loop variable will change to the next element in the object on each return.
Thank you for the suggestion. I was exploring python at that time and now looking back the code looks awful