Googleのトップページを取得
>>> import urllib2
>>> url = "http://www.google.co.jp"
>>> result = urllib2.urlopen(url)
>>> lines = result.readlines()
>>> lines[0]
'<html><head><meta http-equiv="content-type" content="text/html; charset=Shift_JIS"><title>Google</title><script>window.google={kEI:"tirjSpiUK4egwgPJttycDQ",kEXPI:"17259,21766,22107,22217",kCSIE:"17259,21766,22107,22217",kCSI:{e:"17259,21766,22107,22217",ei:"tirjSpiUK4egwgPJttycDQ"},kHL:"ja"};\n'
>>> lines[1]
'\n'
>>> lines[2]
'window.google.sn="webhp";window.google.timers={load:{t:{start:(new Date).getTim
e()}}};try{}catch(b){}window.google.jsrt_kill=1;\n'
このブログからURLを抽出してみた。
data = urllib.urlopen('http://python25.blogspot.com').read()
for url in re.findall(r"https?://[-_.!~*()a-zA-Z0-9/?:@&=+$,%#]+", data):
print url
>>> for url in re.findall(r"https?://[-_.!~*'()a-zA-Z0-9/?:@&=+$,%#]+", data):
... print url
...
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
http://www.w3.org/1999/xhtml
http://www.google.com/2005/gml/b
http://www.google.com/2005/gml/data
http://www.google.com/2005/gml/expr
http://www.blogger.com/favicon.ico
http://python25.blogspot.com/
http://python25.blogspot.com/feeds/posts/default
http://python25.blogspot.com/feeds/posts/default?alt=rss
http://www.blogger.com/feeds/6658540373074530849/posts/default
http://www.blogger.com/rsd.g?blogID=6658540373074530849
http://www.blogger.com/profile/17765477399828427893
http://www.blogger.com/openid-server.g
http://www.blogger.com/static/v1/widgets/1550194411-widget_css_bundle.css
http://www.blogger.com/static/v1/v-css/3727950723-blog_controls.css
http://www.blogger.com/dyn-css/authorization.css?targetBlogID=6658540373074530849&zx=885dfb85-549b-4231-a3a3-178f9be1dbeb
http://www1.blogblog.com/dots/bg_dots.gif
http://www.blogblog.com/dots/bg_3dots.gif
http://www1.blogblog.com/dots/bg_dots2.gif
http://www1.blogblog.com/dots/bg_dots2.gif
http://www1.blogblog.com/dots/bg_post_title_left.gif
http://www.blogblog.com/dots/icon_comment_left.gif
http://www.blogblog.com/dots/icon_comment_left.gif
http://www.blogblog.com/dots/icon_comment_left.gif
http://www1.blogblog.com/dots/bullet.gif
http://www.blogger.com/navbar.g?targetBlogID=6658540373074530849&
http://python25.blogspot.com/2009/10/blog-post_23.html
http://python25.blogspot.com/2009/10/blog-post_23.html
http://python25.blogspot.com/2009/10/blog-post_23.html#comments
http://www.blogger.com/post-edit.g?blogID=6658540373074530849&postID=8238006605508980249
http://www.blogger.com/img/icon18_edit_allbkg.gif
http://python25.blogspot.com/2009/10/blog-post_22.html
http://python25.blogspot.com/2009/10/blog-post_22.html
http://python25.blogspot.com/2009/10/blog-post_22.html#comments
http://www.blogger.com/post-edit.g?blogID=6658540373074530849&postID=5311009012827742139
http://www.blogger.com/img/icon18_edit_allbkg.gif
http://python25.blogspot.com/2009/10/blog-post_21.html
http://python25.blogspot.com/2009/10/blog-post_21.html
http://python25.blogspot.com/2009/10/blog-post_21.html#comments
http://www.blogger.com/post-edit.g?blogID=6658540373074530849&postID=9007033006594304086
http://www.blogger.com/img/icon18_edit_allbkg.gif
http://python25.blogspot.com/2009/10/2.html
http://python25.blogspot.com/2009/10/2.html
http://python25.blogspot.com/2009/10/2.html#comments
http://www.blogger.com/post-edit.g?blogID=6658540373074530849&postID=2181106016910876218
http://www.blogger.com/img/icon18_edit_allbkg.gif
http://python25.blogspot.com/2009/10/mapfilter.html
http://python25.blogspot.com/2009/10/mapfilter.html
http://python25.blogspot.com/2009/10/mapfilter.html#comments
http://www.blogger.com/post-edit.g?blogID=6658540373074530849&postID=2507538441610339075
http://www.blogger.com/img/icon18_edit_allbkg.gif
http://python25.blogspot.com/2009/10/importreload.html
http://www.python.org/dev/peps/pep-0263/
http://python25.blogspot.com/2009/10/importreload.html
http://python25.blogspot.com/2009/10/importreload.html#comments
http://www.blogger.com/post-edit.g?blogID=6658540373074530849&postID=8577865623867738151
http://www.blogger.com/img/icon18_edit_allbkg.gif
http://python25.blogspot.com/2009/10/blog-post_15.html
http://python25.blogspot.com/2009/10/blog-post_15.html
http://python25.blogspot.com/2009/10/blog-post_15.html#comments
http://www.blogger.com/post-edit.g?blogID=6658540373074530849&postID=1242385033646658704
http://www.blogger.com/img/icon18_edit_allbkg.gif
http://python25.blogspot.com/search?updated-max=2009-10-15T06%3A00%3A00%2B09%3A00&max-results=7
http://python25.blogspot.com/feeds/posts/default
http://dame.livedoor.biz
http://www.blogger.com/rearrange?blogID=6658540373074530849&widgetType=LinkList&widgetId=LinkList1&action=editWidget
http://img1.blogblog.com/img/icon18_wrench_allbkg.png
http://python25.blogspot.com/search?updated-min=2009-01-01T00%3A00%3A00%2B09%3A00&updated-max=2010-01-01T00%3A00%3A00%2B09%3A00&max-results=17
http://python25.blogspot.com/2009_10_01_archive.html
http://python25.blogspot.com/2009/10/blog-post_23.html
http://python25.blogspot.com/2009/10/blog-post_22.html
http://python25.blogspot.com/2009/10/blog-post_21.html
http://python25.blogspot.com/2009/10/2.html
http://python25.blogspot.com/2009/10/mapfilter.html
http://python25.blogspot.com/2009/10/importreload.html
http://python25.blogspot.com/2009/10/blog-post_15.html
http://python25.blogspot.com/2009/10/blog-post_14.html
http://python25.blogspot.com/2009/10/blog-post_13.html
http://python25.blogspot.com/2009/10/blog-post_12.html
http://python25.blogspot.com/2009/10/1.html
http://python25.blogspot.com/2009/10/blog-post_08.html
http://python25.blogspot.com/2009/10/blog-post_07.html
http://python25.blogspot.com/2009/10/blog-post_06.html
http://python25.blogspot.com/2009/10/blog-post.html
http://python25.blogspot.com/2009/10/python-1010-10000000000l-101032python.html
http://python25.blogspot.com/2009/10/python.html
http://www.blogger.com/rearrange?blogID=6658540373074530849&widgetType=BlogArchive&widgetId=BlogArchive1&action=editWidget
http://img1.blogblog.com/img/icon18_wrench_allbkg.png
http://www.blogger.com/profile/17765477399828427893
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsqz1XWFjtQG3T6GIQs3uAZBszmJGcBVnTONKZhCFDk0hR3Co-GGw8EmBLk_28VrzcDcIaSDoGqU1IskC4H3wuNSxqfo1yGaiNq__yarzrMnd9dmztzyhA8xV1gueaWaUlyubGcI7bH5s/s220/yoichiro_pic1.jpg
http://www.blogger.com/profile/17765477399828427893
http://www.blogger.com/rearrange?blogID=6658540373074530849&widgetType=Profile&widgetId=Profile1&action=editWidget
http://img1.blogblog.com/img/icon18_wrench_allbkg.png
https://ssl.
http://www.
http://www.blogger.com/static/v1/widgets/4222249892-widgets.js
http://www.blogger.com/rearrange?blogID=6658540373074530849
http://python25.blogspot.com/,6658540373074530849)
http://www.blogger.com/display?blogID=6658540373074530849
http://python25.blogspot.com/
http://python25.blogspot.com/
http://python25.blogspot.com/feeds/posts/default
http://python25.blogspot.com/feeds/posts/default?alt
http://www.blogger.com/feeds/6658540373074530849/posts/default
http://www.blogger.com/rsd.g?blogID
http://www.blogger.com/profile/17765477399828427893
http://www.blogger.com/openid-server.g
http://img1.blogblog.com/img/icon18_wrench_allbkg.png
http://www.blogger.com/favicon.ico
100%正確な結果が得られたわけではないが、90%以上の精度でURLを抽出できた。
0 件のコメント:
コメントを投稿