初识python之 APP store排行榜蜘蛛抓取(一)

浏览数：206 / 时间：2015年06月11日

直接上干货！！

采用python 2.7.5-windows

打开 http://www.apple.com/cn/itunes/charts/free-apps/

如上图可以见采用的是utf-8 编码

经过一番思想斗争编码如下（拍砖别打脸）

#coding=utf-8
import urllib2    
import urllib    
import re    
import thread    
import time

  
    
#----------- APP store 排行榜 -----------    
class Spider_Model:    
        
    def __init__(self):    
        self.page = 1    
        self.pages = []    
        self.enable = False    
       
    def GetCon(self):    
        myUrl = "http://www.apple.com/cn/itunes/charts/free-apps/"    
        user_agent = ‘Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)‘   
        headers = { ‘User-Agent‘ : user_agent }   
        req = urllib2.Request(myUrl, headers = headers)   
        myResponse = urllib2.urlopen(req)  
        myPage = myResponse.read()    
        #encode的作用是将unicode编码转换成其他编码的字符串    
        #decode的作用是将其他编码的字符串转换成unicode编码       
        print myPage
 
print ‘ ‘
myModel = Spider_Model()
myModel.GetCon()

　　采集页面字符集 python文件字符集统一为utf-8 （贫蛋哥是认为没啥问题的）

　　打印输出结果：

拿出杀手锏 www.baidu.com

找到原因：

　　　　　　　　http://blog.csdn.net/lf8289/article/details/2465196

　　　　　　　　http://www.crifan.com/unicodeencodeerror_gbk_codec_can_not_encode_character_in_position_illegal_multibyte_sequence/

　　各种狂改中.......

#coding=gbk   编码修改为gbk
import urllib2    
import urllib    
import re    
import thread    
import time

  
    
#----------- APP store 排行榜 -----------    
class Spider_Model:    
        
    def __init__(self):    
        self.page = 1    
        self.pages = []    
        self.enable = False    
       
    def GetCon(self):    
        myUrl = "http://www.apple.com/cn/itunes/charts/free-apps/"    
        user_agent = ‘Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)‘   
        headers = { ‘User-Agent‘ : user_agent }   
        req = urllib2.Request(myUrl, headers = headers)   
        myResponse = urllib2.urlopen(req)  
        myPage = myResponse.read()    
        #encode的作用是将unicode编码转换成其他编码的字符串    
        #decode的作用是将其他编码的字符串转换成unicode编码    
        unicodePage = myPage.decode(‘utf-8‘).encode(‘gbk‘,‘ignore‘) #采集页面编码为utf-8  转为 gbk (ignore来忽略非法的字符)

　　　　　print unicodePage

　　　　print ‘ ‘ 
　　　　myModel = Spider_Model() 
　　　　myModel.GetCon()

　　运行结果：

初识python之 APP store排行榜蜘蛛抓取(一),,5-wow.com

郑重声明：本站内容如果来自互联网及其他传播媒体，其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享，并不代表本站赞同其观点和对其真实性负责，也不构成任何其他建议。

初识python之 APP store排行榜蜘蛛抓取(一)

标签： Android iOS 移动互联终端开发

初识python之 APP store排行榜蜘蛛抓取(一)

相关文章

随机文章

您可能还喜欢

您可能还喜欢

最新图文

您可能还喜欢

您可能还喜欢

文摘排行

文章排行

推荐文章

图文排行

推荐图文

初识python之 APP store排行榜 蜘蛛抓取(一)

相关文章

随机文章

您可能还喜欢

您可能还喜欢

最新图文

您可能还喜欢

您可能还喜欢

文摘排行

文章排行

推荐文章

图文排行

推荐图文

初识python之 APP store排行榜蜘蛛抓取(一)