Python实际应用-数据处理(二) 数据特定格式变化
目前的状况是:
1. 在我一个文件夹下面有许多文件名是这样的数据文件
part-m-0000
part-m-0001
part-m-0002
part-m-0003
...
2. 其中每个文件夹里的数据是这样格式:
"460030730101160","3","0","0","0","2013/8/31 0:21:42""460036745672363","3","0","0","0","2013/8/31 0:21:31"
"460030250931114","3","1307","1","0","2013/8/31 0:21:40"
"460030250942643","3","0","0","0","2013/8/31 0:21:40"
"460036650411006","3","1021","1","0","2013/8/31 0:21:39"
"000000000009674","8","0","0","0","2013/8/31 0:12:28"
"000000000005661","8","0","0","0","2013/8/31 0:12:29"
"460030731390121","3","0","0","0","2013/8/31 21:54:00"
"460030256111396","3","0","0","0","2013/8/31 21:54:00"
"460030207447762","3","0","0","0","2013/8/31 21:53:58"
"460030250939916","3","0","0","0","2013/8/31 21:53:58"
"460030957972011","3","1613","0","0","2013/8/31 21:53:51"
"460030237206739","3","0","0","0","2013/8/31 21:53:59"
...
现在需要将数字上的引号去掉,同时将最后一列的时间的小时提取出来,下面是我用python处理的过程:
1. 先遍历当前文件夹下所有的以‘part‘开头的文件;
2. 对每一个文件,读取每一行,根据“,”进行分割;
3. 之后读每一部分取引号中间的部分,对最后一项时间取小时数部分,这里需要判断小时的位数是1还是2;
4. 每读一行就写一行
下面是具体的待买
#coding: utf-8 import os for root,dir,files in os.walk("./"): for file in files: if file.startswith("part"): filepath = "./"+file #This is the current file path print filepath newfilepath = "./data_handled/"+file[7:] # This is file used to write into file = open(filepath) newfile = open(newfilepath,'w') for line in file: string = "" line_ = line.split(',') for i in range(len(line_)-1): j = line_[i][1:len(line_[i])-1] #Delte the " " string += j string += ',' len1 = len(line_) if len(line_[len1-1]) > 12: if line_[len1-1][12]==':': k = line_[len1-1][11:12] else: k = line_[len1-1][11:13] else : k = "-1" string += k newfile.write(string+"\n") newfile.close()
郑重声明:本站内容如果来自互联网及其他传播媒体,其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享,并不代表本站赞同其观点和对其真实性负责,也不构成任何其他建议。