shell脚本：清理被篡改的html文件

浏览数：18 / 时间：2015年06月09日

声明：只能清除尾部被篡改的html文件。

------------------------------------------------------------------------------

被篡改的html文件：

[root@CHM-DD-00-E5-07 sndapk]# cat -A problem.html
<html>^M$
<body>^M$
<h1>It works !</h1>^M$
</body>^M$
</html>^M$
<div style="position:absolute;left:expression(386-4635);top:expression(528-9313);">^M$
<a href="http://www.fuckit.com/">Chanel handbags</a>[root@CHM-DD-00-E5-07 sndapk]#

注：下面这两行是恶意添加的内容

<div style="position:absolute;left:expression(386-4635);top:expression(528-9313);">^M$
<a href="http://www.fuckit.com/">Chanel handbags</a>

清理：

1、使用sed脚本清理单个html文件

#vim callby_htmlclean.sed

#!/bin/sed -f
#本脚本除了可以配合循环语句批量处理html文件外，还可直接指定要处理的html文件，注意一次只能指定一个，如下：
#./callby_htmlclean.sed -ni index.html
:nothtmlend
/<\/html>/! {
        1{
                #处理第一行时拷贝到保持空间，用于最后防止首行空行。
                h
                n
                #读取下一行，继续匹配</html>。
                b nothtmlend
        }
        1!{
                #非首行全部追加到保持空间。
                H
                n
                #读取下一行，继续匹配</html>。
                b nothtmlend
        }
}
#执行到这里已经找到</html>，被篡改前的内容结尾。
:findhtmlend
#收集剩余的所有行到模式空间
$!{
        N
        b findhtmlend
}
s/[\n\t ]//g
#替换完制表符、换行符、空格后，正常html文件应该以</html>结尾。
/<\/html>$/{
        H
        x
        p
}
#</html>末尾被篡改的处理。
/<\/html>$/!{
        #替换当前模式空间内容为</html>
        s:\(<\/html>\).*:\1:
        H
        #还原html正常内容到模式空间
        x
        p
}

运行方法：

1、添加执行权限

2、./callby_htmlclean.sed -ni problem.html

清理后的html文件如下：

[root@CHM-DD-00-E5-07 sndapk]# cat -A problem.html
<html>^M$
<body>^M$
<h1>It works !</h1>^M$
</body>^M$
</html>[root@CHM-DD-00-E5-07 sndapk]#

注：

1、相比手动vim清理后的内容，结尾缺少"^M"，不过没有影响。

2、该脚本比较繁琐，但经测试该脚本可以处理在</html>后同行、换行、多行等多种篡改后的格式。

2、配合shell脚本批量清理html文件

#vim htmlclean.sh

#!/bin/bash
IFS=‘
‘
PATH=/bin:/usr/bin:/sbin:/usr/sbin
export IFS PATH
#要调用的sed脚本必须和当前程序在同一目录下。如果sed脚本名修改，需要同时修改这里。
sed_script="$(dirname "$0")/callby_htmlclean.sed"
if [ ! -f "$sed_script" ];then
        echo "ERROR: sed script \"$sed_script\" not found!"
        exit 1
fi
if [ $# -ne 1 ];then
        echo "Usage: $0 [full_path | listfile]"
        #如果指定了目录，会把该目录及子目录下所有以.htm或.html结尾的文件处理一遍。
        echo "Example 1: $0 /web/bbs/static/"
        #指定绝对路径的列表文件，文件中每行一个文件同样要求是绝对路径。
        echo "Example 2: $0 /tmp/problem_html_list.txt"
        exit 1
fi
if [ -d "$1" ];then
        html_list="$(find "$1" -type f -name ‘*.htm‘ -o -name ‘*.html‘)"
elif [ -f "$1" ];then
        html_list="$(sed ‘‘ "$1")"
else
        echo "ERROR: invalid path or listfile."
        exit 1
fi
for html_f in $html_list
do
        if [[ "$html_f" =~ \.html?$ ]];then
                if [ -f "$html_f" ];then
                        sed -f "$sed_script" -n -i "$html_f"
                else
                        echo "ERROR: \"$html_f\" is not exsist."
                fi
        else
                echo "ERROR: \"$html_f\" is not a valid html file."
        fi
done

运行方法：

1、递归清理指定目录中的html文件（htm或html后缀）：

#./htmlclean.sh /www/bbs

2、清理列表文件中的html文件：

#./htmlclean.sh /tmp/htmllist.txt

htmllist.txt中的内容应该使用以下格式

cat /tmp/htmllist.txt