如何根据字节流内容确定汉字编码,从而解决相同App在部分省份显示乱码的问题
近期,有某些省份的电信用户反映公司的Android客户端App通过3G手机卡得到的部分数据显示是乱码,但在wifi环境下显示是正常,初步排查是因为数据在进行gzip压缩之前的编码不同,在某些省份是GBK,有些是UFT8,在解码后可能与预定的GBK编码不符,出现乱码。因此,需要对网络流进行编码探测,根据探测结果选择编码。
一种简单的方式是通过HttpEntity的ContentType分析字符编码:
public static String getContentCharSet(final HttpEntity entity) throws ParseException { if (entity == null) { throw new IllegalArgumentException("HTTP entity may not be null"); } String charset = null; if (entity.getContentType() != null) { HeaderElement values[] = entity.getContentType().getElements(); if (values.length > 0) { NameValuePair param = values[0].getParameterByName("charset"); if (param != null) { charset = param.getValue(); } } } return charset; }
测试的确是有效的,但是可能给人的感觉却似乎总是不放心的,比如万一HTTP Header里缺少ContentType,那如何判断字符编码?
网上有大量的判断字符编码的博客了,但多数是对于文件的编码判断,对于网络流的判断是失效的,在此推荐一个开源组件cpdetector(总共494KB),可以检测文件和字节流编码。
下面是EncodingDetector工具类代码。cpdetector是基于统计学的,统计的字节数越多,准确性越高。对于文件流,字节数是已知的,探测的字节数是文件长度-1,但不超过2000。
public class EncodingDetector { private static final CodepageDetectorProxy detector = CodepageDetectorProxy .getInstance(); static { detector.add(new ParsingDetector(false)); detector.add(JChardetFacade.getInstance()); detector.add(ASCIIDetector.getInstance()); detector.add(UnicodeDetector.getInstance()); } public static String getCharset(InputStream is, Boolean useAvailable) { Charset charset = null; int detectCharNum = 2000; //检测的字节数越多越准确, 字节数的指定不能超过文本流的最大长度 try { if(useAvailable) { int available = is.available(); if(available <= 1) { //有的输入流可能没有能力返回字节数(比如网络流,并不能准确知道还有多少数据未到达) return HTTP.UTF_8; } if(detectCharNum > available) { detectCharNum = available - 1; } } BufferedInputStream bufferedInputStream = new BufferedInputStream(is); charset = detector.detectCodepage(bufferedInputStream, detectCharNum); bufferedInputStream.reset(); } catch (Exception e) { } return null != charset ? charset.name() : null; } public static String getCharset(ByteArrayOutputStream bos) { String charset = null; try { ByteArrayInputStream is = new ByteArrayInputStream(bos.toByteArray()); charset = getCharset(is, true); //bos字节数是已知的 is.close(); } catch (IOException e) { } return charset; } }
对于网络流,因为不能准确知道还有多少数据没有到达,应该先读取并缓存字节流,然后探测缓存的编码。
HttpGet httpGet = null; HttpResponse httpResponse; InputStream is = null; BufferedReader in = null; ByteArrayOutputStream bos = null; try { httpGet = new HttpGet(url); //httpGet.addHeader(); httpResponse = httpClient.execute(httpGet); int statusCode = httpResponse.getStatusLine().getStatusCode(); HttpEntity httpEntity = httpResponse.getEntity(); String json = ""; if(httpEntity != null){ is= httpEntity.getContent(); Header val = httpEntity.getContentEncoding(); if (val != null && val.getValue()!= null && val.getValue().contains("gzip")) { is= new GZIPInputStream(is); } else{ BufferedInputStream bis = new BufferedInputStream(is); bis.mark(2); // 取前两个字节 byte[] header = new byte[2]; int result = bis.read(header); // reset输入流到开始位置 bis.reset(); // 判断是否是GZIP格式 if(result!=-1 && Utils.toInt(header, 0)== GZip_Value) { is= new GZIPInputStream(bis); } else { is= bis; } } if(encoding != null) { //解决部分省份出现乱码的问题 Boolean mustUseDefault = false; if(needDetectEncoding) { /*String chartsetFromHttpEntity = EntityUtils.getContentCharSet(httpEntity); if(!TextUtils.isEmpty(chartsetFromHttpEntity)) { chartsetFromHttpEntity = chartsetFromHttpEntity.toUpperCase(); mustUseDefault = chartsetFromHttpEntity.contains("UTF"); }*/ bos = new ByteArrayOutputStream(); byte[] buff = new byte[100]; //buff用于存放循环读取的临时数据 int rc = 0; while ((rc = is.read(buff, 0, 100)) > 0) { bos.write(buff, 0, rc); } String chartsetFromInputStream = EncodingDetector.getCharset(bos); if(!TextUtils.isEmpty(chartsetFromInputStream)) { chartsetFromInputStream = chartsetFromInputStream.toUpperCase(); mustUseDefault = chartsetFromInputStream.contains("UTF"); } //android.util.Log.e("httpGetWithZip", chartsetFromHttpEntity + chartsetFromInputStream); is.close(); is = new ByteArrayInputStream(bos.toByteArray()); } if(mustUseDefault) { in = new BufferedReader(new InputStreamReader(is)); } else { in = new BufferedReader(new InputStreamReader(is, “GBK”)); } } else{ in = new BufferedReader(new InputStreamReader(is)); } String line = ""; while ((line = in.readLine()) != null) { json += line; } } if(statusCode != HttpStatus.SC_OK){ } } catch (ClientProtocolException e) { if(httpGet != null){ httpGet.abort(); } } catch(IllegalArgumentException e){ if(httpGet != null){ httpGet.abort(); } } catch(OutOfMemoryError e){ if(httpGet != null){ httpGet.abort(); } }catch (IOException e) { rspInfo.setStatusCode(NetError); if(httpGet != null){ httpGet.abort(); } } finally{ if(is != null){ try { is.close(); } catch (IOException e) { } } if(in != null){ try { in.close(); } catch (IOException e) { } } if(bos != null) { try { bos.close(); } catch (IOException e) { } } }
要正确使用detector.add(JChardetFacade.getInstance());,将cpdetector_1.0.10.jar放到\libs\目录下,并且antlr-2.7.4.jar、chardet-1.0.jar、jargs-1.0.jar也放到\libs\目录下。
郑重声明:本站内容如果来自互联网及其他传播媒体,其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享,并不代表本站赞同其观点和对其真实性负责,也不构成任何其他建议。