本文主要是介绍java中判断字节数组的编码方式是不是UTF-8,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
java中判断字节数组的编码方式是不是UTF-8
1,用google的工具包,配置maven:
<!-- https://mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet -->
<dependency><groupId>com.googlecode.juniversalchardet</groupId><artifactId>juniversalchardet</artifactId><version>1.0.3</version>
</dependency>
2,定义一个公共方法:
public static String guessEncoding(byte[] bytes) {UniversalDetector detector = new UniversalDetector(null);detector.handleData(bytes, 0, bytes.length);detector.dataEnd();String encoding = detector.getDetectedCharset();detector.reset();return encoding;
}
public abstract class CharsetUtils {private static Logger logger = LoggerFactory.getLogger(CharsetUtils.class);public static String detectCharset(String contentType, byte[] contentBytes) throws IOException {String charset;// charset// 1、encoding in http header Content-Typecharset = UrlUtils.getCharset(contentType);if (StringUtils.isNotBlank(contentType) && StringUtils.isNotBlank(charset)) {logger.debug("Auto get charset: {}", charset);return charset;}// use default charset to decode first timeCharset defaultCharset = Charset.defaultCharset();String content = new String(contentBytes, defaultCharset);// 2、charset in metaif (StringUtils.isNotEmpty(content)) {Document document = Jsoup.parse(content);Elements links = document.select("meta");for (Element link : links) {// 2.1、html4.01 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />String metaContent = link.attr("content");String metaCharset = link.attr("charset");if (metaContent.indexOf("charset") != -1) {metaContent = metaContent.substring(metaContent.indexOf("charset"), metaContent.length());charset = metaContent.split("=")[1];break;}// 2.2、html5 <meta charset="UTF-8" />else if (StringUtils.isNotEmpty(metaCharset)) {charset = metaCharset;break;}}}logger.debug("Auto get charset: {}", charset);// 3、todo use tools as cpdetector for content decodecharset=guessEncoding(contentBytes);return charset;}}
private static final Pattern patternForCharset = Pattern.compile("charset\\s*=\\s*['\"]*([^\\s;'\"]*)", Pattern.CASE_INSENSITIVE);public static String getCharset(String contentType) {Matcher matcher = patternForCharset.matcher(contentType);if (matcher.find()) {String charset = matcher.group(1);if (Charset.isSupported(charset)) {return charset;}}return null; }
这篇关于java中判断字节数组的编码方式是不是UTF-8的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!