MySQL 字符集utf8、utf8mb3、utf8mb4

2024-08-23 21:48

本文主要是介绍MySQL 字符集utf8、utf8mb3、utf8mb4,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

首先想要了解MySQL的字符集,就需要去官方文档看看字符集是如何介绍的。英语不错的话,看官方文档应该是没问题。在搜索框里搜一下就可以找到相关的解释。我就在这里整理一下,以便后期查看。字符集在官方文档下面这一章节:
Chapter 10 Character Sets, Collations, Unicode

https://dev.mysql.com/doc/refman/5.6/en/charset.html



一、字符集设置


MySQL数据库可以做到:
1、使用多种字符集存储字符串。
2、使用多种排序规则比较字符串。
3、在同一服务器、同一数据库、甚至同一表中混合具有不同字符集或排序规则的字符串。
4、在任何级别启用字符集和排序规则的规范。

MySQL可以设置如下40种字符:
mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |      2 |
| dec8     | DEC West European           | dec8_swedish_ci     |      1 |
| cp850    | DOS West European           | cp850_general_ci    |      1 |
| hp8      | HP West European            | hp8_english_ci      |      1 |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |      1 |
| latin1   | cp1252 West European        | latin1_swedish_ci   |      1 |
| latin2   | ISO 8859-2 Central European | latin2_general_ci   |      1 |
| swe7     | 7bit Swedish                | swe7_swedish_ci     |      1 |
| ascii    | US ASCII                    | ascii_general_ci    |      1 |
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |      3 |
| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |      2 |
| hebrew   | ISO 8859-8 Hebrew           | hebrew_general_ci   |      1 |
| tis620   | TIS620 Thai                 | tis620_thai_ci      |      1 |
| euckr    | EUC-KR Korean               | euckr_korean_ci     |      2 |
| koi8u    | KOI8-U Ukrainian            | koi8u_general_ci    |      1 |
| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |      2 |
| greek    | ISO 8859-7 Greek            | greek_general_ci    |      1 |
| cp1250   | Windows Central European    | cp1250_general_ci   |      1 |
| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |      2 |
| latin5   | ISO 8859-9 Turkish          | latin5_turkish_ci   |      1 |
| armscii8 | ARMSCII-8 Armenian          | armscii8_general_ci |      1 |
| utf8     | UTF-8 Unicode               | utf8_general_ci     |      3 |
| ucs2     | UCS-2 Unicode               | ucs2_general_ci     |      2 |
| cp866    | DOS Russian                 | cp866_general_ci    |      1 |
| keybcs2  | DOS Kamenicky Czech-Slovak  | keybcs2_general_ci  |      1 |
| macce    | Mac Central European        | macce_general_ci    |      1 |
| macroman | Mac West European           | macroman_general_ci |      1 |
| cp852    | DOS Central European        | cp852_general_ci    |      1 |
| latin7   | ISO 8859-13 Baltic          | latin7_general_ci   |      1 |
| utf8mb4  | UTF-8 Unicode               | utf8mb4_general_ci  |      4 |
| cp1251   | Windows Cyrillic            | cp1251_general_ci   |      1 |
| utf16    | UTF-16 Unicode              | utf16_general_ci    |      4 |
| utf16le  | UTF-16LE Unicode            | utf16le_general_ci  |      4 |
| cp1256   | Windows Arabic              | cp1256_general_ci   |      1 |
| cp1257   | Windows Baltic              | cp1257_general_ci   |      1 |
| utf32    | UTF-32 Unicode              | utf32_general_ci    |      4 |
| binary   | Binary pseudo charset       | binary              |      1 |
| geostd8  | GEOSTD8 Georgian            | geostd8_general_ci  |      1 |
| cp932    | SJIS for Windows Japanese   | cp932_japanese_ci   |      2 |
| eucjpms  | UJIS for Windows Japanese   | eucjpms_japanese_ci |      3 |
+----------+-----------------------------+---------------------+--------+

40 rows in set (0.00 sec)


String expressions have a repertoire attribute, which can have two values:

  • ASCII: The expression can contain only characters in the Unicode range U+0000 to U+007F.

  • UNICODE: The expression can contain characters in the Unicode range U+0000 to U+10FFFF. This includes characters in the Basic Multilingual Plane (BMP) range (U+0000 to U+FFFF) and supplementary characters outside the BMP range (U+10000 to U+10FFFF).

这里提到:Basic Multilingual Plane (BMP) 和 supplementary characters
  Basic Multilingual Plane (BMP):基本多文种平面
  Supplementary Multilingual Plane(SMP):多文种补充平面
  BMP就已经包含常用字符,而SMP只是一些不常用的字符,代码点(字符)。如Emoji头像的符号,扑克牌的符号等等。
  关于BMP与SMP详细可以查看wiki上的解释:https://en.wikipedia.org/wiki/Plane_(Unicode)


  系统默认设置元数据表的字符集为utf8,是通过参数character_set_system设置。character_set_results这个参数默认是utf8,当查询表数据返回给客户端,这个参数是控制返回的结构数据的字符集。如果希望服务器将元数据结果传递回不同的字符集,请使用SET NAMES语句强制服务器执行字符集转换。客户端程序可以在接收到来自服务器的结果后执行转换。客户端执行转换更为有效,但此选项并不总是适用于所有客户端。


SET NAMES 'utf8';

There are default settings for character sets and collations at four levels: server, database, table, and column.

_ai

Accent insensitive   重音不敏

_asAccent sensitive     重音敏感
_ciCase insensitive    不区分大小写
_cscase-sensitive      区分大小写

_bin

Binary            二进制

设置了_ci顾名思义显式不区分大小写,隐式重音不敏感。

设置了_cs,顾名思义_as也是包含的,显式区分大小写,隐式重音敏感。

设置MySQL server character参数如下:

character-set-server

方法一:

 mysqld 
   mysqld --character-set-server=latin1 
    mysqld --character-set-server=latin1 \ 

      --collation-server=latin1_swedish_ci

方法二:

     cmake . -DDEFAULT_CHARSET=latin1
       或cmake . -DDEFAULT_CHARSET=latin1 \ 
          -DDEFAULT_COLLATION=latin1_german1_ci

The current server character set and collation can be determined from the values of the character_set_server and collation_server system variables. These variables can be changed at runtime.

二、Database Character Set and Collation

CREATE DATABASE db_name [[DEFAULT] CHARACTER SET charset_name] [[DEFAULT] COLLATE collation_name] 
   ALTER DATABASE db_name [[DEFAULT] CHARACTER SET charset_name] [[DEFAULT] COLLATE collation_name]

   The keyword SCHEMA can be used instead of DATABASE.

   All database options are stored in a text file named db.opt that can be found in the database directory.

   The CHARACTER SET and COLLATE clauses make it possible to create databases with different character sets and collations on the same MySQL server.

   查看你数据库这两个参数设置:
   USE db_name; 
   SELECT @@character_set_database, @@collation_database;


三、Table Character Set and Collation

The CREATE TABLE and ALTER TABLE statements have optional clauses for specifying the table character set and collation:
CREATE TABLE tbl_name (column_list) [[DEFAULT] CHARACTER SET charset_name] [COLLATE collation_name]] 
ALTER TABLE tbl_name [[DEFAULT] CHARACTER SET charset_name] [COLLATE collation_name]

  四、Column Character Set and Collation

Every “character” column (that is, a column of type CHAR, VARCHAR, or TEXT) has a column character set and a column collation. Column definition syntax for CREATE TABLE and ALTER TABLE has optional clauses for specifying the column character set and collation:

col_name {CHAR | VARCHAR | TEXT} (col_length) [CHARACTER SET charset_name] [COLLATE collation_name]
col_name {ENUM | SET} (val_list) [CHARACTER SET charset_name] [COLLATE collation_name]

  五、Character String Literal Character Set and Collation

For the simple statement SELECT 'string', the string has the connection default character set and collation defined by the character_set_connection and collation_connection system variables.

A character string literal may have an optional character set introducer and COLLATE clause, to designate it as a string that uses a particular character set and collation:
[_charset_name]'string' [COLLATE collation_name]

Examples:
SELECT 'abc'; 
SELECT _latin1'abc'; 
SELECT _binary'abc'; 
SELECT _utf8'abc' COLLATE utf8_danish_ci;


  六、The National Character Set

Standard SQL defines NCHAR or NATIONAL CHAR as a way to indicate that a CHAR column should use some predefined character set. MySQL usesutf8 as this predefined character set. For example, these data type declarations are equivalent:

CHAR(10) CHARACTER SET utf8 
NATIONAL CHARACTER(10) 
NCHAR(10)

As are these:
VARCHAR(10) CHARACTER SET utf8 
NATIONAL VARCHAR(10) 
NVARCHAR(10) 
NCHAR VARCHAR(10) 
NATIONAL CHARACTER VARYING(10) 
NATIONAL CHAR VARYING(10)


  七、Character Set Introducers


A character string literal, hexadecimal literal, or bit-value literal may have an optional character set introducer and COLLATE clause, to designate it as a string that uses a particular character set and collation:

[_charset_name] literal [COLLATE collation_name]

Character set introducers and the COLLATE clause are implemented according to standard SQL specifications.

Examples:
SELECT 'abc'; 
SELECT _latin1'abc'; 
SELECT _binary'abc'; 
SELECT _utf8'abc' COLLATE utf8_danish_ci; 

SELECT _latin1 X'4D7953514C';          --16进制
SELECT _utf8 0x4D7953514C COLLATE utf8_danish_ci; 

SELECT _latin1 b'1000001';            --2进制
SELECT _utf8 0b1000001 COLLATE utf8_danish_ci;


八、Unicode Support


BMP characters have these characteristics:

  • Their code point values are between 0 and 65535 (or U+0000 and U+FFFF).

  • They can be encoded in a variable-length encoding using 8, 16, or 24 bits (1 to 3 bytes).

  • They can be encoded in a fixed-length encoding using 16 bits (2 bytes).

  • They are sufficient for almost all characters in major languages.

Supplementary characters lie outside the BMP:

  • Their code point values are between U+10000 and U+10FFFF).

  • Unicode support for supplementary characters requires character sets that have a range outside BMP characters and therefore take more space than BMP characters (up to 4 bytes per character).


The UTF-8 (Unicode Transformation Format with 8-bit units) method for encoding Unicode data is implemented according to RFC 3629, which describes encoding sequences that take from one to four bytes. The idea of UTF-8 is that various Unicode characters are encoded using byte sequences of different lengths:

  • Basic Latin letters, digits, and punctuation signs use one byte.

  • Most European and Middle East script letters fit into a 2-byte sequence: extended Latin letters (with tilde, macron, acute, grave and other accents), Cyrillic, Greek, Armenian, Hebrew, Arabic, Syriac, and others.

  • Korean, Chinese, and Japanese ideographs use 3-byte or 4-byte sequences.

MySQL supports these Unicode character sets:

  • utf8mb4: A UTF-8 encoding of the Unicode character set using one to four bytes per character.

  • utf8mb3: A UTF-8 encoding of the Unicode character set using one to three bytes per character.

  • utf8: An alias for utf8mb3.

  • ucs2: The UCS-2 encoding of the Unicode character set using two bytes per character.

  • utf16: The UTF-16 encoding for the Unicode character set using two or four bytes per character. Like ucs2 but with an extension for supplementary characters.

  • utf16le: The UTF-16LE encoding for the Unicode character set. Like utf16 but little-endian rather than big-endian.

  • utf32: The UTF-32 encoding for the Unicode character set using four bytes per character.


下面这张表统计字符集字节数:

Character SetSupported CharactersRequired Storage Per Character
utf8mb3utf8BMP only1, 2, or 3 bytes
ucs2BMP only2 bytes
utf8mb4BMP and supplementary1, 2, 3, or 4 bytes
utf16BMP and supplementary2 or 4 bytes
utf16leBMP and supplementary2 or 4 bytes
utf32BMP and supplementary4 bytes


九、utf8(utf8mb3)与utf8mb4的转换


10.9.8 Converting Between 3-Byte and 4-Byte Unicode Character Sets

The utf8mb3 and utf8mb4 character sets differ as follows:

utf8mb3 supports only characters in the Basic Multilingual Plane (BMP). utf8mb4 additionally supports supplementary characters that lie outside the BMP.

Note

This discussion refers to the utf8mb3 and utf8mb4 character set names to be explicit about referring to 3-byte and 4-byte UTF-8 character set data. The exception is that in table definitions, utf8 is used because MySQL converts instances of utf8mb3specified in such definitions to utf8, which is an alias for utf8mb3.


utf8mb4与utf8(utf8mb3)转换也是特别好转换的:

1.utf8(utf8mb3)转成utf8mb4可以存储supplementary characters;

2.utf8(utf8mb3)转成utf8mb4可能会增加数据存储空间;

3.对于BMP character字符,utf8(utf8mb3)转成utf8mb4相同的代码值、相同的编码、相同的长度,不会有变化。

4.对于supplementary character字符,utf8mb4会以4字节存储,由于utf8mb3无法存储supplementary character字符,因而在字符集转换过程中,不用担心字符无法转换的问题。

5.表结构在转换过程中需要调整:utf8(utf8mb3)字符集可变长度字符数据类型(VARCHAR和text类型)设定的表中列的字段长度,utf8mb4中将会存储更少的字符。对于所有字符数据类型(CHAR、VARCHAR和文本类型),UTF8Mb4列最多可被索引的字符数比UTF8Mb3列要少。因此在转换之前,要检查字段类型。防止转换后表,索引存储的数据超出该字段定义长度,字段类型长度可以存储的最大字节数。innodb索引列:最大索引列长度767 bytes,对于utf8mb3就是可以索引255个字符,对于utf8mb4就是可以索引191个字符。在转换后不能满足那么就需要换一个列来索引。以下是通过压缩方式使索引更多的字节

Note:

For InnoDB tables that use COMPRESSED or DYNAMIC row format, you can enable the innodb_large_prefix option to permit index key prefixes longer than 767 bytes (up to 3072 bytes). Creating such tables also requires the option valuesinnodb_file_format=barracuda and innodb_file_per_table=true.) In this case, enabling the innodb_large_prefixoption enables you to index a maximum of 1024 or 768 characters for utf8mb3 or utf8mb4 columns, respectively. For related information, see Section 14.8.1.7, “Limits on InnoDB Tables”.


The preceding types of changes are most likely to be required only if you have very long columns or indexes. Otherwise, you should be able to convert your tables from utf8mb3 to utf8mb4 without problems, using ALTER TABLE as described previously.


6.应用于MySQL server 字符集也需要一一对应。

7.master 实例改变字符集,那么slave也需要相应的改变。



这篇关于MySQL 字符集utf8、utf8mb3、utf8mb4的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1100581

相关文章

SQL中的外键约束

外键约束用于表示两张表中的指标连接关系。外键约束的作用主要有以下三点: 1.确保子表中的某个字段(外键)只能引用父表中的有效记录2.主表中的列被删除时,子表中的关联列也会被删除3.主表中的列更新时,子表中的关联元素也会被更新 子表中的元素指向主表 以下是一个外键约束的实例展示

基于MySQL Binlog的Elasticsearch数据同步实践

一、为什么要做 随着马蜂窝的逐渐发展,我们的业务数据越来越多,单纯使用 MySQL 已经不能满足我们的数据查询需求,例如对于商品、订单等数据的多维度检索。 使用 Elasticsearch 存储业务数据可以很好的解决我们业务中的搜索需求。而数据进行异构存储后,随之而来的就是数据同步的问题。 二、现有方法及问题 对于数据同步,我们目前的解决方案是建立数据中间表。把需要检索的业务数据,统一放到一张M

如何去写一手好SQL

MySQL性能 最大数据量 抛开数据量和并发数,谈性能都是耍流氓。MySQL没有限制单表最大记录数,它取决于操作系统对文件大小的限制。 《阿里巴巴Java开发手册》提出单表行数超过500万行或者单表容量超过2GB,才推荐分库分表。性能由综合因素决定,抛开业务复杂度,影响程度依次是硬件配置、MySQL配置、数据表设计、索引优化。500万这个值仅供参考,并非铁律。 博主曾经操作过超过4亿行数据

性能分析之MySQL索引实战案例

文章目录 一、前言二、准备三、MySQL索引优化四、MySQL 索引知识回顾五、总结 一、前言 在上一讲性能工具之 JProfiler 简单登录案例分析实战中已经发现SQL没有建立索引问题,本文将一起从代码层去分析为什么没有建立索引? 开源ERP项目地址:https://gitee.com/jishenghua/JSH_ERP 二、准备 打开IDEA找到登录请求资源路径位置

MySQL数据库宕机,启动不起来,教你一招搞定!

作者介绍:老苏,10余年DBA工作运维经验,擅长Oracle、MySQL、PG、Mongodb数据库运维(如安装迁移,性能优化、故障应急处理等)公众号:老苏畅谈运维欢迎关注本人公众号,更多精彩与您分享。 MySQL数据库宕机,数据页损坏问题,启动不起来,该如何排查和解决,本文将为你说明具体的排查过程。 查看MySQL error日志 查看 MySQL error日志,排查哪个表(表空间

MySQL高性能优化规范

前言:      笔者最近上班途中突然想丰富下自己的数据库优化技能。于是在查阅了多篇文章后,总结出了这篇! 数据库命令规范 所有数据库对象名称必须使用小写字母并用下划线分割 所有数据库对象名称禁止使用mysql保留关键字(如果表名中包含关键字查询时,需要将其用单引号括起来) 数据库对象的命名要能做到见名识意,并且最后不要超过32个字符 临时库表必须以tmp_为前缀并以日期为后缀,备份

[MySQL表的增删改查-进阶]

🌈个人主页:努力学编程’ ⛅个人推荐: c语言从初阶到进阶 JavaEE详解 数据结构 ⚡学好数据结构,刷题刻不容缓:点击一起刷题 🌙心灵鸡汤:总有人要赢,为什么不能是我呢 💻💻💻数据库约束 🔭🔭🔭约束类型 not null: 指示某列不能存储 NULL 值unique: 保证某列的每行必须有唯一的值default: 规定没有给列赋值时的默认值.primary key:

MySQL-CRUD入门1

文章目录 认识配置文件client节点mysql节点mysqld节点 数据的添加(Create)添加一行数据添加多行数据两种添加数据的效率对比 数据的查询(Retrieve)全列查询指定列查询查询中带有表达式关于字面量关于as重命名 临时表引入distinct去重order by 排序关于NULL 认识配置文件 在我们的MySQL服务安装好了之后, 会有一个配置文件, 也就

Java 连接Sql sever 2008

Java 连接Sql sever 2008 /Sql sever 2008 R2 import java.sql.Connection; import java.sql.DriverManager; import java.sql.ResultSet; import java.sql.Statement; public class TestJDBC

Mysql BLOB类型介绍

BLOB类型的字段用于存储二进制数据 在MySQL中,BLOB类型,包括:TinyBlob、Blob、MediumBlob、LongBlob,这几个类型之间的唯一区别是在存储的大小不同。 TinyBlob 最大 255 Blob 最大 65K MediumBlob 最大 16M LongBlob 最大 4G