-
ASCII
- be stored in 7 bits
- Everyone codes 128-255 for his own purposes.
-
ANSI standard
- everybody agreed on what to do below 128, which was the same as ASCII
-
from 128 and on up: code pages.
-
Samples
- 936:gb2312
- 950:big5
-
A byte was a character and a character was 8 bits?
- NO!!!
- In Asia, DBCS, the "double byte character set"
-
DBCS
- gb2312
- 1.ASCII中字符,1char=1byte
- 2.漢字:1char=2bytes
- eg."张博华123"=3*2+3=9bytes
- gbk
- GB18030:编码采用单字节、双字节和4字节方案。
- 從GB2312到GB18030
-
新建文本文件时的ANSI
- 单独说ANSI是没有意义的!!!
- 必须指明是哪个Code Page(这里的Code Page的意思已经等于Encoding)
- 文件的Encoding就是电脑当前默认Codepage指向的编码
- 简体中文机器上--GB2312(应该是GBK?)
- 繁体中文机器上--Big5
- 英文机器上--ISO8859-1( aka Latin-1)
-
Unicode
- A letter/character maps to a code point
- Unicode就是文字的编码,Encoding是指编码的保存方式。
- Encoding可以做名词,也可以做动词
- 正如“编码”可以做名词,也可以做动词
- Encoding: How that code point is represented in memory or on disk
-
Common Encodings
- UTF-7
-
UTF-8
- English text looks exactly the same in UTF-8 as it did in ASCII
- Every code point is stored in 1, 2, 3, in fact, up to 6 bytes.
- 每个汉字存储为3个Byte(有些存储为2个Byte?)--每个汉字存储为3个Byte
-
UTF-16/UCS-2
- UCS-2 (because it has two bytes) / UTF-16 (because it has 16 bits)
- UTF-16并不等同于UCS-2,它是UCS-2的继任者
- store-it-in-two-byte
- high-endian UCS-2 or low-endian UCS-2?
- With or without BOM?
-
UTF32/UCS-4
- store-it-in-four-byte
-
BOM
- BOM不是必须的。但是不指定的话,读取者就只能猜了
- UTF8中BOM不是必须的。而且推荐不写BOM,可能导致文件拼接时出现问题
- LE和BE的例子:LE是89 00;BE是00 89!!!BE是正常顺序
-
Unicode in Windows
-
另存文本文件时的编码选择中
- 里面的Unicode指的是UTF-16 LE
- 里面的Unicode Big Endian指的是UTF-16 BE
- Windows的内码是Unicode
-
Unicode in .NET
- 1.在.NET中字符串是Unicode编码的。Unicode指的是UTF-16 LE
- 2.文件读写流程
- 3.使用对应编码格式的GetBytes()方法来获取一个字符串的Byte[ ]
- 4.string到Byte[ ]时要指定Encoding
- 5.同样,从Byte[ ]转回到string时也要知道对应的Encoding
- 6.对于Unicode的几种编码方式(UTF8/UTF16/UTF32), GetBytes()都不会返回BOM
- 7.String.Length得到的是字符的数量,不是Byte[ ]的长度
-
Unicode in SQL Server
-
1. varchar 和 nvarchar
- varchar: Variable-length, non-Unicode character data. The database collation determines which code page the data is stored using.
- nvarchar: Variable-length Unicode character data. Dependent on the database collation for comparisons.
-
存储所需的空间
- varchar(n):n bytes
- nvarchar(n): 2n+2 bytes
-
能保存的字数
- varchar(n)
- nvarchar(n)
-
2.varchar(max)和nvarchar(max)
- Your understanding is wrong. nvarchar(max) can store up to (and beyond sometimes) 2GB of data (1 billion double byte characters).
- nvarchar(max) is a replacement for ntext which is deprecated
- SO: http://stackoverflow.com/a/12639972
-
3.T-SQL中,变量的collation是如何确定的?
- collation的作用:a)确定Code Page;b)确定排序规则;c)确定比较规则
- 变量定义完后,能否在后续的执行中修改其Code Page
- 看这篇文章:https://the.agilesql.club/Blogs/Ed-Elliott/What-collation-variables-take-on-inT-SQL
- 代码中,可能会同时涉及列和变量,如何判断collation?看这两篇MSDN
-
4.T-SQL中collation处理实例
-
字符串拼接
-
varchar + nvarchar
- 先发生类型转换,varchar类型数据变成nvarchar
- 再进行nvarchar的拼接
-
nvarchar+nvarchar
- 自然是nvarchar
-
两个不同collation的varchar相加
- 列 A+ 列B:报错
- 变量A+列B:
- 列B的collation
- 如果变量A做cast,同时显式指定collation,则取新的collation
-
以substring为例,测试
-
如何处理不同collation的varchar
- SQL Server自动处理了collation,按字符截取
-
如何处理nvarchar
- SQL Server自动处理了collation,按字符截取
-
数据存储及转换
-
将collation为A的字符串存到collation为B的列中
- SQL Server转换,有可能数据丢失
- 因为CodePage A中的字符在Code Page B上可能没有对应的字符
-
collation为A的varchar 转换为 collation为B的varchar
- 同上
-
将nvarchar的字符串存到 collation为B的varchar类型的列中
- SQL Server转换,有可能数据丢失
- Unicode中的某个Code Point在Code Page A上找不到对应的字符
-
nvarchar 转换为varchar
- 同上
-
等号后面用collation的作用是什么?
- 用这个collation去做等于判断
-
5.涉及到客户端和服务器端的情况
- 比如使用SSMSS连接SQL Server查询
- 比如使用Linked Server去查询另一台SQL Server
- 看SO上的这个贴
- ADO.NET用的是什么Driver? 能关掉AutoTranslate吗?
- 5.如何理解SQL Server的内码是Unicode
-
Unicode in JavaScript
- JavaScript doesn't even have a concept of string encodings! Strings are all stored in UTF-16 or UCS-2 format, and so there's no way to manually force a string to be interpreted as a certain encoding.
- JavaScript engines are free to use UCS-2 or UTF-16 internally. Most engines that I know of use UTF-16, but whatever choice they made, it’s just an implementation detail that won’t affect the language’s characteristics.
The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16. https://mathiasbynens.be/notes/javascript-encoding
-
在JS中做转换?
- 在Chrome和Firefox中,可以用TextDecoder这个新的API
- 否则就只能用某些js library了
- http://www.ruanyifeng.com/blog/2014/12/unicode.html
-
Using Notepad++
- "ANSI as UTF-8" means UTF-8 without BOM
-
Select charaters count show in status bar: NO
- UTF-8 file: number of bytes
- Big-5 file: ?
- gb2312 file: number of bytes
- UTF-16 file: ?
- Code