Unicode Inside

ASCII
1. be stored in 7 bits
2. Everyone codes 128-255 for his own purposes.
ANSI standard
1. everybody agreed on what to do below 128, which was the same as ASCII
2. from 128 and on up: code pages.
  1. Samples
    1. 936:gb2312
    2. 950:big5
  2. A byte was a character and a character was 8 bits?
    1. NO!!!
    2. In Asia, DBCS, the "double byte character set"
    3. DBCS
      1. gb2312
      2. 1.ASCII中字符，1char=1byte
      3. 2.漢字:1char=2bytes
      4. eg."张博华123"=3*2+3=9bytes
      5. gbk
      6. GB18030：编码采用单字节、双字节和4字节方案。
      7. 從GB2312到GB18030
3. 新建文本文件时的ANSI
  1. 单独说ANSI是没有意义的!!!
  2. 必须指明是哪个Code Page(这里的Code Page的意思已经等于Encoding)
  3. 文件的Encoding就是电脑当前默认Codepage指向的编码
  4. 简体中文机器上--GB2312（应该是GBK？）
  5. 繁体中文机器上--Big5
  6. 英文机器上--ISO8859-1( aka Latin-1)
Unicode
1. A letter/character maps to a code point
2. Unicode就是文字的编码，Encoding是指编码的保存方式。
3. Encoding可以做名词，也可以做动词
4. 正如“编码”可以做名词，也可以做动词
5. Encoding: How that code point is represented in memory or on disk
6. Common Encodings
  1. UTF-7
  2. UTF-8
    1. English text looks exactly the same in UTF-8 as it did in ASCII
    2. Every code point is stored in 1, 2, 3, in fact, up to 6 bytes.
    3. 每个汉字存储为3个Byte(有些存储为2个Byte？)--每个汉字存储为3个Byte
  3. UTF-16/UCS-2
    1. UCS-2 (because it has two bytes) / UTF-16 (because it has 16 bits)
    2. UTF-16并不等同于UCS-2，它是UCS-2的继任者
    3. store-it-in-two-byte
    4. high-endian UCS-2 or low-endian UCS-2?
    5. With or without BOM?
  4. UTF32/UCS-4
    1. store-it-in-four-byte
7. BOM
  1. BOM不是必须的。但是不指定的话，读取者就只能猜了
  2. UTF8中BOM不是必须的。而且推荐不写BOM，可能导致文件拼接时出现问题
  3. LE和BE的例子：LE是89 00；BE是00 89！！！BE是正常顺序
Unicode in Windows
1. 另存文本文件时的编码选择中
  1. 里面的Unicode指的是UTF-16 LE
  2. 里面的Unicode Big Endian指的是UTF-16 BE
2. Windows的内码是Unicode
Unicode in .NET
1. 1.在.NET中字符串是Unicode编码的。Unicode指的是UTF-16 LE
2. 2.文件读写流程
3. 3.使用对应编码格式的GetBytes()方法来获取一个字符串的Byte[ ]
4. 4.string到Byte[ ]时要指定Encoding
5. 5.同样，从Byte[ ]转回到string时也要知道对应的Encoding
6. 6.对于Unicode的几种编码方式(UTF8/UTF16/UTF32), GetBytes()都不会返回BOM
7. 7.String.Length得到的是字符的数量，不是Byte[ ]的长度
Unicode in SQL Server
1. 1. varchar 和 nvarchar
  1. varchar: Variable-length, non-Unicode character data. The database collation determines which code page the data is stored using.
  2. nvarchar: Variable-length Unicode character data. Dependent on the database collation for comparisons.
  3. 存储所需的空间
    1. varchar(n)：n bytes
    2. nvarchar(n): 2n+2 bytes
  4. 能保存的字数
    1. varchar(n)
    2. nvarchar(n)
2. 2.varchar(max)和nvarchar(max)
  1. Your understanding is wrong. nvarchar(max) can store up to (and beyond sometimes) 2GB of data (1 billion double byte characters).
  2. nvarchar(max) is a replacement for ntext which is deprecated
  3. SO: http://stackoverflow.com/a/12639972
3. 3.T-SQL中，变量的collation是如何确定的？
  1. collation的作用：a）确定Code Page；b）确定排序规则；c）确定比较规则
  2. 变量定义完后，能否在后续的执行中修改其Code Page
  3. 看这篇文章：https://the.agilesql.club/Blogs/Ed-Elliott/What-collation-variables-take-on-inT-SQL
  4. 代码中，可能会同时涉及列和变量，如何判断collation？看这两篇MSDN
4. 4.T-SQL中collation处理实例
  1. 字符串拼接
    1. varchar + nvarchar
      1. 先发生类型转换，varchar类型数据变成nvarchar
      2. 再进行nvarchar的拼接
    2. nvarchar+nvarchar
      1. 自然是nvarchar
    3. 两个不同collation的varchar相加
      1. 列 A+ 列B：报错
      2. 变量A+列B：
      3. 列B的collation
      4. 如果变量A做cast，同时显式指定collation，则取新的collation
  2. 以substring为例，测试
    1. 如何处理不同collation的varchar
      1. SQL Server自动处理了collation，按字符截取
    2. 如何处理nvarchar
      1. SQL Server自动处理了collation，按字符截取
  3. 数据存储及转换
    1. 将collation为A的字符串存到collation为B的列中
      1. SQL Server转换，有可能数据丢失
      2. 因为CodePage A中的字符在Code Page B上可能没有对应的字符
    2. collation为A的varchar 转换为 collation为B的varchar
      1. 同上
    3. 将nvarchar的字符串存到 collation为B的varchar类型的列中
      1. SQL Server转换，有可能数据丢失
      2. Unicode中的某个Code Point在Code Page A上找不到对应的字符
    4. nvarchar 转换为varchar
      1. 同上
  4. 等号后面用collation的作用是什么?
    1. 用这个collation去做等于判断
5. 5.涉及到客户端和服务器端的情况
  1. 比如使用SSMSS连接SQL Server查询
  2. 比如使用Linked Server去查询另一台SQL Server
  3. 看SO上的这个贴
  4. ADO.NET用的是什么Driver? 能关掉AutoTranslate吗?
6. 5.如何理解SQL Server的内码是Unicode
Unicode in JavaScript
1. JavaScript doesn't even have a concept of string encodings! Strings are all stored in UTF-16 or UCS-2 format, and so there's no way to manually force a string to be interpreted as a certain encoding.
2. JavaScript engines are free to use UCS-2 or UTF-16 internally. Most engines that I know of use UTF-16, but whatever choice they made, it’s just an implementation detail that won’t affect the language’s characteristics. The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16. https://mathiasbynens.be/notes/javascript-encoding
3. 在JS中做转换?
  1. 在Chrome和Firefox中，可以用TextDecoder这个新的API
  2. 否则就只能用某些js library了
4. http://www.ruanyifeng.com/blog/2014/12/unicode.html
Using Notepad++
1. "ANSI as UTF-8" means UTF-8 without BOM
2. Select charaters count show in status bar: NO
  1. UTF-8 file: number of bytes
  2. Big-5 file: ?
  3. gb2312 file: number of bytes
  4. UTF-16 file: ?
Code