chcp
The current Code Page will be displayed:
locale
CheckLANGorLC_CTYPEvalue, for example:
LANG=zh_TW.UTF-8
---
#include <clocale>
#include <iostream>
int main() {
std::cout << "Current locale: " << std::setlocale(LC_ALL, nullptr) << std::endl;
}
In Windows it will usually show something likeCorChinese (Traditional)_Taiwan.950。
chcp 65001
→ Switch command line to UTF-8.
$OutputEncoding = [Console]::OutputEncoding = [Text.Encoding]::UTF8
#include <clocale>
int main() {
std::setlocale(LC_ALL, "zh_TW.UTF-8"); // Set to UTF-8
}
std::setlocale(LC_ALL, "Chinese_Taiwan.950");
---
UTF-8。SetConsoleOutputCP(65001); //Set the output to UTF-8
SetConsoleCP(65001); //Set input to UTF-8
usechcp 65001You can only temporarily change the character encoding of the current command prompt (cmd). Once the window is closed or restarted, the default value will be restored (e.g.950Big5). If you want the entire system and all applications to use UTF-8, you need to modify the "Regional Settings" at the Windows system level.
After restarting, the default locale of Windows Console, C++, .NET, Python and other programs will be UTF-8.
---chcp
If displayed:
Active code page: 65001
This means UTF-8 has become the default.
#include <clocale>
#include <iostream>
int main() {
std::cout << "Current locale: " << std::setlocale(LC_ALL, nullptr) << std::endl;
}
---
If you do not want the entire system to be converted to UTF-8, you can set startup parameters or in-program settings for certain applications:
cmd /K chcp 65001
Or call within the program:SetConsoleOutputCP(65001);
SetConsoleCP(65001);
The Unicode escape sequence is a method of representing Unicode characters using pure ASCII characters. Commonly used in programming language source code, JSON, string constants and cross-platform data exchange. This notation is used when the environment cannot directly enter or display a specific character.
The most common format is\uXXXX,inXXXXis a 4-digit hexadecimal number,
Represents a Unicode code point.
\u0041 → A\u00E9 → é\u4E2D→ mediumSupported by some languages (such as Python)\UXXXXXXXX, using 8 hexadecimal digits,
All Unicode code points can be represented directly.
\U0001F600 → 😀In environments that only support 16-bit Unicode (such as the JavaScript legacy specification),
exceedU+FFFFThe characters require a surrogate pair.
\uD83D\uDE00 → 😀JavaScript
const s = "\u4E2D\u6587";
Python
s = "\u4E2D\u6587"
s2 = "\U0001F600"
JSON
{
"text": "\u4E2D\u6587"
}
URL Encoding (also known as Percent-Encoding) is a way of converting characters into a representation that is safe for use in URLs. URLs only allow certain ASCII characters, the rest must be converted to percent plus hexadecimal.
The encoding format is%HH,inHHIs the hexadecimal representation of the byte value of this character.
If characters occupy multiple bytes under UTF-8, they will be encoded separately.
%20%21%E4%B8%ADSome characters in URLs have special semantics and are called reserved characters. Whether encoding is required depends on where it is used.
:/?&=#The following characters can be used directly in URLs without encoding.
- _ . ~JavaScript
encodeURIComponent("Chinese test")
decodeURIComponent("%E4%B8%AD%E6%96%87%20test")
Python
from urllib.parse import quote, unquote
quote("Chinese test")
unquote("%E4%B8%AD%E6%96%87%20test")
existapplication/x-www-form-urlencodedIn the format,
White space characters will be encoded as+, rather than%20.
Still used in general URL paths%20。
Hexadecimal Escapes are a way of using hexadecimal numbers to represent characters. Often used in string constants in programming languages to represent specific bytes or ASCII characters.
The most common format is\xHH,inHHis a 2-digit hexadecimal number,
Represents a byte value, usually corresponding to ASCII or a single byte character.
\x41 → A\x61 → a\x0A→ line feedHexadecimal Escapes mostly only work on single tuples,
If you use UTF-8 encoded multi-byte characters, you need to split them into multiple\xHH。
\xE4\xB8\xADC / C++
char c = '\x41';
JavaScript
const s = "\x48\x65\x6C\x6C\x6F";
Python
s = "\x48\x65\x6C\x6C\x6F"
| 0x0 | 0x1 | 0x2 | 0x3 | 0x4 | 0x5 | 0x6 | 0x7 | 0x8 | 0x9 | 0xA | 0xB | 0xC | 0xD | 0xE | 0xF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0x00 | NUL | SOH | STX | ETX | EOT | ENQ | ACK | BEL | BS | HT | LF | VT | FF | CR | SO | SI |
| 0x10 | DLE | DC1 | DC2 | DC3 | DC4 | NAK | SYN | ETB | CAN | EM | SUB | ESC | FS | GS | RS | US |
| 0x20 | ␣ | ! | " | # | $ | % | & | ' | ( | ) | * | + | , | - | . | / |
| 0x30 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | : | ; | < | = | > | ? |
| 0x40 | @ | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O |
| 0x50 | P | Q | R | S | T | U | V | W | X | Y | Z | [ | \ | ] | ^ | _ |
| 0x60 | ` | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o |
| 0x70 | p | q | r | s | t | u | v | w | x | y | z | { | | | } | ~ | DEL |
| 0x80 | Ç | ü | é | â | ä | à | å | ç | ê | ë | è | ï | î | ì | Ä | Å |
| 0x90 | É | æ | Æ | ô | ö | ò | û | ù | ÿ | Ö | Ü | ¢ | £ | ¥ | ₧ | ƒ |
| 0xA0 | á | í | ó | ú | ñ | Ñ | ª | º | ¿ | ⌐ | ¬ | ½ | ¼ | ¡ | « | » |
| 0xB0 | ░ | ▒ | ▓ | │ | ┤ | ╡ | ╢ | ╖ | ╕ | ╣ | ║ | ╗ | ╝ | ╜ | ╛ | ┐ |
| 0xC0 | └ | ┴ | ┬ | ├ | ─ | ┼ | ╞ | ╟ | ╚ | ╔ | ╩ | ╦ | ╠ | ═ | ╬ | ╧ |
| 0xD0 | ╨ | ╤ | ╥ | ╙ | ╘ | ╒ | ╓ | ╫ | ╪ | ┘ | ┌ | █ | ▄ | ▌ | ▐ | ▀ |
| 0xE0 | α | ß | Γ | π | Σ | σ | µ | τ | Φ | Θ | Ω | δ | ∞ | φ | ε | ∩ |
| 0xF0 | ≡ | ± | ≥ | ≤ | ⌠ | ⌡ | ÷ | ≈ | ° | ∙ | · | √ | ⁿ | ² | ■ |
Chinese characters in Unicode are mainly distributed in the following sections. The following lists the ranges of common Chinese characters (Hanzi) in the Unicode table, as well as detailed descriptions of each range.
| scope name | Unicode range | illustrate |
|---|---|---|
| CJK Unified Ideographs | 4E00–9FFF | Contains basic Chinese, Japanese and Korean characters, which is the most common Chinese character range. |
| CJK Unified Ideographs Extension A | 3400–4DBF | Extended area A, containing less commonly used Chinese characters. |
| CJK Unified Ideographs Extension B | 20000–2A6DF | Expanded area B mainly covers ancient characters and some rare Chinese characters. |
| CJK Unified Ideographs Extension C | 2A700–2B73F | Area C has been expanded to further expand ancient characters and rare characters. |
| CJK Unified Ideographs Extension D | 2B740–2B81F | Extended area D contains rarely used Chinese characters. |
| CJK Unified Ideographs Extension E | 2B820–2CEAF | Expand area E, mainly adding more rare Chinese characters. |
| CJK Unified Ideographs Extension F | 2CEB0–2EBEF | Expanded area F, including rarer ancient characters and Chinese characters. |
| CJK Unified Ideographs Extension G | 30000–3134F | The extended G area is the latest added Chinese character area. |
| CJK Compatibility Ideographs | F900–FAFF | Compatibility zone for compatibility with older character set systems, such as different glyphs for Japanese glyphs. |
The range listed above includes most of the Chinese characters and is distributed in many different areas to meet different needs, including modern Chinese characters, ancient characters, and compatible characters. For Chinese font design or character analysis, these ranges provide complete font support.
email: [email protected]