Unicode Escape and Unescape
Applications can Unicode escape and unescape text in a variety of styles and formats using the Chilkat API. See the following examples and explanations below.
Unicode Escape Formats
Chilkat supports the following types of Unicode escaping.
JSON-style (JavaScript-style) Unicode escape sequences
\uXXXX
- Represents a single 16-bit code unit in hexadecimal.
- Originates from JavaScript string literal syntax and is also used in JSON.
- Example:
\u2705
= ✅ (White Heavy Check Mark).
- Surrogate pairs
- Code points above U+FFFF (such as most emoji) are encoded as two
\uXXXX
sequences — a high surrogate (\uD800
–\uDBFF
) followed by a low surrogate (\uDC00
–\uDFFF
). - Example:
\ud83e\udde0
→ 🧠 (Brain).\uD83E
= high surrogate\uDDE0
= low surrogate
- Code points above U+FFFF (such as most emoji) are encoded as two
Where you see it
- JSON data containing Unicode characters.
- JavaScript source strings.
- Many web APIs that serialize Unicode characters as escaped code units.
ECMAScript (JavaScript) “code point escape” syntax
Also called Unicode code point escapes.
- Written as
\u{<hex>}
instead of\uXXXX
. - Introduced in ES6 (ECMAScript 2015).
- Can directly represent:
- Any Unicode code point from
U+0000
toU+10FFFF
. - Characters above
U+FFFF
without needing surrogate pairs.
- Any Unicode code point from
- Example:
\u{1F9E0}
→ 🧠 (Brain)\u{2705}
→ ✅ (White Heavy Check Mark)
Difference from the older \uXXXX
form
\uXXXX
→ always exactly 4 hex digits; represents a 16-bit UTF-16 code unit (may require two sequences for characters above U+FFFF).\u{...}
→ variable-length hex inside braces; represents a full Unicode code point directly.
Unicode code point notation or U+ notation
- Written as
U+
(or lowercaseu+
) followed by the hexadecimal code point value.- Example:
U+1F9E0
→ 🧠 (Brain)
- Example:
- Commonly used in:
- Unicode charts and specifications.
- Documentation and character tables.
- Fonts and internationalization tools.
- It’s not a programming-language escape sequence — it’s a human-readable way to specify a Unicode character’s code point.
- Can list multiple code points for characters that are composed of more than one code point (e.g.,
U+26A0 U+FE0F
= ️ warning sign with emoji presentation).
HTML hexadecimal character reference
Also known as an HTML numeric character reference in hexadecimal form.
- Syntax:
<hex>;
→ starts a numeric reference
x
→ indicates hexadecimal<hex>
→ the Unicode code point in hex;
→ terminates the reference
- Example:
🧠
→ 🧠 (Brain)✅
→ ✅ (White Heavy Check Mark)
- Used in HTML and XML to represent characters that might otherwise cause encoding issues or be hard to type directly.
- Decimal form is also valid:
🛠
= 🧠.
HTML decimal character reference
or HTML numeric character reference in decimal form.
- Syntax:
<decimal>;
→ starts a numeric reference
<decimal>
→ the Unicode code point in decimal;
→ terminates the reference
- Example:
🧠
→ 🧠 (Brain)✅
→ ✅ (White Heavy Check Mark)
- Decimal form is just the base-10 equivalent of the hexadecimal form (
...;
). - Works in HTML and XML for representing characters without directly embedding them.
Hex in Angle Brackets
This style isn’t a standard programming or HTML escape — it’s essentially a custom or ad-hoc code point notation where each Unicode code point is written in hexadecimal inside angle brackets.
<1f9e0>
is clearly intended to mean U+1F9E0 🧠(Brain).- The values are plain hexadecimal Unicode code points.
- Angle brackets aren’t part of any official Unicode escape syntax — they’re just a delimiter someone chose (often used in internal documentation, markup, or proprietary text formats).
- It’s not HTML (
<tag>
syntax) — there’s no meaning assigned to these “tags” by browsers unless custom parsing code handles them.