Unicode Escape and Unescape

Applications can Unicode escape and unescape text in a variety of styles and formats using the Chilkat API. See the following examples and explanations below.

Unicode Escape Formats

Chilkat supports the following types of Unicode escaping.

JSON-style (JavaScript-style) Unicode escape sequences

\uXXXX
- Represents a single 16-bit code unit in hexadecimal.
- Originates from JavaScript string literal syntax and is also used in JSON.
- Example: \u2705 = ✅ (White Heavy Check Mark).
Surrogate pairs
- Code points above U+FFFF (such as most emoji) are encoded as two \uXXXX sequences — a high surrogate (\uD800–\uDBFF) followed by a low surrogate (\uDC00–\uDFFF).
- Example: \ud83e\udde0 → 🧠 (Brain).
  - \uD83E = high surrogate
  - \uDDE0 = low surrogate

Where you see it

JSON data containing Unicode characters.
JavaScript source strings.
Many web APIs that serialize Unicode characters as escaped code units.

ECMAScript (JavaScript) “code point escape” syntax

Also called Unicode code point escapes.

Written as \u{<hex>} instead of \uXXXX.
Introduced in ES6 (ECMAScript 2015).
Can directly represent:
- Any Unicode code point from U+0000 to U+10FFFF.
- Characters above U+FFFF without needing surrogate pairs.
Example:
- \u{1F9E0} → 🧠 (Brain)
- \u{2705} → ✅ (White Heavy Check Mark)

Difference from the older `\uXXXX` form

\uXXXX → always exactly 4 hex digits; represents a 16-bit UTF-16 code unit (may require two sequences for characters above U+FFFF).
\u{...} → variable-length hex inside braces; represents a full Unicode code point directly.

Unicode code point notation or U+ notation

Written as U+ (or lowercase u+) followed by the hexadecimal code point value.
- Example: U+1F9E0 → 🧠 (Brain)
Commonly used in:
- Unicode charts and specifications.
- Documentation and character tables.
- Fonts and internationalization tools.
It’s not a programming-language escape sequence — it’s a human-readable way to specify a Unicode character’s code point.
Can list multiple code points for characters that are composed of more than one code point (e.g., U+26A0 U+FE0F = ️ warning sign with emoji presentation).

HTML hexadecimal character reference

Also known as an HTML numeric character reference in hexadecimal form.

Syntax: &#x<hex>;
- &# → starts a numeric reference
- x → indicates hexadecimal
- <hex> → the Unicode code point in hex
- ; → terminates the reference
Example:
- 🧠 → 🧠 (Brain)
- ✅ → ✅ (White Heavy Check Mark)
Used in HTML and XML to represent characters that might otherwise cause encoding issues or be hard to type directly.
Decimal form is also valid: 🛠 = 🧠.

HTML decimal character reference

or HTML numeric character reference in decimal form.

Syntax: &#<decimal>;
- &# → starts a numeric reference
- <decimal> → the Unicode code point in decimal
- ; → terminates the reference
Example:
- 🧠 → 🧠 (Brain)
- ✅ → ✅ (White Heavy Check Mark)
Decimal form is just the base-10 equivalent of the hexadecimal form (&#x...;).
Works in HTML and XML for representing characters without directly embedding them.

Hex in Angle Brackets

This style isn’t a standard programming or HTML escape — it’s essentially a custom or ad-hoc code point notation where each Unicode code point is written in hexadecimal inside angle brackets.

<1f9e0> is clearly intended to mean U+1F9E0 🧠(Brain).
The values are plain hexadecimal Unicode code points.
Angle brackets aren’t part of any official Unicode escape syntax — they’re just a delimiter someone chose (often used in internal documentation, markup, or proprietary text formats).
It’s not HTML (<tag> syntax) — there’s no meaning assigned to these “tags” by browsers unless custom parsing code handles them.

Chilkat Articles