Unicode Escape and Unescape

Applications can Unicode escape and unescape text in a variety of styles and formats using the Chilkat API. See the following examples and explanations below.

Unicode Escape Formats

Chilkat supports the following types of Unicode escaping.


JSON-style (JavaScript-style) Unicode escape sequences

  • \uXXXX
    • Represents a single 16-bit code unit in hexadecimal.
    • Originates from JavaScript string literal syntax and is also used in JSON.
    • Example: \u2705 = ✅ (White Heavy Check Mark).
  • Surrogate pairs
    • Code points above U+FFFF (such as most emoji) are encoded as two \uXXXX sequences — a high surrogate (\uD800\uDBFF) followed by a low surrogate (\uDC00\uDFFF).
    • Example: \ud83e\udde0 → 🧠 (Brain).
      • \uD83E = high surrogate
      • \uDDE0 = low surrogate

Where you see it

  • JSON data containing Unicode characters.
  • JavaScript source strings.
  • Many web APIs that serialize Unicode characters as escaped code units.

ECMAScript (JavaScript) “code point escape” syntax

Also called Unicode code point escapes.

  • Written as \u{<hex>} instead of \uXXXX.
  • Introduced in ES6 (ECMAScript 2015).
  • Can directly represent:
    • Any Unicode code point from U+0000 to U+10FFFF.
    • Characters above U+FFFF without needing surrogate pairs.
  • Example:
    • \u{1F9E0} → 🧠 (Brain)
    • \u{2705} → ✅ (White Heavy Check Mark)

Difference from the older \uXXXX form

  • \uXXXX → always exactly 4 hex digits; represents a 16-bit UTF-16 code unit (may require two sequences for characters above U+FFFF).
  • \u{...} → variable-length hex inside braces; represents a full Unicode code point directly.

Unicode code point notation or U+ notation

  • Written as U+ (or lowercase u+) followed by the hexadecimal code point value.
    • Example: U+1F9E0 → 🧠 (Brain)
  • Commonly used in:
    • Unicode charts and specifications.
    • Documentation and character tables.
    • Fonts and internationalization tools.
  • It’s not a programming-language escape sequence — it’s a human-readable way to specify a Unicode character’s code point.
  • Can list multiple code points for characters that are composed of more than one code point (e.g., U+26A0 U+FE0F = ️ warning sign with emoji presentation).

HTML hexadecimal character reference

Also known as an HTML numeric character reference in hexadecimal form.

  • Syntax: &#x<hex>;
    • &# → starts a numeric reference
    • x → indicates hexadecimal
    • <hex> → the Unicode code point in hex
    • ; → terminates the reference
  • Example:
    • &#x1F9E0; → 🧠 (Brain)
    • &#x2705; → ✅ (White Heavy Check Mark)
  • Used in HTML and XML to represent characters that might otherwise cause encoding issues or be hard to type directly.
  • Decimal form is also valid: &#128736; = 🧠.

HTML decimal character reference

or HTML numeric character reference in decimal form.

  • Syntax: &#<decimal>;
    • &# → starts a numeric reference
    • <decimal> → the Unicode code point in decimal
    • ; → terminates the reference
  • Example:
    • &#129504; → 🧠 (Brain)
    • &#9989; → ✅ (White Heavy Check Mark)
  • Decimal form is just the base-10 equivalent of the hexadecimal form (&#x...;).
  • Works in HTML and XML for representing characters without directly embedding them.

Hex in Angle Brackets

This style isn’t a standard programming or HTML escape — it’s essentially a custom or ad-hoc code point notation where each Unicode code point is written in hexadecimal inside angle brackets.

  • <1f9e0> is clearly intended to mean U+1F9E0 🧠(Brain).
  • The values are plain hexadecimal Unicode code points.
  • Angle brackets aren’t part of any official Unicode escape syntax — they’re just a delimiter someone chose (often used in internal documentation, markup, or proprietary text formats).
  • It’s not HTML (<tag> syntax) — there’s no meaning assigned to these “tags” by browsers unless custom parsing code handles them.

Chilkat Articles