# Appendix C. Unicode UTF-8 encoding

UTF-8 is an efficient encoding of [Unicode](https://docs.snomed.org/snomed-ct-specifications/snomed-ct-release-file-specification/appendices/appendix-b.-specification-reference-information/u/unicode) character - String that recognizes the fact that the majority of text-based communications are in ASCII. It therefore optimizes the encoding of these characters.

Unicode is preferred to ASCII because it permits the inclusion of accents, scientific symbols and characters used in languages other than English. The UTF-8 format is a standard encoding that provides the most efficient means of encoding 16-bit Unicode characters in cases where the majority of characters are in the ASCII range. Both UTF-8 and the alternative UTF-16 encoding are supported by all widely used operating systems and major applications. UTF-8 was adopted is an [IETF Internet Standard](https://tools.ietf.org/html/rfc3629) (it was initially adopted by IETF in 1996 to restrict some code values in 1998 and 2003). In 2008 UTF-8 became the most widely used for of encoding in web pages.

SNOMED CT uses the UTF-8 representation of characters in terms and other text fields.

{% hint style="info" %}
Note that SNOMED CT does not use, or require use of, the [Byte Order Mark (BOM)](https://en.wikipedia.org/wiki/Byte_order_mark) specified by the Unicode standard because all SNOMED CT release files use UTF-8.
{% endhint %}

## Summary of Unicode Encoding Rules

Character encoding

* ASCII characters (in the range 0-127) are encoded as a single byte.
* Greek, Hebrew, Arabic and most accented European characters are encoded as two bytes;
* Other characters are encoded as three bytes;
* The individual characters are encoded according to the following rules.

### Single byte encoding

Characters in the range 'u+0000' to 'u+007f' are encoded as a single byte.

**UTF-8 Single Byte Encoding**

<div align="left"><figure><img src="https://1769943195-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FirKbJsZG57nSWZA4GT0M%2Fuploads%2FzyhoLah4FmP2bwwROE0n%2FImage%2012-08-2025%20at%2010.40.jpg?alt=media&#x26;token=b8197a09-b8af-4e69-ad8e-88deccecfc45" alt=""><figcaption></figcaption></figure></div>

### Two byte encoding

Characters in the range 'u+0080' to 'u+07ff' are encoded as two bytes.

**Two byte encoding**

<div align="left"><figure><img src="https://1769943195-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FirKbJsZG57nSWZA4GT0M%2Fuploads%2FD74hPtQ3LbTKErOwDBWS%2FImage%2012-08-2025%20at%2010.42.jpg?alt=media&#x26;token=293bd87e-1361-426a-a859-1731d26d50bf" alt=""><figcaption></figcaption></figure></div>

### Three byte encoding

Characters in the range 'u+0800' to 'u+ffff' are encoded as three bytes:

**UTF-8 Three Byte Encoding**

<div align="left"><figure><img src="https://1769943195-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FirKbJsZG57nSWZA4GT0M%2Fuploads%2FJGtllBbxaIhE8hiTa8rM%2FImage%2012-08-2025%20at%2010.43.jpg?alt=media&#x26;token=e6cb38d9-2729-43fd-9a70-902353186719" alt=""><figcaption></figcaption></figure></div>

## Notes on encoding rules

The first bits of each byte indicate the role of the byte. A zero bit terminates this role information. Thus possible byte values are:

**UTF-8 Encoding Rules**

<table><thead><tr><th width="148.08856201171875">Bits</th><th width="130.87933349609375">Byte value</th><th>Role</th></tr></thead><tbody><tr><td>0???????</td><td>000-127</td><td>Single byte encoding of a character</td></tr><tr><td>10??????</td><td>128-191</td><td>Continuation of a multi-byte encoding</td></tr><tr><td>110?????</td><td>192-223</td><td>First byte of a two byte character encoding</td></tr><tr><td>1110????</td><td>224-239</td><td>First byte of a three byte character encoding</td></tr><tr><td>1111???</td><td>240-255</td><td>Invalid</td></tr></tbody></table>

## Example encoding

**UTF-8 Encoding Example**

<div align="left"><figure><img src="https://1769943195-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FirKbJsZG57nSWZA4GT0M%2Fuploads%2FVKHc3ItZuk7iMLaYRpyg%2FImage%2012-08-2025%20at%2010.47.jpeg?alt=media&#x26;token=e8a9741e-5d76-4cd4-bf60-2dd920e439ab" alt=""><figcaption></figcaption></figure></div>

<a href="https://docs.google.com/forms/d/e/1FAIpQLScTmbZIf0UEQwYDkY27EEWBkaiYkHSbR0_9DmFrMLXoQLyL7Q/viewform?usp=pp_url&#x26;entry.1767247133=Release+File+Specification&#x26;entry.670899847=Appendix%20C.%20Unicode%20UTF-8%20encoding" class="button primary">Provide Feedback</a>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.snomed.org/snomed-ct-specifications/snomed-ct-release-file-specification/appendices/appendix-c-unicode-utf-8-encoding.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
