White Blue • 3/12/2019

When "Zoë" !== "Zoë". Or why you need to normalize Unicode strings

This article discusses a common issue in software development where visually identical Unicode strings, such as 'Zoë', are not equal when compared programmatically. It explains that characters like 'ë' can be represented as a single code point (precomposed) or as a base character plus combining mark (decomposed), leading to mismatches in string comparison. The article provides background on character encoding history from ASCII to Unicode, covering UTF-8 and UTF-16, and emphasizes the importance of Unicode normalization (e.g., NFC, NFD) for reliable string handling in applications. It is relevant to developers working with text processing, data deduplication, or internationalization.

0 comments

#character encoding #unicode #Normalization