confusables.txt and NFKC disagree on 31 characters

Hacker News
February 23, 2026
AI-Generated Deep Dive Summary
Tech startups and developers working on login systems or identifier validation face a critical challenge: homoglyph attacks where visually identical characters cause impersonation risks. The Unicode Consortium addresses this with confusables.txt, a security mechanism mapping characters to their visual equivalents for detection purposes. However, a lesser-known conflict arises when combining confusables.txt with NFKC (Normalization Form Compatibility Composition) normalization. NFKC transforms characters into their canonical forms, such as converting fullwidth letters to ASCII or breaking ligatures into their components. While this is vital for standardizing data storage and comparison, it creates a potential issue: 31 characters in confusables.txt map to different targets than NFKC. For example, the Long S (ſ) is mapped to 'f' by confusables.txt but normalized to 's' by NFKC. This means that if NFKC normalization occurs before confusable checks, these specific entries become redundant. The practical implication for developers is significant. If using a filtered confusable map after NFKC normalization, those 31 conflicting characters should be removed to avoid dead code and ensure accurate security detection. This highlights the importance of understanding how and when to apply these tools. While confusables.txt is essential for detecting visually similar characters, NFKC serves a different purpose by standardizing forms for storage. The broader lesson underscores the need for developers to carefully consider their normalization and detection processes. By aligning these steps with specific use cases—whether prioritizing security or data consistency—they can build more robust systems. This issue is particularly relevant for tech startups focused on user authentication, where even a single missed homoglyph attack could compromise system integrity. Ultimately, the conflict between confusables.txt and NFKC normalization highlights the interconnected yet distinct roles of security detection and canonical form standardization in modern computing. Developers must choose their toolchain wisely, ensuring that their chosen methods align with their application's requirements to maintain both usability and security.
Verticals
techstartups
Originally published on Hacker News on 2/23/2026