Every now and then I am challenged with modifying Unicode-strings. Whether by converting from any non-ASCII script to ASCII or handling differently normalized strings, all of these actions are called “Transliteration”
I first encountered that when I built an application that create PDF-Files on a Linux-Server that would then be overwritten from an application running on a mac that had the folder mounted via CIFS. Everything was working great. Until one of the people thought it would be a great idea to enter a filename with a german Umlaut. So the application created the file “example_ä.pdf” on the server. After some time we realized that there was a second file in that folder with the name “example_ä.pdf”.
Yeah. It looked pretty interesting having two files with the literal same name in one folder.
Solving an impossible situation
So I had to dig deeper into it. To be honest. Far too deep for my liking. And I ended up in the depths of Unicode, String normalization and the differences in storing characters that are combined from different glyphs.
In this specific case the “ä” can be enoded in two different ways: The composed form, where the literal “ä” is stored and the decomposed form, where the “ä” is decomposed into “a” and “¨”. We call the first one “Normalization Form C” (NFC), the later “Normalization Form D” (NFD) . There is also NFKC and NFKD but let’s not overcomplicate things.
And this was the culprit here. MacOS stored the filename in the decomposed normalization form whereas the application I build used the composed normalization form. So even though the two filenames looked the same, they where differently stored. Mystery solved.
But I still had to make sure that the file that the Mac wrote would overwrite my file. So how to do that?
And here the
Transliterator-class from PHPs intl-extension came in handy like this:
<?php $fileName = "example_ä.pdf"; $transliterator = Transliterator::create('Any-NFD'); $normalizedFileName = $transliterator->transliterate($fileName); echo bin2hex($fileName); // 6578616d706c655f c3a4 2e706466 echo bin2hex($normalizedFileName); // 6578616d706c655f 61cc88 2e706466
With that little piece of code to normalize the filename we never had any issues again with people thinking of funny filenames.
But changing normalizations is not everything one can do with the
Another thing we encountered was a requirement where we should be able to send the names of Korean recipients to an API that only accepts ASCII characters.
Nothing easier than that!
Transliterator to the rescue!
<?php $name = "도현"; $transliterator = Transliterator::create('Any-Latin'); $normalizedName = $transliterator->transliterate($name); echo $normalizedName; // dohyeon
There are lots more of things one can do with the
Transliterator-class. The main drawback is the limited documentation on PHPs side as well as on ICUs side. I had to search a lot through the old ICU site to get a grasp of what is possible and then a lot of trial and error to get the stuff working that I wanted to achieve.