Handling encoded characters

jOasis · April 22, 2018, 6:53am

I have string containing many umlauts(ä,ö,ü) and euro(€) symbol. Is there a way to transform them to (a,o,u) and Euro(or its equivalent) respectively in Scala.

I am aware of the similar libraries in python that do the job but can’t seem to find it in scala.

Consider this example : val String="Köln and München are great cities. The average bus ticket costs €4.5"

I want to be converted as follows: "Koln and Munchen is a nice city. The average bus ticket costs Euros 4.5"

martijnhoekstra · April 22, 2018, 8:49am

First off, what you’re looking to do is a big yellow flag. You want to take the perfectly good word Köln, and make it in to the non-word Koln.

That’s making things worse, not better, and if you have ways to fix the other side of whatever it is you’re doing, you should do that.

But sometimes what needs to be done just needs to be done, even if it shouldn’t.

I’m not personally aware of any libraries that will do this for you. The strategy I would employ is the following. First, convert your string to normal form D: https://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html

That will make all diacritical combined characters in to a form where the letter and the diacritics are separated from the combining characters.

After that you can filter them out. You will catch most of them in the unicode block COMBINING_DIACRITICAL_MARKS. There are some other blocks that contain rarer diacritical marks: COMBINING_DIACRITICAL_MARKS_EXTENDED, COMBINING_DIACRITICAL_MARKS_SUPPLEMENT, COMBINING_DIACRITICAL_MARKS_FOR_SYMBOLS and COMBINING_HALF_MARKS

Filtering out all characters in those blocks should take care of everything. Fortunately, all of them are in the BMP, so you don’t need to handle surrogate characters. Unfortunately, the extended marks block doesn’t seem to exist in the JDK. I’m leaving it out here. That means you won’t handle the diacritics in https://en.wikipedia.org/wiki/Teuthonista. If you need to handle those, handle that character range separately.

Putting it all together, it would look something like this:

def removeDiacritics(str: String): String = {
  import Character.UnicodeBlock._
  val diacriticBlocks = List(COMBINING_DIACRITICAL_MARKS,
                             //COMBINING_DIACRITICAL_MARKS_EXTENDED, For some reason not in Java. 
                             COMBINING_DIACRITICAL_MARKS_SUPPLEMENT,
                             COMBINING_MARKS_FOR_SYMBOLS,
                             COMBINING_HALF_MARKS)
  
  import java.text.Normalizer
  val normald = Normalizer.normalize(str, Normalizer.Form.NFD)
  normald.filterNot(ch => diacriticBlocks.contains(Character.UnicodeBlock.of(ch)))
}

That leaves the euro mark. First off, I don’t know how you’d know that the character € should be converted to the string "Euros ". You could get the name, but that wouldn’t have the trailing space you want, and the name is Euro Sign, not Euros. Third, replacing the euro sign with the text “Euros” will only give you the right result for languages where the correct name is Euros, in contexts where the euro sign is used to denotate some amount of currency. If the text you’re processing would be "The sign to denote the Euro is €.", you will translate that to "The sign to denote the Euro is Euros ."

In Greek, for example, The original string would be "Το σύμβολο που υποδηλώνει το Ευρώ είναι €." and you’d translate it to "Το σύμβολο που υποδηλώνει το Ευρώ είναι Euros ." which is clearly wrong. How to handle this depends on comprehension of the context and the meaning of the text, something that can’t be automated.

Second, from your description I can’t figure out what exactly the characters are that you want to translate to their names.

I can’t help you much further with that.

All in all, this is a bad idea, and you shouldn’t do it. If you have to, use the diacritics removal above. What characters you can and can’t handle you don’t describe exactly, so that needs more information. In the end, fixing the reason why you want to do this in the first place will be better, and probably also much easier.

curoli · April 23, 2018, 3:30pm

Why do you want to do this? Do you need US-ASCII code, and what for?

The problem is that transliteration of umlauts depends on language. For example, when writing Turkish words in only English letters, umlauts ö and ü are written as o and u, but for German words, umlauts ä, ö and ü are written ae, oe and ue.

So it should be Koeln and Muenchen (not Koln and Munchen), since these are German names (apart from the fact, of course, that these cities are already known in English as Cologne and Munich).

Could you say more about your use case?

Best, Oliver