Just a simple code snippet to remove diacritics (accents, dots, and other marks that could be added to letters) from strings. For example, “Tĥìŝ Šťŕĭńġ ĩs ñȯt åççeptàblé” becomes “This String is not acceptable”.

The goal could be accomplished this way:

String str = "Tĥìŝ Šťŕĭńġ ĩs ñȯt åççeptàblé";
str = Normalizer.normalize(str, Normalizer.Form.NFD);
str = str.replaceAll("\\p{M}", "");
System.out.println(str); // This String is not acceptable

Where the first line separates the glyphs that represent letters with diacritics in two glyph: one with the letter alone and one with the combining mark. You could see some lists of combining marks here:

  • Wikipedia: https://en.wikipedia.org/wiki/Diacritic
  • &what;: http://www.amp-what.com/unicode/search/diacritic

The second code line removes the combining marks, using a regular expression with the character class for combining marks.

Keep in mind that not every variation of a glyph is considered as a combined character. For example, the following characters are not modified by the previous code:

Ⱦ ø ᵽ é