Class CharUtilities

java.lang.Object
org.docx4j.fonts.fop.util.CharUtilities

public class CharUtilities
extends java.lang.Object
This class provides utilities to distinguish various kinds of Unicode whitespace and to get character widths in a given FontState.
  • Field Summary

    Fields
    Modifier and Type Field Description
    static char CARRIAGE_RETURN
    carriage return
    static char CODE_EOT
    Character code used to signal a character boundary in inline content, such as an inline with borders and padding or a nested block object.
    static int EOT
    Character class: Boundary between text runs
    static char IDEOGRAPHIC_SPACE
    Ideogreaphic space
    static char LINE_SEPARATOR
    line-separator
    static int LINEFEED
    Character class: Line feed
    static char LINEFEED_CHAR
    linefeed character
    static char LRE
    left-to-right embedding
    static char LRM
    left-to-right mark
    static char LRO
    left-to-right override
    static char MISSING_IDEOGRAPH
    missing ideograph
    static char NBSPACE
    non-breaking space
    static char NEXT_LINE
    next line control character
    static int NONWHITESPACE
    Character class: non-whitespace
    static char NOT_A_CHARACTER
    Unicode value indicating the the character is "not a character".
    static char NULL_CHAR
    null char
    static char OBJECT_REPLACEMENT_CHARACTER
    Object replacement character
    static char PARAGRAPH_SEPARATOR
    paragraph-separator
    static char PDF
    pop directional formatting
    static char RLE
    right-to-left embedding
    static char RLM
    right-to-left mark
    static char RLO
    right-to-left override
    static char SOFT_HYPHEN
    soft hyphen
    static char SPACE
    normal space
    static char TAB
    normal tab
    static int UCWHITESPACE
    Character class: Unicode white space
    static char WORD_JOINER
    word joiner
    static int XMLWHITESPACE
    Character class: XML whitespace
    static char ZERO_WIDTH_JOINER
    zero-width joiner
    static char ZERO_WIDTH_NOBREAK_SPACE
    zero-width no-break space (= byte order mark)
    static char ZERO_WIDTH_SPACE
    zero-width space
  • Constructor Summary

    Constructors
    Modifier Constructor Description
    protected CharUtilities()
    Utility class: Constructor prevents instantiating when subclassed.
  • Method Summary

    Modifier and Type Method Description
    static java.lang.String charToNCRef​(int c)
    Convert a single unicode scalar value to an XML numeric character reference.
    static int classOf​(int c)
    Return the appropriate CharClass constant for the type of the passed character.
    static java.lang.Iterable<java.lang.Integer> codepointsIter​(java.lang.CharSequence s)
    Creates an iterator to iter a CharSequence codepoints.
    static java.lang.Iterable<java.lang.Integer> codepointsIter​(java.lang.CharSequence s, int beginIndex, int endIndex)
    Creates an iterator to iter a sub-CharSequence codepoints.
    static boolean containsSurrogatePairAt​(java.lang.CharSequence chars, int index)
    Tells whether there is a surrogate pair starting from the given index in the CharSequence.
    static java.lang.String format​(int c)
    Format character for debugging output, which it is prefixed with "0x", padded left with '0' and either 4 or 6 hex characters in width according to whether it is in the BMP or not.
    static int incrementIfNonBMP​(int codePoint)
    Returns 1 if codePoint not in the BMP.
    static boolean isAdjustableSpace​(int c)
    Method to determine if the character is an adjustable space.
    static boolean isAlphabetic​(int c)
    Indicates whether a character is classified as "Alphabetic" by the Unicode standard.
    static boolean isAnySpace​(int c)
    Determines if the character represents any kind of space.
    static boolean isBmpCodePoint​(int codePoint)
    Determine whether the specified character (Unicode code point) is in then Basic Multilingual Plane (BMP).
    static boolean isBreakableSpace​(int c)
    Helper method to determine if the character is a space with normal behavior.
    static boolean isExplicitBreak​(int c)
    Indicates whether the given character is an explicit break-character
    static boolean isFixedWidthSpace​(int c)
    Method to determine if the character is a (breakable) fixed-width space.
    static boolean isNonBreakableSpace​(int c)
    Method to determine if the character is a nonbreaking space.
    static boolean isSameSequence​(java.lang.CharSequence cs1, java.lang.CharSequence cs2)
    Determine if two character sequences contain the same characters.
    static boolean isSurrogatePair​(char ch)
    Determine if the given characters is part of a surrogate pair.
    static boolean isZeroWidthSpace​(int c)
    Method to determine if the character is a zero-width space.
    static java.lang.String padLeft​(java.lang.String s, int width, char pad)
    Pad a string S on left out to width W using padding character PAD.
    static java.lang.String toNCRefs​(java.lang.String s)
    Convert a string to a sequence of ASCII or XML numeric character references.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • CODE_EOT

      public static final char CODE_EOT
      Character code used to signal a character boundary in inline content, such as an inline with borders and padding or a nested block object.
      See Also:
      Constant Field Values
    • UCWHITESPACE

      public static final int UCWHITESPACE
      Character class: Unicode white space
      See Also:
      Constant Field Values
    • LINEFEED

      public static final int LINEFEED
      Character class: Line feed
      See Also:
      Constant Field Values
    • EOT

      public static final int EOT
      Character class: Boundary between text runs
      See Also:
      Constant Field Values
    • NONWHITESPACE

      public static final int NONWHITESPACE
      Character class: non-whitespace
      See Also:
      Constant Field Values
    • XMLWHITESPACE

      public static final int XMLWHITESPACE
      Character class: XML whitespace
      See Also:
      Constant Field Values
    • NULL_CHAR

      public static final char NULL_CHAR
      null char
      See Also:
      Constant Field Values
    • LINEFEED_CHAR

      public static final char LINEFEED_CHAR
      linefeed character
      See Also:
      Constant Field Values
    • CARRIAGE_RETURN

      public static final char CARRIAGE_RETURN
      carriage return
      See Also:
      Constant Field Values
    • TAB

      public static final char TAB
      normal tab
      See Also:
      Constant Field Values
    • SPACE

      public static final char SPACE
      normal space
      See Also:
      Constant Field Values
    • NBSPACE

      public static final char NBSPACE
      non-breaking space
      See Also:
      Constant Field Values
    • NEXT_LINE

      public static final char NEXT_LINE
      next line control character
      See Also:
      Constant Field Values
    • ZERO_WIDTH_SPACE

      public static final char ZERO_WIDTH_SPACE
      zero-width space
      See Also:
      Constant Field Values
    • WORD_JOINER

      public static final char WORD_JOINER
      word joiner
      See Also:
      Constant Field Values
    • ZERO_WIDTH_JOINER

      public static final char ZERO_WIDTH_JOINER
      zero-width joiner
      See Also:
      Constant Field Values
    • LRM

      public static final char LRM
      left-to-right mark
      See Also:
      Constant Field Values
    • RLM

      public static final char RLM
      right-to-left mark
      See Also:
      Constant Field Values
    • LRE

      public static final char LRE
      left-to-right embedding
      See Also:
      Constant Field Values
    • RLE

      public static final char RLE
      right-to-left embedding
      See Also:
      Constant Field Values
    • PDF

      public static final char PDF
      pop directional formatting
      See Also:
      Constant Field Values
    • LRO

      public static final char LRO
      left-to-right override
      See Also:
      Constant Field Values
    • RLO

      public static final char RLO
      right-to-left override
      See Also:
      Constant Field Values
    • ZERO_WIDTH_NOBREAK_SPACE

      public static final char ZERO_WIDTH_NOBREAK_SPACE
      zero-width no-break space (= byte order mark)
      See Also:
      Constant Field Values
    • SOFT_HYPHEN

      public static final char SOFT_HYPHEN
      soft hyphen
      See Also:
      Constant Field Values
    • LINE_SEPARATOR

      public static final char LINE_SEPARATOR
      line-separator
      See Also:
      Constant Field Values
    • PARAGRAPH_SEPARATOR

      public static final char PARAGRAPH_SEPARATOR
      paragraph-separator
      See Also:
      Constant Field Values
    • MISSING_IDEOGRAPH

      public static final char MISSING_IDEOGRAPH
      missing ideograph
      See Also:
      Constant Field Values
    • IDEOGRAPHIC_SPACE

      public static final char IDEOGRAPHIC_SPACE
      Ideogreaphic space
      See Also:
      Constant Field Values
    • OBJECT_REPLACEMENT_CHARACTER

      public static final char OBJECT_REPLACEMENT_CHARACTER
      Object replacement character
      See Also:
      Constant Field Values
    • NOT_A_CHARACTER

      public static final char NOT_A_CHARACTER
      Unicode value indicating the the character is "not a character".
      See Also:
      Constant Field Values
  • Constructor Details

    • CharUtilities

      protected CharUtilities()
      Utility class: Constructor prevents instantiating when subclassed.
  • Method Details

    • classOf

      public static int classOf​(int c)
      Return the appropriate CharClass constant for the type of the passed character.
      Parameters:
      c - character to inspect
      Returns:
      the determined character class
    • isBreakableSpace

      public static boolean isBreakableSpace​(int c)
      Helper method to determine if the character is a space with normal behavior. Normal behavior means that it's not non-breaking.
      Parameters:
      c - character to inspect
      Returns:
      True if the character is a normal space
    • isZeroWidthSpace

      public static boolean isZeroWidthSpace​(int c)
      Method to determine if the character is a zero-width space.
      Parameters:
      c - the character to check
      Returns:
      true if the character is a zero-width space
    • isFixedWidthSpace

      public static boolean isFixedWidthSpace​(int c)
      Method to determine if the character is a (breakable) fixed-width space.
      Parameters:
      c - the character to check
      Returns:
      true if the character has a fixed-width
    • isNonBreakableSpace

      public static boolean isNonBreakableSpace​(int c)
      Method to determine if the character is a nonbreaking space.
      Parameters:
      c - character to check
      Returns:
      True if the character is a nbsp
    • isAdjustableSpace

      public static boolean isAdjustableSpace​(int c)
      Method to determine if the character is an adjustable space.
      Parameters:
      c - character to check
      Returns:
      True if the character is adjustable
    • isAnySpace

      public static boolean isAnySpace​(int c)
      Determines if the character represents any kind of space.
      Parameters:
      c - character to check
      Returns:
      True if the character represents any kind of space
    • isAlphabetic

      public static boolean isAlphabetic​(int c)
      Indicates whether a character is classified as "Alphabetic" by the Unicode standard.
      Parameters:
      c - the character
      Returns:
      true if the character is "Alphabetic"
    • isExplicitBreak

      public static boolean isExplicitBreak​(int c)
      Indicates whether the given character is an explicit break-character
      Parameters:
      c - the character to check
      Returns:
      true if the character represents an explicit break
    • charToNCRef

      public static java.lang.String charToNCRef​(int c)
      Convert a single unicode scalar value to an XML numeric character reference. If in the BMP, four digits are used, otherwise 6 digits are used.
      Parameters:
      c - a unicode scalar value
      Returns:
      a string representing a numeric character reference
    • toNCRefs

      public static java.lang.String toNCRefs​(java.lang.String s)
      Convert a string to a sequence of ASCII or XML numeric character references.
      Parameters:
      s - a java string (encoded in UTF-16)
      Returns:
      a string representing a sequence of numeric character reference or ASCII characters
    • padLeft

      public static java.lang.String padLeft​(java.lang.String s, int width, char pad)
      Pad a string S on left out to width W using padding character PAD.
      Parameters:
      s - string to pad
      width - width of field to add padding
      pad - character to use for padding
      Returns:
      padded string
    • format

      public static java.lang.String format​(int c)
      Format character for debugging output, which it is prefixed with "0x", padded left with '0' and either 4 or 6 hex characters in width according to whether it is in the BMP or not.
      Parameters:
      c - character code
      Returns:
      formatted character string
    • isSameSequence

      public static boolean isSameSequence​(java.lang.CharSequence cs1, java.lang.CharSequence cs2)
      Determine if two character sequences contain the same characters.
      Parameters:
      cs1 - first character sequence
      cs2 - second character sequence
      Returns:
      true if both sequences have same length and same character sequence
    • isBmpCodePoint

      public static boolean isBmpCodePoint​(int codePoint)
      Determine whether the specified character (Unicode code point) is in then Basic Multilingual Plane (BMP). Such code points can be represented using a single char.
      Parameters:
      codePoint - the character (Unicode code point) to be tested
      Returns:
      true if the specified code point is between Character#MIN_VALUE and Character#MAX_VALUE} inclusive; false otherwise
      See Also:
      from Java 1.7
    • incrementIfNonBMP

      public static int incrementIfNonBMP​(int codePoint)
      Returns 1 if codePoint not in the BMP. This function is particularly useful in for loops over strings where, in presence of surrogate pairs, you need to skip one loop.
      Parameters:
      codePoint - 1 if codePoint > 0xFFFF, 0 otherwise
      Returns:
      1 if codePoint > 0xFFFF, 0 otherwise
    • isSurrogatePair

      public static boolean isSurrogatePair​(char ch)
      Determine if the given characters is part of a surrogate pair.
      Parameters:
      ch - character to be checked
      Returns:
      true if ch is an high surrogate or a low surrogate
    • containsSurrogatePairAt

      public static boolean containsSurrogatePairAt​(java.lang.CharSequence chars, int index)
      Tells whether there is a surrogate pair starting from the given index in the CharSequence. If the character at index is an high surrogate then the character at index+1 is checked to be a low surrogate. If a malformed surrogate pair is encountered then an IllegalArgumentException is thrown.
       high surrogate [0xD800 - 0xDC00]
       low surrogate [0xDC00 - 0xE000]
       
      Parameters:
      chars - CharSequence to check
      index - index in the CharSequqnce where to start the check
      Returns:
      true if there is a well-formed surrogate pair at index
      Throws:
      java.lang.IllegalArgumentException - if there wrong usage of surrogate pairs
    • codepointsIter

      public static java.lang.Iterable<java.lang.Integer> codepointsIter​(java.lang.CharSequence s)
      Creates an iterator to iter a CharSequence codepoints.
      Parameters:
      s - CharSequence to iter
      Returns:
      codepoint iterator for the given CharSequence.
      See Also:
      codepointsIter(CharSequence, int, int)
    • codepointsIter

      public static java.lang.Iterable<java.lang.Integer> codepointsIter​(java.lang.CharSequence s, int beginIndex, int endIndex)
      Creates an iterator to iter a sub-CharSequence codepoints.
      Parameters:
      s - CharSequence to iter
      beginIndex - lower range
      endIndex - upper range
      Returns:
      codepoint iterator for the given sub-CharSequence.
      See Also:
      Bug JDK-5003547