Class CodePageUtil

java.lang.Object
org.docx4j.org.apache.poi.util.CodePageUtil

public class CodePageUtil
extends java.lang.Object
Utilities for working with Microsoft CodePages.

Provides constants for understanding numeric codepages, along with utilities to translate these into Java Character Sets.

  • Field Summary

    Fields
    Modifier and Type Field Description
    static int CP_037
    Codepage 037, a special case
    static int CP_EUC_JP
    Codepage for EUC-JP
    static int CP_EUC_KR
    Codepage for EUC-KR
    static int CP_GB18030
    Codepage for GB18030
    static int CP_GB2312
    Codepage for GB2312
    static int CP_GBK
    Codepage for GBK, aka MS936
    static int CP_ISO_2022_JP1
    Codepage for ISO-2022-JP
    static int CP_ISO_2022_JP2
    Another codepage for ISO-2022-JP
    static int CP_ISO_2022_JP3
    Yet another codepage for ISO-2022-JP
    static int CP_ISO_2022_KR
    Codepage for ISO-2022-KR
    static int CP_ISO_8859_1
    Codepage for ISO-8859-1
    static int CP_ISO_8859_2
    Codepage for ISO-8859-2
    static int CP_ISO_8859_3
    Codepage for ISO-8859-3
    static int CP_ISO_8859_4
    Codepage for ISO-8859-4
    static int CP_ISO_8859_5
    Codepage for ISO-8859-5
    static int CP_ISO_8859_6
    Codepage for ISO-8859-6
    static int CP_ISO_8859_7
    Codepage for ISO-8859-7
    static int CP_ISO_8859_8
    Codepage for ISO-8859-8
    static int CP_ISO_8859_9
    Codepage for ISO-8859-9
    static int CP_JOHAB
    Codepage for Johab
    static int CP_KOI8_R
    Codepage for KOI8-R
    static int CP_MAC_ARABIC
    Codepage for Macintosh Arabic (Java: MacArabic)
    static int CP_MAC_CENTRAL_EUROPE
    Codepage for Macintosh Central Europe (Latin-2) (Java: MacCentralEurope)
    static int CP_MAC_CHINESE_SIMPLE
    Codepage for Macintosh Chinese Simplified (Java: unknown - use EUC_CN, ISO2022_CN_GB, MS936 or cp935)
    static int CP_MAC_CHINESE_TRADITIONAL
    Codepage for Macintosh Chinese Traditional (Java: unknown - use Big5, MS950, or cp937)
    static int CP_MAC_CROATIAN
    Codepage for Macintosh Croatian (Java: MacCroatian)
    static int CP_MAC_CYRILLIC
    Codepage for Macintosh Cyrillic (Java: MacCyrillic)
    static int CP_MAC_GREEK
    Codepage for Macintosh Greek (Java: MacGreek)
    static int CP_MAC_HEBREW
    Codepage for Macintosh Hebrew (Java: MacHebrew)
    static int CP_MAC_ICELAND
    Codepage for Macintosh Iceland (Java: MacIceland)
    static int CP_MAC_JAPAN
    Codepage for Macintosh Japan (Java: unknown - use SJIS, cp942 or cp943)
    static int CP_MAC_KOREAN
    Codepage for Macintosh Korean (Java: unknown - use EUC_KR or cp949)
    static int CP_MAC_ROMAN
    Codepage for Macintosh Roman (Java: MacRoman)
    static int CP_MAC_ROMAN_BIFF23  
    static int CP_MAC_ROMANIA
    Codepage for Macintosh Romanian (Java: MacRomania)
    static int CP_MAC_THAI
    Codepage for Macintosh Thai (Java: MacThai)
    static int CP_MAC_TURKISH
    Codepage for Macintosh Turkish (Java: MacTurkish)
    static int CP_MAC_UKRAINE
    Codepage for Macintosh Ukrainian (Java: MacUkraine)
    static int CP_MS949
    Codepage for MS949
    static int CP_SJIS
    Codepage for SJIS
    static int CP_UNICODE
    Codepage for Unicode
    static int CP_US_ACSII
    Codepage for US-ASCII
    static int CP_US_ASCII2
    Another codepage for US-ASCII
    static int CP_UTF16
    Codepage for UTF-16
    static int CP_UTF16_BE
    Codepage for UTF-16 big-endian
    static int CP_UTF8
    Codepage for UTF-8
    static int CP_WINDOWS_1250
    Codepage for Windows 1250
    static int CP_WINDOWS_1251
    Codepage for Windows 1251
    static int CP_WINDOWS_1252
    Codepage for Windows 1252
    static int CP_WINDOWS_1252_BIFF23  
    static int CP_WINDOWS_1253
    Codepage for Windows 1253
    static int CP_WINDOWS_1254
    Codepage for Windows 1254
    static int CP_WINDOWS_1255
    Codepage for Windows 1255
    static int CP_WINDOWS_1256
    Codepage for Windows 1256
    static int CP_WINDOWS_1257
    Codepage for Windows 1257
    static int CP_WINDOWS_1258
    Codepage for Windows 1258
  • Constructor Summary

    Constructors
    Constructor Description
    CodePageUtil()  
  • Method Summary

    Modifier and Type Method Description
    static java.lang.String codepageToEncoding​(int codepage)
    Turns a codepage number into the equivalent character encoding's name (in Java NIO canonical naming format).
    static java.lang.String codepageToEncoding​(int codepage, boolean javaLangFormat)
    Turns a codepage number into the equivalent character encoding's name, in either Java NIO or Java Lang canonical naming.
    static byte[] getBytesInCodePage​(java.lang.String string, int codepage)
    Converts a string into bytes, in the equivalent character encoding to the supplied codepage number.
    static java.lang.String getStringFromCodePage​(byte[] string, int codepage)
    Converts the bytes into a String, based on the equivalent character encoding to the supplied codepage number.
    static java.lang.String getStringFromCodePage​(byte[] string, int offset, int length, int codepage)
    Converts the bytes into a String, based on the equivalent character encoding to the supplied codepage number.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • CP_037

      public static final int CP_037

      Codepage 037, a special case

      See Also:
      Constant Field Values
    • CP_SJIS

      public static final int CP_SJIS

      Codepage for SJIS

      See Also:
      Constant Field Values
    • CP_GBK

      public static final int CP_GBK

      Codepage for GBK, aka MS936

      See Also:
      Constant Field Values
    • CP_MS949

      public static final int CP_MS949

      Codepage for MS949

      See Also:
      Constant Field Values
    • CP_UTF16

      public static final int CP_UTF16

      Codepage for UTF-16

      See Also:
      Constant Field Values
    • CP_UTF16_BE

      public static final int CP_UTF16_BE

      Codepage for UTF-16 big-endian

      See Also:
      Constant Field Values
    • CP_WINDOWS_1250

      public static final int CP_WINDOWS_1250

      Codepage for Windows 1250

      See Also:
      Constant Field Values
    • CP_WINDOWS_1251

      public static final int CP_WINDOWS_1251

      Codepage for Windows 1251

      See Also:
      Constant Field Values
    • CP_WINDOWS_1252

      public static final int CP_WINDOWS_1252

      Codepage for Windows 1252

      See Also:
      Constant Field Values
    • CP_WINDOWS_1252_BIFF23

      public static final int CP_WINDOWS_1252_BIFF23
      See Also:
      Constant Field Values
    • CP_WINDOWS_1253

      public static final int CP_WINDOWS_1253

      Codepage for Windows 1253

      See Also:
      Constant Field Values
    • CP_WINDOWS_1254

      public static final int CP_WINDOWS_1254

      Codepage for Windows 1254

      See Also:
      Constant Field Values
    • CP_WINDOWS_1255

      public static final int CP_WINDOWS_1255

      Codepage for Windows 1255

      See Also:
      Constant Field Values
    • CP_WINDOWS_1256

      public static final int CP_WINDOWS_1256

      Codepage for Windows 1256

      See Also:
      Constant Field Values
    • CP_WINDOWS_1257

      public static final int CP_WINDOWS_1257

      Codepage for Windows 1257

      See Also:
      Constant Field Values
    • CP_WINDOWS_1258

      public static final int CP_WINDOWS_1258

      Codepage for Windows 1258

      See Also:
      Constant Field Values
    • CP_JOHAB

      public static final int CP_JOHAB

      Codepage for Johab

      See Also:
      Constant Field Values
    • CP_MAC_ROMAN

      public static final int CP_MAC_ROMAN

      Codepage for Macintosh Roman (Java: MacRoman)

      See Also:
      Constant Field Values
    • CP_MAC_ROMAN_BIFF23

      public static final int CP_MAC_ROMAN_BIFF23
      See Also:
      Constant Field Values
    • CP_MAC_JAPAN

      public static final int CP_MAC_JAPAN

      Codepage for Macintosh Japan (Java: unknown - use SJIS, cp942 or cp943)

      See Also:
      Constant Field Values
    • CP_MAC_CHINESE_TRADITIONAL

      public static final int CP_MAC_CHINESE_TRADITIONAL

      Codepage for Macintosh Chinese Traditional (Java: unknown - use Big5, MS950, or cp937)

      See Also:
      Constant Field Values
    • CP_MAC_KOREAN

      public static final int CP_MAC_KOREAN

      Codepage for Macintosh Korean (Java: unknown - use EUC_KR or cp949)

      See Also:
      Constant Field Values
    • CP_MAC_ARABIC

      public static final int CP_MAC_ARABIC

      Codepage for Macintosh Arabic (Java: MacArabic)

      See Also:
      Constant Field Values
    • CP_MAC_HEBREW

      public static final int CP_MAC_HEBREW

      Codepage for Macintosh Hebrew (Java: MacHebrew)

      See Also:
      Constant Field Values
    • CP_MAC_GREEK

      public static final int CP_MAC_GREEK

      Codepage for Macintosh Greek (Java: MacGreek)

      See Also:
      Constant Field Values
    • CP_MAC_CYRILLIC

      public static final int CP_MAC_CYRILLIC

      Codepage for Macintosh Cyrillic (Java: MacCyrillic)

      See Also:
      Constant Field Values
    • CP_MAC_CHINESE_SIMPLE

      public static final int CP_MAC_CHINESE_SIMPLE

      Codepage for Macintosh Chinese Simplified (Java: unknown - use EUC_CN, ISO2022_CN_GB, MS936 or cp935)

      See Also:
      Constant Field Values
    • CP_MAC_ROMANIA

      public static final int CP_MAC_ROMANIA

      Codepage for Macintosh Romanian (Java: MacRomania)

      See Also:
      Constant Field Values
    • CP_MAC_UKRAINE

      public static final int CP_MAC_UKRAINE

      Codepage for Macintosh Ukrainian (Java: MacUkraine)

      See Also:
      Constant Field Values
    • CP_MAC_THAI

      public static final int CP_MAC_THAI

      Codepage for Macintosh Thai (Java: MacThai)

      See Also:
      Constant Field Values
    • CP_MAC_CENTRAL_EUROPE

      public static final int CP_MAC_CENTRAL_EUROPE

      Codepage for Macintosh Central Europe (Latin-2) (Java: MacCentralEurope)

      See Also:
      Constant Field Values
    • CP_MAC_ICELAND

      public static final int CP_MAC_ICELAND

      Codepage for Macintosh Iceland (Java: MacIceland)

      See Also:
      Constant Field Values
    • CP_MAC_TURKISH

      public static final int CP_MAC_TURKISH

      Codepage for Macintosh Turkish (Java: MacTurkish)

      See Also:
      Constant Field Values
    • CP_MAC_CROATIAN

      public static final int CP_MAC_CROATIAN

      Codepage for Macintosh Croatian (Java: MacCroatian)

      See Also:
      Constant Field Values
    • CP_US_ACSII

      public static final int CP_US_ACSII

      Codepage for US-ASCII

      See Also:
      Constant Field Values
    • CP_KOI8_R

      public static final int CP_KOI8_R

      Codepage for KOI8-R

      See Also:
      Constant Field Values
    • CP_ISO_8859_1

      public static final int CP_ISO_8859_1

      Codepage for ISO-8859-1

      See Also:
      Constant Field Values
    • CP_ISO_8859_2

      public static final int CP_ISO_8859_2

      Codepage for ISO-8859-2

      See Also:
      Constant Field Values
    • CP_ISO_8859_3

      public static final int CP_ISO_8859_3

      Codepage for ISO-8859-3

      See Also:
      Constant Field Values
    • CP_ISO_8859_4

      public static final int CP_ISO_8859_4

      Codepage for ISO-8859-4

      See Also:
      Constant Field Values
    • CP_ISO_8859_5

      public static final int CP_ISO_8859_5

      Codepage for ISO-8859-5

      See Also:
      Constant Field Values
    • CP_ISO_8859_6

      public static final int CP_ISO_8859_6

      Codepage for ISO-8859-6

      See Also:
      Constant Field Values
    • CP_ISO_8859_7

      public static final int CP_ISO_8859_7

      Codepage for ISO-8859-7

      See Also:
      Constant Field Values
    • CP_ISO_8859_8

      public static final int CP_ISO_8859_8

      Codepage for ISO-8859-8

      See Also:
      Constant Field Values
    • CP_ISO_8859_9

      public static final int CP_ISO_8859_9

      Codepage for ISO-8859-9

      See Also:
      Constant Field Values
    • CP_ISO_2022_JP1

      public static final int CP_ISO_2022_JP1

      Codepage for ISO-2022-JP

      See Also:
      Constant Field Values
    • CP_ISO_2022_JP2

      public static final int CP_ISO_2022_JP2

      Another codepage for ISO-2022-JP

      See Also:
      Constant Field Values
    • CP_ISO_2022_JP3

      public static final int CP_ISO_2022_JP3

      Yet another codepage for ISO-2022-JP

      See Also:
      Constant Field Values
    • CP_ISO_2022_KR

      public static final int CP_ISO_2022_KR

      Codepage for ISO-2022-KR

      See Also:
      Constant Field Values
    • CP_EUC_JP

      public static final int CP_EUC_JP

      Codepage for EUC-JP

      See Also:
      Constant Field Values
    • CP_EUC_KR

      public static final int CP_EUC_KR

      Codepage for EUC-KR

      See Also:
      Constant Field Values
    • CP_GB2312

      public static final int CP_GB2312

      Codepage for GB2312

      See Also:
      Constant Field Values
    • CP_GB18030

      public static final int CP_GB18030

      Codepage for GB18030

      See Also:
      Constant Field Values
    • CP_US_ASCII2

      public static final int CP_US_ASCII2

      Another codepage for US-ASCII

      See Also:
      Constant Field Values
    • CP_UTF8

      public static final int CP_UTF8

      Codepage for UTF-8

      See Also:
      Constant Field Values
    • CP_UNICODE

      public static final int CP_UNICODE

      Codepage for Unicode

      See Also:
      Constant Field Values
  • Constructor Details

    • CodePageUtil

      public CodePageUtil()
  • Method Details

    • getBytesInCodePage

      public static byte[] getBytesInCodePage​(java.lang.String string, int codepage) throws java.io.UnsupportedEncodingException
      Converts a string into bytes, in the equivalent character encoding to the supplied codepage number.
      Parameters:
      string - The string to convert
      codepage - The codepage number
      Throws:
      java.io.UnsupportedEncodingException
    • getStringFromCodePage

      public static java.lang.String getStringFromCodePage​(byte[] string, int codepage) throws java.io.UnsupportedEncodingException
      Converts the bytes into a String, based on the equivalent character encoding to the supplied codepage number.
      Parameters:
      string - The byte of the string to convert
      codepage - The codepage number
      Throws:
      java.io.UnsupportedEncodingException
    • getStringFromCodePage

      public static java.lang.String getStringFromCodePage​(byte[] string, int offset, int length, int codepage) throws java.io.UnsupportedEncodingException
      Converts the bytes into a String, based on the equivalent character encoding to the supplied codepage number.
      Parameters:
      string - The byte of the string to convert
      codepage - The codepage number
      Throws:
      java.io.UnsupportedEncodingException
    • codepageToEncoding

      public static java.lang.String codepageToEncoding​(int codepage) throws java.io.UnsupportedEncodingException

      Turns a codepage number into the equivalent character encoding's name (in Java NIO canonical naming format).

      Parameters:
      codepage - The codepage number
      Returns:
      The character encoding's name. If the codepage number is 65001, the encoding name is "UTF-8". All other positive numbers are mapped to their Java NIO names, normally either "windows-" followed by the number, eg "windows-1251", or "cp" followed by the number, e.g. if the codepage number is 1252 the returned character encoding name will be "cp1252".
      Throws:
      java.io.UnsupportedEncodingException - if the specified codepage is less than zero.
    • codepageToEncoding

      public static java.lang.String codepageToEncoding​(int codepage, boolean javaLangFormat) throws java.io.UnsupportedEncodingException

      Turns a codepage number into the equivalent character encoding's name, in either Java NIO or Java Lang canonical naming.

      Parameters:
      codepage - The codepage number
      javaLangFormat - Should Java Lang or Java NIO naming be used?
      Returns:
      The character encoding's name, in either Java Lang format (eg Cp1251, ISO8859_5) or Java NIO format (eg windows-1252, ISO-8859-9)
      Throws:
      java.io.UnsupportedEncodingException - if the specified codepage is less than zero.
      See Also:
      Supported Encodings