java.lang.Object

org.opensextant.util.TextUtils

public class TextUtils extends Object

Author:: ubaldino

Field Summary

Fields

Modifier and Type

Field

Description

static final int

ABBREV_MAX_LEN

static final String

arabicLang

static final String

bahasaLang

static final int

CASE_LOWER

static final int

CASE_UPPER

static final String

chineseLang

static final String

chineseTradLang

static final char

CR

static final char

DEL

static final String

englishLang

static final String

farsiLang

static final String

frenchLang

static final String

germanLang

static final Pattern

hashtagPattern1

Find any pattern "ABC#[ABC 123]" -- a hashtag with whitespace.

static final Pattern

hashtagPattern2

Find any pattern "#ABC123" -- normal hashtag, Java Regex note: UNICODE flags are important, otherwise "\w" and other classes match only ASCII.

static final String

italianLang

static final String

japaneseLang

static final String

koreanLang

static final char

NL

static final String

portugueseLang

static final String

romanianLang

static final String

russianLang

static final char

SP

static final String

spanishLang

static final char

TAB

static final String

thaiLang

static final String

turkishLang

static final String

vietnameseLang
Constructor Summary

Constructors

Constructor

Description

TextUtils()
Method Summary

Modifier and Type

Method

Description

static void

addLanguage(Language lg)

static void

addLanguage(Language lg, boolean override)

Extend the basic language dictionary.

static String

b2hex(byte[] barr)

static boolean

checkCase(String text, int textcase)

detects if string alpha chars are purely lower case.

static byte[]

compress(String buf)

Compress bytes from a Unicode string.

static byte[]

compress(String buf, String charset)

static int

count_digits(String txt)

Counts all digits in text.

static int

count_ws(String txt)

Counts all whitespace in text.

static int

countASCIIChars(byte[] data)

count the number of ASCII bytes

static int

countCJKChars(char[] chars)

Counts the CJK characters in buffer, buf chars Inspiration: http://stackoverflow .com/questions/1499804/how-can-i-detect-japanese-text-in-a-java-string Assumption is that the char array is Unicode characters.

static int

countDigits(String txt)

static int

countFormattingSpace(String txt)

Count formatting whitespace.

static int

countIrregularPunctuation(String t)

static int

countNonText(String t)

Count number of non-alphanumeric chars are present.

static String

delete_controls(String t)

Delete control chars from text data; leaving text and whitespace only.

static String

delete_eol(String t)

Replace line endings with SPACE

static String

delete_whitespace(String t)

Delete whitespace of any sort.

static String

fast_replace(String buf, String replace, String substitution)

Given a string S and a list of characters to replace with a substitute, return the new string, S'.

static int[]

get_text_window(int offset, int textsize, int width)

Get a single text window around the offset.

static int[]

get_text_window(int offset, int matchlen, int textsize, int width)

Find the text window(s) around a match.

static Language

getLanguage(String code)

ISO2 and ISO3 char codes for languages are unique.

static String

getLanguageCode(String code)

ISO2 and ISO3 char codes for languages are unique.

static Map<String,Language>

getLanguageMap()

If caller wants to add language they can.

static String

getLanguageName(String code)

Given an ISO2 char code (least common denominator) retrieve Language Name.

static boolean

hasCJKText(String buf)

A simple test to see if text has any CJK characters at all.

static final boolean

hasDiacritics(String s)

If a string has extended latin diacritics.

static boolean

hasDigits(String txt)

static boolean

hasIrregularPunctuation(String t)

Simple triage of punctuation.

static final boolean

hasMiddleEasternText(String data)

Detects the first Arabic or Hewbrew character for now -- will be more comprehensive in scoping "Middle Eastern" scripts in text.

static void

initLanguageData()

Initialize language codes and metadata.

static void

initLOCLanguageData()

This is Libray of Congress data for language IDs.

static boolean

isAbbreviation(String txt)

static boolean

isAbbreviation(String orig, boolean useCase)

Define what an acronym is: A.B.

static boolean

isASCII(byte[] data)

static final boolean

isASCII(char c)

static boolean

isASCII(String t)

Early exit test -- return false on first non-ASCII character found.

static final boolean

isASCIILetter(char c)

static boolean

isChinese(Character.UnicodeBlock blk)

static boolean

isChinese(String x)

Utility method to check if lang ID is Chinese(Traditional or Simplified)...

static boolean

isCJK(Character.UnicodeBlock blk)

static boolean

isCJK(String x)

Utility method to check if lang ID is Chinese, Korean, or Japanese

static boolean

isEnglish(String x)

Utility method to check if lang ID is English...

static boolean

isEuroLanguage(String l)

European languages = Romance + GER + ENG Extend definition as needed.

static boolean

isJapanese(Character.UnicodeBlock blk)

Checks if char block is uniquely Japanese.

static boolean

isKorean(Character.UnicodeBlock blk)

Likely to be uniquely Korean if the character block is in Hangul.

static final boolean

isLatin(String data)

Checks if non-ASCII and non-LATIN characters are present.

static boolean

isLower(String dat)

static boolean

isLowerCaseDocument(int[] counts)

This measures the amount of upper case See Upper Case.

static final boolean

isNumeric(String v)

Determine if a string is numeric in nature, not necessarily a parsable number.

static boolean

isRomanceLanguage(String l)

Romance languages = SPA + POR + ITA + FRA + ROM Extend definition as needed.

static boolean

isUpper(String dat)

For measuring the upper-case-ness of short texts.

static boolean

isUpperCaseDocument(int[] counts)

First measureCase(Text) to acquire counts, then call this routine for a heuristic that suggests the text is mainly upper case.

static String

md5_id(byte[] digest)

Deprecated.
not MD5 specific.

static int[]

measureCase(String text)

Measure character count, upper, lower, non-Character, whitespace

static double

measureCJKText(String buf)

Returns a ratio of Chinese/Japanese/Korean characters: CJK chars / ALL TODO: needs testing; not sure if this is sustainable if block; or if it is comprehensive.

static String

normalizeAbbreviation(String word)

Intended only as a filter for punctuation within a word.

static String

normalizeTextEntity(String str)

Normalization: Clean the ends, Remove Line-endings from middle of entity.

static String

normalizeUnicode(String str)

Normalize to "Normalization Form Canonical Decomposition" (NFD) REF: http: //stackoverflow.com/questions/3610013/file-listfiles-mangles-unicode- names-with-jdk-6-unicode-normalization-issues This supports proper file name retrieval from file system, among other things.

static final Date

parseDate(String dt)

A limited-scope date parsing: Parse properly formatted strings for example, ISO date/time strings stored in one of our Solr indices.

static Set<String>

parseHashTags(String tweetText)

Parse the typical Twitter hashtag variants.

static Set<String>

parseHashTags(String tweetText, boolean normalize)

Takes a string and returns all the hashtags in it.

static String

parseNaturalLanguage(String raw)

see default implementation below

static String

parseNaturalLanguage(String raw, boolean unescapeHtml, boolean remURLs, boolean remTags, boolean remEntities)

Given tweet text or any [social media] text remove entities or other markers: - URLs are removed - entities are stripped of "@" - hashtags are stripped of "#" - HTML: & is converted to an ampersand - HTML: escaped angle brackets are replaced with { and } for gt and lt, respectively - HTML: remaining special chars are converted back to unicode; remaining ampersand is replaced with "+" Whitespaces (space, newlines, tabs, etc.) are reduced.

static String

phoneticReduction(String t)

Create a non-diacritic, ASCII version of the input string.

static String

phoneticReduction(String t, boolean isAscii)

static String

reduce_line_breaks(String t)

Replaces all 3 or more blank lines with a single paragraph break (\n\n)

static String

removeAny(String buf, String remove)

Remove instances of any char in the remove string from buf

static String

removeAnyLeft(String buf, String remove)

compare to trim( string, chars ), but you can trim any chars Example: - a b c remove "-" from string above.

static String

removeDiacritics(String word)

Supports Phoneticizer utility from OpenSextant v1.x Remove diacritics from a phrase

static String

removeEmoticons(String t)

replace Emoticons with something less nefarious -- UTF-16 characters do not play well with some I/O routines.

static String

removePunctuation(String word)

Remove any leading and trailing punctuation and some internal punctuation.

static String

removeSymbols(String t)

Replace symbology

static String

replaceAny(String buf, String remove, String sub)

Replace any of the removal chars with the sub.

static final String

replaceDiacritics(String s)

A thorough replacement of diacritics and Unicode chars to their ASCII equivalents.

static String

replaceDiacriticsOriginal(String s)

Deprecated.
See replaceDiacritics as the replacement.

static String

squeeze_whitespace(String t)

Minimize whitespace.

static List<String>

string2list(String s, String delim)

Get a list of values into a nice, scrubbed array of values, no whitespace.

static String

text_id(String text)

Static method -- use only if you are sure of thread-safety.

static String[]

tokens(String str)

Return just white-space delmited tokens.

static final String[]

tokensLeft(String str)

See tokensRight()

static final String[]

tokensRight(String str)

Return tokens on the right most part of a buffer.

static String

uncompress(byte[] gzData)

static String

uncompress(byte[] gzData, String charset)

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- NL
  
  public static final char NL
  See Also:
  
  Constant Field Values
- CR
  
  public static final char CR
  See Also:
  
  Constant Field Values
- SP
  
  public static final char SP
  See Also:
  
  Constant Field Values
- TAB
  
  public static final char TAB
  See Also:
  
  Constant Field Values
- DEL
  
  public static final char DEL
  See Also:
  
  Constant Field Values
- CASE_LOWER
  
  public static final int CASE_LOWER
  See Also:
  
  Constant Field Values
- CASE_UPPER
  
  public static final int CASE_UPPER
  See Also:
  
  Constant Field Values
- ABBREV_MAX_LEN
  
  public static final int ABBREV_MAX_LEN
  See Also:
  
  Constant Field Values
- arabicLang
  
  public static final String arabicLang
  See Also:
  
  Constant Field Values
- bahasaLang
  
  public static final String bahasaLang
  See Also:
  
  Constant Field Values
- chineseLang
  
  public static final String chineseLang
  See Also:
  
  Constant Field Values
- chineseTradLang
  
  public static final String chineseTradLang
  See Also:
  
  Constant Field Values
- englishLang
  
  public static final String englishLang
  See Also:
  
  Constant Field Values
- farsiLang
  
  public static final String farsiLang
  See Also:
  
  Constant Field Values
- frenchLang
  
  public static final String frenchLang
  See Also:
  
  Constant Field Values
- germanLang
  
  public static final String germanLang
  See Also:
  
  Constant Field Values
- italianLang
  
  public static final String italianLang
  See Also:
  
  Constant Field Values
- japaneseLang
  
  public static final String japaneseLang
  See Also:
  
  Constant Field Values
- koreanLang
  
  public static final String koreanLang
  See Also:
  
  Constant Field Values
- portugueseLang
  
  public static final String portugueseLang
  See Also:
  
  Constant Field Values
- russianLang
  
  public static final String russianLang
  See Also:
  
  Constant Field Values
- spanishLang
  
  public static final String spanishLang
  See Also:
  
  Constant Field Values
- turkishLang
  
  public static final String turkishLang
  See Also:
  
  Constant Field Values
- thaiLang
  
  public static final String thaiLang
  See Also:
  
  Constant Field Values
- vietnameseLang
  
  public static final String vietnameseLang
  See Also:
  
  Constant Field Values
- romanianLang
  
  public static final String romanianLang
  See Also:
  
  Constant Field Values
- hashtagPattern1
  
  public static final Pattern hashtagPattern1
  
  Find any pattern "ABC#[ABC 123]" -- a hashtag with whitespace. Java Regex note: UNICODE flags are important, otherwise "\w" and other classes match only ASCII. NOTE: These are Twitter hashtags primarily
- hashtagPattern2
  
  public static final Pattern hashtagPattern2
  
  Find any pattern "#ABC123" -- normal hashtag, Java Regex note: UNICODE flags are important, otherwise "\w" and other classes match only ASCII. NOTE: These are Twitter hashtags primarily
Constructor Details
- TextUtils
  
  public TextUtils()
Method Details
- hasIrregularPunctuation
  
  public static boolean hasIrregularPunctuation(String t)
  Simple triage of punctuation. Rationale: OpenSextant taggers maximize RECALL in favor of not missing a possible match. the problem there is we often encounter substantial noise with tagger output, so a trivial test is to see if we have overmatched: Allowed Punctuation: , . - _ ` ' ( ) ## Diacritics, Parenthetics, periods/dashes.
  Given phrase "A B C" we may have matched: "A|B+C", "A; B; C", "A <B> C" etc... where common punctation separates valid tokens that appear in the reference phrase.
  Parameters:
  
  t -
  
  Returns:
- countIrregularPunctuation
  
  public static int countIrregularPunctuation(String t)
- isLatin
  
  public static final boolean isLatin(String data)
  
  Checks if non-ASCII and non-LATIN characters are present.
  
  Parameters:
  
  data - any textual data
  
  Returns:
  
  true if content is strictly ASCII or Latin1 extended.
- hasMiddleEasternText
  
  public static final boolean hasMiddleEasternText(String data)
  
  Detects the first Arabic or Hewbrew character for now -- will be more comprehensive in scoping "Middle Eastern" scripts in text.
  
  Parameters:
  
  data -
  
  Returns:
- hasDiacritics
  
  public static final boolean hasDiacritics(String s)
  
  If a string has extended latin diacritics.
  
  Parameters:
  
  s - string to test
  
  Returns:
  
  true if a single diacritic is found.
- phoneticReduction
  
  public static String phoneticReduction(String t)
  
  Create a non-diacritic, ASCII version of the input string. This will also have original whitespace, but will have removed non-character markings, e.g. "Za'tut" => "Zatut" not "Za tut"
  
  Parameters:
  
  t -
  
  Returns:
- phoneticReduction
  
  public static String phoneticReduction(String t, boolean isAscii)
- replaceDiacritics
  
  public static final String replaceDiacritics(String s)
  
  A thorough replacement of diacritics and Unicode chars to their ASCII equivalents.
  
  Parameters:
  
  s - the string
  
  Returns:
  
  converted string
- replaceDiacriticsOriginal
  
  @Deprecated public static String replaceDiacriticsOriginal(String s)
  
  Deprecated.
  See replaceDiacritics as the replacement.
  
  remove accents from a string and replace with ASCII equivalent Reference: http://www.rgagnon.com/javadetails/java-0456.html Caveat: This implementation is not exhaustive.
  Parameters:
  
  s -
  
  Returns:
  
  See Also:
  
  replaceDiacritics(String)
- isASCII
  
  public static final boolean isASCII(char c)
  
  Parameters:
  
  c - a character
  
  Returns:
  
  true if c is ASCII
- isASCIILetter
  
  public static final boolean isASCIILetter(char c)
  
  Parameters:
  
  c - character
  
  Returns:
  
  true if c is ASCII a-z or A-Z
- isASCII
  
  public static boolean isASCII(byte[] data)
  
  Parameters:
  
  data - bytes to test
  
  Returns:
  
  boolean if data is ASCII or not
- isASCII
  
  public static boolean isASCII(String t)
  
  Early exit test -- return false on first non-ASCII character found.
  
  Parameters:
  
  t - buffer of text
  
  Returns:
  
  true only if every char is in ASCII table.
- countASCIIChars
  
  public static int countASCIIChars(byte[] data)
  
  count the number of ASCII bytes
  
  Parameters:
  
  data - bytes to count
  
  Returns:
  
  count of ASCII bytes
- reduce_line_breaks
  
  public static String reduce_line_breaks(String t)
  
  Replaces all 3 or more blank lines with a single paragraph break (\n\n)
  
  Parameters:
  
  t - text
  
  Returns:
  
  A string with fewer line breaks;
- delete_whitespace
  
  public static String delete_whitespace(String t)
  
  Delete whitespace of any sort.
  
  Parameters:
  
  t - text
  
  Returns:
  
  String, without whitespace.
- squeeze_whitespace
  
  public static String squeeze_whitespace(String t)
  
  Minimize whitespace.
  
  Parameters:
  
  t - text
  
  Returns:
  
  scrubbed string
- delete_eol
  
  public static String delete_eol(String t)
  
  Replace line endings with SPACE
  
  Parameters:
  
  t - text
  
  Returns:
  
  scrubbed string
- delete_controls
  
  public static String delete_controls(String t)
  
  Delete control chars from text data; leaving text and whitespace only. Delete char (^?) is also removed. Length may differ if ctl chars are removed.
  
  Parameters:
  
  t - text
  
  Returns:
  
  scrubbed buffer
- hasDigits
  
  public static boolean hasDigits(String txt)
- countDigits
  
  public static int countDigits(String txt)
- count_digits
  
  public static int count_digits(String txt)
  
  Counts all digits in text.
  
  Parameters:
  
  txt - text to count
  
  Returns:
  
  count of digits
- isNumeric
  
  public static final boolean isNumeric(String v)
  
  Determine if a string is numeric in nature, not necessarily a parsable number. 0-9 or "-+.E" are valid symbols. Example -- 11111E.00003333 is Numeric, commons StringUtils.isNumeric only detects digits.
  
  Parameters:
  
  v - val to parse
  
  Returns:
  
  true if val is a numeric sequence, symbols allowed.
- count_ws
  
  public static int count_ws(String txt)
  
  Counts all whitespace in text.
  
  Parameters:
  
  txt - text
  
  Returns:
  
  whitespace count
- countFormattingSpace
  
  public static int countFormattingSpace(String txt)
  
  Count formatting whitespace. This is helpful in determining if text spans are phrases with multiple TAB or EOL characters. For that matter, any control character contributes to formatting in some way. DEL, VT, HT, etc. So all control characters ( c < ' ') are counted.
  
  Parameters:
  
  txt - input string
  
  Returns:
  
  count of format chars
- isUpper
  
  public static boolean isUpper(String dat)
  
  For measuring the upper-case-ness of short texts. Returns true if ALL letters in text are UPPERCASE. Allows for non-letters in text.
  
  Parameters:
  
  dat - text or data
  
  Returns:
  
  true if text is Upper
- isLower
  
  public static boolean isLower(String dat)
- checkCase
  
  public static boolean checkCase(String text, int textcase)
  
  detects if string alpha chars are purely lower case.
  
  Parameters:
  
  text - text
  
  textcase - 1 lower, 2 upper
  
  Returns:
  
  if case matches given textcase param
- measureCase
  
  public static int[] measureCase(String text)
  
  Measure character count, upper, lower, non-Character, whitespace
  
  Parameters:
  
  text - text
  
  Returns:
  
  int array with counts.
- isUpperCaseDocument
  
  public static boolean isUpperCaseDocument(int[] counts)
  
  First measureCase(Text) to acquire counts, then call this routine for a heuristic that suggests the text is mainly upper case. These routines may not work well on languages that are not Latin-alphabet.
  
  Parameters:
  
  counts - word stats from measureCase()
  
  Returns:
  
  true if counts represent text that exceeds the "UPPER CASE" threshold
- isLowerCaseDocument
  
  public static boolean isLowerCaseDocument(int[] counts)
  
  This measures the amount of upper case See Upper Case. Two methods to measure -- lower case count compared to all content (char+non-char) or compared to just char content.
  
  Parameters:
  
  counts - word stats from measureCase()
  
  Returns:
  
  true if counts represent text that exceeds the "lower case" threshold
- get_text_window
  
  public static int[] get_text_window(int offset, int matchlen, int textsize, int width)
  Find the text window(s) around a match. Given the size of a buffer, the match and desired width return
  prepreprepre MATCH postpostpost ^ ^ ^ ^ l-width l l+len l+len+width left1 left2 right1 right2
  Parameters:
  
  offset - offset of match
  
  width - width of window left and right of match
  
  textsize - size of buffer containing match; used for boundary conditions
  
  matchlen - length of match
  
  Returns:
  
  window offsets left of match, right of match: [ l1, l2, r1, r2 ]
- get_text_window
  
  public static int[] get_text_window(int offset, int textsize, int width)
  
  Get a single text window around the offset.
  
  Parameters:
  
  offset - offset of match
  
  width - width of window left and right of match
  
  textsize - size of buffer containing match; used for boundary conditions
  
  Returns:
  
  window offsets of a text span contianing match [ left, right ]
- text_id
  
  public static String text_id(String text) throws NoSuchAlgorithmException, UnsupportedEncodingException
  
  Static method -- use only if you are sure of thread-safety.
  
  Parameters:
  
  text - text or data
  
  Returns:
  
  identifier for the text, an MD5 hash
  
  Throws:
  
  NoSuchAlgorithmException - on err
  
  UnsupportedEncodingException - on err
- b2hex
  
  public static String b2hex(byte[] barr)
- md5_id
  
  public static String md5_id(byte[] digest)
  
  Deprecated.
  not MD5 specific. Use #b2hex() instead
  
  Parameters:
  
  digest - byte array
  
  Returns:
  
  hash for the data
- string2list
  
  public static List<String> string2list(String s, String delim)
  
  Get a list of values into a nice, scrubbed array of values, no whitespace. a, b, c d e, f => [ "a", "b", "c d e", "f" ]
  
  Parameters:
  
  s - string to split
  
  delim - delimiter, no default.
  
  Returns:
  
  list of split strings, which are also whitespace trimmed
- fast_replace
  
  public static String fast_replace(String buf, String replace, String substitution)
  
  Given a string S and a list of characters to replace with a substitute, return the new string, S'. "-name-with.invalid characters;" // replace "-. ;" with "_" "_name_with_invalid_characters_" //
  
  Parameters:
  
  buf - buffer
  
  replace - string of characters to replace with the one substitute char
  
  substitution - string to insert in place of chars
  
  Returns:
  
  scrubbed text
- removeAny
  
  public static String removeAny(String buf, String remove)
  
  Remove instances of any char in the remove string from buf
  
  Parameters:
  
  buf - text
  
  remove - string to remove
  
  Returns:
  
  scrubbed text
- replaceAny
  
  public static String replaceAny(String buf, String remove, String sub)
  
  Replace any of the removal chars with the sub. A many to one replacement. alt: use regex String.replace(//, '')
  
  Parameters:
  
  buf - text
  
  remove - string to replace
  
  sub - the replacement string
  
  Returns:
  
  scrubbed text
- removeAnyLeft
  
  public static String removeAnyLeft(String buf, String remove)
  
  compare to trim( string, chars ), but you can trim any chars Example: - a b c remove "-" from string above.
  
  Parameters:
  
  buf - text
  
  remove - string to remove
  
  Returns:
  
  scrubbed text
- normalizeTextEntity
  
  public static String normalizeTextEntity(String str)
  Normalization: Clean the ends, Remove Line-endings from middle of entity.
  Example: TEXT: **The Daily Newsletter of \n\rBarbara, So.** CLEAN: __The Daily Newsletter of __Barbara, So___ Where "__" represents omitted characters.
  Parameters:
  
  str - text
  
  Returns:
  
  scrubbed text
- tokens
  
  public static String[] tokens(String str)
  
  Return just white-space delmited tokens.
  
  Parameters:
  
  str - text
  
  Returns:
  
  tokens
- tokensRight
  
  public static final String[] tokensRight(String str)
  
  Return tokens on the right most part of a buffer. If a para break occurs, \n\n or \r\n\r\n, then return the part on the right of the break.
  
  Parameters:
  
  str - text
  
  Returns:
  
  whitespace delimited tokens
- tokensLeft
  
  public static final String[] tokensLeft(String str)
  
  See tokensRight()
  
  Parameters:
  
  str - text
  
  Returns:
  
  whitespace delimited tokens
- normalizeAbbreviation
  
  public static String normalizeAbbreviation(String word)
  
  Intended only as a filter for punctuation within a word. Text of the form A.T.T. or U.S. becomes ATT and US. A text such as Mr.Pibbs incorrectly becomes MrPibbs but for the purposes of normalizing tokens this should be fine. Use appropriate tokenization prior to using this as a filter.
  
  Parameters:
  
  word - phrase with periods denoting some abbreviation.
  
  Returns:
  
  scrubbed text
- isAbbreviation
  
  public static boolean isAbbreviation(String txt)
  Parameters:
  
  txt -
  
  Returns:
  
  See Also:
  
  isAbbreviation(String, boolean)
- isAbbreviation
  
  public static boolean isAbbreviation(String orig, boolean useCase)
  
  Define what an acronym is: A.B. (at minimum) A.b. okay A. b. okay A.b not okay A.9. not okay Starts with Alpha Period is required Ends with a period One upper case letter required -- optional arg for case sensitivity Digits allowed. Spaces allowed - length no longer than 15 non-whitespace chars
- removeDiacritics
  
  public static String removeDiacritics(String word)
  
  Supports Phoneticizer utility from OpenSextant v1.x Remove diacritics from a phrase
  
  Parameters:
  
  word - text
  
  Returns:
  
  scrubbed text
- normalizeUnicode
  
  public static String normalizeUnicode(String str)
  
  Normalize to "Normalization Form Canonical Decomposition" (NFD) REF: http: //stackoverflow.com/questions/3610013/file-listfiles-mangles-unicode- names-with-jdk-6-unicode-normalization-issues This supports proper file name retrieval from file system, among other things. In many situations we see unicode file names -- Java can list them, but in using the Java-provided version of the filename the OS/FS may not be able to find the file by the name given in a particular normalized form.
  
  Parameters:
  
  str - text
  
  Returns:
  
  normalized string, encoded with NFD bytes
- removePunctuation
  
  public static String removePunctuation(String word)
  
  Remove any leading and trailing punctuation and some internal punctuation. Internal punctuation which indicates conjunction of two tokens, e.g. a hyphen, should have caused a split into separate tokens at the tokenization stage. Phoneticizer utility from OpenSextant v1.x Remove punctuation from a phrase
  
  Parameters:
  
  word - text
  
  Returns:
  
  scrubbed text
- getLanguageMap
  
  public static Map<String,Language> getLanguageMap()
  
  If caller wants to add language they can.
  
  Returns:
  
  map of lang ID to language obj
- initLanguageData
  
  public static void initLanguageData()
  Initialize language codes and metadata. This establishes a map for the most common language codes/names that exist in at least ISO-639-1 and have a non-zero 2-char ID.
  Based on: http://stackoverflow.com/questions/674041/is-there-an-elegant-way -to-convert-iso-639-2-3-letter-language-codes-to-java-lo Actual code mappings: en => eng eng => en cel => '' // Celtic; Avoid this. tr => tur tur => tr Names: tr => turkish tur => turkish turkish => tr // ISO2 only
- initLOCLanguageData
  
  public static void initLOCLanguageData() throws IOException
  
  This is Libray of Congress data for language IDs. This is offered as a tool to help downstream language ID and enrich metadata when tagging data from particular countries. Reference: http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt
  
  Throws:
  
  IOException - if resource file is not found
- addLanguage
  
  public static void addLanguage(Language lg)
- addLanguage
  
  public static void addLanguage(Language lg, boolean override)
  
  Extend the basic language dictionary. Note -- First language is listed in language map by Name, and is not overwritten. Language objects may be overwritten in map using lang codes. For example, fre = French(fre), fra = French(fra), and french = French(fra) the last one, 'french' = could have been the French(fre) or (fra). Example, 'ger' and 'deu' are both valid ISO 3-alpha codes for German. What to do? TODO: Create a language object that lists both language biblio/terminology codes.
  
  Parameters:
  
  lg - language object
  
  override - if this value should overwrite an existing one.
- getLanguageName
  
  public static String getLanguageName(String code)
  
  Given an ISO2 char code (least common denominator) retrieve Language Name. This is best effort, so if your code finds nothing, this returns code normalized to lowercase.
  
  Parameters:
  
  code - lang ID
  
  Returns:
  
  name of language
- getLanguage
  
  public static Language getLanguage(String code)
  
  ISO2 and ISO3 char codes for languages are unique.
  
  Parameters:
  
  code - iso2 or iso3 code
  
  Returns:
  
  the other code.
- getLanguageCode
  
  public static String getLanguageCode(String code)
  
  ISO2 and ISO3 char codes for languages are unique.
  
  Parameters:
  
  code - iso2 or iso3 code
  
  Returns:
  
  the other code.
- isEuroLanguage
  
  public static boolean isEuroLanguage(String l)
  
  European languages = Romance + GER + ENG Extend definition as needed.
  
  Parameters:
  
  l - language ID
  
  Returns:
  
  true if language is European in nature
- isRomanceLanguage
  
  public static boolean isRomanceLanguage(String l)
  
  Romance languages = SPA + POR + ITA + FRA + ROM Extend definition as needed.
  
  Parameters:
  
  l - lang ID
  
  Returns:
  
  true if language is a Romance language
- isEnglish
  
  public static boolean isEnglish(String x)
  
  Utility method to check if lang ID is English...
  
  Parameters:
  
  x - a langcode
  
  Returns:
  
  whether langcode is english
- isChinese
  
  public static boolean isChinese(String x)
  
  Utility method to check if lang ID is Chinese(Traditional or Simplified)...
  
  Parameters:
  
  x - a langcode
  
  Returns:
  
  whether langcode is chinese
- isCJK
  
  public static boolean isCJK(String x)
  
  Utility method to check if lang ID is Chinese, Korean, or Japanese
  
  Parameters:
  
  x - a langcode
  
  Returns:
  
  whether langcode is a CJK language
- measureCJKText
  
  public static double measureCJKText(String buf)
  
  Returns a ratio of Chinese/Japanese/Korean characters: CJK chars / ALL TODO: needs testing; not sure if this is sustainable if block; or if it is comprehensive. TODO: for performance reasons the internal chain of comparisons is embedded in the method; Otherwise for each char, an external method invocation is required.
  
  Parameters:
  
  buf - the character to be tested
  
  Returns:
  
  true if CJK, false otherwise
- countCJKChars
  
  public static int countCJKChars(char[] chars)
  
  Counts the CJK characters in buffer, buf chars Inspiration: http://stackoverflow .com/questions/1499804/how-can-i-detect-japanese-text-in-a-java-string Assumption is that the char array is Unicode characters.
  
  Parameters:
  
  chars - char array for the text in question.
  
  Returns:
  
  count of CJK characters
- hasCJKText
  
  public static boolean hasCJKText(String buf)
  
  A simple test to see if text has any CJK characters at all. It returns after the first such character.
  
  Parameters:
  
  buf - text
  
  Returns:
  
  if buf has at least one CJK char.
- isCJK
  
  public static boolean isCJK(Character.UnicodeBlock blk)
- isChinese
  
  public static boolean isChinese(Character.UnicodeBlock blk)
- isKorean
  
  public static boolean isKorean(Character.UnicodeBlock blk)
  
  Likely to be uniquely Korean if the character block is in Hangul. But also, it may be Korean if block is part of the CJK ideographs at large. User must check if text in its entirety is part of CJK & Hangul, independently. This method only detects if character block is uniquely Hangul or not.
  
  Parameters:
  
  blk - a Java Unicode block
  
  Returns:
  
  true if char block is Hangul
- isJapanese
  
  public static boolean isJapanese(Character.UnicodeBlock blk)
  
  Checks if char block is uniquely Japanese. Check other chars isChinese
  
  Parameters:
  
  blk - a Java Unicode block
  
  Returns:
  
  true if char block is Hiragana or Katakana
- compress
  
  public static byte[] compress(String buf) throws IOException
  
  Compress bytes from a Unicode string. Conversion to bytes first to avoid unicode or platform-dependent IO issues.
  
  Parameters:
  
  buf - UTF-8 encoded text
  
  Returns:
  
  byte array
  
  Throws:
  
  IOException - on error with compression or text encoding
- compress
  
  public static byte[] compress(String buf, String charset) throws IOException
  
  Parameters:
  
  buf - text
  
  charset - character set encoding for text
  
  Returns:
  
  byte array for the compressed result
  
  Throws:
  
  IOException - on error with compression or text encoding
- uncompress
  
  public static String uncompress(byte[] gzData) throws IOException
  
  Parameters:
  
  gzData - byte array containing gzipped buffer
  
  Returns:
  
  buffer UTF-8 decoded string
  
  Throws:
  
  IOException - on error with decompression or text encoding
- uncompress
  
  public static String uncompress(byte[] gzData, String charset) throws IOException
  
  Parameters:
  
  gzData - byte array containing gzipped buffer
  
  charset - character set decoding for text
  
  Returns:
  
  buffer of uncompressed, decoded string
  
  Throws:
  
  IOException - on error with decompression or text encoding
- removeEmoticons
  
  public static String removeEmoticons(String t)
  
  replace Emoticons with something less nefarious -- UTF-16 characters do not play well with some I/O routines.
  
  Parameters:
  
  t - text
  
  Returns:
  
  scrubbed text
- removeSymbols
  
  public static String removeSymbols(String t)
  
  Replace symbology
  
  Parameters:
  
  t - text
  
  Returns:
  
  scrubbed text
- countNonText
  
  public static int countNonText(String t)
  
  Count number of non-alphanumeric chars are present.
  
  Parameters:
  
  t - text
  
  Returns:
  
  count of chars
- parseHashTags
  
  public static Set<String> parseHashTags(String tweetText)
  
  Parse the typical Twitter hashtag variants.
  
  Parameters:
  
  tweetText -
  
  Returns:
- parseHashTags
  
  public static Set<String> parseHashTags(String tweetText, boolean normalize)
  
  Takes a string and returns all the hashtags in it. Normalized tags are just lowercased and deduplicated. https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json
  
  Parameters:
  
  tweetText - text
  
  normalize - if to normalize text by lowercasing tags, etc.
- parseNaturalLanguage
  
  public static String parseNaturalLanguage(String raw)
  
  see default implementation below
  Parameters:
  
  raw - raw text
  
  Returns:
  
  cleaner looking text
  
  See Also:
  
  replace HTML, URLs removed, Tags and entity markers (@ and #) stripped; Tags and entities left in place.
- parseNaturalLanguage
  
  public static String parseNaturalLanguage(String raw, boolean unescapeHtml, boolean remURLs, boolean remTags, boolean remEntities)
  
  Given tweet text or any [social media] text remove entities or other markers: - URLs are removed - entities are stripped of "@" - hashtags are stripped of "#" - HTML: & is converted to an ampersand - HTML: escaped angle brackets are replaced with { and } for gt and lt, respectively - HTML: remaining special chars are converted back to unicode; remaining ampersand is replaced with "+" Whitespaces (space, newlines, tabs, etc.) are reduced. DEPRECATED: the use of the tags=true flag to replace hashtags with blank is not supported. #tag<unicode text> is a problem. It is hard to tell in some cases where the hashtag ends. In Weibo, #tag#<unicode text> is used to denote that tag has a start/end But in Twitter, tag format is "#tag" or "#[phrase here]" etc. So there is no generic hashtag replacement.
  
  Parameters:
  
  raw - original text
  
  unescapeHtml - unescape HTML
  
  remURLs - remove URLs
  
  remTags - remove hash tags
  
  remEntities - remove other entities
  
  Returns:
  
  text less entities.
- parseDate
  
  public static final Date parseDate(String dt)
  
  A limited-scope date parsing: Parse properly formatted strings for example, ISO date/time strings stored in one of our Solr indices.
  
  Parameters:
  
  dt - ISO date/time string.
  
  Returns:

Class TextUtils

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

NL

CR

SP

TAB

DEL

CASE_LOWER

CASE_UPPER

ABBREV_MAX_LEN

arabicLang

bahasaLang

chineseLang

chineseTradLang

englishLang

farsiLang

frenchLang

germanLang

italianLang

japaneseLang

koreanLang

portugueseLang

russianLang

spanishLang

turkishLang

thaiLang

vietnameseLang

romanianLang

hashtagPattern1

hashtagPattern2

Constructor Details

TextUtils

Method Details

hasIrregularPunctuation

countIrregularPunctuation

isLatin

hasMiddleEasternText

hasDiacritics

phoneticReduction

phoneticReduction

replaceDiacritics

replaceDiacriticsOriginal

isASCII

isASCIILetter

isASCII

isASCII

countASCIIChars

reduce_line_breaks

delete_whitespace

squeeze_whitespace

delete_eol

delete_controls

hasDigits

countDigits

count_digits

isNumeric

count_ws

countFormattingSpace

isUpper

isLower

checkCase

measureCase

isUpperCaseDocument

isLowerCaseDocument

get_text_window

get_text_window

text_id

b2hex

md5_id

string2list

fast_replace

removeAny

replaceAny

removeAnyLeft

normalizeTextEntity

tokens