Class TextUtils

java.lang.Object
org.opensextant.util.TextUtils

public class TextUtils extends Object
Author:
ubaldino
  • Field Details

  • Constructor Details

    • TextUtils

      public TextUtils()
  • Method Details

    • hasIrregularPunctuation

      public static boolean hasIrregularPunctuation(String t)
      Simple triage of punctuation. Rationale: OpenSextant taggers maximize RECALL in favor of not missing a possible match. the problem there is we often encounter substantial noise with tagger output, so a trivial test is to see if we have overmatched: Allowed Punctuation: , . - _ ` ' ( ) ## Diacritics, Parenthetics, periods/dashes.
           Given phrase "A B C"  we may have matched:
                        "A|B+C", "A; B; C", "A <B> C" etc...  where common punctation separates valid tokens
                        that appear in the reference phrase.
       
      Parameters:
      t -
      Returns:
    • countIrregularPunctuation

      public static int countIrregularPunctuation(String t)
    • isLatin

      public static final boolean isLatin(String data)
      Checks if non-ASCII and non-LATIN characters are present.
      Parameters:
      data - any textual data
      Returns:
      true if content is strictly ASCII or Latin1 extended.
    • hasMiddleEasternText

      public static final boolean hasMiddleEasternText(String data)
      Detects the first Arabic or Hewbrew character for now -- will be more comprehensive in scoping "Middle Eastern" scripts in text.
      Parameters:
      data -
      Returns:
    • hasDiacritics

      public static final boolean hasDiacritics(String s)
      If a string has extended latin diacritics.
      Parameters:
      s - string to test
      Returns:
      true if a single diacritic is found.
    • phoneticReduction

      public static String phoneticReduction(String t)
      Create a non-diacritic, ASCII version of the input string. This will also have original whitespace, but will have removed non-character markings, e.g. "Za'tut" => "Zatut" not "Za tut"
      Parameters:
      t -
      Returns:
    • phoneticReduction

      public static String phoneticReduction(String t, boolean isAscii)
    • replaceDiacritics

      public static final String replaceDiacritics(String s)
      A thorough replacement of diacritics and Unicode chars to their ASCII equivalents.
      Parameters:
      s - the string
      Returns:
      converted string
    • replaceDiacriticsOriginal

      @Deprecated public static String replaceDiacriticsOriginal(String s)
      Deprecated.
      See replaceDiacritics as the replacement.
      remove accents from a string and replace with ASCII equivalent Reference: http://www.rgagnon.com/javadetails/java-0456.html Caveat: This implementation is not exhaustive.
      Parameters:
      s -
      Returns:
      See Also:
    • isASCII

      public static final boolean isASCII(char c)
      Parameters:
      c - a character
      Returns:
      true if c is ASCII
    • isASCIILetter

      public static final boolean isASCIILetter(char c)
      Parameters:
      c - character
      Returns:
      true if c is ASCII a-z or A-Z
    • isASCII

      public static boolean isASCII(byte[] data)
      Parameters:
      data - bytes to test
      Returns:
      boolean if data is ASCII or not
    • isASCII

      public static boolean isASCII(String t)
      Early exit test -- return false on first non-ASCII character found.
      Parameters:
      t - buffer of text
      Returns:
      true only if every char is in ASCII table.
    • countASCIIChars

      public static int countASCIIChars(byte[] data)
      count the number of ASCII bytes
      Parameters:
      data - bytes to count
      Returns:
      count of ASCII bytes
    • reduce_line_breaks

      public static String reduce_line_breaks(String t)
      Replaces all 3 or more blank lines with a single paragraph break (\n\n)
      Parameters:
      t - text
      Returns:
      A string with fewer line breaks;
    • delete_whitespace

      public static String delete_whitespace(String t)
      Delete whitespace of any sort.
      Parameters:
      t - text
      Returns:
      String, without whitespace.
    • squeeze_whitespace

      public static String squeeze_whitespace(String t)
      Minimize whitespace.
      Parameters:
      t - text
      Returns:
      scrubbed string
    • delete_eol

      public static String delete_eol(String t)
      Replace line endings with SPACE
      Parameters:
      t - text
      Returns:
      scrubbed string
    • delete_controls

      public static String delete_controls(String t)
      Delete control chars from text data; leaving text and whitespace only. Delete char (^?) is also removed. Length may differ if ctl chars are removed.
      Parameters:
      t - text
      Returns:
      scrubbed buffer
    • hasDigits

      public static boolean hasDigits(String txt)
    • countDigits

      public static int countDigits(String txt)
    • count_digits

      public static int count_digits(String txt)
      Counts all digits in text.
      Parameters:
      txt - text to count
      Returns:
      count of digits
    • isNumeric

      public static final boolean isNumeric(String v)
      Determine if a string is numeric in nature, not necessarily a parsable number. 0-9 or "-+.E" are valid symbols. Example -- 11111E.00003333 is Numeric, commons StringUtils.isNumeric only detects digits.
      Parameters:
      v - val to parse
      Returns:
      true if val is a numeric sequence, symbols allowed.
    • count_ws

      public static int count_ws(String txt)
      Counts all whitespace in text.
      Parameters:
      txt - text
      Returns:
      whitespace count
    • countFormattingSpace

      public static int countFormattingSpace(String txt)
      Count formatting whitespace. This is helpful in determining if text spans are phrases with multiple TAB or EOL characters. For that matter, any control character contributes to formatting in some way. DEL, VT, HT, etc. So all control characters ( c < ' ') are counted.
      Parameters:
      txt - input string
      Returns:
      count of format chars
    • isUpper

      public static boolean isUpper(String dat)
      For measuring the upper-case-ness of short texts. Returns true if ALL letters in text are UPPERCASE. Allows for non-letters in text.
      Parameters:
      dat - text or data
      Returns:
      true if text is Upper
    • isLower

      public static boolean isLower(String dat)
    • checkCase

      public static boolean checkCase(String text, int textcase)
      detects if string alpha chars are purely lower case.
      Parameters:
      text - text
      textcase - 1 lower, 2 upper
      Returns:
      if case matches given textcase param
    • measureCase

      public static int[] measureCase(String text)
      Measure character count, upper, lower, non-Character, whitespace
      Parameters:
      text - text
      Returns:
      int array with counts.
    • isUpperCaseDocument

      public static boolean isUpperCaseDocument(int[] counts)
      First measureCase(Text) to acquire counts, then call this routine for a heuristic that suggests the text is mainly upper case. These routines may not work well on languages that are not Latin-alphabet.
      Parameters:
      counts - word stats from measureCase()
      Returns:
      true if counts represent text that exceeds the "UPPER CASE" threshold
    • isLowerCaseDocument

      public static boolean isLowerCaseDocument(int[] counts)
      This measures the amount of upper case See Upper Case. Two methods to measure -- lower case count compared to all content (char+non-char) or compared to just char content.
      Parameters:
      counts - word stats from measureCase()
      Returns:
      true if counts represent text that exceeds the "lower case" threshold
    • get_text_window

      public static int[] get_text_window(int offset, int matchlen, int textsize, int width)
      Find the text window(s) around a match. Given the size of a buffer, the match and desired width return
       prepreprepre      MATCH        postpostpost
       ^           ^                  ^          ^
       l-width     l                 l+len   l+len+width
       left1     left2              right1    right2
       
      Parameters:
      offset - offset of match
      width - width of window left and right of match
      textsize - size of buffer containing match; used for boundary conditions
      matchlen - length of match
      Returns:
      window offsets left of match, right of match: [ l1, l2, r1, r2 ]
    • get_text_window

      public static int[] get_text_window(int offset, int textsize, int width)
      Get a single text window around the offset.
      Parameters:
      offset - offset of match
      width - width of window left and right of match
      textsize - size of buffer containing match; used for boundary conditions
      Returns:
      window offsets of a text span contianing match [ left, right ]
    • text_id

      Static method -- use only if you are sure of thread-safety.
      Parameters:
      text - text or data
      Returns:
      identifier for the text, an MD5 hash
      Throws:
      NoSuchAlgorithmException - on err
      UnsupportedEncodingException - on err
    • b2hex

      public static String b2hex(byte[] barr)
    • md5_id

      public static String md5_id(byte[] digest)
      Deprecated.
      not MD5 specific. Use #b2hex() instead
      Parameters:
      digest - byte array
      Returns:
      hash for the data
    • string2list

      public static List<String> string2list(String s, String delim)
      Get a list of values into a nice, scrubbed array of values, no whitespace. a, b, c d e, f => [ "a", "b", "c d e", "f" ]
      Parameters:
      s - string to split
      delim - delimiter, no default.
      Returns:
      list of split strings, which are also whitespace trimmed
    • fast_replace

      public static String fast_replace(String buf, String replace, String substitution)
      Given a string S and a list of characters to replace with a substitute, return the new string, S'. "-name-with.invalid characters;" // replace "-. ;" with "_" "_name_with_invalid_characters_" //
      Parameters:
      buf - buffer
      replace - string of characters to replace with the one substitute char
      substitution - string to insert in place of chars
      Returns:
      scrubbed text
    • removeAny

      public static String removeAny(String buf, String remove)
      Remove instances of any char in the remove string from buf
      Parameters:
      buf - text
      remove - string to remove
      Returns:
      scrubbed text
    • replaceAny

      public static String replaceAny(String buf, String remove, String sub)
      Replace any of the removal chars with the sub. A many to one replacement. alt: use regex String.replace(//, '')
      Parameters:
      buf - text
      remove - string to replace
      sub - the replacement string
      Returns:
      scrubbed text
    • removeAnyLeft

      public static String removeAnyLeft(String buf, String remove)
      compare to trim( string, chars ), but you can trim any chars Example: - a b c remove "-" from string above.
      Parameters:
      buf - text
      remove - string to remove
      Returns:
      scrubbed text
    • normalizeTextEntity

      public static String normalizeTextEntity(String str)
      Normalization: Clean the ends, Remove Line-endings from middle of entity.
        Example:
              TEXT: **The Daily Newsletter of \n\rBarbara, So.**
             CLEAN: __The Daily Newsletter of __Barbara, So___
      
       Where "__" represents omitted characters.
       
      Parameters:
      str - text
      Returns:
      scrubbed text
    • tokens

      public static String[] tokens(String str)
      Return just white-space delmited tokens.
      Parameters:
      str - text
      Returns:
      tokens
    • tokensRight

      public static final String[] tokensRight(String str)
      Return tokens on the right most part of a buffer. If a para break occurs, \n\n or \r\n\r\n, then return the part on the right of the break.
      Parameters:
      str - text
      Returns:
      whitespace delimited tokens
    • tokensLeft

      public static final String[] tokensLeft(String str)
      See tokensRight()
      Parameters:
      str - text
      Returns:
      whitespace delimited tokens
    • normalizeAbbreviation

      public static String normalizeAbbreviation(String word)
      Intended only as a filter for punctuation within a word. Text of the form A.T.T. or U.S. becomes ATT and US. A text such as Mr.Pibbs incorrectly becomes MrPibbs but for the purposes of normalizing tokens this should be fine. Use appropriate tokenization prior to using this as a filter.
      Parameters:
      word - phrase with periods denoting some abbreviation.
      Returns:
      scrubbed text
    • isAbbreviation

      public static boolean isAbbreviation(String txt)
      Parameters:
      txt -
      Returns:
      See Also:
    • isAbbreviation

      public static boolean isAbbreviation(String orig, boolean useCase)
      Define what an acronym is: A.B. (at minimum) A.b. okay A. b. okay A.b not okay A.9. not okay Starts with Alpha Period is required Ends with a period One upper case letter required -- optional arg for case sensitivity Digits allowed. Spaces allowed - length no longer than 15 non-whitespace chars
    • removeDiacritics

      public static String removeDiacritics(String word)
      Supports Phoneticizer utility from OpenSextant v1.x Remove diacritics from a phrase
      Parameters:
      word - text
      Returns:
      scrubbed text
    • normalizeUnicode

      public static String normalizeUnicode(String str)
      Normalize to "Normalization Form Canonical Decomposition" (NFD) REF: http: //stackoverflow.com/questions/3610013/file-listfiles-mangles-unicode- names-with-jdk-6-unicode-normalization-issues This supports proper file name retrieval from file system, among other things. In many situations we see unicode file names -- Java can list them, but in using the Java-provided version of the filename the OS/FS may not be able to find the file by the name given in a particular normalized form.
      Parameters:
      str - text
      Returns:
      normalized string, encoded with NFD bytes
    • removePunctuation

      public static String removePunctuation(String word)
      Remove any leading and trailing punctuation and some internal punctuation. Internal punctuation which indicates conjunction of two tokens, e.g. a hyphen, should have caused a split into separate tokens at the tokenization stage. Phoneticizer utility from OpenSextant v1.x Remove punctuation from a phrase
      Parameters:
      word - text
      Returns:
      scrubbed text
    • getLanguageMap

      public static Map<String,Language> getLanguageMap()
      If caller wants to add language they can.
      Returns:
      map of lang ID to language obj
    • initLanguageData

      public static void initLanguageData()
      Initialize language codes and metadata. This establishes a map for the most common language codes/names that exist in at least ISO-639-1 and have a non-zero 2-char ID.
       Based on:
       http://stackoverflow.com/questions/674041/is-there-an-elegant-way
       -to-convert-iso-639-2-3-letter-language-codes-to-java-lo
      
       Actual code mappings: en => eng eng => en
      
       cel => '' // Celtic; Avoid this.
      
       tr => tur tur => tr
      
       Names: tr => turkish tur => turkish turkish => tr // ISO2 only
       
    • initLOCLanguageData

      public static void initLOCLanguageData() throws IOException
      This is Libray of Congress data for language IDs. This is offered as a tool to help downstream language ID and enrich metadata when tagging data from particular countries. Reference: http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt
      Throws:
      IOException - if resource file is not found
    • addLanguage

      public static void addLanguage(Language lg)
    • addLanguage

      public static void addLanguage(Language lg, boolean override)
      Extend the basic language dictionary. Note -- First language is listed in language map by Name, and is not overwritten. Language objects may be overwritten in map using lang codes. For example, fre = French(fre), fra = French(fra), and french = French(fra) the last one, 'french' = could have been the French(fre) or (fra). Example, 'ger' and 'deu' are both valid ISO 3-alpha codes for German. What to do? TODO: Create a language object that lists both language biblio/terminology codes.
      Parameters:
      lg - language object
      override - if this value should overwrite an existing one.
    • getLanguageName

      public static String getLanguageName(String code)
      Given an ISO2 char code (least common denominator) retrieve Language Name. This is best effort, so if your code finds nothing, this returns code normalized to lowercase.
      Parameters:
      code - lang ID
      Returns:
      name of language
    • getLanguage

      public static Language getLanguage(String code)
      ISO2 and ISO3 char codes for languages are unique.
      Parameters:
      code - iso2 or iso3 code
      Returns:
      the other code.
    • getLanguageCode

      public static String getLanguageCode(String code)
      ISO2 and ISO3 char codes for languages are unique.
      Parameters:
      code - iso2 or iso3 code
      Returns:
      the other code.
    • isEuroLanguage

      public static boolean isEuroLanguage(String l)
      European languages = Romance + GER + ENG Extend definition as needed.
      Parameters:
      l - language ID
      Returns:
      true if language is European in nature
    • isRomanceLanguage

      public static boolean isRomanceLanguage(String l)
      Romance languages = SPA + POR + ITA + FRA + ROM Extend definition as needed.
      Parameters:
      l - lang ID
      Returns:
      true if language is a Romance language
    • isEnglish

      public static boolean isEnglish(String x)
      Utility method to check if lang ID is English...
      Parameters:
      x - a langcode
      Returns:
      whether langcode is english
    • isChinese

      public static boolean isChinese(String x)
      Utility method to check if lang ID is Chinese(Traditional or Simplified)...
      Parameters:
      x - a langcode
      Returns:
      whether langcode is chinese
    • isCJK

      public static boolean isCJK(String x)
      Utility method to check if lang ID is Chinese, Korean, or Japanese
      Parameters:
      x - a langcode
      Returns:
      whether langcode is a CJK language
    • measureCJKText

      public static double measureCJKText(String buf)
      Returns a ratio of Chinese/Japanese/Korean characters: CJK chars / ALL TODO: needs testing; not sure if this is sustainable if block; or if it is comprehensive. TODO: for performance reasons the internal chain of comparisons is embedded in the method; Otherwise for each char, an external method invocation is required.
      Parameters:
      buf - the character to be tested
      Returns:
      true if CJK, false otherwise
    • countCJKChars

      public static int countCJKChars(char[] chars)
      Counts the CJK characters in buffer, buf chars Inspiration: http://stackoverflow .com/questions/1499804/how-can-i-detect-japanese-text-in-a-java-string Assumption is that the char array is Unicode characters.
      Parameters:
      chars - char array for the text in question.
      Returns:
      count of CJK characters
    • hasCJKText

      public static boolean hasCJKText(String buf)
      A simple test to see if text has any CJK characters at all. It returns after the first such character.
      Parameters:
      buf - text
      Returns:
      if buf has at least one CJK char.
    • isCJK

      public static boolean isCJK(Character.UnicodeBlock blk)
    • isChinese

      public static boolean isChinese(Character.UnicodeBlock blk)
    • isKorean

      public static boolean isKorean(Character.UnicodeBlock blk)
      Likely to be uniquely Korean if the character block is in Hangul. But also, it may be Korean if block is part of the CJK ideographs at large. User must check if text in its entirety is part of CJK & Hangul, independently. This method only detects if character block is uniquely Hangul or not.
      Parameters:
      blk - a Java Unicode block
      Returns:
      true if char block is Hangul
    • isJapanese

      public static boolean isJapanese(Character.UnicodeBlock blk)
      Checks if char block is uniquely Japanese. Check other chars isChinese
      Parameters:
      blk - a Java Unicode block
      Returns:
      true if char block is Hiragana or Katakana
    • compress

      public static byte[] compress(String buf) throws IOException
      Compress bytes from a Unicode string. Conversion to bytes first to avoid unicode or platform-dependent IO issues.
      Parameters:
      buf - UTF-8 encoded text
      Returns:
      byte array
      Throws:
      IOException - on error with compression or text encoding
    • compress

      public static byte[] compress(String buf, String charset) throws IOException
      Parameters:
      buf - text
      charset - character set encoding for text
      Returns:
      byte array for the compressed result
      Throws:
      IOException - on error with compression or text encoding
    • uncompress

      public static String uncompress(byte[] gzData) throws IOException
      Parameters:
      gzData - byte array containing gzipped buffer
      Returns:
      buffer UTF-8 decoded string
      Throws:
      IOException - on error with decompression or text encoding
    • uncompress

      public static String uncompress(byte[] gzData, String charset) throws IOException
      Parameters:
      gzData - byte array containing gzipped buffer
      charset - character set decoding for text
      Returns:
      buffer of uncompressed, decoded string
      Throws:
      IOException - on error with decompression or text encoding
    • removeEmoticons

      public static String removeEmoticons(String t)
      replace Emoticons with something less nefarious -- UTF-16 characters do not play well with some I/O routines.
      Parameters:
      t - text
      Returns:
      scrubbed text
    • removeSymbols

      public static String removeSymbols(String t)
      Replace symbology
      Parameters:
      t - text
      Returns:
      scrubbed text
    • countNonText

      public static int countNonText(String t)
      Count number of non-alphanumeric chars are present.
      Parameters:
      t - text
      Returns:
      count of chars
    • parseHashTags

      public static Set<String> parseHashTags(String tweetText)
      Parse the typical Twitter hashtag variants.
      Parameters:
      tweetText -
      Returns:
    • parseHashTags

      public static Set<String> parseHashTags(String tweetText, boolean normalize)
      Takes a string and returns all the hashtags in it. Normalized tags are just lowercased and deduplicated. https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json
      Parameters:
      tweetText - text
      normalize - if to normalize text by lowercasing tags, etc.
    • parseNaturalLanguage

      public static String parseNaturalLanguage(String raw)
      see default implementation below
      Parameters:
      raw - raw text
      Returns:
      cleaner looking text
      See Also:
    • parseNaturalLanguage

      public static String parseNaturalLanguage(String raw, boolean unescapeHtml, boolean remURLs, boolean remTags, boolean remEntities)
      Given tweet text or any [social media] text remove entities or other markers: - URLs are removed - entities are stripped of "@" - hashtags are stripped of "#" - HTML: & is converted to an ampersand - HTML: escaped angle brackets are replaced with { and } for gt and lt, respectively - HTML: remaining special chars are converted back to unicode; remaining ampersand is replaced with "+" Whitespaces (space, newlines, tabs, etc.) are reduced. DEPRECATED: the use of the tags=true flag to replace hashtags with blank is not supported. #tag<unicode text> is a problem. It is hard to tell in some cases where the hashtag ends. In Weibo, #tag#<unicode text> is used to denote that tag has a start/end But in Twitter, tag format is "#tag" or "#[phrase here]" etc. So there is no generic hashtag replacement.
      Parameters:
      raw - original text
      unescapeHtml - unescape HTML
      remURLs - remove URLs
      remTags - remove hash tags
      remEntities - remove other entities
      Returns:
      text less entities.
    • parseDate

      public static final Date parseDate(String dt)
      A limited-scope date parsing: Parse properly formatted strings for example, ISO date/time strings stored in one of our Solr indices.
      Parameters:
      dt - ISO date/time string.
      Returns: