Class TextUtils


  • public class TextUtils
    extends java.lang.Object
    Author:
    ubaldino
    • Constructor Summary

      Constructors 
      Constructor Description
      TextUtils()  
    • Method Summary

      Modifier and Type Method Description
      static void addLanguage​(Language lg)  
      static void addLanguage​(Language lg, boolean override)
      Extend the basic language dictionary.
      static java.lang.String b2hex​(byte[] barr)  
      static boolean checkCase​(java.lang.String text, int textcase)
      detects if string alpha chars are purely lower case.
      static byte[] compress​(java.lang.String buf)
      Compress bytes from a Unicode string.
      static byte[] compress​(java.lang.String buf, java.lang.String charset)  
      static int count_digits​(java.lang.String txt)
      Counts all digits in text.
      static int count_ws​(java.lang.String txt)
      Counts all whitespace in text.
      static int countASCIIChars​(byte[] data)
      count the number of ASCII bytes
      static int countCJKChars​(char[] chars)
      Counts the CJK characters in buffer, buf chars Inspiration: http://stackoverflow .com/questions/1499804/how-can-i-detect-japanese-text-in-a-java-string Assumption is that the char array is Unicode characters.
      static int countDigits​(java.lang.String txt)  
      static int countFormattingSpace​(java.lang.String txt)
      Count formatting whitespace.
      static int countIrregularPunctuation​(java.lang.String t)  
      static int countNonText​(java.lang.String t)
      Count number of non-alphanumeric chars are present.
      static java.lang.String delete_controls​(java.lang.String t)
      Delete control chars from text data; leaving text and whitespace only.
      static java.lang.String delete_eol​(java.lang.String t)
      Replace line endings with SPACE
      static java.lang.String delete_whitespace​(java.lang.String t)
      Delete whitespace of any sort.
      static java.lang.String fast_replace​(java.lang.String buf, java.lang.String replace, java.lang.String substitution)
      Given a string S and a list of characters to replace with a substitute, return the new string, S'.
      static int[] get_text_window​(int offset, int textsize, int width)
      Get a single text window around the offset.
      static int[] get_text_window​(int offset, int matchlen, int textsize, int width)
      Find the text window(s) around a match.
      static Language getLanguage​(java.lang.String code)
      ISO2 and ISO3 char codes for languages are unique.
      static java.lang.String getLanguageCode​(java.lang.String code)
      ISO2 and ISO3 char codes for languages are unique.
      static java.util.Map<java.lang.String,​Language> getLanguageMap()
      If caller wants to add language they can.
      static java.lang.String getLanguageName​(java.lang.String code)
      Given an ISO2 char code (least common denominator) retrieve Language Name.
      static boolean hasCJKText​(java.lang.String buf)
      A simple test to see if text has any CJK characters at all.
      static boolean hasDiacritics​(java.lang.String s)
      If a string has extended latin diacritics.
      static boolean hasDigits​(java.lang.String txt)  
      static boolean hasIrregularPunctuation​(java.lang.String t)
      Simple triage of punctuation.
      static void initLanguageData()
      Initialize language codes and metadata.
      static void initLOCLanguageData()
      This is Libray of Congress data for language IDs.
      static boolean isASCII​(byte[] data)  
      static boolean isASCII​(char c)  
      static boolean isASCII​(java.lang.String t)
      Early exit test -- return false on first non-ASCII character found.
      static boolean isASCIILetter​(char c)  
      static boolean isChinese​(java.lang.Character.UnicodeBlock blk)  
      static boolean isChinese​(java.lang.String x)
      Utility method to check if lang ID is Chinese(Traditional or Simplified)...
      static boolean isCJK​(java.lang.Character.UnicodeBlock blk)  
      static boolean isCJK​(java.lang.String x)
      Utility method to check if lang ID is Chinese, Korean, or Japanese
      static boolean isEnglish​(java.lang.String x)
      Utility method to check if lang ID is English...
      static boolean isEuroLanguage​(java.lang.String l)
      European languages = Romance + GER + ENG Extend definition as needed.
      static boolean isJapanese​(java.lang.Character.UnicodeBlock blk)
      Checks if char block is uniquely Japanese.
      static boolean isKorean​(java.lang.Character.UnicodeBlock blk)
      Likely to be uniquely Korean if the character block is in Hangul.
      static boolean isLatin​(java.lang.String data)
      Checks if non-ASCII and non-LATIN characters are present.
      static boolean isLower​(java.lang.String dat)  
      static boolean isLowerCaseDocument​(int[] counts)
      This measures the amount of upper case See Upper Case.
      static boolean isNumeric​(java.lang.String v)
      StringUtils in commons isNumeric("1.234") is NOT numeric.
      static boolean isRomanceLanguage​(java.lang.String l)
      Romance languages = SPA + POR + ITA + FRA + ROM Extend definition as needed.
      static boolean isUpper​(java.lang.String dat)
      For measuring the upper-case-ness of short texts.
      static boolean isUpperCaseDocument​(int[] counts)
      First measureCase(Text) to acquire counts, then call this routine for a heuristic that suggests the text is mainly upper case.
      static java.lang.String md5_id​(byte[] digest)
      Deprecated.
      not MD5 specific.
      static int[] measureCase​(java.lang.String text)
      Measure character count, upper, lower, non-Character, whitespace
      static double measureCJKText​(java.lang.String buf)
      Returns a ratio of Chinese/Japanese/Korean characters: CJK chars / ALL TODO: needs testing; not sure if this is sustainable if block; or if it is comprehensive.
      static java.lang.String normalizeAbbreviation​(java.lang.String word)
      Intended only as a filter for punctuation within a word.
      static java.lang.String normalizeTextEntity​(java.lang.String str)
      Normalization: Clean the ends, Remove Line-endings from middle of entity.
      static java.lang.String normalizeUnicode​(java.lang.String str)
      Normalize to "Normalization Form Canonical Decomposition" (NFD) REF: http: //stackoverflow.com/questions/3610013/file-listfiles-mangles-unicode- names-with-jdk-6-unicode-normalization-issues This supports proper file name retrieval from file system, among other things.
      static java.util.Date parseDate​(java.lang.String dt)
      A limited-scope date parsing: Parse properly formatted strings for example, ISO date/time strings stored in one of our Solr indices.
      static java.util.Set<java.lang.String> parseHashTags​(java.lang.String tweetText)
      Parse the typical Twitter hashtag variants.
      static java.util.Set<java.lang.String> parseHashTags​(java.lang.String tweetText, boolean normalize)
      Takes a string and returns all the hashtags in it.
      static java.lang.String parseNaturalLanguage​(java.lang.String raw)
      see default implementation below
      static java.lang.String parseNaturalLanguage​(java.lang.String raw, boolean unescapeHtml, boolean remURLs, boolean remTags, boolean remEntities)
      Given tweet text or any [social media] text remove entities or other markers: - URLs are removed - entities are stripped of "@" - hashtags are stripped of "#" - HTML: & is converted to an ampersand - HTML: escaped angle brackets are replaced with { and } for gt and lt, respectively - HTML: remaining special chars are converted back to unicode; remaining ampersand is replaced with "+" Whitespaces (space, newlines, tabs, etc.) are reduced.
      static java.lang.String phoneticReduction​(java.lang.String t)
      Create a non-diacritic, ASCII version of the input string.
      static java.lang.String phoneticReduction​(java.lang.String t, boolean isAscii)  
      static java.lang.String reduce_line_breaks​(java.lang.String t)
      Replaces all 3 or more blank lines with a single paragraph break (\n\n)
      static java.lang.String removeAny​(java.lang.String buf, java.lang.String remove)
      Remove instances of any char in the remove string from buf
      static java.lang.String removeAnyLeft​(java.lang.String buf, java.lang.String remove)
      compare to trim( string, chars ), but you can trim any chars Example: - a b c remove "-" from string above.
      static java.lang.String removeDiacritics​(java.lang.String word)
      Supports Phoneticizer utility from OpenSextant v1.x Remove diacritics from a phrase
      static java.lang.String removeEmoticons​(java.lang.String t)
      replace Emoticons with something less nefarious -- UTF-16 characters do not play well with some I/O routines.
      static java.lang.String removePunctuation​(java.lang.String word)
      Remove any leading and trailing punctuation and some internal punctuation.
      static java.lang.String removeSymbols​(java.lang.String t)
      Replace symbology
      static java.lang.String replaceAny​(java.lang.String buf, java.lang.String remove, java.lang.String sub)
      Replace any of the removal chars with the sub.
      static java.lang.String replaceDiacritics​(java.lang.String s)
      A thorough replacement of diacritics and Unicode chars to their ASCII equivalents.
      static java.lang.String replaceDiacriticsOriginal​(java.lang.String s)
      Deprecated.
      See replaceDiacritics as the replacement.
      static java.lang.String squeeze_whitespace​(java.lang.String t)
      Minimize whitespace.
      static java.util.List<java.lang.String> string2list​(java.lang.String s, java.lang.String delim)
      Get a list of values into a nice, scrubbed array of values, no whitespace.
      static java.lang.String text_id​(java.lang.String text)
      Static method -- use only if you are sure of thread-safety.
      static java.lang.String[] tokens​(java.lang.String str)
      Return just white-space delmited tokens.
      static java.lang.String[] tokensLeft​(java.lang.String str)
      See tokensRight()
      static java.lang.String[] tokensRight​(java.lang.String str)
      Return tokens on the right most part of a buffer.
      static java.lang.String uncompress​(byte[] gzData)  
      static java.lang.String uncompress​(byte[] gzData, java.lang.String charset)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • TextUtils

        public TextUtils()
    • Method Detail

      • hasIrregularPunctuation

        public static boolean hasIrregularPunctuation​(java.lang.String t)
        Simple triage of punctuation. Rationale: OpenSextant taggers maximize RECALL in favor of not missing a possible match. the problem there is we often encounter substantial noise with tagger output, so a trivial test is to see if we have overmatched: Allowed Punctuation: , . - _ ` ' ( ) ## Diacritics, Parenthetics, periods/dashes.
             Given phrase "A B C"  we may have matched:
                          "A|B+C", "A; B; C", "A <B> C" etc...  where common punctation separates valid tokens
                          that appear in the reference phrase.
         
        Parameters:
        t -
        Returns:
      • countIrregularPunctuation

        public static int countIrregularPunctuation​(java.lang.String t)
      • isLatin

        public static final boolean isLatin​(java.lang.String data)
        Checks if non-ASCII and non-LATIN characters are present.
        Parameters:
        data - any textual data
        Returns:
        true if content is strictly ASCII or Latin1 extended.
      • hasDiacritics

        public static final boolean hasDiacritics​(java.lang.String s)
        If a string has extended latin diacritics.
        Parameters:
        s - string to test
        Returns:
        true if a single diacritic is found.
      • phoneticReduction

        public static java.lang.String phoneticReduction​(java.lang.String t)
        Create a non-diacritic, ASCII version of the input string. This will also have original whitespace, but will have removed non-character markings, e.g. "Za'tut" => "Zatut" not "Za tut"
        Parameters:
        t -
        Returns:
      • phoneticReduction

        public static java.lang.String phoneticReduction​(java.lang.String t,
                                                         boolean isAscii)
      • replaceDiacritics

        public static final java.lang.String replaceDiacritics​(java.lang.String s)
        A thorough replacement of diacritics and Unicode chars to their ASCII equivalents.
        Parameters:
        s - the string
        Returns:
        converted string
      • replaceDiacriticsOriginal

        @Deprecated
        public static java.lang.String replaceDiacriticsOriginal​(java.lang.String s)
        Deprecated.
        See replaceDiacritics as the replacement.
        remove accents from a string and replace with ASCII equivalent Reference: http://www.rgagnon.com/javadetails/java-0456.html Caveat: This implementation is not exhaustive.
        Parameters:
        s -
        Returns:
        See Also:
        replaceDiacritics(String)
      • isASCII

        public static final boolean isASCII​(char c)
        Parameters:
        c - a character
        Returns:
        true if c is ASCII
      • isASCIILetter

        public static final boolean isASCIILetter​(char c)
        Parameters:
        c - character
        Returns:
        true if c is ASCII a-z or A-Z
      • isASCII

        public static boolean isASCII​(byte[] data)
        Parameters:
        data - bytes to test
        Returns:
        boolean if data is ASCII or not
      • isASCII

        public static boolean isASCII​(java.lang.String t)
        Early exit test -- return false on first non-ASCII character found.
        Parameters:
        t - buffer of text
        Returns:
        true only if every char is in ASCII table.
      • countASCIIChars

        public static int countASCIIChars​(byte[] data)
        count the number of ASCII bytes
        Parameters:
        data - bytes to count
        Returns:
        count of ASCII bytes
      • reduce_line_breaks

        public static java.lang.String reduce_line_breaks​(java.lang.String t)
        Replaces all 3 or more blank lines with a single paragraph break (\n\n)
        Parameters:
        t - text
        Returns:
        A string with fewer line breaks;
      • delete_whitespace

        public static java.lang.String delete_whitespace​(java.lang.String t)
        Delete whitespace of any sort.
        Parameters:
        t - text
        Returns:
        String, without whitespace.
      • squeeze_whitespace

        public static java.lang.String squeeze_whitespace​(java.lang.String t)
        Minimize whitespace.
        Parameters:
        t - text
        Returns:
        scrubbed string
      • delete_eol

        public static java.lang.String delete_eol​(java.lang.String t)
        Replace line endings with SPACE
        Parameters:
        t - text
        Returns:
        scrubbed string
      • delete_controls

        public static java.lang.String delete_controls​(java.lang.String t)
        Delete control chars from text data; leaving text and whitespace only. Delete char (^?) is also removed. Length may differ if ctl chars are removed.
        Parameters:
        t - text
        Returns:
        scrubbed buffer
      • hasDigits

        public static boolean hasDigits​(java.lang.String txt)
      • countDigits

        public static int countDigits​(java.lang.String txt)
      • count_digits

        public static int count_digits​(java.lang.String txt)
        Counts all digits in text.
        Parameters:
        txt - text to count
        Returns:
        count of digits
      • isNumeric

        public static final boolean isNumeric​(java.lang.String v)
        StringUtils in commons isNumeric("1.234") is NOT numeric. Here "1.234" is numeric.
        Parameters:
        v - val to parse
        Returns:
        true if val is a number
      • count_ws

        public static int count_ws​(java.lang.String txt)
        Counts all whitespace in text.
        Parameters:
        txt - text
        Returns:
        whitespace count
      • countFormattingSpace

        public static int countFormattingSpace​(java.lang.String txt)
        Count formatting whitespace. This is helpful in determining if text spans are phrases with multiple TAB or EOL characters. For that matter, any control character contributes to formatting in some way. DEL, VT, HT, etc. So all control characters ( c < ' ') are counted.
        Parameters:
        txt - input string
        Returns:
        count of format chars
      • isUpper

        public static boolean isUpper​(java.lang.String dat)
        For measuring the upper-case-ness of short texts. Returns true if ALL letters in text are UPPERCASE. Allows for non-letters in text.
        Parameters:
        dat - text or data
        Returns:
        true if text is Upper
      • isLower

        public static boolean isLower​(java.lang.String dat)
      • checkCase

        public static boolean checkCase​(java.lang.String text,
                                        int textcase)
        detects if string alpha chars are purely lower case.
        Parameters:
        text - text
        textcase - 1 lower, 2 upper
        Returns:
        if case matches given textcase param
      • measureCase

        public static int[] measureCase​(java.lang.String text)
        Measure character count, upper, lower, non-Character, whitespace
        Parameters:
        text - text
        Returns:
        int array with counts.
      • isUpperCaseDocument

        public static boolean isUpperCaseDocument​(int[] counts)
        First measureCase(Text) to acquire counts, then call this routine for a heuristic that suggests the text is mainly upper case. These routines may not work well on languages that are not Latin-alphabet.
        Parameters:
        counts - word stats from measureCase()
        Returns:
        true if counts represent text that exceeds the "UPPER CASE" threshold
      • isLowerCaseDocument

        public static boolean isLowerCaseDocument​(int[] counts)
        This measures the amount of upper case See Upper Case. Two methods to measure -- lower case count compared to all content (char+non-char) or compared to just char content.
        Parameters:
        counts - word stats from measureCase()
        Returns:
        true if counts represent text that exceeds the "lower case" threshold
      • get_text_window

        public static int[] get_text_window​(int offset,
                                            int matchlen,
                                            int textsize,
                                            int width)
        Find the text window(s) around a match. Given the size of a buffer, the match and desired width return
         prepreprepre      MATCH        postpostpost
         ^           ^                  ^          ^
         l-width     l                 l+len   l+len+width
         left1     left2              right1    right2
         
        Parameters:
        offset - offset of match
        width - width of window left and right of match
        textsize - size of buffer containing match; used for boundary conditions
        matchlen - length of match
        Returns:
        window offsets left of match, right of match: [ l1, l2, r1, r2 ]
      • get_text_window

        public static int[] get_text_window​(int offset,
                                            int textsize,
                                            int width)
        Get a single text window around the offset.
        Parameters:
        offset - offset of match
        width - width of window left and right of match
        textsize - size of buffer containing match; used for boundary conditions
        Returns:
        window offsets of a text span contianing match [ left, right ]
      • text_id

        public static java.lang.String text_id​(java.lang.String text)
                                        throws java.security.NoSuchAlgorithmException,
                                               java.io.UnsupportedEncodingException
        Static method -- use only if you are sure of thread-safety.
        Parameters:
        text - text or data
        Returns:
        identifier for the text, an MD5 hash
        Throws:
        java.security.NoSuchAlgorithmException - on err
        java.io.UnsupportedEncodingException - on err
      • b2hex

        public static java.lang.String b2hex​(byte[] barr)
      • md5_id

        public static java.lang.String md5_id​(byte[] digest)
        Deprecated.
        not MD5 specific. Use #b2hex() instead
        Parameters:
        digest - byte array
        Returns:
        hash for the data
      • string2list

        public static java.util.List<java.lang.String> string2list​(java.lang.String s,
                                                                   java.lang.String delim)
        Get a list of values into a nice, scrubbed array of values, no whitespace. a, b, c d e, f => [ "a", "b", "c d e", "f" ]
        Parameters:
        s - string to split
        delim - delimiter, no default.
        Returns:
        list of split strings, which are also whitespace trimmed
      • fast_replace

        public static java.lang.String fast_replace​(java.lang.String buf,
                                                    java.lang.String replace,
                                                    java.lang.String substitution)
        Given a string S and a list of characters to replace with a substitute, return the new string, S'. "-name-with.invalid characters;" // replace "-. ;" with "_" "_name_with_invalid_characters_" //
        Parameters:
        buf - buffer
        replace - string of characters to replace with the one substitute char
        substitution - string to insert in place of chars
        Returns:
        scrubbed text
      • removeAny

        public static java.lang.String removeAny​(java.lang.String buf,
                                                 java.lang.String remove)
        Remove instances of any char in the remove string from buf
        Parameters:
        buf - text
        remove - string to remove
        Returns:
        scrubbed text
      • replaceAny

        public static java.lang.String replaceAny​(java.lang.String buf,
                                                  java.lang.String remove,
                                                  java.lang.String sub)
        Replace any of the removal chars with the sub. A many to one replacement. alt: use regex String.replace(//, '')
        Parameters:
        buf - text
        remove - string to replace
        sub - the replacement string
        Returns:
        scrubbed text
      • removeAnyLeft

        public static java.lang.String removeAnyLeft​(java.lang.String buf,
                                                     java.lang.String remove)
        compare to trim( string, chars ), but you can trim any chars Example: - a b c remove "-" from string above.
        Parameters:
        buf - text
        remove - string to remove
        Returns:
        scrubbed text
      • normalizeTextEntity

        public static java.lang.String normalizeTextEntity​(java.lang.String str)
        Normalization: Clean the ends, Remove Line-endings from middle of entity.
          Example:
                TEXT: **The Daily Newsletter of \n\rBarbara, So.**
               CLEAN: __The Daily Newsletter of __Barbara, So___
        
         Where "__" represents omitted characters.
         
        Parameters:
        str - text
        Returns:
        scrubbed text
      • tokens

        public static java.lang.String[] tokens​(java.lang.String str)
        Return just white-space delmited tokens.
        Parameters:
        str - text
        Returns:
        tokens
      • tokensRight

        public static final java.lang.String[] tokensRight​(java.lang.String str)
        Return tokens on the right most part of a buffer. If a para break occurs, \n\n or \r\n\r\n, then return the part on the right of the break.
        Parameters:
        str - text
        Returns:
        whitespace delimited tokens
      • tokensLeft

        public static final java.lang.String[] tokensLeft​(java.lang.String str)
        See tokensRight()
        Parameters:
        str - text
        Returns:
        whitespace delimited tokens
      • normalizeAbbreviation

        public static java.lang.String normalizeAbbreviation​(java.lang.String word)
        Intended only as a filter for punctuation within a word. Text of the form A.T.T. or U.S. becomes ATT and US. A text such as Mr.Pibbs incorrectly becomes MrPibbs but for the purposes of normalizing tokens this should be fine. Use appropriate tokenization prior to using this as a filter.
        Parameters:
        word - phrase with periods denoting some abbreviation.
        Returns:
        scrubbed text
      • removeDiacritics

        public static java.lang.String removeDiacritics​(java.lang.String word)
        Supports Phoneticizer utility from OpenSextant v1.x Remove diacritics from a phrase
        Parameters:
        word - text
        Returns:
        scrubbed text
      • normalizeUnicode

        public static java.lang.String normalizeUnicode​(java.lang.String str)
        Normalize to "Normalization Form Canonical Decomposition" (NFD) REF: http: //stackoverflow.com/questions/3610013/file-listfiles-mangles-unicode- names-with-jdk-6-unicode-normalization-issues This supports proper file name retrieval from file system, among other things. In many situations we see unicode file names -- Java can list them, but in using the Java-provided version of the filename the OS/FS may not be able to find the file by the name given in a particular normalized form.
        Parameters:
        str - text
        Returns:
        normalized string, encoded with NFD bytes
      • removePunctuation

        public static java.lang.String removePunctuation​(java.lang.String word)
        Remove any leading and trailing punctuation and some internal punctuation. Internal punctuation which indicates conjunction of two tokens, e.g. a hyphen, should have caused a split into separate tokens at the tokenization stage. Phoneticizer utility from OpenSextant v1.x Remove punctuation from a phrase
        Parameters:
        word - text
        Returns:
        scrubbed text
      • getLanguageMap

        public static java.util.Map<java.lang.String,​Language> getLanguageMap()
        If caller wants to add language they can.
        Returns:
        map of lang ID to language obj
      • initLanguageData

        public static void initLanguageData()
        Initialize language codes and metadata. This establishes a map for the most common language codes/names that exist in at least ISO-639-1 and have a non-zero 2-char ID.
         Based on:
         http://stackoverflow.com/questions/674041/is-there-an-elegant-way
         -to-convert-iso-639-2-3-letter-language-codes-to-java-lo
        
         Actual code mappings: en => eng eng => en
        
         cel => '' // Celtic; Avoid this.
        
         tr => tur tur => tr
        
         Names: tr => turkish tur => turkish turkish => tr // ISO2 only
         
      • initLOCLanguageData

        public static void initLOCLanguageData()
                                        throws java.io.IOException
        This is Libray of Congress data for language IDs. This is offered as a tool to help downstream language ID and enrich metadata when tagging data from particular countries. Reference: http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt
        Throws:
        java.io.IOException - if resource file is not found
      • addLanguage

        public static void addLanguage​(Language lg)
      • addLanguage

        public static void addLanguage​(Language lg,
                                       boolean override)
        Extend the basic language dictionary. Note -- First language is listed in language map by Name, and is not overwritten. Language objects may be overwritten in map using lang codes. For example, fre = French(fre), fra = French(fra), and french = French(fra) the last one, 'french' = could have been the French(fre) or (fra). Example, 'ger' and 'deu' are both valid ISO 3-alpha codes for German. What to do? TODO: Create a language object that lists both language biblio/terminology codes.
        Parameters:
        lg - language object
        override - if this value should overwrite an existing one.
      • getLanguageName

        public static java.lang.String getLanguageName​(java.lang.String code)
        Given an ISO2 char code (least common denominator) retrieve Language Name. This is best effort, so if your code finds nothing, this returns code normalized to lowercase.
        Parameters:
        code - lang ID
        Returns:
        name of language
      • getLanguage

        public static Language getLanguage​(java.lang.String code)
        ISO2 and ISO3 char codes for languages are unique.
        Parameters:
        code - iso2 or iso3 code
        Returns:
        the other code.
      • getLanguageCode

        public static java.lang.String getLanguageCode​(java.lang.String code)
        ISO2 and ISO3 char codes for languages are unique.
        Parameters:
        code - iso2 or iso3 code
        Returns:
        the other code.
      • isEuroLanguage

        public static boolean isEuroLanguage​(java.lang.String l)
        European languages = Romance + GER + ENG Extend definition as needed.
        Parameters:
        l - language ID
        Returns:
        true if language is European in nature
      • isRomanceLanguage

        public static boolean isRomanceLanguage​(java.lang.String l)
        Romance languages = SPA + POR + ITA + FRA + ROM Extend definition as needed.
        Parameters:
        l - lang ID
        Returns:
        true if language is a Romance language
      • isEnglish

        public static boolean isEnglish​(java.lang.String x)
        Utility method to check if lang ID is English...
        Parameters:
        x - a langcode
        Returns:
        whether langcode is english
      • isChinese

        public static boolean isChinese​(java.lang.String x)
        Utility method to check if lang ID is Chinese(Traditional or Simplified)...
        Parameters:
        x - a langcode
        Returns:
        whether langcode is chinese
      • isCJK

        public static boolean isCJK​(java.lang.String x)
        Utility method to check if lang ID is Chinese, Korean, or Japanese
        Parameters:
        x - a langcode
        Returns:
        whether langcode is a CJK language
      • measureCJKText

        public static double measureCJKText​(java.lang.String buf)
        Returns a ratio of Chinese/Japanese/Korean characters: CJK chars / ALL TODO: needs testing; not sure if this is sustainable if block; or if it is comprehensive. TODO: for performance reasons the internal chain of comparisons is embedded in the method; Otherwise for each char, an external method invocation is required.
        Parameters:
        buf - the character to be tested
        Returns:
        true if CJK, false otherwise
      • countCJKChars

        public static int countCJKChars​(char[] chars)
        Counts the CJK characters in buffer, buf chars Inspiration: http://stackoverflow .com/questions/1499804/how-can-i-detect-japanese-text-in-a-java-string Assumption is that the char array is Unicode characters.
        Parameters:
        chars - char array for the text in question.
        Returns:
        count of CJK characters
      • hasCJKText

        public static boolean hasCJKText​(java.lang.String buf)
        A simple test to see if text has any CJK characters at all. It returns after the first such character.
        Parameters:
        buf - text
        Returns:
        if buf has at least one CJK char.
      • isCJK

        public static boolean isCJK​(java.lang.Character.UnicodeBlock blk)
      • isChinese

        public static boolean isChinese​(java.lang.Character.UnicodeBlock blk)
      • isKorean

        public static boolean isKorean​(java.lang.Character.UnicodeBlock blk)
        Likely to be uniquely Korean if the character block is in Hangul. But also, it may be Korean if block is part of the CJK ideographs at large. User must check if text in its entirety is part of CJK & Hangul, independently. This method only detects if character block is uniquely Hangul or not.
        Parameters:
        blk - a Java Unicode block
        Returns:
        true if char block is Hangul
      • isJapanese

        public static boolean isJapanese​(java.lang.Character.UnicodeBlock blk)
        Checks if char block is uniquely Japanese. Check other chars isChinese
        Parameters:
        blk - a Java Unicode block
        Returns:
        true if char block is Hiragana or Katakana
      • compress

        public static byte[] compress​(java.lang.String buf)
                               throws java.io.IOException
        Compress bytes from a Unicode string. Conversion to bytes first to avoid unicode or platform-dependent IO issues.
        Parameters:
        buf - UTF-8 encoded text
        Returns:
        byte array
        Throws:
        java.io.IOException - on error with compression or text encoding
      • compress

        public static byte[] compress​(java.lang.String buf,
                                      java.lang.String charset)
                               throws java.io.IOException
        Parameters:
        buf - text
        charset - character set encoding for text
        Returns:
        byte array for the compressed result
        Throws:
        java.io.IOException - on error with compression or text encoding
      • uncompress

        public static java.lang.String uncompress​(byte[] gzData)
                                           throws java.io.IOException
        Parameters:
        gzData - byte array containing gzipped buffer
        Returns:
        buffer UTF-8 decoded string
        Throws:
        java.io.IOException - on error with decompression or text encoding
      • uncompress

        public static java.lang.String uncompress​(byte[] gzData,
                                                  java.lang.String charset)
                                           throws java.io.IOException
        Parameters:
        gzData - byte array containing gzipped buffer
        charset - character set decoding for text
        Returns:
        buffer of uncompressed, decoded string
        Throws:
        java.io.IOException - on error with decompression or text encoding
      • removeEmoticons

        public static java.lang.String removeEmoticons​(java.lang.String t)
        replace Emoticons with something less nefarious -- UTF-16 characters do not play well with some I/O routines.
        Parameters:
        t - text
        Returns:
        scrubbed text
      • removeSymbols

        public static java.lang.String removeSymbols​(java.lang.String t)
        Replace symbology
        Parameters:
        t - text
        Returns:
        scrubbed text
      • countNonText

        public static int countNonText​(java.lang.String t)
        Count number of non-alphanumeric chars are present.
        Parameters:
        t - text
        Returns:
        count of chars
      • parseHashTags

        public static java.util.Set<java.lang.String> parseHashTags​(java.lang.String tweetText)
        Parse the typical Twitter hashtag variants.
        Parameters:
        tweetText -
        Returns:
      • parseHashTags

        public static java.util.Set<java.lang.String> parseHashTags​(java.lang.String tweetText,
                                                                    boolean normalize)
        Takes a string and returns all the hashtags in it. Normalized tags are just lowercased and deduplicated. https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json
        Parameters:
        tweetText - text
        normalize - if to normalize text by lowercasing tags, etc.
      • parseNaturalLanguage

        public static java.lang.String parseNaturalLanguage​(java.lang.String raw,
                                                            boolean unescapeHtml,
                                                            boolean remURLs,
                                                            boolean remTags,
                                                            boolean remEntities)
        Given tweet text or any [social media] text remove entities or other markers: - URLs are removed - entities are stripped of "@" - hashtags are stripped of "#" - HTML: & is converted to an ampersand - HTML: escaped angle brackets are replaced with { and } for gt and lt, respectively - HTML: remaining special chars are converted back to unicode; remaining ampersand is replaced with "+" Whitespaces (space, newlines, tabs, etc.) are reduced. DEPRECATED: the use of the tags=true flag to replace hashtags with blank is not supported. #tag<unicode text> is a problem. It is hard to tell in some cases where the hashtag ends. In Weibo, #tag#<unicode text> is used to denote that tag has a start/end But in Twitter, tag format is "#tag" or "#[phrase here]" etc. So there is no generic hashtag replacement.
        Parameters:
        raw - original text
        unescapeHtml - unescape HTML
        remURLs - remove URLs
        remTags - remove hash tags
        remEntities - remove other entities
        Returns:
        text less entities.
      • parseDate

        public static final java.util.Date parseDate​(java.lang.String dt)
        A limited-scope date parsing: Parse properly formatted strings for example, ISO date/time strings stored in one of our Solr indices.
        Parameters:
        dt - ISO date/time string.
        Returns: