Package org.opensextant.util
Class TextUtils
- java.lang.Object
-
- org.opensextant.util.TextUtils
-
public class TextUtils extends java.lang.Object
- Author:
- ubaldino
-
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String
arabicLang
static java.lang.String
bahasaLang
static int
CASE_LOWER
static int
CASE_UPPER
static java.lang.String
chineseLang
static java.lang.String
chineseTradLang
static char
CR
static char
DEL
static java.lang.String
englishLang
static java.lang.String
farsiLang
static java.lang.String
frenchLang
static java.lang.String
germanLang
static java.util.regex.Pattern
hashtagPattern1
Find any pattern "ABC#[ABC 123]" -- a hashtag with whitespace.static java.util.regex.Pattern
hashtagPattern2
Find any pattern "#ABC123" -- normal hashtag, Java Regex note: UNICODE flags are important, otherwise "\w" and other classes match only ASCII.static java.lang.String
italianLang
static java.lang.String
japaneseLang
static java.lang.String
koreanLang
static char
NL
static java.lang.String
portugueseLang
static java.lang.String
romanianLang
static java.lang.String
russianLang
static char
SP
static java.lang.String
spanishLang
static char
TAB
static java.lang.String
thaiLang
static java.lang.String
turkishLang
static java.lang.String
vietnameseLang
-
Constructor Summary
Constructors Constructor Description TextUtils()
-
Method Summary
Modifier and Type Method Description static void
addLanguage(Language lg)
static void
addLanguage(Language lg, boolean override)
Extend the basic language dictionary.static java.lang.String
b2hex(byte[] barr)
static boolean
checkCase(java.lang.String text, int textcase)
detects if string alpha chars are purely lower case.static byte[]
compress(java.lang.String buf)
Compress bytes from a Unicode string.static byte[]
compress(java.lang.String buf, java.lang.String charset)
static int
count_digits(java.lang.String txt)
Counts all digits in text.static int
count_ws(java.lang.String txt)
Counts all whitespace in text.static int
countASCIIChars(byte[] data)
count the number of ASCII bytesstatic int
countCJKChars(char[] chars)
Counts the CJK characters in buffer, buf chars Inspiration: http://stackoverflow .com/questions/1499804/how-can-i-detect-japanese-text-in-a-java-string Assumption is that the char array is Unicode characters.static int
countDigits(java.lang.String txt)
static int
countFormattingSpace(java.lang.String txt)
Count formatting whitespace.static int
countIrregularPunctuation(java.lang.String t)
static int
countNonText(java.lang.String t)
Count number of non-alphanumeric chars are present.static java.lang.String
delete_controls(java.lang.String t)
Delete control chars from text data; leaving text and whitespace only.static java.lang.String
delete_eol(java.lang.String t)
Replace line endings with SPACEstatic java.lang.String
delete_whitespace(java.lang.String t)
Delete whitespace of any sort.static java.lang.String
fast_replace(java.lang.String buf, java.lang.String replace, java.lang.String substitution)
Given a string S and a list of characters to replace with a substitute, return the new string, S'.static int[]
get_text_window(int offset, int textsize, int width)
Get a single text window around the offset.static int[]
get_text_window(int offset, int matchlen, int textsize, int width)
Find the text window(s) around a match.static Language
getLanguage(java.lang.String code)
ISO2 and ISO3 char codes for languages are unique.static java.lang.String
getLanguageCode(java.lang.String code)
ISO2 and ISO3 char codes for languages are unique.static java.util.Map<java.lang.String,Language>
getLanguageMap()
If caller wants to add language they can.static java.lang.String
getLanguageName(java.lang.String code)
Given an ISO2 char code (least common denominator) retrieve Language Name.static boolean
hasCJKText(java.lang.String buf)
A simple test to see if text has any CJK characters at all.static boolean
hasDiacritics(java.lang.String s)
If a string has extended latin diacritics.static boolean
hasDigits(java.lang.String txt)
static boolean
hasIrregularPunctuation(java.lang.String t)
Simple triage of punctuation.static void
initLanguageData()
Initialize language codes and metadata.static void
initLOCLanguageData()
This is Libray of Congress data for language IDs.static boolean
isASCII(byte[] data)
static boolean
isASCII(char c)
static boolean
isASCII(java.lang.String t)
Early exit test -- return false on first non-ASCII character found.static boolean
isASCIILetter(char c)
static boolean
isChinese(java.lang.Character.UnicodeBlock blk)
static boolean
isChinese(java.lang.String x)
Utility method to check if lang ID is Chinese(Traditional or Simplified)...static boolean
isCJK(java.lang.Character.UnicodeBlock blk)
static boolean
isCJK(java.lang.String x)
Utility method to check if lang ID is Chinese, Korean, or Japanesestatic boolean
isEnglish(java.lang.String x)
Utility method to check if lang ID is English...static boolean
isEuroLanguage(java.lang.String l)
European languages = Romance + GER + ENG Extend definition as needed.static boolean
isJapanese(java.lang.Character.UnicodeBlock blk)
Checks if char block is uniquely Japanese.static boolean
isKorean(java.lang.Character.UnicodeBlock blk)
Likely to be uniquely Korean if the character block is in Hangul.static boolean
isLatin(java.lang.String data)
Checks if non-ASCII and non-LATIN characters are present.static boolean
isLower(java.lang.String dat)
static boolean
isLowerCaseDocument(int[] counts)
This measures the amount of upper case See Upper Case.static boolean
isNumeric(java.lang.String v)
StringUtils in commons isNumeric("1.234") is NOT numeric.static boolean
isRomanceLanguage(java.lang.String l)
Romance languages = SPA + POR + ITA + FRA + ROM Extend definition as needed.static boolean
isUpper(java.lang.String dat)
For measuring the upper-case-ness of short texts.static boolean
isUpperCaseDocument(int[] counts)
First measureCase(Text) to acquire counts, then call this routine for a heuristic that suggests the text is mainly upper case.static java.lang.String
md5_id(byte[] digest)
Deprecated.not MD5 specific.static int[]
measureCase(java.lang.String text)
Measure character count, upper, lower, non-Character, whitespacestatic double
measureCJKText(java.lang.String buf)
Returns a ratio of Chinese/Japanese/Korean characters: CJK chars / ALL TODO: needs testing; not sure if this is sustainable if block; or if it is comprehensive.static java.lang.String
normalizeAbbreviation(java.lang.String word)
Intended only as a filter for punctuation within a word.static java.lang.String
normalizeTextEntity(java.lang.String str)
Normalization: Clean the ends, Remove Line-endings from middle of entity.static java.lang.String
normalizeUnicode(java.lang.String str)
Normalize to "Normalization Form Canonical Decomposition" (NFD) REF: http: //stackoverflow.com/questions/3610013/file-listfiles-mangles-unicode- names-with-jdk-6-unicode-normalization-issues This supports proper file name retrieval from file system, among other things.static java.util.Date
parseDate(java.lang.String dt)
A limited-scope date parsing: Parse properly formatted strings for example, ISO date/time strings stored in one of our Solr indices.static java.util.Set<java.lang.String>
parseHashTags(java.lang.String tweetText)
Parse the typical Twitter hashtag variants.static java.util.Set<java.lang.String>
parseHashTags(java.lang.String tweetText, boolean normalize)
Takes a string and returns all the hashtags in it.static java.lang.String
parseNaturalLanguage(java.lang.String raw)
see default implementation belowstatic java.lang.String
parseNaturalLanguage(java.lang.String raw, boolean unescapeHtml, boolean remURLs, boolean remTags, boolean remEntities)
Given tweet text or any [social media] text remove entities or other markers: - URLs are removed - entities are stripped of "@" - hashtags are stripped of "#" - HTML: & is converted to an ampersand - HTML: escaped angle brackets are replaced with { and } for gt and lt, respectively - HTML: remaining special chars are converted back to unicode; remaining ampersand is replaced with "+" Whitespaces (space, newlines, tabs, etc.) are reduced.static java.lang.String
phoneticReduction(java.lang.String t)
Create a non-diacritic, ASCII version of the input string.static java.lang.String
phoneticReduction(java.lang.String t, boolean isAscii)
static java.lang.String
reduce_line_breaks(java.lang.String t)
Replaces all 3 or more blank lines with a single paragraph break (\n\n)static java.lang.String
removeAny(java.lang.String buf, java.lang.String remove)
Remove instances of any char in the remove string from bufstatic java.lang.String
removeAnyLeft(java.lang.String buf, java.lang.String remove)
compare to trim( string, chars ), but you can trim any chars Example: - a b c remove "-" from string above.static java.lang.String
removeDiacritics(java.lang.String word)
Supports Phoneticizer utility from OpenSextant v1.x Remove diacritics from a phrasestatic java.lang.String
removeEmoticons(java.lang.String t)
replace Emoticons with something less nefarious -- UTF-16 characters do not play well with some I/O routines.static java.lang.String
removePunctuation(java.lang.String word)
Remove any leading and trailing punctuation and some internal punctuation.static java.lang.String
removeSymbols(java.lang.String t)
Replace symbologystatic java.lang.String
replaceAny(java.lang.String buf, java.lang.String remove, java.lang.String sub)
Replace any of the removal chars with the sub.static java.lang.String
replaceDiacritics(java.lang.String s)
A thorough replacement of diacritics and Unicode chars to their ASCII equivalents.static java.lang.String
replaceDiacriticsOriginal(java.lang.String s)
Deprecated.See replaceDiacritics as the replacement.static java.lang.String
squeeze_whitespace(java.lang.String t)
Minimize whitespace.static java.util.List<java.lang.String>
string2list(java.lang.String s, java.lang.String delim)
Get a list of values into a nice, scrubbed array of values, no whitespace.static java.lang.String
text_id(java.lang.String text)
Static method -- use only if you are sure of thread-safety.static java.lang.String[]
tokens(java.lang.String str)
Return just white-space delmited tokens.static java.lang.String[]
tokensLeft(java.lang.String str)
See tokensRight()static java.lang.String[]
tokensRight(java.lang.String str)
Return tokens on the right most part of a buffer.static java.lang.String
uncompress(byte[] gzData)
static java.lang.String
uncompress(byte[] gzData, java.lang.String charset)
-
-
-
Field Detail
-
NL
public static final char NL
- See Also:
- Constant Field Values
-
CR
public static final char CR
- See Also:
- Constant Field Values
-
SP
public static final char SP
- See Also:
- Constant Field Values
-
TAB
public static final char TAB
- See Also:
- Constant Field Values
-
DEL
public static final char DEL
- See Also:
- Constant Field Values
-
CASE_LOWER
public static final int CASE_LOWER
- See Also:
- Constant Field Values
-
CASE_UPPER
public static final int CASE_UPPER
- See Also:
- Constant Field Values
-
arabicLang
public static final java.lang.String arabicLang
- See Also:
- Constant Field Values
-
bahasaLang
public static final java.lang.String bahasaLang
- See Also:
- Constant Field Values
-
chineseLang
public static final java.lang.String chineseLang
- See Also:
- Constant Field Values
-
chineseTradLang
public static final java.lang.String chineseTradLang
- See Also:
- Constant Field Values
-
englishLang
public static final java.lang.String englishLang
- See Also:
- Constant Field Values
-
farsiLang
public static final java.lang.String farsiLang
- See Also:
- Constant Field Values
-
frenchLang
public static final java.lang.String frenchLang
- See Also:
- Constant Field Values
-
germanLang
public static final java.lang.String germanLang
- See Also:
- Constant Field Values
-
italianLang
public static final java.lang.String italianLang
- See Also:
- Constant Field Values
-
japaneseLang
public static final java.lang.String japaneseLang
- See Also:
- Constant Field Values
-
koreanLang
public static final java.lang.String koreanLang
- See Also:
- Constant Field Values
-
portugueseLang
public static final java.lang.String portugueseLang
- See Also:
- Constant Field Values
-
russianLang
public static final java.lang.String russianLang
- See Also:
- Constant Field Values
-
spanishLang
public static final java.lang.String spanishLang
- See Also:
- Constant Field Values
-
turkishLang
public static final java.lang.String turkishLang
- See Also:
- Constant Field Values
-
thaiLang
public static final java.lang.String thaiLang
- See Also:
- Constant Field Values
-
vietnameseLang
public static final java.lang.String vietnameseLang
- See Also:
- Constant Field Values
-
romanianLang
public static final java.lang.String romanianLang
- See Also:
- Constant Field Values
-
hashtagPattern1
public static final java.util.regex.Pattern hashtagPattern1
Find any pattern "ABC#[ABC 123]" -- a hashtag with whitespace. Java Regex note: UNICODE flags are important, otherwise "\w" and other classes match only ASCII. NOTE: These are Twitter hashtags primarily
-
hashtagPattern2
public static final java.util.regex.Pattern hashtagPattern2
Find any pattern "#ABC123" -- normal hashtag, Java Regex note: UNICODE flags are important, otherwise "\w" and other classes match only ASCII. NOTE: These are Twitter hashtags primarily
-
-
Method Detail
-
hasIrregularPunctuation
public static boolean hasIrregularPunctuation(java.lang.String t)
Simple triage of punctuation. Rationale: OpenSextant taggers maximize RECALL in favor of not missing a possible match. the problem there is we often encounter substantial noise with tagger output, so a trivial test is to see if we have overmatched: Allowed Punctuation: , . - _ ` ' ( ) ## Diacritics, Parenthetics, periods/dashes.Given phrase "A B C" we may have matched: "A|B+C", "A; B; C", "A <B> C" etc... where common punctation separates valid tokens that appear in the reference phrase.
- Parameters:
t
-- Returns:
-
countIrregularPunctuation
public static int countIrregularPunctuation(java.lang.String t)
-
isLatin
public static final boolean isLatin(java.lang.String data)
Checks if non-ASCII and non-LATIN characters are present.- Parameters:
data
- any textual data- Returns:
- true if content is strictly ASCII or Latin1 extended.
-
hasDiacritics
public static final boolean hasDiacritics(java.lang.String s)
If a string has extended latin diacritics.- Parameters:
s
- string to test- Returns:
- true if a single diacritic is found.
-
phoneticReduction
public static java.lang.String phoneticReduction(java.lang.String t)
Create a non-diacritic, ASCII version of the input string. This will also have original whitespace, but will have removed non-character markings, e.g. "Za'tut" => "Zatut" not "Za tut"- Parameters:
t
-- Returns:
-
phoneticReduction
public static java.lang.String phoneticReduction(java.lang.String t, boolean isAscii)
-
replaceDiacritics
public static final java.lang.String replaceDiacritics(java.lang.String s)
A thorough replacement of diacritics and Unicode chars to their ASCII equivalents.- Parameters:
s
- the string- Returns:
- converted string
-
replaceDiacriticsOriginal
@Deprecated public static java.lang.String replaceDiacriticsOriginal(java.lang.String s)
Deprecated.See replaceDiacritics as the replacement.remove accents from a string and replace with ASCII equivalent Reference: http://www.rgagnon.com/javadetails/java-0456.html Caveat: This implementation is not exhaustive.- Parameters:
s
-- Returns:
- See Also:
replaceDiacritics(String)
-
isASCII
public static final boolean isASCII(char c)
- Parameters:
c
- a character- Returns:
- true if c is ASCII
-
isASCIILetter
public static final boolean isASCIILetter(char c)
- Parameters:
c
- character- Returns:
- true if c is ASCII a-z or A-Z
-
isASCII
public static boolean isASCII(byte[] data)
- Parameters:
data
- bytes to test- Returns:
- boolean if data is ASCII or not
-
isASCII
public static boolean isASCII(java.lang.String t)
Early exit test -- return false on first non-ASCII character found.- Parameters:
t
- buffer of text- Returns:
- true only if every char is in ASCII table.
-
countASCIIChars
public static int countASCIIChars(byte[] data)
count the number of ASCII bytes- Parameters:
data
- bytes to count- Returns:
- count of ASCII bytes
-
reduce_line_breaks
public static java.lang.String reduce_line_breaks(java.lang.String t)
Replaces all 3 or more blank lines with a single paragraph break (\n\n)- Parameters:
t
- text- Returns:
- A string with fewer line breaks;
-
delete_whitespace
public static java.lang.String delete_whitespace(java.lang.String t)
Delete whitespace of any sort.- Parameters:
t
- text- Returns:
- String, without whitespace.
-
squeeze_whitespace
public static java.lang.String squeeze_whitespace(java.lang.String t)
Minimize whitespace.- Parameters:
t
- text- Returns:
- scrubbed string
-
delete_eol
public static java.lang.String delete_eol(java.lang.String t)
Replace line endings with SPACE- Parameters:
t
- text- Returns:
- scrubbed string
-
delete_controls
public static java.lang.String delete_controls(java.lang.String t)
Delete control chars from text data; leaving text and whitespace only. Delete char (^?) is also removed. Length may differ if ctl chars are removed.- Parameters:
t
- text- Returns:
- scrubbed buffer
-
hasDigits
public static boolean hasDigits(java.lang.String txt)
-
countDigits
public static int countDigits(java.lang.String txt)
-
count_digits
public static int count_digits(java.lang.String txt)
Counts all digits in text.- Parameters:
txt
- text to count- Returns:
- count of digits
-
isNumeric
public static final boolean isNumeric(java.lang.String v)
StringUtils in commons isNumeric("1.234") is NOT numeric. Here "1.234" is numeric.- Parameters:
v
- val to parse- Returns:
- true if val is a number
-
count_ws
public static int count_ws(java.lang.String txt)
Counts all whitespace in text.- Parameters:
txt
- text- Returns:
- whitespace count
-
countFormattingSpace
public static int countFormattingSpace(java.lang.String txt)
Count formatting whitespace. This is helpful in determining if text spans are phrases with multiple TAB or EOL characters. For that matter, any control character contributes to formatting in some way. DEL, VT, HT, etc. So all control characters ( c < ' ') are counted.- Parameters:
txt
- input string- Returns:
- count of format chars
-
isUpper
public static boolean isUpper(java.lang.String dat)
For measuring the upper-case-ness of short texts. Returns true if ALL letters in text are UPPERCASE. Allows for non-letters in text.- Parameters:
dat
- text or data- Returns:
- true if text is Upper
-
isLower
public static boolean isLower(java.lang.String dat)
-
checkCase
public static boolean checkCase(java.lang.String text, int textcase)
detects if string alpha chars are purely lower case.- Parameters:
text
- texttextcase
- 1 lower, 2 upper- Returns:
- if case matches given textcase param
-
measureCase
public static int[] measureCase(java.lang.String text)
Measure character count, upper, lower, non-Character, whitespace- Parameters:
text
- text- Returns:
- int array with counts.
-
isUpperCaseDocument
public static boolean isUpperCaseDocument(int[] counts)
First measureCase(Text) to acquire counts, then call this routine for a heuristic that suggests the text is mainly upper case. These routines may not work well on languages that are not Latin-alphabet.- Parameters:
counts
- word stats from measureCase()- Returns:
- true if counts represent text that exceeds the "UPPER CASE" threshold
-
isLowerCaseDocument
public static boolean isLowerCaseDocument(int[] counts)
This measures the amount of upper case See Upper Case. Two methods to measure -- lower case count compared to all content (char+non-char) or compared to just char content.- Parameters:
counts
- word stats from measureCase()- Returns:
- true if counts represent text that exceeds the "lower case" threshold
-
get_text_window
public static int[] get_text_window(int offset, int matchlen, int textsize, int width)
Find the text window(s) around a match. Given the size of a buffer, the match and desired width returnprepreprepre MATCH postpostpost ^ ^ ^ ^ l-width l l+len l+len+width left1 left2 right1 right2
- Parameters:
offset
- offset of matchwidth
- width of window left and right of matchtextsize
- size of buffer containing match; used for boundary conditionsmatchlen
- length of match- Returns:
- window offsets left of match, right of match: [ l1, l2, r1, r2 ]
-
get_text_window
public static int[] get_text_window(int offset, int textsize, int width)
Get a single text window around the offset.- Parameters:
offset
- offset of matchwidth
- width of window left and right of matchtextsize
- size of buffer containing match; used for boundary conditions- Returns:
- window offsets of a text span contianing match [ left, right ]
-
text_id
public static java.lang.String text_id(java.lang.String text) throws java.security.NoSuchAlgorithmException, java.io.UnsupportedEncodingException
Static method -- use only if you are sure of thread-safety.- Parameters:
text
- text or data- Returns:
- identifier for the text, an MD5 hash
- Throws:
java.security.NoSuchAlgorithmException
- on errjava.io.UnsupportedEncodingException
- on err
-
b2hex
public static java.lang.String b2hex(byte[] barr)
-
md5_id
public static java.lang.String md5_id(byte[] digest)
Deprecated.not MD5 specific. Use #b2hex() instead- Parameters:
digest
- byte array- Returns:
- hash for the data
-
string2list
public static java.util.List<java.lang.String> string2list(java.lang.String s, java.lang.String delim)
Get a list of values into a nice, scrubbed array of values, no whitespace. a, b, c d e, f => [ "a", "b", "c d e", "f" ]- Parameters:
s
- string to splitdelim
- delimiter, no default.- Returns:
- list of split strings, which are also whitespace trimmed
-
fast_replace
public static java.lang.String fast_replace(java.lang.String buf, java.lang.String replace, java.lang.String substitution)
Given a string S and a list of characters to replace with a substitute, return the new string, S'. "-name-with.invalid characters;" // replace "-. ;" with "_" "_name_with_invalid_characters_" //- Parameters:
buf
- bufferreplace
- string of characters to replace with the one substitute charsubstitution
- string to insert in place of chars- Returns:
- scrubbed text
-
removeAny
public static java.lang.String removeAny(java.lang.String buf, java.lang.String remove)
Remove instances of any char in the remove string from buf- Parameters:
buf
- textremove
- string to remove- Returns:
- scrubbed text
-
replaceAny
public static java.lang.String replaceAny(java.lang.String buf, java.lang.String remove, java.lang.String sub)
Replace any of the removal chars with the sub. A many to one replacement. alt: use regex String.replace(//, '')- Parameters:
buf
- textremove
- string to replacesub
- the replacement string- Returns:
- scrubbed text
-
removeAnyLeft
public static java.lang.String removeAnyLeft(java.lang.String buf, java.lang.String remove)
compare to trim( string, chars ), but you can trim any chars Example: - a b c remove "-" from string above.- Parameters:
buf
- textremove
- string to remove- Returns:
- scrubbed text
-
normalizeTextEntity
public static java.lang.String normalizeTextEntity(java.lang.String str)
Normalization: Clean the ends, Remove Line-endings from middle of entity.Example: TEXT: **The Daily Newsletter of \n\rBarbara, So.** CLEAN: __The Daily Newsletter of __Barbara, So___ Where "__" represents omitted characters.
- Parameters:
str
- text- Returns:
- scrubbed text
-
tokens
public static java.lang.String[] tokens(java.lang.String str)
Return just white-space delmited tokens.- Parameters:
str
- text- Returns:
- tokens
-
tokensRight
public static final java.lang.String[] tokensRight(java.lang.String str)
Return tokens on the right most part of a buffer. If a para break occurs, \n\n or \r\n\r\n, then return the part on the right of the break.- Parameters:
str
- text- Returns:
- whitespace delimited tokens
-
tokensLeft
public static final java.lang.String[] tokensLeft(java.lang.String str)
See tokensRight()- Parameters:
str
- text- Returns:
- whitespace delimited tokens
-
normalizeAbbreviation
public static java.lang.String normalizeAbbreviation(java.lang.String word)
Intended only as a filter for punctuation within a word. Text of the form A.T.T. or U.S. becomes ATT and US. A text such as Mr.Pibbs incorrectly becomes MrPibbs but for the purposes of normalizing tokens this should be fine. Use appropriate tokenization prior to using this as a filter.- Parameters:
word
- phrase with periods denoting some abbreviation.- Returns:
- scrubbed text
-
removeDiacritics
public static java.lang.String removeDiacritics(java.lang.String word)
Supports Phoneticizer utility from OpenSextant v1.x Remove diacritics from a phrase- Parameters:
word
- text- Returns:
- scrubbed text
-
normalizeUnicode
public static java.lang.String normalizeUnicode(java.lang.String str)
Normalize to "Normalization Form Canonical Decomposition" (NFD) REF: http: //stackoverflow.com/questions/3610013/file-listfiles-mangles-unicode- names-with-jdk-6-unicode-normalization-issues This supports proper file name retrieval from file system, among other things. In many situations we see unicode file names -- Java can list them, but in using the Java-provided version of the filename the OS/FS may not be able to find the file by the name given in a particular normalized form.- Parameters:
str
- text- Returns:
- normalized string, encoded with NFD bytes
-
removePunctuation
public static java.lang.String removePunctuation(java.lang.String word)
Remove any leading and trailing punctuation and some internal punctuation. Internal punctuation which indicates conjunction of two tokens, e.g. a hyphen, should have caused a split into separate tokens at the tokenization stage. Phoneticizer utility from OpenSextant v1.x Remove punctuation from a phrase- Parameters:
word
- text- Returns:
- scrubbed text
-
getLanguageMap
public static java.util.Map<java.lang.String,Language> getLanguageMap()
If caller wants to add language they can.- Returns:
- map of lang ID to language obj
-
initLanguageData
public static void initLanguageData()
Initialize language codes and metadata. This establishes a map for the most common language codes/names that exist in at least ISO-639-1 and have a non-zero 2-char ID.Based on: http://stackoverflow.com/questions/674041/is-there-an-elegant-way -to-convert-iso-639-2-3-letter-language-codes-to-java-lo Actual code mappings: en => eng eng => en cel => '' // Celtic; Avoid this. tr => tur tur => tr Names: tr => turkish tur => turkish turkish => tr // ISO2 only
-
initLOCLanguageData
public static void initLOCLanguageData() throws java.io.IOException
This is Libray of Congress data for language IDs. This is offered as a tool to help downstream language ID and enrich metadata when tagging data from particular countries. Reference: http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt- Throws:
java.io.IOException
- if resource file is not found
-
addLanguage
public static void addLanguage(Language lg)
-
addLanguage
public static void addLanguage(Language lg, boolean override)
Extend the basic language dictionary. Note -- First language is listed in language map by Name, and is not overwritten. Language objects may be overwritten in map using lang codes. For example, fre = French(fre), fra = French(fra), and french = French(fra) the last one, 'french' = could have been the French(fre) or (fra). Example, 'ger' and 'deu' are both valid ISO 3-alpha codes for German. What to do? TODO: Create a language object that lists both language biblio/terminology codes.- Parameters:
lg
- language objectoverride
- if this value should overwrite an existing one.
-
getLanguageName
public static java.lang.String getLanguageName(java.lang.String code)
Given an ISO2 char code (least common denominator) retrieve Language Name. This is best effort, so if your code finds nothing, this returns code normalized to lowercase.- Parameters:
code
- lang ID- Returns:
- name of language
-
getLanguage
public static Language getLanguage(java.lang.String code)
ISO2 and ISO3 char codes for languages are unique.- Parameters:
code
- iso2 or iso3 code- Returns:
- the other code.
-
getLanguageCode
public static java.lang.String getLanguageCode(java.lang.String code)
ISO2 and ISO3 char codes for languages are unique.- Parameters:
code
- iso2 or iso3 code- Returns:
- the other code.
-
isEuroLanguage
public static boolean isEuroLanguage(java.lang.String l)
European languages = Romance + GER + ENG Extend definition as needed.- Parameters:
l
- language ID- Returns:
- true if language is European in nature
-
isRomanceLanguage
public static boolean isRomanceLanguage(java.lang.String l)
Romance languages = SPA + POR + ITA + FRA + ROM Extend definition as needed.- Parameters:
l
- lang ID- Returns:
- true if language is a Romance language
-
isEnglish
public static boolean isEnglish(java.lang.String x)
Utility method to check if lang ID is English...- Parameters:
x
- a langcode- Returns:
- whether langcode is english
-
isChinese
public static boolean isChinese(java.lang.String x)
Utility method to check if lang ID is Chinese(Traditional or Simplified)...- Parameters:
x
- a langcode- Returns:
- whether langcode is chinese
-
isCJK
public static boolean isCJK(java.lang.String x)
Utility method to check if lang ID is Chinese, Korean, or Japanese- Parameters:
x
- a langcode- Returns:
- whether langcode is a CJK language
-
measureCJKText
public static double measureCJKText(java.lang.String buf)
Returns a ratio of Chinese/Japanese/Korean characters: CJK chars / ALL TODO: needs testing; not sure if this is sustainable if block; or if it is comprehensive. TODO: for performance reasons the internal chain of comparisons is embedded in the method; Otherwise for each char, an external method invocation is required.- Parameters:
buf
- the character to be tested- Returns:
- true if CJK, false otherwise
-
countCJKChars
public static int countCJKChars(char[] chars)
Counts the CJK characters in buffer, buf chars Inspiration: http://stackoverflow .com/questions/1499804/how-can-i-detect-japanese-text-in-a-java-string Assumption is that the char array is Unicode characters.- Parameters:
chars
- char array for the text in question.- Returns:
- count of CJK characters
-
hasCJKText
public static boolean hasCJKText(java.lang.String buf)
A simple test to see if text has any CJK characters at all. It returns after the first such character.- Parameters:
buf
- text- Returns:
- if buf has at least one CJK char.
-
isCJK
public static boolean isCJK(java.lang.Character.UnicodeBlock blk)
-
isChinese
public static boolean isChinese(java.lang.Character.UnicodeBlock blk)
-
isKorean
public static boolean isKorean(java.lang.Character.UnicodeBlock blk)
Likely to be uniquely Korean if the character block is in Hangul. But also, it may be Korean if block is part of the CJK ideographs at large. User must check if text in its entirety is part of CJK & Hangul, independently. This method only detects if character block is uniquely Hangul or not.- Parameters:
blk
- a Java Unicode block- Returns:
- true if char block is Hangul
-
isJapanese
public static boolean isJapanese(java.lang.Character.UnicodeBlock blk)
Checks if char block is uniquely Japanese. Check other chars isChinese- Parameters:
blk
- a Java Unicode block- Returns:
- true if char block is Hiragana or Katakana
-
compress
public static byte[] compress(java.lang.String buf) throws java.io.IOException
Compress bytes from a Unicode string. Conversion to bytes first to avoid unicode or platform-dependent IO issues.- Parameters:
buf
- UTF-8 encoded text- Returns:
- byte array
- Throws:
java.io.IOException
- on error with compression or text encoding
-
compress
public static byte[] compress(java.lang.String buf, java.lang.String charset) throws java.io.IOException
- Parameters:
buf
- textcharset
- character set encoding for text- Returns:
- byte array for the compressed result
- Throws:
java.io.IOException
- on error with compression or text encoding
-
uncompress
public static java.lang.String uncompress(byte[] gzData) throws java.io.IOException
- Parameters:
gzData
- byte array containing gzipped buffer- Returns:
- buffer UTF-8 decoded string
- Throws:
java.io.IOException
- on error with decompression or text encoding
-
uncompress
public static java.lang.String uncompress(byte[] gzData, java.lang.String charset) throws java.io.IOException
- Parameters:
gzData
- byte array containing gzipped buffercharset
- character set decoding for text- Returns:
- buffer of uncompressed, decoded string
- Throws:
java.io.IOException
- on error with decompression or text encoding
-
removeEmoticons
public static java.lang.String removeEmoticons(java.lang.String t)
replace Emoticons with something less nefarious -- UTF-16 characters do not play well with some I/O routines.- Parameters:
t
- text- Returns:
- scrubbed text
-
removeSymbols
public static java.lang.String removeSymbols(java.lang.String t)
Replace symbology- Parameters:
t
- text- Returns:
- scrubbed text
-
countNonText
public static int countNonText(java.lang.String t)
Count number of non-alphanumeric chars are present.- Parameters:
t
- text- Returns:
- count of chars
-
parseHashTags
public static java.util.Set<java.lang.String> parseHashTags(java.lang.String tweetText)
Parse the typical Twitter hashtag variants.- Parameters:
tweetText
-- Returns:
-
parseHashTags
public static java.util.Set<java.lang.String> parseHashTags(java.lang.String tweetText, boolean normalize)
Takes a string and returns all the hashtags in it. Normalized tags are just lowercased and deduplicated. https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json- Parameters:
tweetText
- textnormalize
- if to normalize text by lowercasing tags, etc.
-
parseNaturalLanguage
public static java.lang.String parseNaturalLanguage(java.lang.String raw)
see default implementation below- Parameters:
raw
- raw text- Returns:
- cleaner looking text
- See Also:
replace HTML, URLs removed, Tags and entity markers (@ and #) stripped; Tags and entities left in place.
-
parseNaturalLanguage
public static java.lang.String parseNaturalLanguage(java.lang.String raw, boolean unescapeHtml, boolean remURLs, boolean remTags, boolean remEntities)
Given tweet text or any [social media] text remove entities or other markers: - URLs are removed - entities are stripped of "@" - hashtags are stripped of "#" - HTML: & is converted to an ampersand - HTML: escaped angle brackets are replaced with { and } for gt and lt, respectively - HTML: remaining special chars are converted back to unicode; remaining ampersand is replaced with "+" Whitespaces (space, newlines, tabs, etc.) are reduced. DEPRECATED: the use of the tags=true flag to replace hashtags with blank is not supported. #tag<unicode text> is a problem. It is hard to tell in some cases where the hashtag ends. In Weibo, #tag#<unicode text> is used to denote that tag has a start/end But in Twitter, tag format is "#tag" or "#[phrase here]" etc. So there is no generic hashtag replacement.- Parameters:
raw
- original textunescapeHtml
- unescape HTMLremURLs
- remove URLsremTags
- remove hash tagsremEntities
- remove other entities- Returns:
- text less entities.
-
parseDate
public static final java.util.Date parseDate(java.lang.String dt)
A limited-scope date parsing: Parse properly formatted strings for example, ISO date/time strings stored in one of our Solr indices.- Parameters:
dt
- ISO date/time string.- Returns:
-
-