Package org.opensextant.util
Class TextUtils
java.lang.Object
org.opensextant.util.TextUtils
- Author:
- ubaldino
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
static final String
static final String
static final int
static final int
static final String
static final String
static final char
static final char
static final String
static final String
static final String
static final String
static final Pattern
Find any pattern "ABC#[ABC 123]" -- a hashtag with whitespace.static final Pattern
Find any pattern "#ABC123" -- normal hashtag, Java Regex note: UNICODE flags are important, otherwise "\w" and other classes match only ASCII.static final String
static final String
static final String
static final char
static final String
static final String
static final String
static final char
static final String
static final char
static final String
static final String
static final String
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic void
addLanguage
(Language lg) static void
addLanguage
(Language lg, boolean override) Extend the basic language dictionary.static String
b2hex
(byte[] barr) static boolean
detects if string alpha chars are purely lower case.static byte[]
Compress bytes from a Unicode string.static byte[]
static int
count_digits
(String txt) Counts all digits in text.static int
Counts all whitespace in text.static int
countASCIIChars
(byte[] data) count the number of ASCII bytesstatic int
countCJKChars
(char[] chars) Counts the CJK characters in buffer, buf chars Inspiration: http://stackoverflow .com/questions/1499804/how-can-i-detect-japanese-text-in-a-java-string Assumption is that the char array is Unicode characters.static int
countDigits
(String txt) static int
Count formatting whitespace.static int
static int
Count number of non-alphanumeric chars are present.static String
Delete control chars from text data; leaving text and whitespace only.static String
delete_eol
(String t) Replace line endings with SPACEstatic String
Delete whitespace of any sort.static String
fast_replace
(String buf, String replace, String substitution) Given a string S and a list of characters to replace with a substitute, return the new string, S'.static int[]
get_text_window
(int offset, int textsize, int width) Get a single text window around the offset.static int[]
get_text_window
(int offset, int matchlen, int textsize, int width) Find the text window(s) around a match.static Language
getLanguage
(String code) ISO2 and ISO3 char codes for languages are unique.static String
getLanguageCode
(String code) ISO2 and ISO3 char codes for languages are unique.If caller wants to add language they can.static String
getLanguageName
(String code) Given an ISO2 char code (least common denominator) retrieve Language Name.static boolean
hasCJKText
(String buf) A simple test to see if text has any CJK characters at all.static final boolean
If a string has extended latin diacritics.static boolean
static boolean
Simple triage of punctuation.static final boolean
hasMiddleEasternText
(String data) Detects the first Arabic or Hewbrew character for now -- will be more comprehensive in scoping "Middle Eastern" scripts in text.static void
Initialize language codes and metadata.static void
This is Libray of Congress data for language IDs.static boolean
isAbbreviation
(String txt) static boolean
isAbbreviation
(String orig, boolean useCase) Define what an acronym is: A.B.static boolean
isASCII
(byte[] data) static final boolean
isASCII
(char c) static boolean
Early exit test -- return false on first non-ASCII character found.static final boolean
isASCIILetter
(char c) static boolean
static boolean
Utility method to check if lang ID is Chinese(Traditional or Simplified)...static boolean
static boolean
Utility method to check if lang ID is Chinese, Korean, or Japanesestatic boolean
Utility method to check if lang ID is English...static boolean
European languages = Romance + GER + ENG Extend definition as needed.static boolean
Checks if char block is uniquely Japanese.static boolean
Likely to be uniquely Korean if the character block is in Hangul.static final boolean
Checks if non-ASCII and non-LATIN characters are present.static boolean
static boolean
isLowerCaseDocument
(int[] counts) This measures the amount of upper case See Upper Case.static final boolean
Determine if a string is numeric in nature, not necessarily a parsable number.static boolean
Romance languages = SPA + POR + ITA + FRA + ROM Extend definition as needed.static boolean
For measuring the upper-case-ness of short texts.static boolean
isUpperCaseDocument
(int[] counts) First measureCase(Text) to acquire counts, then call this routine for a heuristic that suggests the text is mainly upper case.static String
md5_id
(byte[] digest) Deprecated.not MD5 specific.static int[]
measureCase
(String text) Measure character count, upper, lower, non-Character, whitespacestatic double
measureCJKText
(String buf) Returns a ratio of Chinese/Japanese/Korean characters: CJK chars / ALL TODO: needs testing; not sure if this is sustainable if block; or if it is comprehensive.static String
normalizeAbbreviation
(String word) Intended only as a filter for punctuation within a word.static String
Normalization: Clean the ends, Remove Line-endings from middle of entity.static String
normalizeUnicode
(String str) Normalize to "Normalization Form Canonical Decomposition" (NFD) REF: http: //stackoverflow.com/questions/3610013/file-listfiles-mangles-unicode- names-with-jdk-6-unicode-normalization-issues This supports proper file name retrieval from file system, among other things.static final Date
A limited-scope date parsing: Parse properly formatted strings for example, ISO date/time strings stored in one of our Solr indices.parseHashTags
(String tweetText) Parse the typical Twitter hashtag variants.parseHashTags
(String tweetText, boolean normalize) Takes a string and returns all the hashtags in it.static String
see default implementation belowstatic String
parseNaturalLanguage
(String raw, boolean unescapeHtml, boolean remURLs, boolean remTags, boolean remEntities) Given tweet text or any [social media] text remove entities or other markers: - URLs are removed - entities are stripped of "@" - hashtags are stripped of "#" - HTML: & is converted to an ampersand - HTML: escaped angle brackets are replaced with { and } for gt and lt, respectively - HTML: remaining special chars are converted back to unicode; remaining ampersand is replaced with "+" Whitespaces (space, newlines, tabs, etc.) are reduced.static String
Create a non-diacritic, ASCII version of the input string.static String
phoneticReduction
(String t, boolean isAscii) static String
Replaces all 3 or more blank lines with a single paragraph break (\n\n)static String
Remove instances of any char in the remove string from bufstatic String
removeAnyLeft
(String buf, String remove) compare to trim( string, chars ), but you can trim any chars Example: - a b c remove "-" from string above.static String
removeDiacritics
(String word) Supports Phoneticizer utility from OpenSextant v1.x Remove diacritics from a phrasestatic String
replace Emoticons with something less nefarious -- UTF-16 characters do not play well with some I/O routines.static String
removePunctuation
(String word) Remove any leading and trailing punctuation and some internal punctuation.static String
Replace symbologystatic String
replaceAny
(String buf, String remove, String sub) Replace any of the removal chars with the sub.static final String
A thorough replacement of diacritics and Unicode chars to their ASCII equivalents.static String
Deprecated.See replaceDiacritics as the replacement.static String
Minimize whitespace.string2list
(String s, String delim) Get a list of values into a nice, scrubbed array of values, no whitespace.static String
Static method -- use only if you are sure of thread-safety.static String[]
Return just white-space delmited tokens.static final String[]
tokensLeft
(String str) See tokensRight()static final String[]
tokensRight
(String str) Return tokens on the right most part of a buffer.static String
uncompress
(byte[] gzData) static String
uncompress
(byte[] gzData, String charset)
-
Field Details
-
NL
public static final char NL- See Also:
-
CR
public static final char CR- See Also:
-
SP
public static final char SP- See Also:
-
TAB
public static final char TAB- See Also:
-
DEL
public static final char DEL- See Also:
-
CASE_LOWER
public static final int CASE_LOWER- See Also:
-
CASE_UPPER
public static final int CASE_UPPER- See Also:
-
ABBREV_MAX_LEN
public static final int ABBREV_MAX_LEN- See Also:
-
arabicLang
- See Also:
-
bahasaLang
- See Also:
-
chineseLang
- See Also:
-
chineseTradLang
- See Also:
-
englishLang
- See Also:
-
farsiLang
- See Also:
-
frenchLang
- See Also:
-
germanLang
- See Also:
-
italianLang
- See Also:
-
japaneseLang
- See Also:
-
koreanLang
- See Also:
-
portugueseLang
- See Also:
-
russianLang
- See Also:
-
spanishLang
- See Also:
-
turkishLang
- See Also:
-
thaiLang
- See Also:
-
vietnameseLang
- See Also:
-
romanianLang
- See Also:
-
hashtagPattern1
Find any pattern "ABC#[ABC 123]" -- a hashtag with whitespace. Java Regex note: UNICODE flags are important, otherwise "\w" and other classes match only ASCII. NOTE: These are Twitter hashtags primarily -
hashtagPattern2
Find any pattern "#ABC123" -- normal hashtag, Java Regex note: UNICODE flags are important, otherwise "\w" and other classes match only ASCII. NOTE: These are Twitter hashtags primarily
-
-
Constructor Details
-
TextUtils
public TextUtils()
-
-
Method Details
-
hasIrregularPunctuation
Simple triage of punctuation. Rationale: OpenSextant taggers maximize RECALL in favor of not missing a possible match. the problem there is we often encounter substantial noise with tagger output, so a trivial test is to see if we have overmatched: Allowed Punctuation: , . - _ ` ' ( ) ## Diacritics, Parenthetics, periods/dashes.Given phrase "A B C" we may have matched: "A|B+C", "A; B; C", "A <B> C" etc... where common punctation separates valid tokens that appear in the reference phrase.
- Parameters:
t
-- Returns:
-
countIrregularPunctuation
-
isLatin
Checks if non-ASCII and non-LATIN characters are present.- Parameters:
data
- any textual data- Returns:
- true if content is strictly ASCII or Latin1 extended.
-
hasMiddleEasternText
Detects the first Arabic or Hewbrew character for now -- will be more comprehensive in scoping "Middle Eastern" scripts in text.- Parameters:
data
-- Returns:
-
hasDiacritics
If a string has extended latin diacritics.- Parameters:
s
- string to test- Returns:
- true if a single diacritic is found.
-
phoneticReduction
Create a non-diacritic, ASCII version of the input string. This will also have original whitespace, but will have removed non-character markings, e.g. "Za'tut" => "Zatut" not "Za tut"- Parameters:
t
-- Returns:
-
phoneticReduction
-
replaceDiacritics
A thorough replacement of diacritics and Unicode chars to their ASCII equivalents.- Parameters:
s
- the string- Returns:
- converted string
-
replaceDiacriticsOriginal
Deprecated.See replaceDiacritics as the replacement.remove accents from a string and replace with ASCII equivalent Reference: http://www.rgagnon.com/javadetails/java-0456.html Caveat: This implementation is not exhaustive.- Parameters:
s
-- Returns:
- See Also:
-
isASCII
public static final boolean isASCII(char c) - Parameters:
c
- a character- Returns:
- true if c is ASCII
-
isASCIILetter
public static final boolean isASCIILetter(char c) - Parameters:
c
- character- Returns:
- true if c is ASCII a-z or A-Z
-
isASCII
public static boolean isASCII(byte[] data) - Parameters:
data
- bytes to test- Returns:
- boolean if data is ASCII or not
-
isASCII
Early exit test -- return false on first non-ASCII character found.- Parameters:
t
- buffer of text- Returns:
- true only if every char is in ASCII table.
-
countASCIIChars
public static int countASCIIChars(byte[] data) count the number of ASCII bytes- Parameters:
data
- bytes to count- Returns:
- count of ASCII bytes
-
reduce_line_breaks
Replaces all 3 or more blank lines with a single paragraph break (\n\n)- Parameters:
t
- text- Returns:
- A string with fewer line breaks;
-
delete_whitespace
Delete whitespace of any sort.- Parameters:
t
- text- Returns:
- String, without whitespace.
-
squeeze_whitespace
Minimize whitespace.- Parameters:
t
- text- Returns:
- scrubbed string
-
delete_eol
Replace line endings with SPACE- Parameters:
t
- text- Returns:
- scrubbed string
-
delete_controls
Delete control chars from text data; leaving text and whitespace only. Delete char (^?) is also removed. Length may differ if ctl chars are removed.- Parameters:
t
- text- Returns:
- scrubbed buffer
-
hasDigits
-
countDigits
-
count_digits
Counts all digits in text.- Parameters:
txt
- text to count- Returns:
- count of digits
-
isNumeric
Determine if a string is numeric in nature, not necessarily a parsable number. 0-9 or "-+.E" are valid symbols. Example -- 11111E.00003333 is Numeric, commons StringUtils.isNumeric only detects digits.- Parameters:
v
- val to parse- Returns:
- true if val is a numeric sequence, symbols allowed.
-
count_ws
Counts all whitespace in text.- Parameters:
txt
- text- Returns:
- whitespace count
-
countFormattingSpace
Count formatting whitespace. This is helpful in determining if text spans are phrases with multiple TAB or EOL characters. For that matter, any control character contributes to formatting in some way. DEL, VT, HT, etc. So all control characters ( c < ' ') are counted.- Parameters:
txt
- input string- Returns:
- count of format chars
-
isUpper
For measuring the upper-case-ness of short texts. Returns true if ALL letters in text are UPPERCASE. Allows for non-letters in text.- Parameters:
dat
- text or data- Returns:
- true if text is Upper
-
isLower
-
checkCase
detects if string alpha chars are purely lower case.- Parameters:
text
- texttextcase
- 1 lower, 2 upper- Returns:
- if case matches given textcase param
-
measureCase
Measure character count, upper, lower, non-Character, whitespace- Parameters:
text
- text- Returns:
- int array with counts.
-
isUpperCaseDocument
public static boolean isUpperCaseDocument(int[] counts) First measureCase(Text) to acquire counts, then call this routine for a heuristic that suggests the text is mainly upper case. These routines may not work well on languages that are not Latin-alphabet.- Parameters:
counts
- word stats from measureCase()- Returns:
- true if counts represent text that exceeds the "UPPER CASE" threshold
-
isLowerCaseDocument
public static boolean isLowerCaseDocument(int[] counts) This measures the amount of upper case See Upper Case. Two methods to measure -- lower case count compared to all content (char+non-char) or compared to just char content.- Parameters:
counts
- word stats from measureCase()- Returns:
- true if counts represent text that exceeds the "lower case" threshold
-
get_text_window
public static int[] get_text_window(int offset, int matchlen, int textsize, int width) Find the text window(s) around a match. Given the size of a buffer, the match and desired width returnprepreprepre MATCH postpostpost ^ ^ ^ ^ l-width l l+len l+len+width left1 left2 right1 right2
- Parameters:
offset
- offset of matchwidth
- width of window left and right of matchtextsize
- size of buffer containing match; used for boundary conditionsmatchlen
- length of match- Returns:
- window offsets left of match, right of match: [ l1, l2, r1, r2 ]
-
get_text_window
public static int[] get_text_window(int offset, int textsize, int width) Get a single text window around the offset.- Parameters:
offset
- offset of matchwidth
- width of window left and right of matchtextsize
- size of buffer containing match; used for boundary conditions- Returns:
- window offsets of a text span contianing match [ left, right ]
-
text_id
public static String text_id(String text) throws NoSuchAlgorithmException, UnsupportedEncodingException Static method -- use only if you are sure of thread-safety.- Parameters:
text
- text or data- Returns:
- identifier for the text, an MD5 hash
- Throws:
NoSuchAlgorithmException
- on errUnsupportedEncodingException
- on err
-
b2hex
-
md5_id
Deprecated.not MD5 specific. Use #b2hex() instead- Parameters:
digest
- byte array- Returns:
- hash for the data
-
string2list
Get a list of values into a nice, scrubbed array of values, no whitespace. a, b, c d e, f => [ "a", "b", "c d e", "f" ]- Parameters:
s
- string to splitdelim
- delimiter, no default.- Returns:
- list of split strings, which are also whitespace trimmed
-
fast_replace
Given a string S and a list of characters to replace with a substitute, return the new string, S'. "-name-with.invalid characters;" // replace "-. ;" with "_" "_name_with_invalid_characters_" //- Parameters:
buf
- bufferreplace
- string of characters to replace with the one substitute charsubstitution
- string to insert in place of chars- Returns:
- scrubbed text
-
removeAny
Remove instances of any char in the remove string from buf- Parameters:
buf
- textremove
- string to remove- Returns:
- scrubbed text
-
replaceAny
Replace any of the removal chars with the sub. A many to one replacement. alt: use regex String.replace(//, '')- Parameters:
buf
- textremove
- string to replacesub
- the replacement string- Returns:
- scrubbed text
-
removeAnyLeft
compare to trim( string, chars ), but you can trim any chars Example: - a b c remove "-" from string above.- Parameters:
buf
- textremove
- string to remove- Returns:
- scrubbed text
-
normalizeTextEntity
Normalization: Clean the ends, Remove Line-endings from middle of entity.Example: TEXT: **The Daily Newsletter of \n\rBarbara, So.** CLEAN: __The Daily Newsletter of __Barbara, So___ Where "__" represents omitted characters.
- Parameters:
str
- text- Returns:
- scrubbed text
-
tokens
Return just white-space delmited tokens.- Parameters:
str
- text- Returns:
- tokens
-
tokensRight
Return tokens on the right most part of a buffer. If a para break occurs, \n\n or \r\n\r\n, then return the part on the right of the break.- Parameters:
str
- text- Returns:
- whitespace delimited tokens
-
tokensLeft
See tokensRight()- Parameters:
str
- text- Returns:
- whitespace delimited tokens
-
normalizeAbbreviation
Intended only as a filter for punctuation within a word. Text of the form A.T.T. or U.S. becomes ATT and US. A text such as Mr.Pibbs incorrectly becomes MrPibbs but for the purposes of normalizing tokens this should be fine. Use appropriate tokenization prior to using this as a filter.- Parameters:
word
- phrase with periods denoting some abbreviation.- Returns:
- scrubbed text
-
isAbbreviation
- Parameters:
txt
-- Returns:
- See Also:
-
isAbbreviation
Define what an acronym is: A.B. (at minimum) A.b. okay A. b. okay A.b not okay A.9. not okay Starts with Alpha Period is required Ends with a period One upper case letter required -- optional arg for case sensitivity Digits allowed. Spaces allowed - length no longer than 15 non-whitespace chars -
removeDiacritics
Supports Phoneticizer utility from OpenSextant v1.x Remove diacritics from a phrase- Parameters:
word
- text- Returns:
- scrubbed text
-
normalizeUnicode
Normalize to "Normalization Form Canonical Decomposition" (NFD) REF: http: //stackoverflow.com/questions/3610013/file-listfiles-mangles-unicode- names-with-jdk-6-unicode-normalization-issues This supports proper file name retrieval from file system, among other things. In many situations we see unicode file names -- Java can list them, but in using the Java-provided version of the filename the OS/FS may not be able to find the file by the name given in a particular normalized form.- Parameters:
str
- text- Returns:
- normalized string, encoded with NFD bytes
-
removePunctuation
Remove any leading and trailing punctuation and some internal punctuation. Internal punctuation which indicates conjunction of two tokens, e.g. a hyphen, should have caused a split into separate tokens at the tokenization stage. Phoneticizer utility from OpenSextant v1.x Remove punctuation from a phrase- Parameters:
word
- text- Returns:
- scrubbed text
-
getLanguageMap
If caller wants to add language they can.- Returns:
- map of lang ID to language obj
-
initLanguageData
public static void initLanguageData()Initialize language codes and metadata. This establishes a map for the most common language codes/names that exist in at least ISO-639-1 and have a non-zero 2-char ID.Based on: http://stackoverflow.com/questions/674041/is-there-an-elegant-way -to-convert-iso-639-2-3-letter-language-codes-to-java-lo Actual code mappings: en => eng eng => en cel => '' // Celtic; Avoid this. tr => tur tur => tr Names: tr => turkish tur => turkish turkish => tr // ISO2 only
-
initLOCLanguageData
This is Libray of Congress data for language IDs. This is offered as a tool to help downstream language ID and enrich metadata when tagging data from particular countries. Reference: http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt- Throws:
IOException
- if resource file is not found
-
addLanguage
-
addLanguage
Extend the basic language dictionary. Note -- First language is listed in language map by Name, and is not overwritten. Language objects may be overwritten in map using lang codes. For example, fre = French(fre), fra = French(fra), and french = French(fra) the last one, 'french' = could have been the French(fre) or (fra). Example, 'ger' and 'deu' are both valid ISO 3-alpha codes for German. What to do? TODO: Create a language object that lists both language biblio/terminology codes.- Parameters:
lg
- language objectoverride
- if this value should overwrite an existing one.
-
getLanguageName
Given an ISO2 char code (least common denominator) retrieve Language Name. This is best effort, so if your code finds nothing, this returns code normalized to lowercase.- Parameters:
code
- lang ID- Returns:
- name of language
-
getLanguage
ISO2 and ISO3 char codes for languages are unique.- Parameters:
code
- iso2 or iso3 code- Returns:
- the other code.
-
getLanguageCode
ISO2 and ISO3 char codes for languages are unique.- Parameters:
code
- iso2 or iso3 code- Returns:
- the other code.
-
isEuroLanguage
European languages = Romance + GER + ENG Extend definition as needed.- Parameters:
l
- language ID- Returns:
- true if language is European in nature
-
isRomanceLanguage
Romance languages = SPA + POR + ITA + FRA + ROM Extend definition as needed.- Parameters:
l
- lang ID- Returns:
- true if language is a Romance language
-
isEnglish
Utility method to check if lang ID is English...- Parameters:
x
- a langcode- Returns:
- whether langcode is english
-
isChinese
Utility method to check if lang ID is Chinese(Traditional or Simplified)...- Parameters:
x
- a langcode- Returns:
- whether langcode is chinese
-
isCJK
Utility method to check if lang ID is Chinese, Korean, or Japanese- Parameters:
x
- a langcode- Returns:
- whether langcode is a CJK language
-
measureCJKText
Returns a ratio of Chinese/Japanese/Korean characters: CJK chars / ALL TODO: needs testing; not sure if this is sustainable if block; or if it is comprehensive. TODO: for performance reasons the internal chain of comparisons is embedded in the method; Otherwise for each char, an external method invocation is required.- Parameters:
buf
- the character to be tested- Returns:
- true if CJK, false otherwise
-
countCJKChars
public static int countCJKChars(char[] chars) Counts the CJK characters in buffer, buf chars Inspiration: http://stackoverflow .com/questions/1499804/how-can-i-detect-japanese-text-in-a-java-string Assumption is that the char array is Unicode characters.- Parameters:
chars
- char array for the text in question.- Returns:
- count of CJK characters
-
hasCJKText
A simple test to see if text has any CJK characters at all. It returns after the first such character.- Parameters:
buf
- text- Returns:
- if buf has at least one CJK char.
-
isCJK
-
isChinese
-
isKorean
Likely to be uniquely Korean if the character block is in Hangul. But also, it may be Korean if block is part of the CJK ideographs at large. User must check if text in its entirety is part of CJK & Hangul, independently. This method only detects if character block is uniquely Hangul or not.- Parameters:
blk
- a Java Unicode block- Returns:
- true if char block is Hangul
-
isJapanese
Checks if char block is uniquely Japanese. Check other chars isChinese- Parameters:
blk
- a Java Unicode block- Returns:
- true if char block is Hiragana or Katakana
-
compress
Compress bytes from a Unicode string. Conversion to bytes first to avoid unicode or platform-dependent IO issues.- Parameters:
buf
- UTF-8 encoded text- Returns:
- byte array
- Throws:
IOException
- on error with compression or text encoding
-
compress
- Parameters:
buf
- textcharset
- character set encoding for text- Returns:
- byte array for the compressed result
- Throws:
IOException
- on error with compression or text encoding
-
uncompress
- Parameters:
gzData
- byte array containing gzipped buffer- Returns:
- buffer UTF-8 decoded string
- Throws:
IOException
- on error with decompression or text encoding
-
uncompress
- Parameters:
gzData
- byte array containing gzipped buffercharset
- character set decoding for text- Returns:
- buffer of uncompressed, decoded string
- Throws:
IOException
- on error with decompression or text encoding
-
removeEmoticons
replace Emoticons with something less nefarious -- UTF-16 characters do not play well with some I/O routines.- Parameters:
t
- text- Returns:
- scrubbed text
-
removeSymbols
Replace symbology- Parameters:
t
- text- Returns:
- scrubbed text
-
countNonText
Count number of non-alphanumeric chars are present.- Parameters:
t
- text- Returns:
- count of chars
-
parseHashTags
Parse the typical Twitter hashtag variants.- Parameters:
tweetText
-- Returns:
-
parseHashTags
Takes a string and returns all the hashtags in it. Normalized tags are just lowercased and deduplicated. https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json- Parameters:
tweetText
- textnormalize
- if to normalize text by lowercasing tags, etc.
-
parseNaturalLanguage
see default implementation below- Parameters:
raw
- raw text- Returns:
- cleaner looking text
- See Also:
-
parseNaturalLanguage
public static String parseNaturalLanguage(String raw, boolean unescapeHtml, boolean remURLs, boolean remTags, boolean remEntities) Given tweet text or any [social media] text remove entities or other markers: - URLs are removed - entities are stripped of "@" - hashtags are stripped of "#" - HTML: & is converted to an ampersand - HTML: escaped angle brackets are replaced with { and } for gt and lt, respectively - HTML: remaining special chars are converted back to unicode; remaining ampersand is replaced with "+" Whitespaces (space, newlines, tabs, etc.) are reduced. DEPRECATED: the use of the tags=true flag to replace hashtags with blank is not supported. #tag<unicode text> is a problem. It is hard to tell in some cases where the hashtag ends. In Weibo, #tag#<unicode text> is used to denote that tag has a start/end But in Twitter, tag format is "#tag" or "#[phrase here]" etc. So there is no generic hashtag replacement.- Parameters:
raw
- original textunescapeHtml
- unescape HTMLremURLs
- remove URLsremTags
- remove hash tagsremEntities
- remove other entities- Returns:
- text less entities.
-
parseDate
A limited-scope date parsing: Parse properly formatted strings for example, ISO date/time strings stored in one of our Solr indices.- Parameters:
dt
- ISO date/time string.- Returns:
-