Class LangDetect
java.lang.Object
org.opensextant.extractors.langid.LangDetect
Wrapper around cybozu labs langdetect. This tool provides a simple
"guessLanguage", where default Cybozu LangDetect may fail to return a
response due to IO errors and/or may provide multiple guesses
w/propabilities.
GuessLanguage here offers a fall back to look at unknown text to see if
it is in the ASCII or CJK families.
Use this API wrapper in conjunction with the Xponents TextUtils.getLanguage()
routine and Language class
to facilitate connecting LangID output with actual ISO 639 standards code
pages.
ISO 2-char and 3-char language IDs differ depending on the use --
historical/bibliographic vs. linguistic/locales.
- Author:
- ubaldino
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
If working size, in CHARS, is less than 180 (20 8 char words + 1 whitespace word break);static final Language
static final Language
static final Language
static double
static final int
A simple threshold for demarcating when we might infer simple language ID with minimal content. -
Constructor Summary
ConstructorDescriptionDefault use requires you unpack LangDetect profiles here: /langdetect-profilesLangDetect
(int textSz) If you anticipate working with short text - queries, tweets, excerpts, etc.LangDetect
(int textSz, String profiles) LangDetect
(String profiles) -
Method Summary
Modifier and TypeMethodDescriptionalternativeCJKLangID
(String data) detecting if script of text is Japanese, Korean or Chinese.static Language
alternativeLangID
(String data) Look at raw bytes/characters to see which Unicode block they fall into.API for LangDetect, cybozu.labsAPI for LangDetect, cybozu.labs.detectSocialMediaLang
(String lang, String naturalLanguage) Find best lang ID for short texts.detectSocialMediaLang
(String lang, String naturalLanguage, boolean findCJK) EXPERIMENTAL , EXPERIMENTAL, EXPERIMENTAL UPDATE, 2015.guessLanguage
(String data) Routine to guess the language ID Scrub data prior to guessing language.void
Taken straight from LangDetect example NOTE: /langdetect/profiles must be a folder on disk, although I have a variation that could work with JAR resources.void
setWorkingSize
(int sz) Sort what was found; Returns LangID by highest score to lowest.
-
Field Details
-
DEFAULT_WORKING_SIZE
public static final int DEFAULT_WORKING_SIZEIf working size, in CHARS, is less than 180 (20 8 char words + 1 whitespace word break);- See Also:
-
LANGUAGE_ID_GROUP_ENGLISH
-
LANGUAGE_ID_GROUP_CJK
-
LANGUAGE_ID_GROUP_UNKNOWN
-
MIN_LENGTH_UNK_TEXT_THRESHOLD
public static final int MIN_LENGTH_UNK_TEXT_THRESHOLDA simple threshold for demarcating when we might infer simple language ID with minimal content. E.g. 16 chars of ASCII text ~ we can possibly say it is English. However, this is really only making an guess.- See Also:
-
MIN_LANG_DETECT_PROBABILITY
public static double MIN_LANG_DETECT_PROBABILITY
-
-
Constructor Details
-
LangDetect
Default use requires you unpack LangDetect profiles here: /langdetect-profiles- Throws:
ConfigException
-
LangDetect
- Throws:
ConfigException
-
LangDetect
If you anticipate working with short text - queries, tweets, excerpts, etc. Then indicate that here. text working size is in # of Chars.- Parameters:
textSz
-- Throws:
ConfigException
-
LangDetect
- Throws:
ConfigException
-
-
Method Details
-
setWorkingSize
public void setWorkingSize(int sz) - Parameters:
sz
-
-
initLangId
Taken straight from LangDetect example NOTE: /langdetect/profiles must be a folder on disk, although I have a variation that could work with JAR resources. So ordering classpath is important, but also folder itself must exist. TODO: workingSize is used only to guide default profile directory - short message (sm) or not. In the future workingSize- Throws:
ConfigException
-
detect
API for LangDetect, cybozu.labs- Parameters:
text
- ISO language ID or Locale. Straight from the Cybozu API- Returns:
- Throws:
com.cybozu.labs.langdetect.LangDetectException
-
detect
public Map<String,LangID> detect(String text, boolean withProbabilities) throws com.cybozu.labs.langdetect.LangDetectException API for LangDetect, cybozu.labs. However, this does not return cybozu.Language object; this method returns its own LangID class- Parameters:
text
-withProbabilities
- true to include propabilities on results- Returns:
- Throws:
com.cybozu.labs.langdetect.LangDetectException
-
sort
Sort what was found; Returns LangID by highest score to lowest.- Parameters:
lids
-- Returns:
-
guessLanguage
Routine to guess the language ID Scrub data prior to guessing language. If you feed that non-language text (jargon, codes, tables, URLs, hashtags, data) will interfere or overwhelm to volume of natural language text.- Parameters:
data
-- Returns:
-
alternativeLangID
Look at raw bytes/characters to see which Unicode block they fall into.- Parameters:
data
-- Returns:
-
alternativeCJKLangID
detecting if script of text is Japanese, Korean or Chinese. Given Chinese Unicode block contains CJK unified ideographs, the presence of Chinese characters does not indicate any of the three langugaes uniquely. This is used only if CyboZu LangDetect fails OR if you want to detect language(s) in mixed text.- Parameters:
data
-- Returns:
-
detectSocialMediaLang
Find best lang ID for short texts. By default this will not search for CJK language ID if CJK characters are present.- Parameters:
lang
-naturalLanguage
-- Returns:
-
detectSocialMediaLang
EXPERIMENTAL , EXPERIMENTAL, EXPERIMENTAL UPDATE, 2015. Using Cybozu LangDetect 1.3 (released June 2014) operates better on tweets than previous version. A lot of this confusion was related to the lack of optimization early versions had for social media. =============================== Not the proper method for general use. Lang ID is shunted for short text. If lang is non-null, then "~lang" is returned for short text If lang is null, we'll give it a shot. Short ~ two words of natural language, approx 16 chars. Objective is to return a single, best lang-id. More general purpose routines are TBD: e.g., validate all lang-id found by LangDetect or other solution.Workflow used here for ANY text: - get natural language of text ( the data, less any URLs, hashtags, etc.) For large documents, this is not necessary. TODO: evaluate LangDetect or others on longer texts (Blog with comments) to find all languages, etc. - Text is too Short? if lang is non-null, then return "~XX" - Find if text contains CJK: if contains K or J, then return respective langID else text is unified CJK chars which is at least Chinese. - Use LangDetect if Error, use alternate LangID detection if Good and answer < 0.65 (threshold), then report "~XX", as "~" implies low confidence. - Have a "lang-id" from all of the above? if lang-id is a locale, e.g, en_au, en_gb, zh_tw, cn_tw, etc. return just the language part; Return a two-char ISO langID
- Parameters:
lang
- given lang ID or nullnaturalLanguage
- text to determine lang ID; Caller must prepare this text, so consider using DataUtility.scrubTweetText(t).trim();findCJK
- - if findCJK is true, then this will try to find the best language ID if Chinese/Japanese/Korean characters exist at all.- Returns:
- lang ID, possibly different than given lang ID.
-