Class LangDetect
- java.lang.Object
-
- org.opensextant.extractors.langid.LangDetect
-
public class LangDetect extends java.lang.Object
Wrapper around cybozu labs langdetect. This tool provides a simple "guessLanguage", where default Cybozu LangDetect may fail to return a response due to IO errors and/or may provide multiple guesses w/propabilities. GuessLanguage here offers a fall back to look at unknown text to see if it is in the ASCII or CJK families. Use this API wrapper in conjunction with the Xponents TextUtils.getLanguage() routine and Language class to facilitate connecting LangID output with actual ISO 639 standards code pages. ISO 2-char and 3-char language IDs differ depending on the use -- historical/bibliographic vs. linguistic/locales.- Author:
- ubaldino
-
-
Field Summary
Fields Modifier and Type Field Description static int
DEFAULT_WORKING_SIZE
If working size, in CHARS, is less than 180 (20 8 char words + 1 whitespace word break);static Language
LANGUAGE_ID_GROUP_CJK
static Language
LANGUAGE_ID_GROUP_ENGLISH
static Language
LANGUAGE_ID_GROUP_UNKNOWN
static double
MIN_LANG_DETECT_PROBABILITY
static int
MIN_LENGTH_UNK_TEXT_THRESHOLD
A simple threshold for demarcating when we might infer simple language ID with minimal content.
-
Constructor Summary
Constructors Constructor Description LangDetect()
Default use requires you unpack LangDetect profiles here: /langdetect-profilesLangDetect(int textSz)
If you anticipate working with short text - queries, tweets, excerpts, etc.LangDetect(int textSz, java.lang.String profiles)
LangDetect(java.lang.String profiles)
-
Method Summary
Modifier and Type Method Description static java.util.Map<java.lang.String,LangID>
alternativeCJKLangID(java.lang.String data)
detecting if script of text is Japanese, Korean or Chinese.static Language
alternativeLangID(java.lang.String data)
Look at raw bytes/characters to see which Unicode block they fall into.java.lang.String
detect(java.lang.String text)
API for LangDetect, cybozu.labsjava.util.Map<java.lang.String,LangID>
detect(java.lang.String text, boolean withProbabilities)
API for LangDetect, cybozu.labs.Language
detectSocialMediaLang(java.lang.String lang, java.lang.String naturalLanguage)
Find best lang ID for short texts.Language
detectSocialMediaLang(java.lang.String lang, java.lang.String naturalLanguage, boolean findCJK)
EXPERIMENTAL , EXPERIMENTAL, EXPERIMENTAL UPDATE, 2015.Language
guessLanguage(java.lang.String data)
Routine to guess the language ID Scrub data prior to guessing language.void
initLangId()
Taken straight from LangDetect example NOTE: /langdetect/profiles must be a folder on disk, although I have a variation that could work with JAR resources.void
setWorkingSize(int sz)
static java.util.List<LangID>
sort(java.util.Map<java.lang.String,LangID> lids)
Sort what was found; Returns LangID by highest score to lowest.
-
-
-
Field Detail
-
DEFAULT_WORKING_SIZE
public static final int DEFAULT_WORKING_SIZE
If working size, in CHARS, is less than 180 (20 8 char words + 1 whitespace word break);- See Also:
- Constant Field Values
-
LANGUAGE_ID_GROUP_ENGLISH
public static final Language LANGUAGE_ID_GROUP_ENGLISH
-
LANGUAGE_ID_GROUP_CJK
public static final Language LANGUAGE_ID_GROUP_CJK
-
LANGUAGE_ID_GROUP_UNKNOWN
public static final Language LANGUAGE_ID_GROUP_UNKNOWN
-
MIN_LENGTH_UNK_TEXT_THRESHOLD
public static final int MIN_LENGTH_UNK_TEXT_THRESHOLD
A simple threshold for demarcating when we might infer simple language ID with minimal content. E.g. 16 chars of ASCII text ~ we can possibly say it is English. However, this is really only making an guess.- See Also:
- Constant Field Values
-
MIN_LANG_DETECT_PROBABILITY
public static double MIN_LANG_DETECT_PROBABILITY
-
-
Constructor Detail
-
LangDetect
public LangDetect() throws ConfigException
Default use requires you unpack LangDetect profiles here: /langdetect-profiles- Throws:
ConfigException
-
LangDetect
public LangDetect(java.lang.String profiles) throws ConfigException
- Throws:
ConfigException
-
LangDetect
public LangDetect(int textSz) throws ConfigException
If you anticipate working with short text - queries, tweets, excerpts, etc. Then indicate that here. text working size is in # of Chars.- Parameters:
textSz
-- Throws:
ConfigException
-
LangDetect
public LangDetect(int textSz, java.lang.String profiles) throws ConfigException
- Throws:
ConfigException
-
-
Method Detail
-
setWorkingSize
public void setWorkingSize(int sz)
- Parameters:
sz
-
-
initLangId
public void initLangId() throws ConfigException
Taken straight from LangDetect example NOTE: /langdetect/profiles must be a folder on disk, although I have a variation that could work with JAR resources. So ordering classpath is important, but also folder itself must exist. TODO: workingSize is used only to guide default profile directory - short message (sm) or not. In the future workingSize- Throws:
ConfigException
-
detect
public java.lang.String detect(java.lang.String text) throws com.cybozu.labs.langdetect.LangDetectException
API for LangDetect, cybozu.labs- Parameters:
text
- ISO language ID or Locale. Straight from the Cybozu API- Returns:
- Throws:
com.cybozu.labs.langdetect.LangDetectException
-
detect
public java.util.Map<java.lang.String,LangID> detect(java.lang.String text, boolean withProbabilities) throws com.cybozu.labs.langdetect.LangDetectException
API for LangDetect, cybozu.labs. However, this does not return cybozu.Language object; this method returns its own LangID class- Parameters:
text
-withProbabilities
- true to include propabilities on results- Returns:
- Throws:
com.cybozu.labs.langdetect.LangDetectException
-
sort
public static java.util.List<LangID> sort(java.util.Map<java.lang.String,LangID> lids)
Sort what was found; Returns LangID by highest score to lowest.- Parameters:
lids
-- Returns:
-
guessLanguage
public Language guessLanguage(java.lang.String data)
Routine to guess the language ID Scrub data prior to guessing language. If you feed that non-language text (jargon, codes, tables, URLs, hashtags, data) will interfere or overwhelm to volume of natural language text.- Parameters:
data
-- Returns:
-
alternativeLangID
public static Language alternativeLangID(java.lang.String data)
Look at raw bytes/characters to see which Unicode block they fall into.- Parameters:
data
-- Returns:
-
alternativeCJKLangID
public static java.util.Map<java.lang.String,LangID> alternativeCJKLangID(java.lang.String data)
detecting if script of text is Japanese, Korean or Chinese. Given Chinese Unicode block contains CJK unified ideographs, the presence of Chinese characters does not indicate any of the three langugaes uniquely. This is used only if CyboZu LangDetect fails OR if you want to detect language(s) in mixed text.- Parameters:
data
-- Returns:
-
detectSocialMediaLang
public Language detectSocialMediaLang(java.lang.String lang, java.lang.String naturalLanguage)
Find best lang ID for short texts. By default this will not search for CJK language ID if CJK characters are present.- Parameters:
lang
-naturalLanguage
-- Returns:
-
detectSocialMediaLang
public Language detectSocialMediaLang(java.lang.String lang, java.lang.String naturalLanguage, boolean findCJK)
EXPERIMENTAL , EXPERIMENTAL, EXPERIMENTAL UPDATE, 2015. Using Cybozu LangDetect 1.3 (released June 2014) operates better on tweets than previous version. A lot of this confusion was related to the lack of optimization early versions had for social media. =============================== Not the proper method for general use. Lang ID is shunted for short text. If lang is non-null, then "~lang" is returned for short text If lang is null, we'll give it a shot. Short ~ two words of natural language, approx 16 chars. Objective is to return a single, best lang-id. More general purpose routines are TBD: e.g., validate all lang-id found by LangDetect or other solution.Workflow used here for ANY text: - get natural language of text ( the data, less any URLs, hashtags, etc.) For large documents, this is not necessary. TODO: evaluate LangDetect or others on longer texts (Blog with comments) to find all languages, etc. - Text is too Short? if lang is non-null, then return "~XX" - Find if text contains CJK: if contains K or J, then return respective langID else text is unified CJK chars which is at least Chinese. - Use LangDetect if Error, use alternate LangID detection if Good and answer < 0.65 (threshold), then report "~XX", as "~" implies low confidence. - Have a "lang-id" from all of the above? if lang-id is a locale, e.g, en_au, en_gb, zh_tw, cn_tw, etc. return just the language part; Return a two-char ISO langID
- Parameters:
lang
- given lang ID or nullnaturalLanguage
- text to determine lang ID; Caller must prepare this text, so consider using DataUtility.scrubTweetText(t).trim();findCJK
- - if findCJK is true, then this will try to find the best language ID if Chinese/Japanese/Korean characters exist at all.- Returns:
- lang ID, possibly different than given lang ID.
-
-