org.opensextant.extractors.langid.LangDetect

public class LangDetect extends Object

Wrapper around cybozu labs langdetect. This tool provides a simple "guessLanguage", where default Cybozu LangDetect may fail to return a response due to IO errors and/or may provide multiple guesses w/propabilities. GuessLanguage here offers a fall back to look at unknown text to see if it is in the ASCII or CJK families. Use this API wrapper in conjunction with the Xponents TextUtils.getLanguage() routine and Language class to facilitate connecting LangID output with actual ISO 639 standards code pages. ISO 2-char and 3-char language IDs differ depending on the use -- historical/bibliographic vs. linguistic/locales.

Author:: ubaldino

Field Summary

Fields

Modifier and Type

Field

Description

static final int

DEFAULT_WORKING_SIZE

If working size, in CHARS, is less than 180 (20 8 char words + 1 whitespace word break);

static final Language

LANGUAGE_ID_GROUP_CJK

static final Language

LANGUAGE_ID_GROUP_ENGLISH

static final Language

LANGUAGE_ID_GROUP_UNKNOWN

static double

MIN_LANG_DETECT_PROBABILITY

static final int

MIN_LENGTH_UNK_TEXT_THRESHOLD

A simple threshold for demarcating when we might infer simple language ID with minimal content.
Constructor Summary

Constructors

Constructor

Description

LangDetect()

Default use requires you unpack LangDetect profiles here: /langdetect-profiles

LangDetect(int textSz)

If you anticipate working with short text - queries, tweets, excerpts, etc.

LangDetect(int textSz, String profiles)

LangDetect(String profiles)
Method Summary

Modifier and Type

Method

Description

static Map<String,LangID>

alternativeCJKLangID(String data)

detecting if script of text is Japanese, Korean or Chinese.

static Language

alternativeLangID(String data)

Look at raw bytes/characters to see which Unicode block they fall into.

String

detect(String text)

API for LangDetect, cybozu.labs

Map<String,LangID>

detect(String text, boolean withProbabilities)

API for LangDetect, cybozu.labs.

Language

detectSocialMediaLang(String lang, String naturalLanguage)

Find best lang ID for short texts.

Language

detectSocialMediaLang(String lang, String naturalLanguage, boolean findCJK)

EXPERIMENTAL , EXPERIMENTAL, EXPERIMENTAL UPDATE, 2015.

Language

guessLanguage(String data)

Routine to guess the language ID Scrub data prior to guessing language.

void

initLangId()

Taken straight from LangDetect example NOTE: /langdetect/profiles must be a folder on disk, although I have a variation that could work with JAR resources.

void

setWorkingSize(int sz)

static List<LangID>

sort(Map<String,LangID> lids)

Sort what was found; Returns LangID by highest score to lowest.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- DEFAULT_WORKING_SIZE
  
  public static final int DEFAULT_WORKING_SIZE
  
  If working size, in CHARS, is less than 180 (20 8 char words + 1 whitespace word break);
  See Also:
  
  Constant Field Values
- LANGUAGE_ID_GROUP_ENGLISH
  
  public static final Language LANGUAGE_ID_GROUP_ENGLISH
- LANGUAGE_ID_GROUP_CJK
  
  public static final Language LANGUAGE_ID_GROUP_CJK
- LANGUAGE_ID_GROUP_UNKNOWN
  
  public static final Language LANGUAGE_ID_GROUP_UNKNOWN
- MIN_LENGTH_UNK_TEXT_THRESHOLD
  
  public static final int MIN_LENGTH_UNK_TEXT_THRESHOLD
  
  A simple threshold for demarcating when we might infer simple language ID with minimal content. E.g. 16 chars of ASCII text ~ we can possibly say it is English. However, this is really only making an guess.
  See Also:
  
  Constant Field Values
- MIN_LANG_DETECT_PROBABILITY
  
  public static double MIN_LANG_DETECT_PROBABILITY
Constructor Details
- LangDetect
  
  public LangDetect() throws ConfigException
  
  Default use requires you unpack LangDetect profiles here: /langdetect-profiles
  
  Throws:
  
  ConfigException
- LangDetect
  
  public LangDetect(String profiles) throws ConfigException
  
  Throws:
  
  ConfigException
- LangDetect
  
  public LangDetect(int textSz) throws ConfigException
  
  If you anticipate working with short text - queries, tweets, excerpts, etc. Then indicate that here. text working size is in # of Chars.
  
  Parameters:
  
  textSz -
  
  Throws:
  
  ConfigException
- LangDetect
  
  public LangDetect(int textSz, String profiles) throws ConfigException
  
  Throws:
  
  ConfigException
Method Details
- setWorkingSize
  
  public void setWorkingSize(int sz)
  
  Parameters:
  
  sz -
- initLangId
  
  public void initLangId() throws ConfigException
  
  Taken straight from LangDetect example NOTE: /langdetect/profiles must be a folder on disk, although I have a variation that could work with JAR resources. So ordering classpath is important, but also folder itself must exist. TODO: workingSize is used only to guide default profile directory - short message (sm) or not. In the future workingSize
  
  Throws:
  
  ConfigException
- detect
  
  public String detect(String text) throws com.cybozu.labs.langdetect.LangDetectException
  
  API for LangDetect, cybozu.labs
  
  Parameters:
  
  text - ISO language ID or Locale. Straight from the Cybozu API
  
  Returns:
  
  Throws:
  
  com.cybozu.labs.langdetect.LangDetectException
- detect
  
  public Map<String,LangID> detect(String text, boolean withProbabilities) throws com.cybozu.labs.langdetect.LangDetectException
  
  API for LangDetect, cybozu.labs. However, this does not return cybozu.Language object; this method returns its own LangID class
  
  Parameters:
  
  text -
  
  withProbabilities - true to include propabilities on results
  
  Returns:
  
  Throws:
  
  com.cybozu.labs.langdetect.LangDetectException
- sort
  
  public static List<LangID> sort(Map<String,LangID> lids)
  
  Sort what was found; Returns LangID by highest score to lowest.
  
  Parameters:
  
  lids -
  
  Returns:
- guessLanguage
  
  public Language guessLanguage(String data)
  
  Routine to guess the language ID Scrub data prior to guessing language. If you feed that non-language text (jargon, codes, tables, URLs, hashtags, data) will interfere or overwhelm to volume of natural language text.
  
  Parameters:
  
  data -
  
  Returns:
- alternativeLangID
  
  public static Language alternativeLangID(String data)
  
  Look at raw bytes/characters to see which Unicode block they fall into.
  
  Parameters:
  
  data -
  
  Returns:
- alternativeCJKLangID
  
  public static Map<String,LangID> alternativeCJKLangID(String data)
  
  detecting if script of text is Japanese, Korean or Chinese. Given Chinese Unicode block contains CJK unified ideographs, the presence of Chinese characters does not indicate any of the three langugaes uniquely. This is used only if CyboZu LangDetect fails OR if you want to detect language(s) in mixed text.
  
  Parameters:
  
  data -
  
  Returns:
- detectSocialMediaLang
  
  public Language detectSocialMediaLang(String lang, String naturalLanguage)
  
  Find best lang ID for short texts. By default this will not search for CJK language ID if CJK characters are present.
  
  Parameters:
  
  lang -
  
  naturalLanguage -
  
  Returns:
- detectSocialMediaLang
  
  public Language detectSocialMediaLang(String lang, String naturalLanguage, boolean findCJK)
  EXPERIMENTAL , EXPERIMENTAL, EXPERIMENTAL UPDATE, 2015. Using Cybozu LangDetect 1.3 (released June 2014) operates better on tweets than previous version. A lot of this confusion was related to the lack of optimization early versions had for social media. =============================== Not the proper method for general use. Lang ID is shunted for short text. If lang is non-null, then "~lang" is returned for short text If lang is null, we'll give it a shot. Short ~ two words of natural language, approx 16 chars. Objective is to return a single, best lang-id. More general purpose routines are TBD: e.g., validate all lang-id found by LangDetect or other solution.
  Workflow used here for ANY text: - get natural language of text ( the data, less any URLs, hashtags, etc.) For large documents, this is not necessary. TODO: evaluate LangDetect or others on longer texts (Blog with comments) to find all languages, etc. - Text is too Short? if lang is non-null, then return "~XX" - Find if text contains CJK: if contains K or J, then return respective langID else text is unified CJK chars which is at least Chinese. - Use LangDetect if Error, use alternate LangID detection if Good and answer < 0.65 (threshold), then report "~XX", as "~" implies low confidence. - Have a "lang-id" from all of the above? if lang-id is a locale, e.g, en_au, en_gb, zh_tw, cn_tw, etc. return just the language part; Return a two-char ISO langID
  Parameters:
  
  lang - given lang ID or null
  
  naturalLanguage - text to determine lang ID; Caller must prepare this text, so consider using DataUtility.scrubTweetText(t).trim();
  
  findCJK - - if findCJK is true, then this will try to find the best language ID if Chinese/Japanese/Korean characters exist at all.
  
  Returns:
  
  lang ID, possibly different than given lang ID.

Class LangDetect

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

DEFAULT_WORKING_SIZE

LANGUAGE_ID_GROUP_ENGLISH

LANGUAGE_ID_GROUP_CJK

LANGUAGE_ID_GROUP_UNKNOWN

MIN_LENGTH_UNK_TEXT_THRESHOLD

MIN_LANG_DETECT_PROBABILITY

Constructor Details

LangDetect

LangDetect

LangDetect

LangDetect

Method Details

setWorkingSize

initLangId

detect

detect

sort

guessLanguage

alternativeLangID

alternativeCJKLangID

detectSocialMediaLang

detectSocialMediaLang