Class LangDetect

java.lang.Object
org.opensextant.extractors.langid.LangDetect

public class LangDetect extends Object
Wrapper around cybozu labs langdetect. This tool provides a simple "guessLanguage", where default Cybozu LangDetect may fail to return a response due to IO errors and/or may provide multiple guesses w/propabilities. GuessLanguage here offers a fall back to look at unknown text to see if it is in the ASCII or CJK families. Use this API wrapper in conjunction with the Xponents TextUtils.getLanguage() routine and Language class to facilitate connecting LangID output with actual ISO 639 standards code pages. ISO 2-char and 3-char language IDs differ depending on the use -- historical/bibliographic vs. linguistic/locales.
Author:
ubaldino
  • Field Details

    • DEFAULT_WORKING_SIZE

      public static final int DEFAULT_WORKING_SIZE
      If working size, in CHARS, is less than 180 (20 8 char words + 1 whitespace word break);
      See Also:
    • LANGUAGE_ID_GROUP_ENGLISH

      public static final Language LANGUAGE_ID_GROUP_ENGLISH
    • LANGUAGE_ID_GROUP_CJK

      public static final Language LANGUAGE_ID_GROUP_CJK
    • LANGUAGE_ID_GROUP_UNKNOWN

      public static final Language LANGUAGE_ID_GROUP_UNKNOWN
    • MIN_LENGTH_UNK_TEXT_THRESHOLD

      public static final int MIN_LENGTH_UNK_TEXT_THRESHOLD
      A simple threshold for demarcating when we might infer simple language ID with minimal content. E.g. 16 chars of ASCII text ~ we can possibly say it is English. However, this is really only making an guess.
      See Also:
    • MIN_LANG_DETECT_PROBABILITY

      public static double MIN_LANG_DETECT_PROBABILITY
  • Constructor Details

  • Method Details

    • setWorkingSize

      public void setWorkingSize(int sz)
      Parameters:
      sz -
    • initLangId

      public void initLangId() throws ConfigException
      Taken straight from LangDetect example NOTE: /langdetect/profiles must be a folder on disk, although I have a variation that could work with JAR resources. So ordering classpath is important, but also folder itself must exist. TODO: workingSize is used only to guide default profile directory - short message (sm) or not. In the future workingSize
      Throws:
      ConfigException
    • detect

      public String detect(String text) throws com.cybozu.labs.langdetect.LangDetectException
      API for LangDetect, cybozu.labs
      Parameters:
      text - ISO language ID or Locale. Straight from the Cybozu API
      Returns:
      Throws:
      com.cybozu.labs.langdetect.LangDetectException
    • detect

      public Map<String,LangID> detect(String text, boolean withProbabilities) throws com.cybozu.labs.langdetect.LangDetectException
      API for LangDetect, cybozu.labs. However, this does not return cybozu.Language object; this method returns its own LangID class
      Parameters:
      text -
      withProbabilities - true to include propabilities on results
      Returns:
      Throws:
      com.cybozu.labs.langdetect.LangDetectException
    • sort

      public static List<LangID> sort(Map<String,LangID> lids)
      Sort what was found; Returns LangID by highest score to lowest.
      Parameters:
      lids -
      Returns:
    • guessLanguage

      public Language guessLanguage(String data)
      Routine to guess the language ID Scrub data prior to guessing language. If you feed that non-language text (jargon, codes, tables, URLs, hashtags, data) will interfere or overwhelm to volume of natural language text.
      Parameters:
      data -
      Returns:
    • alternativeLangID

      public static Language alternativeLangID(String data)
      Look at raw bytes/characters to see which Unicode block they fall into.
      Parameters:
      data -
      Returns:
    • alternativeCJKLangID

      public static Map<String,LangID> alternativeCJKLangID(String data)
      detecting if script of text is Japanese, Korean or Chinese. Given Chinese Unicode block contains CJK unified ideographs, the presence of Chinese characters does not indicate any of the three langugaes uniquely. This is used only if CyboZu LangDetect fails OR if you want to detect language(s) in mixed text.
      Parameters:
      data -
      Returns:
    • detectSocialMediaLang

      public Language detectSocialMediaLang(String lang, String naturalLanguage)
      Find best lang ID for short texts. By default this will not search for CJK language ID if CJK characters are present.
      Parameters:
      lang -
      naturalLanguage -
      Returns:
    • detectSocialMediaLang

      public Language detectSocialMediaLang(String lang, String naturalLanguage, boolean findCJK)
      EXPERIMENTAL , EXPERIMENTAL, EXPERIMENTAL UPDATE, 2015. Using Cybozu LangDetect 1.3 (released June 2014) operates better on tweets than previous version. A lot of this confusion was related to the lack of optimization early versions had for social media. =============================== Not the proper method for general use. Lang ID is shunted for short text. If lang is non-null, then "~lang" is returned for short text If lang is null, we'll give it a shot. Short ~ two words of natural language, approx 16 chars. Objective is to return a single, best lang-id. More general purpose routines are TBD: e.g., validate all lang-id found by LangDetect or other solution.
       Workflow used here for ANY text:
       - get natural language of text ( the data, less any URLs, hashtags, etc.)
         For large documents, this is not necessary. TODO: evaluate LangDetect or others
         on longer texts (Blog with comments) to find all languages, etc.
      
       - Text is too Short?  if lang is non-null, then return "~XX"
       - Find if text contains CJK:
            if contains K or J,  then return respective langID
            else text is unified CJK chars which is at least Chinese.
      
       - Use LangDetect
            if Error, use alternate LangID detection
            if Good and answer < 0.65 (threshold), then report "~XX", as "~" implies low confidence.
      
       - Have a "lang-id" from all of the above?
            if lang-id is a locale, e.g, en_au, en_gb,  zh_tw, cn_tw, etc.
            return just the language part;
      
        Return a two-char ISO langID
       
      Parameters:
      lang - given lang ID or null
      naturalLanguage - text to determine lang ID; Caller must prepare this text, so consider using DataUtility.scrubTweetText(t).trim();
      findCJK - - if findCJK is true, then this will try to find the best language ID if Chinese/Japanese/Korean characters exist at all.
      Returns:
      lang ID, possibly different than given lang ID.