Class LangDetect


  • public class LangDetect
    extends java.lang.Object
    Wrapper around cybozu labs langdetect. This tool provides a simple "guessLanguage", where default Cybozu LangDetect may fail to return a response due to IO errors and/or may provide multiple guesses w/propabilities. GuessLanguage here offers a fall back to look at unknown text to see if it is in the ASCII or CJK families. Use this API wrapper in conjunction with the Xponents TextUtils.getLanguage() routine and Language class to facilitate connecting LangID output with actual ISO 639 standards code pages. ISO 2-char and 3-char language IDs differ depending on the use -- historical/bibliographic vs. linguistic/locales.
    Author:
    ubaldino
    • Constructor Summary

      Constructors 
      Constructor Description
      LangDetect()
      Default use requires you unpack LangDetect profiles here: /langdetect-profiles
      LangDetect​(int textSz)
      If you anticipate working with short text - queries, tweets, excerpts, etc.
      LangDetect​(int textSz, java.lang.String profiles)  
      LangDetect​(java.lang.String profiles)  
    • Method Summary

      Modifier and Type Method Description
      static java.util.Map<java.lang.String,​LangID> alternativeCJKLangID​(java.lang.String data)
      detecting if script of text is Japanese, Korean or Chinese.
      static Language alternativeLangID​(java.lang.String data)
      Look at raw bytes/characters to see which Unicode block they fall into.
      java.lang.String detect​(java.lang.String text)
      API for LangDetect, cybozu.labs
      java.util.Map<java.lang.String,​LangID> detect​(java.lang.String text, boolean withProbabilities)
      API for LangDetect, cybozu.labs.
      Language detectSocialMediaLang​(java.lang.String lang, java.lang.String naturalLanguage)
      Find best lang ID for short texts.
      Language detectSocialMediaLang​(java.lang.String lang, java.lang.String naturalLanguage, boolean findCJK)
      EXPERIMENTAL , EXPERIMENTAL, EXPERIMENTAL UPDATE, 2015.
      Language guessLanguage​(java.lang.String data)
      Routine to guess the language ID Scrub data prior to guessing language.
      void initLangId()
      Taken straight from LangDetect example NOTE: /langdetect/profiles must be a folder on disk, although I have a variation that could work with JAR resources.
      void setWorkingSize​(int sz)  
      static java.util.List<LangID> sort​(java.util.Map<java.lang.String,​LangID> lids)
      Sort what was found; Returns LangID by highest score to lowest.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • DEFAULT_WORKING_SIZE

        public static final int DEFAULT_WORKING_SIZE
        If working size, in CHARS, is less than 180 (20 8 char words + 1 whitespace word break);
        See Also:
        Constant Field Values
      • LANGUAGE_ID_GROUP_ENGLISH

        public static final Language LANGUAGE_ID_GROUP_ENGLISH
      • LANGUAGE_ID_GROUP_CJK

        public static final Language LANGUAGE_ID_GROUP_CJK
      • LANGUAGE_ID_GROUP_UNKNOWN

        public static final Language LANGUAGE_ID_GROUP_UNKNOWN
      • MIN_LENGTH_UNK_TEXT_THRESHOLD

        public static final int MIN_LENGTH_UNK_TEXT_THRESHOLD
        A simple threshold for demarcating when we might infer simple language ID with minimal content. E.g. 16 chars of ASCII text ~ we can possibly say it is English. However, this is really only making an guess.
        See Also:
        Constant Field Values
      • MIN_LANG_DETECT_PROBABILITY

        public static double MIN_LANG_DETECT_PROBABILITY
    • Constructor Detail

      • LangDetect

        public LangDetect()
                   throws ConfigException
        Default use requires you unpack LangDetect profiles here: /langdetect-profiles
        Throws:
        ConfigException
      • LangDetect

        public LangDetect​(int textSz)
                   throws ConfigException
        If you anticipate working with short text - queries, tweets, excerpts, etc. Then indicate that here. text working size is in # of Chars.
        Parameters:
        textSz -
        Throws:
        ConfigException
    • Method Detail

      • setWorkingSize

        public void setWorkingSize​(int sz)
        Parameters:
        sz -
      • initLangId

        public void initLangId()
                        throws ConfigException
        Taken straight from LangDetect example NOTE: /langdetect/profiles must be a folder on disk, although I have a variation that could work with JAR resources. So ordering classpath is important, but also folder itself must exist. TODO: workingSize is used only to guide default profile directory - short message (sm) or not. In the future workingSize
        Throws:
        ConfigException
      • detect

        public java.lang.String detect​(java.lang.String text)
                                throws com.cybozu.labs.langdetect.LangDetectException
        API for LangDetect, cybozu.labs
        Parameters:
        text - ISO language ID or Locale. Straight from the Cybozu API
        Returns:
        Throws:
        com.cybozu.labs.langdetect.LangDetectException
      • detect

        public java.util.Map<java.lang.String,​LangID> detect​(java.lang.String text,
                                                                   boolean withProbabilities)
                                                            throws com.cybozu.labs.langdetect.LangDetectException
        API for LangDetect, cybozu.labs. However, this does not return cybozu.Language object; this method returns its own LangID class
        Parameters:
        text -
        withProbabilities - true to include propabilities on results
        Returns:
        Throws:
        com.cybozu.labs.langdetect.LangDetectException
      • sort

        public static java.util.List<LangID> sort​(java.util.Map<java.lang.String,​LangID> lids)
        Sort what was found; Returns LangID by highest score to lowest.
        Parameters:
        lids -
        Returns:
      • guessLanguage

        public Language guessLanguage​(java.lang.String data)
        Routine to guess the language ID Scrub data prior to guessing language. If you feed that non-language text (jargon, codes, tables, URLs, hashtags, data) will interfere or overwhelm to volume of natural language text.
        Parameters:
        data -
        Returns:
      • alternativeLangID

        public static Language alternativeLangID​(java.lang.String data)
        Look at raw bytes/characters to see which Unicode block they fall into.
        Parameters:
        data -
        Returns:
      • alternativeCJKLangID

        public static java.util.Map<java.lang.String,​LangID> alternativeCJKLangID​(java.lang.String data)
        detecting if script of text is Japanese, Korean or Chinese. Given Chinese Unicode block contains CJK unified ideographs, the presence of Chinese characters does not indicate any of the three langugaes uniquely. This is used only if CyboZu LangDetect fails OR if you want to detect language(s) in mixed text.
        Parameters:
        data -
        Returns:
      • detectSocialMediaLang

        public Language detectSocialMediaLang​(java.lang.String lang,
                                              java.lang.String naturalLanguage)
        Find best lang ID for short texts. By default this will not search for CJK language ID if CJK characters are present.
        Parameters:
        lang -
        naturalLanguage -
        Returns:
      • detectSocialMediaLang

        public Language detectSocialMediaLang​(java.lang.String lang,
                                              java.lang.String naturalLanguage,
                                              boolean findCJK)
        EXPERIMENTAL , EXPERIMENTAL, EXPERIMENTAL UPDATE, 2015. Using Cybozu LangDetect 1.3 (released June 2014) operates better on tweets than previous version. A lot of this confusion was related to the lack of optimization early versions had for social media. =============================== Not the proper method for general use. Lang ID is shunted for short text. If lang is non-null, then "~lang" is returned for short text If lang is null, we'll give it a shot. Short ~ two words of natural language, approx 16 chars. Objective is to return a single, best lang-id. More general purpose routines are TBD: e.g., validate all lang-id found by LangDetect or other solution.
         Workflow used here for ANY text:
         - get natural language of text ( the data, less any URLs, hashtags, etc.)
           For large documents, this is not necessary. TODO: evaluate LangDetect or others
           on longer texts (Blog with comments) to find all languages, etc.
        
         - Text is too Short?  if lang is non-null, then return "~XX"
         - Find if text contains CJK:
              if contains K or J,  then return respective langID
              else text is unified CJK chars which is at least Chinese.
        
         - Use LangDetect
              if Error, use alternate LangID detection
              if Good and answer < 0.65 (threshold), then report "~XX", as "~" implies low confidence.
        
         - Have a "lang-id" from all of the above?
              if lang-id is a locale, e.g, en_au, en_gb,  zh_tw, cn_tw, etc.
              return just the language part;
        
          Return a two-char ISO langID
         
        Parameters:
        lang - given lang ID or null
        naturalLanguage - text to determine lang ID; Caller must prepare this text, so consider using DataUtility.scrubTweetText(t).trim();
        findCJK - - if findCJK is true, then this will try to find the best language ID if Chinese/Japanese/Korean characters exist at all.
        Returns:
        lang ID, possibly different than given lang ID.