Package org.opensextant.extractors.langid

Language Detection

Xponents LangDetect class here wraps Cybozu LangDetect (Maven: http://search.maven.org/#artifactdetails%7Ccom.norconex.language%7Clangdetect%7C1.3.0%7Cjar).   It is a decent library but required a number of additions to make it a bit more usable:

  • LangDetect.detect() returns list of unordered probabilities.  Xponents simplifies this so caller can get ordered list with highest probability first, and then also a simple method to return just a language object (code +name)
  • Profiles are accessible only from File system, not via resource path/stream.  Profiles for LangDetect are unpacked on file system prior to usage -- Xponents packages these for runtime use.
  • In situations where very little natural language is present in the text, CyboZu may fail. Xponents provides a variety of fall backs to get at the most common language detection possibilities.
  • Overall, Xponents LangDetect provides a simplified approach to language detection, while still providing full access to the underlying CyboZu library as needed.