Package org.opensextant.extractors.langid
package org.opensextant.extractors.langid
Language Detection
Xponents LangDetect class here wraps Cybozu LangDetect (Maven: http://search.maven.org/#artifactdetails%7Ccom.norconex.language%7Clangdetect%7C1.3.0%7Cjar). It is a decent library but required a number of additions to make it a bit more usable:- LangDetect.detect() returns list of unordered probabilities. Xponents simplifies this so caller can get ordered list with highest probability first, and then also a simple method to return just a language object (code +name)
- Profiles are accessible only from File system, not via resource path/stream. Profiles for LangDetect are unpacked on file system prior to usage -- Xponents packages these for runtime use.
- In situations where very little natural language is present in the text, CyboZu may fail. Xponents provides a variety of fall backs to get at the most common language detection possibilities.
- Overall, Xponents LangDetect provides a simplified approach to
language detection, while still providing full access to the
underlying CyboZu library as needed.