Package org.opensextant.util

Utilities for Extraction

Don't get me wrong -- there are a lot of good utilities already out there for NLP work.  I found using Apache Commons StringUtils, File*Utils and other APIs very helpful.  However, there are some oddities in the Java and Unicode world that need handling, as well as in the standards world where reasonable metadata is just missing.

  • FileUtility -- provides some simpler method calls and macro-like calls for common things.  Most often used resource is just readFile(path, encoding)
  • GeodeticUtility -- simple geo math, validation and some geohash utilities
  • TextUtils text buffer cleanup routines;  Language metadata and simple language detection.
    • isASCII, isEnglish, isLatin, isJapanese...: detect Language codes and simple text detection
    • checkCase, measureCase, isUpper, isLower: operations for character/text case metrics
    • hasDiacritics, replaceDiacritics, removeDiacritics..: work with diacritics
    • removeAny, removeAnyLeft, removeEmoticons, removeSymbols,...: non-text removals
    • tokens, tokensRight, tokensLeft: split whitespace and return normalized tokens
    • parseHashtTags, parseNaturalLanguage: work with jargon text or social media
  • GeonamesUtility -- a helper for working with country metadata: ISO, FIPS, names and codes.
  • SolrProxy and SolrUtil -- SolrProxy is a catch-all for interfacing with Solr index or server. The primary use cases here are interfacing Extractors with their underlying SolrTextTagger. SolrUtil supports some general and specific schema interaction for OpenSextant gazetteer Solr schema.