Utilities for Extraction
Don't get me wrong -- there are a lot of good utilities already
out there for NLP work. I found using Apache Commons
StringUtils, File*Utils and other APIs very helpful.
However, there are some oddities in the Java and Unicode world
that need handling, as well as in the standards world where
reasonable metadata is just missing.
- FileUtility -- provides some simpler method calls and macro-like calls for common things. Most often used resource is just readFile(path, encoding)
- GeodeticUtility -- simple geo math, validation and some geohash utilities
- TextUtils text buffer cleanup routines; Language metadata
and simple language detection.
- isASCII, isEnglish, isLatin, isJapanese...: detect Language codes and simple text detection
- checkCase, measureCase, isUpper, isLower: operations for character/text case metrics
- hasDiacritics, replaceDiacritics, removeDiacritics..: work with diacritics
- removeAny, removeAnyLeft, removeEmoticons, removeSymbols,...: non-text removals
- tokens, tokensRight, tokensLeft: split whitespace and return normalized tokens
- parseHashtTags, parseNaturalLanguage: work with jargon text or social media
- GeonamesUtility -- a helper for working with country metadata: ISO, FIPS, names and codes.
- SolrProxy and SolrUtil -- SolrProxy is a catch-all for interfacing with Solr
index or server. The primary use cases here are interfacing
Extractors with their underlying SolrTextTagger. SolrUtil supports some general and specific
schema interaction for OpenSextant gazetteer Solr schema.