Package org.opensextant.extractors.xcoord
XCoord: Geographic Coordinate Extraction
XCoord is a developer toolkit for extracting 3 major forms of coordinate patterns from any textual data:- UTM - Universal Transverse Mercator
- MGRS - Military Grid Reference System
- Degrees, Minutes, Seconds and variants (DD, DM, DMS)
XCoord allows the user to define their own coordinate patterns or
extend the default patterns. There are about 2 dozen
coordinate patterns, defined here: ./doc/XCoord_Patterns.htm
Usage
From the command line you can quickly test XCoord on a set of
given test cases or provide a file of your own.
ant -f ./script/testing.xml test-xcoord
file? test/mytest.txt
ant test-default
... runs internal unit tests coupled with the given patterns
configuration file
Programmatically, the essential usage is:
XCoord xc = new XCoord();
xc.configure();
TextMatchResult geocodes = xc.extract_coordinates(text, text_id);
//... Now iterate over geocodes.matches
Equally as well, the Extractor.extract() interface implemented by XCoord is even more lean:
// 1. just text as an input.
List<TextMatch> geocodes1 = xc.extract(text);
// 2. pass in a TextInput argument, for example DocInput that represents a document.
List<TextMatch> geocodes2 = xc.extract( new TextInput(text, text_id));
In the first case you can extract coordinates from any string of
text. In the second case, if you are managing your input records
using some identifiers and want to carry such IDs on through your
extraction results, use the TextInput method.
Tuning peformance happens at many levels. XCoord can toggle each coordinate pattern family: UTM, MGRS, DM, DMS, DD if there are limited or known formats desired. As well, for embedding XCoord into other systems (such as its parent project OpenSextant), the constructor can take a configuration file, for example, xc.configure( "mypatterns.cfg"). Such configuration files must be in the CLASSPATH currently.
When Interpreting GeocodingResults the caller of XCoord should
check if an individual match is a submatch (GeocoordMatch.is_submatch)
or not. While each pattern is assessed individually, there
may be multiple matches resulting in overlapping
annotations. The intention is that the longest distinct
match is most relevant for any given span of text. Although
in some uses all matches are worth seeing. To be clear,
matches that are contained entirely within other matches are
marked as submatches and therefore less likely to be the item of
interest for geocoding. Other matches may overlap (GeocoordMatch.is_overlap = true)
Pattern Definition
FlexPat (derived from a few other MITRE efforts) allows XCoord to
design the coordinate patterns as regular expressions, using named
pattern groups. As of Java version 6, the Java regular
expression (regex) capability does not allow the full regex
grammar, including naming pattern groups. FlexPat was
designed to address this gap in functionality as well as to
provide a foundation for simple text matches, pattern definition,
and pattern test cases. See documentation in XCoord's
PatternManager.
Runtime Flags and Optimization
The use of configuration file parameters suggests that you have
one value for a parameter at runtime through the duration of the
current process. Since processing may be
context-sensitive, we use static runtime flags (a bit mask of
flags from XConstants) to influence and tune
behavior. Current flags include toggling
coordinate pattern families and the option to extract context
text.
XCoord.RUNTIME_FLAGS ^= XConstants.FILTER_DMS_ON // Turn OFF DMS filters using XOR
XCoord.RUNTIME_FLAGS |= XConstants.FLAG_ALL_FILTERS // return to default filter behavior with all filters.
XCoord.RUNTIME_FLAGS = XConstants.FLAG_ALL_FILTERS // return to default behavior with all filters.
Other FLAG parameters will be added over time to allow XCoord
behavior to be adapted at runtime.
-
ClassDescriptionDMS Filters include ignoring these patterns: dd-dd-dd HH:MM:ss (where dd-dd-dd HH-MM-ss would be a valid coordinate as the field separators for lat/lon are the same).DMSOrdinate represents all the various fields a WGS84 cartesian coordinate could have.Resolution field for DMS.msGeocoordMatch holds all the annotation data for the actual raw and normalized coordinate.Filtering matches is a matter of practicality.Represent a Hemisphere symbol and value.MGRS Filters include ignoring these patterns: 1234 123456 12345678 1234567890 Recent calendar dates of the form ddMMMyyyy, "14DEC1990" (MGRS: 14D EC 19 90 Recent calendar dates with time, ddMMHHmm, "14DEC1200" Noon on 14DEC.This is the culmination of various coordinate extraction efforts in python and Java.Use this XCoord class for both test and development of patterns, as well as to extract coordinates at runtime.