Package org.opensextant.extractors.xcoord


package org.opensextant.extractors.xcoord

XCoord: Geographic Coordinate Extraction

XCoord is a developer toolkit for extracting 3 major forms of coordinate patterns from any textual data:
  • UTM - Universal Transverse Mercator
  • MGRS - Military Grid Reference System
  • Degrees, Minutes, Seconds and variants (DD, DM, DMS)

XCoord allows the user to define their own coordinate patterns or extend the default patterns.  There are about 2 dozen coordinate patterns, defined here:  ./doc/XCoord_Patterns.htm

Usage

From the command line you can quickly test XCoord on a set of given test cases or provide a file of your own.

ant -f ./script/testing.xml test-xcoord
file?    test/mytest.txt
ant test-default

... runs internal unit tests coupled with the given patterns configuration file

Programmatically, the essential usage is:

XCoord xc = new XCoord();
xc.configure();
TextMatchResult geocodes = xc.extract_coordinates(text, text_id);

//... Now iterate over geocodes.matches

Equally as well, the Extractor.extract() interface implemented by XCoord is even more lean:

// 1. just text as an input.
List<TextMatch> geocodes1 = xc.extract(text);

// 2. pass in a TextInput argument, for example DocInput that represents a document.
List<TextMatch> geocodes2 = xc.extract( new TextInput(text, text_id));


In the first case you can extract coordinates from any string of text. In the second case, if you are managing your input records using some identifiers and want to carry such IDs on through your extraction results, use the TextInput method.

Tuning peformance happens at many levels.  XCoord can toggle each coordinate pattern family: UTM, MGRS, DM, DMS, DD if there are limited or known formats desired.  As well, for embedding XCoord into other systems (such as its parent project OpenSextant), the constructor can take a configuration file, for example, xc.configure( "mypatterns.cfg"). Such configuration files must be in the CLASSPATH currently.

When Interpreting GeocodingResults the caller of XCoord should check if an individual match is a submatch (GeocoordMatch.is_submatch) or not.  While each pattern is assessed individually, there may be multiple matches resulting in overlapping annotations.  The intention is that the longest distinct match is most relevant for any given span of text.  Although in some uses all matches are worth seeing.  To be clear, matches that are contained entirely within other matches are marked as submatches and therefore less likely to be the item of interest for geocoding. Other matches may overlap (GeocoordMatch.is_overlap = true)

Pattern Definition

FlexPat (derived from a few other MITRE efforts) allows XCoord to design the coordinate patterns as regular expressions, using named pattern groups.  As of Java version 6, the Java regular expression (regex) capability does not allow the full regex grammar, including naming pattern groups.  FlexPat was designed to address this gap in functionality as well as to provide a foundation for simple text matches, pattern definition, and pattern test cases.  See documentation in XCoord's PatternManager.


Runtime Flags and Optimization

The use of configuration file parameters suggests that you have one value for a parameter at runtime through the duration of the current process.   Since processing may be context-sensitive, we use static runtime flags (a bit mask of flags from XConstants) to influence and tune behavior.    Current flags include toggling coordinate pattern families and the option to extract context text.

    XCoord.RUNTIME_FLAGS ^= XConstants.FILTER_DMS_ON // Turn OFF DMS filters using XOR
    XCoord.RUNTIME_FLAGS |= XConstants.FLAG_ALL_FILTERS // return to default filter behavior with all filters.
    XCoord.RUNTIME_FLAGS = XConstants.FLAG_ALL_FILTERS  // return to default behavior with all filters.


Other FLAG parameters will be added over time to allow XCoord behavior to be adapted at runtime.


  • Class
    Description
    DMS Filters include ignoring these patterns: dd-dd-dd HH:MM:ss (where dd-dd-dd HH-MM-ss would be a valid coordinate as the field separators for lat/lon are the same).
    DMSOrdinate represents all the various fields a WGS84 cartesian coordinate could have.
    Resolution field for DMS.ms
    GeocoordMatch holds all the annotation data for the actual raw and normalized coordinate.
    Filtering matches is a matter of practicality.
     
     
     
     
    Represent a Hemisphere symbol and value.
    MGRS Filters include ignoring these patterns: 1234 123456 12345678 1234567890 Recent calendar dates of the form ddMMMyyyy, "14DEC1990" (MGRS: 14D EC 19 90 Recent calendar dates with time, ddMMHHmm, "14DEC1200" Noon on 14DEC.
     
    This is the culmination of various coordinate extraction efforts in python and Java.
     
     
     
    Use this XCoord class for both test and development of patterns, as well as to extract coordinates at runtime.