Package org.opensextant.extractors.geo


package org.opensextant.extractors.geo

Geo Extraction: PlaceGeocoder, SolrGazetteer, GazetteerMatcher and related items

This package is all about complete geotagging of unstructured text, and any supporting functions. The foundation of this is the Gazetteer and the tagger -- represented by SolrGazetteer and GazetteerMatcher, respectively. The gazetteer is the database of all the place names, abbreviations, codes and location information. The tagger is a Solr handler on top of the database. (Here "database" == "solr index").

Now the fun part: PlaceGeocoder. This is a complex rules processor for all geotagging and geocoding work in pure Java. Lots of internal resource dependencies are involved: gazetteer, resource files, lookup tables, configuration data, tuning parameters. But the API is simple: 

  • initialize: PlaceGeocoder p = new PlaceGeocoder(); /* optionally: */ p.setParameters(...);
  • use: List<TextMatch> results = p.extract( input )
PlaceGeocoder is based on the Xponents geocoding methodology: https://github.com/OpenSextant/Xponents/blob/master/doc/Geocoder_Handbook.md.   The rules package is a collection of implemented GeocodeRules that operate on PlaceCandidates found in your input text.  Each rule may look at the match itself, the candidate locations behind the match or the surrounding evidence near the match or within the document.  The general approach of the geocoder is:

  1. Extract hard evidence first, e.g., acceptable resolution coordinates in text or metadata for document ==> infers Country and or Province.  Note, that hard evidence is not always available in the text. 
  2. Use language ID of text ==> guides text tokenization, matching and filtering of matches.  In rich metadata scenarios, language of data may also indirectly infer Country of origin or of topic.  (see Country data class which has primary language and timezone information)
  3. Decorate candidates / input document obvious soft evidence ==> mentions of countries by name or well-known abbreviations of provinces infer geographic region and weight location names in those regions higher.
  4. Filter obvious false positives ==> organizations or person names that are confounded with place names
  5. Choose location ==> given all the evidence assign rules for the choices made and a confidence level (a relative score on a 100 point scale)
  6. Emit all matches: filtered "out" matches are marked as such; matches may be PlaceCandidate, TaxonMatch (org or person), GeocoordMatch.

Once the caller receives the List of TextMatch, all of the rules and other metadata can be accessed through the data classes APIs.  Caller must cast TextMatch to subclass to leverage such methods.

Examples of PlaceGeocoder usage:

  • BasicGeotemporalExtraction (Examples subproject)
  • org.opensextant.extractors.geo.social (Experimental) Geo-inferencing on Tweets:  XponentsTextGeotagger and XponentsGeocoder are PlaceGeocoder applicatons driven by the demo SimpleProcessorDemo. 
  • Xlayer (subproject) XGeo REST service.  This is a Restlet application that provisions a PlaceGeocoder as a RESTful extractor.


  • Class
    Description
    Emit a boundary event when you come across a concrete reference to a boundary, e.g., county or state, district or prefecture.
    Country metrics
     
    Connects to a Solr sever via HTTP and tags place names in document.
    Apply this interface where application logic observes a coordinate or any hard location reference.
    A PlaceCandidate represents a portion of a document which has been identified as a possible named geographic location.
    Place metrics.
    A PlaceEvidence represents a fragment of evidence about a Place.
    SCOPE - Where did this evidence come from wrt to the PlaceCandidate it is part of? APRIORI - derived from the gazetteer only, not from any information in the document LOCAL - directly associated with this instance of PC COREF - associated with another (related) PC in the document MERGED - came from the merger of multiple PlaceEvidences (future use) DOCUMENT - in the same document but has no other direct association
    A simple variation on the geocoding algorithms: geotag all possible things and determine a best geo-location for each tagged item.
    PostalGeocoder -- a GazetteerMatcher that uses the "postal" solr index to quickly tag any known postal codes worldwide.
    Postal Tagger tags and returns any alphanumeric token or phrase that resembles postal codes and abbreviations.
    A class to hold a Place and a score together.
    Connects to a Solr sever via HTTP and tags place names in document.