Package org.opensextant.extractors.geo
Geo Extraction: PlaceGeocoder, SolrGazetteer, GazetteerMatcher and related items
This package is all about complete geotagging of unstructured
text, and any supporting functions. The foundation of this is the
Gazetteer and the tagger -- represented by SolrGazetteer
and GazetteerMatcher
, respectively. The gazetteer is
the database of all the place names, abbreviations, codes and
location information. The tagger is a Solr handler on top of the
database. (Here "database" == "solr index").
Now the fun part: PlaceGeocoder. This is a complex rules
processor for all geotagging and geocoding work in pure Java. Lots
of internal resource dependencies are involved: gazetteer,
resource files, lookup tables, configuration data, tuning
parameters. But the API is simple:
- initialize:
PlaceGeocoder p = new PlaceGeocoder(); /* optionally: */ p.setParameters(...);
- use:
List<TextMatch> results = p.extract( input )
- Extract hard evidence first, e.g., acceptable
resolution coordinates in text or metadata for document ==>
infers Country and or Province. Note, that hard evidence
is not always available in the text.
- Use language ID of text ==> guides text
tokenization, matching and filtering of matches. In rich
metadata scenarios, language of data may also indirectly infer
Country of origin or of topic. (see
Country
data class which has primary language and timezone information) - Decorate candidates / input document obvious soft evidence ==> mentions of countries by name or well-known abbreviations of provinces infer geographic region and weight location names in those regions higher.
- Filter obvious false positives ==> organizations or person names that are confounded with place names
- Choose location ==> given all the evidence assign rules for the choices made and a confidence level (a relative score on a 100 point scale)
- Emit all matches: filtered "out" matches are marked as such; matches may be PlaceCandidate, TaxonMatch (org or person), GeocoordMatch.
Once the caller receives the List of TextMatch, all of the rules
and other metadata can be accessed through the data classes
APIs. Caller must cast TextMatch to subclass to leverage
such methods.
Examples of PlaceGeocoder usage:
- BasicGeotemporalExtraction (Examples subproject)
org.opensextant.extractors.geo.social
(Experimental) Geo-inferencing on Tweets: XponentsTextGeotagger and XponentsGeocoder are PlaceGeocoder applicatons driven by the demo SimpleProcessorDemo.
- Xlayer (subproject) XGeo REST service. This is a Restlet
application that provisions a PlaceGeocoder as a RESTful
extractor.
-
ClassDescriptionEmit a boundary event when you come across a concrete reference to a boundary, e.g., county or state, district or prefecture.Country metricsConnects to a Solr sever via HTTP and tags place names in document.Apply this interface where application logic observes a coordinate or any hard location reference.A PlaceCandidate represents a portion of a document which has been identified as a possible named geographic location.Place metrics.A PlaceEvidence represents a fragment of evidence about a Place.SCOPE - Where did this evidence come from wrt to the PlaceCandidate it is part of? APRIORI - derived from the gazetteer only, not from any information in the document LOCAL - directly associated with this instance of PC COREF - associated with another (related) PC in the document MERGED - came from the merger of multiple PlaceEvidences (future use) DOCUMENT - in the same document but has no other direct associationA simple variation on the geocoding algorithms: geotag all possible things and determine a best geo-location for each tagged item.PostalGeocoder -- a GazetteerMatcher that uses the "postal" solr index to quickly tag any known postal codes worldwide.Postal Tagger tags and returns any alphanumeric token or phrase that resembles postal codes and abbreviations.A class to hold a Place and a score together.Connects to a Solr sever via HTTP and tags place names in document.