Xponents

Geographic Place, Date/time, and Pattern entity extraction toolkit along with text extraction from unstructured data and GIS outputters.

View the Project on GitHub

Geocoder’s Handbook for Xponents

Author: Marc Ubaldino

Copyright OpenSextant.org, 2017

Updated 2021

Video: “Discoverying World Geography in Your Data”, presented at Lucene/Solr Revolution 2017 in Las Vegas 14 September, 2017. In video, at minute 29:50. This is a 12 minute talk

Contents:

Welcome to Xponents. We realize geocoding text or data can be tedious and mind numbing. Hopefully this handbook will help you walk through the techniques defined in Xponents and the rest of OpenSextant in a way that makes it obvious which rules will impact geocoding of your data. One important thing to note is that in all this information/entity extraction performed here you will not see that much discussion of traditional natural language processing (NLP), e.g., parts of speech, co-references, sentence boundaries, etc. Much of the language-specific processing is delegated to Solr and Lucene, which handle this reasonalbly well. Xponents APIs are then able to focus more on the critical extraction and encoding challenges. In this regard when you have to do less tuning up front per-language, you can field a decent geotagging capability faster and discover what you do not know. You can refine language-specific performance later. In conclusion – NLP mechanics and theory is very important, however as a developer or integrator you need not be so concerned with it intially.

There main topics to cover (Figure 1):

General topics in our geotagging workflow

In either topic we will encounter the concept of filters that either negate or promote a finding. And lastly evidence is any metadata that can be attached to a geotag or geocode to further back the choice of a location. Look for pointers on Xponents solutions, aka Java classes, in each topic

The primary implementation for this handbook is the Java package org.opensextant.extractors.geo. The design of this package provides some good terminology to understand the methodolgy here:

Tagging Conventions

Geo-inferencing and Geocoding Rules

Rules are organized and fired by some main program, the reference implementation here is PlaceGeocoder in Xponents Extraction project. Some rules are fired generically in order, while others are fired separately. All rules (of type GeocodeRule) are evaluated (evaluate()) after the tagging has occurred. Tagging yields a list of PlaceCandidates which may have been filtered by the tagging phase. Each candidate may also have heuristics about the text, including if the text is all upper case, all lower case, pure ASCII vs. diacritic or non-ASCII. As rules fire they contribute a rule label, a score increment and/or additional evidence to each candidate.

A final rule, a Location Chooser, assesses given evidence, context, rules and scores for each candidate. Ultimately the best score wins and a confidence (100 point scale) is associated with the choice to make it easier to compare geotagging and geocoding confidence across documents and data sets.

Examples

TestPlaceGeocoder is a test routine that helps execute discrete test and evaluation activies, for example:

All of these are means of feeding the geotagger to find out – in detail – how decisions are made, what is missed and what false positives are emitted. There may be serious, systematic errors in rules or just missing reference gazetteer data. All of these scenarios need to be assessed with library and reference data changes.

Let’s look at the style of output in debug mode. logback.xml controls logging: By default geocoding and geotagging classes are in DEBUG mode.

Okay. Look at the text San Antonio, TX.

Consider variants that may change your mindset around decisions:

Return back to the variant mentioned above: San Antonio, TX. Xponents teases this apart with an evolving set of rules. The important notes include:

The list goes on. The list is always growing as more opportunities are observed. Lots of those opportunities come with additional meta-data or reference data. The raw output of the TestPlaceGeocoder evaluation is below:


MENTIONS ALL == 3
Name:Antonio, Type:taxon
	(filtered out: Antonio)
Name:San Antonio, TX, Type:generic
Rules = [
  Contains.PersonName, 
  AdminCode, 
  DefaultScore, 
  MajorPlace.Population, 
  CollocatedNames.boundary, 
  Feature, 
  Location.InAdmin]
	geocoded @ San Antonio (48, US, PPL), score=26.31 with conf=81, at [29.4241,-98.4936]
	geocoded @ San Antonio (24, MX, PPL), score=15.54 second place
Name:TX, Type:generic
Filtered Out.  Rules = [DefaultScore]
MENTIONS DISTINCT PLACES == 1
[San Antonio, TX]
MENTIONS COUNTRIES == 0
[]
MENTIONS COORDINATES == 0
[]