Author: Marc Ubaldino
Copyright OpenSextant.org, 2017
Video: “Discoverying World Geography in Your Data”, presented at Lucene/Solr Revolution 2017 in Las Vegas 14 September, 2017. In video, at minute 29:50. This is a 12 minute talk
Welcome to Xponents. We realize geocoding text or data can be tedious and mind numbing. Hopefully this handbook will help you walk through the techniques defined in Xponents and the rest of OpenSextant in a way that makes it obvious which rules will impact geocoding of your data. One important thing to note is that in all this information/entity extraction performed here you will not see that much discussion of traditional natural language processing (NLP), e.g., parts of speech, co-references, sentence boundaries, etc. Much of the language-specific processing is delegated to Solr and Lucene, which handle this reasonalbly well. Xponents APIs are then able to focus more on the critical extraction and encoding challenges. In this regard when you have to do less tuning up front per-language, you can field a decent geotagging capability faster and discover what you do not know. You can refine language-specific performance later. In conclusion – NLP mechanics and theory is very important, however as a developer or integrator you need not be so concerned with it intially.
There main topics to cover (Figure 1):
In either topic we will encounter the concept of filters that either negate or promote a finding.
And lastly evidence is any metadata that can be attached to a geotag or geocode to further back
the choice of a location.
Look for pointers on Xponents solutions, aka Java classes, in each topic
The primary implementation for this handbook is the Java package
The design of this package provides some good terminology to understand the methodolgy here:
TextInputclass to carry basic metadata like lang ID and a Unicode text buffer, however language detection must be done externally as this differs based on the type and length of text.
SolrGazetteer.placesAt()method to reveal nearest cities, the province and country containing the location, or the fact the coordinate is not near anything (e.g., over water). Xponents Patterns project has
XCoordextractor which detects and geocodes the coordinate patterns listed here in the XCoord Patterns reference manual reference manual.
GazetteerMatcherclass exposes the SolrTextTagger capability indirectly through calling a Solr request. The Gazetteer is the primary source of consolidated gazetteer data.
Detroit City Councilis an organization name, that contains a city name. Whereas the
Smithfield Groupmay be an institution not actually located in a place called
Smithfield. Either way it is important to be able to detect all such cases to work with negation or amplification all at once. Xponents Extraction project provides
TaxonMatcher, aka XTax, which provisions a lexicon of such “well-known” non-places of your choosing or the default set used by Xponents.
Countryclass that makes use of a timezone table (source: Geonames.org) which helps infer one or more country codes for a given timezone. Xponents Patterns
XTemporalextractor can be used to detect and code dates and times, if a date/time/timezone is not already provided. The important part here is to recognize the innate time and timezone in the original data, not the UTC time.
Rules are organized and fired by some main program, the reference implementation here is
Xponents Extraction project. Some rules are fired generically in order, while others are fired separately.
All rules (of type
GeocodeRule) are evaluated (
evaluate()) after the tagging has occurred. Tagging
yields a list of
PlaceCandidates which may have been filtered by the tagging phase. Each candidate may
also have heuristics about the text, including if the text is all upper case, all lower case, pure ASCII vs.
diacritic or non-ASCII. As rules fire they contribute a rule label, a score increment and/or additional evidence
to each candidate.
A final rule, a Location Chooser, assesses given evidence, context, rules and scores for each candidate. Ultimately the best score wins and a confidence (100 point scale) is associated with the choice to make it easier to compare geotagging and geocoding confidence across documents and data sets.
San Fran, CA
city of, etc. preceeding the name of populated places (feature types
P/PPL) score those candidates higher. As well,
provincepreceeding or following a name scores higher candidates that are Level-1 administrative boundaries. Could be enhanced by having table-driven rules per language.
SolrGazetteer.placesAtprovides a simple recursive reverse lookup of location to containing boundaries or nearby places. First 10 KM, then 30 KM radii are tried looking for closest match.
COUNTRY.ADM1.ADM2is known as a hiearchical tree that represents containment of boundaries in a lexical string. So,
USA.06.4221is county (district) # 4221 in “California”(06) “USA”. A city in that district will have the same hiearchical coding, whereas a city of the same name in a different country would not.
filter out) tagged place names. Specific rules such as R2. Name, CODE run ahead of this rule to ensure situations such as “Eugene, OR” are not filtered out as person name (i.e., “Eugene”) becuase it is a well-qualified place mention.
TestPlaceGeocoder is a test routine that helps execute discrete test and evaluation activies, for example:
All of these are means of feeding the geotagger to find out – in detail – how decisions are made, what is missed and what false positives are emitted. There may be serious, systematic errors in rules or just missing reference gazetteer data. All of these scenarios need to be assessed with library and reference data changes.
Let’s look at the style of output in debug mode.
controls logging: By default geocoding and geotagging classes are in
Okay. Look at the text
San Antonio, TX.
Consider variants that may change your mindset around decisions:
S. Antonio, TX
SAN ANTONIO TX
SAN ANTONIO TEXAS
San Antonio, Texas, Mexico
San Antonio near the Texas-Mexico border
San \n Antonio, Austin and other cities in southern Texas
san antonio, texas
Return back to the variant mentioned above:
San Antonio, TX.
Xponents teases this apart with an evolving set of rules. The important notes include:
NAME, ADMIN-CODEwhere the administrative boundary code represents the place that contains the named place,
San Antonionumber in the range of 200, but the city in Texas, USA has a significant population
P/PPLcoding for example) and administrative boundaries (
Austinhelps improve the confidence around the connection between cities or sites located in the same district or province or other spatial proximity.
Antonio. But if the location name appears as a subset of the tag, then we should ignore the location tag, e.g.
San Antonio Pharma Groupis a likely company possibly with no specific geographic reference.
The list goes on. The list is always growing as more opportunities are
observed. Lots of those opportunities come with additional meta-data or
reference data. The raw output of the
TestPlaceGeocoder evaluation is below:
MENTIONS ALL == 3 Name:Antonio, Type:taxon (filtered out: Antonio) Name:San Antonio, TX, Type:generic Rules = [ Contains.PersonName, AdminCode, DefaultScore, MajorPlace.Population, CollocatedNames.boundary, Feature, Location.InAdmin] geocoded @ San Antonio (48, US, PPL), score=26.31 with conf=81, at [29.4241,-98.4936] geocoded @ San Antonio (24, MX, PPL), score=15.54 second place Name:TX, Type:generic Filtered Out. Rules = [DefaultScore] MENTIONS DISTINCT PLACES == 1 [San Antonio, TX] MENTIONS COUNTRIES == 0  MENTIONS COORDINATES == 0