Geographic Place, Date/time, and Pattern entity extraction toolkit along with text extraction from unstructured data and GIS outputters.
Author: Marc Ubaldino
Copyright OpenSextant.org, 2017
Updated 2021
Video: “Discoverying World Geography in Your Data”, presented at Lucene/Solr Revolution 2017 in Las Vegas 14 September, 2017. In video, at minute 29:50. This is a 12 minute talk
Contents:
Welcome to Xponents. We realize geocoding text or data can be tedious and mind numbing. Hopefully this handbook will help you walk through the techniques defined in Xponents and the rest of OpenSextant in a way that makes it obvious which rules will impact geocoding of your data. One important thing to note is that in all this information/entity extraction performed here you will not see that much discussion of traditional natural language processing (NLP), e.g., parts of speech, co-references, sentence boundaries, etc. Much of the language-specific processing is delegated to Solr and Lucene, which handle this reasonalbly well. Xponents APIs are then able to focus more on the critical extraction and encoding challenges. In this regard when you have to do less tuning up front per-language, you can field a decent geotagging capability faster and discover what you do not know. You can refine language-specific performance later. In conclusion – NLP mechanics and theory is very important, however as a developer or integrator you need not be so concerned with it intially.
There main topics to cover (Figure 1):
In either topic we will encounter the concept of filters that either negate or promote a finding.
And lastly evidence is any metadata that can be attached to a geotag or geocode to further back
the choice of a location. Look for pointers on Xponents solutions, aka Java classes, in each topic
The primary implementation for this handbook is the Java package org.opensextant.extractors.geo
.
The design of this package provides some good terminology to understand the methodolgy here:
TextInput
class to carry basic metadata like lang ID and a Unicode text buffer, however language detection must be done
externally as this differs based on the type and length of text.SolrGazetteer.placesAt()
method to reveal nearest cities, the province and country containing the location, or the fact the coordinate is
not near anything (e.g., over water). Xponents Patterns project has XCoord
extractor which detects and geocodes
the coordinate patterns listed here in the XCoord Patterns reference manual reference manual.GazetteerMatcher
class exposes the SolrTextTagger capability indirectly through calling a Solr request.
The Gazetteer is the primary source of consolidated gazetteer
data.Detroit City Council
is an
organization name, that contains a city name. Whereas the Smithfield Group
may be an institution not actually located
in a place called Smithfield
. Either way it is important to be able to detect all such cases to work
with negation or amplification all at once. Xponents Extraction project provides TaxonMatcher
, aka XTax, which
provisions a lexicon of such “well-known” non-places of your choosing or the default set used by Xponents.Country
class that makes use of a timezone table (source: Geonames.org) which
helps infer one or more country codes for a given timezone. Xponents Patterns XTemporal
extractor can be used
to detect and code dates and times, if a date/time/timezone is not already provided. The important part
here is to recognize the innate time and timezone in the original data, not the UTC time.Rules are organized and fired by some main program, the reference implementation here is PlaceGeocoder
in
Xponents Extraction project. Some rules are fired generically in order, while others are fired separately.
All rules (of type GeocodeRule
) are evaluated (evaluate()
) after the tagging has occurred. Tagging
yields a list of PlaceCandidates
which may have been filtered by the tagging phase. Each candidate may
also have heuristics about the text, including if the text is all upper case, all lower case, pure ASCII vs.
diacritic or non-ASCII. As rules fire they contribute a rule label, a score increment and/or additional evidence
to each candidate.
A final rule, a Location Chooser, assesses given evidence, context, rules and scores for each candidate. Ultimately the best score wins and a confidence (100 point scale) is associated with the choice to make it easier to compare geotagging and geocoding confidence across documents and data sets.
San Fran, CA
town of
, city of
, etc. preceeding the name
of populated places (feature types P/PPL
) score those candidates higher. As well, province
preceeding or following a name scores higher candidates that are Level-1 administrative boundaries.
Could be enhanced by having table-driven rules per language.SolrGazetteer.placesAt
provides a simple recursive reverse lookup of location to containing boundaries or nearby places. First 10 KM, then
30 KM radii are tried looking for closest match.COUNTRY.ADM1.ADM2
is known as a hiearchical tree that represents containment of boundaries in a lexical string. So, USA.06.4221
is county (district) # 4221 in “California”(06) “USA”. A city in that district will have the same hiearchical coding,
whereas a city of the same name in a different country would not.filter out
)
tagged place names. Specific rules such as R2. Name, CODE run ahead of this rule to ensure situations such
as “Eugene, OR” are not filtered out as person name (i.e., “Eugene”) becuase it is a well-qualified place mention.filtered out
TestPlaceGeocoder is a test routine that helps execute discrete test and evaluation activies, for example:
/data/placename-tests.txt
. (in src/test/resources
).All of these are means of feeding the geotagger to find out – in detail – how decisions are made, what is missed and what false positives are emitted. There may be serious, systematic errors in rules or just missing reference gazetteer data. All of these scenarios need to be assessed with library and reference data changes.
Let’s look at the style of output in debug mode. logback.xml
controls logging: By default geocoding and geotagging classes are in
DEBUG
mode.
Okay. Look at the text San Antonio, TX
.
Consider variants that may change your mindset around decisions:
S. Antonio, TX
SAN ANTONIO TX
SAN ANTONIO TEXAS
San Antonio, Texas, Mexico
San Antonio near the Texas-Mexico border
San \n Antonio, Austin and other cities in southern Texas
san antonio, texas
SanAnt
Return back to the variant mentioned above: San Antonio, TX
.
Xponents teases this apart with an evolving set of rules. The important notes include:
NAME, ADMIN-CODE
where the administrative boundary code represents the place that contains the named place, NAME
.San Antonio
number in the range of 200, but the city in Texas, USA has a significant populationP/PPL
coding for example) and administrative boundaries (A/ADM1
for TX
or Texas, USA
).Austin
helps improve the confidence around the connection between cities or sites located in the same district or province or other spatial proximity.Antonio
. But if the location name appears as a subset of the tag, then we should ignore the location tag, e.g. San Antonio Pharma Group
is a likely company possibly with no specific geographic reference.The list goes on. The list is always growing as more opportunities are
observed. Lots of those opportunities come with additional meta-data or
reference data. The raw output of the TestPlaceGeocoder
evaluation is below:
MENTIONS ALL == 3
Name:Antonio, Type:taxon
(filtered out: Antonio)
Name:San Antonio, TX, Type:generic
Rules = [
Contains.PersonName,
AdminCode,
DefaultScore,
MajorPlace.Population,
CollocatedNames.boundary,
Feature,
Location.InAdmin]
geocoded @ San Antonio (48, US, PPL), score=26.31 with conf=81, at [29.4241,-98.4936]
geocoded @ San Antonio (24, MX, PPL), score=15.54 second place
Name:TX, Type:generic
Filtered Out. Rules = [DefaultScore]
MENTIONS DISTINCT PLACES == 1
[San Antonio, TX]
MENTIONS COUNTRIES == 0
[]
MENTIONS COORDINATES == 0
[]