Geographic Place, Date/time, and Pattern entity extraction toolkit along with text extraction from unstructured data and GIS outputters.
Back to OpenSextant
About our nomenclature – OpenSextant is a family of projects for geotagging and other NLP and information extraction work. Xponents is an actively developed implementation of our OpenSextant mindset around geotagging: be accurate, be simple, be extensible and show the work.
Xponents is a set of information extraction libraries including to extract and normalize geographic entities, date/time patterns, keywords/taxonomies, and various patterns. For example as depicted in Figure 1 where a tourist spots in the French country side are detected and geolocated. That’s easy
when there is only one such known location with that name. It becomes challenging when we try to detect such names in any language in any part of the world. Is “McDonald’s” a farm or a restaurant?
Which one is the right one – there’s thousands of restaurant locations by that name.
Figure 1. A General Tagging and Coding Paradigm
This table below loosely portrays the scenarios in which Xponents operates – parsing and conditioning knowable geo/temporal references in text into usable data structures.
input text | notional output with normalization |
---|---|
“Boise, ID is fun!” | Place names, geocoded:geo = { match:"Boise ID", adm1:"US.16", lat=43.61, lon=-116.20, feat_code:"PPL", confidence=78} |
“Born on 30 DECIEMBRE 1990 … “ | Normalized date/time:date = { match="30 DECIEMBRE 1990", date_norm="1990-12-30"} |
“Epicenter at 01°44’N 101°22’E …“ | Geo Coordinates:coord = { match="01°44'N 101°22'E", lat=1.733, lon=101.367, pattern="DM-01"} |
“The Swiss delegation…“ | Keywords:taxon = { match="Swiss", id="nationality.CHI" } |
“User accessed IP 233.12.0.11” | Patterns:pattern= { match="233.12.0.11", pattern_id="IPADDRESS" } |
Define your own patterns or compose your own Extractor apps. As a Java API, the following application classes implement the extraction above:
Here are some fast-tracks for applying Xponents:
A. Geotagging and everything – Deploy the Docker service as prescribed on our Docker Hub. Consult at least the Python client opensextant.xlayer
as noted in the Docker page and in the Python setup below.
B. Pattern extraction – The Python library opensextant.FlexPat
or its Java counterpart org.opensextant.extractors.flexpat
– offer a lean and effective manner to develop a regular-expression
pipeline. In either case minimal dependencies are needed. See Python setup below. Complete FlexPat
overview is available at Patterns.md
C. Geotagging and everything,…. but for some reason you feel that you need to build
it all yourself. You’ll need to follow the notes here in BUILD.md
and in ./solr/README.md
.
The typical approach is to deploy the docker instance of xponents and interact with it using the Python client, opensextant.xlayer
. The xponents-service.sh
demonstrates how to run the
REST service with or without Docker.
“Discoverying World Geography in Your Data”, presented at Lucene/Solr Revolution 2017 in Las Vegas 14 September, 2017. In video, at minute 29:50. This is a 12 minute talk
The Geocoder Handbook represents the Xponents methodology to geotagging and geocoding that pertains to coordinate extraction and general geotagging, i.e., XCoord and PlaceGeocoder, respectively.
Using XCoord and PlaceGeocoder here are two examples of extracting geographic entities from this made-up text:
String text = "Earthquake epicenter occurred at 39.56N, -123.45W or "+
"an hour west of the Mendocino National Forest ";
// INIT
//==================
XCoord xcoord = new XCoord();
xcoord.configure();
SolrGazetteer gaz = new SolrGazetteer();
// EXTRACT
//==================
List<TextMatch> coords = xcoord.extract( text );
for (TextMatch match : coords) {
/* if match instanceof GeocoordMatch do something.
*/
print("FOUND:" + match);
print("Near named place " + gaz.placeAt(match));
}
/* "Do something" might produce this output: print the location found could
* be reverse geocoded to Arnold, CA, US
*/
-=-=-=-=-=-=-==-=-=-=-=-=-=
FOUND: 39.56N, -123.45W @(33:49) matched by DD-02
Near named place Arnold (ADM1=06, CC=US, FEAT=PPL)
-=-=-=-=-=-=-==-=-=-=-=-=-=
Now with the same text as above, the second and more complex example applies the PlaceGeocoder:
// INIT
//==================
tagger = new PlaceGeocoder();
Parameters xponentsParams = new Parameters();
xponentsParams.resolve_localities = true;
tagger.setParameters(xponentsParams);
tagger.configure();
/* In this example, ""Mendocino National Forest" is found and is
* coded as { cc=US, adm1="06",...} representing that the forest
* is in California ("US.06"). To actually resolve "US.06" to a named
* province we need the resolve_localities flag ON. There is a small
* performance hit which adds up if you do this for every place found.
*/
// EXTRACT
//==================
List<TextMatch> allPlaces = tagger.extract( text );
for (TextMatch match : allPlaces) {
/* PlaceGeocoder yields many types of entities!!
if match instanceof GeocoordMatch, PlaceCandidate, TaxonMatch, etc.
then do something.
*/
}
/* These examples print to stdout, but imagine saving to a database,
* exporting KML or a spreadsheet on the fly
*/
-=-=-=-=-=-=-STDOUT==-=-=-=-=-=-=
....
Name:Mendocino National Forest
Rules = [DefaultScore]
geocoded @ Mendocino National Forest (06, US, FRST), score=20.56 with conf=93
MENTIONS DISTINCT PLACES == 1
[Mendocino National Forest]
MENTIONS COUNTRIES == 0
[]
MENTIONS COORDINATES == 1
[39.56N, -123.45W]
....
-=-=-=-=-=-=-==-=-=-=-=-=-=
Python and Java functionality overlaps but is still drastically differ. The Core API resembles the Python library somewhat. The primary NLP/geocoding work is done in the Xponents SDK proper, whereas the other modules support that work as utilities and foundational data classes, etc.
Xponents-Core
repo.XText
repo is used in the SDK Examples folder to demonstrate text
extraction from documents and feeding the extractors using TextInput
class with the tuple of (text, ID, language ID)
opensextant
(v1.5) Python API offers data utilities; Solr clients for TaxCat and Gazetteer;
basic data models for text spans, place objects, etc. Xponents REST client (xlayer
) which interacts with the Java Xponents REST service.As mentioned above to work with just pattern extraction, the Core API is needed.
But to do anything more, like geotagging, the SDK API is needed along with the instance of
Solr as prescribed in the ./solr
folder. Here are some useful pre-requisites:
Maven
Insert these dependencies into your POM depending on what you need.
<!-- Xponents Core API -->
<dependency>
<groupId>org.opensextant</groupId>
<artifactId>opensextant-xponents-core</artifactId>
<version>3.7.0</version>
</dependency>
<!-- Xponents SDK API -->
<dependency>
<groupId>org.opensextant</groupId>
<artifactId>opensextant-xponents</artifactId>
<version>3.7.0</version>
</dependency>
For reference: OpenSextant Xponents on Maven. For that matter, the only relevant artifacts in our org.opensextant
group are:
opensextant-xponents-core 3.7.*
- This Core APIopensextant-xponents 3.7.*
- This Solr-based tagger SDKopensextant-xponents-xtext 3.6.*
- XText, the text extraction toolkitThese libraries have been folded into Xponents Core library for long term Java sustainment; They
are rooted under org.opensextant.
geodesy 2.0.1
- Geodetic operations and coordinate system calculationsgiscore 2.0.2
- GIS I/OPython
Someday we’ll just post this to PyPi.
pushd ./python
python3 ./setup.py sdist
popd
# Install built lib with dependencies to ./piplib
pip3 install -U --target ./piplib ./python/dist/opensextant-1.5*.tar.gz
pip3 install -U --target ./piplib lxml bs4 arrow requests pycountry
# Note - if working with a distribution release, the built Python
# package is in ./python/ (not ./python/dist/)
# * IMPORTANT *
export PYTHONPATH=$PWD/piplib
# Adjust the "--target piplib" and the PYTHONPATH according to how
# you like it.
See the Examples material that you can use from within the Docker image or from a full checkout/build of the project. Pipeline topics covered there are :
XText
. XText is included in this distributionThese extractors are in the org.opensextant.extractors
packages, and demonstrated in the Examples sub-project using Groovy. These libraries and builds are located in Maven Central and in Docker Hub. Here is a quick overview of the organization of the project:
A packaged distribution has the API docs at ./lib/opensextant-*-javadoc.jar
Xponents Philosophy: The intent of Xponents is to provide the extraction without too much infrastructure, as you likely already have that. This library tool chest contains the following ideas and capabilities:
PlaceGeocoder
a geotagger/geocoder for ALL geographic entities in free-text of any size.XCoord
a geographic coordinate extractor to pull out and geocode MGRS, UTM or latitude/longitude in degrees (decimal, minutes, seconds, etc)XTemporal
a date/time pattern extractor that identifies and codes practical well-known date and date/time formatsXTax
a keyword extractor that allows you to associate important metadata and taxonomic information with keywordsLangDetect
langid, using mainly CyboZu LangDetectFlexPat
(see XCoord
and XTemporal
packages for examples)Country, Place, LatLon
are all examples of the Geocoding
interface (see org.opensextant.data
)Language
and LangID
represents a simple but powerful and overlooked data: language codes and names. Here we use language codes to drive various language-dependent features and then also to present language name to end users. (see ISO-339 standards; our source table is from Library of Congress)TextEntity
(a text span) and TextMatch
(a span matched by a rule or extractor) represent the essential unit of data emitted by all Extractors in Xponents. These include: GeocoordMatch
, PlaceCandidate
, DateMatch
, TaxonMatch
, and others.SolrGazetteer
provides a clean API to query gazetteer data using Solr query mechanics and Xponents data classes. The index is optimized for full text search and geospatial queriesGazetteerMatcher
provides a direct API around the text tagging capability Solr TextTagger
(present in Solr v7.4 or later)See BUILD.md