Geographic Place, Date/time, and Pattern entity extraction toolkit along with text extraction from unstructured data and GIS outputters.
The OpenSextant Gazetteer is a catalog of place names and basic geographic metadata, such as country code, location, feature codings. In Xponents, Solr 7+ is used to index and provision the large lexicons such as gazetteer and taxonomies.
opensextant.gazetteerAPI to query such things.
opensextant.Placeclass and enrichening that with the internal text model and
PlaceHueristicsthat produce name and location biasing based on general assumptions.
master_gazetteer.sqliteand subsequently into the Solr
request handler) to identify that reference data in an input argument to
/tag. The main indices are:
You do NOT need know all about SQLite, Solr or Lucene to make use of this, but it helps when you need to optimize or extend things for new langauges.
You have a few options:
Option 1. Download Xponents SDK release (libraries, docs, and pre-built Xponents Solr)
Option 2. Checkout Xponents projects and build from latest source and data.
Where Python is referred in any instructions we are referring to Python 3.8+ only.
pip may be further qualified as
pip3 in many scripts.
The estimated disk space to build a complete distribution is on the order of 20 GB, with various temporary files and all.
Expectations around Data
|Source ID||Data Source||ETL Time||Place Name Count|
|N||NGA GNIS||25 min||16.6 million|
||5 min||2.3 million|
|U||US FIPS state postal/numeric codes and names||1 min||150+|
|G||Geonames||25 min||23+ million|
|NE||NaturalEarth Admin Boundaries||25 min||73K|
|ISO||ISO 3166||1 min||850+|
|X||Xponents Derived||5 min||235K|
|-||Geonames Postal||30 min||7 million|
Distinct Place Names: 24 million
Build Setup - Python, Java, etc
Managing public domain data sets pulled down, scraped, harvested, etc. involves additional Python libraries
that are not required by normal use of the
opensextant package. Add this Pip-installable items now from the
Xponents root folder:
cd Xponents ./setup.sh # Note - if working with a distribution release, the built Python package is in ./python/ (not ./python/dist/) # Note - chose any means you want to set your effective Python environment; I use the PYTHONPATH var export PYTHONPATH=$PWD/piplib # or . ./dev.env
Linux/Mac kernel configuration related to Solr server usage also requires increasing certain "ulimit" limits above defaults:
# As root /sbin/sysctl fs.file-max=65536000 # or temporarily ulimit -n 65536000
# As user, increase user process max ulimit -u 8092
# As root: sudo /sbin/sysctl -p
Option 2. Build Gazetteer From Scatch --------------------------------------------- Here is an overview of this data curation process: The main sources are ISO 3166, US NGA and USGS to cover world wide geography - Secondary sources (Geonames.org, Natural Earth, Adhoc entries, Generated name variants) are assembled from these Xponents scripts - A one-time collection of `wordstats` is needed to identify common terms that collide with location names. - With all data collected, each data set is loaded into SQLite with specific source identifiers - With the master SQLite gazetteer complete entries can be de-duplicated, marked and optimized - Finally, the master gazetter entries (non-duplicates) are funneled to the default Xponents Solr instance The steps here represent the journey of how to produce this behemoth -- a process we are constantly trying to streamline and automate. **1. Data Collection** ```shell # cd ./solr ant gaz-resources ant gaz-stopwords # US Gov sites ~ USGS and NGA.mil websites are not consistently # secured with an obvious CA chain. `curl -k` is used to insecurely download some data ./build-1-get-sources.sh # Other data sources, and unpacking all of it. ant gaz-sources ant postal-sources
In parallel, run the wordstats collection ONCE. This material does not change. You end up with about a 1.0 GB SQLite file with unigram counts from GooleBooks Ngrams project.
./script/wordstats.sh download ./script/wordstats.sh assemble # Once fully debugged this script may change to streaming or delete download files when done. # You may remove the ./tmp/wordstats/*.gz content once this script has completed. # Output: ./tmp/wordstats.sqlite
2. Collect and Ingest Secondary Sources
This SQLite master curation process is central to the Xponents gazetteer/geotagger.
All of the metadata and source data is channeled through this and optimized. The raw SQLite master is approaching 10 GB or more containing about 45 million place names. By contrast the resulting Solr index is about 3.0 GB with 25 million placenames. The optimization steps are essential to managing size balanced with comprehensive coverage.
cd Xponents/solr ./build-2-sqlite-master.sh
# A simple test attempts to pull in only 100,000 rows of data from each source to see how things work. # ./build-2-sqlite-master.sh test
**3. Postal Gazetteer** The Postal gazetteer/tagger has its own sources (postal codes) but also pulls in metadata for worldwide provinces from the master gazetteer. Make sure your master gazetteer (or test file) completes successfully above. ```shell ./build-3-sqlite-postal.sh
This is a stock instance of Solr 7.x with a number of custom solr cores.
The main cores are:
postal. They are populated by the
using their SQLite databases as the intermediate data:
gazetteer: 99.9% of the
tmp/master_gazetteer.sqlitedistinct entries will be indexed into the Solr gazetteer. A limited number of default filters omit odd names ~ short names, names of obscure hyrological features (wells, intermittent streams, etc). Duplicate place names (feature + name + location + country) are not indexed.
taxcat: Taxcat will contain taxonomic entries such as well-known named entities, nationalities, generic person names, and other useful lexica. See XTax README
postal: The postal index is populated straight from
These notes here are for the general situation just establishing Solr and iterating through common tasks.
Step 1. Get Solr 7.x
To get a fully working Solr instance running unpack the full Solr 7.x distribution here
./solr7-dist. This involves some extra steps, but is relatively well tested.
Using the latest Solr distribution would involve updating Maven POM, possibly, as well as
reviewing Solr index configurations.
```shell script wget http://archive.apache.org/dist/lucene/solr/7.7.3/solr-7.7.3.zip unzip solr-7.7.3.zip SOLR_DIST=./solr7-dist mv ./solr-7.7.3 $SOLR_DIST
rm -rf $SOLR_DIST/example $SOLR_DIST/server/solr/configsets $SOLR_DIST/contrib $SOLR_DIST/dist/test-framework # We could automate this sure. But you need only do it once and hopefully is not repetitive. # NOTE If Solr 8.x is in use, the distro is ./solr8-dist. Differences from Solr 7 to Solr 8 are still being # investigated. ```
Step 3. Configure and Deployment Paths
By default, you have this runtime environment in check-out or in distribution:
./Xponents/solr/solr7will contain the Solr indices
./Xponents/solr/solr7-distwill contain the Solr server that serves the indices
We refer to Xponents Solr informally as
XP_SOLR, which is
./Xponents/solr in source tree,
but in distribution it defaults to
./Xponents-VER/xponents-solr to distinguish it from the raw source.
The typical release schedule for the
XP_SOLR distribution is quarterly.
The OpenSextant/Xponents JVM argument used to set this index path is
must be set to the
solr7 index folder, i.e.
XP_SOLR/solr7. This may be an absolute or relative path.
./xponents-solr/ folder in tact, although only the
solr7 index folder is used at
runtime – The other folders provide a fully operational Solr Server.
Step 4. Build Indices
The build process can be brittle, so let’s get educated so you can make decisions on your own.
build.sh script is the central brain behind the data assembly. Use that script
alone to build and manage indices, however if there are problems see the individual steps below
to intervene and redo any steps.
To update the Gazetteer Meta resources review these few steps:
cd ./solr ./build.sh meta cd .. mvn install
meta step above gathers resources below and pushes them up to the Maven project
they become part of the CLASSPATH ( via
opensextant-xponents-*jar or from file system). Resources include:
/lang/– Lucene and other stopword sets
/filters/– exclusions for tagging and downstream tuning.
FIRST USE: ```shell script ./build.sh clean ./build.sh meta ./build.sh start clean data gazetteer ./build.sh taxcat ./build.sh postal
IF you have gotten to this step and feel confident things look good, this one invocation of `build.sh` should allow you to run steps 4a, 4b, and 4c below all in one command. STOP HERE. If the above succeeded, check your running solr instance at http://localhost:7000/solr/ and inspect the different Cores. If you don't know Solr, please go learn a little bit about using that Solr Admin interface. If you have about 20 million rows in the gazetteer you are likely ready to go start using Xponents SDK. **NEXT USES:** You should not need to reacquire data sources, clean or restart Solr after that first use of build.sh. Subsequent uses may be only `build.sh gazetteer`, to focus on reloading the gazetteer, for example. The rest of this nonsense is to provide more transparency on the individual steps in the event something went wrong.
Synopsis: ./build.sh start clean data gazeteer taxcat
# clean = Clean Solr indices and initialize library folders with copies of dependencies, etc.
# start = Start the Solr server on default Xponents port 7000.
Solr Server is only used at build time, not at runtime # data = Acquire additional data e.g., Census, Geonames.org, JRC entities, etc. These data sets are not cleaned by ‘clean’ # meta = build and gather metadata resources
# gazetteer = regenerate only the gazetteer index
# taxcat = regenerate only the taxcat index
# postal = regenerate only the postal index
**Step 4.a Initialize** ```shell script # ant init
Step 4.b Get Supporting Data
```shell script ./build.sh data
This will pull down data sets used by Gazetteer and TaxCat taggers and resources using the Ant tasks: * `ant gaz-resources ` * `ant taxcat-jrc ` **Step 4.c Load Gazetteer & TaxCat** In this step, you can use:
./build.sh [start] ```
which will build the Solr gazetteer index and add to the taxcat index. If the Solr Server is not running it will be started. Access Solr URL is http://localhost:7000/solr
It is important to periodically look at terms and situations where a phrase is marked for avoiding tagging or something that will prevent a tag from getting back to the user. Filtration happens in two manners at least:
Look at terms marked as search_only in the gazetter and not valid in taxcat:
"wt=csv" to see CSV format. This JSON output is setup to list facet patterns of most frequent terms.
This step falls under the category of geotagger tuning. E.g., see Extraction PlaceGeocoder class as an implementation of a full geotagging capability. To negate false-positives we need a source of known things that are not places, rules that guide us how to judge non-places, or some other means such as statistical models to do so.
XTax API uses TaxCat (
./solr7/taxcat core). This API supports the Gazetteer and Xponents taggers
with lexicons of various types. Like the GazetteerMatcher tagger, XTax tagger uses the TaxCat
catalog to markup documents with known entities in the catalogs.
Note, as XTax JRC (and other catalogs you add) tag text you naturally find lots of additional entities. Some of them can be used to negate false-positives in geotagging, …. other entities found are just interesting – you should save them all as a part of your pipeline.
Relative metrics on feature classes, followed immediately by some commentary on how such feature types as mentioned in most text are seen. For example, we do not often hear about folks talking about Undersea features. If an solid exact match for such a name is tagged and geocoded our confidence in that finding is based on a few aspects:
Mention Weightis a relative weight applied to the Xponents confidence metric
Approximate Feature Count from OpenSextant Gazetteer (2020)
Feature Type Count Mention-Weight Places (P) - 9,000,000 1.0 Hydro (H) - 3,200,000 0.7 default H/STM* 50% 0.3 H/LK* 10% H/RSV, SPNG, WLL 10% 0.3 H/BAY, COVE 1% Spot (S) - 2.700,000 0.8 Terrain(T) - 2,300,000 0.8 Land (L) - 700,000 0.8 Admin (A) - 700,000 1.0 Vegetation(V) - 85,000 0.8 Roadways (R) - 65,000 0.7 Undersea (U) - 12,000 0.5
This is an initial, experimental model for features based on intuition.
Populated Place and Administrative features are far more common in most data, but this depends on your domain.
This weighting will NOT omit particular feature types, but it will help with disambiguating (choosing a most likely feature) and informing the confidence in that conclusion. More test data is needed to objectively build a reasonable feature model.