Package org.opensextant.extractors.geo
Class PlaceGeocoder
java.lang.Object
org.opensextant.extraction.SolrMatcherSupport
org.opensextant.extractors.geo.GazetteerMatcher
org.opensextant.extractors.geo.PlaceGeocoder
- All Implemented Interfaces:
Closeable
,AutoCloseable
,org.opensextant.extraction.Extractor
,BoundaryObserver
,CountryObserver
,LocationObserver
public class PlaceGeocoder
extends GazetteerMatcher
implements org.opensextant.extraction.Extractor, CountryObserver, BoundaryObserver, LocationObserver
A simple variation on the geocoding algorithms: geotag all possible things
and determine a best
geo-location for each tagged item. This uses the following components:
- PlacenameMatcher: place name tagging and gazetteering
- XCoord: coordinate extraction
- geo.rules.* pkg: disambiguation rules to choose the best location for tagged names
- Author:
- Marc C. Ubaldino, MITRE, ubaldino at mitre dot org
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
static final int
Find nearest city within r=25 KM to infer geography of a given coordinate, e.g., What state is (x,y) in? Found locations are sorted by distance to point.static final String
static final String
Fields inherited from class org.opensextant.extractors.geo.GazetteerMatcher
AR_TAG_FIELD, CJK_TAG_FIELD, DEFAULT_TAG_FIELD, filter, lang2nameField
Fields inherited from class org.opensextant.extraction.SolrMatcherSupport
DEFAULT_TAG_LIMIT, getNamesTime, log, requestHandler, solr, tagNamesTime, totalTime
Fields inherited from interface org.opensextant.extraction.Extractor
NO_DOC_ID
-
Constructor Summary
ConstructorDescriptionA default Geocoding app that demonstrates how to invoke the geocoding pipline start to finish.PlaceGeocoder
(boolean lowercaseAllowed) -
Method Summary
Modifier and TypeMethodDescriptionvoid
Add your own geocode rules to enable you to add evidence, adjust score, outright choose Place instances on PlaceCandidates, etc.void
boundaryLevel1InScope
(String nameNorm, org.opensextant.data.Place p) Observer pattern that sees any time a possible boundary (state, province, district, etc) is mentioned.void
boundaryLevel2InScope
(String nameNorm, org.opensextant.data.Place p) Given the name (lower case, strip quotes), the location candidate infers an ADMIN boundaryvoid
cleanup()
Please shutdown the application cleanly when done.void
close()
Close solr resources.void
We do whatever is needed to init resources...void
Configure an Extractor using a config file named by a path.void
Configure an Extractor using a config file named by a URLint
void
countryInScope
(String cc) Record how often country references are made.void
countryInScope
(org.opensextant.data.Country c) Record how often country references are made.Calculate country mention totals and ratios.boolean
Have you seen this country before?boolean
countryObserved
(org.opensextant.data.Country C) Have you seen this country before?void
enablePersonNameMatching
(boolean b) Deprecated.org.opensextant.data.Place
evaluateCoordinate
(org.opensextant.data.Geocoding g) Compund-method that is crucial in reverse geocoding COORDINATE to KNOWN PLACE.List<org.opensextant.extraction.TextMatch>
Generic tagging.List<org.opensextant.extraction.TextMatch>
extract
(org.opensextant.data.TextInput input) Seeextract(TextInput, Parameters)
below.List<org.opensextant.extraction.TextMatch>
extract
(org.opensextant.data.TextInput input, org.opensextant.processing.Parameters jobParams) Extractor.extract() calls first XCoord to get coordinates, then PlacenameMatcher In the end you have all geo entities ranked and scored.getName()
boolean
boolean
Deprecated.This has no effect.void
locationInScope
(org.opensextant.data.Geocoding geo) When coordinates are found track them.Weight mentions or indirect references to Provinces in the documentboolean
placeObserved
(org.opensextant.data.Place p) Tell us if this place P was inferred by hard location mentionsvoid
We have some emerging metrics to report out...void
setParameters
(org.opensextant.processing.Parameters p) void
setRules
(List<GeocodeRule> rlist) You don't like the default rule set,..Methods inherited from class org.opensextant.extractors.geo.GazetteerMatcher
createPlace, createTag, getCoreName, getFiltrationRatio, getGazetteer, getMatcherParameters, initialize, placesAt, reportMemory, searchAdvanced, searchAdvanced, setAllowLowerCase, setAllowLowerCaseAbbreviations, setEnableCaseFilter, setEnableCodeHunter, setMatchFilter, tagText, tagText, tagText, tagText, tagText
Methods inherited from class org.opensextant.extraction.SolrMatcherSupport
getRetrievingNamesTime, getTaggingNamesTime, getTotalTime, setTaggerHandler, tagTextCallSolrTagger
-
Field Details
-
VERSION
- See Also:
-
METHOD_DEFAULT
-
taxonCatalogs
-
COORDINATE_PROXIMITY_CITY_THRESHOLD
public static final int COORDINATE_PROXIMITY_CITY_THRESHOLDFind nearest city within r=25 KM to infer geography of a given coordinate, e.g., What state is (x,y) in? Found locations are sorted by distance to point.- See Also:
-
COORDINATE_PROXIMITY_ADM1_THRESHOLD
public static final int COORDINATE_PROXIMITY_ADM1_THRESHOLD- See Also:
-
-
Constructor Details
-
PlaceGeocoder
public PlaceGeocoder() throws org.opensextant.ConfigExceptionA default Geocoding app that demonstrates how to invoke the geocoding pipline start to finish. It makes use of XCoord to parse/geocode coordinates, SolrGazetteer/GazetteerMatcher to match named places, XTax to tag person names. Match Filters and rules work in conjunction to filter and tag further any candidates.- Throws:
org.opensextant.ConfigException
- if resource files could not be found in CLASSPATH
-
PlaceGeocoder
public PlaceGeocoder(boolean lowercaseAllowed) throws org.opensextant.ConfigException - Parameters:
lowercaseAllowed
- if lower case abbreviations are allowed. See GazetteerMatcher- Throws:
org.opensextant.ConfigException
- if resource files could not be found in CLASSPATH
-
-
Method Details
-
getName
- Specified by:
getName
in interfaceorg.opensextant.extraction.Extractor
-
configure
Configure an Extractor using a config file named by a path.- Specified by:
configure
in interfaceorg.opensextant.extraction.Extractor
- Parameters:
patfile
- configuration file path- Throws:
org.opensextant.ConfigException
- on err
-
configure
Configure an Extractor using a config file named by a URL- Specified by:
configure
in interfaceorg.opensextant.extraction.Extractor
- Parameters:
patfile
- configuration URL- Throws:
org.opensextant.ConfigException
-
reportMetrics
public void reportMetrics()We have some emerging metrics to report out... -
configure
public void configure() throws org.opensextant.ConfigExceptionWe do whatever is needed to init resources... that varies depending on the use case. Guidelines: this class is custodian of the app controller, Corpus feeder, and any Document instances passed into/out of the feeder. This geocoder requires a default /exclusions/person-name-filter.txt, which can be empty, but most often it will be a list of person names (which are non-place names) Rules Configured in approximate order:- CountryRule -- tag all country names
- NameCodeRule -- parse any Name, CODE, or Name1, Name2 patterns for "Place, AdminPlace" evidence
- PersonNameRule -- annotate, negate any patterns or matches that appear to be known celebrity persons or organizations. Qualified places are not negated, e.g., "Euguene, Oregon" is a place; "Euguene" with no other evidence is a person name.
- CoordRule -- if requested, parse any coordinate patterns; Reverse geocode Country + Province.
- ProvinceAssociationRule -- associate places with Province inferred by coordinates.
- MajorPlaceRule -- identify major places by feature type, class or location population.
- LocationChooserRule -- final rule that assigns confidence and chooses best location(s)
- Specified by:
configure
in interfaceorg.opensextant.extraction.Extractor
- Throws:
org.opensextant.ConfigException
- on err
-
addRule
Add your own geocode rules to enable you to add evidence, adjust score, outright choose Place instances on PlaceCandidates, etc. As long as your rule implements or overrides GeocodeRule.evaluate() methods candidate tags will be fully evaluated.- Parameters:
r
- a rule
-
setRules
You don't like the default rule set,.. add your own -
cleanup
public void cleanup()Please shutdown the application cleanly when done.- Specified by:
cleanup
in interfaceorg.opensextant.extraction.Extractor
-
close
public void close()Description copied from class:SolrMatcherSupport
Close solr resources.- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Overrides:
close
in classSolrMatcherSupport
-
setParameters
public void setParameters(org.opensextant.processing.Parameters p) -
isCoordExtractionEnabled
public boolean isCoordExtractionEnabled() -
isPersonNameMatchingEnabled
Deprecated.This has no effect. Tagging Parameters here are mainly considered for filtering output.Person name matching is ALWAYS on, this flag indicates if results are reported in returned array- Returns:
-
enablePersonNameMatching
Deprecated.- Parameters:
b
-
-
extract
public List<org.opensextant.extraction.TextMatch> extract(org.opensextant.data.TextInput input) throws org.opensextant.extraction.ExtractionException Seeextract(TextInput, Parameters)
below. This is the default extraction routine. If you need to tune extraction callextract( input, parameters )
- Specified by:
extract
in interfaceorg.opensextant.extraction.Extractor
- Throws:
org.opensextant.extraction.ExtractionException
-
extract
public List<org.opensextant.extraction.TextMatch> extract(org.opensextant.data.TextInput input, org.opensextant.processing.Parameters jobParams) throws org.opensextant.extraction.ExtractionException Extractor.extract() calls first XCoord to get coordinates, then PlacenameMatcher In the end you have all geo entities ranked and scored. LangID can be set on TextInput input.langid. Only lowercase langIDs please: 'zh', 'ar', tag text for those languages in particular. Null and Other values are treated as generic as of v2.8.Use TextMatch.getType() to determine how to interpret TextMatch / Geocoding results:
- Given TextMatch match, then
- Place tag: ((PlaceCandiate)match).getChosen() OR
- Coord tag: (Geocoding)match, OR
- Other tag: match might be TaxonMatch or Pattern (PoliMatch)
- Parameters:
input
- input buffer, doc ID, and optional langID.- Returns:
- TextMatch instances which are all PlaceCandidates.
- Throws:
org.opensextant.extraction.ExtractionException
- on err
-
countryInScope
public void countryInScope(org.opensextant.data.Country c) Record how often country references are made.- Specified by:
countryInScope
in interfaceCountryObserver
- Parameters:
c
- country obj
-
countryCount
public int countryCount()- Specified by:
countryCount
in interfaceCountryObserver
-
countryMentionCount
Calculate country mention totals and ratios. These ratios help qualify what the document is about. These may be mentions in text or inferred mentions to the countries listed, e.g., a coord infers a particular country.- Specified by:
countryMentionCount
in interfaceCountryObserver
- Returns:
- map of country code : counts
-
placeMentionCount
Weight mentions or indirect references to Provinces in the document- Specified by:
placeMentionCount
in interfaceBoundaryObserver
- Returns:
-
countryInScope
Record how often country references are made.- Specified by:
countryInScope
in interfaceCountryObserver
- Parameters:
cc
-
-
countryObserved
Description copied from interface:CountryObserver
Have you seen this country before?- Specified by:
countryObserved
in interfaceCountryObserver
- Parameters:
cc
- country code- Returns:
- true if observer saw country
-
countryObserved
public boolean countryObserved(org.opensextant.data.Country C) Description copied from interface:CountryObserver
Have you seen this country before?- Specified by:
countryObserved
in interfaceCountryObserver
- Parameters:
C
- country object- Returns:
- true if observer saw country
-
locationInScope
public void locationInScope(org.opensextant.data.Geocoding geo) When coordinates are found track them. A coordinate is critical -- it informs us of city, province, and country. If the location is off shore or in no-mans' land, these chains of observers should respect that and fail quietly. There are at least two opportunities here: 1. Given a geo coordinate, use that hard location to disambiguate other named places 2. Given a geo coordinate identify the nearest known place(s). Such places may not be presented in the document or text. The first improves overall location accuracy, the second offers location enrichment and discovery.- Specified by:
locationInScope
in interfaceLocationObserver
-
boundaryLevel1InScope
Observer pattern that sees any time a possible boundary (state, province, district, etc) is mentioned. Example: mention "Florida" linked to location Florida(ADM1, FL, US) infers the boundary "US.FL" As would "Miami" (PPL, FL, US) also infer "US.FL". We care more about the distinct and various mentions more than the location counts. I.e., "Florida" has 185 locations worldwide, multiples in some countries.- Specified by:
boundaryLevel1InScope
in interfaceBoundaryObserver
- Parameters:
nameNorm
- text or name related to the place, pp
- ID of a boundary.
-
boundaryLevel2InScope
Description copied from interface:BoundaryObserver
Given the name (lower case, strip quotes), the location candidate infers an ADMIN boundary- Specified by:
boundaryLevel2InScope
in interfaceBoundaryObserver
-
extract
public List<org.opensextant.extraction.TextMatch> extract(String input_buf) throws org.opensextant.extraction.ExtractionException Generic tagging. No doc ID or language ID given. Nothing language specific will be done here.- Specified by:
extract
in interfaceorg.opensextant.extraction.Extractor
- Throws:
org.opensextant.extraction.ExtractionException
-
evaluateCoordinate
public org.opensextant.data.Place evaluateCoordinate(org.opensextant.data.Geocoding g) throws org.apache.solr.client.solrj.SolrServerException, IOException Compund-method that is crucial in reverse geocoding COORDINATE to KNOWN PLACE.A method to retrieve one or more distinct admin boundaries containing the coordinate. This depends on resolution of gazetteer at hand. Secondarily as nearby places are encountered they are added to a coordinate providing a basic reverse-geocoding solution.
- Parameters:
g
- geo coordinate found in text.- Returns:
- Place object near the geocoding.
- Throws:
org.apache.solr.client.solrj.SolrServerException
- a query against the Solr index may throw a Solr error.IOException
-
placeObserved
public boolean placeObserved(org.opensextant.data.Place p) Tell us if this place P was inferred by hard location mentions- Specified by:
placeObserved
in interfaceLocationObserver
- Returns:
-