Package org.opensextant.extractors.geo
Class PlaceGeocoder
java.lang.Object
org.opensextant.extraction.SolrMatcherSupport
org.opensextant.extractors.geo.GazetteerMatcher
org.opensextant.extractors.geo.PlaceGeocoder
- All Implemented Interfaces:
Closeable,AutoCloseable,org.opensextant.extraction.Extractor,BoundaryObserver,CountryObserver,LocationObserver
public class PlaceGeocoder
extends GazetteerMatcher
implements org.opensextant.extraction.Extractor, CountryObserver, BoundaryObserver, LocationObserver
A simple variation on the geocoding algorithms: geotag all possible things
and determine a best
geo-location for each tagged item. This uses the following components:
- PlacenameMatcher: place name tagging and gazetteering
- XCoord: coordinate extraction
- geo.rules.* pkg: disambiguation rules to choose the best location for tagged names
- Author:
- Marc C. Ubaldino, MITRE, ubaldino at mitre dot org
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intstatic final intFind nearest city within r=25 KM to infer geography of a given coordinate, e.g., What state is (x,y) in? Found locations are sorted by distance to point.static final Stringstatic final StringFields inherited from class org.opensextant.extractors.geo.GazetteerMatcher
AR_TAG_FIELD, CJK_TAG_FIELD, DEFAULT_TAG_FIELD, filter, lang2nameFieldFields inherited from class org.opensextant.extraction.SolrMatcherSupport
DEFAULT_TAG_LIMIT, getNamesTime, log, requestHandler, solr, tagNamesTime, totalTimeFields inherited from interface org.opensextant.extraction.Extractor
NO_DOC_ID -
Constructor Summary
ConstructorsConstructorDescriptionA default Geocoding app that demonstrates how to invoke the geocoding pipline start to finish.PlaceGeocoder(boolean lowercaseAllowed) -
Method Summary
Modifier and TypeMethodDescriptionvoidAdd your own geocode rules to enable you to add evidence, adjust score, outright choose Place instances on PlaceCandidates, etc.voidboundaryLevel1InScope(String nameNorm, org.opensextant.data.Place p) Observer pattern that sees any time a possible boundary (state, province, district, etc) is mentioned.voidboundaryLevel2InScope(String nameNorm, org.opensextant.data.Place p) Given the name (lower case, strip quotes), the location candidate infers an ADMIN boundaryvoidcleanup()Please shutdown the application cleanly when done.voidclose()Close solr resources.voidWe do whatever is needed to init resources...voidConfigure an Extractor using a config file named by a path.voidConfigure an Extractor using a config file named by a URLintvoidcountryInScope(String cc) Record how often country references are made.voidcountryInScope(org.opensextant.data.Country c) Record how often country references are made.Calculate country mention totals and ratios.booleanHave you seen this country before?booleancountryObserved(org.opensextant.data.Country C) Have you seen this country before?voidenablePersonNameMatching(boolean b) Deprecated.org.opensextant.data.PlaceevaluateCoordinate(org.opensextant.data.Geocoding g) Compund-method that is crucial in reverse geocoding COORDINATE to KNOWN PLACE.List<org.opensextant.extraction.TextMatch>Generic tagging.List<org.opensextant.extraction.TextMatch>extract(org.opensextant.data.TextInput input) Seeextract(TextInput, Parameters)below.List<org.opensextant.extraction.TextMatch>extract(org.opensextant.data.TextInput input, org.opensextant.processing.Parameters jobParams) Extractor.extract() calls first XCoord to get coordinates, then PlacenameMatcher In the end you have all geo entities ranked and scored.getName()booleanbooleanDeprecated.This has no effect.voidlocationInScope(org.opensextant.data.Geocoding geo) When coordinates are found track them.Weight mentions or indirect references to Provinces in the documentbooleanplaceObserved(org.opensextant.data.Place p) Tell us if this place P was inferred by hard location mentionsvoidWe have some emerging metrics to report out...voidsetParameters(org.opensextant.processing.Parameters p) voidsetRules(List<GeocodeRule> rlist) You don't like the default rule set,..Methods inherited from class org.opensextant.extractors.geo.GazetteerMatcher
createPlace, createTag, getCoreName, getFiltrationRatio, getGazetteer, getMatcherParameters, initialize, placesAt, reportMemory, searchAdvanced, searchAdvanced, setAllowLowerCase, setAllowLowerCaseAbbreviations, setEnableCaseFilter, setEnableCodeHunter, setMatchFilter, tagText, tagText, tagText, tagText, tagTextMethods inherited from class org.opensextant.extraction.SolrMatcherSupport
getRetrievingNamesTime, getTaggingNamesTime, getTotalTime, setTaggerHandler, tagTextCallSolrTagger
-
Field Details
-
VERSION
- See Also:
-
METHOD_DEFAULT
-
taxonCatalogs
-
COORDINATE_PROXIMITY_CITY_THRESHOLD
public static final int COORDINATE_PROXIMITY_CITY_THRESHOLDFind nearest city within r=25 KM to infer geography of a given coordinate, e.g., What state is (x,y) in? Found locations are sorted by distance to point.- See Also:
-
COORDINATE_PROXIMITY_ADM1_THRESHOLD
public static final int COORDINATE_PROXIMITY_ADM1_THRESHOLD- See Also:
-
-
Constructor Details
-
PlaceGeocoder
public PlaceGeocoder() throws org.opensextant.ConfigExceptionA default Geocoding app that demonstrates how to invoke the geocoding pipline start to finish. It makes use of XCoord to parse/geocode coordinates, SolrGazetteer/GazetteerMatcher to match named places, XTax to tag person names. Match Filters and rules work in conjunction to filter and tag further any candidates.- Throws:
org.opensextant.ConfigException- if resource files could not be found in CLASSPATH
-
PlaceGeocoder
public PlaceGeocoder(boolean lowercaseAllowed) throws org.opensextant.ConfigException - Parameters:
lowercaseAllowed- if lower case abbreviations are allowed. See GazetteerMatcher- Throws:
org.opensextant.ConfigException- if resource files could not be found in CLASSPATH
-
-
Method Details
-
getName
- Specified by:
getNamein interfaceorg.opensextant.extraction.Extractor
-
configure
Configure an Extractor using a config file named by a path.- Specified by:
configurein interfaceorg.opensextant.extraction.Extractor- Parameters:
patfile- configuration file path- Throws:
org.opensextant.ConfigException- on err
-
configure
Configure an Extractor using a config file named by a URL- Specified by:
configurein interfaceorg.opensextant.extraction.Extractor- Parameters:
patfile- configuration URL- Throws:
org.opensextant.ConfigException
-
reportMetrics
public void reportMetrics()We have some emerging metrics to report out... -
configure
public void configure() throws org.opensextant.ConfigExceptionWe do whatever is needed to init resources... that varies depending on the use case. Guidelines: this class is custodian of the app controller, Corpus feeder, and any Document instances passed into/out of the feeder. This geocoder requires a default /exclusions/person-name-filter.txt, which can be empty, but most often it will be a list of person names (which are non-place names) Rules Configured in approximate order:- CountryRule -- tag all country names
- NameCodeRule -- parse any Name, CODE, or Name1, Name2 patterns for "Place, AdminPlace" evidence
- PersonNameRule -- annotate, negate any patterns or matches that appear to be known celebrity persons or organizations. Qualified places are not negated, e.g., "Euguene, Oregon" is a place; "Euguene" with no other evidence is a person name.
- CoordRule -- if requested, parse any coordinate patterns; Reverse geocode Country + Province.
- ProvinceAssociationRule -- associate places with Province inferred by coordinates.
- MajorPlaceRule -- identify major places by feature type, class or location population.
- LocationChooserRule -- final rule that assigns confidence and chooses best location(s)
- Specified by:
configurein interfaceorg.opensextant.extraction.Extractor- Throws:
org.opensextant.ConfigException- on err
-
addRule
Add your own geocode rules to enable you to add evidence, adjust score, outright choose Place instances on PlaceCandidates, etc. As long as your rule implements or overrides GeocodeRule.evaluate() methods candidate tags will be fully evaluated.- Parameters:
r- a rule
-
setRules
You don't like the default rule set,.. add your own -
cleanup
public void cleanup()Please shutdown the application cleanly when done.- Specified by:
cleanupin interfaceorg.opensextant.extraction.Extractor
-
close
public void close()Description copied from class:SolrMatcherSupportClose solr resources.- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable- Overrides:
closein classSolrMatcherSupport
-
setParameters
public void setParameters(org.opensextant.processing.Parameters p) -
isCoordExtractionEnabled
public boolean isCoordExtractionEnabled() -
isPersonNameMatchingEnabled
Deprecated.This has no effect. Tagging Parameters here are mainly considered for filtering output.Person name matching is ALWAYS on, this flag indicates if results are reported in returned array- Returns:
-
enablePersonNameMatching
Deprecated.- Parameters:
b-
-
extract
public List<org.opensextant.extraction.TextMatch> extract(org.opensextant.data.TextInput input) throws org.opensextant.extraction.ExtractionException Seeextract(TextInput, Parameters)below. This is the default extraction routine. If you need to tune extraction callextract( input, parameters )- Specified by:
extractin interfaceorg.opensextant.extraction.Extractor- Throws:
org.opensextant.extraction.ExtractionException
-
extract
public List<org.opensextant.extraction.TextMatch> extract(org.opensextant.data.TextInput input, org.opensextant.processing.Parameters jobParams) throws org.opensextant.extraction.ExtractionException Extractor.extract() calls first XCoord to get coordinates, then PlacenameMatcher In the end you have all geo entities ranked and scored. LangID can be set on TextInput input.langid. Only lowercase langIDs please: 'zh', 'ar', tag text for those languages in particular. Null and Other values are treated as generic as of v2.8.Use TextMatch.getType() to determine how to interpret TextMatch / Geocoding results:
- Given TextMatch match, then
- Place tag: ((PlaceCandiate)match).getChosen() OR
- Coord tag: (Geocoding)match, OR
- Other tag: match might be TaxonMatch or Pattern (PoliMatch)
- Parameters:
input- input buffer, doc ID, and optional langID.- Returns:
- TextMatch instances which are all PlaceCandidates.
- Throws:
org.opensextant.extraction.ExtractionException- on err
-
countryInScope
public void countryInScope(org.opensextant.data.Country c) Record how often country references are made.- Specified by:
countryInScopein interfaceCountryObserver- Parameters:
c- country obj
-
countryCount
public int countryCount()- Specified by:
countryCountin interfaceCountryObserver
-
countryMentionCount
Calculate country mention totals and ratios. These ratios help qualify what the document is about. These may be mentions in text or inferred mentions to the countries listed, e.g., a coord infers a particular country.- Specified by:
countryMentionCountin interfaceCountryObserver- Returns:
- map of country code : counts
-
placeMentionCount
Weight mentions or indirect references to Provinces in the document- Specified by:
placeMentionCountin interfaceBoundaryObserver- Returns:
-
countryInScope
Record how often country references are made.- Specified by:
countryInScopein interfaceCountryObserver- Parameters:
cc-
-
countryObserved
Description copied from interface:CountryObserverHave you seen this country before?- Specified by:
countryObservedin interfaceCountryObserver- Parameters:
cc- country code- Returns:
- true if observer saw country
-
countryObserved
public boolean countryObserved(org.opensextant.data.Country C) Description copied from interface:CountryObserverHave you seen this country before?- Specified by:
countryObservedin interfaceCountryObserver- Parameters:
C- country object- Returns:
- true if observer saw country
-
locationInScope
public void locationInScope(org.opensextant.data.Geocoding geo) When coordinates are found track them. A coordinate is critical -- it informs us of city, province, and country. If the location is off shore or in no-mans' land, these chains of observers should respect that and fail quietly. There are at least two opportunities here: 1. Given a geo coordinate, use that hard location to disambiguate other named places 2. Given a geo coordinate identify the nearest known place(s). Such places may not be presented in the document or text. The first improves overall location accuracy, the second offers location enrichment and discovery.- Specified by:
locationInScopein interfaceLocationObserver
-
boundaryLevel1InScope
Observer pattern that sees any time a possible boundary (state, province, district, etc) is mentioned. Example: mention "Florida" linked to location Florida(ADM1, FL, US) infers the boundary "US.FL" As would "Miami" (PPL, FL, US) also infer "US.FL". We care more about the distinct and various mentions more than the location counts. I.e., "Florida" has 185 locations worldwide, multiples in some countries.- Specified by:
boundaryLevel1InScopein interfaceBoundaryObserver- Parameters:
nameNorm- text or name related to the place, pp- ID of a boundary.
-
boundaryLevel2InScope
Description copied from interface:BoundaryObserverGiven the name (lower case, strip quotes), the location candidate infers an ADMIN boundary- Specified by:
boundaryLevel2InScopein interfaceBoundaryObserver
-
extract
public List<org.opensextant.extraction.TextMatch> extract(String input_buf) throws org.opensextant.extraction.ExtractionException Generic tagging. No doc ID or language ID given. Nothing language specific will be done here.- Specified by:
extractin interfaceorg.opensextant.extraction.Extractor- Throws:
org.opensextant.extraction.ExtractionException
-
evaluateCoordinate
public org.opensextant.data.Place evaluateCoordinate(org.opensextant.data.Geocoding g) throws org.apache.solr.client.solrj.SolrServerException, IOException Compund-method that is crucial in reverse geocoding COORDINATE to KNOWN PLACE.A method to retrieve one or more distinct admin boundaries containing the coordinate. This depends on resolution of gazetteer at hand. Secondarily as nearby places are encountered they are added to a coordinate providing a basic reverse-geocoding solution.
- Parameters:
g- geo coordinate found in text.- Returns:
- Place object near the geocoding.
- Throws:
org.apache.solr.client.solrj.SolrServerException- a query against the Solr index may throw a Solr error.IOException
-
placeObserved
public boolean placeObserved(org.opensextant.data.Place p) Tell us if this place P was inferred by hard location mentions- Specified by:
placeObservedin interfaceLocationObserver- Returns:
-