Class PlaceGeocoder

All Implemented Interfaces:
Closeable, AutoCloseable, org.opensextant.extraction.Extractor, BoundaryObserver, CountryObserver, LocationObserver

public class PlaceGeocoder extends GazetteerMatcher implements org.opensextant.extraction.Extractor, CountryObserver, BoundaryObserver, LocationObserver
A simple variation on the geocoding algorithms: geotag all possible things and determine a best geo-location for each tagged item. This uses the following components:
  • PlacenameMatcher: place name tagging and gazetteering
  • XCoord: coordinate extraction
  • geo.rules.* pkg: disambiguation rules to choose the best location for tagged names
Author:
Marc C. Ubaldino, MITRE, ubaldino at mitre dot org
  • Field Details

    • VERSION

      public static final String VERSION
      See Also:
    • METHOD_DEFAULT

      public static final String METHOD_DEFAULT
    • taxonCatalogs

      public final Set<String> taxonCatalogs
    • COORDINATE_PROXIMITY_CITY_THRESHOLD

      public static final int COORDINATE_PROXIMITY_CITY_THRESHOLD
      Find nearest city within r=25 KM to infer geography of a given coordinate, e.g., What state is (x,y) in? Found locations are sorted by distance to point.
      See Also:
    • COORDINATE_PROXIMITY_ADM1_THRESHOLD

      public static final int COORDINATE_PROXIMITY_ADM1_THRESHOLD
      See Also:
  • Constructor Details

    • PlaceGeocoder

      public PlaceGeocoder() throws org.opensextant.ConfigException
      A default Geocoding app that demonstrates how to invoke the geocoding pipline start to finish. It makes use of XCoord to parse/geocode coordinates, SolrGazetteer/GazetteerMatcher to match named places, XTax to tag person names. Match Filters and rules work in conjunction to filter and tag further any candidates.
      Throws:
      org.opensextant.ConfigException - if resource files could not be found in CLASSPATH
    • PlaceGeocoder

      public PlaceGeocoder(boolean lowercaseAllowed) throws org.opensextant.ConfigException
      Parameters:
      lowercaseAllowed - if lower case abbreviations are allowed. See GazetteerMatcher
      Throws:
      org.opensextant.ConfigException - if resource files could not be found in CLASSPATH
  • Method Details

    • getName

      public String getName()
      Specified by:
      getName in interface org.opensextant.extraction.Extractor
    • configure

      public void configure(String patfile) throws org.opensextant.ConfigException
      Configure an Extractor using a config file named by a path.
      Specified by:
      configure in interface org.opensextant.extraction.Extractor
      Parameters:
      patfile - configuration file path
      Throws:
      org.opensextant.ConfigException - on err
    • configure

      public void configure(URL patfile) throws org.opensextant.ConfigException
      Configure an Extractor using a config file named by a URL
      Specified by:
      configure in interface org.opensextant.extraction.Extractor
      Parameters:
      patfile - configuration URL
      Throws:
      org.opensextant.ConfigException
    • reportMetrics

      public void reportMetrics()
      We have some emerging metrics to report out...
    • configure

      public void configure() throws org.opensextant.ConfigException
      We do whatever is needed to init resources... that varies depending on the use case. Guidelines: this class is custodian of the app controller, Corpus feeder, and any Document instances passed into/out of the feeder. This geocoder requires a default /exclusions/person-name-filter.txt, which can be empty, but most often it will be a list of person names (which are non-place names) Rules Configured in approximate order:
      • CountryRule -- tag all country names
      • NameCodeRule -- parse any Name, CODE, or Name1, Name2 patterns for "Place, AdminPlace" evidence
      • PersonNameRule -- annotate, negate any patterns or matches that appear to be known celebrity persons or organizations. Qualified places are not negated, e.g., "Euguene, Oregon" is a place; "Euguene" with no other evidence is a person name.
      • CoordRule -- if requested, parse any coordinate patterns; Reverse geocode Country + Province.
      • ProvinceAssociationRule -- associate places with Province inferred by coordinates.
      • MajorPlaceRule -- identify major places by feature type, class or location population.
      • LocationChooserRule -- final rule that assigns confidence and chooses best location(s)
      Your Rule Here -- use addRule( GeocodeRule ) to add a rule on the stack. It will be evaluated just before the final LocationChooserRule. your rule should improve Place scores on PlaceCandidates and name the rules that fire.
      Specified by:
      configure in interface org.opensextant.extraction.Extractor
      Throws:
      org.opensextant.ConfigException - on err
    • addRule

      public void addRule(GeocodeRule r)
      Add your own geocode rules to enable you to add evidence, adjust score, outright choose Place instances on PlaceCandidates, etc. As long as your rule implements or overrides GeocodeRule.evaluate() methods candidate tags will be fully evaluated.
      Parameters:
      r - a rule
    • setRules

      public void setRules(List<GeocodeRule> rlist)
      You don't like the default rule set,.. add your own
    • cleanup

      public void cleanup()
      Please shutdown the application cleanly when done.
      Specified by:
      cleanup in interface org.opensextant.extraction.Extractor
    • close

      public void close()
      Description copied from class: SolrMatcherSupport
      Close solr resources.
      Specified by:
      close in interface AutoCloseable
      Specified by:
      close in interface Closeable
      Overrides:
      close in class SolrMatcherSupport
    • setParameters

      public void setParameters(org.opensextant.processing.Parameters p)
    • isCoordExtractionEnabled

      public boolean isCoordExtractionEnabled()
    • isPersonNameMatchingEnabled

      @Deprecated public boolean isPersonNameMatchingEnabled()
      Deprecated.
      This has no effect. Tagging Parameters here are mainly considered for filtering output.
      Person name matching is ALWAYS on, this flag indicates if results are reported in returned array
      Returns:
    • enablePersonNameMatching

      @Deprecated public void enablePersonNameMatching(boolean b)
      Deprecated.
      Parameters:
      b -
    • extract

      public List<org.opensextant.extraction.TextMatch> extract(org.opensextant.data.TextInput input) throws org.opensextant.extraction.ExtractionException
      See extract(TextInput, Parameters) below. This is the default extraction routine. If you need to tune extraction call extract( input, parameters )
      Specified by:
      extract in interface org.opensextant.extraction.Extractor
      Throws:
      org.opensextant.extraction.ExtractionException
    • extract

      public List<org.opensextant.extraction.TextMatch> extract(org.opensextant.data.TextInput input, org.opensextant.processing.Parameters jobParams) throws org.opensextant.extraction.ExtractionException
      Extractor.extract() calls first XCoord to get coordinates, then PlacenameMatcher In the end you have all geo entities ranked and scored. LangID can be set on TextInput input.langid. Only lowercase langIDs please: 'zh', 'ar', tag text for those languages in particular. Null and Other values are treated as generic as of v2.8.

      Use TextMatch.getType() to determine how to interpret TextMatch / Geocoding results:

      • Given TextMatch match, then
      • Place tag: ((PlaceCandiate)match).getChosen() OR
      • Coord tag: (Geocoding)match, OR
      • Other tag: match might be TaxonMatch or Pattern (PoliMatch)
      Both methods yield a geocoding.
      Parameters:
      input - input buffer, doc ID, and optional langID.
      Returns:
      TextMatch instances which are all PlaceCandidates.
      Throws:
      org.opensextant.extraction.ExtractionException - on err
    • countryInScope

      public void countryInScope(org.opensextant.data.Country c)
      Record how often country references are made.
      Specified by:
      countryInScope in interface CountryObserver
      Parameters:
      c - country obj
    • countryCount

      public int countryCount()
      Specified by:
      countryCount in interface CountryObserver
    • countryMentionCount

      public Map<String,CountryCount> countryMentionCount()
      Calculate country mention totals and ratios. These ratios help qualify what the document is about. These may be mentions in text or inferred mentions to the countries listed, e.g., a coord infers a particular country.
      Specified by:
      countryMentionCount in interface CountryObserver
      Returns:
      map of country code : counts
    • placeMentionCount

      public Map<String,PlaceCount> placeMentionCount()
      Weight mentions or indirect references to Provinces in the document
      Specified by:
      placeMentionCount in interface BoundaryObserver
      Returns:
    • countryInScope

      public void countryInScope(String cc)
      Record how often country references are made.
      Specified by:
      countryInScope in interface CountryObserver
      Parameters:
      cc -
    • countryObserved

      public boolean countryObserved(String cc)
      Description copied from interface: CountryObserver
      Have you seen this country before?
      Specified by:
      countryObserved in interface CountryObserver
      Parameters:
      cc - country code
      Returns:
      true if observer saw country
    • countryObserved

      public boolean countryObserved(org.opensextant.data.Country C)
      Description copied from interface: CountryObserver
      Have you seen this country before?
      Specified by:
      countryObserved in interface CountryObserver
      Parameters:
      C - country object
      Returns:
      true if observer saw country
    • locationInScope

      public void locationInScope(org.opensextant.data.Geocoding geo)
      When coordinates are found track them. A coordinate is critical -- it informs us of city, province, and country. If the location is off shore or in no-mans' land, these chains of observers should respect that and fail quietly. There are at least two opportunities here: 1. Given a geo coordinate, use that hard location to disambiguate other named places 2. Given a geo coordinate identify the nearest known place(s). Such places may not be presented in the document or text. The first improves overall location accuracy, the second offers location enrichment and discovery.
      Specified by:
      locationInScope in interface LocationObserver
    • boundaryLevel1InScope

      public void boundaryLevel1InScope(String nameNorm, org.opensextant.data.Place p)
      Observer pattern that sees any time a possible boundary (state, province, district, etc) is mentioned. Example: mention "Florida" linked to location Florida(ADM1, FL, US) infers the boundary "US.FL" As would "Miami" (PPL, FL, US) also infer "US.FL". We care more about the distinct and various mentions more than the location counts. I.e., "Florida" has 185 locations worldwide, multiples in some countries.
      Specified by:
      boundaryLevel1InScope in interface BoundaryObserver
      Parameters:
      nameNorm - text or name related to the place, p
      p - ID of a boundary.
    • boundaryLevel2InScope

      public void boundaryLevel2InScope(String nameNorm, org.opensextant.data.Place p)
      Description copied from interface: BoundaryObserver
      Given the name (lower case, strip quotes), the location candidate infers an ADMIN boundary
      Specified by:
      boundaryLevel2InScope in interface BoundaryObserver
    • extract

      public List<org.opensextant.extraction.TextMatch> extract(String input_buf) throws org.opensextant.extraction.ExtractionException
      Generic tagging. No doc ID or language ID given. Nothing language specific will be done here.
      Specified by:
      extract in interface org.opensextant.extraction.Extractor
      Throws:
      org.opensextant.extraction.ExtractionException
    • evaluateCoordinate

      public org.opensextant.data.Place evaluateCoordinate(org.opensextant.data.Geocoding g) throws org.apache.solr.client.solrj.SolrServerException, IOException
      Compund-method that is crucial in reverse geocoding COORDINATE to KNOWN PLACE.

      A method to retrieve one or more distinct admin boundaries containing the coordinate. This depends on resolution of gazetteer at hand. Secondarily as nearby places are encountered they are added to a coordinate providing a basic reverse-geocoding solution.

      Parameters:
      g - geo coordinate found in text.
      Returns:
      Place object near the geocoding.
      Throws:
      org.apache.solr.client.solrj.SolrServerException - a query against the Solr index may throw a Solr error.
      IOException
    • placeObserved

      public boolean placeObserved(org.opensextant.data.Place p)
      Tell us if this place P was inferred by hard location mentions
      Specified by:
      placeObserved in interface LocationObserver
      Returns: