Class GazetteerMatcher

java.lang.Object
org.opensextant.extraction.SolrMatcherSupport
org.opensextant.extractors.geo.GazetteerMatcher
All Implemented Interfaces:
Closeable, AutoCloseable
Direct Known Subclasses:
PlaceGeocoder, PostalTagger

public class GazetteerMatcher extends SolrMatcherSupport
Connects to a Solr sever via HTTP and tags place names in document. The SOLR_HOME environment variable must be set to the location of the Solr server.

This class is not thread-safe. It could be made to be with little effort.

Author:
David Smiley - dsmiley@mitre.org, Marc Ubaldino - ubaldino@mitre.org
  • Field Details

    • filter

      protected TagFilter filter
    • DEFAULT_TAG_FIELD

      public static final String DEFAULT_TAG_FIELD
      Most languages
      See Also:
    • CJK_TAG_FIELD

      public static final String CJK_TAG_FIELD
      Use Solr param 'field' = name_tag_cjk to tag in Asian scripts.
      See Also:
    • AR_TAG_FIELD

      public static final String AR_TAG_FIELD
      Use Solr param 'field = name_tag_ar for Arabic. TODO: Generalize this or expand so Farsi and Urdu are managed separately.
      See Also:
    • lang2nameField

      protected static final HashMap<String,String> lang2nameField
  • Constructor Details

    • GazetteerMatcher

      public GazetteerMatcher() throws org.opensextant.ConfigException
      Throws:
      org.opensextant.ConfigException
    • GazetteerMatcher

      public GazetteerMatcher(boolean lowercaseAllowed) throws org.opensextant.ConfigException
      Parameters:
      lowercaseAllowed - variant is case insensitive.
      Throws:
      org.opensextant.ConfigException - on err
  • Method Details

    • initialize

      public void initialize() throws org.opensextant.ConfigException
      Description copied from class: SolrMatcherSupport
      Initialize. This capability is not supporting taggers/matchers using HTTP server. For now, it is intedended to be in-memory, local embedded solr server.
      Overrides:
      initialize in class SolrMatcherSupport
      Throws:
      org.opensextant.ConfigException - if solr server cannot be established from local index or from http server
    • getCoreName

      public String getCoreName()
      Description copied from class: SolrMatcherSupport
      Be explicit about the solr core to use for tagging.
      Specified by:
      getCoreName in class SolrMatcherSupport
      Returns:
      the core name
    • getMatcherParameters

      public org.apache.solr.common.params.SolrParams getMatcherParameters()
      Description copied from class: SolrMatcherSupport
      Return the Solr Parameters for the tagger op.
      Specified by:
      getMatcherParameters in class SolrMatcherSupport
      Returns:
      SolrParams
    • getGazetteer

      public SolrGazetteer getGazetteer()
      For use within package or by subclass
      Returns:
      internal gazetteer instance
    • reportMemory

      public void reportMemory()
    • setAllowLowerCaseAbbreviations

      public void setAllowLowerCaseAbbreviations(boolean b)
      A flag that will allow us to tag "in" or "in." as a possible abbreviation. By default such things are not abbreviations, e.g., Indiana is typically IN or In. or Ind., for example. Oregon, OR or Ore. etc. but almost never 'in' or 'or' for those cases.
      Parameters:
      b - flag true = allow lower case abbreviations to be tagged, e.g., as in social media or
    • setAllowLowerCase

      public void setAllowLowerCase(boolean b)
      Enable/disable the match filter for lower case matches. Primarily lower case text matches are filtered against stopword lists and length filters.
      Parameters:
      b - flag
    • setEnableCaseFilter

      public void setEnableCaseFilter(boolean b)
      Enable/disable the document-level case filter.
      Parameters:
      b - flag
    • setEnableCodeHunter

      public void setEnableCodeHunter(boolean b)
    • setMatchFilter

      public void setMatchFilter(org.opensextant.extraction.MatchFilter f)
      User-provided filters to filter out matched names immediately. Avoid filtering out things that are indeed places, but require disambiguation or refinement.
      Parameters:
      f - a match filter
    • searchAdvanced

      public List<org.opensextant.data.Place> searchAdvanced(String place, boolean as_solr) throws org.apache.solr.client.solrj.SolrServerException, IOException
      Default advanced search.
      Parameters:
      place -
      as_solr -
      Returns:
      Throws:
      org.apache.solr.client.solrj.SolrServerException
      IOException
      See Also:
    • searchAdvanced

      public List<org.opensextant.data.Place> searchAdvanced(String place, boolean as_solr, int maxLen) throws org.apache.solr.client.solrj.SolrServerException, IOException
      This is a variation on SolrGazetteer.search(), just this creates ScoredPlace which is immediately usable with scoring and ranking matches. The score for a ScoredPlace is created when added to PlaceCandidate: a default score is created for the place.
        Usage: pc = PlaceCandidate(); list =
       gaz.searchAdvanced("name:Boston", true) // solr fielded query used as-is.
       for ScoredPlace p: list: pc.addPlace( p )
       
      Parameters:
      place - the place string or text; or a Solr query
      as_solr - the as_solr
      maxLen - max length of gazetteer place names.
      Returns:
      places List of scoreable place entries
      Throws:
      org.apache.solr.client.solrj.SolrServerException - the solr server exception
      IOException
    • tagText

      public List<PlaceCandidate> tagText(String buffer, String docid) throws org.opensextant.extraction.ExtractionException
      Geotag a buffer and return all candidates of gazetteer entries whose name matches phrases in the buffer.
      Parameters:
      buffer - text
      docid - ID
      Returns:
      list of place candidates
      Throws:
      org.opensextant.extraction.ExtractionException - on err
    • tagText

      public List<PlaceCandidate> tagText(String buffer, String docid, boolean tagOnly) throws org.opensextant.extraction.ExtractionException
      Throws:
      org.opensextant.extraction.ExtractionException
    • tagText

      public List<PlaceCandidate> tagText(String buffer, String docid, boolean tagOnly, String fld) throws org.opensextant.extraction.ExtractionException
      Throws:
      org.opensextant.extraction.ExtractionException
    • tagText

      public List<PlaceCandidate> tagText(org.opensextant.data.TextInput t, boolean tagOnly) throws org.opensextant.extraction.ExtractionException
      More convenient way of passing input args, using tuple TextInput (buffer, docid, langid)
      Parameters:
      t -
      tagOnly -
      Returns:
      geocoded matches. see tagText()
      Throws:
      org.opensextant.extraction.ExtractionException
    • tagText

      public List<PlaceCandidate> tagText(org.opensextant.data.TextInput input, boolean tagOnly, String fld) throws org.opensextant.extraction.ExtractionException
      Geotag a document, returning PlaceCandidates for the mentions in document. Optionally just return the PlaceCandidates with name only and no Place objects attached. Names of contients are passed back as matches, with geo matches. Continents are filtered out by default.
      Parameters:
      input - text object
      tagOnly - True if you wish to get the matched phrases only. False if you want the full list of Place Candidates.
      fld - gazetteer field to use for tagging
      Returns:
      place_candidates List of place candidates which may be empty if nothing is found.
      Throws:
      org.opensextant.extraction.ExtractionException - on err
    • getFiltrationRatio

      public double getFiltrationRatio()
      This computes the cumulative filtering rate of user-defined and other non-place name patterns
      Returns:
      filtration ratio
    • createTag

      public Object createTag(org.apache.solr.common.SolrDocument tag)
      Description copied from class: SolrMatcherSupport
      Caller must implement their domain objects, POJOs... this callback handler only hashes them.
      Specified by:
      createTag in class SolrMatcherSupport
      Parameters:
      tag - record to convert to Place record
      Returns:
      object representing a Place
    • createPlace

      public static org.opensextant.data.Place createPlace(org.apache.solr.common.SolrDocument gazEntry)
      Adapt the SolrProxy method for creating a Place object. Here, for disambiguation down stream gazetteer metrics are added.
      Parameters:
      gazEntry - a solr record from the gazetteer
      Returns:
      Place (Xponents) object
    • placesAt

      @Deprecated public List<org.opensextant.data.Place> placesAt(org.opensextant.data.LatLon yx) throws org.apache.solr.client.solrj.SolrServerException, IOException
      Deprecated.
      Use SolrGazetteer directly
      Find places located at a particular location.
      Parameters:
      yx - location
      Returns:
      list of places near location
      Throws:
      org.apache.solr.client.solrj.SolrServerException - on err
      IOException