Class TaxonMatcher

java.lang.Object
org.opensextant.extraction.SolrMatcherSupport
org.opensextant.extractors.xtax.TaxonMatcher
All Implemented Interfaces:
Closeable, AutoCloseable, org.opensextant.extraction.Extractor

public class TaxonMatcher extends SolrMatcherSupport implements org.opensextant.extraction.Extractor
TaxonMatcher uses SolrTextTagger to tag mentions of phrases in documents. The phrases can be from simple word lists or they can connect to a taxonomy of sorts. the "taxcat" solr core (see Xponents/solr/taxcat)
Author:
Marc Ubaldino - ubaldino@mitre.org
  • Field Details

    • DEFAULT_MIN_LENGTH

      public static final int DEFAULT_MIN_LENGTH
      Caller can adjust this default constant if shorter tags are desired.
      See Also:
    • catalogs

      public final Set<String> catalogs
      Catalogs is a list of catalogs caller wants to tag for. If set, only taxon matches with the catalog ID in this list will be returned by tagText()
    • commonTaxonLabels

      protected static final String[] commonTaxonLabels
  • Constructor Details

    • TaxonMatcher

      public TaxonMatcher() throws org.opensextant.ConfigException
      Throws:
      org.opensextant.ConfigException - errors related to configuration, resource files or Solr setup
  • Method Details

    • cleanup

      public void cleanup()
      Extractor interface.
      Specified by:
      cleanup in interface org.opensextant.extraction.Extractor
    • getCoreName

      public String getCoreName()
      Be explicit about the solr core to use for tagging
      Specified by:
      getCoreName in class SolrMatcherSupport
      Returns:
      the core name
    • getMatcherParameters

      public org.apache.solr.common.params.SolrParams getMatcherParameters()
      Return the Solr Parameters for the tagger op.
      Specified by:
      getMatcherParameters in class SolrMatcherSupport
      Returns:
      solr params
    • createTag

      public Object createTag(org.apache.solr.common.SolrDocument refData)
      Create a Taxon tag, which is filtered based on established catalog filters. Caller must implement their domain objects, POJOs... this callback handler only hashes them.
      Specified by:
      createTag in class SolrMatcherSupport
      Parameters:
      refData - solr doc
      Returns:
      tag data
    • createTaxon

      public static org.opensextant.data.Taxon createTaxon(org.apache.solr.common.SolrDocument refData)
      Parse the taxon reference data from a solr doc and return Taxon obj.
      Parameters:
      refData - solr doc
      Returns:
      taxon obj
    • getName

      public String getName()
      Extractor interface: getName
      Specified by:
      getName in interface org.opensextant.extraction.Extractor
      Returns:
      Extractor name
    • configure

      public void configure() throws org.opensextant.ConfigException
      Specified by:
      configure in interface org.opensextant.extraction.Extractor
      Throws:
      org.opensextant.ConfigException
    • configure

      public void configure(String patfile) throws org.opensextant.ConfigException
      Configure an Extractor using a config file named by a path
      Specified by:
      configure in interface org.opensextant.extraction.Extractor
      Parameters:
      patfile - configuration file path
      Throws:
      org.opensextant.ConfigException
    • configure

      public void configure(URL patfile) throws org.opensextant.ConfigException
      Configure an Extractor using a config file named by a URL
      Specified by:
      configure in interface org.opensextant.extraction.Extractor
      Parameters:
      patfile - configuration URL
      Throws:
      org.opensextant.ConfigException
    • addCatalogFilters

      public void addCatalogFilters(String[] cats)
    • addCatalogFilter

      public void addCatalogFilter(String cat)
    • removeFilters

      public void removeFilters()
    • excludeTaxons

      public void excludeTaxons(String prefix)
      Add prefixes of types of taxons you do not want returned. e.g., "Place...." exlclude will allow "Org" and "Person" taxons to pass on thru
      Parameters:
      prefix - taxon name prefix
    • extract

      public List<org.opensextant.extraction.TextMatch> extract(String input_buf) throws org.opensextant.extraction.ExtractionException
      Light-weight usage: text in, matches out. Behaviors: ACRONYMS matching lower case terms will automatically be omitted from results.
      Specified by:
      extract in interface org.opensextant.extraction.Extractor
      Returns:
      Null if nothing found, otherwise a list of TextMatch objects
      Throws:
      org.opensextant.extraction.ExtractionException
    • extract

      public List<org.opensextant.extraction.TextMatch> extract(org.opensextant.data.TextInput input, org.opensextant.processing.Parameters params) throws org.opensextant.extraction.ExtractionException
      Parameterized extraction, e.g., for REST service or other fine tuning.
      Parameters:
      input - text to tag
      params - tagging parameters
      Returns:
      array of TextMatch
      Throws:
      org.opensextant.extraction.ExtractionException - on Solr Tagger error
    • setAllowLowerCase

      public void setAllowLowerCase(boolean b)
    • extract

      public List<org.opensextant.extraction.TextMatch> extract(org.opensextant.data.TextInput input) throws org.opensextant.extraction.ExtractionException
      Tag the input
      Specified by:
      extract in interface org.opensextant.extraction.Extractor
      Parameters:
      input - TextInput
      Returns:
      array of TextMatch or Null
      Throws:
      org.opensextant.extraction.ExtractionException - the extraction exception
    • search

      public static List<org.opensextant.data.Taxon> search(org.apache.solr.client.solrj.SolrClient index, String query) throws org.apache.solr.client.solrj.SolrServerException, IOException
      Throws:
      org.apache.solr.client.solrj.SolrServerException
      IOException
    • search

      public static List<org.opensextant.data.Taxon> search(org.apache.solr.client.solrj.SolrClient index, org.apache.solr.common.params.SolrParams qparams) throws org.apache.solr.client.solrj.SolrServerException, IOException
      Throws:
      org.apache.solr.client.solrj.SolrServerException
      IOException
    • search

      public List<org.opensextant.data.Taxon> search(String query) throws org.apache.solr.client.solrj.SolrServerException, IOException
      search the current taxonomic catalog.
      Parameters:
      query - Solr "q" parameter only
      Returns:
      list of taxons
      Throws:
      org.apache.solr.client.solrj.SolrServerException - on err
      IOException - on err
    • search

      public List<org.opensextant.data.Taxon> search(org.apache.solr.common.params.SolrParams qparams) throws org.apache.solr.client.solrj.SolrServerException, IOException
      search the current taxonomic catalog.
      Parameters:
      qparams - Solr parameters in full.
      Returns:
      list of taxons
      Throws:
      org.apache.solr.client.solrj.SolrServerException - on err
      IOException - on err