Package org.opensextant.extractors.xtax
Class TaxonMatcher
java.lang.Object
org.opensextant.extraction.SolrMatcherSupport
org.opensextant.extractors.xtax.TaxonMatcher
- All Implemented Interfaces:
Closeable
,AutoCloseable
,org.opensextant.extraction.Extractor
public class TaxonMatcher
extends SolrMatcherSupport
implements org.opensextant.extraction.Extractor
TaxonMatcher uses SolrTextTagger to tag mentions of phrases in documents. The phrases can be
from simple word lists or they can connect to a taxonomy of sorts. the "taxcat" solr core (see Xponents/solr/taxcat)
- Author:
- Marc Ubaldino - ubaldino@mitre.org
-
Field Summary
Modifier and TypeFieldDescriptionCatalogs is a list of catalogs caller wants to tag for.protected static final String[]
static final int
Caller can adjust this default constant if shorter tags are desired.Fields inherited from class org.opensextant.extraction.SolrMatcherSupport
DEFAULT_TAG_LIMIT, getNamesTime, log, requestHandler, solr, tagNamesTime, totalTime
Fields inherited from interface org.opensextant.extraction.Extractor
NO_DOC_ID
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionvoid
addCatalogFilter
(String cat) void
addCatalogFilters
(String[] cats) void
cleanup()
Extractor interface.void
void
Configure an Extractor using a config file named by a pathvoid
Configure an Extractor using a config file named by a URLcreateTag
(org.apache.solr.common.SolrDocument refData) Create a Taxon tag, which is filtered based on established catalog filters.static org.opensextant.data.Taxon
createTaxon
(org.apache.solr.common.SolrDocument refData) Parse the taxon reference data from a solr doc and return Taxon obj.void
excludeTaxons
(String prefix) Add prefixes of types of taxons you do not want returned.List<org.opensextant.extraction.TextMatch>
Light-weight usage: text in, matches out.List<org.opensextant.extraction.TextMatch>
extract
(org.opensextant.data.TextInput input) Tag the inputList<org.opensextant.extraction.TextMatch>
extract
(org.opensextant.data.TextInput input, org.opensextant.processing.Parameters params) Parameterized extraction, e.g., for REST service or other fine tuning.Be explicit about the solr core to use for taggingorg.apache.solr.common.params.SolrParams
Return the Solr Parameters for the tagger op.getName()
Extractor interface: getNamevoid
List<org.opensextant.data.Taxon>
search the current taxonomic catalog.static List<org.opensextant.data.Taxon>
static List<org.opensextant.data.Taxon>
search
(org.apache.solr.client.solrj.SolrClient index, org.apache.solr.common.params.SolrParams qparams) List<org.opensextant.data.Taxon>
search
(org.apache.solr.common.params.SolrParams qparams) search the current taxonomic catalog.void
setAllowLowerCase
(boolean b) Methods inherited from class org.opensextant.extraction.SolrMatcherSupport
close, getRetrievingNamesTime, getTaggingNamesTime, getTotalTime, initialize, setTaggerHandler, tagTextCallSolrTagger
-
Field Details
-
DEFAULT_MIN_LENGTH
public static final int DEFAULT_MIN_LENGTHCaller can adjust this default constant if shorter tags are desired.- See Also:
-
catalogs
Catalogs is a list of catalogs caller wants to tag for. If set, only taxon matches with the catalog ID in this list will be returned by tagText() -
commonTaxonLabels
-
-
Constructor Details
-
TaxonMatcher
public TaxonMatcher() throws org.opensextant.ConfigException- Throws:
org.opensextant.ConfigException
- errors related to configuration, resource files or Solr setup
-
-
Method Details
-
cleanup
public void cleanup()Extractor interface.- Specified by:
cleanup
in interfaceorg.opensextant.extraction.Extractor
-
getCoreName
Be explicit about the solr core to use for tagging- Specified by:
getCoreName
in classSolrMatcherSupport
- Returns:
- the core name
-
getMatcherParameters
public org.apache.solr.common.params.SolrParams getMatcherParameters()Return the Solr Parameters for the tagger op.- Specified by:
getMatcherParameters
in classSolrMatcherSupport
- Returns:
- solr params
-
createTag
Create a Taxon tag, which is filtered based on established catalog filters. Caller must implement their domain objects, POJOs... this callback handler only hashes them.- Specified by:
createTag
in classSolrMatcherSupport
- Parameters:
refData
- solr doc- Returns:
- tag data
-
createTaxon
public static org.opensextant.data.Taxon createTaxon(org.apache.solr.common.SolrDocument refData) Parse the taxon reference data from a solr doc and return Taxon obj.- Parameters:
refData
- solr doc- Returns:
- taxon obj
-
getName
Extractor interface: getName- Specified by:
getName
in interfaceorg.opensextant.extraction.Extractor
- Returns:
- Extractor name
-
configure
public void configure() throws org.opensextant.ConfigException- Specified by:
configure
in interfaceorg.opensextant.extraction.Extractor
- Throws:
org.opensextant.ConfigException
-
configure
Configure an Extractor using a config file named by a path- Specified by:
configure
in interfaceorg.opensextant.extraction.Extractor
- Parameters:
patfile
- configuration file path- Throws:
org.opensextant.ConfigException
-
configure
Configure an Extractor using a config file named by a URL- Specified by:
configure
in interfaceorg.opensextant.extraction.Extractor
- Parameters:
patfile
- configuration URL- Throws:
org.opensextant.ConfigException
-
addCatalogFilters
-
addCatalogFilter
-
removeFilters
public void removeFilters() -
excludeTaxons
Add prefixes of types of taxons you do not want returned. e.g., "Place...." exlclude will allow "Org" and "Person" taxons to pass on thru- Parameters:
prefix
- taxon name prefix
-
extract
public List<org.opensextant.extraction.TextMatch> extract(String input_buf) throws org.opensextant.extraction.ExtractionException Light-weight usage: text in, matches out. Behaviors: ACRONYMS matching lower case terms will automatically be omitted from results.- Specified by:
extract
in interfaceorg.opensextant.extraction.Extractor
- Returns:
- Null if nothing found, otherwise a list of TextMatch objects
- Throws:
org.opensextant.extraction.ExtractionException
-
extract
public List<org.opensextant.extraction.TextMatch> extract(org.opensextant.data.TextInput input, org.opensextant.processing.Parameters params) throws org.opensextant.extraction.ExtractionException Parameterized extraction, e.g., for REST service or other fine tuning.- Parameters:
input
- text to tagparams
- tagging parameters- Returns:
- array of TextMatch
- Throws:
org.opensextant.extraction.ExtractionException
- on Solr Tagger error
-
setAllowLowerCase
public void setAllowLowerCase(boolean b) -
extract
public List<org.opensextant.extraction.TextMatch> extract(org.opensextant.data.TextInput input) throws org.opensextant.extraction.ExtractionException Tag the input- Specified by:
extract
in interfaceorg.opensextant.extraction.Extractor
- Parameters:
input
- TextInput- Returns:
- array of TextMatch or Null
- Throws:
org.opensextant.extraction.ExtractionException
- the extraction exception
-
search
public static List<org.opensextant.data.Taxon> search(org.apache.solr.client.solrj.SolrClient index, String query) throws org.apache.solr.client.solrj.SolrServerException, IOException - Throws:
org.apache.solr.client.solrj.SolrServerException
IOException
-
search
public static List<org.opensextant.data.Taxon> search(org.apache.solr.client.solrj.SolrClient index, org.apache.solr.common.params.SolrParams qparams) throws org.apache.solr.client.solrj.SolrServerException, IOException - Throws:
org.apache.solr.client.solrj.SolrServerException
IOException
-
search
public List<org.opensextant.data.Taxon> search(String query) throws org.apache.solr.client.solrj.SolrServerException, IOException search the current taxonomic catalog.- Parameters:
query
- Solr "q" parameter only- Returns:
- list of taxons
- Throws:
org.apache.solr.client.solrj.SolrServerException
- on errIOException
- on err
-
search
public List<org.opensextant.data.Taxon> search(org.apache.solr.common.params.SolrParams qparams) throws org.apache.solr.client.solrj.SolrServerException, IOException search the current taxonomic catalog.- Parameters:
qparams
- Solr parameters in full.- Returns:
- list of taxons
- Throws:
org.apache.solr.client.solrj.SolrServerException
- on errIOException
- on err
-