Package org.opensextant.extractors.geo
Class GazetteerMatcher
java.lang.Object
org.opensextant.extraction.SolrMatcherSupport
org.opensextant.extractors.geo.GazetteerMatcher
- All Implemented Interfaces:
Closeable
,AutoCloseable
- Direct Known Subclasses:
PlaceGeocoder
,PostalTagger
Connects to a Solr sever via HTTP and tags place names in document. The
SOLR_HOME
environment variable must be set to the location of
the Solr server.
This class is not thread-safe. It could be made to be with little effort.
- Author:
- David Smiley - dsmiley@mitre.org, Marc Ubaldino - ubaldino@mitre.org
-
Field Summary
Modifier and TypeFieldDescriptionstatic final String
Use Solr param 'field = name_tag_ar for Arabic.static final String
Use Solr param 'field' = name_tag_cjk to tag in Asian scripts.static final String
Most languagesprotected TagFilter
Fields inherited from class org.opensextant.extraction.SolrMatcherSupport
DEFAULT_TAG_LIMIT, getNamesTime, log, requestHandler, solr, tagNamesTime, totalTime
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic org.opensextant.data.Place
createPlace
(org.apache.solr.common.SolrDocument gazEntry) Adapt the SolrProxy method for creating a Place object.createTag
(org.apache.solr.common.SolrDocument tag) Caller must implement their domain objects, POJOs...Be explicit about the solr core to use for tagging.double
This computes the cumulative filtering rate of user-defined and other non-place name patternsFor use within package or by subclassorg.apache.solr.common.params.SolrParams
Return the Solr Parameters for the tagger op.void
Initialize.List<org.opensextant.data.Place>
placesAt
(org.opensextant.data.LatLon yx) Deprecated.Use SolrGazetteer directlyvoid
List<org.opensextant.data.Place>
searchAdvanced
(String place, boolean as_solr) Default advanced search.List<org.opensextant.data.Place>
searchAdvanced
(String place, boolean as_solr, int maxLen) This is a variation on SolrGazetteer.search(), just this creates ScoredPlace which is immediately usable with scoring and ranking matches.void
setAllowLowerCase
(boolean b) Enable/disable the match filter for lower case matches.void
setAllowLowerCaseAbbreviations
(boolean b) A flag that will allow us to tag "in" or "in." as a possible abbreviation.void
setEnableCaseFilter
(boolean b) Enable/disable the document-level case filter.void
setEnableCodeHunter
(boolean b) void
setMatchFilter
(org.opensextant.extraction.MatchFilter f) User-provided filters to filter out matched names immediately.Geotag a buffer and return all candidates of gazetteer entries whose name matches phrases in the buffer.tagText
(org.opensextant.data.TextInput t, boolean tagOnly) More convenient way of passing input args, using tuple TextInput (buffer, docid, langid)Geotag a document, returning PlaceCandidates for the mentions in document.Methods inherited from class org.opensextant.extraction.SolrMatcherSupport
close, getRetrievingNamesTime, getTaggingNamesTime, getTotalTime, setTaggerHandler, tagTextCallSolrTagger
-
Field Details
-
filter
-
DEFAULT_TAG_FIELD
Most languages- See Also:
-
CJK_TAG_FIELD
Use Solr param 'field' = name_tag_cjk to tag in Asian scripts.- See Also:
-
AR_TAG_FIELD
Use Solr param 'field = name_tag_ar for Arabic. TODO: Generalize this or expand so Farsi and Urdu are managed separately.- See Also:
-
lang2nameField
-
-
Constructor Details
-
GazetteerMatcher
public GazetteerMatcher() throws org.opensextant.ConfigException- Throws:
org.opensextant.ConfigException
-
GazetteerMatcher
public GazetteerMatcher(boolean lowercaseAllowed) throws org.opensextant.ConfigException - Parameters:
lowercaseAllowed
- variant is case insensitive.- Throws:
org.opensextant.ConfigException
- on err
-
-
Method Details
-
initialize
public void initialize() throws org.opensextant.ConfigExceptionDescription copied from class:SolrMatcherSupport
Initialize. This capability is not supporting taggers/matchers using HTTP server. For now, it is intedended to be in-memory, local embedded solr server.- Overrides:
initialize
in classSolrMatcherSupport
- Throws:
org.opensextant.ConfigException
- if solr server cannot be established from local index or from http server
-
getCoreName
Description copied from class:SolrMatcherSupport
Be explicit about the solr core to use for tagging.- Specified by:
getCoreName
in classSolrMatcherSupport
- Returns:
- the core name
-
getMatcherParameters
public org.apache.solr.common.params.SolrParams getMatcherParameters()Description copied from class:SolrMatcherSupport
Return the Solr Parameters for the tagger op.- Specified by:
getMatcherParameters
in classSolrMatcherSupport
- Returns:
- SolrParams
-
getGazetteer
For use within package or by subclass- Returns:
- internal gazetteer instance
-
reportMemory
public void reportMemory() -
setAllowLowerCaseAbbreviations
public void setAllowLowerCaseAbbreviations(boolean b) A flag that will allow us to tag "in" or "in." as a possible abbreviation. By default such things are not abbreviations, e.g., Indiana is typically IN or In. or Ind., for example. Oregon, OR or Ore. etc. but almost never 'in' or 'or' for those cases.- Parameters:
b
- flag true = allow lower case abbreviations to be tagged, e.g., as in social media or
-
setAllowLowerCase
public void setAllowLowerCase(boolean b) Enable/disable the match filter for lower case matches. Primarily lower case text matches are filtered against stopword lists and length filters.- Parameters:
b
- flag
-
setEnableCaseFilter
public void setEnableCaseFilter(boolean b) Enable/disable the document-level case filter.- Parameters:
b
- flag
-
setEnableCodeHunter
public void setEnableCodeHunter(boolean b) -
setMatchFilter
public void setMatchFilter(org.opensextant.extraction.MatchFilter f) User-provided filters to filter out matched names immediately. Avoid filtering out things that are indeed places, but require disambiguation or refinement.- Parameters:
f
- a match filter
-
searchAdvanced
public List<org.opensextant.data.Place> searchAdvanced(String place, boolean as_solr) throws org.apache.solr.client.solrj.SolrServerException, IOException Default advanced search.- Parameters:
place
-as_solr
-- Returns:
- Throws:
org.apache.solr.client.solrj.SolrServerException
IOException
- See Also:
-
searchAdvanced
public List<org.opensextant.data.Place> searchAdvanced(String place, boolean as_solr, int maxLen) throws org.apache.solr.client.solrj.SolrServerException, IOException This is a variation on SolrGazetteer.search(), just this creates ScoredPlace which is immediately usable with scoring and ranking matches. The score for a ScoredPlace is created when added to PlaceCandidate: a default score is created for the place.Usage: pc = PlaceCandidate(); list = gaz.searchAdvanced("name:Boston", true) // solr fielded query used as-is. for ScoredPlace p: list: pc.addPlace( p )
- Parameters:
place
- the place string or text; or a Solr queryas_solr
- the as_solrmaxLen
- max length of gazetteer place names.- Returns:
- places List of scoreable place entries
- Throws:
org.apache.solr.client.solrj.SolrServerException
- the solr server exceptionIOException
-
tagText
public List<PlaceCandidate> tagText(String buffer, String docid) throws org.opensextant.extraction.ExtractionException Geotag a buffer and return all candidates of gazetteer entries whose name matches phrases in the buffer.- Parameters:
buffer
- textdocid
- ID- Returns:
- list of place candidates
- Throws:
org.opensextant.extraction.ExtractionException
- on err
-
tagText
public List<PlaceCandidate> tagText(String buffer, String docid, boolean tagOnly) throws org.opensextant.extraction.ExtractionException - Throws:
org.opensextant.extraction.ExtractionException
-
tagText
public List<PlaceCandidate> tagText(String buffer, String docid, boolean tagOnly, String fld) throws org.opensextant.extraction.ExtractionException - Throws:
org.opensextant.extraction.ExtractionException
-
tagText
public List<PlaceCandidate> tagText(org.opensextant.data.TextInput t, boolean tagOnly) throws org.opensextant.extraction.ExtractionException More convenient way of passing input args, using tuple TextInput (buffer, docid, langid)- Parameters:
t
-tagOnly
-- Returns:
- geocoded matches. see tagText()
- Throws:
org.opensextant.extraction.ExtractionException
-
tagText
public List<PlaceCandidate> tagText(org.opensextant.data.TextInput input, boolean tagOnly, String fld) throws org.opensextant.extraction.ExtractionException Geotag a document, returning PlaceCandidates for the mentions in document. Optionally just return the PlaceCandidates with name only and no Place objects attached. Names of contients are passed back as matches, with geo matches. Continents are filtered out by default.- Parameters:
input
- text objecttagOnly
- True if you wish to get the matched phrases only. False if you want the full list of Place Candidates.fld
- gazetteer field to use for tagging- Returns:
- place_candidates List of place candidates which may be empty if nothing is found.
- Throws:
org.opensextant.extraction.ExtractionException
- on err
-
getFiltrationRatio
public double getFiltrationRatio()This computes the cumulative filtering rate of user-defined and other non-place name patterns- Returns:
- filtration ratio
-
createTag
Description copied from class:SolrMatcherSupport
Caller must implement their domain objects, POJOs... this callback handler only hashes them.- Specified by:
createTag
in classSolrMatcherSupport
- Parameters:
tag
- record to convert to Place record- Returns:
- object representing a Place
-
createPlace
public static org.opensextant.data.Place createPlace(org.apache.solr.common.SolrDocument gazEntry) Adapt the SolrProxy method for creating a Place object. Here, for disambiguation down stream gazetteer metrics are added.- Parameters:
gazEntry
- a solr record from the gazetteer- Returns:
- Place (Xponents) object
-
placesAt
@Deprecated public List<org.opensextant.data.Place> placesAt(org.opensextant.data.LatLon yx) throws org.apache.solr.client.solrj.SolrServerException, IOException Deprecated.Use SolrGazetteer directlyFind places located at a particular location.- Parameters:
yx
- location- Returns:
- list of places near location
- Throws:
org.apache.solr.client.solrj.SolrServerException
- on errIOException
-