Package org.opensextant.extractors.geo
Class TagFilter
java.lang.Object
org.opensextant.extraction.MatchFilter
org.opensextant.extractors.geo.TagFilter
public class TagFilter
extends org.opensextant.extraction.MatchFilter
-
Field Summary
Fields inherited from class org.opensextant.extraction.MatchFilter
tagFilter
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionboolean
assessAllFilters
(String textnorm) Run a term (already lowercased) against all stop filters.void
enableCaseSensitive
(boolean b) void
enableStopwordFilter
(boolean b) boolean
Default filtering rules: (a) If filter is in case-sensitive mode (DEFAULT), all lower case matches are ignored; only mixed case or upper case passes (b) If match term, t, is in stop word list it is filtered out.boolean
boolean
filterOut
(PlaceCandidate t, String langId, boolean docIsUpper, boolean docIsLower) Experimental.loadExclusions
(InputStream filestream) Exclusions have two columns in a CSV file.
-
Constructor Details
-
TagFilter
NOTE: This expects the files are all available. This fails if resource files are missing.- Throws:
org.opensextant.ConfigException
- if any file has a problem.IOException
-
-
Method Details
-
enableStopwordFilter
public void enableStopwordFilter(boolean b) -
enableCaseSensitive
public void enableCaseSensitive(boolean b) -
filterOut
Default filtering rules: (a) If filter is in case-sensitive mode (DEFAULT), all lower case matches are ignored; only mixed case or upper case passes (b) If match term, t, is in stop word list it is filtered out. Case is ignored. TODO: filter rules -- if text match is all lower case and filter is case-sensitive, then this filters out any lower case matches. Not optimal. This should take into account alpha-case of document. TODO: trivial for the general case, but important: stopTerms is hashed only by lower case value, so native-case lookup is not possible.- Overrides:
filterOut
in classorg.opensextant.extraction.MatchFilter
-
filterOut
Experimental. Using proper Language ID (ISO 2-char for now), determine if the given term, t, is a stop term in that language.- Parameters:
t
-langId
-docIsUpper
- true if input doc is mostly upperdocIsLower
- true if input doc is mostly lower- Returns:
-
filterOut
- Parameters:
langId
- lang ID to check.termLower
- lower case term.- Returns:
-
assessAllFilters
Run a term (already lowercased) against all stop filters.- Parameters:
textnorm
-- Returns:
-
loadExclusions
public static Set<String> loadExclusions(InputStream filestream) throws org.opensextant.ConfigException Exclusions have two columns in a CSV file. 'exclusion', 'category' "#" in exclusion column implies a comment. Call is responsible for getting I/O stream.- Parameters:
filestream
- URL/file with exclusion terms- Returns:
- set of filter terms
- Throws:
org.opensextant.ConfigException
- if filter is not found
-