java.lang.Object

org.opensextant.extraction.MatchFilter

org.opensextant.extraction.TagFilter

public class TagFilter extends org.opensextant.extraction.MatchFilter

Field Summary

Fields inherited from class org.opensextant.extraction.MatchFilter
tagFilter
Constructor Summary

Constructors

Constructor

Description

TagFilter()

NOTE: This expects the files are all available.
Method Summary

Modifier and Type

Method

Description

boolean

assessAllFilters(String textnorm)

Run a term (already lowercased) against all stop filters.

void

enableCaseSensitive(boolean b)

void

enableStopwordFilter(boolean b)

boolean

filterOut(String t)

Default filtering rules: (a) If filter is in case-sensitive mode (DEFAULT), all lower case matches are ignored; only mixed case or upper case passes (b) If match term, t, is in stop word list it is filtered out.

boolean

filterOut(String langId, String termLower)

boolean

filterOut(PlaceCandidate t, String langId, boolean docIsUpper, boolean docIsLower)

Experimental.

static Set<String>

loadExclusions(InputStream filestream)

Exclusions have two columns in a CSV file.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- TagFilter
  
  public TagFilter() throws IOException
  
  NOTE: This expects the files are all available. This fails if resource files are missing.
  
  Throws:
  
  org.opensextant.ConfigException - if any file has a problem.
  
  IOException
Method Details
- enableStopwordFilter
  
  public void enableStopwordFilter(boolean b)
- enableCaseSensitive
  
  public void enableCaseSensitive(boolean b)
- filterOut
  
  public boolean filterOut(String t)
  
  Default filtering rules: (a) If filter is in case-sensitive mode (DEFAULT), all lower case matches are ignored; only mixed case or upper case passes (b) If match term, t, is in stop word list it is filtered out. Case is ignored. TODO: filter rules -- if text match is all lower case and filter is case-sensitive, then this filters out any lower case matches. Not optimal. This should take into account alpha-case of document. TODO: trivial for the general case, but important: stopTerms is hashed only by lower case value, so native-case lookup is not possible.
  
  Overrides:
  
  filterOut in class org.opensextant.extraction.MatchFilter
- filterOut
  
  public boolean filterOut(PlaceCandidate t, String langId, boolean docIsUpper, boolean docIsLower)
  
  Experimental. Using proper Language ID (ISO 2-char for now), determine if the given term, t, is a stop term in that language.
  
  Parameters:
  
  t -
  
  langId -
  
  docIsUpper - true if input doc is mostly upper
  
  docIsLower - true if input doc is mostly lower
  
  Returns:
- filterOut
  
  public boolean filterOut(String langId, String termLower)
  
  Parameters:
  
  langId - lang ID to check.
  
  termLower - lower case term.
  
  Returns:
- assessAllFilters
  
  public boolean assessAllFilters(String textnorm)
  
  Run a term (already lowercased) against all stop filters.
  
  Parameters:
  
  textnorm -
  
  Returns:
- loadExclusions
  
  public static Set<String> loadExclusions(InputStream filestream) throws org.opensextant.ConfigException
  
  Exclusions have two columns in a CSV file. 'exclusion', 'category' "#" in exclusion column implies a comment. Call is responsible for getting I/O stream.
  
  Parameters:
  
  filestream - URL/file with exclusion terms
  
  Returns:
  
  set of filter terms
  
  Throws:
  
  org.opensextant.ConfigException - if filter is not found

Class TagFilter

Field Summary

Fields inherited from class org.opensextant.extraction.MatchFilter

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

TagFilter

Method Details

enableStopwordFilter

enableCaseSensitive

filterOut

filterOut

filterOut

assessAllFilters

loadExclusions