Class TagFilter

java.lang.Object
org.opensextant.extraction.MatchFilter
org.opensextant.extractors.geo.TagFilter

public class TagFilter extends org.opensextant.extraction.MatchFilter
  • Constructor Details

    • TagFilter

      public TagFilter() throws IOException
      NOTE: This expects the files are all available. This fails if resource files are missing.
      Throws:
      org.opensextant.ConfigException - if any file has a problem.
      IOException
  • Method Details

    • enableStopwordFilter

      public void enableStopwordFilter(boolean b)
    • enableCaseSensitive

      public void enableCaseSensitive(boolean b)
    • filterOut

      public boolean filterOut(String t)
      Default filtering rules: (a) If filter is in case-sensitive mode (DEFAULT), all lower case matches are ignored; only mixed case or upper case passes (b) If match term, t, is in stop word list it is filtered out. Case is ignored. TODO: filter rules -- if text match is all lower case and filter is case-sensitive, then this filters out any lower case matches. Not optimal. This should take into account alpha-case of document. TODO: trivial for the general case, but important: stopTerms is hashed only by lower case value, so native-case lookup is not possible.
      Overrides:
      filterOut in class org.opensextant.extraction.MatchFilter
    • filterOut

      public boolean filterOut(PlaceCandidate t, String langId, boolean docIsUpper, boolean docIsLower)
      Experimental. Using proper Language ID (ISO 2-char for now), determine if the given term, t, is a stop term in that language.
      Parameters:
      t -
      langId -
      docIsUpper - true if input doc is mostly upper
      docIsLower - true if input doc is mostly lower
      Returns:
    • filterOut

      public boolean filterOut(String langId, String termLower)
      Parameters:
      langId - lang ID to check.
      termLower - lower case term.
      Returns:
    • assessAllFilters

      public boolean assessAllFilters(String textnorm)
      Run a term (already lowercased) against all stop filters.
      Parameters:
      textnorm -
      Returns:
    • loadExclusions

      public static Set<String> loadExclusions(InputStream filestream) throws org.opensextant.ConfigException
      Exclusions have two columns in a CSV file. 'exclusion', 'category' "#" in exclusion column implies a comment. Call is responsible for getting I/O stream.
      Parameters:
      filestream - URL/file with exclusion terms
      Returns:
      set of filter terms
      Throws:
      org.opensextant.ConfigException - if filter is not found