Class NonsenseFilter

java.lang.Object
org.opensextant.extractors.geo.rules.GeocodeRule
org.opensextant.extractors.geo.rules.NonsenseFilter

public class NonsenseFilter extends GeocodeRule
Filter out nonsense tokens that match some city or state name. Indicators are: irregular whitespace, mixed punctuation This does not apply to longer matches. Default nonsense length is 10 chars or shorter.
 // Do. do do
 // ta-da
 // doo doo
 
Author:
ubaldino
  • Field Details

    • AV

      public static final int AV
      See Also:
    • PHRASE_DENSITY_CHAR_RATIO

      public static final int PHRASE_DENSITY_CHAR_RATIO
      Names of places should have about N=5 chars to non-chars. "A BC" 3:1 filtered out. "AB CD" 4:1 filterd out. "AB BCD" 5:1 possibly acceptable.
      See Also:
  • Constructor Details

    • NonsenseFilter

      public NonsenseFilter()
  • Method Details

    • isValidAbbreviation

      public static boolean isValidAbbreviation(String s)
      Test for simple abbreviations.
      Parameters:
      s -
      Returns:
    • evaluate

      public void evaluate(List<PlaceCandidate> names)
      Evaluate the name in each list of names.
       doo doo      - FAIL
       St. Paul     - PASS
       south"  bend - FAIL
       
      Overrides:
      evaluate in class GeocodeRule
      Parameters:
      names - list of found place names
    • assessPhraseDensity

      public static boolean assessPhraseDensity(org.opensextant.extraction.TextMatch p)
      Parameters:
      p -
      Returns:
      True if alphanum to non-alphanum content is at or above default threshold
    • assessPhraseDensity

      public static boolean assessPhraseDensity(String name, int charRatio)
      Parameters:
      name -
      charRatio -
      Returns:
      True if alphanum to non-alphanum content is at or above charRatio threshold
    • assessPunctuation

      public static boolean assessPunctuation(PlaceCandidate p)
      optimize punctuation detection and filtration. This routine marks the candidate as filtered or not, as well as returning a status indicating something was done.

      Results: - no punctuation found - continue - valid punctuation found - exit nonsense filter - invalid punctuation found - mark filtered out, exit nonsense filter - inconclusive - continue

      Parameters:
      p -
      Returns:
    • assessPhoneticMatch

      public void assessPhoneticMatch(PlaceCandidate p)
      Assess the validity of a match candidate with the geographic names associated with it. For example if you have ÄEÃ how well does it match Aeå, Aea or aeA? this is intended for ruling out short crap phonetically, but NOT for ranking location names for a given candidate
      Parameters:
      p -
    • irregularCase

      public boolean irregularCase(String txt)
      Filter out cases of acronmyms of the form AAa.... which match codes and abbreviations.
      Parameters:
      txt -
      Returns:
    • shortNumericText

      public static boolean shortNumericText(String t)
      5th Street -- fine. 5th A -- ambiguous 5) Bullet -- no good.
      Parameters:
      t -
      Returns:
    • irregularCommonPunct

      public static boolean irregularCommonPunct(String t)
      If common punctuation (), [], !, &, $ are found within the match, then the name is not likely the right thing.
      Parameters:
      t -
      Returns:
    • isIrregularPunct

      public static boolean isIrregularPunct(int punct, int strLength)
    • isIrregularPunct

      public static boolean isIrregularPunct(int punct, int strLength, int validCharRate)
    • regularAbbreviationPatterns

      public static boolean regularAbbreviationPatterns(String t)
    • evaluate

      public void evaluate(PlaceCandidate name, org.opensextant.data.Place geo)
      Description copied from class: GeocodeRule
      The one evaluation scheme that all rules must implement. Given a single text match and a location, consider if the geo is a good geocoding for the match.
      Specified by:
      evaluate in class GeocodeRule
      Parameters:
      name - matched name in text
      geo - gazetteer entry or location