Class NonsenseFilter
java.lang.Object
org.opensextant.extractors.geo.rules.GeocodeRule
org.opensextant.extractors.geo.rules.NonsenseFilter
Filter out nonsense tokens that match some city or state name.
Indicators are: irregular whitespace, mixed punctuation
This does not apply to longer matches. Default nonsense length is 10 chars or
shorter.
// Do. do do // ta-da // doo doo
- Author:
- ubaldino
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
static final int
Names of places should have about N=5 chars to non-chars.Fields inherited from class org.opensextant.extractors.geo.rules.GeocodeRule
AVG_WORD_LEN, boundaryObserver, coordObserver, countryObserver, defaultMethod, LEX1, LEX2, locationOnly, log, LOWERCASE, NAME, textCase, UPPERCASE, weight
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionvoid
Assess the validity of a match candidate with the geographic names associated with it.static boolean
assessPhraseDensity
(String name, int charRatio) static boolean
assessPhraseDensity
(org.opensextant.extraction.TextMatch p) static boolean
optimize punctuation detection and filtration.void
evaluate
(List<PlaceCandidate> names) Evaluate the name in each list of names.void
evaluate
(PlaceCandidate name, org.opensextant.data.Place geo) The one evaluation scheme that all rules must implement.boolean
irregularCase
(String txt) Filter out cases of acronmyms of the form AAa....static boolean
If common punctuation (), [], !, &, $ are found within the match, then the name is not likely the right thing.static boolean
isIrregularPunct
(int punct, int strLength) static boolean
isIrregularPunct
(int punct, int strLength, int validCharRate) static boolean
Test for simple abbreviations.static boolean
static boolean
5th Street -- fine.Methods inherited from class org.opensextant.extractors.geo.rules.GeocodeRule
filterByNameOnly, filterOutByFrequency, internalPlaceID, isRelevant, isShort, logMsg, reset, sameBoundary, sameCountry, sameCountry, sameLexicalName, setBoundaryObserver, setCountryObserver, setDefaultMethod, setGeohash, setLocationObserver, setTextCase, textCase
-
Field Details
-
AV
public static final int AV- See Also:
-
PHRASE_DENSITY_CHAR_RATIO
public static final int PHRASE_DENSITY_CHAR_RATIONames of places should have about N=5 chars to non-chars. "A BC" 3:1 filtered out. "AB CD" 4:1 filterd out. "AB BCD" 5:1 possibly acceptable.- See Also:
-
-
Constructor Details
-
NonsenseFilter
public NonsenseFilter()
-
-
Method Details
-
isValidAbbreviation
Test for simple abbreviations.- Parameters:
s
-- Returns:
-
evaluate
Evaluate the name in each list of names.doo doo - FAIL St. Paul - PASS south" bend - FAIL
- Overrides:
evaluate
in classGeocodeRule
- Parameters:
names
- list of found place names
-
assessPhraseDensity
public static boolean assessPhraseDensity(org.opensextant.extraction.TextMatch p) - Parameters:
p
-- Returns:
- True if alphanum to non-alphanum content is at or above default threshold
-
assessPhraseDensity
- Parameters:
name
-charRatio
-- Returns:
- True if alphanum to non-alphanum content is at or above charRatio threshold
-
assessPunctuation
optimize punctuation detection and filtration. This routine marks the candidate as filtered or not, as well as returning a status indicating something was done.Results: - no punctuation found - continue - valid punctuation found - exit nonsense filter - invalid punctuation found - mark filtered out, exit nonsense filter - inconclusive - continue
- Parameters:
p
-- Returns:
-
assessPhoneticMatch
Assess the validity of a match candidate with the geographic names associated with it. For example if you have ÄEÃ how well does it match Aeå, Aea or aeA? this is intended for ruling out short crap phonetically, but NOT for ranking location names for a given candidate- Parameters:
p
-
-
irregularCase
Filter out cases of acronmyms of the form AAa.... which match codes and abbreviations.- Parameters:
txt
-- Returns:
-
shortNumericText
5th Street -- fine. 5th A -- ambiguous 5) Bullet -- no good.- Parameters:
t
-- Returns:
-
irregularCommonPunct
If common punctuation (), [], !, &, $ are found within the match, then the name is not likely the right thing.- Parameters:
t
-- Returns:
-
isIrregularPunct
public static boolean isIrregularPunct(int punct, int strLength) -
isIrregularPunct
public static boolean isIrregularPunct(int punct, int strLength, int validCharRate) -
regularAbbreviationPatterns
-
evaluate
Description copied from class:GeocodeRule
The one evaluation scheme that all rules must implement. Given a single text match and a location, consider if the geo is a good geocoding for the match.- Specified by:
evaluate
in classGeocodeRule
- Parameters:
name
- matched name in textgeo
- gazetteer entry or location
-