Package org.opensextant.extractors.geo
Class PlaceCandidate
java.lang.Object
org.opensextant.extraction.TextEntity
org.opensextant.extraction.TextMatch
org.opensextant.extractors.geo.PlaceCandidate
- All Implemented Interfaces:
Comparable<org.opensextant.extraction.TextMatch>,org.opensextant.data.MatchSchema
public class PlaceCandidate
extends org.opensextant.extraction.TextMatch
A PlaceCandidate represents a portion of a document which has been identified
as a possible named geographic location. It is used to collect together the
information from the document (the evidence), as well as the possible
geographic locations it could represent (the Places ). It also contains the
results of the final decision to include: bestPlace - Of all the places with
the same/similar names, which place is it?
- Author:
- ubaldino, dlutz, based on OpenSextant Toolbox
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intstatic final Stringstatic final doublebooleanbooleanMatch types - Abbreviation/Code, Acronym or normal (unknown).booleanbooleanbooleanCommon evidence flags -- isCountry, isPerson, isOrganization, abbreviation, and acronym.booleanbooleanstatic final String[]Linked geographic slots, in no order.static final doublestatic final doublestatic intstatic final Patternstatic final StringFields inherited from class org.opensextant.extraction.TextMatch
pattern_id, producer, typeFields inherited from class org.opensextant.extraction.TextEntity
end, is_duplicate, is_overlap, is_submatch, match_id, postChar, preChar, start, textFields inherited from interface org.opensextant.data.MatchSchema
VAL_COORD, VAL_COUNTRY, VAL_PLACE, VAL_POSTAL, VAL_TAXON -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoidaddAdmin1Evidence(String rule, double weight, String adm1, String cc) voidaddCountryEvidence(String rule, double weight, String cc, org.opensextant.data.Place geo) Add country evidence and increment score immediately.voidaddEvidence(String rule, double weight, org.opensextant.data.Place ev) voidvoidaddFeatureClassEvidence(String rule, double weight, String fclass) voidaddFeatureCodeEvidence(String rule, double weight, String fcode) voidaddGeocoordEvidence(String rule, double weight, org.opensextant.data.LatLon coord, org.opensextant.data.Place geo, double proximityScore) Add evidence and increment score immediately.voidaddPlace(ScoredPlace place) voidaddPlace(ScoredPlace place, Double score) voidConnect another match to this one, usually something cooccurring or collocated with this matchvoidvoidchoose()Get the most highly ranked Place, or Null if empty list.voidchoose(ScoredPlace geo) If caller is willing to claim an explicit choice, so be it.doubledefaultScore(org.opensextant.data.Place g) Given this candidate, how do you score the provided place just based on those place properties (and not on context, document properties, or other evidence)? This 'should' produce a base score of something between 0 and 1.0, or 0..10.intHow many different countries contain this name?.intorg.opensextant.data.Placeintsee setConfidence.protected static Stringorg.opensextant.data.GeocodingAfter candidate has been scored and all, the final best place is the geocoding result for the given name in context.Get the collection of geographic slots geolocated.String[]String[]getRules()org.opensextant.data.PlacedoubleOnly call after choose() operation.String[]Tokens in word.inta basic whitespace, punctuation delimited count of grams Set ONLY after inferTextSense() is invokedbooleanbooleanbooleanhasLinkedGeography(String slot) booleanbooleanbooleanEvaluate if postal matches reside in candidate locations.booleanvoidincrementPlaceScore(org.opensextant.data.Place place, Double score, String rule) Consolidate attaching Rules to this name when also scoring candidate locations.voidinferTextSense(boolean contextisLower, boolean contextisUpper) text hueristicsbooleanbooleanThis only makes sense if you tried choose() first to sort scored places.booleanisAnchor()booleanbooleanbooleanAlias for "isAbbreviation || isAcronym" and a length criteria of less than #{PlaceCandidate.SHORT_NAME_LEN}booleanisValid()if candidate was marked as valid.voidlinkGeography(String slot, org.opensextant.data.Place geo) booleanlinkGeography(PlaceCandidate otherMention, String slot, String featPrefix) Link geographic mention from other part of the document.voidlinkGeography(PlaceCandidate otherMention, String slot, org.opensextant.data.Place geo) Foricbly link geography to the given slot.makeKey(org.opensextant.data.Place p) Each place has an ID, but this candidate scoring mechanism must score distinct ID+NAME tuples.voidMark this mention as an anchor to build from, e.g., given a postal code expand the tag to gather the related mentions for city, province, etc.voidMark candidate as valid to protect it from being filtered out by downstream rules.booleanTo be used sparingly -- determine if a matched place for this text span is actually a code.booleanbooleanpresentInHierarchy(String path) Given a path, 'a.b' ( province b in country a), see if this name is present there.protected doublescoreFeature(org.opensextant.data.Place g) A preference for features that are major places or boundaries.protected doublescoreName(org.opensextant.data.Place g) Produce a goodness score in the range 0 to 1.0 Trivial examples of name matching:voidsetChosen(ScoredPlace geo) Unlike choose(Place), setChosen(Place) just sets the value.voidsetChosenPlace(org.opensextant.data.Place geo) voidsetConfidence(int c) Using a scale of 0 to 100, indicate how confident we are that the chosen place is best.voidsetDerived(boolean b) Mark this candidate as something that was derived by special rules and to treat it differently, e.g., in formatting output or other situations.voidsetLinkedGeography(Map<String, org.opensextant.data.Place> geography) voidsetPostmatchTokens(String[] toks) voidsetPrematchTokens(String[] toks) voidsetReviewed(boolean b) A general purpose flag "reviewed" to indicate something was reviewed and to not repeat that task on this instance.protected voidsetSurroundingTokens(String sourceBuffer) Get some sense of tokens surrounding match.voidsummarize(boolean dumpAll) If you need a full print out of the data, use summarize(true);.toString()Methods inherited from class org.opensextant.extraction.TextMatch
compareTo, copy, defaultMatchId, getContentId, getMatchId, getTextnorm, getType, isDefault, isFilteredOut, isSame, isSameNorm, setFilteredOut, setTypeMethods inherited from class org.opensextant.extraction.TextEntity
contains, copy, getContext, getContextAfter, getContextBefore, getLength, getText, isAfter, isASCII, isBefore, isLeftMatch, isLower, isMixedCase, isOverlap, isRightMatch, isSameMatch, isUpper, isWithin, isWithinChars, setContext, setContext, setTextOnly
-
Field Details
-
VAL_SAME_COUNTRY
- See Also:
-
KNOWN_GEO_SLOTS
Linked geographic slots, in no order. These help develop a fuller depiction of the context of a place mention -- through linked-geography in these categorical slots. These are ordered roughly in resolution order, fine to coarse. POSTAL or other Association: Country vs. "Same Country" -- for small territories, a POSTAL code may be associated with the country at ADM0 level for example, if there are not many admin boundaries. So "Country" association is tight there. "Same Country" is much looser, indicating only that a mentioned place is in a mentioned country Holding off: VAL_COUNTRY -
isCountry
public boolean isCountryCommon evidence flags -- isCountry, isPerson, isOrganization, abbreviation, and acronym. -
isContinent
public boolean isContinent -
isPerson
public boolean isPerson -
isOrganization
public boolean isOrganization -
isAbbreviation
public boolean isAbbreviationMatch types - Abbreviation/Code, Acronym or normal (unknown). From found text we can only tell from case sense and punctuation if the intended part of speech is normal name/text or something coded such as an abbreviation, alphnum, or acronym. For these reason "isAbbreviation" accounts for abbreviations and codes. -
isAcronym
public boolean isAcronym -
hasDiacritics
public boolean hasDiacritics -
SHORT_NAME_LEN
public static int SHORT_NAME_LEN -
DEFAULT_SCORE
- See Also:
-
NAME_WEIGHT
public static final double NAME_WEIGHT- See Also:
-
FEAT_WEIGHT
public static final double FEAT_WEIGHT- See Also:
-
LOCATION_BIAS_WEIGHT
public static final double LOCATION_BIAS_WEIGHT- See Also:
-
tokenizer
-
ABBREVIATION_MAX_LEN
public static final int ABBREVIATION_MAX_LEN- See Also:
-
-
Constructor Details
-
PlaceCandidate
public PlaceCandidate(int x1, int x2)
-
-
Method Details
-
getNDTextnorm
-
setText
- Overrides:
setTextin classorg.opensextant.extraction.TextEntity
-
hasCJKText
public boolean hasCJKText() -
hasMiddleEasternText
public boolean hasMiddleEasternText() -
isAbbrevLength
public boolean isAbbrevLength() -
setDerived
public void setDerived(boolean b) Mark this candidate as something that was derived by special rules and to treat it differently, e.g., in formatting output or other situations. A derivation may correct or subsume other non-derived mentions.- Parameters:
b-
-
isDerived
public boolean isDerived() -
markAnchor
public void markAnchor()Mark this mention as an anchor to build from, e.g., given a postal code expand the tag to gather the related mentions for city, province, etc. vice versa. In such situations you want one anchor in such a tuple. -
isAnchor
public boolean isAnchor() -
setConfidence
public void setConfidence(int c) Using a scale of 0 to 100, indicate how confident we are that the chosen place is best. Note this is different than the individual score assigned to each candidate place. We just need one final confidence measure for this place mention.- Parameters:
c-
-
getConfidence
public int getConfidence()see setConfidence.- Returns:
- confidence
-
choose
If caller is willing to claim an explicit choice, so be it. Otherwise unchosen places go to disambiguation.- Parameters:
geo-
-
addRelated
Connect another match to this one, usually something cooccurring or collocated with this match- Parameters:
pc-
-
getRelated
-
setSurroundingTokens
Get some sense of tokens surrounding match. Possibly optimize this by getting token list from SolrTextTagger (which provides the lang-specifics)- Parameters:
sourceBuffer-
-
isShortName
public boolean isShortName()Alias for "isAbbreviation || isAcronym" and a length criteria of less than #{PlaceCandidate.SHORT_NAME_LEN}- Returns:
- true if name is short and likely a code or abbreviation.
-
getGeocoding
public org.opensextant.data.Geocoding getGeocoding()After candidate has been scored and all, the final best place is the geocoding result for the given name in context.- Returns:
- the chosen geocoding
-
setChosenPlace
public void setChosenPlace(org.opensextant.data.Place geo) -
getChosenPlace
public org.opensextant.data.Place getChosenPlace() -
getChosen
- Returns:
-
setChosen
Unlike choose(Place), setChosen(Place) just sets the value. choose() attempts to pull the ScoredPlace from internal cache.- Parameters:
geo-
-
getFirstChoice
- Returns:
-
choose
public void choose()Get the most highly ranked Place, or Null if empty list. Typical usage: choose() // this does work. performance cost. getChosen() // this is a getter; no performance cost -
matchesCode
public boolean matchesCode()To be used sparingly -- determine if a matched place for this text span is actually a code. ExampleYYZ -- an airport code Yyz -- transliterated name. If we are not tagging coded information then short abbreviations are ignorable.- Returns:
- True if a Geographic place for this match is actually a CODE
-
isAmbiguous
public boolean isAmbiguous()This only makes sense if you tried choose() first to sort scored places.- Returns:
- true if two choices are tied
-
getSecondChoiceScore
public double getSecondChoiceScore()Only call after choose() operation.- Returns:
- score
-
getSecondChoice
public org.opensextant.data.Place getSecondChoice()- Returns:
- ScoredPlace, choice2
-
getPlaces
- Returns:
- all values of scored places. Not a copy
-
addPlace
- Parameters:
place-
-
makeKey
Each place has an ID, but this candidate scoring mechanism must score distinct ID+NAME tuples. As name variances play into scoring and choosing.- Parameters:
p-- Returns:
-
addPlace
- Parameters:
place-score-
-
defaultScore
public double defaultScore(org.opensextant.data.Place g) Given this candidate, how do you score the provided place just based on those place properties (and not on context, document properties, or other evidence)? This 'should' produce a base score of something between 0 and 1.0, or 0..10. These scores do not necessarily need to stay in that range, as they are all relative. However, as rules fire and compare location data it is better to stay in a known range for sanity sake.- Parameters:
g-- Returns:
- objective score for the gazetteer entry
-
scoreName
protected double scoreName(org.opensextant.data.Place g) Produce a goodness score in the range 0 to 1.0 Trivial examples of name matching:given some patterns, 'geo' match Text case 1. 'Alberta' matches ALBERTA or alberta just fine. case 2. 'La' matches LA, however, knowing "LA" is a acronym/abbreviation adds to the score of any geo that actually is "LA" case 3. 'Afghanestan' matches Afghanistan, but decrement because it is not perfectly spelled.- Parameters:
g-- Returns:
- score for a given name based on all of its diacritics
-
scoreFeature
protected double scoreFeature(org.opensextant.data.Place g) A preference for features that are major places or boundaries. This yields a feature score on a 0 to 1.0 point scale.- Parameters:
g-- Returns:
- feature score
-
incrementPlaceScore
Consolidate attaching Rules to this name when also scoring candidate locations. This operation says a given Place deserves a certain increment in score for a certain reason.- Parameters:
place-score-rule-
-
getRules
- Returns:
- all rules
-
hasRule
- Parameters:
rule-- Returns:
- true if candidate has seen this rule already
-
addRule
- Parameters:
rule-
-
getEvidenceID
- Parameters:
ev- evidence- Returns:
- internal ID for evidence (rule + location)
-
addEvidence
- Parameters:
ev- evidence object
-
addEvidence
- Parameters:
rule-weight-ev-
-
addCountryEvidence
public void addCountryEvidence(String rule, double weight, String cc, org.opensextant.data.Place geo) Add country evidence and increment score immediately.- Parameters:
rule-weight-cc-geo-
-
addAdmin1Evidence
- Parameters:
rule-weight-adm1-cc-
-
addFeatureClassEvidence
- Parameters:
rule-weight-fclass-
-
addFeatureCodeEvidence
- Parameters:
rule-weight-fcode-
-
addGeocoordEvidence
public void addGeocoordEvidence(String rule, double weight, org.opensextant.data.LatLon coord, org.opensextant.data.Place geo, double proximityScore) Add evidence and increment score immediately.- Parameters:
rule-weight-coord-geo-proximityScore-
-
getEvidence
- Returns:
- the current evidence
-
hasPlaces
public boolean hasPlaces()- Returns:
- true if candidate has any associated potential locations
-
toString
- Overrides:
toStringin classorg.opensextant.extraction.TextMatch- Returns:
- string representation of candidate
-
summarize
If you need a full print out of the data, use summarize(true);.- Parameters:
dumpAll-- Returns:
- summary of evidence, rules and chosen location
-
getPrematchTokens
- Returns:
- the preceding tokens
-
setPrematchTokens
- Parameters:
toks- set preceding tokens
-
getPostmatchTokens
- Returns:
- tokens following name span
-
setPostmatchTokens
- Parameters:
toks- set following tokens
-
getSurroundingText
-
presentInHierarchy
Given a path, 'a.b' ( province b in country a), see if this name is present there.- Parameters:
path-- Returns:
- true if given path is represented by candidates' potential locations
-
presentInCountry
- Parameters:
cc- country code- Returns:
- true if candidate has potential locations for the given country code.
-
distinctCountryCount
public int distinctCountryCount()How many different countries contain this name?.- Returns:
- count of distinct country codes inferred
-
distinctLocationCount
public int distinctLocationCount()- Returns:
- distinct locations by ID, not by geodetic location
-
markValid
public void markValid()Mark candidate as valid to protect it from being filtered out by downstream rules. -
isValid
public boolean isValid()if candidate was marked as valid. IF valid, then avoid filters.- Returns:
- true if rules have marked this candidate valid
-
hasEvidence
public boolean hasEvidence()- Returns:
- true if candidate has any evidence.
-
getWordCount
public int getWordCount()a basic whitespace, punctuation delimited count of grams Set ONLY after inferTextSense() is invoked- Returns:
- token word count
-
inferTextSense
public void inferTextSense(boolean contextisLower, boolean contextisUpper) text hueristics- Parameters:
contextisLower- True if text around mention is mainly lowercasecontextisUpper- True if text around mention is mainly uppercase
-
getTokens
Tokens in word. Only after inferTextSense() is invoked.- Returns:
-
getLinkedGeography
Get the collection of geographic slots geolocated. E.g., for a "Town Hall" building location you might link the Place object representing the "city" slot.- Returns:
-
setLinkedGeography
-
linkGeography
Foricbly link geography to the given slot.- Parameters:
otherMention-slot-geo-- See Also:
-
linkGeography
-
hasLinkedGeography
-
linkGeography
Link geographic mention from other part of the document. E.g., for a "Town Hall" building location you might link the PlaceCandidate mention object representing the "city" slot.method added to support PostalGeocoder. TBD.
- Parameters:
otherMention-slot-featPrefix-- Returns:
- True if any link was made or already existed.
-
setReviewed
public void setReviewed(boolean b) A general purpose flag "reviewed" to indicate something was reviewed and to not repeat that task on this instance.- Parameters:
b-
-
isReviewed
public boolean isReviewed() -
hasPostal
public boolean hasPostal()Evaluate if postal matches reside in candidate locations. Evaluate only once and save result. We distinguish between "hasPostal" matches vs. marking this place as "is Postal". That's the difference between factual and inferential.- Returns:
- true if postal features exist here.
-