Package org.opensextant.extractors.geo
Class PlaceCandidate
java.lang.Object
org.opensextant.extraction.TextEntity
org.opensextant.extraction.TextMatch
org.opensextant.extractors.geo.PlaceCandidate
- All Implemented Interfaces:
Comparable<org.opensextant.extraction.TextMatch>
,org.opensextant.data.MatchSchema
public class PlaceCandidate
extends org.opensextant.extraction.TextMatch
A PlaceCandidate represents a portion of a document which has been identified
as a possible named geographic location. It is used to collect together the
information from the document (the evidence), as well as the possible
geographic locations it could represent (the Places ). It also contains the
results of the final decision to include: bestPlace - Of all the places with
the same/similar names, which place is it?
- Author:
- ubaldino, dlutz, based on OpenSextant Toolbox
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
static final String
static final double
boolean
boolean
Match types - Abbreviation/Code, Acronym or normal (unknown).boolean
boolean
boolean
Common evidence flags -- isCountry, isPerson, isOrganization, abbreviation, and acronym.boolean
boolean
static final String[]
Linked geographic slots, in no order.static final double
static final double
static int
static final Pattern
static final String
Fields inherited from class org.opensextant.extraction.TextMatch
pattern_id, producer, type
Fields inherited from class org.opensextant.extraction.TextEntity
end, is_duplicate, is_overlap, is_submatch, match_id, postChar, preChar, start, text
Fields inherited from interface org.opensextant.data.MatchSchema
VAL_COORD, VAL_COUNTRY, VAL_PLACE, VAL_POSTAL, VAL_TAXON
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionvoid
addAdmin1Evidence
(String rule, double weight, String adm1, String cc) void
addCountryEvidence
(String rule, double weight, String cc, org.opensextant.data.Place geo) Add country evidence and increment score immediately.void
addEvidence
(String rule, double weight, org.opensextant.data.Place ev) void
void
addFeatureClassEvidence
(String rule, double weight, String fclass) void
addFeatureCodeEvidence
(String rule, double weight, String fcode) void
addGeocoordEvidence
(String rule, double weight, org.opensextant.data.LatLon coord, org.opensextant.data.Place geo, double proximityScore) Add evidence and increment score immediately.void
addPlace
(ScoredPlace place) void
addPlace
(ScoredPlace place, Double score) void
Connect another match to this one, usually something cooccurring or collocated with this matchvoid
void
choose()
Get the most highly ranked Place, or Null if empty list.void
choose
(ScoredPlace geo) If caller is willing to claim an explicit choice, so be it.double
defaultScore
(org.opensextant.data.Place g) Given this candidate, how do you score the provided place just based on those place properties (and not on context, document properties, or other evidence)? This 'should' produce a base score of something between 0 and 1.0, or 0..10.int
How many different countries contain this name?.int
org.opensextant.data.Place
int
see setConfidence.protected static String
org.opensextant.data.Geocoding
After candidate has been scored and all, the final best place is the geocoding result for the given name in context.Get the collection of geographic slots geolocated.String[]
String[]
getRules()
org.opensextant.data.Place
double
Only call after choose() operation.String[]
Tokens in word.int
a basic whitespace, punctuation delimited count of grams Set ONLY after inferTextSense() is invokedboolean
boolean
boolean
hasLinkedGeography
(String slot) boolean
boolean
boolean
Evaluate if postal matches reside in candidate locations.boolean
void
incrementPlaceScore
(org.opensextant.data.Place place, Double score, String rule) Consolidate attaching Rules to this name when also scoring candidate locations.void
inferTextSense
(boolean contextisLower, boolean contextisUpper) text hueristicsboolean
boolean
This only makes sense if you tried choose() first to sort scored places.boolean
isAnchor()
boolean
boolean
boolean
Alias for "isAbbreviation || isAcronym" and a length criteria of less than #{PlaceCandidate.SHORT_NAME_LEN}boolean
isValid()
if candidate was marked as valid.void
linkGeography
(String slot, org.opensextant.data.Place geo) boolean
linkGeography
(PlaceCandidate otherMention, String slot, String featPrefix) Link geographic mention from other part of the document.void
linkGeography
(PlaceCandidate otherMention, String slot, org.opensextant.data.Place geo) Foricbly link geography to the given slot.makeKey
(org.opensextant.data.Place p) Each place has an ID, but this candidate scoring mechanism must score distinct ID+NAME tuples.void
Mark this mention as an anchor to build from, e.g., given a postal code expand the tag to gather the related mentions for city, province, etc.void
Mark candidate as valid to protect it from being filtered out by downstream rules.boolean
To be used sparingly -- determine if a matched place for this text span is actually a code.boolean
boolean
presentInHierarchy
(String path) Given a path, 'a.b' ( province b in country a), see if this name is present there.protected double
scoreFeature
(org.opensextant.data.Place g) A preference for features that are major places or boundaries.protected double
scoreName
(org.opensextant.data.Place g) Produce a goodness score in the range 0 to 1.0 Trivial examples of name matching:void
setChosen
(ScoredPlace geo) Unlike choose(Place), setChosen(Place) just sets the value.void
setChosenPlace
(org.opensextant.data.Place geo) void
setConfidence
(int c) Using a scale of 0 to 100, indicate how confident we are that the chosen place is best.void
setDerived
(boolean b) Mark this candidate as something that was derived by special rules and to treat it differently, e.g., in formatting output or other situations.void
setLinkedGeography
(Map<String, org.opensextant.data.Place> geography) void
setPostmatchTokens
(String[] toks) void
setPrematchTokens
(String[] toks) void
setReviewed
(boolean b) A general purpose flag "reviewed" to indicate something was reviewed and to not repeat that task on this instance.protected void
setSurroundingTokens
(String sourceBuffer) Get some sense of tokens surrounding match.void
summarize
(boolean dumpAll) If you need a full print out of the data, use summarize(true);.toString()
Methods inherited from class org.opensextant.extraction.TextMatch
compareTo, copy, defaultMatchId, getContentId, getMatchId, getTextnorm, getType, isDefault, isFilteredOut, isSame, isSameNorm, setFilteredOut, setType
Methods inherited from class org.opensextant.extraction.TextEntity
contains, copy, getContext, getContextAfter, getContextBefore, getLength, getText, isAfter, isASCII, isBefore, isLeftMatch, isLower, isMixedCase, isOverlap, isRightMatch, isSameMatch, isUpper, isWithin, isWithinChars, setContext, setContext, setTextOnly
-
Field Details
-
VAL_SAME_COUNTRY
- See Also:
-
KNOWN_GEO_SLOTS
Linked geographic slots, in no order. These help develop a fuller depiction of the context of a place mention -- through linked-geography in these categorical slots. These are ordered roughly in resolution order, fine to coarse. POSTAL or other Association: Country vs. "Same Country" -- for small territories, a POSTAL code may be associated with the country at ADM0 level for example, if there are not many admin boundaries. So "Country" association is tight there. "Same Country" is much looser, indicating only that a mentioned place is in a mentioned country Holding off: VAL_COUNTRY -
isCountry
public boolean isCountryCommon evidence flags -- isCountry, isPerson, isOrganization, abbreviation, and acronym. -
isContinent
public boolean isContinent -
isPerson
public boolean isPerson -
isOrganization
public boolean isOrganization -
isAbbreviation
public boolean isAbbreviationMatch types - Abbreviation/Code, Acronym or normal (unknown). From found text we can only tell from case sense and punctuation if the intended part of speech is normal name/text or something coded such as an abbreviation, alphnum, or acronym. For these reason "isAbbreviation" accounts for abbreviations and codes. -
isAcronym
public boolean isAcronym -
hasDiacritics
public boolean hasDiacritics -
SHORT_NAME_LEN
public static int SHORT_NAME_LEN -
DEFAULT_SCORE
- See Also:
-
NAME_WEIGHT
public static final double NAME_WEIGHT- See Also:
-
FEAT_WEIGHT
public static final double FEAT_WEIGHT- See Also:
-
LOCATION_BIAS_WEIGHT
public static final double LOCATION_BIAS_WEIGHT- See Also:
-
tokenizer
-
ABBREVIATION_MAX_LEN
public static final int ABBREVIATION_MAX_LEN- See Also:
-
-
Constructor Details
-
PlaceCandidate
public PlaceCandidate(int x1, int x2)
-
-
Method Details
-
getNDTextnorm
-
setText
- Overrides:
setText
in classorg.opensextant.extraction.TextEntity
-
hasCJKText
public boolean hasCJKText() -
hasMiddleEasternText
public boolean hasMiddleEasternText() -
isAbbrevLength
public boolean isAbbrevLength() -
setDerived
public void setDerived(boolean b) Mark this candidate as something that was derived by special rules and to treat it differently, e.g., in formatting output or other situations. A derivation may correct or subsume other non-derived mentions.- Parameters:
b
-
-
isDerived
public boolean isDerived() -
markAnchor
public void markAnchor()Mark this mention as an anchor to build from, e.g., given a postal code expand the tag to gather the related mentions for city, province, etc. vice versa. In such situations you want one anchor in such a tuple. -
isAnchor
public boolean isAnchor() -
setConfidence
public void setConfidence(int c) Using a scale of 0 to 100, indicate how confident we are that the chosen place is best. Note this is different than the individual score assigned to each candidate place. We just need one final confidence measure for this place mention.- Parameters:
c
-
-
getConfidence
public int getConfidence()see setConfidence.- Returns:
- confidence
-
choose
If caller is willing to claim an explicit choice, so be it. Otherwise unchosen places go to disambiguation.- Parameters:
geo
-
-
addRelated
Connect another match to this one, usually something cooccurring or collocated with this match- Parameters:
pc
-
-
getRelated
-
setSurroundingTokens
Get some sense of tokens surrounding match. Possibly optimize this by getting token list from SolrTextTagger (which provides the lang-specifics)- Parameters:
sourceBuffer
-
-
isShortName
public boolean isShortName()Alias for "isAbbreviation || isAcronym" and a length criteria of less than #{PlaceCandidate.SHORT_NAME_LEN}- Returns:
- true if name is short and likely a code or abbreviation.
-
getGeocoding
public org.opensextant.data.Geocoding getGeocoding()After candidate has been scored and all, the final best place is the geocoding result for the given name in context.- Returns:
- the chosen geocoding
-
setChosenPlace
public void setChosenPlace(org.opensextant.data.Place geo) -
getChosenPlace
public org.opensextant.data.Place getChosenPlace() -
getChosen
- Returns:
-
setChosen
Unlike choose(Place), setChosen(Place) just sets the value. choose() attempts to pull the ScoredPlace from internal cache.- Parameters:
geo
-
-
getFirstChoice
- Returns:
-
choose
public void choose()Get the most highly ranked Place, or Null if empty list. Typical usage: choose() // this does work. performance cost. getChosen() // this is a getter; no performance cost -
matchesCode
public boolean matchesCode()To be used sparingly -- determine if a matched place for this text span is actually a code. ExampleYYZ -- an airport code Yyz -- transliterated name. If we are not tagging coded information then short abbreviations are ignorable.
- Returns:
- True if a Geographic place for this match is actually a CODE
-
isAmbiguous
public boolean isAmbiguous()This only makes sense if you tried choose() first to sort scored places.- Returns:
- true if two choices are tied
-
getSecondChoiceScore
public double getSecondChoiceScore()Only call after choose() operation.- Returns:
- score
-
getSecondChoice
public org.opensextant.data.Place getSecondChoice()- Returns:
- ScoredPlace, choice2
-
getPlaces
- Returns:
- all values of scored places. Not a copy
-
addPlace
- Parameters:
place
-
-
makeKey
Each place has an ID, but this candidate scoring mechanism must score distinct ID+NAME tuples. As name variances play into scoring and choosing.- Parameters:
p
-- Returns:
-
addPlace
- Parameters:
place
-score
-
-
defaultScore
public double defaultScore(org.opensextant.data.Place g) Given this candidate, how do you score the provided place just based on those place properties (and not on context, document properties, or other evidence)? This 'should' produce a base score of something between 0 and 1.0, or 0..10. These scores do not necessarily need to stay in that range, as they are all relative. However, as rules fire and compare location data it is better to stay in a known range for sanity sake.- Parameters:
g
-- Returns:
- objective score for the gazetteer entry
-
scoreName
protected double scoreName(org.opensextant.data.Place g) Produce a goodness score in the range 0 to 1.0 Trivial examples of name matching:given some patterns, 'geo' match Text case 1. 'Alberta' matches ALBERTA or alberta just fine. case 2. 'La' matches LA, however, knowing "LA" is a acronym/abbreviation adds to the score of any geo that actually is "LA" case 3. 'Afghanestan' matches Afghanistan, but decrement because it is not perfectly spelled.
- Parameters:
g
-- Returns:
- score for a given name based on all of its diacritics
-
scoreFeature
protected double scoreFeature(org.opensextant.data.Place g) A preference for features that are major places or boundaries. This yields a feature score on a 0 to 1.0 point scale.- Parameters:
g
-- Returns:
- feature score
-
incrementPlaceScore
Consolidate attaching Rules to this name when also scoring candidate locations. This operation says a given Place deserves a certain increment in score for a certain reason.- Parameters:
place
-score
-rule
-
-
getRules
- Returns:
- all rules
-
hasRule
- Parameters:
rule
-- Returns:
- true if candidate has seen this rule already
-
addRule
- Parameters:
rule
-
-
getEvidenceID
- Parameters:
ev
- evidence- Returns:
- internal ID for evidence (rule + location)
-
addEvidence
- Parameters:
ev
- evidence object
-
addEvidence
- Parameters:
rule
-weight
-ev
-
-
addCountryEvidence
public void addCountryEvidence(String rule, double weight, String cc, org.opensextant.data.Place geo) Add country evidence and increment score immediately.- Parameters:
rule
-weight
-cc
-geo
-
-
addAdmin1Evidence
- Parameters:
rule
-weight
-adm1
-cc
-
-
addFeatureClassEvidence
- Parameters:
rule
-weight
-fclass
-
-
addFeatureCodeEvidence
- Parameters:
rule
-weight
-fcode
-
-
addGeocoordEvidence
public void addGeocoordEvidence(String rule, double weight, org.opensextant.data.LatLon coord, org.opensextant.data.Place geo, double proximityScore) Add evidence and increment score immediately.- Parameters:
rule
-weight
-coord
-geo
-proximityScore
-
-
getEvidence
- Returns:
- the current evidence
-
hasPlaces
public boolean hasPlaces()- Returns:
- true if candidate has any associated potential locations
-
toString
- Overrides:
toString
in classorg.opensextant.extraction.TextMatch
- Returns:
- string representation of candidate
-
summarize
If you need a full print out of the data, use summarize(true);.- Parameters:
dumpAll
-- Returns:
- summary of evidence, rules and chosen location
-
getPrematchTokens
- Returns:
- the preceding tokens
-
setPrematchTokens
- Parameters:
toks
- set preceding tokens
-
getPostmatchTokens
- Returns:
- tokens following name span
-
setPostmatchTokens
- Parameters:
toks
- set following tokens
-
getSurroundingText
-
presentInHierarchy
Given a path, 'a.b' ( province b in country a), see if this name is present there.- Parameters:
path
-- Returns:
- true if given path is represented by candidates' potential locations
-
presentInCountry
- Parameters:
cc
- country code- Returns:
- true if candidate has potential locations for the given country code.
-
distinctCountryCount
public int distinctCountryCount()How many different countries contain this name?.- Returns:
- count of distinct country codes inferred
-
distinctLocationCount
public int distinctLocationCount()- Returns:
- distinct locations by ID, not by geodetic location
-
markValid
public void markValid()Mark candidate as valid to protect it from being filtered out by downstream rules. -
isValid
public boolean isValid()if candidate was marked as valid. IF valid, then avoid filters.- Returns:
- true if rules have marked this candidate valid
-
hasEvidence
public boolean hasEvidence()- Returns:
- true if candidate has any evidence.
-
getWordCount
public int getWordCount()a basic whitespace, punctuation delimited count of grams Set ONLY after inferTextSense() is invoked- Returns:
- token word count
-
inferTextSense
public void inferTextSense(boolean contextisLower, boolean contextisUpper) text hueristics- Parameters:
contextisLower
- True if text around mention is mainly lowercasecontextisUpper
- True if text around mention is mainly uppercase
-
getTokens
Tokens in word. Only after inferTextSense() is invoked.- Returns:
-
getLinkedGeography
Get the collection of geographic slots geolocated. E.g., for a "Town Hall" building location you might link the Place object representing the "city" slot.- Returns:
-
setLinkedGeography
-
linkGeography
Foricbly link geography to the given slot.- Parameters:
otherMention
-slot
-geo
-- See Also:
-
linkGeography
-
hasLinkedGeography
-
linkGeography
Link geographic mention from other part of the document. E.g., for a "Town Hall" building location you might link the PlaceCandidate mention object representing the "city" slot.method added to support PostalGeocoder. TBD.
- Parameters:
otherMention
-slot
-featPrefix
-- Returns:
- True if any link was made or already existed.
-
setReviewed
public void setReviewed(boolean b) A general purpose flag "reviewed" to indicate something was reviewed and to not repeat that task on this instance.- Parameters:
b
-
-
isReviewed
public boolean isReviewed() -
hasPostal
public boolean hasPostal()Evaluate if postal matches reside in candidate locations. Evaluate only once and save result. We distinguish between "hasPostal" matches vs. marking this place as "is Postal". That's the difference between factual and inferential.- Returns:
- true if postal features exist here.
-