Class PlaceCandidate

java.lang.Object
org.opensextant.extraction.TextEntity
org.opensextant.extraction.TextMatch
org.opensextant.extractors.geo.PlaceCandidate
All Implemented Interfaces:
Comparable<org.opensextant.extraction.TextMatch>, org.opensextant.data.MatchSchema

public class PlaceCandidate extends org.opensextant.extraction.TextMatch
A PlaceCandidate represents a portion of a document which has been identified as a possible named geographic location. It is used to collect together the information from the document (the evidence), as well as the possible geographic locations it could represent (the Places ). It also contains the results of the final decision to include: bestPlace - Of all the places with the same/similar names, which place is it?
Author:
ubaldino, dlutz, based on OpenSextant Toolbox
  • Field Details

    • VAL_SAME_COUNTRY

      public static final String VAL_SAME_COUNTRY
      See Also:
    • KNOWN_GEO_SLOTS

      public static final String[] KNOWN_GEO_SLOTS
      Linked geographic slots, in no order. These help develop a fuller depiction of the context of a place mention -- through linked-geography in these categorical slots. These are ordered roughly in resolution order, fine to coarse. POSTAL or other Association: Country vs. "Same Country" -- for small territories, a POSTAL code may be associated with the country at ADM0 level for example, if there are not many admin boundaries. So "Country" association is tight there. "Same Country" is much looser, indicating only that a mentioned place is in a mentioned country Holding off: VAL_COUNTRY
    • isCountry

      public boolean isCountry
      Common evidence flags -- isCountry, isPerson, isOrganization, abbreviation, and acronym.
    • isContinent

      public boolean isContinent
    • isPerson

      public boolean isPerson
    • isOrganization

      public boolean isOrganization
    • isAbbreviation

      public boolean isAbbreviation
      Match types - Abbreviation/Code, Acronym or normal (unknown). From found text we can only tell from case sense and punctuation if the intended part of speech is normal name/text or something coded such as an abbreviation, alphnum, or acronym. For these reason "isAbbreviation" accounts for abbreviations and codes.
    • isAcronym

      public boolean isAcronym
    • hasDiacritics

      public boolean hasDiacritics
    • SHORT_NAME_LEN

      public static int SHORT_NAME_LEN
    • DEFAULT_SCORE

      public static final String DEFAULT_SCORE
      See Also:
    • NAME_WEIGHT

      public static final double NAME_WEIGHT
      See Also:
    • FEAT_WEIGHT

      public static final double FEAT_WEIGHT
      See Also:
    • LOCATION_BIAS_WEIGHT

      public static final double LOCATION_BIAS_WEIGHT
      See Also:
    • tokenizer

      public static final Pattern tokenizer
    • ABBREVIATION_MAX_LEN

      public static final int ABBREVIATION_MAX_LEN
      See Also:
  • Constructor Details

    • PlaceCandidate

      public PlaceCandidate(int x1, int x2)
  • Method Details

    • getNDTextnorm

      public String getNDTextnorm()
    • setText

      public void setText(String name)
      Overrides:
      setText in class org.opensextant.extraction.TextEntity
    • hasCJKText

      public boolean hasCJKText()
    • hasMiddleEasternText

      public boolean hasMiddleEasternText()
    • isAbbrevLength

      public boolean isAbbrevLength()
    • setDerived

      public void setDerived(boolean b)
      Mark this candidate as something that was derived by special rules and to treat it differently, e.g., in formatting output or other situations. A derivation may correct or subsume other non-derived mentions.
      Parameters:
      b -
    • isDerived

      public boolean isDerived()
    • markAnchor

      public void markAnchor()
      Mark this mention as an anchor to build from, e.g., given a postal code expand the tag to gather the related mentions for city, province, etc. vice versa. In such situations you want one anchor in such a tuple.
    • isAnchor

      public boolean isAnchor()
    • setConfidence

      public void setConfidence(int c)
      Using a scale of 0 to 100, indicate how confident we are that the chosen place is best. Note this is different than the individual score assigned to each candidate place. We just need one final confidence measure for this place mention.
      Parameters:
      c -
    • getConfidence

      public int getConfidence()
      see setConfidence.
      Returns:
      confidence
    • choose

      public void choose(ScoredPlace geo)
      If caller is willing to claim an explicit choice, so be it. Otherwise unchosen places go to disambiguation.
      Parameters:
      geo -
    • addRelated

      public void addRelated(PlaceCandidate pc)
      Connect another match to this one, usually something cooccurring or collocated with this match
      Parameters:
      pc -
    • getRelated

      public Collection<PlaceCandidate> getRelated()
    • setSurroundingTokens

      protected void setSurroundingTokens(String sourceBuffer)
      Get some sense of tokens surrounding match. Possibly optimize this by getting token list from SolrTextTagger (which provides the lang-specifics)
      Parameters:
      sourceBuffer -
    • isShortName

      public boolean isShortName()
      Alias for "isAbbreviation || isAcronym" and a length criteria of less than #{PlaceCandidate.SHORT_NAME_LEN}
      Returns:
      true if name is short and likely a code or abbreviation.
    • getGeocoding

      public org.opensextant.data.Geocoding getGeocoding()
      After candidate has been scored and all, the final best place is the geocoding result for the given name in context.
      Returns:
      the chosen geocoding
    • setChosenPlace

      public void setChosenPlace(org.opensextant.data.Place geo)
    • getChosenPlace

      public org.opensextant.data.Place getChosenPlace()
    • getChosen

      public ScoredPlace getChosen()
      Returns:
    • setChosen

      public void setChosen(ScoredPlace geo)
      Unlike choose(Place), setChosen(Place) just sets the value. choose() attempts to pull the ScoredPlace from internal cache.
      Parameters:
      geo -
    • getFirstChoice

      public ScoredPlace getFirstChoice()
      Returns:
    • choose

      public void choose()
      Get the most highly ranked Place, or Null if empty list. Typical usage: choose() // this does work. performance cost. getChosen() // this is a getter; no performance cost
    • matchesCode

      public boolean matchesCode()
      To be used sparingly -- determine if a matched place for this text span is actually a code. Example
           YYZ  -- an airport code
           Yyz  -- transliterated name.
           If we are not tagging coded information then short abbreviations are ignorable.
       
      Returns:
      True if a Geographic place for this match is actually a CODE
    • isAmbiguous

      public boolean isAmbiguous()
      This only makes sense if you tried choose() first to sort scored places.
      Returns:
      true if two choices are tied
    • getSecondChoiceScore

      public double getSecondChoiceScore()
      Only call after choose() operation.
      Returns:
      score
    • getSecondChoice

      public org.opensextant.data.Place getSecondChoice()
      Returns:
      ScoredPlace, choice2
    • getPlaces

      public Collection<ScoredPlace> getPlaces()
      Returns:
      all values of scored places. Not a copy
    • addPlace

      public void addPlace(ScoredPlace place)
      Parameters:
      place -
    • makeKey

      public String makeKey(org.opensextant.data.Place p)
      Each place has an ID, but this candidate scoring mechanism must score distinct ID+NAME tuples. As name variances play into scoring and choosing.
      Parameters:
      p -
      Returns:
    • addPlace

      public void addPlace(ScoredPlace place, Double score)
      Parameters:
      place -
      score -
    • defaultScore

      public double defaultScore(org.opensextant.data.Place g)
      Given this candidate, how do you score the provided place just based on those place properties (and not on context, document properties, or other evidence)? This 'should' produce a base score of something between 0 and 1.0, or 0..10. These scores do not necessarily need to stay in that range, as they are all relative. However, as rules fire and compare location data it is better to stay in a known range for sanity sake.
      Parameters:
      g -
      Returns:
      objective score for the gazetteer entry
    • scoreName

      protected double scoreName(org.opensextant.data.Place g)
      Produce a goodness score in the range 0 to 1.0 Trivial examples of name matching:
        given some patterns, 'geo' match Text
      
         case 1. 'Alberta' matches ALBERTA or alberta just fine.
         case 2. 'La' matches LA, however, knowing "LA" is a acronym/abbreviation
             adds to the score of any geo that actually is "LA"
         case 3. 'Afghanestan' matches Afghanistan, but decrement because it is not perfectly spelled.
       
      Parameters:
      g -
      Returns:
      score for a given name based on all of its diacritics
    • scoreFeature

      protected double scoreFeature(org.opensextant.data.Place g)
      A preference for features that are major places or boundaries. This yields a feature score on a 0 to 1.0 point scale.
      Parameters:
      g -
      Returns:
      feature score
    • incrementPlaceScore

      public void incrementPlaceScore(org.opensextant.data.Place place, Double score, String rule)
      Consolidate attaching Rules to this name when also scoring candidate locations. This operation says a given Place deserves a certain increment in score for a certain reason.
      Parameters:
      place -
      score -
      rule -
    • getRules

      public Collection<String> getRules()
      Returns:
      all rules
    • hasRule

      public boolean hasRule(String rule)
      Parameters:
      rule -
      Returns:
      true if candidate has seen this rule already
    • addRule

      public void addRule(String rule)
      Parameters:
      rule -
    • getEvidenceID

      protected static String getEvidenceID(PlaceEvidence ev)
      Parameters:
      ev - evidence
      Returns:
      internal ID for evidence (rule + location)
    • addEvidence

      public void addEvidence(PlaceEvidence ev)
      Parameters:
      ev - evidence object
    • addEvidence

      public void addEvidence(String rule, double weight, org.opensextant.data.Place ev)
      Parameters:
      rule -
      weight -
      ev -
    • addCountryEvidence

      public void addCountryEvidence(String rule, double weight, String cc, org.opensextant.data.Place geo)
      Add country evidence and increment score immediately.
      Parameters:
      rule -
      weight -
      cc -
      geo -
    • addAdmin1Evidence

      public void addAdmin1Evidence(String rule, double weight, String adm1, String cc)
      Parameters:
      rule -
      weight -
      adm1 -
      cc -
    • addFeatureClassEvidence

      public void addFeatureClassEvidence(String rule, double weight, String fclass)
      Parameters:
      rule -
      weight -
      fclass -
    • addFeatureCodeEvidence

      public void addFeatureCodeEvidence(String rule, double weight, String fcode)
      Parameters:
      rule -
      weight -
      fcode -
    • addGeocoordEvidence

      public void addGeocoordEvidence(String rule, double weight, org.opensextant.data.LatLon coord, org.opensextant.data.Place geo, double proximityScore)
      Add evidence and increment score immediately.
      Parameters:
      rule -
      weight -
      coord -
      geo -
      proximityScore -
    • getEvidence

      public Collection<PlaceEvidence> getEvidence()
      Returns:
      the current evidence
    • hasPlaces

      public boolean hasPlaces()
      Returns:
      true if candidate has any associated potential locations
    • toString

      public String toString()
      Overrides:
      toString in class org.opensextant.extraction.TextMatch
      Returns:
      string representation of candidate
    • summarize

      public String summarize(boolean dumpAll)
      If you need a full print out of the data, use summarize(true);.
      Parameters:
      dumpAll -
      Returns:
      summary of evidence, rules and chosen location
    • getPrematchTokens

      public String[] getPrematchTokens()
      Returns:
      the preceding tokens
    • setPrematchTokens

      public void setPrematchTokens(String[] toks)
      Parameters:
      toks - set preceding tokens
    • getPostmatchTokens

      public String[] getPostmatchTokens()
      Returns:
      tokens following name span
    • setPostmatchTokens

      public void setPostmatchTokens(String[] toks)
      Parameters:
      toks - set following tokens
    • getSurroundingText

      public String getSurroundingText()
    • presentInHierarchy

      public boolean presentInHierarchy(String path)
      Given a path, 'a.b' ( province b in country a), see if this name is present there.
      Parameters:
      path -
      Returns:
      true if given path is represented by candidates' potential locations
    • presentInCountry

      public boolean presentInCountry(String cc)
      Parameters:
      cc - country code
      Returns:
      true if candidate has potential locations for the given country code.
    • distinctCountryCount

      public int distinctCountryCount()
      How many different countries contain this name?.
      Returns:
      count of distinct country codes inferred
    • distinctLocationCount

      public int distinctLocationCount()
      Returns:
      distinct locations by ID, not by geodetic location
    • markValid

      public void markValid()
      Mark candidate as valid to protect it from being filtered out by downstream rules.
    • isValid

      public boolean isValid()
      if candidate was marked as valid. IF valid, then avoid filters.
      Returns:
      true if rules have marked this candidate valid
    • hasEvidence

      public boolean hasEvidence()
      Returns:
      true if candidate has any evidence.
    • getWordCount

      public int getWordCount()
      a basic whitespace, punctuation delimited count of grams Set ONLY after inferTextSense() is invoked
      Returns:
      token word count
    • inferTextSense

      public void inferTextSense(boolean contextisLower, boolean contextisUpper)
      text hueristics
      Parameters:
      contextisLower - True if text around mention is mainly lowercase
      contextisUpper - True if text around mention is mainly uppercase
    • getTokens

      public String[] getTokens()
      Tokens in word. Only after inferTextSense() is invoked.
      Returns:
    • getLinkedGeography

      public Map<String,org.opensextant.data.Place> getLinkedGeography()
      Get the collection of geographic slots geolocated. E.g., for a "Town Hall" building location you might link the Place object representing the "city" slot.
      Returns:
    • setLinkedGeography

      public void setLinkedGeography(Map<String,org.opensextant.data.Place> geography)
    • linkGeography

      public void linkGeography(PlaceCandidate otherMention, String slot, org.opensextant.data.Place geo)
      Foricbly link geography to the given slot.
      Parameters:
      otherMention -
      slot -
      geo -
      See Also:
    • linkGeography

      public void linkGeography(String slot, org.opensextant.data.Place geo)
    • hasLinkedGeography

      public boolean hasLinkedGeography(String slot)
    • linkGeography

      public boolean linkGeography(PlaceCandidate otherMention, String slot, String featPrefix)
      Link geographic mention from other part of the document. E.g., for a "Town Hall" building location you might link the PlaceCandidate mention object representing the "city" slot.

      method added to support PostalGeocoder. TBD.

      Parameters:
      otherMention -
      slot -
      featPrefix -
      Returns:
      True if any link was made or already existed.
    • setReviewed

      public void setReviewed(boolean b)
      A general purpose flag "reviewed" to indicate something was reviewed and to not repeat that task on this instance.
      Parameters:
      b -
    • isReviewed

      public boolean isReviewed()
    • hasPostal

      public boolean hasPostal()
      Evaluate if postal matches reside in candidate locations. Evaluate only once and save result. We distinguish between "hasPostal" matches vs. marking this place as "is Postal". That's the difference between factual and inferential.
      Returns:
      true if postal features exist here.