Class PostalGeocoder

java.lang.Object
org.opensextant.extractors.geo.PostalGeocoder
All Implemented Interfaces:
org.opensextant.data.MatchSchema, org.opensextant.extraction.Extractor, BoundaryObserver, CountryObserver

public class PostalGeocoder extends Object implements org.opensextant.data.MatchSchema, org.opensextant.extraction.Extractor, BoundaryObserver, CountryObserver
PostalGeocoder -- a GazetteerMatcher that uses the "postal" solr index to quickly tag any known postal codes worldwide. Postal codes are typically 4 to 7 alphanumeric characters with space or punctuation. Through Geonames.org we have identified 4 million unique patterns for COUNTRY + CODE tuples.

For example the Postal code "11111" in different countries is two distinct codes, since we assume a postal code is unique within a country, but may occur in more than one country.

Xponents Methodology:

- "Rules" are added to PlaceCandidates to inform caller of basic lexical rules fired - "PlaceEvidence" is NOT used to score Places, because there is very little geographic association across tags - Confidence is assigned to a PlaceCandidate only based on complexity of the match

Returned "TextMatch" tags are marked as filtered_out for SHORT or YEAR codes. Returned "TextMatch" tags may or may not have a location selected.

Author:
ubaldino
  • Field Details

  • Constructor Details

    • PostalGeocoder

      public PostalGeocoder()
  • Method Details

    • getName

      public String getName()
      Specified by:
      getName in interface org.opensextant.extraction.Extractor
    • configure

      public void configure() throws org.opensextant.ConfigException
      Specified by:
      configure in interface org.opensextant.extraction.Extractor
      Throws:
      org.opensextant.ConfigException
    • configure

      public void configure(String patfile) throws org.opensextant.ConfigException
      Specified by:
      configure in interface org.opensextant.extraction.Extractor
      Throws:
      org.opensextant.ConfigException
    • configure

      public void configure(URL patfile) throws org.opensextant.ConfigException
      Specified by:
      configure in interface org.opensextant.extraction.Extractor
      Throws:
      org.opensextant.ConfigException
    • setGeneralMatches

      public void setGeneralMatches(List<org.opensextant.extraction.TextMatch> arr)
      OPTIMIZATION: Set the general purpose matches (geo, taxons, etc) from a prior processing step. This helps avoid PostalGeocoder from re-running the same. Only call this if the matches array includes the output of running the PlaceGeocoder.
      Parameters:
      arr -
    • extract

      public List<org.opensextant.extraction.TextMatch> extract(org.opensextant.data.TextInput input) throws org.opensextant.extraction.ExtractionException
      Tag, choose location if possible and emit an array of text matches.

      INPUT: Free text that may have postal addresses.

      OUTPUT: TextMatch arrary where each match may be:

      • high confidence: Admin code + Postal code that makes sense
      • low confidence: Postal code alone

      There is nothing in between really, for example:

             ..... CA  94537 ...     # a valid zip code in California next to "CA" postal abbreviation. HIGH confidence
             ..... 94537 ....        # a bare 5-digit number. LOW confidence.
             ..... SA6 19DN ...      # bare alpha-numeric postal code.  MED confidence
         
      NOTE: Not multi-thread safe. A single call here has some amount of internal state; A second simultaneous call would disrupt that
      Specified by:
      extract in interface org.opensextant.extraction.Extractor
      Parameters:
      input - TextInput
      Returns:
      array of TextMatch
      Throws:
      org.opensextant.extraction.ExtractionException - if extraction fails (Solr or Lucene errors) or rules mechanics.
    • extract

      public List<org.opensextant.extraction.TextMatch> extract(String input) throws org.opensextant.extraction.ExtractionException
      Specified by:
      extract in interface org.opensextant.extraction.Extractor
      Throws:
      org.opensextant.extraction.ExtractionException
    • cleanup

      public void cleanup()
      Very simple resource reporting and cleanup.
      Specified by:
      cleanup in interface org.opensextant.extraction.Extractor
    • reset

      public void reset()
    • boundaryLevel1InScope

      public void boundaryLevel1InScope(String nameNorm, org.opensextant.data.Place p)
      Description copied from interface: BoundaryObserver
      Given the name (lower case, strip quotes), the location candidate infers an ADMIN boundary
      Specified by:
      boundaryLevel1InScope in interface BoundaryObserver
    • boundaryLevel2InScope

      public void boundaryLevel2InScope(String nameNorm, org.opensextant.data.Place p)
      Description copied from interface: BoundaryObserver
      Given the name (lower case, strip quotes), the location candidate infers an ADMIN boundary
      Specified by:
      boundaryLevel2InScope in interface BoundaryObserver
    • placeMentionCount

      public Map<String,PlaceCount> placeMentionCount()
      Description copied from interface: BoundaryObserver
      Calculates totals and ratios for the discovered set of boundaries, inferred or explicit.
      Specified by:
      placeMentionCount in interface BoundaryObserver
      Returns:
      counts for boundary places mentioned or inferred
    • countryInScope

      public void countryInScope(String cc)
      Description copied from interface: CountryObserver
      Use a country code to signal that a country was mentioned.
      Specified by:
      countryInScope in interface CountryObserver
      Parameters:
      cc - country code
    • countryInScope

      public void countryInScope(org.opensextant.data.Country C)
      Description copied from interface: CountryObserver
      Use a country object to signal a country was mentioned or is in scope
      Specified by:
      countryInScope in interface CountryObserver
      Parameters:
      C - country object
    • countryObserved

      public boolean countryObserved(String cc)
      Description copied from interface: CountryObserver
      Have you seen this country before?
      Specified by:
      countryObserved in interface CountryObserver
      Parameters:
      cc - country code
      Returns:
      true if observer saw country
    • countryObserved

      public boolean countryObserved(org.opensextant.data.Country C)
      Description copied from interface: CountryObserver
      Have you seen this country before?
      Specified by:
      countryObserved in interface CountryObserver
      Parameters:
      C - country object
      Returns:
      true if observer saw country
    • countryCount

      public int countryCount()
      Specified by:
      countryCount in interface CountryObserver
    • countryMentionCount

      public Map<String,CountryCount> countryMentionCount()
      Description copied from interface: CountryObserver
      Calculates totals and ratios for the discovered set of countries.
      Specified by:
      countryMentionCount in interface CountryObserver
      Returns:
      map of country code : counts
    • associateMatches

      public static void associateMatches(List<PlaceCandidate> matches, List<PlaceCandidate> postalMatches)
      Given geotagging from a prior pass of PlaceGeocoder or other stuff, compare and align those tags with POSTAL tags.
    • linkGeography

      public static boolean linkGeography(PlaceCandidate postal, PlaceCandidate otherMention, String slot, String featPrefix)
    • deriveMatches

      public static List<org.opensextant.extraction.TextMatch> deriveMatches(List<PlaceCandidate> postalMatches, org.opensextant.data.TextInput t)
      For situations of the form:
           CITY  PROV POSTAL
           CITY  PROV POSTAL COUNTRY
                 PROV POSTAL COUNTRY
      
          etc.  where PROV is either name or ADM1 postal code; And POSTAL appears in any order in tuple.
       
      Do the following: (a) generate new span (PlaceCandidate) match (b) set the chosen location to be City or Province whichever is finest resolution. (c) insert new match into original array

      return super set of all matches. This makes use of the linkedGeography.

      Parameters:
      postalMatches -
      t -
      Returns:
      all postal matches, now with derived ones added.
    • unqualifiedPostalLocation

      public static boolean unqualifiedPostalLocation(PlaceCandidate match)