Class PostalTagger

All Implemented Interfaces:
Closeable, AutoCloseable, org.opensextant.data.MatchSchema, org.opensextant.extraction.Extractor

public class PostalTagger extends GazetteerMatcher implements org.opensextant.data.MatchSchema, org.opensextant.extraction.Extractor
Postal Tagger tags and returns any alphanumeric token or phrase that resembles postal codes and abbreviations. This includes simple filter rules, and nothing attempting geocoding.
Author:
ubaldino
  • Field Details

  • Constructor Details

    • PostalTagger

      public PostalTagger() throws org.opensextant.ConfigException
      Throws:
      org.opensextant.ConfigException
  • Method Details

    • getName

      public String getName()
      Specified by:
      getName in interface org.opensextant.extraction.Extractor
    • getCoreName

      public String getCoreName()
      Description copied from class: SolrMatcherSupport
      Be explicit about the solr core to use for tagging.
      Overrides:
      getCoreName in class GazetteerMatcher
      Returns:
      the core name
    • configure

      public void configure()
      Specified by:
      configure in interface org.opensextant.extraction.Extractor
    • configure

      public void configure(String patfile) throws org.opensextant.ConfigException
      Specified by:
      configure in interface org.opensextant.extraction.Extractor
      Throws:
      org.opensextant.ConfigException
    • configure

      public void configure(URL patfile) throws org.opensextant.ConfigException
      Specified by:
      configure in interface org.opensextant.extraction.Extractor
      Throws:
      org.opensextant.ConfigException
    • extract

      public List<org.opensextant.extraction.TextMatch> extract(org.opensextant.data.TextInput input) throws org.opensextant.extraction.ExtractionException
      Tag, choose location if possible and emit an array of text matches.

      INPUT: Free text that may have postal addresses.

      OUTPUT: TextMatch array of all possible postal codes that pass trivial noise filters.

      Specified by:
      extract in interface org.opensextant.extraction.Extractor
      Parameters:
      input - TextInput
      Returns:
      array of TextMatch
      Throws:
      org.opensextant.extraction.ExtractionException - if extraction fails (Solr or Lucene errors) or rules mechanics.
    • extract

      public List<org.opensextant.extraction.TextMatch> extract(String input) throws org.opensextant.extraction.ExtractionException
      Specified by:
      extract in interface org.opensextant.extraction.Extractor
      Throws:
      org.opensextant.extraction.ExtractionException
    • cleanup

      public void cleanup()
      Very simple resource reporting and cleanup.
      Specified by:
      cleanup in interface org.opensextant.extraction.Extractor
    • setMinLen

      public void setMinLen(int l)
      Override the default MIN_LEN=4 length for a postal code. Any textmatch with length < this length will be filtered out. Postal codes in CA, FO, GB, GG, IE, IM, IS, JE, MT all have postal codes that are 2 or 3 alphanum.