Class MatcherUtils

java.lang.Object
org.opensextant.extraction.MatcherUtils

public class MatcherUtils extends Object
  • Field Details

  • Constructor Details

    • MatcherUtils

      public MatcherUtils()
  • Method Details

    • reduceMatches

      public static void reduceMatches(List<TextMatch> matches)
      Reduce actual valid matches by identifying duplicates or sub-matches. Overlapping spans are not considered filtered out.
      Parameters:
      matches - set of matches you need to sift through to find filtered out items.
    • findTagSpans

      public static List<TextEntity> findTagSpans(String text)
      Trivial attempt at locating edges of tags in data. This allows us to tag any data, but post-filter any items match within tags, that is if you have
       [A]text[A] [A]text[/A] [A data]text
       
      where A is a tag, but the (angle,paren,square,curly) bracket marks the start of a tag area. We are finding those start/ends of the tag area, not the text span represented by the matching tags. Supported characters are > and [ for now.
        Tags are:
          CHAR TEXT ? CHAR     # <a href=''>
      
        Tags are not:
          CHAR SPACE TEXT .....# an open tag, followed by non-alpha and/or not closed.
      
        Tag names are always ASCII, as these are simple tag detection tools.
        Uniccode tags are allowable.
      
        To properly detect end tags, [/a] or </a> then "/" is the only allowable character after
        an opening char for a tag.
       
      Parameters:
      text -
      Returns:
      list of TextEntity with no text, just span offsets
    • filterMatchesBySpans

      public static void filterMatchesBySpans(String buffer, List<TextMatch> matches)
      A simple demonstration of how to sift through matches identifying which matches appear within tags. So we say [A]match[/A] -- match is good. [A]match -- match is good. [A match]text other_match -- match is not good; other_match is fine. The result is that matches inside tags are "filteredOut"
      Parameters:
      buffer - the raw signal text where the matches were found.
      matches - TextMatch array