Package org.opensextant.extraction
Class MatcherUtils
- java.lang.Object
-
- org.opensextant.extraction.MatcherUtils
-
public class MatcherUtils extends java.lang.Object
-
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String
CLOSE_CHARS
static java.lang.String
START_CHARS
-
Constructor Summary
Constructors Constructor Description MatcherUtils()
-
Method Summary
Modifier and Type Method Description static void
filterMatchesBySpans(java.lang.String buffer, java.util.List<TextMatch> matches)
A simple demonstration of how to sift through matches identifying which matches appear within tags.static java.util.List<TextEntity>
findTagSpans(java.lang.String text)
Trivial attempt at locating edges of tags in data.static void
reduceMatches(java.util.List<TextMatch> matches)
Reduce actual valid matches by identifying duplicates or sub-matches.
-
-
-
Field Detail
-
START_CHARS
public static final java.lang.String START_CHARS
- See Also:
- Constant Field Values
-
CLOSE_CHARS
public static final java.lang.String CLOSE_CHARS
- See Also:
- Constant Field Values
-
-
Method Detail
-
reduceMatches
public static void reduceMatches(java.util.List<TextMatch> matches)
Reduce actual valid matches by identifying duplicates or sub-matches. Overlapping spans are not considered filtered out.- Parameters:
matches
- set of matches you need to sift through to find filtered out items.
-
findTagSpans
public static java.util.List<TextEntity> findTagSpans(java.lang.String text)
Trivial attempt at locating edges of tags in data. This allows us to tag any data, but post-filter any items match within tags, that is if you have[A]text[A] [A]text[/A] [A data]text
where A is a tag, but the (angle,paren,square,curly) bracket marks the start of a tag area. We are finding those start/ends of the tag area, not the text span represented by the matching tags. Supported characters are > and [ for now.Tags are: CHAR TEXT ? CHAR # <a href=''> Tags are not: CHAR SPACE TEXT .....# an open tag, followed by non-alpha and/or not closed. Tag names are always ASCII, as these are simple tag detection tools. Uniccode tags are allowable. To properly detect end tags, [/a] or </a> then "/" is the only allowable character after an opening char for a tag.
- Parameters:
text
-- Returns:
- list of TextEntity with no text, just span offsets
-
filterMatchesBySpans
public static void filterMatchesBySpans(java.lang.String buffer, java.util.List<TextMatch> matches)
A simple demonstration of how to sift through matches identifying which matches appear within tags. So we say [A]match[/A] -- match is good. [A]match -- match is good. [A match]text other_match -- match is not good; other_match is fine. The result is that matches inside tags are "filteredOut"- Parameters:
buffer
- the raw signal text where the matches were found.matches
- TextMatch array
-
-