Package org.opensextant.extraction
Class MatcherUtils
java.lang.Object
org.opensextant.extraction.MatcherUtils
-
Field Summary
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic void
filterMatchesBySpans
(String buffer, List<TextMatch> matches) A simple demonstration of how to sift through matches identifying which matches appear within tags.static List<TextEntity>
findTagSpans
(String text) Trivial attempt at locating edges of tags in data.static void
reduceMatches
(List<TextMatch> matches) Reduce actual valid matches by identifying duplicates or sub-matches.
-
Field Details
-
START_CHARS
- See Also:
-
CLOSE_CHARS
- See Also:
-
-
Constructor Details
-
MatcherUtils
public MatcherUtils()
-
-
Method Details
-
reduceMatches
Reduce actual valid matches by identifying duplicates or sub-matches. Overlapping spans are not considered filtered out.- Parameters:
matches
- set of matches you need to sift through to find filtered out items.
-
findTagSpans
Trivial attempt at locating edges of tags in data. This allows us to tag any data, but post-filter any items match within tags, that is if you have[A]text[A] [A]text[/A] [A data]text
where A is a tag, but the (angle,paren,square,curly) bracket marks the start of a tag area. We are finding those start/ends of the tag area, not the text span represented by the matching tags. Supported characters are > and [ for now.Tags are: CHAR TEXT ? CHAR # <a href=''> Tags are not: CHAR SPACE TEXT .....# an open tag, followed by non-alpha and/or not closed. Tag names are always ASCII, as these are simple tag detection tools. Uniccode tags are allowable. To properly detect end tags, [/a] or </a> then "/" is the only allowable character after an opening char for a tag.
- Parameters:
text
-- Returns:
- list of TextEntity with no text, just span offsets
-
filterMatchesBySpans
A simple demonstration of how to sift through matches identifying which matches appear within tags. So we say [A]match[/A] -- match is good. [A]match -- match is good. [A match]text other_match -- match is not good; other_match is fine. The result is that matches inside tags are "filteredOut"- Parameters:
buffer
- the raw signal text where the matches were found.matches
- TextMatch array
-