| |
- builtins.object
-
- PatternTestCase
- RegexPattern
- RegexPatternManager
- opensextant.Extractor(abc.ABC)
-
- PatternExtractor
- opensextant.TextMatch(opensextant.TextEntity)
-
- PatternMatch
class PatternExtractor(opensextant.Extractor) |
|
PatternExtractor(pattern_manager)
Discussion: Read first https://opensextant.github.io/Xponents/doc/Patterns.md
Example:
```
from opensextant.extractors.poli import PatternsOfLifeManager
from opensextant.FlexPat import PatternExtractor
# INIT
#=====================
# Invoke a particular REGEX rule set, here poli_patterns.cfg
# @see https://github.com/OpenSextant/Xponents/blob/master/Core/src/main/resources/poli_patterns.cfg
mgr = PatternsOfLifeManager("poli_patterns.cfg")
pex = PatternExtractor(mgr)
# DEV/TEST
#=====================
# "default_test()" is useful to run during development and
# encourages you to capture critical pattern variants in your "TEST" data.
# Look at your pass/fail situations -- what test cases are failing your rule?
test_results = pex.default_tests()
print("TEST RESULTS")
for result in test_results:
print(repr(result))
# RUN
#=====================
real_results = pex.extract(".... text blob 1-800-123-4567...")
print("REAL RESULTS")
for result in real_results:
print(repr(result))
print(" RAW DICT:", render_match(result))
``` |
|
- Method resolution order:
- PatternExtractor
- opensextant.Extractor
- abc.ABC
- builtins.object
Methods defined here:
- __init__(self, pattern_manager)
- invoke RegexPatternManager(your_cfg_file) or implement a custom RegexPatternManager (rare).
NOTE - `PatternsOfLifeManager` is a particular subclass of RegexPatternManager becuase
it is manipulating the input patterns config file which is shared with the Java demo.
The `CLASS` names unfortunately are specific to Python or Java.
:param pattern_manager: RegexPatternManager
- default_tests(self, scope='rule')
- Default Tests run all TEST cases for each RULE in patterns config.
TESTs marked with a 'FAIL' comment are intended to return 0 matches or only matches that are filtered out.
Otherwise a TEST is intended to return 1 or more matches.
By default, this runs each test and observes only results that were triggered by that rule being tested.
If scope is "ruleset" then any results from any rule will be allowed.
"rule" scope is much better for detailed rule development as it tells you if your rule tests are testing the
right thing.
Runs the default tests on the provided configuration. Plenty of debug printed to screen.
But returns the test results as an array, e.g., to write to CSV for review.
This uses PatternExtractor.extract_patterns() to avoid any collision with the generic use
of Extractor.extract() parent method.
:param scope: rule or ruleset. Rule scope means only results for rule test case are evaluated.
ruleset scope means that all results for a test are evaluated.
:return: test results array; Each result represents a TEST case run against a RULE
- extract(self, text, **kwargs)
- Default Extractor API.
- extract_patterns(self, text, **kwargs)
- Given some text input, apply all relevant pattern families against the text.
Surrounding text is added to each match for post-processing.
:param text:
:param kwargs:
:return:
Data and other attributes defined here:
- __abstractmethods__ = frozenset()
Data descriptors inherited from opensextant.Extractor:
- __dict__
- dictionary for instance variables (if defined)
- __weakref__
- list of weak references to the object (if defined)
|
class PatternMatch(opensextant.TextMatch) |
|
PatternMatch(*args, pattern_id=None, label=None, match_groups=None)
A general Pattern-based TextMatch.
This Python variation consolidates PoliMatch (patterns-of-life = poli) ideas in the Java API. |
|
- Method resolution order:
- PatternMatch
- opensextant.TextMatch
- opensextant.TextEntity
- builtins.object
Methods defined here:
- __init__(self, *args, pattern_id=None, label=None, match_groups=None)
- Initialize self. See help(type(self)) for accurate signature.
- add_surrounding_text(self, text, text_len, length=16)
- Given this match's span and the text it was derived from,
populate pre_text, post_text with some # of chars specified by length.
:param text: The text in which this match was found.
:param text_len: the length of the text buffer. (avoid repeating len(text))
:param length: the pre/post text length to attach.
:return:
- attributes(self)
- Render domain details to meaningful exported view of the data.
:return:
- copy_attrs(self, arr)
- Default copy of match group slots. Does not work for every situation.
:param arr:
:return:
- get_value(self, k)
- Get Slot value -- returns first one.
:param k:
:return:
- normalize(self)
- Optional, but recommended routine to normalize the matched data.
That is, parse fields, uppercase, streamline punctuation, etc.
As well, given such normalization result, this is the opportunity to additionally
validate the match.
:return:
Data and other attributes defined here:
- FOUND_CASE = 0
- LOWER_CASE = 2
- UPPER_CASE = 1
Methods inherited from opensextant.TextMatch:
- __str__(self)
- Return str(self).
- populate(self, attrs: dict)
- Populate a TextMatch to normalize the set of attributes -- separate class fields on TextMatch from additional
optional attributes.
:param attrs: dict of standard Xponents API outputs.
:return:
Methods inherited from opensextant.TextEntity:
- contains(self, x1)
- if this span contains an offset x1
:param x1:
- exact_match(self, t)
- is_after(self, t)
- is_before(self, t)
- is_within(self, t)
- if the given annotation, t, contains this
:param t:
:return:
- overlaps(self, t)
- Determine if t overlaps self. If Right or Left match, t overlaps if it is longer.
If t is contained entirely within self, then it is not considered overlap -- it is Contained within.
:param t:
:return:
Data descriptors inherited from opensextant.TextEntity:
- __dict__
- dictionary for instance variables (if defined)
- __weakref__
- list of weak references to the object (if defined)
|
class PatternTestCase(builtins.object) |
|
PatternTestCase(tid, family, text)
|
|
Methods defined here:
- __init__(self, tid, family, text)
- Initialize self. See help(type(self)) for accurate signature.
Data descriptors defined here:
- __dict__
- dictionary for instance variables (if defined)
- __weakref__
- list of weak references to the object (if defined)
|
class RegexPattern(builtins.object) |
|
RegexPattern(fam, pid, desc)
|
|
Methods defined here:
- __init__(self, fam, pid, desc)
- Initialize self. See help(type(self)) for accurate signature.
- __str__(self)
- Return str(self).
Data descriptors defined here:
- __dict__
- dictionary for instance variables (if defined)
- __weakref__
- list of weak references to the object (if defined)
|
class RegexPatternManager(builtins.object) |
|
RegexPatternManager(patterns_cfg, module_file=None, debug=False, testing=False)
RegexPatternManager is the patterns configuration file parser.
See documentation: https://opensextant.github.io/Xponents/doc/Patterns.md |
|
Methods defined here:
- __init__(self, patterns_cfg, module_file=None, debug=False, testing=False)
- Initialize self. See help(type(self)) for accurate signature.
- create_pattern(self, fam, rule, desc)
- Override pattern class creation as needed.
- create_testcase(self, tid, fam, text)
- disable_all(self)
- enable_all(self)
- get_pattern(self, pid)
- set_enabled(self, some: str, flag: bool)
- set family enabled or not
:param some: prefix of a family or family-variant
:param flag: bool setting
:return:
- validate_pattern(self, repat)
- Default validation is True
Override this if necessary, e.g., pattern implementation has additional metadata
Data descriptors defined here:
- __dict__
- dictionary for instance variables (if defined)
- __weakref__
- list of weak references to the object (if defined)
| |