Package org.opensextant.extractors.flexpat
FlexPat -- A Pattern Definition & Testing Library
FlexPat is a pattern-based extractor that allows you to define regular expressions (RE or regex) along with the test data that you believe should be matched. Part of the "features" of FlexPat is due to a deficiency in Java's RE support: Java SDK does not support named groups. FlexPat solves this be defining fields (aka RE groups) that are used to compose more complex patterns. The fields are sub-patterns themselves and serve two purposes:
- They keep your pattern library organized and more object-oriented and reusable. e.g., once you define a field for a date pattern, you can reuse that by naming it where you need it.
- They help you recall fields from matches so you can post-process matches, e.g., for normalization and other stuff.
A config file is processed by RegexPatternManager
.
The file consists of DEFINES, RULES, and TESTS.
DEFINE -- a field name and a valid RE pattern.
#DEFINE field pattern
RULE and TEST -- a valid RE pattern that defines things you wish
to match. <field> must be valid fields DEFINEd ahead
of time. RULEs are enumerated within a family of
rules. RegexPatternManager
and your implementation
should allow the enabling/disabling of families of rules as well
as individual rules. RULEs are immediately followed by
TEST cases that share the family and enumeration of a given rule.
#RULE name family enum pattern #TEST name family enum data name, family and enum are code keywords with no white space. Enumerations are any alphanumeric string, however ease of use, they are typically numbers followed by a few alphabetic characters as modifiers. pattern := RE, which is any valid combination of <field> and RE expressions excluding RE groups. That is, RULE patterns may not contain additional unnamed/un-DEFINED groups. The use of "(group)" in a RULE will cause Flexpat to fail. TEST data := is any string of characters. $NL typically is used to represent a \n character which should be inserted during testing. FlexPat does not do this -- the caller must handle this. This is only a convention. Data may also contain an optional comment. Again, this is a convention The caller should know what do do with the comment. By convention, if the comment/data includes the term "FAIL" this is used to imply the test represents a true negative, i.e., do not match or do not parse as a true positive.
DEFINES and RULES being RE strings, they are escaped properly
within RegexPatternManager
-- you the user do not need
to escape tokens for the programming language, e.g., "\s+" is
sufficient -- "\\s+" is not needed to escape the "\" modifier.
Defining patterns involves these classes
- RegexPatternManager -- the central pattern manager as describe above. It takes a config file as a URL or file. DEFINEs are ephemeral -- after RULE creation defines are not used after initialization.
- PatternTestCase -- maps to the TEST objects.
- RegexPattern -- maps to the RULE objects.
Implementation
Subclass RegexPatternManager implementing the create_pattern, validate_pattern, enable_pattern and create_testcase methods. These are specific to your patterns. Your own patterns will sub-class from RegexPattern, optionally test cases can sub-class PatternTestCase.SEE Also: XCoord and XTemp implementations.
class MyPattern extends RegexPattern { public String attr = null; } ... myManager.create_pattern( "MYFAM", "09a", "A rule for matching not much")
Would create a MyPattern instance with the data above.
Starting up your application should look like this:
patterns = new MyPatternManager(new URL("/path/to/my-patterns.cfg")); patterns.testing = debug; patterns.initialize();
Using your patterns manager should look like a loop -- which loops through and evaluates all enabled patterns. That is, at runtime or compile time you can decide in your app how to all users or integrators how to enable or disable rules. FlexPat does not consider how you implement this -- it simple requires you implement a per-rule toggler, enable_pattern( <rule-id> ).
/** For tracking purposes you should assign each text object to a text ID. * TextMatches and results can then be associated with text by this ID */ public MyPatternResult extract_mystuff(String text, String text_id) { int bufsize = text.length(); MyPatternResult results = new MyPatternResult(); results.result_id = text_id; results.matches = new ArrayList<TextMatch>(); for (RegexPattern repat : patterns.get_patterns()) { /* if repat is enabled, evaluate it. * Once you know you want to evaluate it you will likely want to cast * the generic RegexPattern * to your own MyPattern * and do more specific stuff with it. */ MyPattern pat = (MyPattern) repat; Matcher match = pat.regex.matcher(text); // This tracks for this result that at least one rule was evaluated on the data. // If no rules were evaluated, you have a bigger issue with logic or your config file. // results.evaluated = true; while (match.find()) { MyMatch domainMatch = new MyMatch() // a TextMatch sub-class // Here you parse through the matches. // You use the base pattern manager's ability to map the DEFINES to fields by name. // // Get basic RE metadata and then parse out fields from the RULE as needed. // domainMatch.pattern_id = pat.id; domainMatch.start = match.start(); domainMatch.end = match.end(); domainMatch.text = match.group(); Elements fields = patterns.group_map(pat, match) // Your domain logic for normalization... // Utility.normalizeFields( domainMatch, fields ); // Filter? Check for false positives and filter out junk. if (filter.filterOut(domainMatch)){ continue; } results.matches.add( domainMatch ); } } // You've now assessed all RULES on input text. All results are assembled, filtered, normalized, etc. // return. return results; }
-
ClassDescriptionFlexPat Extractor -- given a set of pattern families, extract, filter and normalize matches.This is the culmination of various date/time extraction efforts in python and Java.This result class holds all the results for a given text block.