Geographic Place, Date/time, and Pattern entity extraction toolkit along with text extraction from unstructured data and GIS outputters.
Regular Expression (REGEX) patterns is a common solution to detect relatively concrete sequences characters and symbols. REGEX, however, is not a complete solution to help validate what is found: for example, finding a 10-digit sequence of numbers does not alone imply you found a phone number; Finding a tuple of digits separated by slashes is not alone to imply a date format. Additional validation is important … usually.
Xponents FlexPat is a general methodology for developing REGEX-based extractors. Consider alternative solutions, such as YARA – very powerful, but much more intricate and tailored specifically toward malware/cyberforensics. FlexPat provides a more general abstraction around REGEX, specifically:
FlexPat currently operates in Java or Python with the same patterns file syntax. On the last point above, the intent of FlexPat was to get the patterns out of language specific source code and into a readable form that all team members could comprehend and weigh in on test cases.
To date Xponents FlexPat extractors include XCoord for geo-coordinate patterns, XTemporal for date/time patterns,
and PoLi for general tutorial/demonstration purposes using simple patterns like email and telephone numbers.
They are described in more detail with the actual files and supporting material below.
XCoord is a geographic coordinate extractor and normalizer that finds latitude/longitude pairs or grids such
UTM. The patterns are either decimal degrees, minutes seconds, and/or fractional parts along
with hemisphere symbology. Patterns include these: Coord Patterns, 2013-2017, as
implemented in this patterns definition geocoord_patterns.cfg. This was drafted and operationalized
in Java here and has not yet been ported to Python (In actuality the first extractor implementation before OpenSextant was
in Python not using the FlexPat approach). Coordinate patterns include:
FILES: Coord Patterns, Patterns Config: geocoord_patterns.cfg.
XTemporal a date/time extractor and normalizer that finds dates and date/time patterns, implemented
with this patterns definition datetime_patterns.cfg. Patterns include
Sept 22nd, 2017or
22 SEPT 2017 0700Z
FILES: Patterns Config: datetime_patterns.cfg
PoLi or patterns-of-life demonstration, which includes well-known patterns like telephone numbers, email address,
and money. Those patterns are contained in poli_patterns.cfg. As a demonstration of FlexPat,
this set of patterns was provided only to show the development process of additional patterns. It is here for
illustration. On other projects we have implemented such patterns in much more depth, albeit such things are
not always open sourced. These patterns include:
FILES: Patterns Config: poli_patterns.cfg
While XCoord and XTemporal above are complex regarding their parsing, they are relatively well-contained and easy. Patterns tackled using the more general solutions demonstrated in PoLi show that the REGEX detection is just the first part of the problem, and the user has to bring in that sense of validation.
That validation is implemented (in Python) by subclassing
opensextant.FlexPat.PatternMatch and implementing a
normalize() function to validate the detected pattern and groups. We’ll get into that more below with the optional
First here is the outline of the standard FlexPat patterns configuration file – which should be language independent
(yes, until you specify the optional
CLASS, which is the name of your custom class which may vary depending on your
This FlexPat uses a “patterns configuration” file, which contains the clauses for
– the essential ingredients for a pattern extractor pipeline. Outlining these more:
DEFINE: define discrete groups or “sub-patterns” that recur in your patterns
RULE: an actual REGEX, defined with named groups only identified by your
DEFINEsor any other valid REGEX syntax
MDYhas about 6 total month-day-year patterns. But all such variants are easily referenced by naming the pattern family
TEST: is a single example of the pattern+variant to be detected, parsed and/or normalized by the
RULE. By convention a
FAILcomment trailing the test is used to denote a test case that should NOT be detected by the
RULE. Alternatively, the pattern may detect a match, but through the
RULEwill yield a pattern match that is filtered out. In conclusion, there are two chances to succeed here – (a)
RULEdetect or not detect the match, and/or (b)
CLASSnormalization validates or invalidates the match.
TESTtest cases as possible to touch on variants.
CLASS: An optional custom class that subclasses
PatternMatchto carry REGEX subgroups and allow for additional normalization, validation, etc.
The PoLi example is provided as a template for starting your own set of patterns: [poli_patterns.cfg](https://github.com/OpenSextant/Xponents/blob/master/Core/src/main/resources/poli_patterns.cfg