Geographic Place, Date/time, and Pattern entity extraction toolkit along with text extraction from unstructured data and GIS outputters.
Regular Expression (REGEX) patterns is a common solution to detect relatively concrete sequences characters and symbols. REGEX, however, is not a complete solution to help validate what is found: for example, finding a 10-digit sequence of numbers does not alone imply you found a phone number; Finding a tuple of digits separated by slashes is not alone to imply a date format. Additional validation is important … usually.
Xponents FlexPat is a general methodology for developing REGEX-based extractors. Consider alternative solutions, such as YARA – very powerful, but much more intricate and tailored specifically toward malware/cyberforensics. FlexPat provides a more general abstraction around REGEX, specifically:
FlexPat currently operates in Java or Python with the same patterns file syntax. On the last point above, the intent of FlexPat was to get the patterns out of language specific source code and into a readable form that all team members could comprehend and weigh in on test cases.
To date Xponents FlexPat extractors include XCoord for geo-coordinate patterns, XTemporal for date/time
patterns,
and PoLi for general tutorial/demonstration purposes using simple patterns like email and telephone numbers.
They are described in more detail with the actual files and supporting material below.
XCoord
is a geographic coordinate extractor and normalizer that finds latitude/longitude pairs or grids such
as MGRS
or UTM
. The patterns are either decimal degrees, minutes seconds, and/or fractional parts along
with hemisphere symbology. Patterns include these: Coord Patterns, 2013-2017, as
implemented in this patterns
definition geocoord_patterns.cfg.
This was drafted and operationalized
in Java here and has not yet been ported to Python (In actuality the first extractor implementation before OpenSextant
was
in Python not using the FlexPat approach). Coordinate patterns include:
FILES: Coord Patterns, Patterns Config: * *geocoord_patterns.cfg **.
XTemporal
a date/time extractor and normalizer that finds dates and date/time patterns, implemented
with this patterns
definition datetime_patterns.cfg.
Patterns include
Sept 22nd, 2017
or 09/22/2017
22 SEPT 2017 0700Z
2017-09-22
2017-09-22T0700-0500
FILES: Patterns Config: * *datetime_patterns.cfg **
PoLi
or patterns-of-life demonstration, which includes well-known patterns like telephone numbers, email address,
and money. Those patterns are contained
in poli_patterns.cfg.
As a demonstration of FlexPat,
this set of patterns was provided only to show the development process of additional patterns. It is here for
illustration. On other projects we have implemented such patterns in much more depth, albeit such things are
not always open sourced. These patterns include:
"user@domain"
FILES: Patterns Config: * *poli_patterns.cfg**
While XCoord and XTemporal above are complex regarding their parsing, they are relatively well-contained and easy. Patterns tackled using the more general solutions demonstrated in PoLi show that the REGEX detection is just the first part of the problem, and the user has to bring in that sense of validation.
That validation is implemented (in Python) by subclassing opensextant.FlexPat.PatternMatch
and implementing a
normalize()
function to validate the detected pattern and groups. We’ll get into that more below with the
optional CLASS
directive.
First here is the outline of the standard FlexPat patterns configuration file – which should be language independent
(yes, until you specify the optional CLASS
, which is the name of your custom class which may vary depending on your
programming language).
This FlexPat uses a “patterns configuration” file, which contains the clauses for DEFINE
, RULE
, TEST
, and CLASS
– the essential ingredients for a pattern extractor pipeline. Outlining these more:
DEFINE
: define discrete groups or “sub-patterns” that recur in your patternsRULE
: an actual REGEX, defined with named groups only identified by your DEFINEs
or any other valid REGEX syntaxMDY
has about 6 total month-day-year patterns. But all such variants are easily referenced by naming the pattern
family MDY
.TEST
: is a single example of the pattern+variant to be detected, parsed and/or normalized by the RULE
.
By convention a FAIL
comment trailing the test is used to denote a test case that should NOT be detected by
the RULE
. Alternatively, the pattern may detect a match, but through the CLASS
.normalize()
implementation the
RULE
will yield a pattern match that is filtered out. In conclusion, there are two chances to succeed here
– (a) RULE
detect or not detect the match, and/or (b) CLASS
normalization validates or invalidates the match.TEST
test cases as possible to touch on variants.CLASS
: An optional custom class that subclasses PatternMatch
to carry REGEX subgroups and allow for additional
normalization, validation, etc.The PoLi example is provided as a template for starting your own set of patterns: [poli_patterns.cfg](https://github.com/OpenSextant/Xponents/blob/master/Core/src/main/resources/poli_patterns.cfg
./script/xponents-demo.sh
, TestXCoord.java