Xponents

Geographic Place, Date/time, and Pattern entity extraction toolkit along with text extraction from unstructured data and GIS outputters.

View the Project on GitHub

Xponents Patterns

Regular Expression (REGEX) patterns is a common solution to detect relatively concrete sequences characters and symbols. REGEX, however, is not a complete solution to help validate what is found: for example, finding a 10-digit sequence of numbers does not alone imply you found a phone number; Finding a tuple of digits separated by slashes is not alone to imply a date format. Additional validation is important … usually.

Xponents FlexPat is a general methodology for developing REGEX-based extractors. Consider alternative solutions, such as YARA – very powerful, but much more intricate and tailored specifically toward malware/cyberforensics. FlexPat provides a more general abstraction around REGEX, specifically:

FlexPat currently operates in Java or Python with the same patterns file syntax. On the last point above, the intent of FlexPat was to get the patterns out of language specific source code and into a readable form that all team members could comprehend and weigh in on test cases.

To date Xponents FlexPat extractors include XCoord for geo-coordinate patterns, XTemporal for date/time patterns, and PoLi for general tutorial/demonstration purposes using simple patterns like email and telephone numbers.
They are described in more detail with the actual files and supporting material below.

XCoord

XCoord is a geographic coordinate extractor and normalizer that finds latitude/longitude pairs or grids such as MGRS or UTM. The patterns are either decimal degrees, minutes seconds, and/or fractional parts along with hemisphere symbology. Patterns include these: Coord Patterns, 2013-2017, as implemented in this patterns definition geocoord_patterns.cfg. This was drafted and operationalized in Java here and has not yet been ported to Python (In actuality the first extractor implementation before OpenSextant was in Python not using the FlexPat approach). Coordinate patterns include:

FILES: Coord Patterns, Patterns Config: * *geocoord_patterns.cfg **.

XTemporal

XTemporal a date/time extractor and normalizer that finds dates and date/time patterns, implemented with this patterns definition datetime_patterns.cfg. Patterns include

FILES: Patterns Config: * *datetime_patterns.cfg **

PoLi

PoLi or patterns-of-life demonstration, which includes well-known patterns like telephone numbers, email address, and money. Those patterns are contained in poli_patterns.cfg. As a demonstration of FlexPat, this set of patterns was provided only to show the development process of additional patterns. It is here for illustration. On other projects we have implemented such patterns in much more depth, albeit such things are not always open sourced. These patterns include:

FILES: Patterns Config: * *poli_patterns.cfg**

Developing with FlexPat

While XCoord and XTemporal above are complex regarding their parsing, they are relatively well-contained and easy. Patterns tackled using the more general solutions demonstrated in PoLi show that the REGEX detection is just the first part of the problem, and the user has to bring in that sense of validation.

That validation is implemented (in Python) by subclassing opensextant.FlexPat.PatternMatch and implementing a normalize() function to validate the detected pattern and groups. We’ll get into that more below with the optional CLASS directive.

First here is the outline of the standard FlexPat patterns configuration file – which should be language independent (yes, until you specify the optional CLASS, which is the name of your custom class which may vary depending on your programming language).

This FlexPat uses a “patterns configuration” file, which contains the clauses for DEFINE, RULE, TEST, and CLASS – the essential ingredients for a pattern extractor pipeline. Outlining these more:

The PoLi example is provided as a template for starting your own set of patterns: [poli_patterns.cfg](https://github.com/OpenSextant/Xponents/blob/master/Core/src/main/resources/poli_patterns.cfg

Code References and Examples