Class AnnotationHelper


  • public class AnnotationHelper
    extends java.lang.Object
     Basis for this optional helper class was three or four different projects using DeepEye as
     a model for persisting annotations from the typical Named Entity and Geo/Time extraction work.
    
     Common annotation practices include:
     - All annotations should have both their own ID as well as a Record ID;  this involves tracking unique annots for a given doc.
       The intent of DeepEye identifiers is that they are unique within a system or a data set, not necessarily UUID:
         - Record IDs should be related to their source identifier
         - Annotation IDs should be deterministic based on metadata tuple: MD5( name + value + contrib + rec_id ) or a similar predictable, reproducible hash
     - Caching an consolidating repetitive annotations. Example:  seeing 'U.S.' 100 times in a
       single document as a "GPE" or as a "geo" can be overwhelming.  Is it 100 individual annots,
       or 1 annotation for GPE and 1 annotation for the geo.  Each annotation tracks as many offsets as it needs.
       As individual instances in the document vary in attributes or other metadata, then those
       should be considered unique annotations.
     - Allow common doc-level annotations, such as a topic tag, even when there is no offset or text span in question. "Span-less" annotations.
    
     Advice for what to persist to a DeepEye database:
     - its costly or complex to compute
     - its helpful to aggregate many raw extracted values to review over a large data set
    
     Consider not storing:
     - annotations that are trivial to compute at runtime, e.g., indexing a derived metadata tag,
       such as converting 'US' to 'United States'
     - NLP artifacts such as tokens like pronouns or other parts of speech.  These might be
     - Filtered values  -- if you are filtering out certain data in all your analyses or downstream
       operations, consider filtering out such things before you store them blindly.
     
    SEE ALSO: Xponents SDK class org.opensextant.output.Transform: this utility class offers more ideas on standard JSON representations for REST. whereas this utility is aimed at a more reliable pure representation of the match data for storing/retrieving from a data store.
    Author:
    ubaldino
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static java.lang.String NUM_SEP
      The Constant NUM_SEP.
    • Method Summary

      Modifier and Type Method Description
      Annotation cacheAnnotation​(java.lang.String contrib, java.lang.String etype, java.lang.String value, int start, java.lang.String docid)
      Cache entity annotations, accumulating unique offsets for a name/value pair.
      void cacheAnnotation​(Annotation ea)
      Cache annotation.
      void cacheAnnotation​(Annotation ea, int start)
      Cache entity annotation - in Memory; Note, the actual ID or key in database is usually composed of name+value+contrib.
      void cacheAnnotation​(Annotation ea, java.lang.String key)
      Cache an annotation.
      protected Annotation cacheTaxonAnnotation​(java.lang.String contrib, Taxon taxon, java.lang.String value, int offset, java.lang.String docid)
      Cache taxon entity annotation.
      static Annotation createAnnotation​(java.lang.String contrib, java.lang.String type, java.lang.String val, int offset, int len, java.lang.String docid)  
      static Annotation createAnnotation​(java.lang.String contrib, java.lang.String type, java.lang.String val, int offset, java.lang.String docid)
      Creates a standard named entity annotation.
      static Country createCountry​(Annotation a)
      Returns an instance of a Country object using annotation value as country name, and attr[cc] optionally as code.
      static Annotation createCountryAnnotation​(java.lang.String contrib, java.lang.String type, java.lang.String val, int offset, java.lang.String docid, java.lang.String country_code)
      Tracking a country name match of some sort.
      static Place createGeocoding​(Annotation a)
      Decode: Geocoding See OpenSextant Geocoding interface.
      static Annotation createGeocodingAnnotation​(java.lang.String contrib, java.lang.String type, java.lang.String val, int offset, java.lang.String docid, Geocoding g)
      Encode geocoding annotations to be saved.
      static Taxon createTaxon​(Annotation a)
      Recreates a Taxon from a stored annotation.
      static Annotation createTaxonAnnotation​(java.lang.String contrib, java.lang.String type, java.lang.String val, int offset, java.lang.String docid, Taxon n)
      Create an annotation for a Taxon node that has a found value, val, in document, docid at offset.
      static Annotation createTemporalAnnotation​(java.lang.String contrib, java.lang.String type, java.lang.String val, int offset, int len, java.lang.String docid, java.util.Date d, java.lang.String resolution)
      Same createTemporalEntityAnnotation, just with len param.
      static Annotation createTemporalEntityAnnotation​(java.lang.String contrib, java.lang.String type, java.lang.String val, int offset, java.lang.String docid, java.util.Date d, java.lang.String resolution)
      Creates the temporal entity annotation.
      static java.util.List<Annotation> decodeAnnotations​(java.util.List<Annotation> codedAnnots)
      Given encoded annotations from db, decode them and yield a flattened set of annotations, e.g., for use with MAT
      static java.util.List<java.lang.Integer> decodeOffsets​(java.lang.String list)
      Take a list of numbers and convert to Integer list "1;5;89;777" => List<> [ 1, 5, 89, 777 ].
      static java.util.List<Annotation> decodeOffsets​(Annotation meta, java.lang.String offsetList)
      Generate annotations in a linear fashion.
      static java.lang.String encodeOffsets​(java.util.Collection<java.lang.Integer> offsets)
      Encode offsets.
      static java.lang.String getAnnotationId​(java.lang.String rec_id, java.lang.String contrib, java.lang.String atype, java.lang.String val)
      New, required format for an annotation ID: Md5 hash made up of:
      Annotation getCachedAnnotation​(java.lang.String etype, java.lang.String value)
      Careful -- no guarntee that two entity annotations could share the same type/value unintentionally.
      java.util.Collection<Annotation> getCachedAnnotations()
      Gets the cached annotations, unordered.
      static long getFirstOffset​(Annotation a)
      Gets the first offset.
      boolean hasCachedAnnotation​(java.lang.String etype, java.lang.String value)
      Checks for cached annotation.
      void reset()
      Reset() clears the internal cache.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • NUM_SEP

        public static final java.lang.String NUM_SEP
        The Constant NUM_SEP.
        See Also:
        Constant Field Values
    • Constructor Detail

      • AnnotationHelper

        public AnnotationHelper()
    • Method Detail

      • decodeAnnotations

        public static java.util.List<Annotation> decodeAnnotations​(java.util.List<Annotation> codedAnnots)
        Given encoded annotations from db, decode them and yield a flattened set of annotations, e.g., for use with MAT
        Parameters:
        codedAnnots - the coded annots
        Returns:
        the list
      • getFirstOffset

        public static long getFirstOffset​(Annotation a)
        Gets the first offset.
        Parameters:
        a - the a
        Returns:
        the first offset
      • encodeOffsets

        public static java.lang.String encodeOffsets​(java.util.Collection<java.lang.Integer> offsets)
        Encode offsets.
        Parameters:
        offsets - the offsets
        Returns:
        the string
      • decodeOffsets

        public static java.util.List<java.lang.Integer> decodeOffsets​(java.lang.String list)
        Take a list of numbers and convert to Integer list "1;5;89;777" => List<> [ 1, 5, 89, 777 ].
        Parameters:
        list - the list
        Returns:
        the list
      • decodeOffsets

        public static java.util.List<Annotation> decodeOffsets​(Annotation meta,
                                                               java.lang.String offsetList)
        Generate annotations in a linear fashion. Given the optimized Annotation, A, create duplicate annotations, each with an offset from the list of offsets.
        Parameters:
        meta - the meta
        offsetList - the offset list
        Returns:
        the list
      • getAnnotationId

        public static java.lang.String getAnnotationId​(java.lang.String rec_id,
                                                       java.lang.String contrib,
                                                       java.lang.String atype,
                                                       java.lang.String val)
        New, required format for an annotation ID: Md5 hash made up of:
            rec_id + contributor + type + value
        
            Distinct entities from a single contributor for a single document will be recorded (and overwritten over time).
            Reprocessing data will overwrite a new value.
        
             Doc abc has a NAMEX = 'the diplomat', provided by xyz extractor.
        
             key is MD5( 'abc;xyz;NAMEX;the diplomat' )
             Multiple occurrences of the same value in the same document must be recorded as "atts.offsets" = [n1,n2,n3...] offsets
         
        Parameters:
        rec_id - the rec_id
        contrib - the contrib
        atype - the atype
        val - the val
        Returns:
        the annotation id
      • reset

        public void reset()
        Reset() clears the internal cache. Ideally, you hit reset on each new Record in a loop.
      • getCachedAnnotations

        public java.util.Collection<Annotation> getCachedAnnotations()
        Gets the cached annotations, unordered.
        Returns:
        the cached annotations
      • cacheAnnotation

        public Annotation cacheAnnotation​(java.lang.String contrib,
                                          java.lang.String etype,
                                          java.lang.String value,
                                          int start,
                                          java.lang.String docid)
        Cache entity annotations, accumulating unique offsets for a name/value pair.
        Parameters:
        contrib - the contrib
        etype - the etype
        value - the value
        start - the start
        docid - the docid
        Returns:
        the entity annotation
      • cacheAnnotation

        public void cacheAnnotation​(Annotation ea,
                                    java.lang.String key)
        Cache an annotation.
        Parameters:
        ea - the ea
        key - the key
      • cacheAnnotation

        public void cacheAnnotation​(Annotation ea)
        Cache annotation.
        Parameters:
        ea - your annotation
      • cacheAnnotation

        public void cacheAnnotation​(Annotation ea,
                                    int start)
        Cache entity annotation - in Memory; Note, the actual ID or key in database is usually composed of name+value+contrib. NOTE: these are normalized case-insensitive values. NOTE: If annotation already exists, then all we do is add the start offset to the existing entry. Name and value must not be null.
        Parameters:
        ea - your annotation
        start - offset into your doc.
        Throws:
        java.lang.NullPointerException - if name and value are not set on Annotation.
      • hasCachedAnnotation

        public boolean hasCachedAnnotation​(java.lang.String etype,
                                           java.lang.String value)
        Checks for cached annotation.
        Parameters:
        etype - the etype
        value - the value
        Returns:
        true, if successful
      • getCachedAnnotation

        public Annotation getCachedAnnotation​(java.lang.String etype,
                                              java.lang.String value)
        Careful -- no guarntee that two entity annotations could share the same type/value unintentionally. e.g., if "tx" type annot implies a taxon from one contrib and "tx" means a transaction from another, then you the developer should choose more distinct entity type codes.
        Parameters:
        etype - entity type
        value - value
        Returns:
        the cached annotation
      • cacheTaxonAnnotation

        protected Annotation cacheTaxonAnnotation​(java.lang.String contrib,
                                                  Taxon taxon,
                                                  java.lang.String value,
                                                  int offset,
                                                  java.lang.String docid)
        Cache taxon entity annotation.
        Parameters:
        contrib - contributor ID
        taxon - taxon obj
        value - string value
        offset - offset
        docid - docid
        Returns:
        the entity annotation
      • createAnnotation

        public static Annotation createAnnotation​(java.lang.String contrib,
                                                  java.lang.String type,
                                                  java.lang.String val,
                                                  int offset,
                                                  java.lang.String docid)
        Creates a standard named entity annotation.
        Parameters:
        contrib - contributor ID
        type - annotation type/ID
        val - string value
        offset - offset
        docid - docid
        Returns:
        the annotation
      • createAnnotation

        public static Annotation createAnnotation​(java.lang.String contrib,
                                                  java.lang.String type,
                                                  java.lang.String val,
                                                  int offset,
                                                  int len,
                                                  java.lang.String docid)
        Parameters:
        contrib -
        type -
        val -
        offset -
        len -
        docid -
        Returns:
      • createTaxonAnnotation

        public static Annotation createTaxonAnnotation​(java.lang.String contrib,
                                                       java.lang.String type,
                                                       java.lang.String val,
                                                       int offset,
                                                       java.lang.String docid,
                                                       Taxon n)
        Create an annotation for a Taxon node that has a found value, val, in document, docid at offset. Taxon match has a type and a contributor, usually the tagger or extractor that processed the document.
        Parameters:
        contrib - the contrib
        type - the type
        val - the val
        offset - the offset
        docid - the docid
        n - the n
        Returns:
        the entity annotation
      • createTaxon

        public static Taxon createTaxon​(Annotation a)
        Recreates a Taxon from a stored annotation. Required fields: a.attrs[name] -- taxon node name a.attrs[cat] -- catalog a.name -- Not used here. a.value -- the value of the matched text.
        Parameters:
        a - the a
        Returns:
        the taxon
      • createCountryAnnotation

        public static Annotation createCountryAnnotation​(java.lang.String contrib,
                                                         java.lang.String type,
                                                         java.lang.String val,
                                                         int offset,
                                                         java.lang.String docid,
                                                         java.lang.String country_code)
        Tracking a country name match of some sort. You know this is a country, eh,... so please enrich with the country code here. We know you can always find out the country code later from a given country name/match, however this may be context specific. Georgia / GE -- putting the country code here gives more confidence that you found Georgia, the country and not the US state You might have other means for deriving the country code for a given value, e.g., for example you found "GOI" a geopolitical entity you infer to be Govt. of India, so you emit "IN" as the country code. create( xxx, 'GPE', 'GOI', ..., 'IN' )
        Parameters:
        contrib - the contrib
        type - the type
        val - the val
        offset - the offset
        docid - the docid
        country_code - the country_code
        Returns:
        the entity annotation
      • createCountry

        public static Country createCountry​(Annotation a)
        Returns an instance of a Country object using annotation value as country name, and attr[cc] optionally as code. This does not reproduce a full Country object as if queried from
        Parameters:
        a - annot
        Returns:
        the country
        See Also:
        GeonamesUtility.getCountry(String)
      • createGeocodingAnnotation

        public static Annotation createGeocodingAnnotation​(java.lang.String contrib,
                                                           java.lang.String type,
                                                           java.lang.String val,
                                                           int offset,
                                                           java.lang.String docid,
                                                           Geocoding g)
        Encode geocoding annotations to be saved. This schema follows from EH/GLINT/Glare, etc.
        Parameters:
        contrib - the contrib
        type - the type
        val - the val
        offset - the offset
        docid - the docid
        g - the g
        Returns:
        the entity annotation
      • createGeocoding

        public static Place createGeocoding​(Annotation a)
        Decode: Geocoding See OpenSextant Geocoding interface. Here required annotation fields are: lat, lon, prec cc, adm1, place feat_class, feat_code method
        Parameters:
        a - the a
        Returns:
        the geocoded data
      • createTemporalEntityAnnotation

        public static Annotation createTemporalEntityAnnotation​(java.lang.String contrib,
                                                                java.lang.String type,
                                                                java.lang.String val,
                                                                int offset,
                                                                java.lang.String docid,
                                                                java.util.Date d,
                                                                java.lang.String resolution)
        Creates the temporal entity annotation.
        Parameters:
        contrib - the contrib
        type - the type
        val - the val
        offset - the offset
        docid - the docid
        d - the d
        resolution - the resolution
        Returns:
        the entity annotation
      • createTemporalAnnotation

        public static Annotation createTemporalAnnotation​(java.lang.String contrib,
                                                          java.lang.String type,
                                                          java.lang.String val,
                                                          int offset,
                                                          int len,
                                                          java.lang.String docid,
                                                          java.util.Date d,
                                                          java.lang.String resolution)
        Same createTemporalEntityAnnotation, just with len param.
        Parameters:
        contrib -
        type -
        val -
        offset -
        len -
        docid -
        d -
        resolution -
        Returns: