Class AnnotationHelper

java.lang.Object
org.opensextant.annotations.AnnotationHelper

public class AnnotationHelper extends Object
 Basis for this optional helper class was three or four different projects using DeepEye as
 a model for persisting annotations from the typical Named Entity and Geo/Time extraction work.

 Common annotation practices include:
 - All annotations should have both their own ID as well as a Record ID;  this involves tracking unique annots for a given doc.
   The intent of DeepEye identifiers is that they are unique within a system or a data set, not necessarily UUID:
     - Record IDs should be related to their source identifier
     - Annotation IDs should be deterministic based on metadata tuple: MD5( name + value + contrib + rec_id ) or a similar predictable, reproducible hash
 - Caching an consolidating repetitive annotations. Example:  seeing 'U.S.' 100 times in a
   single document as a "GPE" or as a "geo" can be overwhelming.  Is it 100 individual annots,
   or 1 annotation for GPE and 1 annotation for the geo.  Each annotation tracks as many offsets as it needs.
   As individual instances in the document vary in attributes or other metadata, then those
   should be considered unique annotations.
 - Allow common doc-level annotations, such as a topic tag, even when there is no offset or text span in question. "Span-less" annotations.

 Advice for what to persist to a DeepEye database:
 - its costly or complex to compute
 - its helpful to aggregate many raw extracted values to review over a large data set

 Consider not storing:
 - annotations that are trivial to compute at runtime, e.g., indexing a derived metadata tag,
   such as converting 'US' to 'United States'
 - NLP artifacts such as tokens like pronouns or other parts of speech.  These might be
 - Filtered values  -- if you are filtering out certain data in all your analyses or downstream
   operations, consider filtering out such things before you store them blindly.
 
SEE ALSO: Xponents SDK class org.opensextant.output.Transform: this utility class offers more ideas on standard JSON representations for REST. whereas this utility is aimed at a more reliable pure representation of the match data for storing/retrieving from a data store.
Author:
ubaldino
  • Field Details

  • Constructor Details

    • AnnotationHelper

      public AnnotationHelper()
  • Method Details

    • decodeAnnotations

      public static List<Annotation> decodeAnnotations(List<Annotation> codedAnnots)
      Given encoded annotations from db, decode them and yield a flattened set of annotations, e.g., for use with MAT
      Parameters:
      codedAnnots - the coded annots
      Returns:
      the list
    • getFirstOffset

      public static long getFirstOffset(Annotation a)
      Gets the first offset.
      Parameters:
      a - the a
      Returns:
      the first offset
    • encodeOffsets

      public static String encodeOffsets(Collection<Integer> offsets)
      Encode offsets.
      Parameters:
      offsets - the offsets
      Returns:
      the string
    • decodeOffsets

      public static List<Integer> decodeOffsets(String list)
      Take a list of numbers and convert to Integer list "1;5;89;777" => List<> [ 1, 5, 89, 777 ].
      Parameters:
      list - the list
      Returns:
      the list
    • decodeOffsets

      public static List<Annotation> decodeOffsets(Annotation meta, String offsetList)
      Generate annotations in a linear fashion. Given the optimized Annotation, A, create duplicate annotations, each with an offset from the list of offsets.
      Parameters:
      meta - the meta
      offsetList - the offset list
      Returns:
      the list
    • getAnnotationId

      public static String getAnnotationId(String rec_id, String contrib, String atype, String val)
      New, required format for an annotation ID: Md5 hash made up of:
          rec_id + contributor + type + value
      
          Distinct entities from a single contributor for a single document will be recorded (and overwritten over time).
          Reprocessing data will overwrite a new value.
      
           Doc abc has a NAMEX = 'the diplomat', provided by xyz extractor.
      
           key is MD5( 'abc;xyz;NAMEX;the diplomat' )
           Multiple occurrences of the same value in the same document must be recorded as "atts.offsets" = [n1,n2,n3...] offsets
       
      Parameters:
      rec_id - the rec_id
      contrib - the contrib
      atype - the atype
      val - the val
      Returns:
      the annotation id
    • reset

      public void reset()
      Reset() clears the internal cache. Ideally, you hit reset on each new Record in a loop.
    • getCachedAnnotations

      public Collection<Annotation> getCachedAnnotations()
      Gets the cached annotations, unordered.
      Returns:
      the cached annotations
    • cacheAnnotation

      public Annotation cacheAnnotation(String contrib, String etype, String value, int start, String docid)
      Cache entity annotations, accumulating unique offsets for a name/value pair.
      Parameters:
      contrib - the contrib
      etype - the etype
      value - the value
      start - the start
      docid - the docid
      Returns:
      the entity annotation
    • cacheAnnotation

      public void cacheAnnotation(Annotation ea, String key)
      Cache an annotation.
      Parameters:
      ea - the ea
      key - the key
    • cacheAnnotation

      public void cacheAnnotation(Annotation ea)
      Cache annotation.
      Parameters:
      ea - your annotation
    • cacheAnnotation

      public void cacheAnnotation(Annotation ea, int start)
      Cache entity annotation - in Memory; Note, the actual ID or key in database is usually composed of name+value+contrib. NOTE: these are normalized case-insensitive values. NOTE: If annotation already exists, then all we do is add the start offset to the existing entry. Name and value must not be null.
      Parameters:
      ea - your annotation
      start - offset into your doc.
      Throws:
      NullPointerException - if name and value are not set on Annotation.
    • hasCachedAnnotation

      public boolean hasCachedAnnotation(String etype, String value)
      Checks for cached annotation.
      Parameters:
      etype - the etype
      value - the value
      Returns:
      true, if successful
    • getCachedAnnotation

      public Annotation getCachedAnnotation(String etype, String value)
      Careful -- no guarntee that two entity annotations could share the same type/value unintentionally. e.g., if "tx" type annot implies a taxon from one contrib and "tx" means a transaction from another, then you the developer should choose more distinct entity type codes.
      Parameters:
      etype - entity type
      value - value
      Returns:
      the cached annotation
    • cacheTaxonAnnotation

      protected Annotation cacheTaxonAnnotation(String contrib, Taxon taxon, String value, int offset, String docid)
      Cache taxon entity annotation.
      Parameters:
      contrib - contributor ID
      taxon - taxon obj
      value - string value
      offset - offset
      docid - docid
      Returns:
      the entity annotation
    • createAnnotation

      public static Annotation createAnnotation(String contrib, String type, String val, int offset, String docid)
      Creates a standard named entity annotation.
      Parameters:
      contrib - contributor ID
      type - annotation type/ID
      val - string value
      offset - offset
      docid - docid
      Returns:
      the annotation
    • createAnnotation

      public static Annotation createAnnotation(String contrib, String type, String val, int offset, int len, String docid)
      Parameters:
      contrib -
      type -
      val -
      offset -
      len -
      docid -
      Returns:
    • createTaxonAnnotation

      public static Annotation createTaxonAnnotation(String contrib, String type, String val, int offset, String docid, Taxon n)
      Create an annotation for a Taxon node that has a found value, val, in document, docid at offset. Taxon match has a type and a contributor, usually the tagger or extractor that processed the document.
      Parameters:
      contrib - the contrib
      type - the type
      val - the val
      offset - the offset
      docid - the docid
      n - the n
      Returns:
      the entity annotation
    • createTaxon

      public static Taxon createTaxon(Annotation a)
      Recreates a Taxon from a stored annotation. Required fields: a.attrs[name] -- taxon node name a.attrs[cat] -- catalog a.name -- Not used here. a.value -- the value of the matched text.
      Parameters:
      a - the a
      Returns:
      the taxon
    • createCountryAnnotation

      public static Annotation createCountryAnnotation(String contrib, String type, String val, int offset, String docid, String country_code)
      Tracking a country name match of some sort. You know this is a country, eh,... so please enrich with the country code here. We know you can always find out the country code later from a given country name/match, however this may be context specific. Georgia / GE -- putting the country code here gives more confidence that you found Georgia, the country and not the US state You might have other means for deriving the country code for a given value, e.g., for example you found "GOI" a geopolitical entity you infer to be Govt. of India, so you emit "IN" as the country code. create( xxx, 'GPE', 'GOI', ..., 'IN' )
      Parameters:
      contrib - the contrib
      type - the type
      val - the val
      offset - the offset
      docid - the docid
      country_code - the country_code
      Returns:
      the entity annotation
    • createCountry

      public static Country createCountry(Annotation a)
      Returns an instance of a Country object using annotation value as country name, and attr[cc] optionally as code. This does not reproduce a full Country object as if queried from
      Parameters:
      a - annot
      Returns:
      the country
      See Also:
    • createGeocodingAnnotation

      public static Annotation createGeocodingAnnotation(String contrib, String type, String val, int offset, String docid, Geocoding g)
      Encode geocoding annotations to be saved. This schema follows from EH/GLINT/Glare, etc.
      Parameters:
      contrib - the contrib
      type - the type
      val - the val
      offset - the offset
      docid - the docid
      g - the g
      Returns:
      the entity annotation
    • createGeocoding

      public static Place createGeocoding(Annotation a)
      Decode: Geocoding See OpenSextant Geocoding interface. Here required annotation fields are: lat, lon, prec cc, adm1, place feat_class, feat_code method
      Parameters:
      a - the a
      Returns:
      the geocoded data
    • createTemporalEntityAnnotation

      public static Annotation createTemporalEntityAnnotation(String contrib, String type, String val, int offset, String docid, Date d, String resolution)
      Creates the temporal entity annotation.
      Parameters:
      contrib - the contrib
      type - the type
      val - the val
      offset - the offset
      docid - the docid
      d - the d
      resolution - the resolution
      Returns:
      the entity annotation
    • createTemporalAnnotation

      public static Annotation createTemporalAnnotation(String contrib, String type, String val, int offset, int len, String docid, Date d, String resolution)
      Same createTemporalEntityAnnotation, just with len param.
      Parameters:
      contrib -
      type -
      val -
      offset -
      len -
      docid -
      d -
      resolution -
      Returns: