org.opensextant.annotations.AnnotationHelper

public class AnnotationHelper extends Object

Basis for this optional helper class was three or four different projects using DeepEye as
a model for persisting annotations from the typical Named Entity and Geo/Time extraction work.

Common annotation practices include:
- All annotations should have both their own ID as well as a Record ID; this involves tracking unique annots for a given doc.
The intent of DeepEye identifiers is that they are unique within a system or a data set, not necessarily UUID:
- Record IDs should be related to their source identifier
- Annotation IDs should be deterministic based on metadata tuple: MD5( name + value + contrib + rec_id ) or a similar predictable, reproducible hash
- Caching an consolidating repetitive annotations. Example: seeing 'U.S.' 100 times in a
single document as a "GPE" or as a "geo" can be overwhelming. Is it 100 individual annots,
or 1 annotation for GPE and 1 annotation for the geo. Each annotation tracks as many offsets as it needs.
As individual instances in the document vary in attributes or other metadata, then those
should be considered unique annotations.
- Allow common doc-level annotations, such as a topic tag, even when there is no offset or text span in question. "Span-less" annotations.

Advice for what to persist to a DeepEye database:
- its costly or complex to compute
- its helpful to aggregate many raw extracted values to review over a large data set

Consider not storing:
- annotations that are trivial to compute at runtime, e.g., indexing a derived metadata tag,
such as converting 'US' to 'United States'
- NLP artifacts such as tokens like pronouns or other parts of speech. These might be
- Filtered values -- if you are filtering out certain data in all your analyses or downstream
operations, consider filtering out such things before you store them blindly.

SEE ALSO: Xponents SDK class org.opensextant.output.Transform: this utility class offers more ideas on standard JSON representations for REST. whereas this utility is aimed at a more reliable pure representation of the match data for storing/retrieving from a data store.

Author:: ubaldino

Field Summary

Fields

Modifier and Type

Field

Description

static final String

NUM_SEP

The Constant NUM_SEP.
Constructor Summary

Constructors

Constructor

Description

AnnotationHelper()
Method Summary

Modifier and Type

Method

Description

Annotation

cacheAnnotation(String contrib, String etype, String value, int start, String docid)

Cache entity annotations, accumulating unique offsets for a name/value pair.

void

cacheAnnotation(Annotation ea)

Cache annotation.

void

cacheAnnotation(Annotation ea, int start)

Cache entity annotation - in Memory; Note, the actual ID or key in database is usually composed of name+value+contrib.

void

cacheAnnotation(Annotation ea, String key)

Cache an annotation.

protected Annotation

cacheTaxonAnnotation(String contrib, Taxon taxon, String value, int offset, String docid)

Cache taxon entity annotation.

static Annotation

createAnnotation(String contrib, String type, String val, int offset, int len, String docid)

static Annotation

createAnnotation(String contrib, String type, String val, int offset, String docid)

Creates a standard named entity annotation.

static Country

createCountry(Annotation a)

Returns an instance of a Country object using annotation value as country name, and attr[cc] optionally as code.

static Annotation

createCountryAnnotation(String contrib, String type, String val, int offset, String docid, String country_code)

Tracking a country name match of some sort.

static Place

createGeocoding(Annotation a)

Decode: Geocoding See OpenSextant Geocoding interface.

static Annotation

createGeocodingAnnotation(String contrib, String type, String val, int offset, String docid, Geocoding g)

Encode geocoding annotations to be saved.

static Taxon

createTaxon(Annotation a)

Recreates a Taxon from a stored annotation.

static Annotation

createTaxonAnnotation(String contrib, String type, String val, int offset, String docid, Taxon n)

Create an annotation for a Taxon node that has a found value, val, in document, docid at offset.

static Annotation

createTemporalAnnotation(String contrib, String type, String val, int offset, int len, String docid, Date d, String resolution)

Same createTemporalEntityAnnotation, just with len param.

static Annotation

createTemporalEntityAnnotation(String contrib, String type, String val, int offset, String docid, Date d, String resolution)

Creates the temporal entity annotation.

static List<Annotation>

decodeAnnotations(List<Annotation> codedAnnots)

Given encoded annotations from db, decode them and yield a flattened set of annotations, e.g., for use with MAT

static List<Integer>

decodeOffsets(String list)

Take a list of numbers and convert to Integer list "1;5;89;777" => List<> [ 1, 5, 89, 777 ].

static List<Annotation>

decodeOffsets(Annotation meta, String offsetList)

Generate annotations in a linear fashion.

static String

encodeOffsets(Collection<Integer> offsets)

Encode offsets.

static String

getAnnotationId(String rec_id, String contrib, String atype, String val)

New, required format for an annotation ID: Md5 hash made up of:

Annotation

getCachedAnnotation(String etype, String value)

Careful -- no guarntee that two entity annotations could share the same type/value unintentionally.

Collection<Annotation>

getCachedAnnotations()

Gets the cached annotations, unordered.

static long

getFirstOffset(Annotation a)

Gets the first offset.

boolean

hasCachedAnnotation(String etype, String value)

Checks for cached annotation.

void

reset()

Reset() clears the internal cache.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- NUM_SEP
  
  public static final String NUM_SEP
  
  The Constant NUM_SEP.
  See Also:
  
  Constant Field Values
Constructor Details
- AnnotationHelper
  
  public AnnotationHelper()
Method Details
- decodeAnnotations
  
  public static List<Annotation> decodeAnnotations(List<Annotation> codedAnnots)
  
  Given encoded annotations from db, decode them and yield a flattened set of annotations, e.g., for use with MAT
  
  Parameters:
  
  codedAnnots - the coded annots
  
  Returns:
  
  the list
- getFirstOffset
  
  public static long getFirstOffset(Annotation a)
  
  Gets the first offset.
  
  Parameters:
  
  a - the a
  
  Returns:
  
  the first offset
- encodeOffsets
  
  public static String encodeOffsets(Collection<Integer> offsets)
  
  Encode offsets.
  
  Parameters:
  
  offsets - the offsets
  
  Returns:
  
  the string
- decodeOffsets
  
  public static List<Integer> decodeOffsets(String list)
  
  Take a list of numbers and convert to Integer list "1;5;89;777" => List<> [ 1, 5, 89, 777 ].
  
  Parameters:
  
  list - the list
  
  Returns:
  
  the list
- decodeOffsets
  
  public static List<Annotation> decodeOffsets(Annotation meta, String offsetList)
  
  Generate annotations in a linear fashion. Given the optimized Annotation, A, create duplicate annotations, each with an offset from the list of offsets.
  
  Parameters:
  
  meta - the meta
  
  offsetList - the offset list
  
  Returns:
  
  the list
- getAnnotationId
  
  public static String getAnnotationId(String rec_id, String contrib, String atype, String val)
  New, required format for an annotation ID: Md5 hash made up of:
  rec_id + contributor + type + value Distinct entities from a single contributor for a single document will be recorded (and overwritten over time). Reprocessing data will overwrite a new value. Doc abc has a NAMEX = 'the diplomat', provided by xyz extractor. key is MD5( 'abc;xyz;NAMEX;the diplomat' ) Multiple occurrences of the same value in the same document must be recorded as "atts.offsets" = [n1,n2,n3...] offsets
  Parameters:
  
  rec_id - the rec_id
  
  contrib - the contrib
  
  atype - the atype
  
  val - the val
  
  Returns:
  
  the annotation id
- reset
  
  public void reset()
  
  Reset() clears the internal cache. Ideally, you hit reset on each new Record in a loop.
- getCachedAnnotations
  
  public Collection<Annotation> getCachedAnnotations()
  
  Gets the cached annotations, unordered.
  
  Returns:
  
  the cached annotations
- cacheAnnotation
  
  public Annotation cacheAnnotation(String contrib, String etype, String value, int start, String docid)
  
  Cache entity annotations, accumulating unique offsets for a name/value pair.
  
  Parameters:
  
  contrib - the contrib
  
  etype - the etype
  
  value - the value
  
  start - the start
  
  docid - the docid
  
  Returns:
  
  the entity annotation
- cacheAnnotation
  
  public void cacheAnnotation(Annotation ea, String key)
  
  Cache an annotation.
  
  Parameters:
  
  ea - the ea
  
  key - the key
- cacheAnnotation
  
  public void cacheAnnotation(Annotation ea)
  
  Cache annotation.
  
  Parameters:
  
  ea - your annotation
- cacheAnnotation
  
  public void cacheAnnotation(Annotation ea, int start)
  
  Cache entity annotation - in Memory; Note, the actual ID or key in database is usually composed of name+value+contrib. NOTE: these are normalized case-insensitive values. NOTE: If annotation already exists, then all we do is add the start offset to the existing entry. Name and value must not be null.
  
  Parameters:
  
  ea - your annotation
  
  start - offset into your doc.
  
  Throws:
  
  NullPointerException - if name and value are not set on Annotation.
- hasCachedAnnotation
  
  public boolean hasCachedAnnotation(String etype, String value)
  
  Checks for cached annotation.
  
  Parameters:
  
  etype - the etype
  
  value - the value
  
  Returns:
  
  true, if successful
- getCachedAnnotation
  
  public Annotation getCachedAnnotation(String etype, String value)
  
  Careful -- no guarntee that two entity annotations could share the same type/value unintentionally. e.g., if "tx" type annot implies a taxon from one contrib and "tx" means a transaction from another, then you the developer should choose more distinct entity type codes.
  
  Parameters:
  
  etype - entity type
  
  value - value
  
  Returns:
  
  the cached annotation
- cacheTaxonAnnotation
  
  protected Annotation cacheTaxonAnnotation(String contrib, Taxon taxon, String value, int offset, String docid)
  
  Cache taxon entity annotation.
  
  Parameters:
  
  contrib - contributor ID
  
  taxon - taxon obj
  
  value - string value
  
  offset - offset
  
  docid - docid
  
  Returns:
  
  the entity annotation
- createAnnotation
  
  public static Annotation createAnnotation(String contrib, String type, String val, int offset, String docid)
  
  Creates a standard named entity annotation.
  
  Parameters:
  
  contrib - contributor ID
  
  type - annotation type/ID
  
  val - string value
  
  offset - offset
  
  docid - docid
  
  Returns:
  
  the annotation
- createAnnotation
  
  public static Annotation createAnnotation(String contrib, String type, String val, int offset, int len, String docid)
  
  Parameters:
  
  contrib -
  
  type -
  
  val -
  
  offset -
  
  len -
  
  docid -
  
  Returns:
- createTaxonAnnotation
  
  public static Annotation createTaxonAnnotation(String contrib, String type, String val, int offset, String docid, Taxon n)
  
  Create an annotation for a Taxon node that has a found value, val, in document, docid at offset. Taxon match has a type and a contributor, usually the tagger or extractor that processed the document.
  
  Parameters:
  
  contrib - the contrib
  
  type - the type
  
  val - the val
  
  offset - the offset
  
  docid - the docid
  
  n - the n
  
  Returns:
  
  the entity annotation
- createTaxon
  
  public static Taxon createTaxon(Annotation a)
  
  Recreates a Taxon from a stored annotation. Required fields: a.attrs[name] -- taxon node name a.attrs[cat] -- catalog a.name -- Not used here. a.value -- the value of the matched text.
  
  Parameters:
  
  a - the a
  
  Returns:
  
  the taxon
- createCountryAnnotation
  
  public static Annotation createCountryAnnotation(String contrib, String type, String val, int offset, String docid, String country_code)
  
  Tracking a country name match of some sort. You know this is a country, eh,... so please enrich with the country code here. We know you can always find out the country code later from a given country name/match, however this may be context specific. Georgia / GE -- putting the country code here gives more confidence that you found Georgia, the country and not the US state You might have other means for deriving the country code for a given value, e.g., for example you found "GOI" a geopolitical entity you infer to be Govt. of India, so you emit "IN" as the country code. create( xxx, 'GPE', 'GOI', ..., 'IN' )
  
  Parameters:
  
  contrib - the contrib
  
  type - the type
  
  val - the val
  
  offset - the offset
  
  docid - the docid
  
  country_code - the country_code
  
  Returns:
  
  the entity annotation
- createCountry
  
  public static Country createCountry(Annotation a)
  
  Returns an instance of a Country object using annotation value as country name, and attr[cc] optionally as code. This does not reproduce a full Country object as if queried from
  Parameters:
  
  a - annot
  
  Returns:
  
  the country
  
  See Also:
  
  GeonamesUtility.getCountry(String)
- createGeocodingAnnotation
  
  public static Annotation createGeocodingAnnotation(String contrib, String type, String val, int offset, String docid, Geocoding g)
  
  Encode geocoding annotations to be saved. This schema follows from EH/GLINT/Glare, etc.
  
  Parameters:
  
  contrib - the contrib
  
  type - the type
  
  val - the val
  
  offset - the offset
  
  docid - the docid
  
  g - the g
  
  Returns:
  
  the entity annotation
- createGeocoding
  
  public static Place createGeocoding(Annotation a)
  
  Decode: Geocoding See OpenSextant Geocoding interface. Here required annotation fields are: lat, lon, prec cc, adm1, place feat_class, feat_code method
  
  Parameters:
  
  a - the a
  
  Returns:
  
  the geocoded data
- createTemporalEntityAnnotation
  
  public static Annotation createTemporalEntityAnnotation(String contrib, String type, String val, int offset, String docid, Date d, String resolution)
  
  Creates the temporal entity annotation.
  
  Parameters:
  
  contrib - the contrib
  
  type - the type
  
  val - the val
  
  offset - the offset
  
  docid - the docid
  
  d - the d
  
  resolution - the resolution
  
  Returns:
  
  the entity annotation
- createTemporalAnnotation
  
  public static Annotation createTemporalAnnotation(String contrib, String type, String val, int offset, int len, String docid, Date d, String resolution)
  
  Same createTemporalEntityAnnotation, just with len param.
  
  Parameters:
  
  contrib -
  
  type -
  
  val -
  
  offset -
  
  len -
  
  docid -
  
  d -
  
  resolution -
  
  Returns:

Class AnnotationHelper

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

NUM_SEP

Constructor Details

AnnotationHelper

Method Details

decodeAnnotations

getFirstOffset

encodeOffsets

decodeOffsets

decodeOffsets

getAnnotationId

reset

getCachedAnnotations

cacheAnnotation

cacheAnnotation

cacheAnnotation

cacheAnnotation

hasCachedAnnotation

getCachedAnnotation

cacheTaxonAnnotation

createAnnotation

createAnnotation

createTaxonAnnotation

createTaxon

createCountryAnnotation

createCountry

createGeocodingAnnotation

createGeocoding

createTemporalEntityAnnotation

createTemporalAnnotation