Package org.opensextant.annotations
Class AnnotationHelper
java.lang.Object
org.opensextant.annotations.AnnotationHelper
Basis for this optional helper class was three or four different projects using DeepEye as a model for persisting annotations from the typical Named Entity and Geo/Time extraction work. Common annotation practices include: - All annotations should have both their own ID as well as a Record ID; this involves tracking unique annots for a given doc. The intent of DeepEye identifiers is that they are unique within a system or a data set, not necessarily UUID: - Record IDs should be related to their source identifier - Annotation IDs should be deterministic based on metadata tuple: MD5( name + value + contrib + rec_id ) or a similar predictable, reproducible hash - Caching an consolidating repetitive annotations. Example: seeing 'U.S.' 100 times in a single document as a "GPE" or as a "geo" can be overwhelming. Is it 100 individual annots, or 1 annotation for GPE and 1 annotation for the geo. Each annotation tracks as many offsets as it needs. As individual instances in the document vary in attributes or other metadata, then those should be considered unique annotations. - Allow common doc-level annotations, such as a topic tag, even when there is no offset or text span in question. "Span-less" annotations. Advice for what to persist to a DeepEye database: - its costly or complex to compute - its helpful to aggregate many raw extracted values to review over a large data set Consider not storing: - annotations that are trivial to compute at runtime, e.g., indexing a derived metadata tag, such as converting 'US' to 'United States' - NLP artifacts such as tokens like pronouns or other parts of speech. These might be - Filtered values -- if you are filtering out certain data in all your analyses or downstream operations, consider filtering out such things before you store them blindly.SEE ALSO: Xponents SDK class org.opensextant.output.Transform: this utility class offers more ideas on standard JSON representations for REST. whereas this utility is aimed at a more reliable pure representation of the match data for storing/retrieving from a data store.
- Author:
- ubaldino
-
Field Summary
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptioncacheAnnotation
(String contrib, String etype, String value, int start, String docid) Cache entity annotations, accumulating unique offsets for a name/value pair.void
Cache annotation.void
cacheAnnotation
(Annotation ea, int start) Cache entity annotation - in Memory; Note, the actual ID or key in database is usually composed of name+value+contrib.void
cacheAnnotation
(Annotation ea, String key) Cache an annotation.protected Annotation
cacheTaxonAnnotation
(String contrib, Taxon taxon, String value, int offset, String docid) Cache taxon entity annotation.static Annotation
createAnnotation
(String contrib, String type, String val, int offset, int len, String docid) static Annotation
createAnnotation
(String contrib, String type, String val, int offset, String docid) Creates a standard named entity annotation.static Country
Returns an instance of a Country object using annotation value as country name, and attr[cc] optionally as code.static Annotation
createCountryAnnotation
(String contrib, String type, String val, int offset, String docid, String country_code) Tracking a country name match of some sort.static Place
Decode: Geocoding See OpenSextant Geocoding interface.static Annotation
createGeocodingAnnotation
(String contrib, String type, String val, int offset, String docid, Geocoding g) Encode geocoding annotations to be saved.static Taxon
Recreates a Taxon from a stored annotation.static Annotation
Create an annotation for a Taxon node that has a found value, val, in document, docid at offset.static Annotation
createTemporalAnnotation
(String contrib, String type, String val, int offset, int len, String docid, Date d, String resolution) Same createTemporalEntityAnnotation, just with len param.static Annotation
createTemporalEntityAnnotation
(String contrib, String type, String val, int offset, String docid, Date d, String resolution) Creates the temporal entity annotation.static List<Annotation>
decodeAnnotations
(List<Annotation> codedAnnots) Given encoded annotations from db, decode them and yield a flattened set of annotations, e.g., for use with MATdecodeOffsets
(String list) Take a list of numbers and convert to Integer list "1;5;89;777" => List<> [ 1, 5, 89, 777 ].static List<Annotation>
decodeOffsets
(Annotation meta, String offsetList) Generate annotations in a linear fashion.static String
encodeOffsets
(Collection<Integer> offsets) Encode offsets.static String
getAnnotationId
(String rec_id, String contrib, String atype, String val) New, required format for an annotation ID: Md5 hash made up of:getCachedAnnotation
(String etype, String value) Careful -- no guarntee that two entity annotations could share the same type/value unintentionally.Gets the cached annotations, unordered.static long
Gets the first offset.boolean
hasCachedAnnotation
(String etype, String value) Checks for cached annotation.void
reset()
Reset() clears the internal cache.
-
Field Details
-
NUM_SEP
The Constant NUM_SEP.- See Also:
-
-
Constructor Details
-
AnnotationHelper
public AnnotationHelper()
-
-
Method Details
-
decodeAnnotations
Given encoded annotations from db, decode them and yield a flattened set of annotations, e.g., for use with MAT- Parameters:
codedAnnots
- the coded annots- Returns:
- the list
-
getFirstOffset
Gets the first offset.- Parameters:
a
- the a- Returns:
- the first offset
-
encodeOffsets
Encode offsets.- Parameters:
offsets
- the offsets- Returns:
- the string
-
decodeOffsets
Take a list of numbers and convert to Integer list "1;5;89;777" => List<> [ 1, 5, 89, 777 ].- Parameters:
list
- the list- Returns:
- the list
-
decodeOffsets
Generate annotations in a linear fashion. Given the optimized Annotation, A, create duplicate annotations, each with an offset from the list of offsets.- Parameters:
meta
- the metaoffsetList
- the offset list- Returns:
- the list
-
getAnnotationId
New, required format for an annotation ID: Md5 hash made up of:rec_id + contributor + type + value Distinct entities from a single contributor for a single document will be recorded (and overwritten over time). Reprocessing data will overwrite a new value. Doc abc has a NAMEX = 'the diplomat', provided by xyz extractor. key is MD5( 'abc;xyz;NAMEX;the diplomat' ) Multiple occurrences of the same value in the same document must be recorded as "atts.offsets" = [n1,n2,n3...] offsets
- Parameters:
rec_id
- the rec_idcontrib
- the contribatype
- the atypeval
- the val- Returns:
- the annotation id
-
reset
public void reset()Reset() clears the internal cache. Ideally, you hit reset on each new Record in a loop. -
getCachedAnnotations
Gets the cached annotations, unordered.- Returns:
- the cached annotations
-
cacheAnnotation
public Annotation cacheAnnotation(String contrib, String etype, String value, int start, String docid) Cache entity annotations, accumulating unique offsets for a name/value pair.- Parameters:
contrib
- the contribetype
- the etypevalue
- the valuestart
- the startdocid
- the docid- Returns:
- the entity annotation
-
cacheAnnotation
Cache an annotation.- Parameters:
ea
- the eakey
- the key
-
cacheAnnotation
Cache annotation.- Parameters:
ea
- your annotation
-
cacheAnnotation
Cache entity annotation - in Memory; Note, the actual ID or key in database is usually composed of name+value+contrib. NOTE: these are normalized case-insensitive values. NOTE: If annotation already exists, then all we do is add the start offset to the existing entry. Name and value must not be null.- Parameters:
ea
- your annotationstart
- offset into your doc.- Throws:
NullPointerException
- if name and value are not set on Annotation.
-
hasCachedAnnotation
Checks for cached annotation.- Parameters:
etype
- the etypevalue
- the value- Returns:
- true, if successful
-
getCachedAnnotation
Careful -- no guarntee that two entity annotations could share the same type/value unintentionally. e.g., if "tx" type annot implies a taxon from one contrib and "tx" means a transaction from another, then you the developer should choose more distinct entity type codes.- Parameters:
etype
- entity typevalue
- value- Returns:
- the cached annotation
-
cacheTaxonAnnotation
protected Annotation cacheTaxonAnnotation(String contrib, Taxon taxon, String value, int offset, String docid) Cache taxon entity annotation.- Parameters:
contrib
- contributor IDtaxon
- taxon objvalue
- string valueoffset
- offsetdocid
- docid- Returns:
- the entity annotation
-
createAnnotation
public static Annotation createAnnotation(String contrib, String type, String val, int offset, String docid) Creates a standard named entity annotation.- Parameters:
contrib
- contributor IDtype
- annotation type/IDval
- string valueoffset
- offsetdocid
- docid- Returns:
- the annotation
-
createAnnotation
public static Annotation createAnnotation(String contrib, String type, String val, int offset, int len, String docid) - Parameters:
contrib
-type
-val
-offset
-len
-docid
-- Returns:
-
createTaxonAnnotation
public static Annotation createTaxonAnnotation(String contrib, String type, String val, int offset, String docid, Taxon n) Create an annotation for a Taxon node that has a found value, val, in document, docid at offset. Taxon match has a type and a contributor, usually the tagger or extractor that processed the document.- Parameters:
contrib
- the contribtype
- the typeval
- the valoffset
- the offsetdocid
- the docidn
- the n- Returns:
- the entity annotation
-
createTaxon
Recreates a Taxon from a stored annotation. Required fields: a.attrs[name] -- taxon node name a.attrs[cat] -- catalog a.name -- Not used here. a.value -- the value of the matched text.- Parameters:
a
- the a- Returns:
- the taxon
-
createCountryAnnotation
public static Annotation createCountryAnnotation(String contrib, String type, String val, int offset, String docid, String country_code) Tracking a country name match of some sort. You know this is a country, eh,... so please enrich with the country code here. We know you can always find out the country code later from a given country name/match, however this may be context specific. Georgia / GE -- putting the country code here gives more confidence that you found Georgia, the country and not the US state You might have other means for deriving the country code for a given value, e.g., for example you found "GOI" a geopolitical entity you infer to be Govt. of India, so you emit "IN" as the country code. create( xxx, 'GPE', 'GOI', ..., 'IN' )- Parameters:
contrib
- the contribtype
- the typeval
- the valoffset
- the offsetdocid
- the docidcountry_code
- the country_code- Returns:
- the entity annotation
-
createCountry
Returns an instance of a Country object using annotation value as country name, and attr[cc] optionally as code. This does not reproduce a full Country object as if queried from- Parameters:
a
- annot- Returns:
- the country
- See Also:
-
createGeocodingAnnotation
public static Annotation createGeocodingAnnotation(String contrib, String type, String val, int offset, String docid, Geocoding g) Encode geocoding annotations to be saved. This schema follows from EH/GLINT/Glare, etc.- Parameters:
contrib
- the contribtype
- the typeval
- the valoffset
- the offsetdocid
- the docidg
- the g- Returns:
- the entity annotation
-
createGeocoding
Decode: Geocoding See OpenSextant Geocoding interface. Here required annotation fields are: lat, lon, prec cc, adm1, place feat_class, feat_code method- Parameters:
a
- the a- Returns:
- the geocoded data
-
createTemporalEntityAnnotation
public static Annotation createTemporalEntityAnnotation(String contrib, String type, String val, int offset, String docid, Date d, String resolution) Creates the temporal entity annotation.- Parameters:
contrib
- the contribtype
- the typeval
- the valoffset
- the offsetdocid
- the docidd
- the dresolution
- the resolution- Returns:
- the entity annotation
-
createTemporalAnnotation
public static Annotation createTemporalAnnotation(String contrib, String type, String val, int offset, int len, String docid, Date d, String resolution) Same createTemporalEntityAnnotation, just with len param.- Parameters:
contrib
-type
-val
-offset
-len
-docid
-d
-resolution
-- Returns:
-