Package org.opensextant.data


package org.opensextant.data

Xponents Data Model

The key constructs here are the GeoBase and Geocoding.  GeoBase provides a base class for anything that has an ID, name or label, and a coordinate.  Geocoding provides an interface for any heuristic that helps ground some data to a coordinate, while providing additional metadata about the geocoding itself.  For example, beyond an actual coordinate useful geocoding attributes include:
  • precision and confidence
  • country code and province code or name
  • method or source for geocoding, e.g., derivation or rote lookup
  • etc.

Country and Place objects are extensions of GeoBase. Country is used extensively in place name extraction, reverse geocoding, and general country name/code lookups.  See GeonamesUtility for more country metadata tools.

Language object helps tie language code and name. LangDetect and LangId (org.opensextant.extractors.langid) provide some tools for language detection. Language ID does not always line up with a known Language code literally, as LangID may yield language + locale. So there is a need to be able to parse and manage explicit and inferred language/locale codings.

Java SDK Locale classes appear to only cover those used for computer internationalization. ICU4J libraries, for example, do not have a simple clear API. So, I created language lookup tables around ISO-639 codes (sourced from Library of Congress) which are found in org.opensextant.util.TextUtility: getLanguage(), getLanguageCode(), getLanguageMap().





  • Class
    Description
    Country metadata provided on this class includes: ISO-3166 country code 2-char and 3-char forms, aligned with US standard FIPS 10-4 codes Country aliases: nick names, variant names, abbreviations Affiliated territories Timezone and UTC offset for temporal calculations Primary and Secondary languages
     
    Use only for cases where you have document inputs instead of raw records.
    An intermediary between the simple LatLon and other conceptual classes: Place, Country, etc.
    An interface that describes any data that can be geocoded -- the metadata behind deriving location is as important as the actual location is.
    Simple mapping of ISO 639 id to display name for languages
     
    Improving control over Xponents schema fields and common, constant values.
    Place class represents all the metadata about a location.
    A Taxon is an entry in a taxonomy, which could be as simple as a flat word list or something with lots of structure.
    TextInput is a unit of data -- a tuple that represents the text and its language and an identifier for downstream processing, export formatting, databasing results keyed by text identifier, etc.