| |
- builtins.object
-
- Geotagger
- XlayerClient
class Geotagger(builtins.object) |
|
Geotagger(cfg: dict, debug=False, features=['geo', 'postal', 'taxons'])
GEOTAGGER REST client |
|
Methods defined here:
- __init__(self, cfg: dict, debug=False, features=['geo', 'postal', 'taxons'])
- Initialize self. See help(type(self)) for accurate signature.
- dbg(self, msg, *args, **kwargs)
- error(self, msg, *args, **kwargs)
- infer_locations(self, locs: list) -> dict
- Choose the best location from the list -- Most specific is preferred.
:param locs: list of locations
:return:
- info(self, msg, *args, **kwargs)
- populate_mentions(self, spans: list) -> dict
- summarize(self, doc_id, text, lang=None) -> dict
- Call the XlayerClient process() endpoint,
distills output tags into `geoinferences` and `mentions` (all other non-geo tags).
A valid 2-char ISO 639 language code helps to tune
:param doc_id: ID of text
:param text: the text input
:param lang: language of the text
:return: A single geoinference
Data descriptors defined here:
- __dict__
- dictionary for instance variables (if defined)
- __weakref__
- list of weak references to the object (if defined)
Data and other attributes defined here:
- ALLOWED_SLOTS = {'admin', 'city', 'country', 'postal', 'site'}
|
class XlayerClient(builtins.object) |
|
XlayerClient(server, options='')
|
|
Methods defined here:
- __init__(self, server, options='')
- @param server: URL for the service. E.g., host:port or 'http://SERVER/xlayer/rest/process'.
@keyword options: STRING. a comma-separated list of options to send with each request.
There are no default options supported.
- ping(self, timeout=30)
- Timeout of 30 seconds is used here so calls do not hang indefinitely.
:return: True if successful.
- process(self, docid, text, lang=None, features=['geo'], timeout=10, minlen=-1, preferred_countries=None, preferred_locations=None)
- Process text, extracting some entities
lang = "xx" or None, where "xx" is a ISO language 2-char code.
For general Chinese/Japanese/Korean (CJK) support, use lang = 'cjk'
Language IDs that have some additional tuning include:
"ja", "th", "tr", "id", "ar", "fa", "ur", "ru", "it",
"pt", "de", "nl", "es", "en", "tl", "ko", "vi"
Behavior: Arabic (ar) or CJK (cjk) lang ID directs tagger to use language-specific tokenizers
Otherwise other lang ID provided just invokes language-specific stopword filters
features are places, coordinates, countries, orgs, persons, patterns, postal.
feature aliases "geo" can be used to get All Geographic entities (places,coordinates,countries)
feature "taxons" can get at any Taxon "taxons", "persons", "orgs". As of Xponents 3.6 this reports ALL
Other taxons available in TaxCat tagger. "all_taxons" is offered as a means to distinguish old and new behavior.
feature "postal" will tag obvious, qualified postal codes that are paired with a CITY, PROVINCE, or COUNTRY tag.
feature "patterns" is an alias for dates and any other pattern-based extractors. For now "dates" is only one
feature "codes" will tag, use and report coded information for any place; primarily administrative boundaries
options are not observed by Xlayer "Xgeo", but you can adapt your own service
to accomodate such options. Possible options are clean_input, lowercase, for example:
* clean_input scrubs the input text if it has HTML or other content in it.
* lowercase allows the tagging to pass on through lower case matches.
but interpretation of "clean text" and "lower case" support is subjective.
so they are not supported out of the box here.
:param docid: identifier of transaction
:param text: Unicode text to process
:param lang: One of ["ar", "cjk", .... other ISO language IDs]
:param features: list of geo OR [places, coordinates, countries], orgs, persons, patterns, taxons
:param timeout: default to 10 seconds; If you think your processing takes longer,
adjust if you see exceptions.
:param minlen: minimum length of matches that are unqualified. To reduce noise in geotags. Server has a default
of 4 chars for general purpose noise filtering.
:param preferred_countries: Array of country codes representing those which are preferred fall backs when
there are ambiguous location names.
:param preferred_locations: Array of geohash representing general area of desired preferred matches
:return: array of TextMatch objects or empty array.
- stop(self, timeout=30)
- Timeout of 30 seconds is used here so calls do not hang indefinitely.
The service URL is inferred: /process and /control endpoints should be next to each other.
:return: True if successful or if "Connection aborted" ConnectionError occurs
Data descriptors defined here:
- __dict__
- dictionary for instance variables (if defined)
- __weakref__
- list of weak references to the object (if defined)
| |