Python: module opensextant.utility

opensextant.utility

index
/Users/ubaldino/workspace/opensource/Xponents-Core/src/main/python/opensextant/utility.py

Copyright 2015-2021 The MITRE Corporation. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ============================================================================= @author: ubaldino OpenSextant utilities

Modules

csv
os
re

Classes



builtins.object

ConfigUtility

class ConfigUtility(builtins.object)

    ConfigUtility(config=None, rootdir='.') A utility to load parameter lists, CSV files, word lists, etc. from a folder *dir* functions here take an Oxygen cfg parameter keyword or a file path. If the keyword is valid and points to a valid file path, then the file path is used. In otherwords, keywords are aliases for a file on disk.   Ex.  'mywords' = './cfg/mywords_v03_filtered.txt'   oxygen.cfg file would have this mapping.  Your code just references 'mywords' to load it.

Methods defined here:

__init__(self, config=None, rootdir='.')
Initialize self.  See help(type(self)) for accurate signature.

loadCSVFile(self, keyword, delim)
Load a named CSV file.  If the name is not a cfg parameter, the keyword name *is* the file.

loadDataFromFile(self, path, delim)
:param path: file path :param delim: delimiter :return: Array of tuples.

loadFile(self, keyword)
Load a named word list file. If the name is not a cfg parameter, the keyword name *is* the file.

loadListFromFile(self, path)
Load text data from a file. Returns array of non-comment rows. One non-whitespace row per line.

Data descriptors defined here:

__dict__

dictionary for instance variables (if defined)

__weakref__

list of weak references to the object (if defined)

Functions


bytes2unicode(buf, encoding=None)
Convert bytes 2 unicode by guessing character set. :param buf: :param encoding: :return:

ensure_dirs(fpath)
Given a file path, ensure parent folders exist. If path is intended to be a directory -- use os.makedirs(path) instead. May throw exception -- caller should handle. :path: path a file

fast_replace(t, sep, sub=None)
Replace separators (sep) with substitute char, sub. Many-to-one substitute. "a.b, c" SEP='.,' :param t:  input text :param sep: string of chars to replace :param sub: replacement char :return:  text with separators replaced

get_bool(token)

get_csv_reader(fh, columns, delim=',')

get_csv_writer(fh, columns, delim=',')

get_list(text, delim=',', lower=False)
Take a string and return trim segments given the delimiter:      "A,  B,        C" => ["A", "B", "C"] :param text: :param delim: delimiter str :param lower: True if you want items lowercased :return: array

get_number(token)
Turn leading part of a string into a number, if possible.

get_text(t)
Default is to return Unicode string from raw data

get_text_window(offset, matchlen, textsize, width)
prepreprepre MATCH postpostpost ^            ^   ^            ^ l-width      l   l+len        l+len+width left_y  left_x   right_x      right_y

guess_encoding(text)
Given bytes, determine the character set encoding @return: dict with encoding and confidence

has_arabic(text)
infer if text has Arabic / Middle-eastern scripts ~ Urdu, Farsi, Arabic. :param text: :return:

has_cjk(text)
infer if chinese (unihan), korean (hangul) or japanese (hirgana) characters are present :param text: :return:

has_digit(text)
Used primarily to report places and appears to be critical for name filtering when doing phonetics.

is_abbreviation(nm: str)
Determine if something is an abbreviation. Otherwise if text ends with "." we'll conclude so. Examples:     Ala.     YES     Ala      NO     S. Bob   NO   -- abbreviated, yes, but this is more like a contraction.     S. B.    YES :param nm: textual name :return: True if obj is inferred to be an abbreviation

is_ascii(s)

is_code(t: str, nlen=6)
Test if a string is an ASCII code typically 1-3 chars in len. :param t: text :param nlen: threshold for string len :return:

is_text(t)
# --------------------------------------- #  TEXT UTILITIES # ---------------------------------------

is_upper_text(t, threadshold=0.9)

is_value(v)
Working more with pandas or sci libraries -- you run into various types of default "Null" values. This checks to see if value is non-trivial, non-empty. :param v: :return:

isnan(x, /)
Return True if x is a NaN (not a number), and False otherwise.

levenshtein_distance(s, t)
Wikipedia page on Levenshtein Edit Distance https://en.wikipedia.org/wiki/Levenshtein_distance This is the fastest, simplest of 3 methods documented for Python.

load_datafile(path, delim)
:param path: file path :param delim: delimiter :return: Array of tuples.

load_list(path, lower=False)
Load text data from a file. Returns array of non-comment rows. One non-whitespace row per line. :param path: file to load. :param lower: Lowercased is optional. :return: array of terms

measure_case(t)
:param t: text :return:  tuple:  counts of UPPER, lower, Alpha, Non-Alpha, WS

parse_float(v)

replace_diacritics(txt: str)
Leverage the OpenSextant traditional ASCII Folding map for now. Yes encoded("ascii", "ignore") may do this.... :param txt: :return: a non-diacritic version of the text

scrub_eol(t)

squeeze_whitespace(s)

strip_quotes(t)
Run replace_diacritics first -- this routine only attempts to remove normal quotes ~ ', "

trivial_bias(name)
Experimental: Deteremine unique a name is using length and character set and # of words Abcd           4/2 + 1 + 0   x 0.02  = 0.06 Abcde fghi    10/2 + 2 + 0   x 0.02  = 0.14 Abcdé fghi    10/2 + 2 + 1   x 0.02  = 0.16

Data

BOOL_F_STR = {0, '', 'f', '0', 'n', 'no', ...}
BOOL_T_STR = {1, 'y', 'yes', 't', '1', 'true'}
CHARDET_LATIN2_ENCODING = 'ISO-8859-1'
COMMON_DIACRITC_HASHMARKS = re.compile('["\'`´‘’]')
LATIN1_FOLDING = {'À': 'A', 'Á': 'A', 'Â': 'A', 'Ã': 'A', 'Ä': 'A', 'Å': 'A', 'Æ': 'AE', 'Ç': 'C', 'È': 'E', 'É': 'E', ...}
code_pattern = re.compile('^[A-Z0-9]{1,}$', re.ASCII)
reSqueezeWhiteSpace = re.compile('\\s+', re.MULTILINE)

Data
		BOOL_F_STR = {0, '', 'f', '0', 'n', 'no', ...} BOOL_T_STR = {1, 'y', 'yes', 't', '1', 'true'} CHARDET_LATIN2_ENCODING = 'ISO-8859-1' COMMON_DIACRITC_HASHMARKS = re.compile('["\'`´‘’]') LATIN1_FOLDING = {'À': 'A', 'Á': 'A', 'Â': 'A', 'Ã': 'A', 'Ä': 'A', 'Å': 'A', 'Æ': 'AE', 'Ç': 'C', 'È': 'E', 'É': 'E', ...} code_pattern = re.compile('^[A-Z0-9]{1,}$', re.ASCII) reSqueezeWhiteSpace = re.compile('\\s+', re.MULTILINE)