| |
- bytes2unicode(buf, encoding=None)
- Convert bytes 2 unicode by guessing character set.
:param buf:
:param encoding:
:return:
- ensure_dirs(fpath)
- Given a file path, ensure parent folders exist.
If path is intended to be a directory -- use os.makedirs(path) instead.
May throw exception -- caller should handle.
:path: path a file
- fast_replace(t, sep, sub=None)
- Replace separators (sep) with substitute char, sub. Many-to-one substitute.
"a.b, c" SEP='.,'
:param t: input text
:param sep: string of chars to replace
:param sub: replacement char
:return: text with separators replaced
- get_bool(token)
- get_csv_reader(fh, columns, delim=',')
- get_csv_writer(fh, columns, delim=',')
- get_list(text, delim=',', lower=False)
- Take a string and return trim segments given the delimiter:
"A, B, C" => ["A", "B", "C"]
:param text:
:param delim: delimiter str
:param lower: True if you want items lowercased
:return: array
- get_number(token)
- Turn leading part of a string into a number, if possible.
- get_text(t)
- Default is to return Unicode string from raw data
- get_text_window(offset, matchlen, textsize, width)
- prepreprepre MATCH postpostpost
^ ^ ^ ^
l-width l l+len l+len+width
left_y left_x right_x right_y
- guess_encoding(text)
- Given bytes, determine the character set encoding
@return: dict with encoding and confidence
- has_arabic(text)
- infer if text has Arabic / Middle-eastern scripts ~ Urdu, Farsi, Arabic.
:param text:
:return:
- has_cjk(text)
- infer if chinese (unihan), korean (hangul) or japanese (hirgana) characters are present
:param text:
:return:
- has_digit(text)
- Used primarily to report places and appears to be critical for
name filtering when doing phonetics.
- is_abbreviation(nm: str)
- Determine if something is an abbreviation.
Otherwise if text ends with "." we'll conclude so.
Examples:
Ala. YES
Ala NO
S. Bob NO -- abbreviated, yes, but this is more like a contraction.
S. B. YES
:param nm: textual name
:return: True if obj is inferred to be an abbreviation
- is_ascii(s)
- is_code(t: str, nlen=6)
- Test if a string is an ASCII code typically 1-3 chars in len.
:param t: text
:param nlen: threshold for string len
:return:
- is_text(t)
- # ---------------------------------------
# TEXT UTILITIES
# ---------------------------------------
- is_upper_text(t, threadshold=0.9)
- is_value(v)
- Working more with pandas or sci libraries -- you run into various types of default "Null" values.
This checks to see if value is non-trivial, non-empty.
:param v:
:return:
- isnan(x, /)
- Return True if x is a NaN (not a number), and False otherwise.
- levenshtein_distance(s, t)
- Wikipedia page on Levenshtein Edit Distance
https://en.wikipedia.org/wiki/Levenshtein_distance
This is the fastest, simplest of 3 methods documented for Python.
- load_datafile(path, delim)
- :param path: file path
:param delim: delimiter
:return: Array of tuples.
- load_list(path, lower=False)
- Load text data from a file.
Returns array of non-comment rows. One non-whitespace row per line.
:param path: file to load.
:param lower: Lowercased is optional.
:return: array of terms
- measure_case(t)
- :param t: text
:return: tuple: counts of UPPER, lower, Alpha, Non-Alpha, WS
- parse_float(v)
- replace_diacritics(txt: str)
- Leverage the OpenSextant traditional ASCII Folding map for now.
Yes encoded("ascii", "ignore") may do this....
:param txt:
:return: a non-diacritic version of the text
- scrub_eol(t)
- squeeze_whitespace(s)
- strip_quotes(t)
- Run replace_diacritics first -- this routine only attempts to remove normal quotes ~ ', "
- trivial_bias(name)
- Experimental: Deteremine unique a name is using length and character set and # of words
Abcd 4/2 + 1 + 0 x 0.02 = 0.06
Abcde fghi 10/2 + 2 + 0 x 0.02 = 0.14
Abcdé fghi 10/2 + 2 + 1 x 0.02 = 0.16
|