A record is a representation of the raw original. It records a processing
date (aka ingest date),
metadata, source ID, record ID, and a payload.
Conventions:
Record Identity is very important in the context of a full pipeline. If
you can leverage the
given identity of data, maintain that consistently.
MD5 digest or UUID has been used often to create a compact identifier, for
examples. If left
null, database systems can assign object IDs, but transactional webservices
are not typically
responsible for generating missing identifiers. The lesson is that we should
not ignore the use
of identifiers. For Record processing a record ID, if for nothing else, is
practical for
debugging and logging.
The metadata "attributes" are considered optional, but usually helpful.
Record the raw
metadata as-is when you can.
If metadata attributes can be conditioned or normalized easily do that,
e.g., tag data with
ISO2 country code, rather than with country name or FIPS code.
The proc_date is usually determined at ingest time; It makes a good shard
key for balancing
the load of records across distributed storage/database.
Record "value" vs. "content": content was intended to capture the textual
content of files,
knowing that trying to store raw binary content quickly leads to performance
problems. For
file-based sources (file system/folder crawls, web crawls, etc) content would
store a compressed
UTF-8 encoded byte-array; Record value would be the filepath to the original.
However for
non-file based records, the use of .value may make more sense to record the
most obvious innate
value.
a processing date/time key that has as much resolution as you need This is a
string because the
lexical sort is likely easier to manage than using actual date/time field
with date/time math.
a processing date/time key that has as much resolution as you need This is a
string because the
lexical sort is likely easier to manage than using actual date/time field
with date/time math.