Skip to content

Conversation

@dtkav
Copy link
Contributor

@dtkav dtkav commented Mar 20, 2019

This API relies on my fork of datalake-common library (datalake-common-dtkav).
planetlabs/datalake-common@master...dtkav:dist

The goal of the Files API is to:

  1. Provide a centralized and searchable place to store file metadata of all kinds, especially logs, particularly along the time-axis. (DatalakeFile model)
  2. Provide a content-addressed backend that de-duplicates file contents and compresses them (s3).
  3. Provide a managed set of allowed 'what's which can be provisioned by an admin without deploying MC. This is done via the What model.
  4. Provide a mechanism for relating files to MC objects. This is done via RelatedFile model, and work-id field on the DatalakeFile model.
  5. Store metadata alongside data to allow repopulating the database in case of catastrophic failure.

Design:
DatalakeFile objects are just metadata, and contains a pointer to the content-id (cid) of a file.
The content-id is: mutibase('base32', mutlihash('blake2b-16', blake2b(file_contents, digest_size=16)))

The work-id field can be any string (so users can include their own prefixes).
However, there are a special set of work-id prefixes that correspond to objects in mission control.
Adding files with these prefixes will make the files available from those objects directly.
A work-id looks like this: mc-<model_name>.<uuid>.
Alternately, a user might be running jobs with jenkins and use something like: jenkins-job.<id>. These files would not be related to any objects in mission control, but users in the ecosystem can still leverage the datalake to store all of the files in one place.

@dtkav
Copy link
Contributor Author

dtkav commented Mar 20, 2019

This is to the point where it could use some proper review. I had a lot of lessons-learned along the way.
The main changes - I've set version back to the metadata version. This is because version is done by time (latest file in time is the latest version).
Also the CID is no longer unique as multiple metadata records can happily reference the same file contents.

@dtkav dtkav force-pushed the files branch 8 times, most recently from 522d143 to 65667f2 Compare March 29, 2019 06:44
@dtkav dtkav changed the title WIP: Files API Files API Mar 29, 2019
@dtkav
Copy link
Contributor Author

dtkav commented Mar 29, 2019

I just noticed the files api uses start and end, whereas the rest of the API uses start_time and end_time. :/

elif isinstance(o, datetime.timedelta):
return duration_iso_string(o)
elif isinstance(o, (decimal.Decimal, uuid.UUID, Promise)):
return str(o)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str of a Promise doesn't sound right?
Also Promise doesn't seem to be defined anywhere?

return data


class ISODateTimeField(models.DateTimeField):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel there must be a better way to do this than copy paste this class across django apps.

data = {}
for f in chain(opts.concrete_fields, opts.private_fields, opts.many_to_many):
if f.name is 'id':
if f.name == 'id':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch :\

/datalake/admin/whats/:
get:
security:
- jwt: ['admins']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unclear what this part of the tag will do?
What defines who an 'admin' is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

admins doesn't appear to be defined anywhere else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants