Skip to content

file processing application design

Erwan Colin edited this page Aug 27, 2019 · 7 revisions

File processing application design

The file processing application is thought in a way that it is an extensible application. The application can be extended using native action which relies on native packages, managed by the language-specific package manager (Cargo for Rust, NPM for JavaScript, Pip for Python...). Then those package can be local or located on a registry of the package manager. The application can also be extended using docker action based on Docker containers. In a same way repository of the docker image can be local or on a docker registry.

Structure of the code

Root folder

fpa/
|__ bin/
|__ src/
|   |__ commands/
|   |   |__ plugin/
|   |   |   |__ add.js
|   |   |
|   |   |__ consume.js
|   |   |__ install.js
|   |
|   |__ actioner.js
|   |__ index.js
|   |__ ta-util.js
|
|__ tests/
|__ tmp/
|__ actions.json
|__ .env

bin/

This folder contains the executable to use the command line interface

src/

The sources of the application

index.js

The entrypoint of the command line application

ta-util.js

A set of reused functions accross the code base of the application

actioner.js
  • The actioner take as an argument the payload of the task given by the consumer
  • Then the actioner creates a temporary work folder inside tmp/: tmp/xxx/ and two other temporary folders: tmp/xxx/output/ and tmp/xxx/input/
  • Then the actioner get file(s) of the payload from file storage service and put them inside tmp/xxx/input/
  • Then actioner use the action-list.conf file to get the type of the action (native/docker)
  • If the action is a native action:
    • Then the actioner import the coresponding library and run it by passing it the location of temporary work folder tmp/xxx/ and the arguments of the action.
  • If the action is a docker action:
    • Then the actioner launch the container named following the action name.
    • The tmp/xxx/ folder is mount as a volume with the binding: /path/to/tmp/xxx:/app/files/
    • The arguments of the command use by the launched container are the arguments of the action.
  • For both type of action:
    • The action will write the resulting files inside the folder tmp/xxx/output which is bind to /app/files/output inside a container.
    • The action reads the tmp/xxx/results.json file (bind to /app/files/results.json for a docker action) to know the naming convention of the output files enforced by the task payload. Yet, sometimes it is possible that no file naming is enforced inside the task payload. It is not possible to know how many files result from the action. Then the actioner randomly name the resulting files.
    • The action writes into the tmp/xxx/output/metadata.json (bind to /app/files/output/metadata.json for a docker action) file the resulting from the action.
    • The action writes into the tmp/xxx/output/status.json (bind to /app/files/output/status.json for a docker action) file a live report of the process.
  • The actioner reads callback from the action
  • The actioner gives feedback to MMF api
  • Then the actioner push the resulting files to the file storage service.
  • Then actioner gives feedback to MMF api concerning the location of these files on the file storage service and the metadata created by the action.
  • Then the actioner clean the temporary work folder.

commands/

The folder containing the available commands of the command line interface

plugin/

The folder contains the commands used for plugin management

add.js

This the part of the code used to add a plugin to the task-actioner

consume.js

This is the code for the command that create an AMQP comsumer which will get a task from a queue and transfer the payload of the task to the actioner. The payload of the task is a json containing the name of the action, the files to be processed and a list of arguments for the action.

install.js

This is the code of the command to run to install plugins based on the actions.json file.

tests/

The tests folder

.env

This file contains the different urls to the api of the file storage service, MMF platform , the secrets associated to these apis and the uid, gid, name of the user running action inside docker container.

MMF_API_BASE_URL=
MMF_API_SECRET_KEY=
FILE_STORAGE_HOST=
FILE_STORAGE_PORT=
FILE_STORAGE_USE_SSL=
FILE_STORAGE_SECRET_KEY=
FILE_STORAGE_ACCESS_KEY=
RABBITMQ_HOST=
RABBITMQ_PORT=
RABBITMQ_USER=
RABBITMQ_PASSWORD=
UID=
GID=
UNAME=

actions

An action is either a docker action, either a native action

Native action

It contains the library of the action, written using the native language of the task actioner. It must be an node module whose main function is exported as run() which takes as arguments, the arguments of the action contained inside the task payload and the workspace of the action.

Docker action

It contains:

  • a README
  • a Dockerfile
README

The readme contains information about the different files it can output and how to access it following the output files convention.

Dockerfile

The dockerfile of the docker image

It must respect this template:

FROM <parent_image>

ARG UNAME=worker
ARG UID=1000
ARG GID=1000

# For classic parent_image
RUN groupadd --gid $GID $UNAME && useradd --gid $GID --uid $UID $UNAME

# For alpinelinux parent image
RUN addgroup -g $GID -S $UNAME && adduser -u $UID -S $UNAME -G $UNAME

WORKDIR /app

# Do everything you need to be done

# RUN something
# Copy something
# ...

RUN chown -R $UNAME:$UNAME /app

USER $UNAME

ENTRYPOINT ["my_entrypoint"]

Workflow

File procesing workflow

Data structure

Task payload

A JSON string

{
    "id": 12315,
    "action": "action-a",
    "inputFiles": [
        {
            "bucketName": "my bucket",
            "objectName": "my object"
        },
        {
            "bucketName": "my other or same bucket",
            "objectName": "my other object"
        },
        ...
    ],
    "gpu": true,
    "args": [
        "arg1",
        "arg2",
        ...
    ],
    "outputFiles": {
        "key1": {
            "location": "my/location",
            "name": "my_name"
        },
        "key2": {
            "location": "my/location",
            "name": "my_name"
        },
        ...
    },
    "s3Location": {
        "bucketName": "my bucket",
        "keyPrefix": "my prefix"
    }
}

results.json

{
    "key1": {
        "location": "my_location",
        "name": "my_name"
    },
    "key2": {
        "location": "my/location",
        "name": "my_name"
    },
    ...
}

status.json

This file is used to give a live short report of the process.

{
   "step_name":{
      "status":"done" or "in progress",
      "progress": percentage representing the completion of the step
   },
   "step_name":{
      "status":"done" or "in progress",
      "progress": percentage representing the completion of the step
    },
    ...
}