Skip to content

Setting up ML backend for Label Studio

Louis Maddox edited this page Jul 3, 2022 · 80 revisions

Table of contents

Background links

Requirements

A prerequirement for setting up a ML backend for Label Studio is Docker Compose, which in turn requires Docker Engine.

See Installing Docker Compose and Installing Docker Engine

(Thankfully the standard way to install Engine also installs Compose)

Deploying an example backend

To set up the example used in the repo's docs, run the following

git clone https://github.com/heartexlabs/label-studio-ml-backend
cd label-studio-ml-backend/label_studio_ml/examples/simple_text_classifier
docker compose up

If you installed via a different route you may have docker-compose as the command instead

Here's the project structure:

label-studio-ml-backend/label_studio_ml/examples/simple_text_classifier $ tree .

.
├── data
│   ├── redis
│   └── server
│       └── models
├── docker-compose.yml
├── Dockerfile
├── logs
├── README.md
├── requirements.txt
├── simple_text_classifier.py
└── _wsgi.py

5 directories, 6 files

There's a data directory with a subdirectory for each of the services, redis and server (with a subdirectory models), and a logs directory.

  • In fact these are all created when the service starts: notice they're not checked into the repo

Compose specification

The docker-compose.yml Compose specification file specifies the service names and where to mount these directories, and the port number for the server service:

  • The redis service:
    • mounts ./data/redis as /data
    • also names its container and hostname redis
  • The server service:
    • mounts ./data/server as /data
    • mounts ./logs as /tmp
    • sets environment variables including MODEL_DIR as /data/models and an API key
    • defines a network link to the container in the redis service, thereby determining the order of service startup.
    • specifies that it depends_on the redis service, again determining order of service startup and shutdown.
version: "3.8"

services:
  redis:
    image: redis:alpine
    container_name: redis
    hostname: redis
    volumes:
      - "./data/redis:/data"
    expose:
      - 6379
  server:
    container_name: server
    build: .
    environment:
      - MODEL_DIR=/data/models
      - RQ_QUEUE_NAME=default
      - REDIS_HOST=redis
      - REDIS_PORT=6379
      - LABEL_STUDIO_ML_BACKEND_V2=true
      - LABEL_STUDIO_HOSTNAME=http://localhost:8000
      - LABEL_STUDIO_API_KEY=d6f8a2622d39e9d89ff0dfef1a80ad877f4ee9e3
    ports:
      - "9090:9090"
    depends_on:
      - redis
    links:
      - redis
    volumes:
      - "./data/server:/data"
      - "./logs:/tmp"
  • The LABEL_STUDIO_API_KEY is specific to this backend (see grep.app)
  • The YAML specifies 2 services:
    • one named redis (which is spun up from the redis:alpine image and binds the ./data/redis)
    • one named server (which depends on the one named redis)

WSGI and model modules

The obvious question here is how ./data/server/models got made (see the directory tree above). Since /data/models is passed in as an environment variable MODEL_DIR, it seems obvious that either _wsgi.py or simple_text_classifier.py boots up when the service starts and touches it.

_wsgi.py

A 126 line module that looks in its directory for a config.json (note: not used), exposes a CLI parser that reads a kwarg config of parameters as well as port (from the env. var.), host, debug flag, log level, model dir (defaulting to the file's dir) and runs the app that it initialises by label_studio_ml.api.init_app; or if not being run from the command line, just goes straight to initialising the app with the args from the environment but does not run the app.

Note: the app is initialised by a module-level singleton LabelStudioMLManager instance _manager, passing the model_class as the SimpleTextClassifier class imported from the simple_text_classifier.py module, and the _server object returned here as app is a Flask app, again a module-level singleton instance. Importantly, the SimpleTextClassifier class gets bound to the model_class of the LabelStudioMLManager and ends up getting bound inside the _current_model attribute (dict)

simple_text_classifier.py

A 161 line module that ensures an API key is set as an env. var, then defines the SimpleTextClassifier class which subclasses LabelStudioMLBase and uses some imported sklearn helpers (LogisticRegression, TfidfVectorizer, make_pipeline). The class has only a few methods:

  • __init__:

    • checks self.parsed_label_config is [a dict] of length 1 (this comes from the base class), and sets self.name and self.info from its key/value.
      • The base class sets this attribute from parse_config(self.label_config) if self.label_config else {}.
      • The base class's __init__ signature is self, label_config=None, train_output=None, **kwargs. My understanding here is that the Flask app sends a POST request that includes the config of the project, among which is the labelling config set up in the UI, and that if this isn't text classification then initialising this backend will fail.
    • checks that the config's type (now in self.info) value is "Choices" (i.e. it's for classification).
    • checks that the config's to_name and inputs are length 1 (i.e. the model has just 1 input), and that the type of the input is "Text"
    • sets self.to_name from the config's to_name
    • sets self.value from the config's first and only inputs value
  • reset_model creates the following simple 2-step model:

    self.model = make_pipeline(
        TfidfVectorizer(ngram_range=(1, 3), token_pattern=r"(?u)\b\w\w+\b|\w"),
        LogisticRegression(C=10, verbose=True)
    )
  • predict gets the input text from the data of each of the tasks (passed in as an argument), runs self.predict_proba() on them, gets argmax indices of the predicted labels and scores, then zips the labels against the scores in a dict that get listed for all the tasks.

    for idx, score in zip(predicted_label_indices, predicted_scores):
        predicted_label = self.labels[idx]
        # prediction result for the single task
        result = [{
            'from_name': self.from_name,
            'to_name': self.to_name,
            'type': 'choices',
            'value': {'choices': [predicted_label]}
        }]
    
        # expand predictions with their scores for all tasks
        predictions.append({'result': result, 'score': score})
  • _get_annotated_dataset takes a project_id and is used to support webhook-based training workflows (in fit). It is marked "just for demo purposes", and uses the API key to authenticate a request to localhost/api/projects/{project_id}/export to retrieve data annotations for the project.

    • N.B. this is why the API key is needed.
  • fit that takes annotations (which gets renamed to tasks) and a workdir (not used if MODEL_DIR env. var. is set), builds a list of input_texts from accessing ["data"].get(self.value) of each of the tasks, calls reset_model then self.model.fit with the input_texts and output_labels_idx, and returns a dict of labels (the sorted set of output labels coerced to list) and model_file (the pickled model).

    • The optional kwarg data can be used to override annotations as tasks = _get_annotated_dataset(data["project"]["id"].

Testing the service

The README for this backend says to run curl http://localhost:9090/health to check the service is running OK, and indeed it returns some JSON:

{"model_dir":"/data/models","status":"UP","v2":"true"}

If we search through the repo for the word health we find that this is a Flask route defined in label_studio_ml/api.py

@_server.route('/health', methods=['GET'])
@_server.route('/', methods=['GET'])
@exception_handler
def health():
    return jsonify({
        'status': 'UP',
        'model_dir': _manager.model_dir,
        'v2': os.getenv('LABEL_STUDIO_ML_BACKEND_V2', default=LABEL_STUDIO_ML_BACKEND_V2_DEFAULT)
    })

Reading this also tells us that we can curl http://localhost:9090/ (the / route) and get the same output. These are bound to our deployed app because when we initialised it we got the _server module-level singleton object defined in this api.py module.

Another thing to notice here about this route funcdef is that it takes no arguments (pretty standard for a GET request), and only uses environment variables and the _manager (module-level global variable).

Compare this to the POST request route for _predict:

@_server.route('/predict', methods=['POST'])
@exception_handler
def _predict():
    data = request.json
    tasks = data.get('tasks')
    project = data.get('project')
    label_config = data.get('label_config')
    force_reload = data.get('force_reload', False)
    try_fetch = data.get('try_fetch', True)
    params = data.get('params') or {}
    predictions, model = _manager.predict(
        tasks, project, label_config, force_reload, try_fetch, **params
    )
    response = {
        'results': predictions,
        'model_version': model.model_version
    }
    return jsonify(response)

Here there is the implicit parameter request which is provided by Flask to a route when called.

It's known as a 'context' in Flask's docs, implemented in werkzeug as a context local. I don't really know much on the implementation details here other than you access the .json attribute and then you're just working with a regular dict (similar to locals()).

But how does this all work together? How can we test the /predict route? We can't just send a plain POST request:

curl --header "Content-Type: application/json" --request POST --data '{}' http://localhost:9090/predict

We hit an exception in the LabelStudioMLManager.predict class method which receives the empty data and get told the model is not loaded:

    @classmethod
    def predict(
        cls, tasks, project=None, label_config=None, force_reload=False, try_fetch=True, **kwargs
    ):
        if not os.getenv('LABEL_STUDIO_ML_BACKEND_V2', default=LABEL_STUDIO_ML_BACKEND_V2_DEFAULT):
            if try_fetch:
                m = cls.fetch(project, label_config, force_reload)
            else:
                m = cls.get(project)
                if not m:
                    raise FileNotFoundError('No model loaded. Specify "try_fetch=True" option.')
            predictions = m.model.predict(tasks, **kwargs)
            return predictions, m

        if not cls._current_model:
            raise ValueError(f'Model is not loaded for {cls.__class__.__name__}: run setup() before
using predict()')

        predictions = cls._current_model.model.predict(tasks, **kwargs)
        return predictions, cls._current_model

In other words, you should [let the UI] set this backend up before trying to decipher the inner workings any deeper.

Backend specification

I don't want a text classifier like this though, I want a bounding box predictor (object detection model). This one doesn't tick all the boxes for my needs, which are:

  • A HuggingFace model (the text classifier indeed uses HuggingFace transformers.AutoTokenizer and transformers.AutoModelForCausalLM). These will be useful for figuring out what to put in the predict method of the model class in my ML backend.
  • An image model (not too complicated: similar to a language model, it'll use a tokenizer/processor and a model). This will be useful for figuring out how to pass the image data into the model (which has some requirements that get validated in the model's __init__ method).

It's clear that of these two, the first priority should be to find another object detection labelling studio backend, so I'd be able to look at the assertions made in its equivalent of the simple_text_classifier's SimpleTextClassifier.__init__() method.

Detectron2 example

I searched for the name of the base class LabelStudioMLBase on the code search site grep.app (here) and indeed I landed on an image model, Detectron2, a well-known semantic segmentation model (which is closer to object detection with bboxes, but I expect will be outputting pixel-level masks).

Edit: in fact it is giving bboxes: in the code excerpt below, result type is "rectanglelabels".

Edit 2: it turned out I overlooked one right under my nose: mmdetection contains an object detection example in this repo!

After some further digging, this turned out to be one of the LayoutParser developers' personal copy of code that would go on to become part of the official LayoutParser annotation service.

This is much closer to what I am aiming for, in fact it's the exact same task even, however I want to use the LayoutLMv3 model on HuggingFace whereas this example (obviously) is using a layoutparser model (specifically lp.Detectron2LayoutModel)

From this example however, it's clear what function signatures we should aim for in an object detection API:

class ObjectDetectionAPI(LabelStudioMLBase):
    def __init__(self, freeze_extractor=False, **kwargs):
        ...

    def predict(self, tasks, **kwargs):
        image_urls = [task["data"][self.value] for task in tasks]
        images = [load_image_from_url(url) for url in image_urls]
        layouts = [self.model.detect(image) for image in images]
        predictions = []
        for image, layout in zip(images, layouts):
            height, width = image.shape[:2]
            result = [
                {
                    "from_name": self.from_name,
                    "to_name": self.to_name,
                    "original_height": height,
                    "original_width": width,
                    "source": "$image",
                    "type": "rectanglelabels",
                    "value": convert_block_to_value(block, height, width),
                }
                for block in layout
            ]
            predictions.append({"result": result})
        return predictions

    def fit(self, completions, workdir=None, batch_size=32, num_epochs=10, **kwargs):
        image_urls, image_classes = [], []
        print("Collecting completions...")
        # for completion in completions:
        #     if is_skipped(completion):
        #         continue
        #     image_urls.append(completion['data'][self.value])
        #     image_classes.append(get_choice(completion))
        print("Creating dataset...")
        # dataset = ImageClassifierDataset(image_urls, image_classes)
        # dataloader = DataLoader(dataset, shuffle=True, batch_size=batch_size)
        print("Train model...")
        # self.reset_model()
        # self.model.train(dataloader, num_epochs=num_epochs)
        print("Save model...")
        # model_path = os.path.join(workdir, 'model.pt')
        # self.model.save(model_path)
        return {"model_path": None, "classes": None}

It's pretty clear here that this is a work in progress (i.e. all the commented out code). After getting to grips with how Label Studio backends work, I'm fairly certain that the training API isn't operational, the prediction service looks like it could be though.

The commented out line dataset = ImageClassifierDataset(image_urls, image_classes) caught my attention, as it suggests that this was building on prior work. Indeed, searching on grep.app shows that this name comes from the label-studio-ml-backend repo:

One thing I like about the code at this link is that it has nicely contained methods. (To be covered below: the HuggingFace example has quite a messy fit method).

It's a pretty simple PyTorch dataset, but I'm not personally going to use URLs for my data, so it's not quite aligned to my needs.

I expect I'm going to use something more like this example by Niels Rogge (MLE at HuggingFace):

from torch.utils.data import Dataset
from PIL import Image

class CustomDataset(Dataset):
     def __init__(self, root, df, processor):
          self.root = root
          self.df = df
          self.processor = processor

     def __getitem__(self, idx):
          # get document image + corresponding words and boxes
          item = self.df.iloc[idx]
          image = Image.open(self.root + ...).convert('RGB')
          words = item.words
          boxes = item.boxes

          # use processor to prepare everything for the model
          encoding = self.processor(image, words, boxes=boxes)

          return encoding

This is just a draft, assuming you have a root folder with all your document images, and a Pandas dataframe that contains the words + boxes for each document image.

You can then instantiate the dataset as follows:

from transformers import LayoutLMv3Processor

processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base")

dataset = CustomDataset(root="path_to_your_root", df="your_dataframe", processor=processor)

Note that in the tutorial notebook it's clarified what the processor is:

Next, we prepare the dataset for the model. This can be done very easily using LayoutLMv3Processor, which internally wraps a LayoutLMv3FeatureExtractor (for the image modality) and a LayoutLMv3Tokenizer (for the text modality) into one.

Back to the code at hand though! (There's not much to say)

The result here would be a good use case for a typing.TypedDict as the keys will always be the same.

Note here that convert_block_to_value(block, image_height, image_width) returns:

{
    "height": block.height / image_height * 100,
    "rectanglelabels": [str(block.type)],
    "rotation": 0,
    "width": block.width / image_width * 100,
    "x": block.coordinates[0] / image_width * 100,
    "y": block.coordinates[1] / image_height * 100,
    "score": block.score,
}

...and block is a single object from layouts, which is a list returned from self.model.detect(image) where as already stated, the model is Detectron2LayoutModel from layoutparser (source here).

If we keep digging, we see the detect method returns what it gets from running the gather_output method on the output of calling self.model. To disregard the model here, the "gathering" involves creating Layout objects (from the lp.elements.layout module) and putting TextBlock objects in them, each populated with a block argument made of a Rectangle, both from the lp.elements.layout_elements module.

The Rectangle is a "manual dataclass" made of x_1, y_1, x_2, y_2 (or 'Lord of the Rings Bilbo' notation as I remember it: LT,RB).

So that's what a block is being iterated over in the predict method of the ObjectDetectionAPI class which is subclassing LabelStudioMLBase, and therefore we can interpret the values... The "x", and "y" are the bbox left and top coordinate's percentages of the image width and height, while the "height" and "width" of the bbox are again relative to the image's height and width (again as a percentage).

The block "score" comes from the model, it's difficult to look at the model directly in the code due to the weird metaprogramming approach used (it comes from another package fvcore, fvcore.common.registry, via detectron2's registry), which instantiates a module-wide 'registry' of architectures which get recorded through the @META_ARCH_REGISTRY.register() decorator (see search results here).

The block "type" is passed as the label of predicted classes list (pred_classes.tolist()).

MMDetection example

I missed the OpenMMLab MMDetection toolbox example backend at first, perhaps because it has such a simple class structure: it only has __init__ and predict methods (as well as a _get_image_url helper method). It's not trainable through the Label Studio interface, you just load trained checkpoints from file.

This one's a bit unusual: it's the only one I've seen here that asks to specify device in the class __init__ signature (defaulting to "cpu").

It also loads labels from a file (the other backends populate their labels attribute from the labels value in the info attribute that comes from the parsed_label_config dict's value).

Instead, the parsed_label_config dict's first value is assigned to schema which looks like it's also a dict, with yet more dicts nested inside... (Type annotations would be valuable here!)

(
    self.from_name,
    self.to_name,
    self.value,
    self.labels_in_config,
) = get_single_tag_keys(self.parsed_label_config, "RectangleLabels", "Image")
schema = list(self.parsed_label_config.values())[0]
self.labels_in_config = set(self.labels_in_config)

# Collect label maps from `predicted_values="airplane,car"` attribute
# in <Label> tag
self.labels_attrs = schema.get("labels_attrs")
if self.labels_attrs:
    for label_name, label_attrs in self.labels_attrs.items():
        for predicted_value in label_attrs.get("predicted_values", "").split(
            ","
        ):
            self.label_map[predicted_value] = label_name

That get_single_tag_keys function is from the label_studio_ml.utils module:

def get_single_tag_keys(parsed_label_config, control_type, object_type):
    """
    Gets parsed label config, and returns data keys related to the single
    control tag and the single object tag schema
    (e.g. one "Choices" with one "Text")

    :param parsed_label_config: parsed label config returned by
                                "label_studio.misc.parse_config" function
    :param control_type: control tag str as it written in label config
                         (e.g. 'Choices')
    :param object_type: object tag str as it written in label config
                        (e.g. 'Text')
    :return: 3 string keys and 1 array of string labels:
             (from_name, to_name, value, labels)
    """
    assert len(parsed_label_config) == 1
    from_name, info = list(parsed_label_config.items())[0]
    assert info["type"] == control_type, (
        'Label config has control tag "<'
        + info["type"]
        + '>" but "<'
        + control_type
        + '>" is expected for this model.'
    )  # noqa

    assert len(info["to_name"]) == 1
    assert len(info["inputs"]) == 1
    assert info["inputs"][0]["type"] == object_type
    to_name = info["to_name"][0]
    value = info["inputs"][0]["value"]
    return from_name, to_name, value, info["labels"]

As well as a 'getter', this is a helper validating the parsed_label_config!

The control_type appears to be more like the "type of task label" (so classification task has example "Choices" type of task label) which is "RectangleLabels" here because the labels are bboxes, and the object_type is more like the "modality type of the task" (so a text classifier has example "Text" type of modality) which is "Image" here because the objects are detected in images.

HuggingFace Transformers backends

I don't just want an image inputs/rectangular labels ML backend though, I specifically want to use HuggingFace's Transformers library to load my model and then make predictions with it.

If we hop into the examples directory and search for the transformers import statement we can see what's been demo'd:

grep -r --include \*.py transformers

huggingface/gpt.py:from transformers import AutoTokenizer, AutoModelForCausalLM
bert/bert_classifier.py:from transformers import BertTokenizer, BertForSequenceClassification
bert/bert_classifier.py:from transformers import AdamW, get_linear_schedule_with_warmup
ner/ner.py:from transformers import (
ner/ner.py:from transformers import AdamW, get_linear_schedule_with_warmup
electra/electra.py:from transformers import ElectraTokenizerFast, ElectraForSequenceClassification
electra/electra.py:from transformers import Trainer
electra/electra.py:from transformers import TrainingArguments

So that's GPT, BERT, and Electra, all (text) language models. The ner directory likewise obviously contains language models (for named entity recognition: BERT, Roberta, DistilBert, CamemBert).

Note from the imported names that bert is the first in this list that does classification like the simple_text_classifier backend we saw above (BertForSequenceClassification). This seems like a good example to compare to (it should otherwise be similar to simple_text_classifier).

BERT backend

The only difference between the two Docker Compose specs (bert and simple_text_classifier) is that the BERT example does not specify any of the Label Studio-related env. vars:

diff simple_text_classifier/docker-compose.yml bert/docker-compose.yml

20,22d19
<       - LABEL_STUDIO_ML_BACKEND_V2=true
<       - LABEL_STUDIO_HOSTNAME=http://localhost:8000
<       - LABEL_STUDIO_API_KEY=d6f8a2622d39e9d89ff0dfef1a80ad877f4ee9e3

The hostname and API key are used to GET data annotations from the Label Studio API [locally], and it turned out these aren't set up for this backend, hence the env. var's not being set.

Let's now look at the bert_classifier.py module. The class BertClassifier defines nearly all of the same methods as SimpleTextClassifier. It doesn't define _get_annotated_dataset, which I interpret as meaning this model does not support webhook-based training workflows (unconfirmed).

The rest:

  • __init__ identical but now with a few extra lines after the base class gets initialised (assigning attributes that were added to the previously blank self, **kwargs signature, all of which have defaults):
            self.pretrained_model = pretrained_model
            self.maxlen = maxlen
            self.batch_size = batch_size
            self.num_epochs = num_epochs
            self.logging_steps = logging_steps
            self.train_logs = train_logs
    • Not quite identically at the end too, where the pickle loading is replaced with loading from_pretrained the model saved with save_pretrained in HuggingFace.
  • reset_model is used to set up an initial model, but rather than the sklearn pipeline in SimpleTextClassifier, it's the model loaded from_pretrained again.
  • predict does a few things differently:
    • First off, it won't return anything if the tokenizer attribute wasn't set by running the load() method in the __init__ method [by passing the truthiness check on self.train_output, which gets set in the base class when the train_output kwarg is passed)
    • Rather than just iterating over the tasks and sticking task["data"].get(self.value) into a list of input_texts, a proper dataloader is used (it's cooked up in the utils.prepare_texts function), and iterating over it gives input IDs and attention masks which are moved to the appropriate device upon being dataloaded.
    • After dataloading, model inference runs in a torch.no_grad block (which disables the gradient calculation), and then after this block the resulting logits are detached from the graph, and put back on the CPU.
    • The scores and labels are assigned more neatly than in the sklearn model. The predicted label is listed directly rather than waiting to zip the argmax index against the score and look up the label just before building the result dict.
  • fit has no argument annotations but instead completions, which has the annotations nested inside it. Compare the simple_text_classifier vs. bert_classifier, they're clearly the same:
    output_label = annotation['result'][0]['value']['choices'][0]
    output_label = completion['annotations'][0]['result'][0]['value']['choices'][0]
    After that, there's a ton more that goes on (whereas the sklearn backend's fit method abstracted it all away into the sklearn model.fit call). This is not particularly worth going through: neural net training loop, with logs, tqdm'd dataloader, model.train() call, loss backprop., early stopping.

...it also defines

  • a load method, which loads the pretrained model, and overwrites some of the attributes from the restored model over attributes defined at __init__ (namely: batch_size, labels, maxlen).
  • a not_trained property, which relies on checking if the self.tokenizer attribute has been set (it gets set in the load method).

Electra backend

The first thing notable in the electra backend is that its _wsgi.py is near identical to the bert directory's: the BertClassifier [subclass of LabelStudioMLBase] is just replaced with ElectraTextClassifier.

The electra.py module is simpler however: totalling 145 lines to the BERT module's 221. Right from the start you can see it has fewer imports.

On closer inspection this is because the Electra model is trained with the HuggingFace Trainer class, so other than transformers, the only libraries loaded in the module are requests and json!

  • See its source code for details on what is abstracted away into this class or click through to particular sections from the docs
diff <(grep import examples/bert/bert_classifier.py) <(grep import examples/electra/electra.py)

2c2,3
< import numpy as np
---
> import requests
> import json
4,10c5,7
< from torch.utils.data import SequentialSampler
< from tqdm import tqdm, trange
< from collections import deque
< from tensorboardX import SummaryWriter
< from transformers import BertTokenizer, BertForSequenceClassification
< from transformers import AdamW, get_linear_schedule_with_warmup
< from torch.utils.data import TensorDataset, DataLoader, RandomSampler
---
> from transformers import ElectraTokenizerFast, ElectraForSequenceClassification
> from transformers import Trainer
> from transformers import TrainingArguments
12c9
< from utils import prepare_texts, calc_slope
---
> from label_studio_tools.core.label_config import parse_config
  • The __init__ method is conspicuously lacking any assert statements (the other 2 examples had checks for the config's inputs value being length 1, i.e. for single labels in annotations). It seems to just rely on it implicitly however, and behaves the same.

    self.value = self.info["inputs"][0]["value"]
  • The fit method is shrunk back down closer to the simple_text_classifier backend, after being crammed full of training loop logic in the BERT backend.

  • There's no load method (which in the BERT model was checking if self.tokenizer was set. Here self.tokenizer gets set in the __init__ method.

  • There is a load_config method, but this is used to initialise the parsed_label_config if fit is called before that's set. It's set when the base class initialises but can be {} if no config is passed (i.e. if the ElectraTextClassifier class isn't passed a label_config kwarg on init.

  • The predict method has the nice neat HuggingFace style of predictions (seen in the BERT example) but keeps the label index from argmax as seen in the simple_text_classifier's sklearn code. This is the best of both worlds.

  • The _get_annotated_dataset method is back, and handles the 'webhook' events with the API key (though the API key is hardcoded here, rather than set as an env. var. in the Docker Compose spec. as done in the simple_text_classifier.

  • There is also a new _get_text_from_s3 method which I don't need.

It also includes a CustomDataset class similar to the draft above.

Model loading in all backends

To step aside and review how each of the models are loaded and what it entails for the resulting backend capabilities (we can put a checkbox beside them to indicate if they are compatible with local training):

grep -r --include \*.py "self.model ="

label_studio_ml/examples/flair/ner_ml_backend.py:            self.model = self.load(self.train_output["base_path"])
label_studio_ml/examples/huggingface/gpt.py:        self.model = AutoModelForCausalLM.from_pretrained(self.model_name)
label_studio_ml/examples/mmdetection/mmdetection.py:        self.model = init_detector(config_file, checkpoint_file, device=device)
label_studio_ml/examples/bert/bert_classifier.py:        self.model = BertForSequenceClassification.from_pretrained(pretrained_model)
label_studio_ml/examples/nemo/asr.py:        self.model = nemo_asr.models.EncDecCTCModel.from_pretrained(
label_studio_ml/examples/tensorflow/mobilenet_finetune.py:        self.model = tf.keras.Sequential(
label_studio_ml/examples/simple_text_classifier/simple_text_classifier.py:                self.model = pickle.load(f)
label_studio_ml/examples/simple_text_classifier/simple_text_classifier.py:        self.model = make_pipeline(
label_studio_ml/examples/pytorch_transfer_learning/pytorch_transfer_learning.py:        self.model = models.resnet18(pretrained=True)
label_studio_ml/examples/pytorch_transfer_learning/pytorch_transfer_learning.py:        self.model = self.model.to(device)
label_studio_ml/examples/pytorch_transfer_learning/pytorch_transfer_learning.py:            self.model = ImageClassifier(len(self.classes), freeze_extractor)
label_studio_ml/examples/pytorch_transfer_learning/pytorch_transfer_learning.py:            self.model = ImageClassifier(len(self.classes), freeze_extractor)
label_studio_ml/examples/pytorch_transfer_learning/pytorch_transfer_learning.py:        self.model = ImageClassifier(len(self.classes), self.freeze_extractor)
label_studio_ml/examples/electra/electra.py:            self.model = ElectraForSequenceClassification.from_pretrained("my_model")
label_studio_ml/examples/electra/electra.py:            self.model = ElectraForSequenceClassification.from_pretrained(
  • The flair backend assigns self.model in the __init__ method in a conditional block checking self.train_output (which gets set in the base class __init__) and if the check fails it just doesn't load a model (doesn't even assign None to the attribute! Dicey).

    • The model is always loaded from a local path with filename best_model.pt
  • The gpt backend doesn't do any check, and the model gets assigned as self.model_name, so it can't be trained (so it isn't usable in active learning retraining workflows).

  • The mmdetection backend sets it once, as for gpt.

  • The bert backend sets it from pretrained_model in a load method. Unless I'm mistaken, there's a bug where reset_model is a no-op. The model it returns isn't assigned to self.model (as the simple_text_classifier sklearn backend did). In fact the only other example with a reset_model method is the pytorch_transfer_learning backend, and indeed it binds self.model too.

    • That said, if it were fixed (so the call to reset_model in the __init__ method assigned to self.model) it would be training-compatible.
  • The mmdetection backend sets it once, as for gpt.

  • There is also the ner backend which sets self._model from the self.train_output attribute's model_path value (which with HuggingFace can of course be a HuggingFace Hub-hosted model path rather than a local file path).

  • The nemo ASR backend sets it once, as for gpt.

  • The tensorflow backend sets it once but loads weights afterwards if self.train_output is set (truthily).

  • The simple_text_classifier backend either sets it from reset_model and immediately fits to initialise if train_output is falsey, or sets it from the pickle if a local model_file is passed in train_output.

  • The pytorch_transfer_learning backend loads its classes if train_output is passed, and loads weights into it after assigning too, otherwise it just initialises it. This is done quite neatly (probably helped by defining the model class in the module itself, not relying on importing an external one).

  • The electra backend checks if a hardcoded path exists then doesn't use the hardcoded path, but it clearly is supposed to. Again, yes you can train the model and use it here.

So to sumarise the training-friendly backend examples and whether they're good templates to build from:

  • bert pulls in the labels from the label config info before resetting the model if training output is not available; if available it loads that and gets the labels and [should get] the model from there.
    • It uses the fact that only load (not reset_model) sets the tokenizer to distinguish whether it's not_trained (so as to refuse to predict until being trained)
  • ner uses the model_path from the training output if provided, otherwise just sets labels
    • I.e. it does the same as the BERT backend, and can't predict until trained.
  • tensorflow is the odd one out, starting with the same model regardless of self.train_output but then loading weights into it if available. This uses Keras not HuggingFace, so not applicable.
  • simple_text_classifier calls fit directly on self.model after resetting the model if not trained, otherwise unpickles the model.
  • pytorch_transfer_learning calls load if training output is available, otherwise instantiates the model directly (not put in a reset_model method but same idea). Really it should call reset_model in both blocks of that condition.
  • electra instantiates the model directly, just changes the path based on whether the model file exists. I'm not a fan of the hardcoded value, but I do like that the attributes are consistent regardless of whether the model was trained already (self.tokenizer gets set either way too).

I'd like:

  • The attribute assignment simplicity of electra (and subsequent ability to predict regardless of whether trained or not)
  • The model path handling of ner
  • The proper reset_model/load method handling of bert (when fixed as above)
  • The proper assertion checks on __init__ of bert

Custom LayoutLMv3 object detection backend

So having found honed in on these 5 (SimpleTextClassifier, BERT, Electra, the NER tagger, and MMDetection), as well as the partial Detectron2 example, it's clear that we actually want to mix and match aspects of code from various sources.

  • API key handling is only done properly (in Docker Compose spec) by the SimpleTextClassifier. Electra uses it too in _get_annotated_dataset but it's a hardcoded module string literal.

  • Training is done most neatly (i.e. more simply, abstracting the details away) in Electra, and matches the use of the Trainer API in the LayoutLMv3 tutorial by Niels Rogge (via the Transformer-Tutorials repo).

  • Prediction is done most neatly in Electra, but I'd still prefer a TypedDict for the results to make it even cleaner.

  • Bounding box handling is done in Detectron2 and MMDetection. The result type will be changed from choices to rectanglelabels

  • Config assertions are done in BERT's __init__ method, and these may be useful to write (they're not in Electra).

  • GPU handling is only done explicitly in BERT, but I expect the Trainer class handles that in Electra. This is handled through the place_model_on_device property of the TrainingArguments class,

    • ...which is True if transformers.utils.import_utils.py's is_sagemaker_mp_enabled() evaluates to False (i.e. if not using model parallelism, which is set via SM_HP_MP_PARAMETERS env. var. else defaults to False).
  • The processor is going to go where the tokenizer goes in Electra (in the __init__ method) for use in predict and fit. Even though it's said to be 'pretrained', it doesn't get retrained so we don't need to load it, so it doesn't need to be conditional on there being train_output (see discussion).

  • The model is going to go where it goes in Electra (in the __init__ method) but rather than instantiating it here from a hardcoded MODEL_FILE module-level global variable, it's going to be loaded via the path given by the load method like in BERT if train_output is available, otherwise from reset_model. This condition will look more like BERT but without moving devices (unsure?). Like the SimpleTextClassifier, the labels attribute is set from train_output if loading else from info if using reset_model.

    • reset_model should not be passing hardcoded defaults through (as in BERT), they should be method defaults (as in SimpleTextClassifier). The method should take no arguments.

At the risk of overemphasising, let's turn that inside out so it's in terms of what 'features' we want from each source:

  • BERT: reset_model/load pattern; config assertions in __init__
  • Electra: simple attribute assignment [in particular of the tokenizer, i.e. processor in my case] in __init__ (permitting use of predict even if no train_output); prediction with Trainer API (with automatic GPU device handling); _get_annotated_dataset
  • Simple Text Classifier: API key handling in Docker Compose spec and _get_annotated_dataset; method-level defaults in reset_model (not hardcoded in __init__'s call to that method)
  • Detectron2 and MMDetection: bbox handling
  • Niels Rogge's Transformer-Tutorials LayoutLMv3 notebook: Trainer API usage; custom Dataset (from issue #123)
  • NER tagger: model_path handling

Backend rewriting recipe

Since this is quite an ambitious rewrite (with at least 4 different sources in the examples here, plus likely reusing some of Niels Rogge's code for datasets and training with the Trainer API), I'll want to take a principled approach, and record what I do, with version control so I can roll back (or at least review) any mistakes.

  • The first step is to begin adapting the most relevant template (Electra), which will achieve GPU handling (which we get 'for free' with the Trainer API) and immediately check one of our features off the to-do list.
  • Then we should start modifying the model itself. The first 'easy win' is API key usage.
  • Next, we should move onto the model class __init__ method, and tackle some real wins:
    • Adapting the signature to be relevant to the LayoutLMv3 model's args/kwargs.
    • The config assertions (just guess if unsure, we can fix if they fail)
    • The processor instantiation (another easy win)
    • The model instantiation within a train_output conditional block.
  • That just leaves:
    • Prediction (of which bbox handling is a component), which will give us preannotation
    • Training, which will give us a retrainable (fine-tuneable) model which will learn from the annotation labels we provide in Label Studio

Our recipe is therefore:

  1. GPU handling
  2. API key handling
  3. Config assertions
  4. Processor
  5. Model
  6. Prediction
  7. Bbox handling
  8. Training

Choosing a template

The first question is obviously: where to start? I.e. which template to begin adapting from?

Well, out of the sources above, Electra has the longest list of 'features' I want.

Looked at another way, the most complexity-reducing thing we have here is the Trainer API, and that's only in Electra (Niels Rogge's Trainer API example is not a Label Studio backend example).

At the risk of bikeshedding I'm going to just go with that impulse...

Adapting the WSGI server module

  1. Copy the directory and rename to layoutlmv3
  2. Rename the model module to layoutlmv3.py
  3. Overwrite the import line for the model module with the new model module (layoutlmv3) and class names (LayoutLMv3Classifier)
  4. Overwrite the model class name with the new one
cp -r electra layoutlmv3
cd layoutlmv3
mv electra.py layoutlmv3.py
sed -i 's/from electra import ElectraTextClassifier/from layoutlmv3 import LayoutLMv3Classifier/' _wsgi.py
sed -i 's/ElectraTextClassifier/LayoutLMv3Classifier/g' _wsgi.py

Before we start adapting the model class, we should really ensure that class is renamed too (so far it's just renamed in the server module).

sed -i 's/ElectraTextClassifier/LayoutLMv3Classifier/g' layoutlmv3.py

Modifying the Compose spec and API key usage

A major feature that is missing from Electra is that it doesn't handle the API key from an environment variable set in the Docker Compose spec, it handles it from a hard-coded string. We can get this easily enough by copying the Docker Compose spec over from simple_text_classifier and then using it in layoutlmv3.py the same way the simple_text_classifier.py module uses it.

Just copy the variables in the environment section of the YAML (I just did this in a text editor)

--- a/label_studio_ml/examples/layoutlmv3/docker-compose.yml
+++ b/label_studio_ml/examples/layoutlmv3/docker-compose.yml
@@ -18,6 +18,9 @@ services:
       - REDIS_HOST=redis
       - REDIS_PORT=6379
       - USE_REDIS=true
+      - LABEL_STUDIO_ML_BACKEND_V2=true
+      - LABEL_STUDIO_HOSTNAME=http://localhost:8000
+      - LABEL_STUDIO_API_KEY=d6f8a2622d39e9d89ff0dfef1a80ad877f4ee9e3

Since it's all local, I imagine you could change that key to be whatever you wanted instead? (TBC)

The model module layoutlmv3.py now needs to use those environment variables like simple_text_classifier.py does:

--- a/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
+++ b/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
@@ -12,9 +12,15 @@
+from label_studio_ml.utils import DATA_UNDEFINED_NAME, get_env
+
+HOSTNAME = get_env("HOSTNAME", "http://localhost:8080")
+API_KEY = get_env("API_KEY")
+
+print("=> LABEL STUDIO HOSTNAME = ", HOSTNAME)
+if not API_KEY:
+    print("=> WARNING! API_KEY is not set")
 
-HOSTNAME = "https://app.heartex.com/"
-API_KEY = ""

Finally we also need to modify the _get_annotated_dataset method (which Electra had) to use the same 'best practice' method of simple_text_classifier (missing exception handling):

--- a/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
+++ b/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
@@ -142,6 +142,11 @@ class LayoutLMv3Classifier(LabelStudioMLBase):
         response = requests.get(
             download_url, headers={"Authorization": f"Token {API_KEY}"}
         )
+        if response.status_code != 200:
+            raise Exception(
+                f"Can't load task data using {download_url}, "
+                f"response status_code = {response.status_code}"
+            )
         return json.loads(response.content)

and with that we should have enabled webhook-triggered training with the Docker Compose-specified API key.

The only reason you might not want to do this is if the error would crash your annotation session, but I'd expect it to fail early, before you'd done any annotation, so not losing any work.

Config checks

BERT had some confident assertions that demonstrate data validation on the input config, so that we can't accidentally use this backend with the wrong task type (or something like that).

The obvious question here is: what are we going to check? What are we expecting? Well, we can't just reuse the BERT code as we are not expecting to classify choices but rather to have labelled bounding boxes or rectanglelabels as they're known.

Here are the checks the BERT classifier does:

        # then collect all keys from config which will be used to extract data from task and to form prediction
        # Parsed label config contains only one output of <Choices> type
        assert len(self.parsed_label_config) == 1
        self.from_name, self.info = list(self.parsed_label_config.items())[0]
        assert self.info["type"] == "Choices"

        # the model has only one textual input
        assert len(self.info["to_name"]) == 1
        assert len(self.info["inputs"]) == 1
        assert self.info["inputs"][0]["type"] == "Text"
        self.to_name = self.info["to_name"][0]
        self.value = self.info["inputs"][0]["value"]

We aren't using outputs of Choices type, but RectangleLabels (recall this is known as the control_type).

If you review the code above from the get_single_tag_keys helper function in label_studio_ml.utils, it is in fact the exact same check. So we can just call that and have a much more concise (thus more maintainable) model class __init__.

So in fact, we really want to copy the mmdetection backend's routine here.

diff --git a/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
b/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
index 0e7e5bb..2ddbfdb 100644
--- a/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
+++ b/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
@@ -25,13 +25,17 @@ MODEL_FILE = "my_model"
 
 
 class LayoutLMv3Classifier(LabelStudioMLBase):
+    control_type: str = "RectangleLabels"
+    object_type: str = "Image"
+
     def __init__(self, **kwargs):
         super(LayoutLMv3Classifier, self).__init__(**kwargs)
         try:
-            self.from_name, self.info = list(self.parsed_label_config.items())[0]
-            self.to_name = self.info["to_name"][0]
-            self.value = self.info["inputs"][0]["value"]
-            self.labels = sorted(self.info["labels"])
+            self.from_name, self.to_name, self.value, self.labels = get_single_tag_keys(
+                self.parsed_label_config,
+                control_type=self.control_type,
+                object_type=self.object_type,
+            )
         except BaseException:
             print("Couldn't load label config")

While we're at it, we may as well set some class attributes and type annotate to make it clearer.

These print statements are annoyingly amateur though: I then swapped them all for logger.error calls.

Next I removed some code repetition and made load_config only take the self argument.

With that, the config step was all done, and tucked away neatly into a load_confi method.

Processor instantiation

See the LayoutLMv3 processor source code and its tests

We create the processor just once, as Electra did for its tokenizer (so we just need to adapt this tokenizer to be a processor).

To make it neater, I made the processor name a class attribute, and the processor class as another.

I swapped the Electra tokenizer import for LayoutLMv3Processor (while at it also swapping the ElectraForSequenceClassification with LayoutLMv3ForTokenClassification) and was now halfway done migrating it from Electra to LayoutLMv3:

class LayoutLMv3Classifier(LabelStudioMLBase):
    control_type: str = "RectangleLabels"
    object_type: str = "Image"
    hf_hub_name: str = "microsoft/layoutlmv3-base"
    hf_model_cls: Type = LayoutLMv3ForTokenClassification
    hf_processor_cls: Type = LayoutLMv3Processor
    
    def __init__(self, **kwargs):
        super(LayoutLMv3Classifier, self).__init__(**kwargs)
        self.load_config()
        self.processor = self.hf_processor_cls.from_pretrained(self.hf_hub_name)

Model initialisation

See the LayoutLMv3 model source code and its tests

There are two options for the model class: LayoutLMv3ForSequenceClassification and LayoutLMv3ForTokenClassification. The 'sequence' is a document (e.g. if you wanted to distinguish different types of document), and the 'token' is a part of a document (I want to annotate and classify parts of documents so I chose this).

We instantiate our model in two different ways: either with reset_model or with load (if we have train_output).

What I did initially was to simplify the Electra model initialisation into two lines:

        model_to_load = MODEL_FILE if Path(MODEL_FILE).exists() else self.hf_hub_name
        self.model = self.hf_model_cls.from_pretrained(model_to_load)

At this point I committed my changes in case I messed the next step up

However as already established, this conditional block should actually be as in simple_text_classifier and bert.

This part of the code had hardcoded device="cpu", so I replaced that with a module-level global

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

This left me with the outline of the new model, but still with the BERT kwargs. I added more type annotations, made the reset_model take no arguments and return nothing, and annotated load as returning nothing too.

        if not self.train_output:
            self.labels = self.info["labels"]
            self.reset_model()
            load_repr = "Initialised with"
        else:
            self.load(self.train_output)
            load_repr = f"Loaded from train output with"
        logger.info(f"{load_repr} {self.from_name=}, {self.to_name=}, {self.labels=!s}")
        
    def reset_model(self) -> None:
        # THESE KWARGS HAVE NOT BEEN CHANGED FROM BERT ! TODO
        model_kwargs = dict(
            num_labels=len(self.labels),
            output_attentions=False,
            output_hidden_states=False,
            cache_dir=None,
        )   
        model = self.hf_model_cls.from_pretrained(
            self.hf_hub_name,
            **model_kwargs
        )   
        model.to(DEVICE)
        self.model = model
        return
        
    def load(self, train_output) -> None:
        pretrained_model = train_output["model_path"]
        self.model = self.hf_model_cls.from_pretrained(pretrained_model)
        self.model.to(DEVICE)
        self.model.eval()
        self.batch_size = train_output["batch_size"]
        self.labels = train_output["labels"]
        self.maxlen = train_output["maxlen"]

Now getting the arguments to the model class looks tricky: if we review the BERT signature which we are adapting:

class BertForSequenceClassification(BertPreTrainedModel)
 |      Args:
 |          input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
 |              Indices of input sequence tokens in the vocabulary.
 |      
 |              Indices can be obtained using [`BertTokenizer`]. See [`PreTrainedTokenizer.encode`] and
 |              [`PreTrainedTokenizer.__call__`] for details.
 |      
 |              [What are input IDs?](../glossary#input-ids)
 |          attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
 |              Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
 |      
 |              - 1 for tokens that are **not masked**,
 |              - 0 for tokens that are **masked**.
 |      
 |              [What are attention masks?](../glossary#attention-mask)
 |          token_type_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
 |              Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
 |              1]`:
 |      
 |              - 0 corresponds to a *sentence A* token,
 |              - 1 corresponds to a *sentence B* token.
 |      
 |              [What are token type IDs?](../glossary#token-type-ids)
 |          position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
 |              Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
 |              config.max_position_embeddings - 1]`.
 |      
 |              [What are position IDs?](../glossary#position-ids)
 |          head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
 |              Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
 |      
 |              - 1 indicates the head is **not masked**,
 |              - 0 indicates the head is **masked**.
 |      
 |          inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
 |              Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
 |              is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
 |              model's internal embedding lookup matrix.
 |          output_attentions (`bool`, *optional*):
 |              Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
 |              tensors for more detail.
 |          output_hidden_states (`bool`, *optional*):
 |              Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
 |              more detail.
 |          return_dict (`bool`, *optional*):
 |              Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
 |      
 |          labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
 |              Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
 |              config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
 |              `config.num_labels > 1` a classification loss is computed (Cross-Entropy).

Preannotation prediction and bbox handling

This one's really simple: we just call the model with the inputs. The inputs must be in a specific format though, and we need to handle the bounding box rectangles properly.

Training for in-the-loop fine-tuning

This one's really simple: we just call Trainer.train()

(TODO)

Clone this wiki locally