-
Notifications
You must be signed in to change notification settings - Fork 7
Setting up ML backend for Label Studio
- Background links
- Requirements
- Deploying an example backend
- Backend specification
- Detectron2 example
- MMDetection example
- HuggingFace Transformers backends
- Custom LayoutLMv3 object detection backend
A prerequirement for setting up a ML backend for Label Studio is Docker Compose, which in turn requires Docker Engine.
- Specifically Quickstart with an example ML backend guides you to set up Compose.
See Installing Docker Compose and Installing Docker Engine
(Thankfully the standard way to install Engine also installs Compose)
To set up the example used in the repo's docs, run the following
git clone https://github.com/heartexlabs/label-studio-ml-backend
cd label-studio-ml-backend/label_studio_ml/examples/simple_text_classifier
docker compose upIf you installed via a different route you may have
docker-composeas the command instead
Here's the project structure:
label-studio-ml-backend/label_studio_ml/examples/simple_text_classifier $ tree .⇣
.
├── data
│ ├── redis
│ └── server
│ └── models
├── docker-compose.yml
├── Dockerfile
├── logs
├── README.md
├── requirements.txt
├── simple_text_classifier.py
└── _wsgi.py
5 directories, 6 files
There's a data directory with a subdirectory for each of the services, redis and server
(with a subdirectory models), and a logs directory.
- In fact these are all created when the service starts: notice they're not checked into the repo
The docker-compose.yml Compose specification
file specifies the service names and where to mount these directories,
and the port number for the server service:
- The
redisservice:- mounts
./data/redisas/data - also names its container and hostname
redis
- mounts
- The
serverservice:- mounts
./data/serveras/data - mounts
./logsas/tmp - sets environment variables including
MODEL_DIRas/data/modelsand an API key -
defines a network link to the container
in the
redisservice, thereby determining the order of service startup. - specifies that it
depends_ontheredisservice, again determining order of service startup and shutdown.
- mounts
version: "3.8"
services:
redis:
image: redis:alpine
container_name: redis
hostname: redis
volumes:
- "./data/redis:/data"
expose:
- 6379
server:
container_name: server
build: .
environment:
- MODEL_DIR=/data/models
- RQ_QUEUE_NAME=default
- REDIS_HOST=redis
- REDIS_PORT=6379
- LABEL_STUDIO_ML_BACKEND_V2=true
- LABEL_STUDIO_HOSTNAME=http://localhost:8000
- LABEL_STUDIO_API_KEY=d6f8a2622d39e9d89ff0dfef1a80ad877f4ee9e3
ports:
- "9090:9090"
depends_on:
- redis
links:
- redis
volumes:
- "./data/server:/data"
- "./logs:/tmp"- The
LABEL_STUDIO_API_KEYis specific to this backend (see grep.app) - The YAML specifies 2 services:
- one named
redis(which is spun up from theredis:alpineimage and binds the./data/redis) - one named
server(which depends on the one namedredis)
- one named
The obvious question here is how ./data/server/models got made (see the directory tree above).
Since /data/models is passed in as an environment variable MODEL_DIR, it seems obvious that
either _wsgi.py or simple_text_classifier.py boots up when the service starts and touches it.
A 126 line module that looks in its directory for a config.json (note: not used),
exposes a CLI parser that reads a kwarg config of parameters as well as port (from the env. var.),
host, debug flag, log level, model dir (defaulting to the file's dir) and runs the app
that it initialises by
label_studio_ml.api.init_app;
or if not being run from the command line, just goes straight to initialising
the app with the args from the environment but does not run the app.
Note: the app is initialised by a module-level singleton
LabelStudioMLManagerinstance_manager, passing themodel_classas theSimpleTextClassifierclass imported from thesimple_text_classifier.pymodule, and the_serverobject returned here asappis a Flask app, again a module-level singleton instance. Importantly, theSimpleTextClassifierclass gets bound to themodel_classof theLabelStudioMLManagerand ends up getting bound inside the_current_modelattribute (dict)
A 161 line module that ensures an API key is set as an env. var,
then defines the SimpleTextClassifier class which subclasses LabelStudioMLBase and uses some imported sklearn helpers
(LogisticRegression, TfidfVectorizer, make_pipeline). The class has only a few methods:
-
__init__:- checks
self.parsed_label_configis [a dict] of length 1 (this comes from the base class), and setsself.nameandself.infofrom its key/value.- The base class sets this attribute
from
parse_config(self.label_config) if self.label_config else {}. - The base class's
__init__signature isself, label_config=None, train_output=None, **kwargs. My understanding here is that the Flask app sends a POST request that includes the config of the project, among which is the labelling config set up in the UI, and that if this isn't text classification then initialising this backend will fail.
- The base class sets this attribute
from
- checks that the config's
type(now inself.info) value is "Choices" (i.e. it's for classification). - checks that the config's
to_nameandinputsare length 1 (i.e. the model has just 1 input), and that the type of the input is "Text" - sets
self.to_namefrom the config'sto_name - sets
self.valuefrom the config's first and onlyinputsvalue
- checks
-
reset_modelcreates the following simple 2-step model:self.model = make_pipeline( TfidfVectorizer(ngram_range=(1, 3), token_pattern=r"(?u)\b\w\w+\b|\w"), LogisticRegression(C=10, verbose=True) )
-
predictgets the input text from thedataof each of thetasks(passed in as an argument), runsself.predict_proba()on them, getsargmaxindices of the predicted labels and scores, then zips the labels against the scores in a dict that get listed for all the tasks.for idx, score in zip(predicted_label_indices, predicted_scores): predicted_label = self.labels[idx] # prediction result for the single task result = [{ 'from_name': self.from_name, 'to_name': self.to_name, 'type': 'choices', 'value': {'choices': [predicted_label]} }] # expand predictions with their scores for all tasks predictions.append({'result': result, 'score': score})
-
_get_annotated_datasettakes aproject_idand is used to support webhook-based training workflows (infit). It is marked "just for demo purposes", and uses the API key to authenticate a request tolocalhost/api/projects/{project_id}/exportto retrieve data annotations for the project.- N.B. this is why the API key is needed.
-
fitthat takesannotations(which gets renamed totasks) and aworkdir(not used ifMODEL_DIRenv. var. is set), builds a list ofinput_textsfrom accessing["data"].get(self.value)of each of thetasks, callsreset_modelthenself.model.fitwith theinput_textsandoutput_labels_idx, and returns a dict oflabels(the sorted set of output labels coerced to list) andmodel_file(the pickled model).- The optional kwarg
datacan be used to overrideannotationsastasks = _get_annotated_dataset(data["project"]["id"].
- The optional kwarg
The README
for this backend says to run curl http://localhost:9090/health to check the service is running OK,
and indeed it returns some JSON:
{"model_dir":"/data/models","status":"UP","v2":"true"}If we search through the repo for the word health we find that this is a Flask route defined in
label_studio_ml/api.py
@_server.route('/health', methods=['GET'])
@_server.route('/', methods=['GET'])
@exception_handler
def health():
return jsonify({
'status': 'UP',
'model_dir': _manager.model_dir,
'v2': os.getenv('LABEL_STUDIO_ML_BACKEND_V2', default=LABEL_STUDIO_ML_BACKEND_V2_DEFAULT)
})Reading this also tells us that we can curl http://localhost:9090/ (the / route) and get the same output.
These are bound to our deployed app because when we initialised it we got the _server
module-level singleton object defined in this api.py module.
Another thing to notice here about this route funcdef is that it takes no arguments (pretty standard
for a GET request), and only uses environment variables and the _manager (module-level global variable).
Compare this to the POST request route for _predict:
@_server.route('/predict', methods=['POST'])
@exception_handler
def _predict():
data = request.json
tasks = data.get('tasks')
project = data.get('project')
label_config = data.get('label_config')
force_reload = data.get('force_reload', False)
try_fetch = data.get('try_fetch', True)
params = data.get('params') or {}
predictions, model = _manager.predict(
tasks, project, label_config, force_reload, try_fetch, **params
)
response = {
'results': predictions,
'model_version': model.model_version
}
return jsonify(response)Here there is the implicit parameter request which is provided by Flask
to a route when called.
It's known as a 'context' in Flask's docs, implemented in werkzeug as a context local. I don't really know much on the implementation details here other than you access the
.jsonattribute and then you're just working with a regular dict (similar tolocals()).
But how does this all work together? How can we test the /predict route? We can't just send a
plain POST request:
curl --header "Content-Type: application/json" --request POST --data '{}' http://localhost:9090/predictWe hit an exception in the LabelStudioMLManager.predict class method which receives the empty data
and get told the model is not loaded:
@classmethod
def predict(
cls, tasks, project=None, label_config=None, force_reload=False, try_fetch=True, **kwargs
):
if not os.getenv('LABEL_STUDIO_ML_BACKEND_V2', default=LABEL_STUDIO_ML_BACKEND_V2_DEFAULT):
if try_fetch:
m = cls.fetch(project, label_config, force_reload)
else:
m = cls.get(project)
if not m:
raise FileNotFoundError('No model loaded. Specify "try_fetch=True" option.')
predictions = m.model.predict(tasks, **kwargs)
return predictions, m
if not cls._current_model:
raise ValueError(f'Model is not loaded for {cls.__class__.__name__}: run setup() before
using predict()')
predictions = cls._current_model.model.predict(tasks, **kwargs)
return predictions, cls._current_modelIn other words, you should [let the UI] set this backend up before trying to decipher the inner workings any deeper.
I don't want a text classifier like this though, I want a bounding box predictor (object detection model). This one doesn't tick all the boxes for my needs, which are:
- A HuggingFace model (the text classifier indeed uses HuggingFace
transformers.AutoTokenizerandtransformers.AutoModelForCausalLM). These will be useful for figuring out what to put in thepredictmethod of the model class in my ML backend. - An image model (not too complicated: similar to a language model, it'll use a
tokenizer/processorand amodel). This will be useful for figuring out how to pass the image data into the model (which has some requirements that get validated in the model's__init__method).
It's clear that of these two, the first priority should be to find another object detection labelling studio backend,
so I'd be able to look at the assertions made in its equivalent of the simple_text_classifier's
SimpleTextClassifier.__init__() method.
I searched for the name of the base class LabelStudioMLBase on the code search site grep.app
(here) and indeed I landed on an image model,
Detectron2, a well-known semantic segmentation model (which is closer to object detection with
bboxes, but I expect will be outputting pixel-level masks).
Edit: in fact it is giving bboxes: in the code excerpt below, result type is "rectanglelabels".
Edit 2: it turned out I overlooked one right under my nose: mmdetection contains an object
detection example in this repo!
After some further digging, this turned out to be one of the LayoutParser developers' personal copy of code that would go on to become part of the official LayoutParser annotation service.
This is much closer to what I am aiming for, in fact it's the exact same task even,
however I want to use the LayoutLMv3 model on HuggingFace whereas this example
(obviously) is using a layoutparser model
(specifically lp.Detectron2LayoutModel)
From this example however, it's clear what function signatures we should aim for in an object detection API:
class ObjectDetectionAPI(LabelStudioMLBase):
def __init__(self, freeze_extractor=False, **kwargs):
...
def predict(self, tasks, **kwargs):
image_urls = [task["data"][self.value] for task in tasks]
images = [load_image_from_url(url) for url in image_urls]
layouts = [self.model.detect(image) for image in images]
predictions = []
for image, layout in zip(images, layouts):
height, width = image.shape[:2]
result = [
{
"from_name": self.from_name,
"to_name": self.to_name,
"original_height": height,
"original_width": width,
"source": "$image",
"type": "rectanglelabels",
"value": convert_block_to_value(block, height, width),
}
for block in layout
]
predictions.append({"result": result})
return predictions
def fit(self, completions, workdir=None, batch_size=32, num_epochs=10, **kwargs):
image_urls, image_classes = [], []
print("Collecting completions...")
# for completion in completions:
# if is_skipped(completion):
# continue
# image_urls.append(completion['data'][self.value])
# image_classes.append(get_choice(completion))
print("Creating dataset...")
# dataset = ImageClassifierDataset(image_urls, image_classes)
# dataloader = DataLoader(dataset, shuffle=True, batch_size=batch_size)
print("Train model...")
# self.reset_model()
# self.model.train(dataloader, num_epochs=num_epochs)
print("Save model...")
# model_path = os.path.join(workdir, 'model.pt')
# self.model.save(model_path)
return {"model_path": None, "classes": None}It's pretty clear here that this is a work in progress (i.e. all the commented out code). After getting to grips with how Label Studio backends work, I'm fairly certain that the training API isn't operational, the prediction service looks like it could be though.
The commented out line dataset = ImageClassifierDataset(image_urls, image_classes) caught my
attention, as it suggests that this was building on prior work. Indeed, searching on
grep.app shows that this name comes
from the label-studio-ml-backend repo:
label_studio_ml/examples/pytorch_transfer_learning/pytorch_transfer_learning.pydocs/source/tutorials/pytorch-image-transfer-learning.md
One thing I like about the code at this link is that it has nicely contained methods.
(To be covered below: the HuggingFace example has quite a messy fit method).
It's a pretty simple PyTorch dataset, but I'm not personally going to use URLs for my data, so it's not quite aligned to my needs.
I expect I'm going to use something more like this example by Niels Rogge (MLE at HuggingFace):
from torch.utils.data import Dataset
from PIL import Image
class CustomDataset(Dataset):
def __init__(self, root, df, processor):
self.root = root
self.df = df
self.processor = processor
def __getitem__(self, idx):
# get document image + corresponding words and boxes
item = self.df.iloc[idx]
image = Image.open(self.root + ...).convert('RGB')
words = item.words
boxes = item.boxes
# use processor to prepare everything for the model
encoding = self.processor(image, words, boxes=boxes)
return encodingThis is just a draft, assuming you have a root folder with all your document images, and a Pandas dataframe that contains the words + boxes for each document image.
You can then instantiate the dataset as follows:
from transformers import LayoutLMv3Processor
processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base")
dataset = CustomDataset(root="path_to_your_root", df="your_dataframe", processor=processor)Note that in the tutorial notebook it's clarified what the processor is:
Next, we prepare the dataset for the model. This can be done very easily using
LayoutLMv3Processor, which internally wraps aLayoutLMv3FeatureExtractor(for the image modality) and aLayoutLMv3Tokenizer(for the text modality) into one.
Back to the code at hand though! (There's not much to say)
The result here would be a good use case for a typing.TypedDict as the keys will always be the same.
Note here that convert_block_to_value(block, image_height, image_width) returns:
{
"height": block.height / image_height * 100,
"rectanglelabels": [str(block.type)],
"rotation": 0,
"width": block.width / image_width * 100,
"x": block.coordinates[0] / image_width * 100,
"y": block.coordinates[1] / image_height * 100,
"score": block.score,
}...and block is a single object from layouts, which is a list returned from
self.model.detect(image) where as already stated, the model is Detectron2LayoutModel
from layoutparser (source here).
If we keep digging,
we see the detect method returns what it gets from running the gather_output
method on the output of calling self.model. To disregard the model here, the "gathering" involves
creating Layout objects
(from the lp.elements.layout module)
and putting TextBlock objects
in them, each populated with a block argument made of a
Rectangle,
both from the lp.elements.layout_elements module.
The Rectangle is a "manual dataclass" made of x_1, y_1, x_2, y_2
(or 'Lord of the Rings Bilbo' notation as I remember it: LT,RB).
So that's what a block is being iterated over in the predict method of
the ObjectDetectionAPI class which is subclassing LabelStudioMLBase,
and therefore we can interpret the values... The "x", and "y" are the
bbox left and top coordinate's percentages of the image width and height,
while the "height" and "width" of the bbox are again relative to the image's
height and width (again as a percentage).
The block "score" comes from the model, it's difficult to look at the model directly in the code
due to the weird metaprogramming approach used (it comes from another package fvcore,
fvcore.common.registry,
via detectron2's registry),
which instantiates a module-wide 'registry' of architectures which get recorded through the
@META_ARCH_REGISTRY.register() decorator
(see search results here).
The block "type" is passed as the label of predicted classes list (pred_classes.tolist()).
I missed the OpenMMLab MMDetection toolbox example
backend at first, perhaps because it has such a simple class structure: it only has __init__ and
predict methods (as well as a _get_image_url helper method). It's not trainable through the
Label Studio interface, you just load trained checkpoints from file.
This one's a bit unusual: it's the only one I've seen here that asks to specify device in the class
__init__ signature (defaulting to "cpu").
It also loads labels from a file (the other backends populate their labels attribute
from the labels value in the info attribute that comes from the parsed_label_config dict's value).
Instead, the parsed_label_config dict's first value is assigned to schema which looks like it's
also a dict, with yet more dicts nested inside... (Type annotations would be valuable here!)
(
self.from_name,
self.to_name,
self.value,
self.labels_in_config,
) = get_single_tag_keys(self.parsed_label_config, "RectangleLabels", "Image")
schema = list(self.parsed_label_config.values())[0]
self.labels_in_config = set(self.labels_in_config)
# Collect label maps from `predicted_values="airplane,car"` attribute
# in <Label> tag
self.labels_attrs = schema.get("labels_attrs")
if self.labels_attrs:
for label_name, label_attrs in self.labels_attrs.items():
for predicted_value in label_attrs.get("predicted_values", "").split(
","
):
self.label_map[predicted_value] = label_nameThat get_single_tag_keys function is from the label_studio_ml.utils module:
def get_single_tag_keys(parsed_label_config, control_type, object_type):
"""
Gets parsed label config, and returns data keys related to the single
control tag and the single object tag schema
(e.g. one "Choices" with one "Text")
:param parsed_label_config: parsed label config returned by
"label_studio.misc.parse_config" function
:param control_type: control tag str as it written in label config
(e.g. 'Choices')
:param object_type: object tag str as it written in label config
(e.g. 'Text')
:return: 3 string keys and 1 array of string labels:
(from_name, to_name, value, labels)
"""
assert len(parsed_label_config) == 1
from_name, info = list(parsed_label_config.items())[0]
assert info["type"] == control_type, (
'Label config has control tag "<'
+ info["type"]
+ '>" but "<'
+ control_type
+ '>" is expected for this model.'
) # noqa
assert len(info["to_name"]) == 1
assert len(info["inputs"]) == 1
assert info["inputs"][0]["type"] == object_type
to_name = info["to_name"][0]
value = info["inputs"][0]["value"]
return from_name, to_name, value, info["labels"]As well as a 'getter', this is a helper validating the parsed_label_config!
The control_type appears to be more like the "type of task label"
(so classification task has example "Choices" type of task label)
which is "RectangleLabels" here because the labels are bboxes,
and the object_type is more like the "modality type of the task"
(so a text classifier has example "Text" type of modality)
which is "Image" here because the objects are detected in images.
I don't just want an image inputs/rectangular labels ML backend though, I specifically want to use HuggingFace's Transformers library to load my model and then make predictions with it.
If we hop into the examples directory and search for the transformers import statement we can see
what's been demo'd:
grep -r --include \*.py transformers⇣
huggingface/gpt.py:from transformers import AutoTokenizer, AutoModelForCausalLM
bert/bert_classifier.py:from transformers import BertTokenizer, BertForSequenceClassification
bert/bert_classifier.py:from transformers import AdamW, get_linear_schedule_with_warmup
ner/ner.py:from transformers import (
ner/ner.py:from transformers import AdamW, get_linear_schedule_with_warmup
electra/electra.py:from transformers import ElectraTokenizerFast, ElectraForSequenceClassification
electra/electra.py:from transformers import Trainer
electra/electra.py:from transformers import TrainingArguments
So that's GPT, BERT, and Electra, all (text) language models. The ner directory likewise obviously
contains language models (for named entity recognition: BERT, Roberta, DistilBert, CamemBert).
Note from the imported names that bert is the first in this list that does classification
like the simple_text_classifier backend we saw above (BertForSequenceClassification).
This seems like a good example to compare to (it should otherwise be similar to simple_text_classifier).
The only difference between the two Docker Compose specs (bert and simple_text_classifier) is
that the BERT example does not specify any of the Label Studio-related env. vars:
diff simple_text_classifier/docker-compose.yml bert/docker-compose.yml⇣
20,22d19
< - LABEL_STUDIO_ML_BACKEND_V2=true
< - LABEL_STUDIO_HOSTNAME=http://localhost:8000
< - LABEL_STUDIO_API_KEY=d6f8a2622d39e9d89ff0dfef1a80ad877f4ee9e3The hostname and API key are used to GET data annotations from the Label Studio API [locally], and it turned out these aren't set up for this backend, hence the env. var's not being set.
Let's now look at the bert_classifier.py module. The class BertClassifier defines nearly all of the
same methods as SimpleTextClassifier. It doesn't define _get_annotated_dataset,
which I interpret as meaning this model does not support webhook-based training workflows (unconfirmed).
The rest:
-
__init__identical but now with a few extra lines after the base class gets initialised (assigning attributes that were added to the previously blankself, **kwargssignature, all of which have defaults):self.pretrained_model = pretrained_model self.maxlen = maxlen self.batch_size = batch_size self.num_epochs = num_epochs self.logging_steps = logging_steps self.train_logs = train_logs
- Not quite identically at the end too, where the pickle loading is replaced with loading
from_pretrainedthe model saved withsave_pretrainedin HuggingFace.
- Not quite identically at the end too, where the pickle loading is replaced with loading
-
reset_modelis used to set up an initial model, but rather than the sklearn pipeline inSimpleTextClassifier, it's the model loadedfrom_pretrainedagain. -
predictdoes a few things differently:- First off, it won't return anything if the
tokenizerattribute wasn't set by running theload()method in the__init__method [by passing the truthiness check onself.train_output, which gets set in the base class when thetrain_outputkwarg is passed) - Rather than just iterating over the tasks and sticking
task["data"].get(self.value)into a list ofinput_texts, a proper dataloader is used (it's cooked up in theutils.prepare_textsfunction), and iterating over it gives input IDs and attention masks which are moved to the appropriate device upon being dataloaded. - After dataloading, model inference runs in
a
torch.no_gradblock (which disables the gradient calculation), and then after this block the resulting logits aredetached from the graph, and put back on the CPU. - The scores and labels are assigned more neatly than in the sklearn model. The predicted label is listed directly rather than waiting to zip the argmax index against the score and look up the label just before building the result dict.
- First off, it won't return anything if the
-
fithas no argumentannotationsbut insteadcompletions, which has the annotations nested inside it. Compare thesimple_text_classifiervs.bert_classifier, they're clearly the same:output_label = annotation['result'][0]['value']['choices'][0]
After that, there's a ton more that goes on (whereas the sklearn backend'soutput_label = completion['annotations'][0]['result'][0]['value']['choices'][0]
fitmethod abstracted it all away into the sklearnmodel.fitcall). This is not particularly worth going through: neural net training loop, with logs, tqdm'd dataloader,model.train()call, loss backprop., early stopping.
...it also defines
- a
loadmethod, which loads the pretrained model, and overwrites some of the attributes from the restored model over attributes defined at__init__(namely:batch_size,labels,maxlen). - a
not_trainedproperty, which relies on checking if theself.tokenizerattribute has been set (it gets set in theloadmethod).
The first thing notable in the electra backend is that its _wsgi.py is near identical to the
bert directory's: the BertClassifier [subclass of LabelStudioMLBase] is just replaced with ElectraTextClassifier.
The electra.py module is simpler however: totalling 145 lines to the BERT module's 221.
Right from the start you can see it has fewer imports.
On closer inspection this is because the Electra model is trained with
the HuggingFace Trainer class,
so other than transformers, the only libraries loaded in the module are requests and json!
- See its source code for details on what is abstracted away into this class or click through to particular sections from the docs
diff <(grep import examples/bert/bert_classifier.py) <(grep import examples/electra/electra.py)⇣
2c2,3
< import numpy as np
---
> import requests
> import json
4,10c5,7
< from torch.utils.data import SequentialSampler
< from tqdm import tqdm, trange
< from collections import deque
< from tensorboardX import SummaryWriter
< from transformers import BertTokenizer, BertForSequenceClassification
< from transformers import AdamW, get_linear_schedule_with_warmup
< from torch.utils.data import TensorDataset, DataLoader, RandomSampler
---
> from transformers import ElectraTokenizerFast, ElectraForSequenceClassification
> from transformers import Trainer
> from transformers import TrainingArguments
12c9
< from utils import prepare_texts, calc_slope
---
> from label_studio_tools.core.label_config import parse_config-
The
__init__method is conspicuously lacking anyassertstatements (the other 2 examples had checks for the config's inputs value being length 1, i.e. for single labels in annotations). It seems to just rely on it implicitly however, and behaves the same.self.value = self.info["inputs"][0]["value"]
-
The
fitmethod is shrunk back down closer to thesimple_text_classifierbackend, after being crammed full of training loop logic in the BERT backend. -
There's no
loadmethod (which in the BERT model was checking ifself.tokenizerwas set. Hereself.tokenizergets set in the__init__method. -
There is a
load_configmethod, but this is used to initialise theparsed_label_configiffitis called before that's set. It's set when the base class initialises but can be{}if no config is passed (i.e. if theElectraTextClassifierclass isn't passed alabel_configkwarg on init. -
The
predictmethod has the nice neat HuggingFace style of predictions (seen in the BERT example) but keeps the label index from argmax as seen in thesimple_text_classifier's sklearn code. This is the best of both worlds. -
The
_get_annotated_datasetmethod is back, and handles the 'webhook' events with the API key (though the API key is hardcoded here, rather than set as an env. var. in the Docker Compose spec. as done in thesimple_text_classifier. -
There is also a new
_get_text_from_s3method which I don't need.
It also includes a CustomDataset class similar to the draft above.
To step aside and review how each of the models are loaded and what it entails for the resulting backend capabilities (we can put a checkbox beside them to indicate if they are compatible with local training):
grep -r --include \*.py "self.model ="⇣
label_studio_ml/examples/flair/ner_ml_backend.py: self.model = self.load(self.train_output["base_path"])
label_studio_ml/examples/huggingface/gpt.py: self.model = AutoModelForCausalLM.from_pretrained(self.model_name)
label_studio_ml/examples/mmdetection/mmdetection.py: self.model = init_detector(config_file, checkpoint_file, device=device)
label_studio_ml/examples/bert/bert_classifier.py: self.model = BertForSequenceClassification.from_pretrained(pretrained_model)
label_studio_ml/examples/nemo/asr.py: self.model = nemo_asr.models.EncDecCTCModel.from_pretrained(
label_studio_ml/examples/tensorflow/mobilenet_finetune.py: self.model = tf.keras.Sequential(
label_studio_ml/examples/simple_text_classifier/simple_text_classifier.py: self.model = pickle.load(f)
label_studio_ml/examples/simple_text_classifier/simple_text_classifier.py: self.model = make_pipeline(
label_studio_ml/examples/pytorch_transfer_learning/pytorch_transfer_learning.py: self.model = models.resnet18(pretrained=True)
label_studio_ml/examples/pytorch_transfer_learning/pytorch_transfer_learning.py: self.model = self.model.to(device)
label_studio_ml/examples/pytorch_transfer_learning/pytorch_transfer_learning.py: self.model = ImageClassifier(len(self.classes), freeze_extractor)
label_studio_ml/examples/pytorch_transfer_learning/pytorch_transfer_learning.py: self.model = ImageClassifier(len(self.classes), freeze_extractor)
label_studio_ml/examples/pytorch_transfer_learning/pytorch_transfer_learning.py: self.model = ImageClassifier(len(self.classes), self.freeze_extractor)
label_studio_ml/examples/electra/electra.py: self.model = ElectraForSequenceClassification.from_pretrained("my_model")
label_studio_ml/examples/electra/electra.py: self.model = ElectraForSequenceClassification.from_pretrained(
-
The
flairbackend assignsself.modelin the__init__method in a conditional block checkingself.train_output(which gets set in the base class__init__) and if the check fails it just doesn't load a model (doesn't even assignNoneto the attribute! Dicey).- The model is always loaded from a local path with filename
best_model.pt
- The model is always loaded from a local path with filename
-
The
gptbackend doesn't do any check, and the model gets assigned asself.model_name, so it can't be trained (so it isn't usable in active learning retraining workflows). -
The
mmdetectionbackend sets it once, as forgpt. -
The
bertbackend sets it frompretrained_modelin aloadmethod. Unless I'm mistaken, there's a bug wherereset_modelis a no-op. The model it returns isn't assigned toself.model(as thesimple_text_classifiersklearn backend did). In fact the only other example with areset_modelmethod is thepytorch_transfer_learningbackend, and indeed it bindsself.modeltoo.- That said, if it were fixed (so the call to
reset_modelin the__init__method assigned toself.model) it would be training-compatible.
- That said, if it were fixed (so the call to
-
The
mmdetectionbackend sets it once, as forgpt. -
There is also the
nerbackend which setsself._modelfrom theself.train_outputattribute'smodel_pathvalue (which with HuggingFace can of course be a HuggingFace Hub-hosted model path rather than a local file path). -
The
nemoASR backend sets it once, as forgpt. -
The
tensorflowbackend sets it once but loads weights afterwards ifself.train_outputis set (truthily). -
The
simple_text_classifierbackend either sets it fromreset_modeland immediatelyfits to initialise iftrain_outputis falsey, or sets it from the pickle if a localmodel_fileis passed intrain_output. -
The
pytorch_transfer_learningbackend loads its classes iftrain_outputis passed, and loads weights into it after assigning too, otherwise it just initialises it. This is done quite neatly (probably helped by defining the model class in the module itself, not relying on importing an external one). -
The
electrabackend checks if a hardcoded path exists then doesn't use the hardcoded path, but it clearly is supposed to. Again, yes you can train the model and use it here.
So to sumarise the training-friendly backend examples and whether they're good templates to build from:
-
bertpulls in the labels from the label configinfobefore resetting the model if training output is not available; if available itloads that and gets the labels and [should get] the model from there.- It uses the fact that only
load(notreset_model) sets thetokenizerto distinguish whether it'snot_trained(so as to refuse topredictuntil being trained)
- It uses the fact that only
-
neruses themodel_pathfrom the training output if provided, otherwise just sets labels- I.e. it does the same as the BERT backend, and can't
predictuntil trained.
- I.e. it does the same as the BERT backend, and can't
-
tensorflowis the odd one out, starting with the same model regardless ofself.train_outputbut then loading weights into it if available. This uses Keras not HuggingFace, so not applicable. -
simple_text_classifiercallsfitdirectly onself.modelafter resetting the model if not trained, otherwise unpickles the model. -
pytorch_transfer_learningcallsloadif training output is available, otherwise instantiates the model directly (not put in areset_modelmethod but same idea). Really it should callreset_modelin both blocks of that condition. -
electrainstantiates the model directly, just changes the path based on whether the model file exists. I'm not a fan of the hardcoded value, but I do like that the attributes are consistent regardless of whether the model was trained already (self.tokenizergets set either way too).
I'd like:
- The attribute assignment simplicity of
electra(and subsequent ability topredictregardless of whether trained or not) - The model path handling of
ner - The proper
reset_model/loadmethod handling ofbert(when fixed as above) - The proper assertion checks on
__init__ofbert
So having found honed in on these 5 (SimpleTextClassifier, BERT, Electra, the NER tagger, and MMDetection), as well as the partial Detectron2 example, it's clear that we actually want to mix and match aspects of code from various sources.
-
API key handling is only done properly (in Docker Compose spec) by the SimpleTextClassifier. Electra uses it too in
_get_annotated_datasetbut it's a hardcoded module string literal. -
Training is done most neatly (i.e. more simply, abstracting the details away) in Electra, and matches the use of the
TrainerAPI in the LayoutLMv3 tutorial by Niels Rogge (via the Transformer-Tutorials repo). -
Prediction is done most neatly in Electra, but I'd still prefer a
TypedDictfor the results to make it even cleaner. -
Bounding box handling is done in Detectron2 and MMDetection. The result type will be changed from
choicestorectanglelabels -
Config assertions are done in BERT's
__init__method, and these may be useful to write (they're not in Electra). -
GPU handling is only done explicitly in BERT, but I expect the Trainer class handles that in Electra. This is handled through the
place_model_on_deviceproperty of theTrainingArgumentsclass,- ...which is True if
transformers.utils.import_utils.py'sis_sagemaker_mp_enabled()evaluates to False (i.e. if not using model parallelism, which is set viaSM_HP_MP_PARAMETERSenv. var. else defaults to False).
- ...which is True if
-
The processor is going to go where the tokenizer goes in Electra (in the
__init__method) for use inpredictandfit. Even though it's said to be 'pretrained', it doesn't get retrained so we don't need to load it, so it doesn't need to be conditional on there beingtrain_output(see discussion). -
The model is going to go where it goes in Electra (in the
__init__method) but rather than instantiating it here from a hardcodedMODEL_FILEmodule-level global variable, it's going to be loaded via the path given by theloadmethod like in BERT iftrain_outputis available, otherwise fromreset_model. This condition will look more like BERT but without moving devices (unsure?). Like the SimpleTextClassifier, thelabelsattribute is set fromtrain_outputifloading else frominfoif usingreset_model.-
reset_modelshould not be passing hardcoded defaults through (as in BERT), they should be method defaults (as in SimpleTextClassifier). The method should take no arguments.
-
At the risk of overemphasising, let's turn that inside out so it's in terms of what 'features' we want from each source:
- BERT:
reset_model/loadpattern; config assertions in__init__ - Electra: simple attribute assignment [in particular of the
tokenizer, i.e.processorin my case] in__init__(permitting use ofpredicteven if notrain_output); prediction withTrainerAPI (with automatic GPU device handling);_get_annotated_dataset - Simple Text Classifier: API key handling in Docker Compose spec and
_get_annotated_dataset; method-level defaults inreset_model(not hardcoded in__init__'s call to that method) - Detectron2 and MMDetection: bbox handling
- Niels Rogge's Transformer-Tutorials LayoutLMv3 notebook:
TrainerAPI usage; customDataset(from issue #123) - NER tagger:
model_pathhandling
Since this is quite an ambitious rewrite (with at least 4 different sources in the
examples here, plus likely reusing some of Niels Rogge's code for
datasets
and training with the Trainer API),
I'll want to take a principled approach, and record what I do, with version control so I can roll
back (or at least review) any mistakes.
- The first step is to begin adapting the most relevant template (Electra), which will achieve
GPU handling (which we get 'for free' with the
TrainerAPI) and immediately check one of our features off the to-do list. - Then we should start modifying the model itself. The first 'easy win' is API key usage.
- Next, we should move onto the model class
__init__method, and tackle some real wins:- Adapting the signature to be relevant to the
LayoutLMv3model's args/kwargs. - The config assertions (just guess if unsure, we can fix if they fail)
- The processor instantiation (another easy win)
- The model instantiation within a
train_outputconditional block.
- Adapting the signature to be relevant to the
- That just leaves:
- Prediction (of which bbox handling is a component), which will give us preannotation
- Training, which will give us a retrainable (fine-tuneable) model which will learn from the annotation labels we provide in Label Studio
Our recipe is therefore:
- GPU handling
- API key handling
- Config assertions
- Processor
- Model
- Prediction
- Bbox handling
- Training
The first question is obviously: where to start? I.e. which template to begin adapting from?
Well, out of the sources above, Electra has the longest list of 'features' I want.
Looked at another way, the most complexity-reducing thing we have here is the Trainer API,
and that's only in Electra (Niels Rogge's Trainer API example is not a Label Studio backend example).
At the risk of bikeshedding I'm going to just go with that impulse...
- Copy the directory and rename to
layoutlmv3 - Rename the model module to
layoutlmv3.py - Overwrite the import line for the model module with the new model module (
layoutlmv3) and class names (LayoutLMv3Classifier) - Overwrite the model class name with the new one
cp -r electra layoutlmv3
cd layoutlmv3
mv electra.py layoutlmv3.py
sed -i 's/from electra import ElectraTextClassifier/from layoutlmv3 import LayoutLMv3Classifier/' _wsgi.py
sed -i 's/ElectraTextClassifier/LayoutLMv3Classifier/g' _wsgi.pyBefore we start adapting the model class, we should really ensure that class is renamed too (so far it's just renamed in the server module).
sed -i 's/ElectraTextClassifier/LayoutLMv3Classifier/g' layoutlmv3.pyA major feature that is missing from Electra is that it doesn't handle the API key from an
environment variable set in the Docker Compose spec, it handles it from a hard-coded string.
We can get this easily enough by copying the Docker Compose spec over from simple_text_classifier
and then using it in layoutlmv3.py the same way the simple_text_classifier.py module uses it.
Just copy the variables in the environment section of the YAML (I just did this in a text editor)
--- a/label_studio_ml/examples/layoutlmv3/docker-compose.yml
+++ b/label_studio_ml/examples/layoutlmv3/docker-compose.yml
@@ -18,6 +18,9 @@ services:
- REDIS_HOST=redis
- REDIS_PORT=6379
- USE_REDIS=true
+ - LABEL_STUDIO_ML_BACKEND_V2=true
+ - LABEL_STUDIO_HOSTNAME=http://localhost:8000
+ - LABEL_STUDIO_API_KEY=d6f8a2622d39e9d89ff0dfef1a80ad877f4ee9e3Since it's all local, I imagine you could change that key to be whatever you wanted instead? (TBC)
The model module layoutlmv3.py now needs to use those environment variables like
simple_text_classifier.py does:
--- a/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
+++ b/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
@@ -12,9 +12,15 @@
+from label_studio_ml.utils import DATA_UNDEFINED_NAME, get_env
+
+HOSTNAME = get_env("HOSTNAME", "http://localhost:8080")
+API_KEY = get_env("API_KEY")
+
+print("=> LABEL STUDIO HOSTNAME = ", HOSTNAME)
+if not API_KEY:
+ print("=> WARNING! API_KEY is not set")
-HOSTNAME = "https://app.heartex.com/"
-API_KEY = ""Finally we also need to modify the _get_annotated_dataset method (which Electra had)
to use the same 'best practice' method of simple_text_classifier (missing exception handling):
--- a/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
+++ b/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
@@ -142,6 +142,11 @@ class LayoutLMv3Classifier(LabelStudioMLBase):
response = requests.get(
download_url, headers={"Authorization": f"Token {API_KEY}"}
)
+ if response.status_code != 200:
+ raise Exception(
+ f"Can't load task data using {download_url}, "
+ f"response status_code = {response.status_code}"
+ )
return json.loads(response.content)and with that we should have enabled webhook-triggered training with the Docker Compose-specified API key.
The only reason you might not want to do this is if the error would crash your annotation session, but I'd expect it to fail early, before you'd done any annotation, so not losing any work.
BERT had some confident assertions that demonstrate data validation on the input config, so that we can't accidentally use this backend with the wrong task type (or something like that).
The obvious question here is: what are we going to check? What are we expecting?
Well, we can't just reuse the BERT code as we are not expecting to classify choices but rather
to have labelled bounding boxes or rectanglelabels as they're known.
Here are the checks the BERT classifier does:
# then collect all keys from config which will be used to extract data from task and to form prediction
# Parsed label config contains only one output of <Choices> type
assert len(self.parsed_label_config) == 1
self.from_name, self.info = list(self.parsed_label_config.items())[0]
assert self.info["type"] == "Choices"
# the model has only one textual input
assert len(self.info["to_name"]) == 1
assert len(self.info["inputs"]) == 1
assert self.info["inputs"][0]["type"] == "Text"
self.to_name = self.info["to_name"][0]
self.value = self.info["inputs"][0]["value"]We aren't using outputs of Choices type, but RectangleLabels (recall this is known as the control_type).
If you review the code above from the get_single_tag_keys helper function in
label_studio_ml.utils, it is in fact the exact same check. So we can just call that and have a
much more concise (thus more maintainable) model class __init__.
So in fact, we really want to copy the mmdetection backend's routine here.
diff --git a/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
b/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
index 0e7e5bb..2ddbfdb 100644
--- a/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
+++ b/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
@@ -25,13 +25,17 @@ MODEL_FILE = "my_model"
class LayoutLMv3Classifier(LabelStudioMLBase):
+ control_type: str = "RectangleLabels"
+ object_type: str = "Image"
+
def __init__(self, **kwargs):
super(LayoutLMv3Classifier, self).__init__(**kwargs)
try:
- self.from_name, self.info = list(self.parsed_label_config.items())[0]
- self.to_name = self.info["to_name"][0]
- self.value = self.info["inputs"][0]["value"]
- self.labels = sorted(self.info["labels"])
+ self.from_name, self.to_name, self.value, self.labels = get_single_tag_keys(
+ self.parsed_label_config,
+ control_type=self.control_type,
+ object_type=self.object_type,
+ )
except BaseException:
print("Couldn't load label config")While we're at it, we may as well set some class attributes and type annotate to make it clearer.
These print statements are annoyingly amateur though: I then swapped them all for logger.error calls.
Next I removed some code repetition and made load_config only take the self argument.
With that, the config step was all done, and tucked away neatly into a load_confi method.
We create the processor just once, as Electra did for its tokenizer (so we just need to adapt
this tokenizer to be a processor).
To make it neater, I made the processor name a class attribute, and the processor class as another.
I swapped the Electra tokenizer import for LayoutLMv3Processor
(while at it also swapping the ElectraForSequenceClassification with LayoutLMv3ForTokenClassification)
and was now halfway done migrating it from Electra to LayoutLMv3:
class LayoutLMv3Classifier(LabelStudioMLBase):
control_type: str = "RectangleLabels"
object_type: str = "Image"
hf_hub_name: str = "microsoft/layoutlmv3-base"
hf_model_cls: Type = LayoutLMv3ForTokenClassification
hf_processor_cls: Type = LayoutLMv3Processor
def __init__(self, **kwargs):
super(LayoutLMv3Classifier, self).__init__(**kwargs)
self.load_config()
self.processor = self.hf_processor_cls.from_pretrained(self.hf_hub_name)There are two options for the model class: LayoutLMv3ForSequenceClassification and LayoutLMv3ForTokenClassification.
The 'sequence' is a document (e.g. if you wanted to distinguish different types of document),
and the 'token' is a part of a document (I want to annotate and classify parts of documents so I chose this).
We instantiate our model in two different ways: either with reset_model or with load (if we have
train_output).
What I did initially was to simplify the Electra model initialisation into two lines:
model_to_load = MODEL_FILE if Path(MODEL_FILE).exists() else self.hf_hub_name
self.model = self.hf_model_cls.from_pretrained(model_to_load)At this point I committed my changes in case I messed the next step up
However as already established, this conditional block should actually be as in
simple_text_classifier and bert.
This part of the code had hardcoded device="cpu", so I replaced that with a module-level global
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"This left me with the outline of the new model, but still with the BERT kwargs.
I added more type annotations, made the reset_model take no arguments and return nothing,
and annotated load as returning nothing too.
if not self.train_output:
self.labels = self.info["labels"]
self.reset_model()
load_repr = "Initialised with"
else:
self.load(self.train_output)
load_repr = f"Loaded from train output with"
logger.info(f"{load_repr} {self.from_name=}, {self.to_name=}, {self.labels=!s}")
def reset_model(self) -> None:
# THESE KWARGS HAVE NOT BEEN CHANGED FROM BERT ! TODO
model_kwargs = dict(
num_labels=len(self.labels),
output_attentions=False,
output_hidden_states=False,
cache_dir=None,
)
model = self.hf_model_cls.from_pretrained(
self.hf_hub_name,
**model_kwargs
)
model.to(DEVICE)
self.model = model
return
def load(self, train_output) -> None:
pretrained_model = train_output["model_path"]
self.model = self.hf_model_cls.from_pretrained(pretrained_model)
self.model.to(DEVICE)
self.model.eval()
self.batch_size = train_output["batch_size"]
self.labels = train_output["labels"]
self.maxlen = train_output["maxlen"]Now getting the arguments to the model class looks tricky: if we review the BERT signature which we are adapting:
class BertForSequenceClassification(BertPreTrainedModel)
| Args:
| input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
| Indices of input sequence tokens in the vocabulary.
|
| Indices can be obtained using [`BertTokenizer`]. See [`PreTrainedTokenizer.encode`] and
| [`PreTrainedTokenizer.__call__`] for details.
|
| [What are input IDs?](../glossary#input-ids)
| attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
| Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
|
| - 1 for tokens that are **not masked**,
| - 0 for tokens that are **masked**.
|
| [What are attention masks?](../glossary#attention-mask)
| token_type_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
| Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
| 1]`:
|
| - 0 corresponds to a *sentence A* token,
| - 1 corresponds to a *sentence B* token.
|
| [What are token type IDs?](../glossary#token-type-ids)
| position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
| Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
| config.max_position_embeddings - 1]`.
|
| [What are position IDs?](../glossary#position-ids)
| head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
| Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
| - 1 indicates the head is **not masked**,
| - 0 indicates the head is **masked**.
|
| inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
| Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
| is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
| model's internal embedding lookup matrix.
| output_attentions (`bool`, *optional*):
| Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
| tensors for more detail.
| output_hidden_states (`bool`, *optional*):
| Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
| more detail.
| return_dict (`bool`, *optional*):
| Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
|
| labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
| Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
| config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
| `config.num_labels > 1` a classification loss is computed (Cross-Entropy).This one's really simple: we just call the model with the inputs. The inputs must be in a specific format though, and we need to handle the bounding box rectangles properly.
This one's really simple: we just call Trainer.train()
(TODO)