Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
74 changes: 55 additions & 19 deletions docs/en/alab/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,46 +102,82 @@ commands inside `annotationlab-installer.sh` and `annotationlab-updater.sh` file

### Backup and restore

- Backup
#### Backup

You can enable daily backups by adding several variables with --set option to helm command in `annotationlab-updater.sh`:

```bash
backup.enable=true
backup.files=true
backup.s3_access_key="<ACCESS_KEY>"
backup.s3_secret_key="<SECRET_KEY>"
backup.s3_bucket_fullpath="<FULL_PATH>"
```

`<ACCESS_KEY>` - your access key for aws s3 access
`<SECRET_KEY>` - your secret key for aws s3 access
`<ACCESS_KEY>` - your access key for AWS S3 access
`<SECRET_KEY>` - your secret key for AWS S3 access
`<FULL_PATH>` - full path to your backup in s3 bucket (f.e. s3://example.com/path/to/my/backup/dir)

- Restore

To restore from backup you need new clear installation of Annotation Lab. Do it with `annotationlab-install.sh`.
Next, you need to download latest backup from your s3 bucket and unpack an archive. There should be 3 sql backup files:
Notice: Files backups enabled by default. If you don't need to backup files, you have to change

```bash
annotationlab.sql
keycloak.sql
airflow.sql
backup.files=true
```
Run commands below to get PostgreSQL passwords.
to

airflow-postgres password for user `airflow`:
```bash
kubectl get secret -l app.kubernetes.io/name=airflow-postgresql -o jsonpath='{.items[0].data.postgresql-password}' | base64 -d
backup.files=false
```
annotationlab-postgres password for user `annotationlab`:
```bash
kubectl get secret -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].data.postgresql-password}' | base64 -d

**Configure Backup from the UI**

Backup can also be configured by admin user from the UI. Goto Settings > Backup and set the parameters.

<img class="image image--xl" src="/assets/images/annotation_lab/3.1.0/backupRestoreUI.png" style="width:100%; align:center; box-shadow: 0 3px 6px rgba(0,0,0,0.16), 0 3px 6px rgba(0,0,0,0.23);"/>


#### Restore

**Database**

To restore annotationlab from backup you need new clear installation of annotationlab. Do it with 'annotationlab-install.sh'. Now, download latest backup from your s3 bucket and move and archive to `restore/database/` directory. Next go to the `restore/database/` directory and execute script 'restore_all_databases.sh' with name of your backup archive as argument.

For example:

```
keycloak-postgress password for user `keycloak`:
```bash
kubectl get secret -l app.kubernetes.io/name=keycloak-postgres -o jsonpath='{.items[0].data.postgresql-password}' | base64 -d
cd restore/database/
sudo ./restore_all_databases.sh 2022-04-14-annotationlab-all-databases.tar.xz
```

*Notice:* You need `xz` and `bash` installed to execute this script.
*Notice:* This script works only with backups created by annotationlab backup system.
*Notice:* Run this scripts with `sudo` command

After database restore complete you can check logs in `restore_log` directory created by restore script.

**Files**

Download your files backup and move it to `restore/files` directory. Go to `restore/files` directory and execute script 'restore_files.sh' with name of your backup archive as argument. For example:

```
cd restore/files/
sudo ./restore_files.sh 2022-04-14-annotationlab-files.tar
```

*Notice:* You need `bash` installed to execute this script.

*Notice:* This script works only with backups created by annotationlab backup system.

*Notice:* Run this scripts with `sudo` command

**Reboot**

After restoring database and files, reboot AnnotationLab:

```
sudo reboot
```
Now you can restore your databases with `psql`, `pg_restore`, etc.

## Recommended Configurations

Expand Down
2 changes: 1 addition & 1 deletion docs/en/ocr.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Spark OCR is another commercial extension of Spark NLP for optical character rec


Spark OCR is built on top of ```Apache Spark``` and offers the following capabilities:
- Image pre-processing algorithms to improve text recognition results:
- Image pre-processing algorithms to improve text recognition results:
- Adaptive thresholding & denoising
- Skew detection & correction
- Adaptive scaling
Expand Down
6 changes: 3 additions & 3 deletions docs/en/ocr_install.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Currently, it supports 3.0.*, 2.4.* and 2.3.* versions of Spark.
It is recommended to have basic knowledge of the framework and a working environment before using Spark OCR. Refer to Spark [documentation](http://spark.apache.org/docs/2.4.4/index.html) to get started with Spark.


Spark OCR required:
Spark OCR requires:
- Scala 2.11 or 2.12 related to the Spark version
- Python 3.7 + (in case using PySpark)

Expand All @@ -47,7 +47,7 @@ You can start a spark REPL with Scala by running in your terminal a spark-shell
spark-shell --jars ####
```

The #### is a secret url only avaliable for license users. If you have purchansed a license but did not receive it please contact us at [email protected].
The #### is a secret url only available for license users. If you have purchased a license but did not receive it please contact us at [email protected].

</div>

Expand Down Expand Up @@ -85,7 +85,7 @@ Install python package using pip:
pip install spark-ocr==1.8.0.spark24 --extra-index-url #### --ignore-installed
```

The #### is a secret url only avaliable for license users. If you have purchansed a license but did not receive it please contact us at [email protected].
The #### is a secret url only available for license users. If you have purchased a license but did not receive it please contact us at [email protected].

</div><div class="h3-box" markdown="1">

Expand Down
6 changes: 3 additions & 3 deletions docs/en/ocr_object_detection.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ sidebar:
## ImageHandwrittenDetector

`ImageHandwrittenDetector` is a DL model for detect handwritten text on the image.
It based on Cascade Region-based CNN network.
It's based on Cascade Region-based CNN network.

Detector support following labels:
- 'signature'
Expand Down Expand Up @@ -139,8 +139,8 @@ display_images(data, "image_with_regions")

## ImageTextDetector

`ImageTextDetector` is a DL model for detect text on the image.
It based on CRAFT network architecture.
`ImageTextDetector` is a DL model for detecting text on the image.
It's based on CRAFT network architecture.


#### Input Columns
Expand Down
44 changes: 22 additions & 22 deletions docs/en/ocr_pipeline_components.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@ Next section describes the transformers that deal with PDF files with the purpos
{:.table-model-big}
| Param name | Type | Default | Description |
| --- | --- | --- | --- |
| splitPage | bool | true | whether it needed to split document to pages |
| textStripper | | TextStripperType.PDF_TEXT_STRIPPER |
| splitPage | bool | true | Whether it needed to split document to pages |
| textStripper | | TextStripperType.PDF_TEXT_STRIPPER | Extract unstructured text
| sort | bool | false | Sort text during extraction with TextStripperType.PDF_LAYOUT_STRIPPER |
| partitionNum | int| 0 | Force repartition dataframe if set to value more than 0. |
| onlyPageNum | bool | false | Extract only page numbers. |
Expand Down Expand Up @@ -117,8 +117,8 @@ data.select("pagenum", "text").show()

`PdfToImage` renders PDF to an image. To be used with scanned PDF documents.
Output dataframe contains `total_pages` field with total number of pages.
For process pdf with big number of pages prefer to split pdf by setting `splitNumBatch` param.
Number of partitions should be equal number of cores/executors.
For process pdf with a big number of pages prefer to split pdf by setting `splitNumBatch` param.
Number of partitions should be equal to number of cores/executors.

##### Input Columns

Expand Down Expand Up @@ -228,7 +228,7 @@ column and create multipage PDF document.

**Example:**

Read images and store it as single page PDF documents.
Read images and store them as single page PDF documents.


<div class="tabs-box pt0" markdown="1">
Expand Down Expand Up @@ -289,8 +289,8 @@ pdf_df.select("content").show()

### TextToPdf

`TextToPdf` renders ocr results to PDF document as text layout. Each symbol will render to same position
with same font size as in original image or PDF.
`TextToPdf` renders ocr results to PDF document as text layout. Each symbol will render to the same position
with the same font size as in original image or PDF.
If dataframe contains few records for same origin path, it groups image by origin
column and create multipage PDF document.

Expand Down Expand Up @@ -1088,7 +1088,7 @@ data.select("tables").show()

### PptToPdf

`PptToPdf` convert PPT and PPTX document to PDF document.
`PptToPdf` convert PPT and PPTX documents to PDF document.

##### Input Columns

Expand Down Expand Up @@ -1364,14 +1364,14 @@ data.select("image").show()

`GPUImageTransformer` allows to run image pre-processing operations on GPU.

It supports following operations:
It supports the following operations:
- Scaling
- Otsu thresholding
- Huang thresholding
- Erosion
- Dilation

`GPUImageTransformer` allows to add few operations. For add operations need to call
`GPUImageTransformer` allows to add few operations. To add operations you need to call
one of the methods with params:

{:.table-model-big}
Expand Down Expand Up @@ -1474,7 +1474,7 @@ display_images(result, "transformed_image")

### ImageBinarizer

`ImageBinarizer` transforms image to binary color schema by threshold.
`ImageBinarizer` transforms image to binary color schema, based on threshold.

##### Input Columns

Expand Down Expand Up @@ -1559,11 +1559,11 @@ data.show()
### ImageAdaptiveBinarizer

Supported Methods:
- OTSU
- OTSU. Returns a single intensity threshold that separate pixels into two classes, foreground and background.
- Gaussian local thresholding. Thresholds the image using a locally adaptive threshold that is computed
using a local square region centered on each pixel. The threshold is equal to the gaussian weighted sum
of the surrounding pixels times the scale.
- Sauvola
- Sauvola. Is a Local thresholding technique that are useful for images where the background is not uniform.


#### Input Columns
Expand Down Expand Up @@ -2147,12 +2147,12 @@ data.select("path", "noiselevel").show()

**python only**

`ImageRemoveObjects` for remove background objects.
It support removing:
- objects less then elements of font with _minSizeFont_ size
- objects less then _minSizeObject_
- holes less then _minSizeHole_
- objects more then _maxSizeObject_
`ImageRemoveObjects` to remove background objects.
It supports removing:
- objects less than elements of font with _minSizeFont_ size
- objects less than _minSizeObject_
- holes less than _minSizeHole_
- objects more than _maxSizeObject_

#### Input Columns

Expand Down Expand Up @@ -2505,7 +2505,7 @@ data.show()

### ImageSplitRegions

`ImageSplitRegions` splits image to regions.
`ImageSplitRegions` splits image into regions.

#### Input Columns

Expand Down Expand Up @@ -3468,7 +3468,7 @@ Next section describes the extra transformers

### PositionFinder

`PositionFinder` find position of input text entities in original document.
`PositionFinder` find the position of input text entities in the original document.

#### Input Columns

Expand Down Expand Up @@ -3759,7 +3759,7 @@ results.show()
### FoundationOneReportParser

`FoundationOneReportParser` is a transformer for parsing FoundationOne reports.
Current implementation support parsing patient info, genomic, biomarker findings and gene lists
Current implementation supports parsing patient info, genomic, biomarker findings and gene lists
from appendix.
Output format is json.

Expand Down
Loading