Bag-of-Tagged-Word metric

The ie-eval botw command can be used to compute the bag-of-tagged-word recognition and error rates, globally and for each semantic category.

Metric description

Recognition rate (Precision, Recall, F1)

The Bag-of-Tagged-Word (BoTW) recognition rate checks whether predicted words appear in the ground truth and if ground truth words appear in the prediction, regardless of their position. Note that words tagged as other (O in the IOB2 notation) are ignored.

The number of True Positives (TP) is the number of words that appears both in the label and the prediction.
The number of False Positives (FP) is the number of words that appear in the prediction, but not in the label.
The number of False Negatives (FN) is the number of words that appear in the label, but not in the prediction.

From these counts, the Precision, Recall and F1-scores can be computed:

The Precision (P) is the fraction of predicted words that also appear in the ground truth. It is defined by \(\frac{TP}{TP + FP}\).
The Recall (R) is the fraction of ground truth words that are predicted by the automatic model. It is defined by \(\frac{TP}{TP + FN}\).
The F1-score is the harmonic mean of the Precision and Recall. It is defined by \(\frac{2 \times P \times R}{P + R}\).

Error rates (bWER)

The Bag-of-Tagged-Words (BoTW) error rate is derived from the bag of words WER (bWER) metric proposed by Vidal et al. in End-to-End page-Level assessment of handwritten text recognition. Tagged words are defined as a combination of a word and its semantic tag. For example:

Label: [("person", "Georges"), ("person", "Washington"), ("date", 1732)]
Prediction: [("person", "Georgs"), ("person", "Washington")]

From ground truth and predicted tagged words, we count the number of errors and compute the error rate.

The number of insertions & deletions (\(N_{ID}\)) is the absolute difference between the number of ground truth tagged words and predicted tagged words. In this case, ("date", 1732) counts as a deletion, so \(N_{ID} = 1\).
The number of substitutions (\(N_S\)) is defined as \((N_{SID} - N_{ID}) / 2\), where \(N_{SID}\) is the total number of errors. In this case, ("person", "Georgs") counts as a substitution, so \(N_S = 1\).
The error rate (\(BoTW_{WER}\)) is then defined as \((N_{ID} + N_S) / |G|\), where \(|G|\) is the number of ground truth words. In this example, \(BoTW_{WER} = 2 / 3 = 0.67\).

Parameters

Here are the available parameters for this metric:

Parameter	Description	Type	Default
`--label-dir`	Path to the directory containing BIO label files.	`pathlib.Path`
`--prediction-dir`	Path to the directory containing BIO prediction files.	`pathlib.Path`
`--by-category`	Whether to display the metric for each category.	`bool`	`False`

The parameters are also described when running ie-eval boe --help.

Examples

Global evaluation

Use the following command to compute the overall BoTW metrics:

ie-eval botw --label-dir Simara/labels/ \
             --prediction-dir Simara/predictions/

It will output the results in Markdown format:

2024-01-24 12:25:37,866 INFO/bio_parser.utils: Loading labels...
2024-01-24 12:25:37,996 INFO/bio_parser.utils: Loading prediction...
2024-01-24 12:25:38,082 INFO/bio_parser.utils: The dataset is complete and valid.
| Category | bWER (%) | Precision (%) | Recall (%) | F1 (%) | N words | N documents |
|:---------|:--------:|:-------------:|:----------:|:------:|:-------:|:-----------:|
| total    |  17.40   |     84.82     |   83.70    | 84.26  |  17894  |     804     |

Evaluation for each category

Use the following command to compute the BoTW metrics for each semantic category:

ie-eval botw --label-dir Simara/labels/ \
             --prediction-dir Simara/predictions/
             --by-category

It will output the results in Markdown format:

2024-01-24 12:25:27,019 INFO/bio_parser.utils: Loading labels...
2024-01-24 12:25:27,148 INFO/bio_parser.utils: Loading prediction...
2024-01-24 12:25:27,232 INFO/bio_parser.utils: The dataset is complete and valid.
| Category            | bWER (%) | Precision (%) | Recall (%) | F1 (%) | N words | N documents |
|:--------------------|:--------:|:-------------:|:----------:|:------:|:-------:|:-----------:|
| total               |  17.40   |     84.82     |   83.70    | 84.26  |  17894  |     804     |
| precisions_sur_cote |  14.39   |     90.48     |   87.70    | 89.07  |   813   |     675     |
| intitule            |  20.73   |     82.18     |   81.15    | 81.66  |   8173  |     804     |
| cote_article        |   4.28   |     95.94     |   97.64    | 96.78  |   678   |     676     |
| cote_serie          |   3.25   |     97.21     |   97.78    | 97.49  |   676   |     676     |
| date                |   2.67   |     97.61     |   97.44    | 97.52  |   1799  |     751     |
| analyse_compl       |  22.92   |     81.50     |   78.97    | 80.22  |   5602  |     771     |
| classement          |  13.73   |     86.36     |   86.93    | 86.64  |   153   |      77     |