Overview

The "ICDAR2021 Competition on Mathematical Formula Detection" on the IBEM Dataset competition is organized in the framework of the ICDAR 2021 competitions by the Pattern Recognition and Human Language Technologies research centre.

Searching in massive collections of digitized printed scientific documents with queries that are mathematical expressions is a research area scarcely explored. To address this problem, a crucial first step involves the detection of regions that may contain mathematical expressions. This contest aims to tackle this problem and thus, provide several reasons that could be interesting for attracting research groups to participate in this competition:
  1. Groups researching in Mathematical Expression Recognition, at some point, need to address the problem of automatic detection of mathematical expressions in a document;
  2. Participants in this contest will have access to a large labeled dataset;
  3. The method of obtaining labeled data in the IBEM corpus is scalable, so it is expected to increase this collection in the future, and this new data could be used in future editions of this contest.
The proposed IBEM dataset consists of a subset of documents from the KDD Cup 2003 competition dataset. The IBEM dataset has been automatically generated by processing the LaTeX version of STEM papers available in the KDD Cup dataset. The ground truth extracted contains information about the position of mathematical expressions at page level and their LaTeX definition. Given that the LaTeX definitions of these mathematical formulas play no role in the detection process, we will provide a simplified version of the IBEM dataset with only the relevant information.

The mathematical formulas that appear in STEM documents, and therefore in the IBEM dataset, are either embbeded along the lines of the text or isolated. For each mathematical expression, either embedded or isolated, information on the exact position of its minimum bounding box will be provided.

The IBEM dataset consists of 600 documents with a total number of 8 273 pages containing 58 834 isolated and 260 323 embedded expressions. As it can be noted each document has approximately an average of 14 pages, 98 isolated mathematical expressions and 434 embedded mathematical expressions. Each document has been broken down into individual pages with the corresponding ground truth.

Figure 1. An example of the ground truth (right) proyected onto the second page of document 0001015 (left). The coordinates and dimensions of the bounding boxes are proportional to the dimensions of the pages.

Description and goals

The IBEM dataset has been divided in two sets in order to allow for performing different type of experiments. A first set of documents will be provided in the training phase, and a second set of documents will be provided in the evaluation phase.

First, the 600 documents contained in the dataset have been shuffled at document level. The set of documents prepared for the training phase has been created by dividing the first 500 documents as follows:
  1. 300 documents have been used to create a dataset for training, that we will reffer to as Tr00;
  2. the next 50 documents have been set apart for validation (Va00);
  3. the next 50 documents have been set apart for test (Ts00);
Then, the remaining 100 documents were shuffled at page level.
  1. 50% (760 pages) of these images are used for training (Tr01);
  2. 25% (380 pages) of these images are used for validation (Va01);
  3. 25% (380 pages) of these images are used for test (Ts01);
The set of 100 documents prepared for the evaluation phase has been prepared as follows:
  1. the first 50 documents are used for testing (Ts10);
Then, the remaining 50 documents were shuffled at page level.
  1. 50% (329 pages) of these images are used for training (Tr10);
  2. 50% (329 pages) of these images are used for testing (Ts11);
Note that 7 is used for performing a task independent evaluation and 9 is used for performing a task dependent evaluation.

The available data for 1, 2, 3, 4, 5, 6, and 8 datasets will consist of:
  1. The original images of all the training pages.
  2. A txt file per training page, containing the corresponding ground truth. The bounding boxes of the mathematical formulas have been organized by type of expressions and later sorted in ascending order by the y component, taking a reference system in which the (0,0) coordinate point is in the upper left corner of the page. This ground truth has been checked and corrected manually.
The available data that is provided for 7 and 8 will consist of:
  1. The original images will be provided for Ts10 and Ts11.
The goal of the competition is to obtain the best mathematical expression detection rate on the Ts10 and Ts11 datasets.

Test datasets

As explained previously, the IBEM dataset has been divided in two sets: document-level and page-level data. Each such set has a corresponding test set. One week before the deadline of the competition, the Tr10, Ts10 and Ts11 sets will be made available to the entrants.

The test sets will not have the associated groundtruth available, and will be merged with several thousands page images for which there is no ground truth. The participants will not be able to distinguish the actual test sets from the other page images. This is done for two reasons: The dataset, both the training and the test, will be made freely available, with the corresponding ground truth, once the competition is finished.

Evaluation modalities

Evaluation will be performed using Intersection-over-Union (IoU), and systems will be ranked based on their F-measure after matching output formula boxes to ground truth formula regions.

The participants will not receive any feedback about their results on the test sets. Providing evaluation results while the competition is open can help the participants to fit their systems. Several submissions per participant will be allowed. Besides, results on several systems per participant will be allowed. If a participant submits results for several systems (s)he has to inform the differences from one system to another system. The differences have to be substantial in order to guarantee that the systems are really different.
In any case, just the last submitted results will be considered for ranking the participants.

Registration and access to data

To register in this contest send an e-mail to jandreu_AT_prhlt_DOT_upv_DOT_es with the subject ICDAR 2021 MFD competition registration. In the message you must provide the following data: A username and password will be given to each registered participant, which will grant access to the data and evaluation page.

Data now available after being registered!!

Test set available!! (24/03/2021)


Tools

An evaluation tool will be provided shortly.

Schedule

The schedule will be the following:

Organizers