Data Structures

EDS-PDF stores PDFs and their annotation in a custom data structures that are designed to be easy to use and manipulate. We must distinguish between:

the data models used to store the PDFs and exchange them between the different components of EDS-PDF
the tensors structures used to process the PDFs with deep learning models

Itinerary of a PDF

A PDF is first converted to a PDFDoc object, which contains the raw PDF content. This task is usually performed a PDF extractor component. Once the PDF is converted, the same object will be used and updated by the different components, and returned at the end of the pipeline.

When running a trainable component, the PDFDoc is preprocessed and converted to tensors containing relevant features for the task. This task is performed in the preprocess method of the component. The resulting tensors are then collated together to form a batch, in the collate method of the component. After running the forward method of the component, the tensor predictions are finally assigned as annotations to original PDFDoc objects in the postprocess method.

Data models

The main data structure is the PDFDoc, which represents full a PDF document. It contains the raw PDF content, annotations for the full document, regardless of pages. A PDF is split into Page objects that stores their number, dimension and optionally an image of the rendered page.

The PDF annotations are stored in Box objects, which represent a rectangular region of the PDF. At the moment, box can only be specialized into TextBox to represent text regions, such as lines extracted by a PDF extractor. Aggregated texts are stored in Text objects, that are not associated with a specific box.

A TextBox contains a list of TextProperties objects to store the style properties of a styled spans of the text.

Reference

`PDFDoc`

Bases: BaseModel

This is the main data structure of the library to hold PDFs. It contains the content of the PDF, as well as box annotations and text outputs.

ATTRIBUTE	DESCRIPTION
`content`	The content of the PDF document. TYPE: `bytes`
`id`	The ID of the PDF document. TYPE: `(str, optional)`
`pages`	The pages of the PDF document. TYPE: `List[Page]`
`error`	Whether there was an error when processing this PDF document. TYPE: `(bool, optional)`
`content_boxes`	The content boxes/annotations of the PDF document. TYPE: `List[Union[TextBox, ImageBox]]`
`aggregated_texts`	The aggregated text outputs of the PDF document. TYPE: `Dict[str, Text]`
`text_boxes`	The text boxes of the PDF document. TYPE: `List[TextBox]`

`Page`

Bases: BaseModel

The Page class represents a page of a PDF document.

ATTRIBUTE	DESCRIPTION
`page_num`	The page number of the page. TYPE: `int`
`width`	The width of the page. TYPE: `float`
`height`	The height of the page. TYPE: `float`
`doc`	The PDF document that this page belongs to. TYPE: `PDFDoc`
`image`	The rendered image of the page, stored as a NumPy array. TYPE: `Optional[ndarray]`
`text_boxes`	The text boxes of the page. TYPE: `List[TextBox]`

`TextProperties`

Bases: BaseModel

The TextProperties class represents the style properties of a span of text in a TextBox.

ATTRIBUTE	DESCRIPTION
`italic`	Whether the text is italic. TYPE: `bool`
`bold`	Whether the text is bold. TYPE: `bool`
`begin`	The beginning index of the span of text. TYPE: `int`
`end`	The ending index of the span of text. TYPE: `int`
`fontname`	The font name of the span of text. TYPE: `Optional[str]`

`Box`

Bases: BaseModel

The Box class represents a box annotation in a PDF document. It is the base class of TextBox.

ATTRIBUTE	DESCRIPTION
`doc`	The PDF document that this box belongs to. TYPE: `PDFDoc`
`page_num`	The page number of the box. TYPE: `Optional[int]`
`x0`	The left x-coordinate of the box. TYPE: `float`
`x1`	The right x-coordinate of the box. TYPE: `float`
`y0`	The top y-coordinate of the box. TYPE: `float`
`y1`	The bottom y-coordinate of the box. TYPE: `float`
`label`	The label of the box. TYPE: `Optional[str]`
`page`	The page object that this box belongs to. TYPE: `Page`

`Text`

Bases: BaseModel

The TextBox class represents text object, not bound to any box.

It can be used to store aggregated text from multiple boxes for example.

ATTRIBUTE	DESCRIPTION
`text`	The text content. TYPE: `str`
`properties`	The style properties of the text. TYPE: `List[TextProperties]`

`TextBox`

Bases: Box

The TextBox class represents a text box annotation in a PDF document.

ATTRIBUTE	DESCRIPTION
`text`	The text content of the text box. TYPE: `str`
`props`	The style properties of the text box. TYPE: `List[TextProperties]`

Tensor structure

The tensors used to process PDFs with deep learning models usually contain 4 main dimensions, in addition to the standard embedding dimensions:

samples: one entry per PDF in the batch
pages: one entry per page in a PDF
boxes: one entry per box in a page
token: one entry per token in a box (only for text boxes)

These tensors use a special FoldedTensor format to store the data in a compact way and reshape the data depending on the requirements of a layer.