Skip to content

Data Structures

EDS-PDF stores PDFs and their annotation in a custom data structures that are designed to be easy to use and manipulate. We must distinguish between:

  • the data models used to store the PDFs and exchange them between the different components of EDS-PDF
  • the tensors structures used to process the PDFs with deep learning models

Itinerary of a PDF

A PDF is first converted to a PDFDoc object, which contains the raw PDF content. This task is usually performed a PDF extractor component. Once the PDF is converted, the same object will be used and updated by the different components, and returned at the end of the pipeline.

When running a trainable component, the PDFDoc is preprocessed and converted to tensors containing relevant features for the task. This task is performed in the preprocess method of the component. The resulting tensors are then collated together to form a batch, in the collate method of the component. After running the forward method of the component, the tensor predictions are finally assigned as annotations to original PDFDoc objects in the postprocess method.

Data models

The main data structure is the PDFDoc, which represents full a PDF document. It contains the raw PDF content, annotations for the full document, regardless of pages. A PDF is split into Page objects that stores their number, dimension and optionally an image of the rendered page.

The PDF annotations are stored in Box objects, which represent a rectangular region of the PDF. At the moment, box can only be specialized into TextBox to represent text regions, such as lines extracted by a PDF extractor. Aggregated texts are stored in Text objects, that are not associated with a specific box.

A TextBox contains a list of TextProperties objects to store the style properties of a styled spans of the text.

Reference

PDFDoc

Bases: BaseModel

This is the main data structure of the library to hold PDFs. It contains the content of the PDF, as well as box annotations and text outputs.

ATTRIBUTE DESCRIPTION
content

The content of the PDF document.

TYPE: bytes

id

The ID of the PDF document.

TYPE: (str, optional)

pages

The pages of the PDF document.

TYPE: List[Page]

error

Whether there was an error when processing this PDF document.

TYPE: (bool, optional)

content_boxes

The content boxes/annotations of the PDF document.

TYPE: List[Union[TextBox, ImageBox]]

aggregated_texts

The aggregated text outputs of the PDF document.

TYPE: Dict[str, Text]

text_boxes

The text boxes of the PDF document.

TYPE: List[TextBox]

Page

Bases: BaseModel

The Page class represents a page of a PDF document.

ATTRIBUTE DESCRIPTION
page_num

The page number of the page.

TYPE: int

width

The width of the page.

TYPE: float

height

The height of the page.

TYPE: float

doc

The PDF document that this page belongs to.

TYPE: PDFDoc

image

The rendered image of the page, stored as a NumPy array.

TYPE: Optional[ndarray]

text_boxes

The text boxes of the page.

TYPE: List[TextBox]

TextProperties

Bases: BaseModel

The TextProperties class represents the style properties of a span of text in a TextBox.

ATTRIBUTE DESCRIPTION
italic

Whether the text is italic.

TYPE: bool

bold

Whether the text is bold.

TYPE: bool

begin

The beginning index of the span of text.

TYPE: int

end

The ending index of the span of text.

TYPE: int

fontname

The font name of the span of text.

TYPE: Optional[str]

Box

Bases: BaseModel

The Box class represents a box annotation in a PDF document. It is the base class of TextBox.

ATTRIBUTE DESCRIPTION
doc

The PDF document that this box belongs to.

TYPE: PDFDoc

page_num

The page number of the box.

TYPE: Optional[int]

x0

The left x-coordinate of the box.

TYPE: float

x1

The right x-coordinate of the box.

TYPE: float

y0

The top y-coordinate of the box.

TYPE: float

y1

The bottom y-coordinate of the box.

TYPE: float

label

The label of the box.

TYPE: Optional[str]

page

The page object that this box belongs to.

TYPE: Page

Text

Bases: BaseModel

The TextBox class represents text object, not bound to any box.

It can be used to store aggregated text from multiple boxes for example.

ATTRIBUTE DESCRIPTION
text

The text content.

TYPE: str

properties

The style properties of the text.

TYPE: List[TextProperties]

TextBox

Bases: Box

The TextBox class represents a text box annotation in a PDF document.

ATTRIBUTE DESCRIPTION
text

The text content of the text box.

TYPE: str

props

The style properties of the text box.

TYPE: List[TextProperties]

Tensor structure

The tensors used to process PDFs with deep learning models usually contain 4 main dimensions, in addition to the standard embedding dimensions:

  • samples: one entry per PDF in the batch
  • pages: one entry per page in a PDF
  • boxes: one entry per box in a page
  • token: one entry per token in a box (only for text boxes)

These tensors use a special FoldedTensor format to store the data in a compact way and reshape the data depending on the requirements of a layer.