Data Structures
EDS-PDF stores PDFs and their annotation in a custom data structures that are designed to be easy to use and manipulate. We must distinguish between:
- the data models used to store the PDFs and exchange them between the different components of EDS-PDF
- the tensors structures used to process the PDFs with deep learning models
Itinerary of a PDF
A PDF is first converted to a PDFDoc object, which contains the raw PDF content. This task is usually performed a PDF extractor component. Once the PDF is converted, the same object will be used and updated by the different components, and returned at the end of the pipeline.
When running a trainable component, the PDFDoc is preprocessed and converted to tensors containing relevant features for the task. This task is performed in the preprocess
method of the component. The resulting tensors are then collated together to form a batch, in the collate
method of the component. After running the forward
method of the component, the tensor predictions are finally assigned as annotations to original PDFDoc objects in the postprocess
method.
Data models
The main data structure is the PDFDoc, which represents full a PDF document. It contains the raw PDF content, annotations for the full document, regardless of pages. A PDF is split into Page objects that stores their number, dimension and optionally an image of the rendered page.
The PDF annotations are stored in Box objects, which represent a rectangular region of the PDF. At the moment, box can only be specialized into TextBox to represent text regions, such as lines extracted by a PDF extractor. Aggregated texts are stored in Text objects, that are not associated with a specific box.
A TextBox contains a list of TextProperties objects to store the style properties of a styled spans of the text.
Reference
PDFDoc
Bases: BaseModel
This is the main data structure of the library to hold PDFs. It contains the content of the PDF, as well as box annotations and text outputs.
ATTRIBUTE | DESCRIPTION |
---|---|
content |
The content of the PDF document.
TYPE:
|
id |
The ID of the PDF document.
TYPE:
|
pages |
The pages of the PDF document.
TYPE:
|
error |
Whether there was an error when processing this PDF document.
TYPE:
|
content_boxes |
The content boxes/annotations of the PDF document.
TYPE:
|
aggregated_texts |
The aggregated text outputs of the PDF document.
TYPE:
|
text_boxes |
The text boxes of the PDF document.
TYPE:
|
Page
Bases: BaseModel
The Page
class represents a page of a PDF document.
ATTRIBUTE | DESCRIPTION |
---|---|
page_num |
The page number of the page.
TYPE:
|
width |
The width of the page.
TYPE:
|
height |
The height of the page.
TYPE:
|
doc |
The PDF document that this page belongs to.
TYPE:
|
image |
The rendered image of the page, stored as a NumPy array.
TYPE:
|
text_boxes |
The text boxes of the page.
TYPE:
|
TextProperties
Bases: BaseModel
The TextProperties
class represents the style properties of a span of text in a
TextBox.
ATTRIBUTE | DESCRIPTION |
---|---|
italic |
Whether the text is italic.
TYPE:
|
bold |
Whether the text is bold.
TYPE:
|
begin |
The beginning index of the span of text.
TYPE:
|
end |
The ending index of the span of text.
TYPE:
|
fontname |
The font name of the span of text.
TYPE:
|
Box
Bases: BaseModel
The Box
class represents a box annotation in a PDF document. It is the base class
of TextBox.
ATTRIBUTE | DESCRIPTION |
---|---|
doc |
The PDF document that this box belongs to.
TYPE:
|
page_num |
The page number of the box.
TYPE:
|
x0 |
The left x-coordinate of the box.
TYPE:
|
x1 |
The right x-coordinate of the box.
TYPE:
|
y0 |
The top y-coordinate of the box.
TYPE:
|
y1 |
The bottom y-coordinate of the box.
TYPE:
|
label |
The label of the box.
TYPE:
|
page |
The page object that this box belongs to.
TYPE:
|
Text
Bases: BaseModel
The TextBox
class represents text object, not bound to any box.
It can be used to store aggregated text from multiple boxes for example.
ATTRIBUTE | DESCRIPTION |
---|---|
text |
The text content.
TYPE:
|
properties |
The style properties of the text.
TYPE:
|
TextBox
Bases: Box
The TextBox
class represents a text box annotation in a PDF document.
ATTRIBUTE | DESCRIPTION |
---|---|
text |
The text content of the text box.
TYPE:
|
props |
The style properties of the text box.
TYPE:
|
Tensor structure
The tensors used to process PDFs with deep learning models usually contain 4 main dimensions, in addition to the standard embedding dimensions:
samples
: one entry per PDF in the batchpages
: one entry per page in a PDFboxes
: one entry per box in a pagetoken
: one entry per token in a box (only for text boxes)
These tensors use a special FoldedTensor format to store the data in a compact way and reshape the data depending on the requirements of a layer.