Skip to content

edspdf.structures

PDFDoc

Bases: BaseModel

This is the main data structure of the library to hold PDFs. It contains the content of the PDF, as well as box annotations and text outputs.

ATTRIBUTE DESCRIPTION
content

The content of the PDF document.

TYPE: bytes

id

The ID of the PDF document.

TYPE: str, optional

pages

The pages of the PDF document.

TYPE: List[Page]

error

Whether there was an error when processing this PDF document.

TYPE: bool, optional

content_boxes

The content boxes/annotations of the PDF document.

TYPE: List[Union[TextBox, ImageBox]]

aggregated_texts

The aggregated text outputs of the PDF document.

TYPE: Dict[str, Text]

text_boxes

The text boxes of the PDF document.

TYPE: List[TextBox]

Page

Bases: BaseModel

The Page class represents a page of a PDF document.

ATTRIBUTE DESCRIPTION
page_num

The page number of the page.

TYPE: int

width

The width of the page.

TYPE: float

height

The height of the page.

TYPE: float

doc

The PDF document that this page belongs to.

TYPE: PDFDoc

image

The rendered image of the page, stored as a NumPy array.

TYPE: Optional[np.ndarray]

text_boxes

The text boxes of the page.

TYPE: List[TextBox]

TextProperties

Bases: BaseModel

The TextProperties class represents the style properties of a span of text in a TextBox.

ATTRIBUTE DESCRIPTION
italic

Whether the text is italic.

TYPE: bool

bold

Whether the text is bold.

TYPE: bool

begin

The beginning index of the span of text.

TYPE: int

end

The ending index of the span of text.

TYPE: int

fontname

The font name of the span of text.

TYPE: Optional[str]

Box

Bases: BaseModel

The Box class represents a box annotation in a PDF document. It is the base class of TextBox.

ATTRIBUTE DESCRIPTION
doc

The PDF document that this box belongs to.

TYPE: PDFDoc

page_num

The page number of the box.

TYPE: Optional[int]

x0

The left x-coordinate of the box.

TYPE: float

x1

The right x-coordinate of the box.

TYPE: float

y0

The top y-coordinate of the box.

TYPE: float

y1

The bottom y-coordinate of the box.

TYPE: float

label

The label of the box.

TYPE: Optional[str]

page

The page object that this box belongs to.

TYPE: Page

Text

Bases: BaseModel

The TextBox class represents text object, not bound to any box.

It can be used to store aggregated text from multiple boxes for example.

ATTRIBUTE DESCRIPTION
text

The text content.

TYPE: str

properties

The style properties of the text.

TYPE: List[TextProperties]

TextBox

Bases: Box

The TextBox class represents a text box annotation in a PDF document.

ATTRIBUTE DESCRIPTION
text

The text content of the text box.

TYPE: str

props

The style properties of the text box.

TYPE: List[TextProperties]