edspdf.pipeline
Pipeline
The Pipeline is the core object of EDS-PDF. It is responsible for the orchestration of the components and processing PDF documents end-to-end.
A pipeline is usually created empty and then populated with components via the
add_pipe method. Here is an example :
pipeline = Pipeline()
pipeline.add_pipe("pdfminer-extractor")
pipeline.add_pipe("mask-classifier")
pipeline.add_pipe("simple-aggregator")
Source code in edspdf/pipeline.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 | |
cfg
property
Returns the initial configuration of the pipeline
trainable_components: List[TrainableComponent]
property
Returns the list of trainable components in the pipeline.
__init__(components=None, components_config=None, batch_size=4)
Initializes the pipeline. The pipeline is empty by default and can be
populated with components via the add_pipe method.
| PARAMETER | DESCRIPTION |
|---|---|
components |
List of component names
TYPE:
|
components_config |
Dictionary of component configurations. The keys of the dictionary must match the component names. The values are the component configurations, which can contain unresolved configuration for nested components or instances of components.
TYPE:
|
batch_size |
The default number of documents to process in parallel when running the pipeline
TYPE:
|
Source code in edspdf/pipeline.py
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | |
add_pipe(factory_name, name=None, config=None)
Adds a component to the pipeline. The component can be either a factory name or an instantiated component. If a factory name is provided, the component will be instantiated with the class from the registry matching the factory name and using the provided config as arguments.
| PARAMETER | DESCRIPTION |
|---|---|
factory_name |
Either a factory name or an instantiated component
TYPE:
|
name |
Name of the component
DEFAULT:
|
config |
Configuration of the component. The configuration can contain unresolved
configuration for nested components such as
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
Component
|
The added component |
Source code in edspdf/pipeline.py
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | |
reset_cache(cache=None)
Reset the caches of the components in this pipeline
| PARAMETER | DESCRIPTION |
|---|---|
cache |
The cache to reset (either
TYPE:
|
Source code in edspdf/pipeline.py
161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 | |
__call__(doc)
Applies the pipeline on a sample
| PARAMETER | DESCRIPTION |
|---|---|
doc |
The document to process
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OutputT
|
Source code in edspdf/pipeline.py
177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 | |
pipe(docs)
Apply the pipeline on a collection of documents
| PARAMETER | DESCRIPTION |
|---|---|
docs |
The documents to process
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Iterable
|
An iterable collection of processed documents |
Source code in edspdf/pipeline.py
195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 | |
initialize(data)
Initialize the components of the pipeline Each component must be initialized before the next components are run. Since a component might need the full training data to be initialized, all data may be fed to the component, making it impossible to enable batch caching.
Therefore, we disable cache during the entire operation, so heavy computation (such as embeddings) that is usually shared will be repeated for each initialized component.
| PARAMETER | DESCRIPTION |
|---|---|
data |
TYPE:
|
Source code in edspdf/pipeline.py
215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 | |
score(docs)
Scores a pipeline against a sequence of annotated documents
| PARAMETER | DESCRIPTION |
|---|---|
docs |
The documents to score
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
A dictionary containing the metrics of the pipeline, as well as the speed of the pipeline. Each component that has a scorer will also be scored and its metrics will be included in the returned dictionary under a key named after each component. |
Source code in edspdf/pipeline.py
241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 | |
preprocess(doc, supervision=False)
Runs the preprocessing methods of each component in the pipeline on a document and returns a dictionary containing the results, with the component names as keys.
| PARAMETER | DESCRIPTION |
|---|---|
doc |
The document to preprocess
TYPE:
|
supervision |
Whether to include supervision information in the preprocessing
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Source code in edspdf/pipeline.py
289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 | |
preprocess_many(docs, compress=True, supervision=True)
Runs the preprocessing methods of each component in the pipeline on a collection of documents and returns an iterable of dictionaries containing the results, with the component names as keys.
| PARAMETER | DESCRIPTION |
|---|---|
docs |
TYPE:
|
compress |
Whether to deduplicate identical preprocessing outputs of the results if multiple documents share identical subcomponents. This step is required to enable the cache mechanism when training or running the pipeline over a tabular datasets such as pyarrow tables that do not store referential equality information.
DEFAULT:
|
supervision |
Whether to include supervision information in the preprocessing
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
Iterable[OutputT]
|
Source code in edspdf/pipeline.py
312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 | |
collate(batch, device=None)
Collates a batch of preprocessed samples into a single (maybe nested) dictionary of tensors by calling the collate method of each component.
| PARAMETER | DESCRIPTION |
|---|---|
batch |
The batch of preprocessed samples
TYPE:
|
device |
The device to move the tensors to before returning them
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
The collated batch |
Source code in edspdf/pipeline.py
339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 | |
train(mode=True)
Enables training mode on pytorch modules
| PARAMETER | DESCRIPTION |
|---|---|
mode |
Whether to enable training or not
DEFAULT:
|
Source code in edspdf/pipeline.py
366 367 368 369 370 371 372 373 374 375 376 377 | |
no_cache()
Disable caching for all (trainable) components in the pipeline
Source code in edspdf/pipeline.py
387 388 389 390 391 392 393 394 395 396 | |
parameters()
Returns an iterator over the Pytorch parameters of the components in the pipeline
Source code in edspdf/pipeline.py
398 399 400 401 402 403 404 405 406 407 408 | |
__iter__()
Returns an iterator over the components in the pipeline.
Source code in edspdf/pipeline.py
431 432 433 | |
__len__()
Returns the number of components in the pipeline.
Source code in edspdf/pipeline.py
435 436 437 | |