CatalogBank: A Structured and Interoperable Catalog Dataset with a Semi-Automatic Annotation Tool DocumentLabeler for Engineering System Design
Published in Proceedings of the ACM Symposium on Document Engineering 2024, 2024
📌 Overview
CatalogBank: A Structured and Interoperable Catalog Dataset with a Semi‑Automatic Annotation Tool (DocumentLabeler) for Engineering System Design was presented at DocEng ’24 (ACM Symposium on Document Engineering) in August 2024. Authored by Hasan Sinan Bank (Craftnetics Inc.) and Daniel R. Herber (Colorado State University) (ResearchGate). It was nominated for Best Paper Award (Colorado State University Engineering).
Objective
This work introduces CatalogBank, a dataset designed to connect textual engineering catalogs with structured, interoperable metadata. It targets improved automation in engineering workflows and downstream NLP tasks like layout analysis and knowledge extraction (GitHub).
Core Contributions
1. CatalogBank Dataset
- A collection of diverse PDF-based catalogs from engineering vendors (e.g. Thorlabs, McMaster‑Carr).
- Extracts structured information from catalogs such as product specifications, images, tables, and layout elements (GitHub, ResearchGate).
- Facilitates bridging of modalities: text, geometry, images, and graph-format data.
2. DocumentLabeler Tool
- An open‑source, semi‑automatic annotation tool tailored for engineering documents.
- Speeds annotation by combining automation and human review to generate structured labels from PDF layout (GitHub).
3. Interoperability & Standardization
- Creates unified schema and metadata model to enable consistent extraction and downstream integration.
- Helps overcome challenges posed by non‑standard PDF catalog formats and manual entry bottlenecks.
Significance & Benefits
- Enables automation in engineering design workflows by providing structured, labeled data from catalogs.
- Supports layout analysis and information extraction in Document Engineering and NLP pipelines.
- Contributes a robust benchmark dataset for training and evaluating models on catalog-style documents (ResearchGate).
TL;DP Summary Table
Element | Description |
---|---|
Title | CatalogBank dataset with DocumentLabeler tool |
Event | ACM DocEng ’24 (August 2024) |
Authors | Hasan Sinan Bank & Daniel R. Herber |
Goal | Structure textual catalogs into interoperable data formats |
Mechanism | PDF parsing, semi-automated annotation, metadata schema standardization |
Tools | CatalogBank dataset + DocumentLabeler annotation tool |
Use Cases | NLP layout analysis, engineering design automation, dataset benchmarks |
BibTeX Citation:
@inproceedings {bank2024catalogbank, title= {CatalogBank: A Structured and Interoperable Catalog Dataset with a Semi-Automatic Annotation Tool (DocumentLabeler) for Engineering System Design} , author= {Bank, Hasan Sinan and Herber, Daniel R} , booktitle= {Proceedings of the ACM Symposium on Document Engineering 2024} , pages= {1--9} , year= {2024} }