Curate Labs

March 13, 2024External researchWeb extractionGraph data

External research: Curate Labs did not author this paper.

Combining Language and Graph Models for Semi-structured Information Extraction on the Web focuses on targeted relation extraction from webpages. The task framing is practical: given a relation name and short description, extract matching values from semi-structured web pages without training a new vertical-specific extractor.

The model, GraphScholarBERT, combines language representations with graph representations of page structure. That is the right instinct for this domain. Webpages are not ordinary prose; their layout, repeated templates, local DOM neighborhoods, and field-like structures carry signal that a sentence encoder can easily flatten away.

Why we're excited

The paper reports improvements on SWDE, expanded SWDE, and PPPDB, including gains in zero-shot domain and website settings. The most important result is not just the metric; it is the evidence that graph features help when the source data is semi-structured.

Our community read

This is relevant to any agentic extraction system that consumes websites, portals, vendor pages, or public filings. Treating those inputs as "text only" throws away structure.

The limitation is that benchmarked web extraction is still much more controlled than the live web. Real production systems also need layout drift handling, JavaScript rendering policy, deduplication, and provenance. GraphScholarBERT is therefore best read as a strong modeling pattern, not a complete web-ingestion system.

Source

arXiv: 2402.14129

Community Reading: GraphScholarBERT for Semi-Structured Web IE

Why we're excited

Our community read

Source