Web Data Extraction from Query Result Pages Based on Visual and Content Features
    Download PDF
Daiyue Weng,Jun Hong,David A. Bell. Web Data Extraction from Query Result Pages Based on Visual and Content Features. International Journal of Software and Informatics, 2012,6(3):453~472
Hits: 4041
Download times: 4219
Abstract:A rapidly increasing number of Web databases are now become accessible via their HTML form-based query interfaces. Query result pages are dynamically generated in response to user queries, which encode structured data and are displayed for human use. Query result pages usually contain other types of information in addition to query results, e.g., advertisements, navigation bar etc. The problem of extracting structured data from query result pages is critical for web data integration applications, such as comparison shopping, meta-search engines etc, and has been intensively studied. A number of approaches have been proposed. As the structures of Web pages become more and more complex, the existing approaches start to fail, and most of them do not remove irrelevant contents which may affect the accuracy of data record extraction. We propose an automated approach for Web data extraction. First, it makes use of visual features and query terms to identify data sections and extracts data records in these sections. We also represent several content and visual features of visual blocks in a data section, and use them to filter out noisy blocks. Second, it measures similarity between data items in different data records based on their visual and content features, and aligns them into different groups so that the data in the same group have the same semantics. The results of our experiments with a large set of Web query result pages in di?erent domains show that our proposed approaches are highly effective.
keywords:web data mining  web data extraction  data record extraction  web data alignment
View Full Text  View/Add Comment  Download reader

 

 

more>>  
Visitor:3140676
Top Paper  |  E-mail Alert  |  Publication Ethics  |  New Version

© Copyright by Institute of Software, the Chinese Academy of Sciences
京ICP备05046678号-5

京公网安备 11040202500065号