Daniel Janus 4/18/2010

Sunflower

Read Original

Sunflower is a tool designed to automate the extraction of the main textual content from multiple HTML documents from the same source. It works by having the user identify key strings in a document's essence, then uses the smallest containing HTML subtree to extract content from all documents in a collection. The article details its GUI, its use for building the National Corpus of Polish, and the author's shift to a Swing widget-based architecture for managing application state.

Sunflower

Comments

No comments yet

Be the first to share your thoughts!

Browser Extension

Get instant access to AllDevBlogs from your browser

Top of the Week

2
Designing Design Systems
TkDodo Dominik Dorfmeister 2 votes
3
Introducing RSC Explorer
Dan Abramov 1 votes
5
Fragments Dec 11
Martin Fowler 1 votes
6
Adding Type Hints to my Blog
Daniel Feldroy 1 votes
7
Refactoring English: Month 12
Michael Lynch 1 votes
9