Daniel Janus 4/18/2010

Sunflower

Read Original

Sunflower is a tool designed to automate the extraction of the main textual content from multiple HTML documents from the same source. It works by having the user identify key strings in a document's essence, then uses the smallest containing HTML subtree to extract content from all documents in a collection. The article details its GUI, its use for building the National Corpus of Polish, and the author's shift to a Swing widget-based architecture for managing application state.

Sunflower

Comments

No comments yet

Be the first to share your thoughts!

Browser Extension

Get instant access to AllDevBlogs from your browser