Developers Arena

Social Media Web Tips, Social Media News & Technology Updates

Categories:

Head to Head Comparison of Text Extraction Algorithms

A few months ago we linked to Tomaž Kovačič’s overview of text extraction algorithms. Now Kovačič has posted an evaluation of several text extraction algorithms and services, including Boilerpipe, NCleaner, the Python and Node.js versions of Readability and the Extractiv API.

To conduct his evaluations, Kovačič used the cleaneval dataset, which includes 681 documents, and a Google News dataset with 621 documents harvested by the authors of Boilerplate.

Sponsor

Text extraction algorithms compared
Metric for the Google News data set

A few notes:

NCleaner did better on its own Cleaneval data set than it did on the Google News data set, but Boilerpipe did well on both sets.
Kovačič’ was surprised by Readability’s poor performance, and notes the discrepancy between the two ports. He thinks the original JavaScript version may do better.
The commercial APIs had the most consistent results.

Image by Andrew Mason

Discuss

Posted in General, Technology, Web.

Tagged with Analysis.

No comments

By Klint Finley – June 11, 2011

0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

« Cozimo Makes Video Collaboration Easier Coming To A Bar Near You: Facial Recognition & Real-Time Data »

Proudly powered by WordPress and Carrington.

Carrington Theme by Crowd Favorite

Head to Head Comparison of Text Extraction Algorithms

0 Responses

About Developers Arena

Recent Posts

Categories

Recent Comments

Head to Head Comparison of Text Extraction Algorithms

0 Responses

Subscribe

About Developers Arena

Recent Posts

Categories

Tags

Recent Comments