The Semantic Web initiative has been promoting the standardization of data formats to incorporate machine-friendly information about content into web pages. Although more sites are beginning to use micro-formats and other techniques to move this initiative forward, the web is still predominantly filled with unstructured text.
So, what do you do when you have to automate the handling of a large volume of content that does not have embedded tags to identify properties such as authorship or publication date? Ask the publisher to allocate resources to mark up their content? Write your own algorithms to interpret the text to find the information?
Temnos Elements takes care of detecting and extracting a variety of document properties such as,
- Creator: The person who created the content (i.e., the author)
- Date: When the content was published
- Language: Which language the page is written in
- Publisher: The entity that makes the content available for consumption
- Reading level: A measure of the minimum grade level required to understand the content
- Title: The main title of an article or blog post
Contact us to get started!