Temnos offers scalable content intelligence services on a PaaS (Platform-as-a-Service) model. In essence, you send content (documents or URL’s) to our API and we give you back metadata or metacontent for each item of content that was processed by our platform. Armed with this information, you will gain insight into what your pages are about, which pieces of content are most relevant to each other (and why), and the various ways that pages can be characterized or re-packaged to your benefit.
The Temnos platform and its field-proven technologies were built by a core team who has worked together through two different acquired start-ups (and their acquirers). With over 120 person-years of engineering invested in it, the technology has been tested and proven by the likes of CNET, About.com, IDG, General Motors, Federated Media, Disqus, Metaweb, and more. Blending several methodologies together from diverse fields of knowledge, the Temnos platform offers developers of content-centric applications an unprecedented opportunity to make their products richer and smarter.
Building Content Intelligence
Keywords vs. Word Senses
Before even starting down the road of building content intelligence, the first hurdle to overcome is the multi-sense, or polysemous, nature of language. Though we rarely think much about it, we all know that words can mean different things. Is the word “Mercury” a car, a planet, a chemical element, or a Roman god? Our team was one of the first to tackle scalable, automated word sense disambiguation (WSD). Some of our technologists have worked on this for more than 15 years. Everything that we do is cognizant of the fact that keywords are not good enough any more — you must know whether the intended sense of a word was found in a document or not.
Rule-based vs. Statistics-based Intelligence
One way of building up content intelligence is to apply statistical measures to document collections. This is quite useful, to a point. Using statistics, we can quickly and easily create clusters of related words, such as “dog-bark-tail-fur-pet-vet-bone.”
However, this technique can be deceiving. Whereas most of us would agree that “physician” is closer in meaning to the word “doctor” than the word “nurse”, when a statistical method determines what words are most frequently found near the word “doctor,” it so happens that “nurse” is found more frequently than is the word “physician.” Therefore, the statistical method incorrectly concludes that “nurse” is more closely related in meaning to “doctor” than “physician.” Yet, any child would know that conclusion is not true.
Another approach, which avoids this error, is to create hand-written rules that explicitly state that “physician” is a synonym of “doctor”, but “nurse” is not. That sounds like a good rule, but we would need thousands (or millions) more such rules to handle a real-world document collection. Such rules require expert editors who must constantly change and improve the rule base to keep in sync with a living language.
For example, what happens when the system of rules that correctly separates nurses from doctors, one day encounters a new term called a “nurse practitioner” (one who can prescribe medications and see patients independently)? Is this concept closer to a regular nurse, or to a regular doctor? Should it be a sub-type of nurse or a new category unto itself? Whatever is decided, it is a challenge to update the rule base in reaction to the ever-expanding world of information.
Clearly, the rule-based and statistical approaches both have inherent strengths and weaknesses. Many organizations offer services based largely on one or the other of these methodologies, but at Temnos, we employ both. A large set of editorially-designated relations and rules are supplemented by a vast statistical framework. It is rather like the left and right side of the brain working in concert to reach a decision. Statistical approaches give us scalability and rapid learning, while rule-based methods give us transparency and an easy way to implement a manual override on top of the statistical outputs.
Top-Down vs. Bottom-Up
One can build a content intelligence framework around a set of high-level concepts and categories into which everything fits (a top-down approach), or one can build up a huge list of keywords from raw text, then try to roll them up into lists, clusters, tag clouds, and the like (a bottom-up approach).
Examples of the top-down approach are the IAB taxonomy, the DMOZ directory, or the sitemap of a large publisher’s web site such as Newsweek. In these cases, editors have decided the categories of content ahead of time and every new item must fit into one or more of those pre-established categories. Otherwise, a new category must be requested or created ad hoc. This approach has the advantages that a neat-and-tidy “category tree” can be used to classify all the content, and the number of categories can be kept down to a manageable number (usually dozens or hundreds of categories).
Google’s keyword search index is an example of the bottom-up approach. Every word of every document is indexed. These keywords, taken en masse, constitute the primary material of the information service. One advantage of this approach is that newly-coined words and phrases (as well as misspelled words) are promptly crawled and indexed. If a new rock-n-roll band opens up a website, calling itself “Frontera” (a made-up word), then that word is immediately entered into the index just like any other keyword. It is immediately searchable and findable.
Many companies exclusively employ a top-down or bottom-up approach to organizing and understanding their content repository. At Temnos, we blend both approaches. The top-down approach gives us an intelligent way to control which large subsets of a corpus are of interest for a given use case (i.e., for a given app, site, user, session, etc.). Thus, we can handle just the Sports, Business, or Small Business content.
We also use the bottom-up method to ensure that we are taking an exhaustive approach to the data set, keeping a finger on the pulse of every new phrase, name, and concept that crops up in our corpora. As a result, our system automatically recognizes newly-formed words and phrases.
So, on the one hand, we build multiple classifiers and taxonomies (top-down), while on the other hand, we detect endemic tags from the distribution of raw text (bottom-up). Tags that predominate within the content are promoted to topics, which are then grouped by common concerns and interests into metatopics, which are then placed under the established categories in our taxonomies. Metatopics are born from the overlap between the top-down and bottom-up methodologies; they inhabit the space where the two approaches meet, at an ontological middle.
Bringing Order to Content
Taxonomic Pluralism vs. Absolutism
A content taxonomy is a hierarchical collection of categories into which documents may be placed. Many efforts at categorizing Web content impose a static taxonomy on that content. In some cases, publishers or advertisers are faced with a single taxonomy that has been decided upon by an editorial committee, which seldom shares the same business interests and infrequently updates its taxonomy.
In reality, no single taxonomy is the “right” one and taxonomies need to ceaselessly evolve.
Temnos embraces multiple taxonomies, maps them to each other, and allows you to request your own custom classification schema. Whether you want IAB, DMOZ, MSI, or a custom taxonomy, we can provide the necessary information architecture.
In addition to taxonomic pluralism, we focus a lot on middle-level ontology and for good reason. Middle-level ontology is the most relevant, most of the time. It is where humans prefer to think and talk. Suppose you see a dog knock over a newspaper rack across the street. You might say to the friend next to you, “Hey, look at what that dog is doing!” You are very unlikely to reference a higher-level ontology to say “Hey, look at what that mammal is doing,” nor are you likely to use a low-level ontology to say, “Hey, look at what that Cavalier King Charles Spaniel is doing!” For practical purposes, we naturally default to a level of specification that is neither obtusely abstract, nor painstakingly particular. We find the middle-level ontology to be the most useful.
A pure top-down approach to content intelligence inappropriately privileges an inevitably arbitrary set of higher-level categories. Should sports be first of all divided into team vs. individual sports or indoor vs. outdoor sports? These seem to be the world-making decisions. Meanwhile, a bottom-up approach does the opposite. It dodges those questions to march headlong into a brute-force, self-imposed servitude to mindlessly transcribing every detail of the content in the belief that you never can tell what might matter to the next query. Gone is the attempt to get any overall interpretation of the grand scheme of things.
If you choose between a purely top-down approach and a purely bottom-up approach to content intelligence, you are choosing whether you will miss the trees for the forest or the forest for the trees. Instead, what you really need is a framework that helps you recognize practical, medium-sized patterns in things. To give an example that follows the forest analogy, imagine a grove of redwoods amidst a mostly-oak forest. Seeing this enables you to infer a high likelihood of a natural spring being present — a valuable find! However, looking through the whole forest one tree at a time would fail to show you this. Likewise, looking at the broad outline of the entire forest would miss it, too. You need a medium level of focus to find the most interesting, valuable items within a vast collection.
MetaTopics vs. Parent Topics
One of the strongest and most valuable elements of the Temnos platform is Temnos MetaTopics — topics that connect many other topics together. These are very highly valued not only by marketers, but by readers and authors, too. A good way to get a grasp of what a metatopic is, is to consider how it differs from a higher-level or parent topic.
A parent topic is one that has sub-topics under it, such that every sub-topic is simply a more-specific form of its parent. For example, Sports is a parent topic of Football, because Football is a specific Sport. A parent topic’s relationship to its child topics can easily be represented by a diagram of the content taxonomy — the tree-like structure that we are all used to seeing.
By contrast, a metatopic jumps all around the taxonomy tree and connects topics that share a single overarching theme or interest. A great example is “Small Business” as a metatopic connecting the topics of “QuickBooks”, “Schedule C”, “SBA Loans”, etc. Note that parent-child relationships do not fit here. QuickBooks is not a kind of Small Business. Rather, it is a software tool used almost exclusively by small businesses.
The advantage of a metatopic is that it can easily embody hundreds of topics, all of which pertain to one particular community or mode of discourse. It is thus of immense interest to both publishers and marketers.
Aligning Technology to Customer Needs
Supervised vs. Unsupervised
In many cases our clients want a completely unsupervised system — a totally automated tool. However, in some cases you want to have a measure of editorial control — essentially you want to supervise the process. We offer both supervised and unsupervised services. For supervised processes, we use a beginning-and-end approach. You start by seeding a process with examples of the output that you desire. We then train or tune our system to perform that. Upon judging the results, you supply relevance feedback which is used to re-train or re-tune the system. This, again, is an optional procedure that provides you with extra control and visibility when you are either customizing our platform for a new style of output or content, or you are simply trying to ensure the utmost precision in the results.
Transparent vs. Opaque
Some services are “opaque” in that there is no way to view, in layman’s terms, an explanation of “why” the system gave a particular result. For example, why was an article classified as Sports when I think it should be Health? In an opaque system, someone with a PhD may be required to spend several hours tracing through a few thousand data points to unravel why the system arrived at the answer that it did.
In a transparent system, the reason that the system gave a particular answer would be revealed fairly clearly and easily. Perhaps a rule dictates that “NHL” and “NBA” indicate Sports, and these terms were found in the document. Or it could be something else, such as there being several documents of similar characteristics already classified as Sports, which were found to be the most strongly-comparative examples. Either way, the process is readily understandable, even by a non-technical person such as an editor, author, or agency representative.
At Temnos, we adopt transparent methods whenever we can — which is nearly all the time. In some cases, such as with customizable document classifiers, incredible scale can be accomplished with opaque methods that otherwise would not be feasible with transparent methods. However, the majority of our outputs can be traced to lexical entries, encyclopedic entries, reference documents, and readily-visible statistical metrics. We have one of the more “explainable” systems in the AI community.
E Pluribus Unum
Out of many AI techniques comes one robust platform for content intelligence. This is an important aspect of the Temnos platform‘s value proposition.
In the world of artificial intelligence, you may come across companies whose offering is based primarily on a single methodology. One company touts their expertise in neural nets, while another is all about genetic algorithms, and still another, lexical chains or algorithms known by acronyms such as LDA, LSI, SVM, etc. These can be thought of as the various “schools” (some would say, “religions”) of the AI community.
At Temnos, we don’t have one religion. We are proud to say we are “jacks of all trades”. Our focus is not to publish the next research paper proving that a new tweak in a genetic algorithm makes it 0.0001% more accurate on a stock set of test cases. Instead, we are concerned with how genetic algorithms (or any other AI technique) can serve a practical purpose.
We believe that one technology seldom solves a problem by itself or makes a good application all by itself. Rather, the synthesis of several technologies is what produces the best real-world solution. So, we have all the aforementioned AI tools, and more besides, in our toolbox.
There is no single magic bullet to create content intelligence.
Ready to put content intelligence to work for you? Contact us to get started!