«
Trends @ O’Reilly Money:Tech 2008 -
Boston Social Media Breakfast covers “Getting Hired in a 2.0 World” »
16 April 2008
 |
Calais is a web service from Reuters based on the natural language processing (NLP) technology they acquired along with software company ClearForest. Reuters are making powerful entity extraction and semantic algorithms available free through this API, making it easy for developers to get structured representations of text documents, making variety of analytical problems easier to solve. We ran a set of 3,200 recent technology job postings (*) through Calais to see how well it can classify key technologies employers are looking for. The results are below, with some comments on the value of this approach and some observations on where the technology could be improved. |
Top 25 “Technologies” occurring in 3,200 tech job postings (**)
April 14, 2008
|
Technology
|
% of Listings
|
| Html |
10.61% |
| Java |
8.62% |
| Php |
5.00% |
| Linux |
4.82% |
| Ajax |
4.34% |
| Xml |
4.23% |
| Perl |
3.48% |
| Information Technology |
2.70% |
| Actionscript |
2.51% |
| Telecommuting |
2.20% |
| Unix |
2.13% |
| Jsp |
1.99% |
| Search Engine |
1.75% |
| Apache |
1.56% |
| Web Technologies |
1.39% |
| Operating Systems |
1.37% |
| Dhtml |
1.06% |
| Soa |
0.91% |
| Xsltl |
0.89% |
| Doml |
0.81% |
| Content Management |
0.75% |
| Relational Database |
0.73% |
| Rdbms |
0.72% |
| Content Management Systems |
0.68% |
| Crm |
0.63% |
Value:
Using NLP from a system like Calais for entity extraction and classification gives us advantages over just using a keyword search-based approach. In this context of trying to understand what technologies employers are looking for, it tells us answers for which we didn’t know effectively how to ask. To produce the list above using a keyword-based trend search (e.g. on Indeed’s Job Trends), I would need to manually enter 25 keywords. More problematically, I would need to find out which 25 keywords to search for in advance. The entity-extraction approach mines for keywords on our behalf, meaning that we can now also attempt to do search in the opposite direction, with information more able to “find†people interested in it.
Concerns:
Important terms can be misclassified, leading to major inaccuracies. For example: Calais doesn’t seem to recognize “python†as a programming language, so it wouldn’t be represented on this list. It seems to be classifying “asp.netâ€, “ado.netâ€, etc. as companies rather than as technologies. Many of these examples are easily fixable.
But the general problem is a very serious one. One of the key points of value comes from being able to learn what you didn’t know how to effectively ask (e.g. which programming languages that I’ve never heard of are most popular with employers). If a system like Calais is to solve the problem, we’re counting on it to have the key knowledge we might be missing (e.g. that Ruby is a programming language).
Conclusion:
Calais is a powrful tool for managing unstructured data. With reasonable amounts of supervision, it can yield some pretty amazing results today in terms of machine understanding of text. And with feedback from a growing number of developers using it in real world applications, it looks promising that it will get much better.
(*) 3200 job listings sampled from about 25 internet & software technology-oriented RSS job feeds from SimplyHired, Indeed, Dice, etc. This was not a scientifically rigorous process in any way.
(**) “Technology” is one of the entity types in the Calais Semantic Metadata
2 Responses to '
OpenCalais on Jobs Data '
Leave a reply
on April 16th, 2008 at 4:29 pm
Mark:
Tom Tague from Calais here.
What a great experiment. I’m constantly amazed at the new ways being are attempting to get value out of Calais - things we never would have thought of ourselves.
What you’re addressing here is fundamentally interesting and a good proof point for some of the issues we’re all having with search as primary tool for information discovery. As you point out: search works fine if you know what to search for. It’s a pretty poor discovery tool.
In the upcoming release of Calais we’re rolling out our first experiments in using open data sources (such as Wikipedia and others) to improve recognition of certain types of entities. This first experiment is going to be a small foray into the area of popular culture (entertainment and sports), but it is also serving to help us learn how to leverage these data assets.
In the near future we’ll be able to increase our intelligence by dozens of entity types per month - and I’m certain this will include items such as technologies. We’re also investigating some “near real time” options where entities could be discovered in the morning - and we’d be able to categorize and type them within a few hours.
Thanks again for taking the time to do such a thoughtful experiment.
on April 17th, 2008 at 11:27 am
Nice! I’m curious how Yahoo’s entity extractor stacks up: http://pipes.yahoo.com/pipes/pipes.popular