全部版块 我的主页
论坛 计量经济学与统计论坛 五区 计量经济学与统计软件 LATEX论坛
1342 5
2017-02-22

本帖隐藏的内容

In this article, we discuss a general machine learning technique to make predictions or score transnational data, applicable to very big, streaming data. This hybrid technique combines different algorithms to boost accuracy, outperforming each algorithm taken separately, yet it is simple enough to be reliably automated It is illustrated in the context of predicting the performance of articles published in media outlets or blogs, and has been used by the author to build an AI (artificial intelligence) system to detect articles worth curating, as well as to automatically schedule tweets and other postings in social networks.for maximum impact, with a goal of eventually fully automating digital publishing. This application is broad enough that the methodology can be applied to most NLP (natural language processing) contexts with large amounts of unstructured data. The results obtained in our particular case study are also very interesting.

Figure 1: HDT 1.0. Here we describe HDT 2.0.


The algorithmic framework described here applies to any data set, text or not, with quantitative, non-quantitative (gender, race) or a mix of variables. It consists of several components; we discuss in details those that are new and original, The other, non original components are briefly mentioned, with references provided for further reading. No deep technical expertise and no mathematical knowledge is required to understand the concepts and methodology described here. The methodology, though state-of-the-art, is simple enough that it can even be implemented in Excel, for small data sets (one million observations.).

1. The Problem

Rather than first presenting a general, abstract framework and then showing how it applies to a specific problem (case study), we decided to proceed the other way around, as we believe that it will help the reader understand better our methodology. We then generalize to any kind of data set.

In its simplest form, our particular problem consists of analyzing historical data about articles and blog posts, to identify features (also called metrics or variables) that are good predictors of blog popularity when combined together, to build a system that can predict the popularity of an article before it gets published. The goal is to select the right mix of relevant articles to publish, to increase web traffic, and thus advertising dollars, for a niche digital publisher.

As in any similar problem, the historical data is called training set, and it is split into test data and control data for cross-validation purposes to avoid over-fitting. The features are selected to maximize some measure of predictive power, as described here. All of this is (so far) standard practice; the reader not familiar with this can Google the keywords introduced in this paragraph. In our particular case, we use our domain expertise to identify great features. These features are pretty generic and apply to numerous NLP contexts, so you can re-use them for your own data sets.

Feature Selection and Best Practices

One caveat is that some metrics are very sensitive to distortion. In our case, the response (that is, what we are trying to predict, also called dependent variable by statisticians) is the traffic volume. It can be measured in page views, unique page views, or number of users who read the article. Page views can easily be manipulated and the number is inflated by web robots, especially for articles that have little traffic. So instead, we chose "unique page views", a more robust metric available through Google Analytics. Also, older articles have accumulated more page views over time, so we need to correct for this effect. Correcting for time is explained in this article. Here we used a very simple approach instead: focusing on articles from the most recent, big channel instead (the time window is about two years), and taking the logarithm of unique page views (denoted as pv in the source code in the last section).

Taking the logarithm not only smooths out the effect of time and web robots, but also it makes perfect sense as the page view distribution is highly skewed -- well modeled using a Zipf distribution -- with a few incredibly popular (viral) articles and a large number of articles with average traffic: it is a bit like the income distribution.

As for selecting the features, we have two kinds of metrics that we can choose as predictors:

1. Metrics based on the article title, easy to compute:

  • Keywords found in the title
  • Article category (blog, event, forum question)
  • Channel
  • Creation date
  • Title contains numbers?
  • Title is a question?
  • Title contains special characters?
  • Length of title

2. Metrics based on the article body, more difficult to compute:

  • Size of article
  • Does it contain pictures?
  • Keywords found in body
  • Author (and author popularity)
  • First few words

Despite focusing only on a subset of features associated with the article title, we were able to get very interesting, actionable insights; we only used title keywords, and whether the posting is a blog, or not. You have to keep in mind that the methodology used here takes into account all potential key-value combinations, where a key is a subset of features, and value, the respective values: for instance key = (keyword_1, keyword_2, article category) and value = ("Python", "tutorial", "Blog"). So it is important to appropriately bin the variables when turning them into features, to prevent the number of key-value pairs from exploding. Another mechanism described  further down in this article is also used to keep the key-value database, stored as an hash table or associate array, manageable. Finally, it can easily be implemented in a distributed environment (Hadoop.)  

Due to the analogy with decision trees, a key-value is also called a node, and plays the same role as a node in a decision tree.

2. Methodology and Solution

As we have seen in the previous section, the problem consists of predicting pv, the logarithm of unique page views for an article (over some time period), as a function of keywords found in the title, and whether the article in question is a blog or not.

In order to do so, we created lists of all one-token and two-token keywords found in all the titles, as well as blog status, after cleaning the titles and eliminating some stop word such as "that", "and" or "the", that don't have impacts on the predictions. We were also careful about not eliminating all keywords made up of one or two letters: the one-letter keyword "R", corresponding to the programming language R, has a high predictive power.

For each element in our lists, we recorded the frequency and traffic popularity. More precisely, for each key-value pair, we recorder the number of articles (titles, actually) that are associated with it, as well as the average, minimum and maximum pv across these articles.

Example

For instance, the element or key-value (keyword_1 = "R", keyword_2 = "Python", article = "Blog") is associated with 6 articles, and has the following statistics: avg pv = 8.52, min pv = 7.41, and max pv = 10.45.

Since the average pv across all articles is equal to 6.83, this specific key-value pair (also called node) generates exp(8.52 - 6.83) = 5.42 times more traffic than an average article. It is thus a great node. Even the worst article, among the 6 articles belonging to this node, with a pv of 7.41, outperforms the average article across all nodes. So not only this is a great node, but also a stable one. Some nodes have a far larger volatility, for instance when one of the keywords has different meanings, such as the word "training", in "training deep learning" (training set) versus "deep learning training" (courses.)

Hidden decision trees (HDT) revisited

Note that here, the nodes are overlapping, allowing considerable flexibility. In particular, nodes with two keywords are sub-nodes of nodes with one keyword. A previous version of this technique, described here, did not consider overlapping nodes. Also, with highly granular features, the number of nodes explodes exponentially. A solution to this problem consists of

  • shuffling the observations
  • working with nodes built on no more than 4 or 5 features
  • proper binning
  • visiting the observations sequentially (after the shuffle) and every one million observations, deleting nodes that contain only one observation

The general idea behind this technique is to group articles into buckets that are large enough to provide predictions that are sound, without explicitly building decision trees. Not only the nodes are simple and easy to interpret, but unstable nodes are easy to detect and discard. There is no splitting/pruning involved as with classical decision trees, making this methodology simple and robust, and thus fit for artificial intelligence (automated processing.)

General framework

Whether you are dealing with predicting the popularity of an article, or the risk for a client to default on a loan, the basic methodology is identical. It involves training sets, cross-validation, feature selection, binning, and populating hash tables of key-value pairs (referred to here as the nodes).

When you process a new observation, you check which node(s) it belongs to. If the best node it belongs to is stable and not too small, you use it to predict the future performance or value of your observation, or to score the transaction if you are dealing with transactional data such as credit card transactions. In our example, if the performance metric (the average pv in the node in question) is significantly above the global average, and other constraints are met (the node is not too small, and the minimum pv in the node in question not too low to guarantee stability), then we classify the observation as good, just like the node it belongs to. In our case, the observation is a potential article.

Also, you need to update your training set and the node table (including automatically discovered new nodes) every six months or so.




二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

全部回复
2017-2-22 21:22:11

Parameters must be calibrated to guarantee that

  • error rate (classifying a good observation as bad or the other way around) is small enough; it is measured using a confusion matrix
  • the system is robust: we have a reasonable number of stable nodes that are big enough; it is great if less than 3,000 stable, not too small nodes cover 80% of the observations (by stable, we mean nodes with low variance) with an average of at least 10 observations per node
  • the binning and feature selection mechanism offer real predictive power: the average response (our pv) measured in a node classified as good, is much above the general average, and the other way around for nodes classified as bad; in addition, the response shows little volatility within each node (in our case, pv is relatively stable across all observations within a same usable node)
  • we have enough usable nodes (that is, after excluding the small ones) to cover at least 50% of all observations, and if possible up to 95% of all observations (100% would be ideal but never exists in practice)

We discuss the parameters of our technique, and how to fine-tune them, in the next section. Fine-tuning can be automated or made more robust by testing (say) 2,000 sets of parameters and identify regions of stability that meet our criteria (in terms of error rate and so on) in the parameter space.   

A big question is what to do with observations not belonging to any usable node: they can not be classified. In our example it does not matter if 30% of the observations can not be classified, but in many applications, it does matter. One way to address this issue is to use super-nodes: in our case, a node for all posts that are blogs, and another one for all posts that are not blogs (these two nodes cover 100% of observations, both past and future.) The problem is that usually, these super-nodes don't have much predictive power. A better solution consists of using two algorithms: the one described here based on usable nodes (let's call it algorithm A) and another one called algorithm B, that classifies all observations. Observations that can't be classified or scored with algorithm A are classified/scored with algorithm B. You can read the details about how to blend the results of two algorithms, in one of my patents.  In practice, we have used Jackknife regression for algorithm B, a technique easy to implement, easy to understand, leading to simple interpretations, and very robust. These feature are important for systems that are designed to run automatically.

The resulting hybrid algorithm is called Hidden Decision Trees - hidden because you don't even realize that you have created a bunch of mini decision trees: it was all implicit. The version described here is version 2, with new features to prevent the node table from exploding, and allowing nodes to overlap, making it more suitable for data sets with a larger number of variables.


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2017-2-23 06:13:35
Thanks a lot
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2017-2-24 10:13:28
encounter,遭遇
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2017-2-25 22:08:09
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2017-2-25 22:28:56
...................
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群