combing1のブログ -21ページ目

combing1のブログ

ブログの説明を入力します。

Spark 1.2 MLlib introduced in the stochastic gradient enhance Air Max 2011 Womens Purple Black Grey forest and tree (GBTs). These two machine learning methods suitable for classification and regression, and is the largest and most successful 597806-400 Nike LeBron X EXT QS Denim-Pink Outlet applications have algorithms in machine learning algorithms. Random Forest and GBTs are integrated learning algorithm, which through integrated decision tree to tree realize the strong classifier. This blog, we will illustrate these models and they are distributed in MLlib Realization. We also give some simple examples and points so that you know how to get started. Integrated learning method is simple, the integrated learning approach is a machine learning algorithm based on other machine learning algorithms, and put them together effective combination. Algorithm combinations resulting algorithm 579756 403 Purple White Nike Black Mamba 24 Kobe Sale model as compared to any of them more powerful and accurate. In MLlib 1.2, we used as the basis for the decision tree model. We offer two integrated algorithms: random forests and gradient upgrade tree (GBTs). The main difference between the two is that each tree training sequence. Random forest through random sampling of data to individual training every tree. This randomness makes the model more robust with respect to a single tree, and difficult to produce in the training set overfitting. GBTs the train only once a tree, behind every tree a new tree in front of the gradual correction of errors generated by the decision tree. With the addition of the tree, the stronger expression of force model. Finally, both methods generate the right to a decision tree heavy collection. The integration model through the combined results Meike independent tree to predict. The following figure shows a tree by the three integrated simple example. In return Air Max Nike Free Run 3 Women 2012 Black Navy Blue White set in the above example, each tree has predicted a real value. These predicted values ​​are combined to produce the final integrated forecast results. Here, we get the final result by taking the average prediction method (of course, different tasks require prediction algorithms used in different combinations). Distributed learning algorithm integrated learning in MLlib randomly forests and GBTs data are by instances (rows) stored. To achieve the original decision tree algorithm code base, each of them a decision tree using a distributed learning (mentioned in earlier blog). Many of our algorithms are optimized with reference to Google's PLANET project, in particular an article about the integrated environment of distributed learning. Random Forests: Random forests are 585388-083 Anti-Nerf Nike KD V Elite Outlet individual training every tree, more trees parallel training (in addition, individual training of each tree can be parallelized). MLlib is indeed doing so: according to the current iteration memory limits dynamically adjusts the number of parallel training subtree. GBTs: Because GBTs training only once a tree, so the parallel granularity training only to a single tree. We emphasize here what MLlib used in two important optimization techniques Memory: Random Forests uses a different sample data training every tree. We use data TreePoint this data structure to store each sub-sample, instead of direct copy of each data sub-sampling method, thus saving memory. Communications: Despite frequent tree by selecting all of the features in the tree for each decision point for training, but random forests are often limit the selection at each node a random subset. MLlib implementation take full advantage of the characteristics of this sub-sampling to reduce the communication: for example, if the value of each node uses 1/3 of features, then we will reduce traffic 1/3. See detailed section section MLlib integrated programming guide. Use MLlib integrated learning we will demonstrate how to use MLlib learning integrated model. Scala The following examples illustrate how to read the data set, the data set is divided into training and test sets, learning a New Nike Free Run 3 Shoes Silver 3 model, and print out the model and test accuracy. Examples of Java and Pyton please refer MLlib Programming Guide. Note that GBTs No Python interface, but we expect Spark1.3 release will contain. (Via Github PR 3951) Random Forest examples import org.apache.spark.mllib.tree.RandomForestimport org.apache.spark.mllib.tree.configuration.Strategyimport org.apache.spark.mllib.util.MLUtils // Load and parse the data file.val data = MLUtils.loadLibSVMFile (sc, \u0026 quot; data / mllib / sample_libsvm_data.txt \u0026 quot;) // Split data into training / test setsval splits = data.randomSplit (Array (0.7, 0.3)) val (trainingData, testData) = (splits (0), splits (1)) // Train a RandomForest model.val treeStrategy = Strategy.defaultStrategy (\u0026 quot; Classification \u0026 quot;) val numTrees = 3 // Use more in practice.val featureSubsetStrategy = \u0026 quot; auto \u0026 quot ; // Let the algorithm choose.val model = RandomForest.trainClassifier (trainingData, treeStrategy, numTrees, featureSubsetStrategy, seed = 12345) // Evaluate model on test instances and compute test errorval testErr = testData.map Nike Air Max {point = \u0026 gt; val prediction = model.predict (point.features) if (point.label == prediction) 1.0 else 0.0} .mean () println (\u0026 quot; Test Error = \u0026 quot; + testErr) println (\u0026 quot; Learned Air Jordan Outlet Random Forest: n \u0026 quot; + model .toDebugString) GBTs example import org.apache.spark.mllib.tree.GradientBoostedTreesimport org.apache.spark.mllib.tree.configuration.BoostingStrategyimport org.apache.spark.mllib.util.MLUtils // Load and parse the data file. val data = MLUtils.loadLibSVMFile (sc, \u0026 quot; data / mllib / sample_libsvm_data.txt \u0026 quot;) // Split data into training / test setsval splits = data.randomSplit (Array (0.7, 0.3)) val (trainingData, testData) = ( splits (0), splits (1)) // Train a GradientBoostedTrees model.val boostingStrategy = BoostingStrategy.defaultParams (\u0026 quot; Classification \u0026 quot;) boostingStrategy.numIterations = 3 // Note: Use more in practiceval model = GradientBoostedTrees.train (trainingData, boostingStrategy) // Evaluate model on test instances and compute test errorval testErr = testData.map {point = \u0026 gt; val prediction = model.predict (point.features) if (point.label == prediction) 1.0 else 0.0} .mean ( ) println (\u0026 quot; Test Error = \u0026 quot; + testErr) println (\u0026 quot; Learned GBT model: n \u0026 quot; + model.toDebugString) Scalability Scalability Mens Nike Free Run 3 Shoes Black 3 binary classification problem by empirical results, we proved MLlib scalability. The following chart the sheets were the characteristics of GBTs and Random Forests are compared, in which each tree has a different maximum depth. These tests are a return to the task, that is, from the audio feature songs from the predicted release date (YearPredictionMSD data sets from UCI ML repository). Air Max 2011 Womens Red We use the EC2 r3.2xlarge machine, unless otherwise specified parameters Air Jordan Heel of the algorithm uses default values. Telescopic model size: training time and testing error following two diagrams show the effects of increasing the number of trees on the integrated effect. Nike Free Run 3 For both GBTs and random forests, increasing the number of trees will increase the training time (the first figure), while also increasing the number of trees to improve the prediction accuracy (average test mean square error of measure , shown in Figure 2). When comparing the two, random forests training time is shorter, but to achieve the same prediction accuracy and GBTs deeper tree is required. GBTs is able to significantly reduce the error at each iteration, but after too many iterations, it is too easy to over-fitting (an increase of measurement error). Random Forest is not easy over-fitting, test error also stabilized. The following is the mean square error with a single tree tree depth (depth respectively 2,5,10) change curve. Description: 463,715 training instances 16 nodes. Telescopic training set: training time and testing error following two charts show the effects of using different training set of algorithms produce results. Chart shows that although the larger data set, the two methods of training time longer, but was able to produce better test results. Further stretch: more nodes, faster training speed last chart shows the use of larger computer clusters to solve the above problem results, the conclusion is GBTs and random forest on a large cluster speed has been significantly improved. For example, he said single tree depth GBTs 2 in the 16 nodes on training speed is about 4.7 times two nodes. The larger the data set to enhance the effect of the more obvious. Looking GBTs will soon provide the Python API. Another issue is the future development can be inserted: the integrated approach not only can integrate decision tree, which can be integrated Nike Dunk Heels almost all of the classification and regression algorithms. In Spark 1.2, the in Pipelines API experiments spark.ml package introduced will allow universal integration methods, and truly can be inserted. Learn more about API and relevant examples see MLlib integrated learning document. To learn more background knowledge to build decision tree collections, see the previous blog. Acknowledgements MLlib integration algorithm of cooperation by the blog's authors developed, they are Qiping Li (Alibaba), Sung Chung (Alpine Data Labs), and Davies Liu (Databricks). We also thank Lee Yang, Andrew Feng, and Hirakendu Das (Yahoo ), they help design and test. We also welcome you to 2015 Nike Free 5.0 contribute to a force!how random forests and gradient upgrade tree (GBTs) in MLlib in?