combing1のブログ -8ページ目

combing1のブログ

ブログの説明を入力します。

Hadoop system provides an open source implementation of MapReduce computational framework, such as Yahoo!, Facebook, Taobao, China Mobile, Baidu, Tencent and other companies are carried out by means of massive data processing Hadoop. Hadoop system performance depends not only on the task scheduler allocation policy, also affected by the actual task execution after distribution efficiency, task execution often involves reading, sorting, merging, compression, and other specific written stage. HCE computing framework is an open source project aims to optimize all stages of task execution, improve the efficiency of the entire Hadoop system. Compared with the Hadoop Java framework, based on the highest MapReduce Air Jordan Outlet task HCE framework can save more than 30 percent of the CPU resources. 1 Hadoop ecosystem HCE computing framework in Figure 1 shows the HCE framework Hadoop ecosystem located. For OLTP systems, a user-generated Web front end through the corresponding request, the request through the middleware processing, as the data entered into the database or KV storage system, and it will produce a log. OLTP system generates data and log analysis system will be used 597806-400 Nike LeBron X EXT QS Denim-Pink Outlet as input for the search engines and advertising systems, daily log would be easier than TB. Logs and business data will generally be stored to a mass storage system KV HDFS file system or storage system, generally above the storage system based on MapReduce distributed computing framework. Will perform every day thousands of MapReduce jobs massive data processing, the result will be three ideas: Store in a mass storage system for later use; import for generating database Nike Free Run 3 reports or analysis; OLTP system as input, import Online store. MapReduce job generally consists of internal users through native clients Hadoop, Pig / DISQL language Hive data warehouse clients or submitted in three ways, the results of operations by SQL client queries. Issue more and more companies began using Hadoop and its surrounding massive data analysis system, Yahoo! And Facebook's Hadoop cluster nodes have over 10,000, and the node unabated growth trend, domestic companies such as Tencent and Baidu, also facing the same problem. Growing business needs and business data caused by cluster resource shortage is the main reason for the continuous expansion of the cluster, CPU resources (note: Air Max 2012 Black Navy Blue White one of the storage resource shortage solution is open heavyweight compression (such as Facebook), which relates to CPU resources use) is one of the most scarce. To control costs, optimize the utilization of cluster resources is imperative. For distributed computing level, resource optimization there are two ways: First, to ensure maximum use of global resources through refinement of the resource scheduling, which usually involve a reasonable resource scheduling algorithms and lightweight resource isolation; the second is through optimization computing tasks and user programs to enhance the utilization of existing computing resources job. HCE computing framework focuses on the latter. Cross-platform, highly scalable, computational framework common interface also brings additional computing overhead analysis Hadoop MapReduce framework, there are trade-offs to achieve can do. MapReduce framework for efficient implementation of the Java language to ensure that its cross-platform compatibility; however, the domestic Internet companies generally use Intel x86 platform compatibility advantage is difficult to be reflected, so you can choose better Java performance but does not support cross-platform language to implement MapReduce framework. In order to maximize scalability (Extensibility), Hadoop data to achieve a multi-level process flow package, which makes the Java framework, there are some performance loss in large data processing can actually achieve a more direct route to improve the data processing efficiency. Domestic Internet companies are mostly engineers feature development using C ++ or scripting language to manipulate text, Java interfaces is too much trouble for them to do. Nike Free Run 3 Women Hadoop Streaming provides a programming interface that allows the user program can through the media - the standard input and output to and interact with data computing frameworks, namely bypassing the limitations of language, so many user tasks are started by Streaming interfaces. Advantage Streaming interface that supports multi-language development, and increase the versatility to bring the loss of performance, that is, data copy overhead pipes and Key split (approximately 2% to 5%), and better than the original ecosystem is more suitable programming language interface ʱ?? User computing framework program beyond the control task execution Figure 2 MapReduce framework in addition to the cost of a schematic framework, resource consumption computing tasks also include the user program. Shown, Hadoop Streaming and Pipes frame support as shown 597806-400 Nike LeBron X EXT QS Denim-Pink Outlet in Figure 2 C ++ users to develop MapReduce applications, the user starts an executable program framework, the framework and the user program, respectively, in the two processes, namely footprint. Simple analysis does not take up too much CPU resources that the user program execution time computing tasks across the proportion of small, this time, to optimize computing framework will bring more substantial benefits; however, for complex analytical procedures , the user 2015 Nike Free 5.0 program is occupied much longer than the calculation framework, this time, to optimize computing framework brings benefits may be minimal. Thus, saving CPU resources can not be separated clusters to optimize the user program. Optimize the user's job can be divided into two levels, the higher level of unity through the inlet allows users to submit jobs, such as the use of user data warehouse Hive Hadoop MapReduce does not operate the API, the Hive internal unity do optimization, including some static or dynamic method to adjust the user's job parameters to perform tasks with minimal resources efficiently; second is to optimize the direct operation of the user's job MapReduce API, of course, also fall into this category Hive. Assume the user's job or a data warehouse is implemented by C ++ language compiler implementation by the user, the platform only by the user interface calls an executable program, then the time you want to optimize the user program would be more difficult. Dynamic and static optimization methods: dynamic optimization mode (Note: For details, see Starfish: A Selftuning System for Big Data Analytics (CIDR'11)) is added a layer on MapReduce, dynamic adjustment of operations by profiler and sampler technology parameters; static optimization approach is to let users rely header files and libraries when compiled with the framework provided by the compiler optimization techniques to enhance the user program performance. HCE computing framework to solve the static optimization framework and the user program to enhance the computing tasks of CPU usage. 2, the frame and the user program to integrate as shown in a process optimization framework and the user can Nike Air Max program compiled by the same set of mechanisms, Key-Values ​​treatment is also in the same process space, without the help of the media (pipes or sockets) to pass. HCE and Hadoop user programming interface provided in Table 1 below. Table each user interfaces contrast computing framework 1 MapReduce framework and efficient C ++ implementation HCE framework through C ++ language implementation of New Nike Free 5.0 V4 Purple Yellow Shoes the MapReduce data processing logic, based on better performance than Java C ++ language, you can get a better CPU utilization on the data processing operations, You can also call the Native Lib more directly rather than through JNI (Note: Compression Library is a Native achieve, Hadoop through JNI to call the compression method, HCE compression space in a process execution); in addition, through efficient compiler optimization methods, such as ICC compilers, performance can further tap the advantages of the framework. HCE framework through streamlined approach to achieve the MapReduce data processing, multi-level comparison Java streaming package, HCE handling process more efficient. HCE framework provides a multilingual interface to C ++, Python, etc., to facilitate the user programming, but also saves the overhead Streaming interface; while HCE also fully compatible with the original Java Streaming provides an interface that can be seamlessly migrated to the original job HCE framework. Statically compiled user program optimization Air Max 2011 Womens Grey Green is static optimization of the user program mode HCE Frame, dynamic optimization to the upper layer of the data warehouse to do. For those heavy CPU load user program, HCE provide C ++ programming interface to the user, the user need to rely on local programs to compile the framework header files and libraries, header files and so built as SSE optimized code, you can make the user program at compile time It is optimized. This simple way enables the user program execution efficiency dramatically. HCE framework Hadoop framework to achieve the support of the functional components, such as support for RecordReader and RecordWriter Text Air Max 2011 Womens Red or SequenceFile format in C ++ space, also supports Gzip, Lzo, QuickLz, Lzma and other four kinds of Air Jordan Heel compression formats. Since the input file split in Hadoop Client implementation, so Split method is still in the Java space implementation; of course, user-defined Mapper and Reducer must be implemented in C ++ space, such as Hive want to perform on HCE framework, then it must implement C ++ version of Mapper, Reducer and Nike Free Run 3 so functional components. Figure 3 HCE frame data processing flow chart of Figure 3 shows the data processing flow HCE framework can be seen in C ++ space HCE framework for efficient implementation of a number of scalable functional modules, such as RecordReader, OutputCollector, Shuffle, ReduceInputReader, RecordWriter, Committer , Partitioner, Mapper, Reducer, Combiner and so on, the processing logic is more compact and efficient than Hadoop MapReduce. In Hadoop Java space MapRunner and ReduceRunner only play the role of the state to collect information. HCE performance framework focused on the Map stage, about more than 40%. For general MapReduce programs, compared to the Shuffle and Reduce phase, Map stage but also its most resource consumption stage, Nike Free Run 3 because the output of the final work is generally only 10% entered a lot of data processing is done in the Map stage. HCE expand only basic framework is not enough, because there are a large number of jobs are performed by Streaming interface, and in addition to C ++ development interface, script developers also want to use the corresponding language development interface. Fortunately, all the scripting languages ​​are based on C development, it can achieve a simple interpreter, Nike Free Run 3 Women will translate into C language scripting language, the final execution is still HCE framework, and this interpretation overhead is small. Of course, the overhead Streaming operation is inevitable, but job-based Streaming HCE framework can utilize the performance advantages of the framework promoted CPU utilization, which is still considerable income for lightweight jobs. Figure 4 shows the data Java Streaming, HCE, Streaming Over HCE and Python Over HCE four frames of processing channels. Data processing Java Streaming framework is still done in the Java space, and data processing HCE, Streaming Over HCE, Python Over HCE framework are completed in C ++ space, Child JVM process only HCE collection task status information. Figure 4 Streaming Over HCE and Python Over HCE schematic framework MapReduce computing future framework does not rely solely on HDFS storage system can also be based on other storage systems, such as the KV Hypertable or other systems. At present, many block storage systems or KV systems are implemented in C ++, you want to use the original ecology on which Hadoop MapReduce, you must calculate the conversion interface system through storage systems language conversion interface (for example Hypertable of Thrift) or (for example, Hadoop's AvroRPC, etc.), the problem is data serialization and de-serialization overhead will inevitably bring. Therefore, based on HCE framework, the non-Java language to achieve more efficient storage system can support Hadoop MapReduce calculations, of course, they need to implement the corresponding Split, RecordReader, RecordWriter and Committer and other components. Summary HCE framework is a derivative Hadoop MapReduce framework. Relying on efficient local processing mechanism HCE framework, Hadoop jobs can save up to 30% of the CPU resources. In addition, HCE provides C ++, Python, and other programming interfaces, and to ensure that the existing interface, backwards compatible; variety of compiler optimization techniques can be easily applied to the MapReduce framework; and finally, HCE via compiler optimization and built-in interpreter, etc. Optimization of the user program execution.HCE: MapReduce framework to enhance resource utilization