Distributed key - Value store for large - scale systems

The Algorithm 7 describes steps to add an item into a big set. Firstly, it get root metadata id of the big set in function getOrCreateRootIDOfBigSet. That function will automatically creates new meta id using IDGenerator if there is no root meta data id associated with the big set and store it into a key-value store service called Key2MasterMetaID in Fig 5.1. Then it uses the Traversal Algorithm of Metadata Service above to fill data to an object of class TSetMetaPath. The result describes a path from root meta data node to associated small set that will contain adding item.

150 trang | Chia sẻ: tueminh09 | Lượt xem: 708 | Lượt tải: 0

Bạn đang xem trước 20 trang tài liệu Distributed key - Value store for large - scale systems, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

MAX META SIZE. Fig 5.2 demonstrates metadata in non-leaf nodes. TSetMetaPath is a structure in Metadata Service. It is used to describe a path from a root meta node to a leaf node. It is used in algorithms for querying and manipulating meta data and big-sets as well. It consists of a split info and multiple children items called TSetMetaPathItem described in Listing 5.3. Listing 5.3: MetaPath description struct TNeedSpl i t Info { 1 : TMetaKey metaID , 2 : TMetaKey parentID , 3 : i 32 childCount , 102 4 : bool i s Sma l lS e t = fa l se } struct TSetMetaPathItem{ 1 : TItemKey minItem , 2 : i 64 metaID , 3 : byte l e v e l , } struct TSetMetaPath{ 1 : l i s t metaPath 2 : op t i ona l TSetMetaPathItem sma l l S e t In f o 3 : op t i ona l TNeedSpl i t Info s p l i t I n f o } In TSetMetaPath structure, elements of metaPath list has decreasing level, the ith element is parent of i + 1th element. The final element in metaPath is the parent of smallSetInfo. 5.4.4.2 Metadata Service Algorithms There are important algorithms in Metadata Service. They are parts of big set operations. The Traversal algorithm is used to locate the path from root node to small set associate with an item for adding, modifying or reading. Splitting Algorithms are used to split full MetaData node when it reaches the limited number of children. These algorithms are implemented in Service Model layer of the framework described subsection 1.3. Traversal To insert an item into a big set, we have to find the small set which will store that item. Traversal operation is described in Algorithm 5. In this algorithm, a path from root node to leaf node (small set) which may contain a searched itemKey is constructed. 103 At each meta data node, it decides which children node to travel in next step by using binary search in minItem field of TMetaItem objects in children of current TSetMeta. And a node with largest number of children will also be updated in splitInfo of the meta path. The found small set information will be updated into smallSetInfo of TSetMetaPath. Big Set is organized as a distributed B+Tree and the number of node this algorithm will visit is proportional to the height of the B+Tree, so the complexity of this algorithm is O(log(N)) , while N is number of items in the Big Set. Algorithm 5 Metadata Traversal 1: Input: rootMetaID, itemKey 2: OutPut: metaPath 3: metaID=rootMetaID 4: while TRUE do 5: meta← metaOf(metaID) 6: if meta.children is empty then return EmptyPath; 7: end if 8: currentCount← childrenCount(meta) 9: oldCount← metaPath.splitInfo.childCount 10: if currentCount > oldCount then 11: updateSplitInfo(meta, metaPath.splitInfo); 12: end if 13: TMetaItem mtItem ← meta.findByItem(itemKey) 14: if meta.level = 0 then 15: metaPath.smallSetInfo.metaID ← mtItem.metaID 16: updateSmallSetInfo(metaPath) 17: break 18: else 19: TSetMetaPathItem pathItem; 20: pathItem ← mtItem; 21: pathItem.level ← meta.level -1; 22: metaPath.push back(pathItem) 23: metaID ← mtItem.metaID 24: end if 25: end while Split Root Metadata node. When a root meta node reaches the limited maximum size, it will be split into two smaller nodes. Two new smaller metadata nodes will be 104 Algorithm 6 Split Root Metadata node and Level Up Input: rootMeta.children = [mt0,mt1,mt2,mt3, ...mtn], rootMetaID, firstChildID, sec- ondChildID , each mti is a MetaItem structure Output: splitResult 1. Check size: if n < 4 then splitResult← TooSmall; return end if 2. Construct new children meta data node: fcmt⇐ [mt0,mt1, ...,mtn/2] scmt⇐ [mtn/2+1,mtn/2+2, ...mtn] 3. Increase level of root Meta data : rootMeta.level ⇐ rootMeta.level + 1 4. Construct Meta Item for two new children: minItem1← minItem(fcmt) count1← itemCount(fcmt) metaItem1⇐ {firstChildID,minItem1, count1} minItem2← minItem(scmt) count2← itemCount(scmt) metaItem2⇐ {secondChildID,minItem2, count2} 5. Construct new children for root Meta data node rootMeta.children⇐ [metaItem1,metaItem2] 6. Save metadata nodes to key-value store: saveData(rootMetaID, rootMeta) saveData(firstChildID, fcmt) saveData(secondChildID, scmt) splitResult = OK return splitResult created and the original root node will have two meta items and its level will be increased by 1. The detail steps are described in Algorithm 6. Fig 5.3 demonstrates an example of splitting root meta data node with KMAX META SIZE=3. Split middle meta-data node. This algorithm is used when children number of a non- root meta-data node is greater than limited maximum size caused by smaller level node 105 Fig. 5.3: Split root node and level up Fig. 5.4: Split and create new brother meta data node splitting or small set splitting. The algorithm will create a new same-level metadata node (new brother node), moving second-half children into new node and add new meta item that described new node into the parent meta node. This algorithm is demonstrated in Fig 5.4 with KMAX META SIZE=3. 5.4.5 Forest of Distributed B+Tree Forest of Distributed B+Tree manages large number of distributed B+Tree. It is presented by a component called BigSet Service. This service is a server application and it cares about big set functions for clients. It coordinates other services and uses algorithms from Small 106 set service, Metadata service. All bigset operations are implemented in this services. 5.4.5.1 Add an Item to Big Set The Algorithm 7 describes steps to add an item into a big set. Firstly, it get root metadata id of the big set in function getOrCreateRootIDOfBigSet. That function will automatically creates new meta id using IDGenerator if there is no root meta data id associated with the big set and store it into a key-value store service called Key2MasterMetaID in Fig 5.1. Then it uses the Traversal Algorithm of Metadata Service above to fill data to an object of class TSetMetaPath. The result describes a path from root meta data node to associated small set that will contain adding item. Item will be added to the found small set from the Algorithm 7 Add Item into Big Set 1: Input: bigsetID, item 2: Output: TAddItemResult 3: rootMeta ← getOrCreateRootIDOfBigSet(bigSetID) 4: Travel from root meta node to get metaPath in MetaData Service: 5: metaPath← traversal(bigSetID, rootMeta) 6: if exist small set in metaPath then 7: smallSetID ← metaPath.smallSetInfo.metaID; 8: sresult ← smallSetService.addItem(smallSetID, item) 9: if sresult.code = EAdded then 10: metaDataService.addNumberInPath(metaPath,1) 11: return EAdded; 12: else 13: return EReplacedOrNoChange 14: end if 15: else 16: create first small set 17: addItem to created small set 18: update root metadata 19: end if 20: checkSplit(metaPath.splitInfo) path. If this is a new item in that small set, the count field in associated children of meta data node in the path will be increased by one (or decreased by one in Remove function). 107 After adding an item into the small set, it may reach the limited size lmax or a full meta data node may be found. If the small set reaches the limited size lmax, it will be split into two smaller set in function checkSplit. A new small set is created with second-half items from original small set. New small set id or key in Small Set Service is generated using IDGenerator service. This make Small Sets have auto increasing integer key, that can take advantage of ZDB in Chapter 3 efficiently when persistently storing data. The Fig 5.5 shows an example of splitting small set. This operation always ensures that items in next small set is greater than items in current small set, so we can easily travel backward and forward through big set from a specific position in a small set. This is a characteristic of B+Tree and we can do it on key-value store easily. Fig. 5.5: Split small set example In case of full a metadata node found in the insertion path, that node in Metadata Service will be split using Algorithm 6 or Algorithm presented in Fig 5.4 depend on if that metadata node is root or not. ID of new metadata node or key in Metadata service is also generated using IDGenerator. It makes the keys of Metadata serivce auto increasing integer and can also take advantage of ZDB efficiently like Small Set Service. The algorithm for Remove function of Big Set service is similar to the algorithm for adding an item into big set. However, it eliminates the checkSplit step. 108 5.4.5.2 Query Items from Big Set Algorithms Big Set Service provides several type of query in big set: • GetSlice: Query n items from a specific position s • GetSliceFromItem: Query nq items from an item key • RangQuery: Query items in a range between two item keys such as [startItemKey, endItemKey) We developed two approaches for solving these query types. The first approach is suitable for sequential processing in Big Set. While all items in big set are sorted by item keys and according to the design of Metadata Service and Small Set Service presented above, we can easily do these query types by specifying the starting small set ID of the result. The first small set ID is got by Traversal Algorithm for GetSliceFromItem query and RangeQuery. GetSlice uses Algorithm 8 to get first small set ID. From this small set ID, items will be retrieved from Small Set Service. Appropriate items from this small set will be in the query result, items in next small sets (using nextSetID field in the Small Set Information) will be retrieved until it gets enough number of items or it reaches to upper bound item of the query or it is the end of the big set . The second approach is suitable for parallel processing in Big Set. In each query, we firstly estimate all small set IDs using Metadata Service for the query. Each meta data element describes a branch of B+Tree. Because meta data elements in children of TSetMeta have fields minItem - the lower bound item key it stores and count - total number of items it stores. These fields are used to retrieve total small set IDs for processing the query efficiently. Then it is able to query these small set data in parallel in Small Set Service. 109 Algorithm 8 Find first Small Set ID for GetSlice Input: p, rootMetaID Output: smallSetID, offset metaID ← rootMetaID while TRUE do mt ← metaOf(metaID).children x ← argmax| ∑xi=0mt.children[i].count ≤ p if x >mt.children.size()-1 then smallSetID ← -1 //query out of range break else if mt.level = 0 then smallSetID ← mt.children[x+1] offset ← postion-∑xi=0mt.children[i].count break else if mt.level > 0 then metaID← mt.children[x+1] p← p-∑xi=0mt.children[i].count end if end while 5.4.6 General key-value store using Forest of Distributed B+Tree With the algorithms and architecture described above, we can efficiently store big-sets in commodity distributed key-value store. ZDB in Chapter 3 optimizes the engine for key-value store with auto increasing integer keys, for another key type such as binary, it does not support directly. In this work, items in big sets can be key-value pairs, a small set can store multiple items. Keys in both Small Set Service and Metadata service are auto increasing integers. Each big-set can be a general key-value store that fully supports binary or string key type. The key-value pairs in big-sets are ordered and distributed. It is different from consistent hashing distribution where keys are distributed by using hash value. Keys in consistent hashing distribution are unordered that makes it difficult to range query. Although small sets and metadata nodes are distributed in key-value stores using consistent hashing, items in Proposed Big-Set are ordered by item key that support range query easily in distributed environment. This actually is order-preserving distribution 110 [51] based on consistent hashing. Moreover, Forest of Distributed B+Tree can store large number of big-sets, so the new general key-value store is scalable and elastic. 5.5 Evaluation In this section, we present workload design, configuration we used to benchmark and compare the proposed big set with existing popular solutions: Redis [72] and Cassandra [48]. The main purpose of benchmarks is to verify the efficiency of proposed architecture when facing big set storage problems. We measured number of operations per second (ops), the scalability of the system. We used method of Yahoo! Cloud Serving Benchmark (YCSB) [20] to generate workloads for the benchmarks. Three workloads are described below: • Insertion workload : all clients generate and add new items into big sets. • Read Only workload: all clients check the existence of items in the big set. • Mixed workload R: 90% client operations read items from big set, 10% client operations update new value to big set. • Mixed workload W: 10% client operations read items from big set, 90% client operations update new value to big set. In these benchmark workloads, we used 8 storage servers and 4 machines to run clients with configuration below: 111 Operating System Ubuntu Server 64 bits CPU Intel Xeon Quad core Memory 64 GB DDR FileSystem ext4 on 600 GB SSD Network Wired 1Gbps We deployed Redis in 8 servers for benchmarks first. Ordered Set in redis is used in this benchmark. We used zadd method of Redis to insert items to Redis. Redis consumes about 1.2GB main memory to store 10 millions items generated from the Insertion Workload in a sorted set. Because all data of Redis reside in main meory, with 64 GBs RAM in a server, it can handle only about fifty four big sets (each big set contains 10 millions items) no matter how large is SSD or HDD in the server. Then we cleanup servers and deployed Cassandra in these servers. We used Columns in Cassandra for storing Sets data. Cassandra uses disk driver to store data (SSTable and commit-log) efficiently. Its capacity only depend on disk space. In the Insertion Workload, we inserted big sets of 500 millions items (in a row, each column store an item) and it can store big set of 2 billions items as well. Items in a big set are stored and must be fit in a server. Finally, instances of ZDBService as persistent layer of Big Set’s service were deployed in 8 servers, Small Set Service is deployed in 4 servers, MetaData Service is deployed in 4 servers. We also inserted big sets of 500 millions items into Proposed Big Set. More over, it can store big set of 4 billions items. Items in big set are distributed in ZDBService instances deployed in servers. Items in both Cassandra and proposed Big Set for this benchmark are key-value pairs (A simple column in Cassandra is a key-value pair), each big set in Cassandra is a wide row. Items in Redis is a string with the same size to key-value pair in Cassandra. In client 112 Tab. 5.1: Other Capability comparison Storage Type Cassandra Redis Big Set Support billions Yes No Yes items per set (Max 2B) (4B tested) Rang Query Yes No Yes by item Rang Query No Yes Yes by position Item distributability No No Yes in a set machines, we run 16 processes for each workload. Fig 5.6 shows the result for write benchmark throughput. Cassandra has the best performance in Write Only Workloads. Proposed Big Set has a bit lower performance compared to Redis Ordered Sets. All these storage system has linear scalability when number of servers increasing. Fig 5.7 shows the result of read only benchmark throughput. In spite of having a good write performance, number of read operations per second of Cassandra is much lower than Redis and Proposed Big Set. The same result in Mixed workloads is showed in Fig. 5.8 and Fig. 5.9. Proposed Big Set and Redis out perform Cassandra in reading and mixed workloads. Capabilities of these storage are also compared in Table 5.1 and Fig 5.10. 113 Fig. 5.6: Write only throughput Fig. 5.7: Read only throughput Fig. 5.8: Mixed Workload R Fig. 5.9: Mixed Workload W Fig. 5.10: Capacity comparison 5.6 Discussion As results we presented above, Cassandra has good performance in write operations, its throughput is biggest compared to the others. The reason is that write operations in Cas- sandra always sequential write to commit log files and access an limited size in-memory data structure called MemTable. Unfortunately in Read Only and Mixed workloads, Cassandra performance is not very good as in write only workload. In the read path of Cassandra, it firstly lookup MemTable, then lookup on-disk SSTable(s) with Bloomfilter to read columns. 114 BloomFilter in Cassandra uses only row key, so it is not efficient when using wide-column as big set. There are some work were done to improve the read performance of Cassandra such as [57] that enhance a row cache layer in SSD. In this benchmark we used SSD and the performance of Cassandra is still not good enough. Another disadvantage of using Cassandra for Big Set problem is that it is difficult to count how many items in the big set while proposed Distributed Modified B+Tree can count total of items easily by using only one read operation of root metadata node in Metadata Service. With small sets that the whole data fit in main memory, Redis is good in both read and write operation. It is implemented using efficient data structure to manage in-memory data and is appropriate with small caching. With large data set that not fit into main memory, Redis cannot store all of them. Additional writes into a full memory Redis server may lead to data lost caused by its eviction policy. This explains the benchmark results above which showed that Redis capacity is significantly smaller than both proposed Big Set and Cassandra. Forest of Distributed B+Tree for Big Set problems is a trade-off between Read/Write performance and Capacity. It has a good throughput in both read and write operation. Its capacity proportions to the size of file-system, main memory is used to cache hot small sets and hot metadata node in Small Set Service and Metadata Service. In workloads above, it has performance a bit lower than Redis (when the size of sets are small or medium), however it has a better capability in storing big set. Proposed work can store bigger sets than both Cassandra and Redis. Moreover, each big set stored using proposed method is distributed in multiple server using advantages of key-value store naturally, especially the advantage of ZDB while keys in Small Set Service and Metadata Service are generate as auto-increasing integer. Based on the principles of this architecture we can build another scalable distributed key-value store or nosql database for big data structures efficiently. In the previous sections, we present the proposed big set architecture. Data are stored 115 in two main services: Small Set Service and Metadata Service. In implementation, we can unify these services with object oriented programming to modeling Small Set and Meta Data. Small Set ID and Metadata Id can be generated by the same ID generator and these data using the same distributed key-value store farm. In case of using ZDB, it is easy to scale out data to many servers. Last but not least, this design can be used with arbitrary high performance key-value store. 5.7 Applications of ZDB and Forest of Distributed B+Tree Results of this chapter is used in many projects in both production and research. The most recent research project is ”Unsupervised Anomaly Detection in Online Game”. Online game is one of the most successful business on the Internet. As online game business grows, cheating in game becomes popular and is the biggest challenge of online game systems. In this research, we investigate the application of anomaly detection techniques to cheating detection in an online game (JX2) of VNG company. A method to evaluate the performance of unsupervised anomaly detection techniques was proposed. Six unsupervised anomaly detection algorithms were tested. In computing architecture of this research , distributed B+Tree is used as the main data storage for all action. 5.7.1 Computing Architecture for Anomaly Detection System Achieving high performance is critical in a real anomaly detection system. We used map- reduced framework [23] to increase the performance of the system. Moreover, an in-house storage system was implemented to store historical actions of game players. The in-house storage is based on Forest of Distributed B+Tree based on ZDB key-value store in Chap- 116 Fig. 5.11: Computing System Architecture. ter 3. Each user has a long list of actions that can be queried by its position or time range. This long list was stored in multiple servers and managed by a distributed B+Tree. The computing architecture is shown in Fig 5.11. In this system, raw logs from game server was daily pushed to Log Processor component. The log data was processed and information about the interaction between users was extracted. Then, user information was saved into Log Store Service. Log Store Service is built based on a distributed structured storage called Big Set that was built on ZDB. Each item of Big Set is a key-value pair. This component increased the availability and scalability of the structured storage system. The core component of the architecture is Anomaly Detection Service. This service applied anomaly detection techniques to find abnormal users. Anomaly detection component was designed so that new detection techniques could comfortably be added to the system. Finally, the detected results were stored into a database and visualized for post analysis. The result of this research is presented in sixth publication of this thesis. 117 5.7.2 Specific Storage Solution for Specific Structured Data ZDB in Chapter 3 and Data Storage Framework in Sub Section 1.3 are applied in many backend systems of Zing Me3, CSM4, UBus5, etc. Most of data of Zing Me users: feeds, friend list, profile user are stored in ZDB. One of the most important components of Zing Me is Friend Recommendation System uses ZDB for storing its data. This problem is often considered to be the link prediction problem in the network. The component is result of a research project and presented in fifth publication of this thesis. In this research, various information of user consist of direct and in-direct information are stored in ZDB and exploited to predict future friendship (link) of Zing Me users. Formal Definitions Social network is modeled as directed graph G = (V,E) in which each vertex u ∈ V represents a user in the network and each edge eu,v ∈ E represents relationship: v is a friend of u. Because of symmetry in friendship in social network ZingMe, if eu,v exists, ev,u also exists. Weight cu,v represents strength of friendship between u and v. tu,v or te is the time when edge eu,v is established. For times t, t1, t2, let G[t] ⊆ G and G[t1, t2] ⊆ G consist of edges with te ≤ t and t1 ≤ te ≤ t2, respectively. The link prediction problem for social networks can be defined as follow: given times t0 < t ′ 0 ≤ t1 < t′1, based on G[t0, t′0], we want to output a list of edges which are not in G[t0, t ′ 0] and are predicted to present in G[t1, t ′ 1] [49]. Knowledge Sources and their storage The proposed friend suggestion approach uses knowledge sources: network structure, attributes of vertices, for example, age, sex or educa- tion and attributes of edges, for example, interactions when a user comments on a picture, likes a page or sends a message 3 4 5 118 Specific features are derived from these knowledge sources with assumption that if two vertices are not connected, the more mutual information they have, the more probability to be connected they have, and if they are already connected, it represents strength of their connection which is greater when they have more mutual information. Simultaneously, type and reliability of data are also considered when choosing these features. For the graph structure, the number of mutual adjacent vertices is the feature derived and equal to |Nu ∩Nv|. Data from ZingMe and Facebook [8] also show that more than a half of new friendships are established from users having mutual friends. For the attributes of vertices, selected features are number of mutual schools, number of mutual groups and number of mutual IP addresses. For the attributes of edges, the feature chosen is number of ”mutual interactions”. For instance, if two users like same picture, that can be considered as a mutual interaction between them. There are many types of mutual interactions such as users are tagged in same pictures, comment on same posts, and it may be better to consider each mutual interaction as an individual feature but they are considered as one because of computing capability of real system. Our approach reuses some features from current approach, including number of mutual adjacent vertices, number of mutual schools and number of mutual IP addresses. It also adds features derived from groups and interactions of users, which were neglected in the current approach. It eliminates the feature relating to companies of users because of unreliability of this feature. In ZingMe, users provide names of their companies by free text without any standardization. By manually examining content of those names, we found them unreliable. All these knowledge sources are stored in backend services using ZDB in Chapter 3 and Data Storage Framework in Sub Section 1.3 and Forest of distributed B+Tree. These backend services are scalable in practice and achieve high performance, low latency re- quirements of real products. 119 5.8 Summary The proposed Forest of Distributed B+Tree with many advantages to build various data storage services. Every Big Set as a value in a key-value store is split into multiple small sets. Small Sets are managed in storage backend service using Memory Cache for fast accessing and distributed key-value store to persist data. A Metadata Service is designed based on key-value store to manage the big set structure as a distributed B+Tree. It sup- ports building scalable data storage system for big data structure. While other works often build key-value store using B+Tree, this research build distributed B+Tree using key-value store from scratch. This work can easily distribute big value such as big set, big dictionary into multiple servers using advantages of distributed key-value store while others such as Cassandra and Redis have not supported this feature yet. Proposed Distributed Modified B+Tree has a relative high performance in both read and write operations and capability to distributively store big data structures. The complexity of operations in proposed Big Set is O(log(N)). The proposed work is used to store data for friend suggestion in Zing Me Social Network and to store time series of user action for game data anomaly detec- tion system for games of VNG. It can be applied in storing big data for many kinds of applications and systems. 120 Conclusion and Future works The development from RDBMS to NoSQL is an inevitable trend to build storage system for wide area of applications such as Text Processing, Data Warehouses, Stream Processing and Scientific databases. Key-Value store is a category of NoSQL databases. It plays important roles in building large scale applications. Performance of a distributed key- value store consists of multiple metrics which depend on architecture and algorithms inside key-value store engines. To achieve high performance distributed key-value store and make it possible to effi- ciently store various data types, we have to face many problems: • How to minimize latency in every operation of persistent key-value store, how to optimize the index of persistent key-value engine in the one of most popular key types. • How to store large values into key-value stores? How to manage large number of big files in cloud storage by distributed key-value store with optimal metadata per file? • How to store big data structures into key-value stores? How to manage large number of big sets or big data structures while they are modified and queried frequently? This thesis firstly presents method to build high performance key-value store in a single node, data can be distributed into multiple servers using consistent hashing. Every write operation can be done sequentially on disk drive and every read operation need at most 121 one disk seeking. All writes can be configured to be sequential go achieve best writing performance of both SSD and HDD. Proposed key-value store using shared memory Flat Index to fast lookup key-value pair position in data file without unnecessary disk accessing. The proposed key-value store is optimized for auto increasing integer keys - one of the most popular key types. A high performance key-value store called Zing Database (ZDB) is implemented. The results are presented in the first two publications of this thesis. Secondly, this thesis propose architecture for building big file cloud storage based on key-value store. It takes advantages of proposed key-value store to minimize the size of metadata when a system managing large number of big files for serving millions of users. For storing big-files (big values) into key-value store, every file has a same size metadata. Each big-file is split into multiple fixed-size chunks and stored in ZDB. The chunks of a file have a contiguous ID range, thus it is easy to distribute data and scale-out storage system, especially when using ZDB. This thesis proposes method to store big-files efficiently with advantages of key-value store. Finally, this thesis proposed Forest of Distributed B+Tree based on key-value store. This result convert efficiently binary key-space into auto increasing integer key-space. It is useful for building scalable Nosql data storage for large data structure such as big set, wide-column data. Data is distributed in key-value store automatically and make it easy to scale the systems. Every Big Set as a value in a key-value store is split into multiple small sets and store them in distributed ZDB key-value store. It supports building scalable data storage system for big data structure. The experiment results show that Forest of Distributed B+Tree has a good performance in both read and write operations. Moreover, it is a general key-value store that support binary-key type efficiently and order-preserving. In future, we will continue to extend and research storage architecture for big data. We will firstly focus to data storage system that support computing in data mining system more efficiently such as large time series storage. In the ”Internet of things” trend, many data 122 sources from millions sensors with multiple long time series need to be stored for querying and mining efficiently. Second, we will research to make our storage systems more secure in network environment, make them not only high performance but also secure. 123 Publications [1] Thanh Trung Nguyen, Minh Hieu Nguyen. “ZDB-High performance key-value store.” In Proceedings of the 2013 Third World Congress on Information and Communication Technologies (WICT 2013) [2] Thanh Trung Nguyen, Minh Hieu Nguyen. “Zing Database: high-performance key- value store for large-scale storage service.” Vietnam Journal of Computer Science, Febru- ary 2015, Volume 2, Issue 1, pp 13-23 [3] Thanh Trung Nguyen, Tin Khac Vu, Minh Hieu Nguyen. “BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value Store” in Proceeding of 16th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Net- working and Parallel/Distributed Computing (SNPD 2015). [4] Thanh Trung Nguyen, Anh Tuan Nguyen, Tuan Anh Ha Nguyen, Ly Thi Vu, Quang Uy Nguyen, and Long Dao Hai. 2015. “Unsupervised Anomaly Detection in Online Game.” In Proceedings of the Sixth International Symposium on Information and Communication Technology (SoICT 2015). ACM, New York, NY, USA, 4-10. DOI= 10.1145/2833258.2833305 [5] Thanh Trung Nguyen, Minh Hieu Nguyen. “Forest of Distributed B+Tree Based On Key-Value Store for Big-Set Problem.” In Database Systems for Advanced Applications Volume 9645 of the series Lecture Notes in Computer Science pp 268-282, 2016 [6] Thanh Trung Nguyen, Minh Hieu Nguyen. “Distributed and High Performance Big- File Cloud Storage Based On Key-Value Store” International Journal of Networked and Distributed Computing, Volume 4, Issue 3, pp 159 - 172, July 2016 124 Bibliography [1] Daniel Abadi, Rakesh Agrawal, Anastasia Ailamaki, Magdalena Balazinska, Philip A. Bernstein, Michael J. Carey, Surajit Chaudhuri, Jeffrey Dean, AnHai Doan, Michael J. Franklin, Johannes Gehrke, Laura M. Haas, Alon Y. Halevy, Joseph M. Heller- stein, Yannis E. Ioannidis, H. V. Jagadish, Donald Kossmann, Samuel Madden, Sharad Mehrotra, Tova Milo, Jeffrey F. Naughton, Raghu Ramakrishnan, Volker Markl, Christopher Olston, Beng Chin Ooi, Christopher Re´, Dan Suciu, Michael Stonebraker, Todd Walter, and Jennifer Widom. The beckman report on database research. SIGMOD Rec., 43(3):61–70, December 2014. ISSN 0163-5808. doi: 10.1145/2694428.2694441. URL [2] Divyakant Agrawal, Sudipto Das, and Amr El Abbadi. Big data and cloud comput- ing: current state and future opportunities. In Proceedings of the 14th International Conference on Extending Database Technology, pages 530–533. ACM, 2011. [3] Marcos K Aguilera, Wojciech Golab, and Mehul A Shah. A practical scalable dis- tributed b-tree. Proceedings of the VLDB Endowment, 1(1):598–609, 2008. [4] Ashok Anand, Chitra Muthukrishnan, Steven Kappes, Aditya Akella, and Suman Nath. Cheap and Large CAMs for High Performance Data-Intensive Networked Sys- tems. In NSDI, volume 10, pages 29–29, 2010. [5] David G Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence 125 Tan, and Vijay Vasudevan. FAWN: A fast array of wimpy nodes. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 1–14. ACM, 2009. [6] NoSQL Archive. NoSQL DEFINITION: Next Generation Databases mostly address- ing some of the points: being non-relational, distributed, open-source and horizontally scalable. 2016. Accessed September 4, 2016. [7] Marcos D Assunc¸a˜o, Rodrigo N Calheiros, Silvia Bianchi, Marco AS Netto, and Ra- jkumar Buyya. Big data computing and clouds: Trends and future directions. Journal of Parallel and Distributed Computing, 79:3–15, 2015. [8] Lars Backstrom and Jure Leskovec. Supervised random walks: predicting and rec- ommending links in social networks. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 635–644. ACM, 2011. [9] Anirudh Badam, KyoungSoo Park, Vivek S Pai, and Larry L Peterson. HashCache: Cache Storage for the Next Billion. In NSDI, volume 9, pages 123–136, 2009. [10] Doug Beaver, Sanjeev Kumar, Harry C Li, Jason Sobel, Peter Vajgel, et al. Finding a Needle in Haystack: Facebook’s Photo Storage. In OSDI, volume 10, pages 1–8, 2010. [11] Dhruba Borthakur. HDFS architecture guide. Hadoop Apache Project, page 53, 2008. [12] Eric Brew. CAP Twelve Years Later: How the ”Rules” Have Changed. infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed, 2012. Accessed December 1st , 2014. [13] Eric A Brewer. Towards robust distributed systems. In PODC, page 7, 2000. [14] Eric A Brewer. Lessons from giant-scale services. Internet Computing, IEEE, 5(4): 46–55, 2001. 126 [15] Mike Burrows. The chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th symposium on Operating systems design and implementation, pages 335–350. USENIX Association, 2006. [16] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C Hsieh, Deborah A Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2):4, 2008. [17] Laura Chappell and Gerald Combs. Wireshark network analysis: the official Wire- shark certified network analyst study guide. Protocol Analysis Institute, Chappell University, 2010. [18] Peter M Chen, David Patterson, et al. Storage performance-metrics and benchmarks. Proceedings of the IEEE, 81(8):1151–1165, 1993. [19] Douglas Comer. Ubiquitous B-tree. ACM Computing Surveys (CSUR), 11(2):121–137, 1979. [20] Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing, pages 143–154. ACM, 2010. [21] Bin Cui, Hong Mei, and Beng Chin Ooi. Big data: the driver for innovation in databases. National Science Review, 1(1):27–30, 2014. [22] Jeff Dean. Software engineering advice from building large-scale distributed systems, 2007. [23] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008. 127 [24] Biplob Debnath, Sudipta Sengupta, and Jin Li. FlashStore: high throughput persis- tent key-value store. Proceedings of the VLDB Endowment, 3(1-2):1414–1425, 2010. [25] Biplob Debnath, Sudipta Sengupta, and Jin Li. SkimpyStash: RAM space skimpy key-value store on flash-based storage. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 25–36. ACM, 2011. [26] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: amazon’s highly available key-value store. In SOSP, vol- ume 7, pages 205–220, 2007. [27] Idilio Drago, Marco Mellia, Maurizio M Munafo, Anna Sperotto, Ramin Sadre, and Aiko Pras. Inside dropbox: understanding personal cloud storage services. In Proceed- ings of the 2012 ACM conference on Internet measurement conference, pages 481–494. ACM, 2012. [28] Idilio Drago, Enrico Bocchi, Marco Mellia, Herman Slatman, and Aiko Pras. Bench- marking personal cloud storage. In Proceedings of the 2013 conference on Internet measurement conference, pages 205–212. ACM, 2013. [29] Dropbox. Dropbox Tech Blog. https://tech.dropbox.com/, 2014. Accessed October 28, 2014. [30] PUB FIPS. 197: the official AES standard. Figure2: Working scheme with four LFSRs and their IV generation LFSR1 LFSR, 2, 2001. [31] Brad Fitzpatrick. A distributed memory object caching system. com/memcached/, 2013. Accessed September 4, 2013. [32] Peter Ge´czy. Big data characteristics. The Macrotheme Review, 3(6):94–104, 2014. 128 [33] Sanjay Ghemawat and Jeff Dean. LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. https: //github.com/google/leveldb, 2014. Accessed November 2, 2014. [34] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google file system. In ACM SIGOPS Operating Systems Review, volume 37, pages 29–43. ACM, 2003. [35] Jim Gray. Microsoft SQL Server. January 1997. URL com/apps/pubs/default.aspx?id=68492. [36] Jim Gray and Andreas Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 1992. ISBN 1558601902. [37] Jim Gray et al. The transaction concept: Virtues and limitations. In VLDB, vol- ume 81, pages 144–154, 1981. [38] Yunhong Gu and Robert L Grossman. UDT: UDP-based data transfer for high-speed wide area networks. Computer Networks, 51(7):1777–1799, 2007. [39] Patrick Hunt, Mahadev Konar, Flavio P Junqueira, and Benjamin Reed. ZooKeeper: wait-free coordination for internet-scale systems. In Proceedings of the 2010 USENIX conference on USENIX annual technical conference, volume 8, pages 11–11, 2010. [40] Google Inc. LevelDB - A fast and lightweight key/value database library by Google. 2013. Accessed July 23, 2013. [41] Doug Judd. Hypertable. hypertable\_nosql.pdf, 2009. 129 [42] Stephen Kaisler, Frank Armour, Juan Antonio Espinosa, and William Money. Big data: Issues and challenges moving forward. In System Sciences (HICSS), 2013 46th Hawaii International Conference on, pages 995–1004. IEEE, 2013. [43] Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alexander Rasin, Stanley Zdonik, Evan P. C. Jones, Samuel Madden, Michael Stonebraker, Yang Zhang, John Hugg, and Daniel J. Abadi. H-Store: a High-Performance, Distributed Main Memory Transaction Processing System. Proc. VLDB Endow., 1(2):1496–1499, 2008. ISSN 2150-8097. doi: 10.1145/1454159.1454211. URL edu/papers/hstore-demo.pdf. [44] David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, and Daniel Lewin. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, pages 654–663. ACM, 1997. [45] David Karger, Alex Sherman, Andy Berkheimer, Bill Bogstad, Rizwan Dhanidina, Ken Iwamoto, Brian Kim, Luke Matkins, and Yoav Yerushalmi. Web caching with consistent hashing. Computer Networks, 31(11):1203–1213, 1999. [46] FAL Labs. Kyoto Cabinet: a straightforward implementation of DBM. http:// fallabs.com/kyotocabinet, 2013. Accessed May 1, 2013. [47] Eric Lai. No to SQL? Anti-database movement gains steam. http: //www.computerworld.com/article/2526317/database-administration/ no-to-sql--anti-database-movement-gains-steam.html, 2009. [48] Avinash Lakshman and Prashant Malik. Cassandra: a decentralized structured stor- age system. ACM SIGOPS Operating Systems Review, 44(2):35–40, 2010. 130 [49] David Liben-Nowell and Jon Kleinberg. The Link Prediction Problem for Social Networks. In Proceedings of the Twelfth International Conference on Information and Knowledge Management, volume 58 of CIKM ’03, pages 556–559, New York, NY, USA, 2003. ACM. ISBN 1-58113-723-0. doi: 10.1145/956863.956972. URL [50] Hyeontaek Lim, Bin Fan, David G Andersen, and Michael Kaminsky. SILT: A memory-efficient, high-performance key-value store. In Proceedings of the Twenty- Third ACM Symposium on Operating Systems Principles, pages 1–13. ACM, 2011. [51] Witold Litwin, Marie-Anne Neimat, and Donovan Schneider. Rp*: A family of order preserving scalable distributed data structures. In VLDB, volume 94, pages 12–15, 1994. [52] Yandong Mao, Eddie Kohler, and Robert Tappan Morris. Cache craftiness for fast multicore key-value storage. In Proceedings of the 7th ACM european conference on Computer Systems, pages 183–196. ACM, 2012. [53] Mapkeeper. MapKeeper. https://github.com/m1ch1/mapkeeper, 2014. Accessed June 1, 2014. [54] Ward Douglas Maurer and Theodore Gyle Lewis. Hash table methods. ACM Com- puting Surveys (CSUR), 7(1):5–19, 1975. [55] Nimrod Megiddo and Dharmendra S Modha. ARC: A Self-Tuning, Low Overhead Replacement Cache. In FAST, volume 3, pages 115–130, 2003. [56] Nimrod Megiddo and Dharmendra S Modha. Outperforming lru with an adaptive replacement cache algorithm. Computer, 37(4):58–65, 2004. 131 [57] Prathyush Menon, Tilmann Rabl, Mohammad Sadoghi, and Hans-Arno Jacobsen. Cassandra: An ssd boosted key-value store. In Data Engineering (ICDE), 2014 IEEE 30th International Conference on, pages 1162–1167. IEEE, 2014. [58] Changwoo Min, Kangnyeon Kim, Hyunjin Cho, Sang-Won Lee, and Young Ik Eom. SFS: Random write considered harmful in solid state drives. In Proc. of the 10th USENIX Conf. on File and Storage Tech, 2012. [59] Jeffrey C Mogul, Yee-Man Chan, and Terence Kelly. Design, Implementation, and Evaluation of Duplicate Transfer Detection in HTTP. In NSDI, volume 4, pages 4–4, 2004. [60] MySQL. Disadvantages of Creating Many Tables in the Same Database, 2015. [61] Elizabeth J O’neil, Patrick E O’neil, and Gerhard Weikum. The lru-k page replace- ment algorithm for database disk buffering. ACM SIGMOD Record, 22(2):297–306, 1993. [62] Elizabeth J O’neil, Patrick E O’Neil, and Gerhard Weikum. An optimality proof of the lru-k page replacement algorithm. Journal of the ACM (JACM), 46(1):92–112, 1999. [63] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. The log- structured merge-tree (LSM-tree). Acta Informatica, 33(4):351–385, 1996. [64] Oracle. Oracle Berkeley DB 12c: Persistent key value store. com/technetwork/products/berkeleydb, 2013. [65] John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazie`res, Subhasish Mitra, Aravind Narayanan, Diego Ongaro, Guru Parulkar, et al. The case for RAMCloud. Communications of the ACM, 54(7):121–130, 2011. 132 [66] Rasmus Pagh and Flemming Friche Rodler. Cuckoo hashing. Journal of Algorithms, 51(2):122–144, 2004. [67] David Peleg and Avishai Wool. The availability of quorum systems. Information and Computation, 123(2):210–223, 1995. [68] Martin Placek and Rajkumar Buyya. A taxonomy of distributed storage systems. Reporte te´cnico, Universidad de Melbourne, Laboratorio de sistemas distribuidos y co´mputo grid, 2006. [69] FIPS PUB. Secure Hash Standard (SHS). 2012. [70] William Pugh. Skip lists: a probabilistic alternative to balanced trees. Communica- tions of the ACM, 33(6):668–676, 1990. [71] Stephen M Rumble, Ankita Kejriwal, and John K Ousterhout. Log-structured memory for DRAM-based storage. In FAST, pages 1–16, 2014. [72] Salvatore Sanfilippo and Pieter Noordhuis. Redis. 2013. Accessed June 07, 2013. [73] Spencer Shepler, Mike Eisler, David Robinson, Brent Callaghan, Robert Thurlow, David Noveck, and Carl Beame. Network file system (NFS) version 4 protocol. Net- work, 2003. [74] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST ’10, pages 1–10, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-1-4244-7152-2. doi: 10.1109/ MSST.2010.5496972. URL 133 [75] Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. Thrift: Scalable cross-language services implementation. Facebook White Paper, 5, 2007. [76] Benjamin Sowell, Wojciech Golab, and Mehul A Shah. Minuet: a scalable distributed multiversion b-tree. Proceedings of the VLDB Endowment, 5(9):884–895, 2012. [77] Jan Stanek, Alessandro Sorniotti, Elli Androulaki, and Lukas Kencl. A Secure Data Deduplication Scheme for Cloud Storage. 2014. [78] Michael Stonebraker and Rick Cattell. 10 rules for scalable performance in’simple operation’datastores. Communications of the ACM, 54(6):72–80, 2011. [79] Michael Stonebraker and Ugur Cetintemel. ” One size fits all”: an idea whose time has come and gone. In Data Engineering, 2005. ICDE 2005. Proceedings. 21st Inter- national Conference on, pages 2–11. IEEE, 2005. [80] Michael Stonebraker, Chuck Bear, Ug˘ur C¸etintemel, Mitch Cherniack, Tingjian Ge, Nabil Hachem, Stavros Harizopoulos, John Lifter, Jennie Rogers, and Stan Zdonik. One size fits all? Part 2: Benchmarking results. In Proc. CIDR, 2007. [81] Michael Stonebraker, Samuel Madden, Daniel J Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. The end of an architectural era:(it’s time for a complete rewrite). In Proceedings of the 33rd international conference on Very large data bases, pages 1150–1160. VLDB Endowment, 2007. [82] Christof Strauch, Ultra-Large Scale Sites, and Walter Kriha. NoSQL databases. Lec- ture Notes, Stuttgart Media University, 2011. [83] Miklos Szeredi et al. Fuse: Filesystem in userspace. Accessed on, 2010. [84] D. B. Terry, M. M. Theimer, Karin Petersen, A. J. Demers, M. J. Spreitzer, and C. H. Hauser. Managing Update Conflicts in Bayou, a Weakly Connected Replicated 134 Storage System. In Proceedings of the Fifteenth ACM Symposium on Operating Sys- tems Principles, SOSP ’95, pages 172–182, New York, NY, USA, 1995. ACM. ISBN 0-89791-715-4. doi: 10.1145/224056.224070. URL 224056.224070. [85] Avadis Tevanian, Richard F Rashid, Michael Young, David B Golub, Mary R Thomp- son, William J Bolosky, and Richard Sanzi. A UNIX Interface for Shared Memory and Memory Mapped Files Under Mach. In USENIX Summer, pages 53–68. Citeseer, 1987. [86] Tom van Dijk. Analysing and Improving Hash Table Performance. 2009. [87] Robbert van Renesse and Fred B Schneider. Chain Replication for Supporting High Throughput and Availability. In OSDI, volume 4, pages 91–104, 2004. [88] Werner Vogels. Eventually consistent. Communications of the ACM, 52(1):40–44, 2009. [89] Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. Ceph: A Scalable, High-performance Distributed File System. In Proceed- ings of the 7th Symposium on Operating Systems Design and Implementation, OSDI ’06, pages 307–320, Berkeley, CA, USA, 2006. USENIX Association. ISBN 1-931971- 47-1. URL [90] Sage A. Weil, Scott A. Brandt, Ethan L. Miller, and Carlos Maltzahn. CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC ’06, New York, NY, USA, 2006. ACM. ISBN 0-7695-2700-0. doi: 10.1145/1188455.1188582. URL acm.org/10.1145/1188455.1188582. 135 [91] M Widenius, D Axmark, and M AB. MySQL 5.5 Reference Manual, 2014. [92] Demetrios Zeinalipour-Yazti, Song Lin, Vana Kalogeraki, Dimitrios Gunopulos, and Walid A Najjar. MicroHash: An Efficient Index Structure for Flash-Based Sensor Devices. In FAST, volume 5, pages 3–3, 2005. [93] Kai Zhang, Kaibo Wang, Yuan Yuan, Lei Guo, Rubao Lee, and Xiaodong Zhang. Mega-kv: a case for gpus to maximize the throughput of in-memory key-value stores. Proceedings of the VLDB Endowment, 8(11):1226–1237, 2015. 136

Các file đính kèm theo tài liệu này:

distributed_key_value_store_for_large_scale_systems.pdf