The Algorithm 7 describes steps to add an item into a big set. Firstly, it get root metadata
id of the big set in function getOrCreateRootIDOfBigSet. That function will automatically
creates new meta id using IDGenerator if there is no root meta data id associated with
the big set and store it into a key-value store service called Key2MasterMetaID in Fig 5.1.
Then it uses the Traversal Algorithm of Metadata Service above to fill data to an object of
class TSetMetaPath. The result describes a path from root meta data node to associated
small set that will contain adding item.
150 trang |
Chia sẻ: tueminh09 | Ngày: 24/01/2022 | Lượt xem: 471 | Lượt tải: 0
Bạn đang xem trước 20 trang tài liệu Distributed key - Value store for large - scale systems, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
MAX META SIZE. Fig 5.2 demonstrates
metadata in non-leaf nodes.
TSetMetaPath is a structure in Metadata Service. It is used to describe a path from
a root meta node to a leaf node. It is used in algorithms for querying and manipulating
meta data and big-sets as well. It consists of a split info and multiple children items called
TSetMetaPathItem described in Listing 5.3.
Listing 5.3: MetaPath description
struct TNeedSpl i t Info {
1 : TMetaKey metaID ,
2 : TMetaKey parentID ,
3 : i 32 childCount ,
102
4 : bool i s Sma l lS e t = fa l se
}
struct TSetMetaPathItem{
1 : TItemKey minItem ,
2 : i 64 metaID ,
3 : byte l e v e l ,
}
struct TSetMetaPath{
1 : l i s t metaPath
2 : op t i ona l TSetMetaPathItem sma l l S e t In f o
3 : op t i ona l TNeedSpl i t Info s p l i t I n f o
}
In TSetMetaPath structure, elements of metaPath list has decreasing level, the ith
element is parent of i + 1th element. The final element in metaPath is the parent of
smallSetInfo.
5.4.4.2 Metadata Service Algorithms
There are important algorithms in Metadata Service. They are parts of big set operations.
The Traversal algorithm is used to locate the path from root node to small set associate
with an item for adding, modifying or reading. Splitting Algorithms are used to split full
MetaData node when it reaches the limited number of children. These algorithms are
implemented in Service Model layer of the framework described subsection 1.3.
Traversal To insert an item into a big set, we have to find the small set which will store
that item. Traversal operation is described in Algorithm 5. In this algorithm, a path from
root node to leaf node (small set) which may contain a searched itemKey is constructed.
103
At each meta data node, it decides which children node to travel in next step by using
binary search in minItem field of TMetaItem objects in children of current TSetMeta.
And a node with largest number of children will also be updated in splitInfo of the meta
path. The found small set information will be updated into smallSetInfo of TSetMetaPath.
Big Set is organized as a distributed B+Tree and the number of node this algorithm will
visit is proportional to the height of the B+Tree, so the complexity of this algorithm is
O(log(N)) , while N is number of items in the Big Set.
Algorithm 5 Metadata Traversal
1: Input: rootMetaID, itemKey
2: OutPut: metaPath
3: metaID=rootMetaID
4: while TRUE do
5: meta← metaOf(metaID)
6: if meta.children is empty then return EmptyPath;
7: end if
8: currentCount← childrenCount(meta)
9: oldCount← metaPath.splitInfo.childCount
10: if currentCount > oldCount then
11: updateSplitInfo(meta, metaPath.splitInfo);
12: end if
13: TMetaItem mtItem ← meta.findByItem(itemKey)
14: if meta.level = 0 then
15: metaPath.smallSetInfo.metaID ← mtItem.metaID
16: updateSmallSetInfo(metaPath)
17: break
18: else
19: TSetMetaPathItem pathItem;
20: pathItem ← mtItem;
21: pathItem.level ← meta.level -1;
22: metaPath.push back(pathItem)
23: metaID ← mtItem.metaID
24: end if
25: end while
Split Root Metadata node. When a root meta node reaches the limited maximum
size, it will be split into two smaller nodes. Two new smaller metadata nodes will be
104
Algorithm 6 Split Root Metadata node and Level Up
Input: rootMeta.children = [mt0,mt1,mt2,mt3, ...mtn], rootMetaID, firstChildID, sec-
ondChildID , each mti is a MetaItem structure
Output: splitResult
1. Check size:
if n < 4 then
splitResult← TooSmall;
return
end if
2. Construct new children meta data node:
fcmt⇐ [mt0,mt1, ...,mtn/2]
scmt⇐ [mtn/2+1,mtn/2+2, ...mtn]
3. Increase level of root Meta data :
rootMeta.level ⇐ rootMeta.level + 1
4. Construct Meta Item for two new children:
minItem1← minItem(fcmt)
count1← itemCount(fcmt)
metaItem1⇐ {firstChildID,minItem1, count1}
minItem2← minItem(scmt)
count2← itemCount(scmt)
metaItem2⇐ {secondChildID,minItem2, count2}
5. Construct new children for root Meta data node
rootMeta.children⇐ [metaItem1,metaItem2]
6. Save metadata nodes to key-value store:
saveData(rootMetaID, rootMeta)
saveData(firstChildID, fcmt)
saveData(secondChildID, scmt)
splitResult = OK
return splitResult
created and the original root node will have two meta items and its level will be increased
by 1. The detail steps are described in Algorithm 6. Fig 5.3 demonstrates an example of
splitting root meta data node with KMAX META SIZE=3.
Split middle meta-data node. This algorithm is used when children number of a non-
root meta-data node is greater than limited maximum size caused by smaller level node
105
Fig. 5.3: Split root node and level up
Fig. 5.4: Split and create new brother meta data node
splitting or small set splitting. The algorithm will create a new same-level metadata node
(new brother node), moving second-half children into new node and add new meta item
that described new node into the parent meta node. This algorithm is demonstrated in
Fig 5.4 with KMAX META SIZE=3.
5.4.5 Forest of Distributed B+Tree
Forest of Distributed B+Tree manages large number of distributed B+Tree. It is presented
by a component called BigSet Service. This service is a server application and it cares about
big set functions for clients. It coordinates other services and uses algorithms from Small
106
set service, Metadata service. All bigset operations are implemented in this services.
5.4.5.1 Add an Item to Big Set
The Algorithm 7 describes steps to add an item into a big set. Firstly, it get root metadata
id of the big set in function getOrCreateRootIDOfBigSet. That function will automatically
creates new meta id using IDGenerator if there is no root meta data id associated with
the big set and store it into a key-value store service called Key2MasterMetaID in Fig 5.1.
Then it uses the Traversal Algorithm of Metadata Service above to fill data to an object of
class TSetMetaPath. The result describes a path from root meta data node to associated
small set that will contain adding item. Item will be added to the found small set from the
Algorithm 7 Add Item into Big Set
1: Input: bigsetID, item
2: Output: TAddItemResult
3: rootMeta ← getOrCreateRootIDOfBigSet(bigSetID)
4: Travel from root meta node to get metaPath in MetaData Service:
5: metaPath← traversal(bigSetID, rootMeta)
6: if exist small set in metaPath then
7: smallSetID ← metaPath.smallSetInfo.metaID;
8: sresult ← smallSetService.addItem(smallSetID, item)
9: if sresult.code = EAdded then
10: metaDataService.addNumberInPath(metaPath,1)
11: return EAdded;
12: else
13: return EReplacedOrNoChange
14: end if
15: else
16: create first small set
17: addItem to created small set
18: update root metadata
19: end if
20: checkSplit(metaPath.splitInfo)
path. If this is a new item in that small set, the count field in associated children of meta
data node in the path will be increased by one (or decreased by one in Remove function).
107
After adding an item into the small set, it may reach the limited size lmax or a full meta
data node may be found. If the small set reaches the limited size lmax, it will be split into
two smaller set in function checkSplit. A new small set is created with second-half items
from original small set. New small set id or key in Small Set Service is generated using
IDGenerator service. This make Small Sets have auto increasing integer key, that can take
advantage of ZDB in Chapter 3 efficiently when persistently storing data. The Fig 5.5
shows an example of splitting small set. This operation always ensures that items in next
small set is greater than items in current small set, so we can easily travel backward and
forward through big set from a specific position in a small set. This is a characteristic of
B+Tree and we can do it on key-value store easily.
Fig. 5.5: Split small set example
In case of full a metadata node found in the insertion path, that node in Metadata
Service will be split using Algorithm 6 or Algorithm presented in Fig 5.4 depend on if that
metadata node is root or not. ID of new metadata node or key in Metadata service is
also generated using IDGenerator. It makes the keys of Metadata serivce auto increasing
integer and can also take advantage of ZDB efficiently like Small Set Service.
The algorithm for Remove function of Big Set service is similar to the algorithm for
adding an item into big set. However, it eliminates the checkSplit step.
108
5.4.5.2 Query Items from Big Set Algorithms
Big Set Service provides several type of query in big set:
• GetSlice: Query n items from a specific position s
• GetSliceFromItem: Query nq items from an item key
• RangQuery: Query items in a range between two item keys such as [startItemKey, endItemKey)
We developed two approaches for solving these query types. The first approach is suitable
for sequential processing in Big Set. While all items in big set are sorted by item keys and
according to the design of Metadata Service and Small Set Service presented above, we can
easily do these query types by specifying the starting small set ID of the result. The first
small set ID is got by Traversal Algorithm for GetSliceFromItem query and RangeQuery.
GetSlice uses Algorithm 8 to get first small set ID. From this small set ID, items will
be retrieved from Small Set Service. Appropriate items from this small set will be in the
query result, items in next small sets (using nextSetID field in the Small Set Information)
will be retrieved until it gets enough number of items or it reaches to upper bound item of
the query or it is the end of the big set .
The second approach is suitable for parallel processing in Big Set. In each query,
we firstly estimate all small set IDs using Metadata Service for the query. Each meta
data element describes a branch of B+Tree. Because meta data elements in children
of TSetMeta have fields minItem - the lower bound item key it stores and count - total
number of items it stores. These fields are used to retrieve total small set IDs for processing
the query efficiently. Then it is able to query these small set data in parallel in Small Set
Service.
109
Algorithm 8 Find first Small Set ID for GetSlice
Input: p, rootMetaID
Output: smallSetID, offset
metaID ← rootMetaID
while TRUE do
mt ← metaOf(metaID).children
x ← argmax| ∑xi=0mt.children[i].count ≤ p
if x >mt.children.size()-1 then
smallSetID ← -1 //query out of range
break
else if mt.level = 0 then
smallSetID ← mt.children[x+1]
offset ← postion-∑xi=0mt.children[i].count
break
else if mt.level > 0 then
metaID← mt.children[x+1]
p← p-∑xi=0mt.children[i].count
end if
end while
5.4.6 General key-value store using Forest of Distributed B+Tree
With the algorithms and architecture described above, we can efficiently store big-sets
in commodity distributed key-value store. ZDB in Chapter 3 optimizes the engine for
key-value store with auto increasing integer keys, for another key type such as binary, it
does not support directly. In this work, items in big sets can be key-value pairs, a small
set can store multiple items. Keys in both Small Set Service and Metadata service are
auto increasing integers. Each big-set can be a general key-value store that fully supports
binary or string key type. The key-value pairs in big-sets are ordered and distributed. It
is different from consistent hashing distribution where keys are distributed by using hash
value. Keys in consistent hashing distribution are unordered that makes it difficult to range
query. Although small sets and metadata nodes are distributed in key-value stores using
consistent hashing, items in Proposed Big-Set are ordered by item key that support range
query easily in distributed environment. This actually is order-preserving distribution
110
[51] based on consistent hashing. Moreover, Forest of Distributed B+Tree can store large
number of big-sets, so the new general key-value store is scalable and elastic.
5.5 Evaluation
In this section, we present workload design, configuration we used to benchmark and
compare the proposed big set with existing popular solutions: Redis [72] and Cassandra
[48]. The main purpose of benchmarks is to verify the efficiency of proposed architecture
when facing big set storage problems.
We measured number of operations per second (ops), the scalability of the system. We
used method of Yahoo! Cloud Serving Benchmark (YCSB) [20] to generate workloads for
the benchmarks. Three workloads are described below:
• Insertion workload : all clients generate and add new items into big sets.
• Read Only workload: all clients check the existence of items in the big set.
• Mixed workload R: 90% client operations read items from big set, 10% client
operations update new value to big set.
• Mixed workload W: 10% client operations read items from big set, 90% client
operations update new value to big set.
In these benchmark workloads, we used 8 storage servers and 4 machines to run clients
with configuration below:
111
Operating System Ubuntu Server 64 bits
CPU Intel Xeon Quad core
Memory 64 GB DDR
FileSystem ext4 on 600 GB SSD
Network Wired 1Gbps
We deployed Redis in 8 servers for benchmarks first. Ordered Set in redis is used in this
benchmark. We used zadd method of Redis to insert items to Redis. Redis consumes about
1.2GB main memory to store 10 millions items generated from the Insertion Workload in a
sorted set. Because all data of Redis reside in main meory, with 64 GBs RAM in a server,
it can handle only about fifty four big sets (each big set contains 10 millions items) no
matter how large is SSD or HDD in the server.
Then we cleanup servers and deployed Cassandra in these servers. We used Columns
in Cassandra for storing Sets data. Cassandra uses disk driver to store data (SSTable and
commit-log) efficiently. Its capacity only depend on disk space. In the Insertion Workload,
we inserted big sets of 500 millions items (in a row, each column store an item) and it can
store big set of 2 billions items as well. Items in a big set are stored and must be fit in a
server.
Finally, instances of ZDBService as persistent layer of Big Set’s service were deployed
in 8 servers, Small Set Service is deployed in 4 servers, MetaData Service is deployed in
4 servers. We also inserted big sets of 500 millions items into Proposed Big Set. More
over, it can store big set of 4 billions items. Items in big set are distributed in ZDBService
instances deployed in servers.
Items in both Cassandra and proposed Big Set for this benchmark are key-value pairs
(A simple column in Cassandra is a key-value pair), each big set in Cassandra is a wide
row. Items in Redis is a string with the same size to key-value pair in Cassandra. In client
112
Tab. 5.1: Other Capability comparison
Storage Type Cassandra Redis Big Set
Support billions Yes No Yes
items per set (Max 2B) (4B tested)
Rang Query Yes No Yes
by item
Rang Query No Yes Yes
by position
Item distributability No No Yes
in a set
machines, we run 16 processes for each workload.
Fig 5.6 shows the result for write benchmark throughput. Cassandra has the best
performance in Write Only Workloads. Proposed Big Set has a bit lower performance
compared to Redis Ordered Sets. All these storage system has linear scalability when
number of servers increasing.
Fig 5.7 shows the result of read only benchmark throughput. In spite of having a good
write performance, number of read operations per second of Cassandra is much lower than
Redis and Proposed Big Set. The same result in Mixed workloads is showed in Fig. 5.8
and Fig. 5.9. Proposed Big Set and Redis out perform Cassandra in reading and mixed
workloads.
Capabilities of these storage are also compared in Table 5.1 and Fig 5.10.
113
Fig. 5.6: Write only throughput Fig. 5.7: Read only throughput
Fig. 5.8: Mixed Workload R Fig. 5.9: Mixed Workload W
Fig. 5.10: Capacity comparison
5.6 Discussion
As results we presented above, Cassandra has good performance in write operations, its
throughput is biggest compared to the others. The reason is that write operations in Cas-
sandra always sequential write to commit log files and access an limited size in-memory data
structure called MemTable. Unfortunately in Read Only and Mixed workloads, Cassandra
performance is not very good as in write only workload. In the read path of Cassandra, it
firstly lookup MemTable, then lookup on-disk SSTable(s) with Bloomfilter to read columns.
114
BloomFilter in Cassandra uses only row key, so it is not efficient when using wide-column
as big set. There are some work were done to improve the read performance of Cassandra
such as [57] that enhance a row cache layer in SSD. In this benchmark we used SSD and
the performance of Cassandra is still not good enough. Another disadvantage of using
Cassandra for Big Set problem is that it is difficult to count how many items in the big
set while proposed Distributed Modified B+Tree can count total of items easily by using
only one read operation of root metadata node in Metadata Service.
With small sets that the whole data fit in main memory, Redis is good in both read
and write operation. It is implemented using efficient data structure to manage in-memory
data and is appropriate with small caching. With large data set that not fit into main
memory, Redis cannot store all of them. Additional writes into a full memory Redis server
may lead to data lost caused by its eviction policy. This explains the benchmark results
above which showed that Redis capacity is significantly smaller than both proposed Big
Set and Cassandra.
Forest of Distributed B+Tree for Big Set problems is a trade-off between Read/Write
performance and Capacity. It has a good throughput in both read and write operation. Its
capacity proportions to the size of file-system, main memory is used to cache hot small sets
and hot metadata node in Small Set Service and Metadata Service. In workloads above,
it has performance a bit lower than Redis (when the size of sets are small or medium),
however it has a better capability in storing big set. Proposed work can store bigger sets
than both Cassandra and Redis. Moreover, each big set stored using proposed method is
distributed in multiple server using advantages of key-value store naturally, especially the
advantage of ZDB while keys in Small Set Service and Metadata Service are generate as
auto-increasing integer. Based on the principles of this architecture we can build another
scalable distributed key-value store or nosql database for big data structures efficiently.
In the previous sections, we present the proposed big set architecture. Data are stored
115
in two main services: Small Set Service and Metadata Service. In implementation, we can
unify these services with object oriented programming to modeling Small Set and Meta
Data. Small Set ID and Metadata Id can be generated by the same ID generator and these
data using the same distributed key-value store farm. In case of using ZDB, it is easy to
scale out data to many servers. Last but not least, this design can be used with arbitrary
high performance key-value store.
5.7 Applications of ZDB and Forest of Distributed
B+Tree
Results of this chapter is used in many projects in both production and research. The
most recent research project is ”Unsupervised Anomaly Detection in Online Game”. Online
game is one of the most successful business on the Internet. As online game business grows,
cheating in game becomes popular and is the biggest challenge of online game systems. In
this research, we investigate the application of anomaly detection techniques to cheating
detection in an online game (JX2) of VNG company. A method to evaluate the performance
of unsupervised anomaly detection techniques was proposed. Six unsupervised anomaly
detection algorithms were tested. In computing architecture of this research , distributed
B+Tree is used as the main data storage for all action.
5.7.1 Computing Architecture for Anomaly Detection System
Achieving high performance is critical in a real anomaly detection system. We used map-
reduced framework [23] to increase the performance of the system. Moreover, an in-house
storage system was implemented to store historical actions of game players. The in-house
storage is based on Forest of Distributed B+Tree based on ZDB key-value store in Chap-
116
Fig. 5.11: Computing System Architecture.
ter 3. Each user has a long list of actions that can be queried by its position or time range.
This long list was stored in multiple servers and managed by a distributed B+Tree. The
computing architecture is shown in Fig 5.11.
In this system, raw logs from game server was daily pushed to Log Processor component.
The log data was processed and information about the interaction between users was
extracted. Then, user information was saved into Log Store Service. Log Store Service is
built based on a distributed structured storage called Big Set that was built on ZDB. Each
item of Big Set is a key-value pair. This component increased the availability and scalability
of the structured storage system. The core component of the architecture is Anomaly
Detection Service. This service applied anomaly detection techniques to find abnormal
users. Anomaly detection component was designed so that new detection techniques could
comfortably be added to the system. Finally, the detected results were stored into a
database and visualized for post analysis. The result of this research is presented in sixth
publication of this thesis.
117
5.7.2 Specific Storage Solution for Specific Structured Data
ZDB in Chapter 3 and Data Storage Framework in Sub Section 1.3 are applied in many
backend systems of Zing Me3, CSM4, UBus5, etc. Most of data of Zing Me users: feeds,
friend list, profile user are stored in ZDB. One of the most important components of Zing
Me is Friend Recommendation System uses ZDB for storing its data. This problem is often
considered to be the link prediction problem in the network. The component is result of a
research project and presented in fifth publication of this thesis. In this research, various
information of user consist of direct and in-direct information are stored in ZDB and
exploited to predict future friendship (link) of Zing Me users.
Formal Definitions Social network is modeled as directed graph G = (V,E) in which
each vertex u ∈ V represents a user in the network and each edge eu,v ∈ E represents
relationship: v is a friend of u. Because of symmetry in friendship in social network ZingMe,
if eu,v exists, ev,u also exists. Weight cu,v represents strength of friendship between u and
v. tu,v or te is the time when edge eu,v is established.
For times t, t1, t2, let G[t] ⊆ G and G[t1, t2] ⊆ G consist of edges with te ≤ t and
t1 ≤ te ≤ t2, respectively. The link prediction problem for social networks can be defined
as follow: given times t0 < t
′
0 ≤ t1 < t′1, based on G[t0, t′0], we want to output a list of
edges which are not in G[t0, t
′
0] and are predicted to present in G[t1, t
′
1] [49].
Knowledge Sources and their storage The proposed friend suggestion approach uses
knowledge sources: network structure, attributes of vertices, for example, age, sex or educa-
tion and attributes of edges, for example, interactions when a user comments on a picture,
likes a page or sends a message
3
4
5
118
Specific features are derived from these knowledge sources with assumption that if two
vertices are not connected, the more mutual information they have, the more probability to
be connected they have, and if they are already connected, it represents strength of their
connection which is greater when they have more mutual information. Simultaneously,
type and reliability of data are also considered when choosing these features.
For the graph structure, the number of mutual adjacent vertices is the feature derived
and equal to |Nu ∩Nv|. Data from ZingMe and Facebook [8] also show that more than a
half of new friendships are established from users having mutual friends.
For the attributes of vertices, selected features are number of mutual schools, number
of mutual groups and number of mutual IP addresses. For the attributes of edges, the
feature chosen is number of ”mutual interactions”. For instance, if two users like same
picture, that can be considered as a mutual interaction between them. There are many
types of mutual interactions such as users are tagged in same pictures, comment on same
posts, and it may be better to consider each mutual interaction as an individual feature but
they are considered as one because of computing capability of real system. Our approach
reuses some features from current approach, including number of mutual adjacent vertices,
number of mutual schools and number of mutual IP addresses. It also adds features
derived from groups and interactions of users, which were neglected in the current approach.
It eliminates the feature relating to companies of users because of unreliability of this
feature. In ZingMe, users provide names of their companies by free text without any
standardization. By manually examining content of those names, we found them unreliable.
All these knowledge sources are stored in backend services using ZDB in Chapter 3
and Data Storage Framework in Sub Section 1.3 and Forest of distributed B+Tree. These
backend services are scalable in practice and achieve high performance, low latency re-
quirements of real products.
119
5.8 Summary
The proposed Forest of Distributed B+Tree with many advantages to build various data
storage services. Every Big Set as a value in a key-value store is split into multiple small
sets. Small Sets are managed in storage backend service using Memory Cache for fast
accessing and distributed key-value store to persist data. A Metadata Service is designed
based on key-value store to manage the big set structure as a distributed B+Tree. It sup-
ports building scalable data storage system for big data structure. While other works often
build key-value store using B+Tree, this research build distributed B+Tree using key-value
store from scratch. This work can easily distribute big value such as big set, big dictionary
into multiple servers using advantages of distributed key-value store while others such as
Cassandra and Redis have not supported this feature yet. Proposed Distributed Modified
B+Tree has a relative high performance in both read and write operations and capability
to distributively store big data structures. The complexity of operations in proposed Big
Set is O(log(N)). The proposed work is used to store data for friend suggestion in Zing
Me Social Network and to store time series of user action for game data anomaly detec-
tion system for games of VNG. It can be applied in storing big data for many kinds of
applications and systems.
120
Conclusion and Future works
The development from RDBMS to NoSQL is an inevitable trend to build storage system for
wide area of applications such as Text Processing, Data Warehouses, Stream Processing
and Scientific databases. Key-Value store is a category of NoSQL databases. It plays
important roles in building large scale applications. Performance of a distributed key-
value store consists of multiple metrics which depend on architecture and algorithms inside
key-value store engines.
To achieve high performance distributed key-value store and make it possible to effi-
ciently store various data types, we have to face many problems:
• How to minimize latency in every operation of persistent key-value store, how to
optimize the index of persistent key-value engine in the one of most popular key
types.
• How to store large values into key-value stores? How to manage large number of big
files in cloud storage by distributed key-value store with optimal metadata per file?
• How to store big data structures into key-value stores? How to manage large number
of big sets or big data structures while they are modified and queried frequently?
This thesis firstly presents method to build high performance key-value store in a single
node, data can be distributed into multiple servers using consistent hashing. Every write
operation can be done sequentially on disk drive and every read operation need at most
121
one disk seeking. All writes can be configured to be sequential go achieve best writing
performance of both SSD and HDD. Proposed key-value store using shared memory Flat
Index to fast lookup key-value pair position in data file without unnecessary disk accessing.
The proposed key-value store is optimized for auto increasing integer keys - one of the
most popular key types. A high performance key-value store called Zing Database (ZDB)
is implemented. The results are presented in the first two publications of this thesis.
Secondly, this thesis propose architecture for building big file cloud storage based on
key-value store. It takes advantages of proposed key-value store to minimize the size of
metadata when a system managing large number of big files for serving millions of users.
For storing big-files (big values) into key-value store, every file has a same size metadata.
Each big-file is split into multiple fixed-size chunks and stored in ZDB. The chunks of a file
have a contiguous ID range, thus it is easy to distribute data and scale-out storage system,
especially when using ZDB. This thesis proposes method to store big-files efficiently with
advantages of key-value store.
Finally, this thesis proposed Forest of Distributed B+Tree based on key-value store.
This result convert efficiently binary key-space into auto increasing integer key-space. It
is useful for building scalable Nosql data storage for large data structure such as big set,
wide-column data. Data is distributed in key-value store automatically and make it easy
to scale the systems. Every Big Set as a value in a key-value store is split into multiple
small sets and store them in distributed ZDB key-value store. It supports building scalable
data storage system for big data structure. The experiment results show that Forest of
Distributed B+Tree has a good performance in both read and write operations. Moreover,
it is a general key-value store that support binary-key type efficiently and order-preserving.
In future, we will continue to extend and research storage architecture for big data. We
will firstly focus to data storage system that support computing in data mining system more
efficiently such as large time series storage. In the ”Internet of things” trend, many data
122
sources from millions sensors with multiple long time series need to be stored for querying
and mining efficiently. Second, we will research to make our storage systems more secure
in network environment, make them not only high performance but also secure.
123
Publications
[1] Thanh Trung Nguyen, Minh Hieu Nguyen. “ZDB-High performance key-value store.”
In Proceedings of the 2013 Third World Congress on Information and Communication
Technologies (WICT 2013)
[2] Thanh Trung Nguyen, Minh Hieu Nguyen. “Zing Database: high-performance key-
value store for large-scale storage service.” Vietnam Journal of Computer Science, Febru-
ary 2015, Volume 2, Issue 1, pp 13-23
[3] Thanh Trung Nguyen, Tin Khac Vu, Minh Hieu Nguyen. “BFC: High-Performance
Distributed Big-File Cloud Storage Based On Key-Value Store” in Proceeding of 16th
IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Net-
working and Parallel/Distributed Computing (SNPD 2015).
[4] Thanh Trung Nguyen, Anh Tuan Nguyen, Tuan Anh Ha Nguyen, Ly Thi Vu, Quang Uy
Nguyen, and Long Dao Hai. 2015. “Unsupervised Anomaly Detection in Online Game.”
In Proceedings of the Sixth International Symposium on Information and Communication
Technology (SoICT 2015). ACM, New York, NY, USA, 4-10. DOI=
10.1145/2833258.2833305
[5] Thanh Trung Nguyen, Minh Hieu Nguyen. “Forest of Distributed B+Tree Based On
Key-Value Store for Big-Set Problem.” In Database Systems for Advanced Applications
Volume 9645 of the series Lecture Notes in Computer Science pp 268-282, 2016
[6] Thanh Trung Nguyen, Minh Hieu Nguyen. “Distributed and High Performance Big-
File Cloud Storage Based On Key-Value Store” International Journal of Networked and
Distributed Computing, Volume 4, Issue 3, pp 159 - 172, July 2016
124
Bibliography
[1] Daniel Abadi, Rakesh Agrawal, Anastasia Ailamaki, Magdalena Balazinska, Philip A.
Bernstein, Michael J. Carey, Surajit Chaudhuri, Jeffrey Dean, AnHai Doan, Michael J.
Franklin, Johannes Gehrke, Laura M. Haas, Alon Y. Halevy, Joseph M. Heller-
stein, Yannis E. Ioannidis, H. V. Jagadish, Donald Kossmann, Samuel Madden,
Sharad Mehrotra, Tova Milo, Jeffrey F. Naughton, Raghu Ramakrishnan, Volker
Markl, Christopher Olston, Beng Chin Ooi, Christopher Re´, Dan Suciu, Michael
Stonebraker, Todd Walter, and Jennifer Widom. The beckman report on database
research. SIGMOD Rec., 43(3):61–70, December 2014. ISSN 0163-5808. doi:
10.1145/2694428.2694441. URL
[2] Divyakant Agrawal, Sudipto Das, and Amr El Abbadi. Big data and cloud comput-
ing: current state and future opportunities. In Proceedings of the 14th International
Conference on Extending Database Technology, pages 530–533. ACM, 2011.
[3] Marcos K Aguilera, Wojciech Golab, and Mehul A Shah. A practical scalable dis-
tributed b-tree. Proceedings of the VLDB Endowment, 1(1):598–609, 2008.
[4] Ashok Anand, Chitra Muthukrishnan, Steven Kappes, Aditya Akella, and Suman
Nath. Cheap and Large CAMs for High Performance Data-Intensive Networked Sys-
tems. In NSDI, volume 10, pages 29–29, 2010.
[5] David G Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence
125
Tan, and Vijay Vasudevan. FAWN: A fast array of wimpy nodes. In Proceedings of the
ACM SIGOPS 22nd symposium on Operating systems principles, pages 1–14. ACM,
2009.
[6] NoSQL Archive. NoSQL DEFINITION: Next Generation Databases mostly address-
ing some of the points: being non-relational, distributed, open-source and horizontally
scalable. 2016. Accessed September 4, 2016.
[7] Marcos D Assunc¸a˜o, Rodrigo N Calheiros, Silvia Bianchi, Marco AS Netto, and Ra-
jkumar Buyya. Big data computing and clouds: Trends and future directions. Journal
of Parallel and Distributed Computing, 79:3–15, 2015.
[8] Lars Backstrom and Jure Leskovec. Supervised random walks: predicting and rec-
ommending links in social networks. In Proceedings of the fourth ACM international
conference on Web search and data mining, pages 635–644. ACM, 2011.
[9] Anirudh Badam, KyoungSoo Park, Vivek S Pai, and Larry L Peterson. HashCache:
Cache Storage for the Next Billion. In NSDI, volume 9, pages 123–136, 2009.
[10] Doug Beaver, Sanjeev Kumar, Harry C Li, Jason Sobel, Peter Vajgel, et al. Finding a
Needle in Haystack: Facebook’s Photo Storage. In OSDI, volume 10, pages 1–8, 2010.
[11] Dhruba Borthakur. HDFS architecture guide. Hadoop Apache Project, page 53, 2008.
[12] Eric Brew. CAP Twelve Years Later: How the ”Rules” Have Changed.
infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed,
2012. Accessed December 1st , 2014.
[13] Eric A Brewer. Towards robust distributed systems. In PODC, page 7, 2000.
[14] Eric A Brewer. Lessons from giant-scale services. Internet Computing, IEEE, 5(4):
46–55, 2001.
126
[15] Mike Burrows. The chubby lock service for loosely-coupled distributed systems. In
Proceedings of the 7th symposium on Operating systems design and implementation,
pages 335–350. USENIX Association, 2006.
[16] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C Hsieh, Deborah A Wallach,
Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E Gruber. Bigtable:
A distributed storage system for structured data. ACM Transactions on Computer
Systems (TOCS), 26(2):4, 2008.
[17] Laura Chappell and Gerald Combs. Wireshark network analysis: the official Wire-
shark certified network analyst study guide. Protocol Analysis Institute, Chappell
University, 2010.
[18] Peter M Chen, David Patterson, et al. Storage performance-metrics and benchmarks.
Proceedings of the IEEE, 81(8):1151–1165, 1993.
[19] Douglas Comer. Ubiquitous B-tree. ACM Computing Surveys (CSUR), 11(2):121–137,
1979.
[20] Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell
Sears. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st
ACM symposium on Cloud computing, pages 143–154. ACM, 2010.
[21] Bin Cui, Hong Mei, and Beng Chin Ooi. Big data: the driver for innovation in
databases. National Science Review, 1(1):27–30, 2014.
[22] Jeff Dean. Software engineering advice from building large-scale distributed systems,
2007.
[23] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large
clusters. Communications of the ACM, 51(1):107–113, 2008.
127
[24] Biplob Debnath, Sudipta Sengupta, and Jin Li. FlashStore: high throughput persis-
tent key-value store. Proceedings of the VLDB Endowment, 3(1-2):1414–1425, 2010.
[25] Biplob Debnath, Sudipta Sengupta, and Jin Li. SkimpyStash: RAM space skimpy
key-value store on flash-based storage. In Proceedings of the 2011 ACM SIGMOD
International Conference on Management of data, pages 25–36. ACM, 2011.
[26] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati,
Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and
Werner Vogels. Dynamo: amazon’s highly available key-value store. In SOSP, vol-
ume 7, pages 205–220, 2007.
[27] Idilio Drago, Marco Mellia, Maurizio M Munafo, Anna Sperotto, Ramin Sadre, and
Aiko Pras. Inside dropbox: understanding personal cloud storage services. In Proceed-
ings of the 2012 ACM conference on Internet measurement conference, pages 481–494.
ACM, 2012.
[28] Idilio Drago, Enrico Bocchi, Marco Mellia, Herman Slatman, and Aiko Pras. Bench-
marking personal cloud storage. In Proceedings of the 2013 conference on Internet
measurement conference, pages 205–212. ACM, 2013.
[29] Dropbox. Dropbox Tech Blog. https://tech.dropbox.com/, 2014. Accessed October
28, 2014.
[30] PUB FIPS. 197: the official AES standard. Figure2: Working scheme with four
LFSRs and their IV generation LFSR1 LFSR, 2, 2001.
[31] Brad Fitzpatrick. A distributed memory object caching system.
com/memcached/, 2013. Accessed September 4, 2013.
[32] Peter Ge´czy. Big data characteristics. The Macrotheme Review, 3(6):94–104, 2014.
128
[33] Sanjay Ghemawat and Jeff Dean. LevelDB is a fast key-value storage library written
at Google that provides an ordered mapping from string keys to string values. https:
//github.com/google/leveldb, 2014. Accessed November 2, 2014.
[34] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google file system. In
ACM SIGOPS Operating Systems Review, volume 37, pages 29–43. ACM, 2003.
[35] Jim Gray. Microsoft SQL Server. January 1997. URL
com/apps/pubs/default.aspx?id=68492.
[36] Jim Gray and Andreas Reuter. Transaction Processing: Concepts and Techniques.
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 1992. ISBN
1558601902.
[37] Jim Gray et al. The transaction concept: Virtues and limitations. In VLDB, vol-
ume 81, pages 144–154, 1981.
[38] Yunhong Gu and Robert L Grossman. UDT: UDP-based data transfer for high-speed
wide area networks. Computer Networks, 51(7):1777–1799, 2007.
[39] Patrick Hunt, Mahadev Konar, Flavio P Junqueira, and Benjamin Reed. ZooKeeper:
wait-free coordination for internet-scale systems. In Proceedings of the 2010 USENIX
conference on USENIX annual technical conference, volume 8, pages 11–11, 2010.
[40] Google Inc. LevelDB - A fast and lightweight key/value database library by Google.
2013. Accessed July 23, 2013.
[41] Doug Judd. Hypertable.
hypertable\_nosql.pdf, 2009.
129
[42] Stephen Kaisler, Frank Armour, Juan Antonio Espinosa, and William Money. Big
data: Issues and challenges moving forward. In System Sciences (HICSS), 2013 46th
Hawaii International Conference on, pages 995–1004. IEEE, 2013.
[43] Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alexander Rasin,
Stanley Zdonik, Evan P. C. Jones, Samuel Madden, Michael Stonebraker, Yang Zhang,
John Hugg, and Daniel J. Abadi. H-Store: a High-Performance, Distributed Main
Memory Transaction Processing System. Proc. VLDB Endow., 1(2):1496–1499, 2008.
ISSN 2150-8097. doi: 10.1145/1454159.1454211. URL
edu/papers/hstore-demo.pdf.
[44] David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, and
Daniel Lewin. Consistent hashing and random trees: Distributed caching protocols
for relieving hot spots on the World Wide Web. In Proceedings of the twenty-ninth
annual ACM symposium on Theory of computing, pages 654–663. ACM, 1997.
[45] David Karger, Alex Sherman, Andy Berkheimer, Bill Bogstad, Rizwan Dhanidina,
Ken Iwamoto, Brian Kim, Luke Matkins, and Yoav Yerushalmi. Web caching with
consistent hashing. Computer Networks, 31(11):1203–1213, 1999.
[46] FAL Labs. Kyoto Cabinet: a straightforward implementation of DBM. http://
fallabs.com/kyotocabinet, 2013. Accessed May 1, 2013.
[47] Eric Lai. No to SQL? Anti-database movement gains steam. http:
//www.computerworld.com/article/2526317/database-administration/
no-to-sql--anti-database-movement-gains-steam.html, 2009.
[48] Avinash Lakshman and Prashant Malik. Cassandra: a decentralized structured stor-
age system. ACM SIGOPS Operating Systems Review, 44(2):35–40, 2010.
130
[49] David Liben-Nowell and Jon Kleinberg. The Link Prediction Problem for Social
Networks. In Proceedings of the Twelfth International Conference on Information
and Knowledge Management, volume 58 of CIKM ’03, pages 556–559, New York,
NY, USA, 2003. ACM. ISBN 1-58113-723-0. doi: 10.1145/956863.956972. URL
[50] Hyeontaek Lim, Bin Fan, David G Andersen, and Michael Kaminsky. SILT: A
memory-efficient, high-performance key-value store. In Proceedings of the Twenty-
Third ACM Symposium on Operating Systems Principles, pages 1–13. ACM, 2011.
[51] Witold Litwin, Marie-Anne Neimat, and Donovan Schneider. Rp*: A family of order
preserving scalable distributed data structures. In VLDB, volume 94, pages 12–15,
1994.
[52] Yandong Mao, Eddie Kohler, and Robert Tappan Morris. Cache craftiness for fast
multicore key-value storage. In Proceedings of the 7th ACM european conference on
Computer Systems, pages 183–196. ACM, 2012.
[53] Mapkeeper. MapKeeper. https://github.com/m1ch1/mapkeeper, 2014. Accessed
June 1, 2014.
[54] Ward Douglas Maurer and Theodore Gyle Lewis. Hash table methods. ACM Com-
puting Surveys (CSUR), 7(1):5–19, 1975.
[55] Nimrod Megiddo and Dharmendra S Modha. ARC: A Self-Tuning, Low Overhead
Replacement Cache. In FAST, volume 3, pages 115–130, 2003.
[56] Nimrod Megiddo and Dharmendra S Modha. Outperforming lru with an adaptive
replacement cache algorithm. Computer, 37(4):58–65, 2004.
131
[57] Prathyush Menon, Tilmann Rabl, Mohammad Sadoghi, and Hans-Arno Jacobsen.
Cassandra: An ssd boosted key-value store. In Data Engineering (ICDE), 2014 IEEE
30th International Conference on, pages 1162–1167. IEEE, 2014.
[58] Changwoo Min, Kangnyeon Kim, Hyunjin Cho, Sang-Won Lee, and Young Ik Eom.
SFS: Random write considered harmful in solid state drives. In Proc. of the 10th
USENIX Conf. on File and Storage Tech, 2012.
[59] Jeffrey C Mogul, Yee-Man Chan, and Terence Kelly. Design, Implementation, and
Evaluation of Duplicate Transfer Detection in HTTP. In NSDI, volume 4, pages 4–4,
2004.
[60] MySQL. Disadvantages of Creating Many Tables in the Same Database, 2015.
[61] Elizabeth J O’neil, Patrick E O’neil, and Gerhard Weikum. The lru-k page replace-
ment algorithm for database disk buffering. ACM SIGMOD Record, 22(2):297–306,
1993.
[62] Elizabeth J O’neil, Patrick E O’Neil, and Gerhard Weikum. An optimality proof of
the lru-k page replacement algorithm. Journal of the ACM (JACM), 46(1):92–112,
1999.
[63] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. The log-
structured merge-tree (LSM-tree). Acta Informatica, 33(4):351–385, 1996.
[64] Oracle. Oracle Berkeley DB 12c: Persistent key value store.
com/technetwork/products/berkeleydb, 2013.
[65] John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich,
David Mazie`res, Subhasish Mitra, Aravind Narayanan, Diego Ongaro, Guru Parulkar,
et al. The case for RAMCloud. Communications of the ACM, 54(7):121–130, 2011.
132
[66] Rasmus Pagh and Flemming Friche Rodler. Cuckoo hashing. Journal of Algorithms,
51(2):122–144, 2004.
[67] David Peleg and Avishai Wool. The availability of quorum systems. Information and
Computation, 123(2):210–223, 1995.
[68] Martin Placek and Rajkumar Buyya. A taxonomy of distributed storage systems.
Reporte te´cnico, Universidad de Melbourne, Laboratorio de sistemas distribuidos y
co´mputo grid, 2006.
[69] FIPS PUB. Secure Hash Standard (SHS). 2012.
[70] William Pugh. Skip lists: a probabilistic alternative to balanced trees. Communica-
tions of the ACM, 33(6):668–676, 1990.
[71] Stephen M Rumble, Ankita Kejriwal, and John K Ousterhout. Log-structured memory
for DRAM-based storage. In FAST, pages 1–16, 2014.
[72] Salvatore Sanfilippo and Pieter Noordhuis. Redis. 2013. Accessed
June 07, 2013.
[73] Spencer Shepler, Mike Eisler, David Robinson, Brent Callaghan, Robert Thurlow,
David Noveck, and Carl Beame. Network file system (NFS) version 4 protocol. Net-
work, 2003.
[74] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The
Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on
Mass Storage Systems and Technologies (MSST), MSST ’10, pages 1–10, Washington,
DC, USA, 2010. IEEE Computer Society. ISBN 978-1-4244-7152-2. doi: 10.1109/
MSST.2010.5496972. URL
133
[75] Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. Thrift: Scalable cross-language
services implementation. Facebook White Paper, 5, 2007.
[76] Benjamin Sowell, Wojciech Golab, and Mehul A Shah. Minuet: a scalable distributed
multiversion b-tree. Proceedings of the VLDB Endowment, 5(9):884–895, 2012.
[77] Jan Stanek, Alessandro Sorniotti, Elli Androulaki, and Lukas Kencl. A Secure Data
Deduplication Scheme for Cloud Storage. 2014.
[78] Michael Stonebraker and Rick Cattell. 10 rules for scalable performance in’simple
operation’datastores. Communications of the ACM, 54(6):72–80, 2011.
[79] Michael Stonebraker and Ugur Cetintemel. ” One size fits all”: an idea whose time
has come and gone. In Data Engineering, 2005. ICDE 2005. Proceedings. 21st Inter-
national Conference on, pages 2–11. IEEE, 2005.
[80] Michael Stonebraker, Chuck Bear, Ug˘ur C¸etintemel, Mitch Cherniack, Tingjian Ge,
Nabil Hachem, Stavros Harizopoulos, John Lifter, Jennie Rogers, and Stan Zdonik.
One size fits all? Part 2: Benchmarking results. In Proc. CIDR, 2007.
[81] Michael Stonebraker, Samuel Madden, Daniel J Abadi, Stavros Harizopoulos, Nabil
Hachem, and Pat Helland. The end of an architectural era:(it’s time for a complete
rewrite). In Proceedings of the 33rd international conference on Very large data bases,
pages 1150–1160. VLDB Endowment, 2007.
[82] Christof Strauch, Ultra-Large Scale Sites, and Walter Kriha. NoSQL databases. Lec-
ture Notes, Stuttgart Media University, 2011.
[83] Miklos Szeredi et al. Fuse: Filesystem in userspace. Accessed on, 2010.
[84] D. B. Terry, M. M. Theimer, Karin Petersen, A. J. Demers, M. J. Spreitzer, and
C. H. Hauser. Managing Update Conflicts in Bayou, a Weakly Connected Replicated
134
Storage System. In Proceedings of the Fifteenth ACM Symposium on Operating Sys-
tems Principles, SOSP ’95, pages 172–182, New York, NY, USA, 1995. ACM. ISBN
0-89791-715-4. doi: 10.1145/224056.224070. URL
224056.224070.
[85] Avadis Tevanian, Richard F Rashid, Michael Young, David B Golub, Mary R Thomp-
son, William J Bolosky, and Richard Sanzi. A UNIX Interface for Shared Memory
and Memory Mapped Files Under Mach. In USENIX Summer, pages 53–68. Citeseer,
1987.
[86] Tom van Dijk. Analysing and Improving Hash Table Performance. 2009.
[87] Robbert van Renesse and Fred B Schneider. Chain Replication for Supporting High
Throughput and Availability. In OSDI, volume 4, pages 91–104, 2004.
[88] Werner Vogels. Eventually consistent. Communications of the ACM, 52(1):40–44,
2009.
[89] Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos
Maltzahn. Ceph: A Scalable, High-performance Distributed File System. In Proceed-
ings of the 7th Symposium on Operating Systems Design and Implementation, OSDI
’06, pages 307–320, Berkeley, CA, USA, 2006. USENIX Association. ISBN 1-931971-
47-1. URL
[90] Sage A. Weil, Scott A. Brandt, Ethan L. Miller, and Carlos Maltzahn. CRUSH:
Controlled, Scalable, Decentralized Placement of Replicated Data. In Proceedings of
the 2006 ACM/IEEE Conference on Supercomputing, SC ’06, New York, NY, USA,
2006. ACM. ISBN 0-7695-2700-0. doi: 10.1145/1188455.1188582. URL
acm.org/10.1145/1188455.1188582.
135
[91] M Widenius, D Axmark, and M AB. MySQL 5.5 Reference Manual, 2014.
[92] Demetrios Zeinalipour-Yazti, Song Lin, Vana Kalogeraki, Dimitrios Gunopulos, and
Walid A Najjar. MicroHash: An Efficient Index Structure for Flash-Based Sensor
Devices. In FAST, volume 5, pages 3–3, 2005.
[93] Kai Zhang, Kaibo Wang, Yuan Yuan, Lei Guo, Rubao Lee, and Xiaodong Zhang.
Mega-kv: a case for gpus to maximize the throughput of in-memory key-value stores.
Proceedings of the VLDB Endowment, 8(11):1226–1237, 2015.
136
Các file đính kèm theo tài liệu này:
- distributed_key_value_store_for_large_scale_systems.pdf