CAPITULO II MARCO TEÓRICO
2.12.2. La enfermedad o accidente no profesional
At this point, you may wonder what kind of indexing is taking place in HBase. We’ve talked about it in the last two chapters, but it becomes important when thinking about your table designs. The tall table versus wide table discussion is fundamentally a dis- cussion of what needs to be indexed and what doesn’t. Putting more information into the rowkey gives you the ability to answer some questions in constant time. Remember the read path and block index from chapter 2? That’s what’s at play here, enabling you to get to the right row quickly.
Only the keys (the Key part of the KeyValue object, consisting of the rowkey, col- umn qualifier, and timestamp) are indexed in HBase tables. Think of it as the primary key in a relational database system, but you can’t change the column that forms the primary key, and the key is a compound of three data elements (rowkey, column qual- ifier, and timestamp). The only way to access a particular row is through the rowkey. Index- ing the qualifier and timestamp lets you skip to the right column without scanning all the previous columns in that row. The KeyValue object that you get back is basically a row from the HFile, as shown in figure 4.10.
There are two ways to retrieve data from a table: Get and Scan. If you want a single row, you use a Get call, in which case you have to provide the rowkey. If you want to execute a Scan, you can choose to provide the start and stop rowkeys if you know them, and limit the number of rows the scanner object will scan.
r5 c1:v1 c3:v5 c7:v8 r4 c2:v4 c2:v3 c5:v6 r3 c1:v2 c3:v6 r2 c6:v2 c1:v9 c1:v1 r1 CF1 CF2 r1:CF1:c1:t1:v1 r2:CF1:c1:t2:v2 r2:CF1:c3:t3:v6 r3:CF1:c2:t1:v3 r4:CF1:c2:t1:v4 r5:CF1:c1:t2:v1 r5:CF1:c3:t3:v5 r1:CF2:c1:t1:v9 r1:CF2:c6:t4:v2 r3:CF2:c5:t4:v6 r5:CF2:c7:t3:v8 HFile for CF1 HFile for CF2
r5:cf2:c7:t3:v8 r5:CF1:c3:t3:v5 r5:CF1:c1:t2:v1
Result object returned for a Get() on row r5
KeyValue objects Value Cell Time Stamp Col Qual Col Fam Row Key Key Value Logical representation of an HBase table.
We'll look at what it means to Get() row r5 from this table. Actual physical storage of the table
Structure of a KeyValue object
Figure 4.10 Logical to physical translation of an HBase table. The KeyValue object represents a
single entry in the HFile. The figure shows the result of executing get(r5) on the table to retrieve
99
How to approach schema design
When you execute a Get, you can skip to the exact block that contains the row you’re looking for. From there, it scans the block to find the relevant KeyValue objects that form the row. In the Get object, you can also specify the column family and the col- umn qualifiers if you want. By specifying the column family, you can limit the client to accessing only the HFiles for the specified column families. Specifying the column qualifier doesn’t play a role in limiting the HFiles that are read off the disk, but it does limit what’s sent back over the wire. If multiple HFiles exist for a column family on a given region, all of them are accessed to find the components of the row you specify in the Get call. This access is regardless of how many HFiles contain data relevant to the request. Being as specific as possible in your Get is useful, though, so you don’t transfer unnecessary data across the wire to the client. The only cost is potential disk
I/O on the RegionServer. If you specify timestamps in your Get object, you can avoid reading HFiles that are older than that timestamp. Figure 4.11 illustrates this in a sim- ple table.
You can use this information to inform your table design. Putting data into the cell value occupies the same amount of storage space as putting it into the column quali- fier or the rowkey. But you can possibly achieve better performance by moving it up from the cell to the rowkey. The downside to putting more in the rowkey is a bigger block index, given that the keys are the only bits that go into the index.
You’ve learned quite a few things so far, and we’ll continue to build on them in the rest of the chapter. Let’s do a quick recap before we proceed:
■ HBase tables are flexible, and you can store anything in the form of byte[]. ■ Store everything with similar access patterns in the same column family.
■ Indexing is done on the Key portion of the KeyValue objects, consisting of the
rowkey, qualifier, and timestamp in that order. Use it to your advantage.
■ Tall tables can potentially allow you to move toward O(1) operations, but you trade atomicity.
■ De-normalizing is the way to go when designing HBase schemas.
Limit disk I/O Limit network I/O Limit rows Limit HFiles
Col qual Time stamp Col fam Rowkey Key X X X X X
Figure 4.11 Depending on what part of the key you specify, you can limit the amount of data you read off the disk or transfer over the network. Specifying the rowkey lets you read just the exact row you need. But the server returns the entire row to the client. Specifying the column family lets you further specify what part of the row to read, thereby allowing for reading only a subset of the HFiles if the row spans multiple families. Further specifying the column qualifier and timestamp lets you save on the number of columns returned to the client, thereby saving on network I/O.
■ Think how you can accomplish your access patterns in single API calls rather
than multiple API calls. HBase doesn’t have cross-row transactions, and you want to avoid building that logic in your client code.
■ Hashing allows for fixed-length keys and better distribution but takes away
ordering.
■ Column qualifiers can be used to store data, just like cells.
■ The length of column qualifiers impacts the storage footprint because you can put data in them. It also affects the disk and network I/O cost when the data is accessed. Be concise.
■ The length of the column family name impacts the size of data sent over the wire to the client (in KeyValue objects). Be concise.
Having worked through an example table design process and learned a bunch of con- cepts, let’s solidify some of the core ideas and look at how you can use them while designing HBase tables.
4.2
De-normalization is the word in HBase land
One of the key concepts when designing HBase tables is de-normalization. We’ll explore it in detail in this section. So far, you’ve looked at maintaining a list of the users an individual user follows. When a TwitBase user logs in to their account and wants to see twits from the people they follow, your application fetches the list of fol- lowed users and their twits, returning that information. This process can take time as the number of users in the system grows. Moreover, if a user is being followed by lots of users, their twits are accessed every time a follower logs in. The region hosting the popular user’s twits is constantly answering requests because you’ve created a read hot spot. The way to solve that is to maintain a twit stream for every user in the system and add twits to it the moment one of the users they follow writes a twit.
Think about it. The process for displaying a user’s twit stream changes. Earlier, you read the list of users they follow and then combined the latest twits for each of them to form the stream. With this new idea, you’ll have a persisted list of twits that make up a user’s stream. You’re basically de-normalizing your tables.
Normalization and de-normalization
Normalization is a technique in the relational database world where every type of re-
peating information is put into a table of its own. This has two benefits: you don’t have to worry about the complexity of updating all copies of the given data when an update or delete happens; and you reduce the storage footprint by having a single copy instead of multiple copies. The data is recombined at query time using JOIN clauses in SQL statements.
De-normalization is the opposite concept. Data is repeated and stored at multiple lo-
cations. This makes querying the data much easier and faster because you no longer need expensive JOIN clauses.
101
De-normalization is the word in HBase land
In this case, you can de-normalize by having another table for twit streams. By doing this, you’ll take away the read-scalability issue and solve it by having multiple copies of the data (in this case, a popular user’s twits) available for all readers (the users follow- ing the popular user).
As of now, you have the users table, the twits table, and the follows table. When a user logs in, you get their twit stream by using the following process:
1 Get a list of people the user follows.
2 Get the twits for each of those followed users.
3 Combine the twits and order them by timestamp, latest first.
You can use a couple of options to de-normalize. You can add another column family to the users table and maintain a stream there for each user. Or you can have another table for twit streams. Putting the twit stream in the users table isn’t ideal because the rowkey design of that table is such that it isn’t optimal for what you’re trying to do. Keep reading; you’ll see this reasoning soon.
The access pattern for the twit stream table consists of two parts:
■ Reading a list of twits to show to a given user when the user logs in, and display- ing it in reverse order of creation timestamp (latest first)
■ Adding to the list of twits for a user when any of the users they follow writes a twit Another thing to think about is the retention policy for the twit stream. You may want to maintain a stream of the twits from only the last 72 hours, for instance. We talk about Time To Live (TTL) later, as a part of advanced column family configurations.
Using the concepts that we’ve covered so far, you can see that putting the user ID
and the reverse timestamp in the rowkey makes sense. You can easily scan a set of rows in the table to retrieve the twits that form a user’s twit stream. You also need the user
ID of the person who created each twit. This information can go into the column qual- ifier. The table looks like figure 4.12.
(continued)
From a performance standpoint, normalization optimizes for writes, and de-normaliza- tion optimizes for reads.
De-normalization Poor design Dreamland Normalization Read performance Write performance
Normalization optimizes the table for writes; you pay the cost of combining data at read time. De- normalization optimizes for reads, but you pay the cost of writing multiple copies at write time.
When someone creates a twit, all their followers should get that twit in their respective streams. This can be accomplished using coprocessors, which we talk about in the next chapter. Here, we can talk about what the process is. When a user creates a twit, a list of all their followers is fetched from the relationships table, and the twit is added to each of the followers’ streams. To accomplish this, you need to first be able to find the list of users following any given user, which is the inverse of what you’ve solved so far in your relationships table. In other words, you want to answer the question, “Who follows me?” effectively.
With the current table design, this question can be answered by scanning the entire table and looking for rows where the second half of the rowkey is the user you’re interested in. Again, this process is inefficient. In a relational database system, this can be solved by adding an index on the second column and making a slight change to the SQL query. Also keep in mind that the amount of data you would work with is much smaller. What you’re trying to accomplish here is the ability to perform these kinds of operations with large volumes of data.
4.3
Heterogeneous data in the same table
HBase schemas are flexible, and you’ll use that flexibility now to avoid doing scans every time you want a list of followers for a given user. The intent is to expose you to the various ideas involved in designing HBase tables. The relationships table as you have it now has the rowkey as follows:
md5(user) + md5(followed user)
You can add the relationship information to this key and make it look like this:
md5(user) + relationship type + md5(user)
That lets you store both kinds of relationships in the same table: following as well as fol-
lowed by. Answering the questions you’ve been working with so far now involves check-
ing for the relationship information from the key. When you’re accessing all the followers for a particular user or all the users a particular user is following, you’ll do scans over a set of rows. When you’re looking for a list of users for the first case, you
md5(TheRealMT) + reverse ts HRogers:Twit foo Olivia:Second twit md5(TheRealMT) + reverse ts Olivia:Second twit md5(TheFakeMT) + reverse ts HRogers:Twit foo TheRealMT:First twit md5(TheFakeMT) + reverse ts md5(TheFakeMT) + reverse ts TheRealMT:Twit bar md5(TheFakeMT) + reverse ts info
Reverse timestamp = Long.MAX_VALUE - timestamp
Figure 4.12 A table to store twit streams for every user. Reverse timestamps let you sort
the twits with the latest twit first. That allows for efficient scanning and retrieval of the n
103
Rowkey design strategies
want to avoid having to read information for the other case. In other words, when you’re looking for a list of followers for a user, you don’t want a list of users that the user follows in the dataset returned to your client application. You can accomplish this by specifying the start and end keys for the scan.
Let’s look at another possible key structure: putting the relationship information in the first part of the key. The new key looks like this:
relationship type + md5(user) + md5(user)
Think about how the data is distributed across the RegionServers now. Everything for a particular type of relationship is collocated. If you’re querying for a list of followers more often than the followed list, the load isn’t well distributed across the various regions. That is the downside of this key design and the challenge in storing heteroge- neous data in the same table.
TIP Isolate different access patterns as much as possible.
The way to improve the load distribution in this case is to have separate tables for the two types of relationships you want to store. You can create a table called followedBy with the same design as the follows table. By doing that, you avoid putting the rela- tionship type information in the key. This allows for better load distribution across the cluster.
One of the challenges we haven’t addressed yet is keeping the two relationship entries consistent. When Mark Twain decides to follow his fanboy, two entries need to be made in the tables: one in the follows table and the other in the followedBy table. Given that HBase doesn’t allow inter-table or inter-row transactions, the client application writing these entries has to ensure that both rows are written. Failures happen all the time, and it will make the client application complicated if you try to implement transactional logic there. In an ideal world, the underlying database sys- tem should handle this for you; but design decisions are different at scale, and this isn’t a solved problem in the field of distributed systems at this point.
4.4
Rowkey design strategies
By now, you may have seen a pattern in the design process you went through to come up with the two tables to store relationship information. The thing you’ve been tweak- ing is the rowkey.
TIP In designing HBase tables, the rowkey is the single most important thing. You should model keys based on the expected access pattern.
Your rowkeys determine the performance you get while interacting with HBase tables. Two factors govern this behavior: the fact that regions serve a range of rows based on the rowkeys and are responsible for every row that falls in that range, and the fact that
HFiles store the rows sorted on disk. These factors are interrelated. HFiles are formed when regions flush the rows they’re holding in memory; these rows are already in
sorted order and get flushed that way as well. This sorted nature of HBase tables and their underlying storage format allows you to reason about performance based on how you design your keys and what you put in your column qualifiers. To refresh your memory about HFiles, look at figure 4.13; it’s the HFile you read about in chapter 2.
Unlike relational databases, where you can index on multiple columns, HBase indexes only on the key; the only way to access data is by using the rowkey. If you don’t know the rowkey for the data you want to access, you’ll end up scanning a chunk of rows, if not the entire table. There are various techniques to design rowkeys that are optimized for different access patterns, as we’ll explore next.
4.5
I/O considerations
The sorted nature of HBase tables can turn out to be a great thing for your applica- tion—or not. For instance, when we looked at the twit stream table in the previous sec- tion, its sorted nature gave you the ability to quickly scan a small set of rows to find the latest twits that should show up in a user’s stream. But the same sorted nature can hurt you when you’re trying to write a bunch of time-series data into an HBase table (remember hot-spotting?). If you choose your rowkey to be the timestamp, you’ll always be writing to a single region, whichever is responsible for the range in which the timestamp falls. In fact, you’ll always be writing to the end of a table, because time- stamps are monotonically increasing in nature. This not only limits your throughput to what a single region can handle but also puts you at risk of overloading a single machine where other machines in the cluster are sitting idle. The trick is to design