From the course: Microsoft Azure Cosmos DB Developer Specialty (DP-420) Cert Prep by Microsoft Press

Plan for transactions when choosing a partition key

From the course: Microsoft Azure Cosmos DB Developer Specialty (DP-420) Cert Prep by Microsoft Press

Start my 1-month free trial Buy for my team

Plan for transactions when choosing a partition key

“

Let's take a look at some practical examples here on how we could start with the JSON document and choose the best partition key for that workload. Now, some requirements of that partition key is that the key is a JSON property that needs to exist in every document in the container. And ideally, that partition key has high cardinality. In other words, there's a wide variety of values because again, you want to think about spreading these JSON documents as equally as you can across the underlying Cosmos compute cluster. A partition key can be of either a string or a numeric value. You know that JSON offers five or six data types, but string and numeric are what we want for our partition key. I had mentioned previously that every JSON document in the Cosmos DB SQL API has a built-in id property. Now that is case-sensitive. The JSON here in our files is case-sensitive, so we want to think about that as well and ways that we can enforce consistency wherever possible. But we can use that built-in id property as our partition key if we want to or we need to. If we don't add an id property to the incoming data, Cosmos will create one, and as it looks like in this example here, let's bring my drawing tools out here. We've got on Line 2 an ID field that's using the classic globally unique identifier or good form. So we could either generate this ourselves or let Cosmos do it. Now, in this example, we're obviously storing contact information or customer information. Notice that there's also an element called customer ID that I make up is also going to be a value that's unique to every unique customer, which means that you have some choice here in terms of your partition key selection. You want to think about, as I mentioned several times, how you'll be querying or accessing the data. In other words, if our application needs to report on Thomas Grey, what data besides customer ID and perhaps the customer's first name and last name might we want to retrieve? Remember that from the previous lesson? If it's 1 to 1 or 1 to few data, where you have bounded related data, for instance, in our contact database, maybe you've only pre allocated two or three slots for contact info. That would be a bounded context, so it might make sense to include that as an array value that's embedded in the source document. By contrast, when we have one to many relations, we might want to keep the customer's orders in separate documents. And in this case, it looks like this pattern, we're linking out to order ID values that are in other documents within the container. In this case, it looks like Thomas Grey has one order with the ID, W2PRY80. All right. So hopefully you're starting to get a sense for best practices for designing these containers and choosing partition key values. Now, when you're reviewing the Microsoft Docs for Cosmos DB Design, you'll see references to what are called hot partitions. So what I want to do here again, let me bring my drawing tools out. First of all, discuss that notion of partitioning again. If you've worked in a relational context, like with SQL server, for example, you can take huge tables that may have millions of rows and horizontally partition them. This happens in data warehousing in the relational context a lot so that you can take advantage of parallel compute. And then when you're querying those customers, instead of having to parse through long single objects on single servers, each server has just a partition set that they can work with. Well, there's a somewhat similar idea here. Remember in Cosmos, you have your physical partitions that we can look at as compute nodes. All right. So under the hood we've got this collection of Azure virtual machines running on Azure service fabric. Those are our physical partitions. We don't get any visibility or control over those. Those are managed by Azure. However, for every unique partition key value within a container, that's what we have here in this diagram, we have logical partitions, and those logical partitions are going to be placed on the underlying physical partitions. So over time, you might have some partition logical partitions on Host 1, others on Host 2, others on Host n. So again, when it comes to querying, you may very well have to go across some or that entire physical partition structure, hence the benefit of creating partition keys with lots of values because when you're rounding up the data and queries, you want to take advantage of all that parallel processing so that you can bring the data back in the query with less time, with less latency. All right. Now in this example, we're using order date as our partition key. You might think, well, okay, we're tracking orders and these JSON documents. And so we've got a unique order ID, kind of, like, what we saw in the previous slide with that Thomas Grey example. And you might think, well, if we do our partition key on order date, then we should definitely have nice equal spreading. Right? Because for each month or whatever the division is for order date, maybe it would be every day actually rather than month, you would have a separate or you would expect a separate logical partition for each day, week, month, year, whatever it is that you're doing. And although that sounds great initially, depending upon your order date velocity, it may not be a good idea. So in this case, notice that we've got pretty variable order date information where January is very quiet. February and April are, kind of, in between, but March is very heavily with a very busy day. The notion of the hot partition is where you're reading and writing from the same logical partition a lot more than you are others. So again, order date generally is not a good idea as a partition key. If we were doing order date by day, for example, every day you would be hitting a single node in your compute cluster. You've got all these other nodes that are potentially available, but you're just hammering the one because of the way you've structured your partition key. If you did order ID, for example, then every discrete order would be its own logical partition. The idea, in summary, is that you want to have a wide number of logical partitions, so Cosmos can spread those and their associated data across compute nodes and take advantage of massive parallel compute. All right. So we want to avoid the hot partition. As you're thinking about how to model your data for best performance and for best cost optimization, any tools that the cloud provider can give you, it seems to me, is a good thing. So I want to make sure that you're aware of the Cosmos DB Capacity Calculator. I gave you a short link right at the bottom of the slide, timw.info/ccc. This is a public website that's similar to the Azure pricing calculator, if you're familiar with that, only it's specific to Cosmos. So here what you can do is model, first of all, API, we're assuming SQL, how many regions you plan with your account, whether you're going to do multi-region writes. And then per region in your containers, what are your estimates in terms of your velocity, your creates updates, deletes and queries per second, your point reads per second. A point read is where you already know the ID of the document you're after, so you're not going to have to make use of Cosmos indexing or scanning. All right. Bottom line is, you model your workload here and you can see on the right, a live cost estimate in terms of paying for your storage as well as your request units, that's your throughput and your capacity. And that's going to be a factor of how many regions you're going to replicate that workload to. And what's nice about the Cosmos DB Capacity Calculator is that it's not just pulling numbers from thin air, the numbers that the Capacity Calculator gives you are going to be reflective of your home environment, your region, and your environment, and it taps into the live cost and pricing APIs. So it should be a pretty close estimate to what you'll pay in the real world.

Contents