From the course: Microsoft Azure Cosmos DB Developer Specialty (DP-420) Cert Prep by Microsoft Press
Choose between Azure Synapse Link and Spark Connector
From the course: Microsoft Azure Cosmos DB Developer Specialty (DP-420) Cert Prep by Microsoft Press
Choose between Azure Synapse Link and Spark Connector
So there's two main ways to integrate Cosmos DB with an Apache Spark cluster that's running in Azure. If you're using Azure Databricks or HDInsight using Apache Spark, you can take advantage of this open source project at . It's called the Cosmos DB Connector for Spark. It gives you read and write access from Cosmos DB via Apache Spark DataFrames, and the two supported languages are Python and Scala. You might wonder, well, why do you care about this integration? Well, you're going to see it more when we look at Synapse in particular. But the idea is you can report and run big data analytics jobs on Cosmos DB data without disturbing any transaction flow that you already have in Cosmos. See a big business driver to data warehousing is that you're taking data on a schedule, or it could be in a fashion, or it could be both with a lambda architecture, you're taking data out of a transactional system, putting it into a separate system for analysis. And we want to definitely on the SQL side, you want to minimize concurrency and contention going on in the transaction environment. Well, we have a similar situation even in the NoSQL world. You'll learn momentarily that Cosmos DB has two separate data stores available within it, one for its native transactional work, and another for this big data analytics. But I'm getting a little bit ahead of myself. You should know just for DP-420 that the main Spark or Apache Spark solutions available in Azure are through Azure HDInsight, that's the original product, Azure Databricks, which is a partnership between Microsoft and the Databricks Corporation, and then, of course, we have Apache Spark for Azure Synapse. So if you are not going to do the Cosmos DB Connector, that would be mainly for folks who are already invested in HDInsight or Databricks. You know you're going to work within Synapse. There is native integration here called Azure Synapse Link for Cosmos DB. And this gives you what's called cloud-native hybrid transactional and analytic processing or HTAP. That's an acronym you might want to know for your exam. The exams never going to test your knowledge of acronyms, but I don't want you to be surprised if you see one. So the business case here, to put it in a nutshell is "How can we use Cosmos DB data in our data analytics workloads without disturbing the production workload?" See, that's really the kernel of what we're talking about here. And the reason why we can do this Extract-Transform-Load out of Cosmos into Synapse without the need for large-scale data movement and building up complex pipelines is because, as we'll see, there are two data stores available in Cosmos that don't step on each other. As far as language goes, the Synapse Spark supports Scala, Python, Spark SQL, and C#, so you actually get more flexibility in your client side programming language when you're using Azure Synapse Link as opposed to the open source connector for Spark.