azure云数据库_从Azure Databricks将数据加载到Azure SQL数据库-白红宇

azure云数据库_从Azure Databricks将数据加载到Azure SQL数据库

阅读量：2515 次

发布时间：2019-05-11

本文共 11691 字，大约阅读时间需要 38 分钟。

azure云数据库

In this article, we will learn how we can load data into Azure SQL Database from Azure Databricks using Scala and Python notebooks.

在本文中，我们将学习如何使用Scala和Python笔记本从Azure Databricks将数据加载到Azure SQL数据库中。

With unprecedented volumes of data being generated, captured, and shared by organizations, fast processing of this data to gain meaningful insights has become a dominant concern for businesses. One of the popular frameworks that offer fast processing and analysis of big data workloads is .

随着组织生成，捕获和共享前所未有的数据量，快速处理这些数据以获得有意义的见解已成为企业的主要关注点。是提供快速处理和分析大数据工作负载的流行框架之一。

is the implementation of Apache Spark analytics on Microsoft Azure, and it integrates well with several Azure services like Azure Blob Storage, Azure Synapse Analytics, and Azure SQL Database, etc. Spinning up clusters in fully managed Apache Spark environment with benefits of Azure Cloud platform could have never been easier. In case you are new to Databricks, you can benefit and understand its basics from this tutorial .

是Microsoft Azure上Apache Spark分析的实现，并且与Azure Azure Blob存储，Azure Synapse Analytics和Azure SQL数据库等几种Azure服务很好地集成。利用Azure的优势在完全托管的Apache Spark环境中启动群集云平台从未如此简单。如果你是新来Databricks，你可以享受和理解本教程的基础知识。

Data processing is one vital step in the overall data life cycle. Once this data is processed with the help of fast processing clusters, it needs to be stored in storage repositories for it to be easily accessed and analyzed for a variety of future purposes like reporting.

数据处理是整个数据生命周期中至关重要的一步。一旦在快速处理集群的帮助下处理了这些数据，就需要将其存储在存储库中，以便可以轻松访问和分析该数据，以用于各种未来用途，例如报告。

In this article, we will load the processed data into the SQL Database on Azure from Azure Databricks. Databricks in Azure supports APIs for several languages like Scala, Python, R, and SQL. As Apache Spark is written in Scala, this language choice for programming is the fastest one to use.

在本文中，我们将从Azure Databricks将处理后的数据加载到Azure上的SQL数据库中 。 Azure中的数据块支持多种语言的API，例如Scala，Python，R和SQL。由于Apache Spark是用Scala编写的，因此这种编程语言选择是使用最快的语言。

Let’s go ahead and demonstrate the data load into SQL Database using both Scala and Python notebooks from Databricks on Azure.

让我们继续进行演示，并使用来自Azure上的Databricks的Scala和Python笔记本将数据加载到SQL数据库中。

演示前的准备 (Preparations before demo)

Before we start with our exercise, we will need to have the following prerequisites:

在开始练习之前，我们需要具备以下先决条件：

Azure Subscription. If you don’t have it, you can create it Azure订阅 。如果没有，可以创建

Azure Databricks – You need to set up both Databricks service and cluster in Azure, you can go over the steps in this article, Azure Databricks –您需要在Azure中设置Databricks服务和群集，您可以阅读本文的步骤， to create these for you. As shown in this article, we have created a Databricks service named “azdatabricks” and Databricks cluster named “azdatabrickscluster” 为您创建这些。如本文所示，我们创建了一个名为“ azdatabricks”的Databricks服务和一个名为“ azdatabrickscluster”的Databricks集群。

Azure SQL Database – Creating a SQL Database on Azure is a straight-forward process. I have put out screenshots below to throw a quick idea on how to create a SQL Database on Azure Azure SQL数据库 –在Azure上创建SQL数据库是一个简单的过程。我在下面放出了屏幕截图，以快速了解如何在Azure上创建SQL数据库

On the Azure portal, you can either directly click on Create a resource button or SQL databases on the left vertical menu bar to land on the Create SQL Database screen.

在Azure门户上，您可以直接单击“ 创建资源”按钮或SQL数据库 （位于左侧垂直菜单栏上）以进入“创建SQL数据库”屏幕。

Provide details like Database name, its configuration, and create or select the Server name. Click on the Review + create button to create this SQL database on Azure.

提供详细信息，例如数据库名称，其配置，以及创建或选择服务器名称。单击“ 审阅+创建”按钮以在Azure上创建此SQL数据库。

Check out this official documentation by Microsoft, , where the process to create a SQL database is described in great detail.

请查阅Microsoft的官方文档“ ，其中详细介绍了创建SQL数据库的过程。

在Azure Databricks群集上上传CSV文件 (Uploading a CSV file on Azure Databricks Cluster)

We will be loading a CSV file (semi-structured data) in the Azure SQL Database from Databricks. For the same reason, let’s quickly upload a CSV file on the Databricks portal. You can download it from . Click on the Data icon on the left vertical menu bar and select the Add Data button.

我们将从Databricks在Azure SQL数据库中加载CSV文件（半结构化数据）。出于同样的原因，让我们在Databricks门户上快速上传CSV文件。您可以从下载。单击数据图标在左侧的垂直菜单栏上然后选择添加数据按钮。

Browse and choose the file that you want to upload on Azure Databricks.

浏览并选择要在Azure Databricks上上传的文件。

Uploading data on Databricks portal in Azure.

Once uploaded, you can see the file “1000 Sales Records.csv” being uploaded on the Azure Databricks service. Take a note of the path name of the file: /FileStore/tables/1000_Sales_Records-d540d.csv. We will use this path in notebooks to read data.

上传后，您可以看到文件“ 1000 Sales Records.csv”正在Azure Databricks服务上上传。记下文件的路径名： /FileStore/tables/1000_Sales_Records-d540d.csv 。我们将在笔记本中使用此路径读取数据。

CSV file uploaded in Databricks portal on Azure.

使用Scala将数据从Azure Databricks加载到Azure SQL数据库 (Load data into Azure SQL Database from Azure Databricks using Scala )

Hit on the Create button and select Notebook on the Workspace icon to create a Notebook.

单击创建按钮，然后在工作区图标上选择笔记本以创建笔记本。

Type in a Name for the notebook and select Scala as the language. The Cluster name is self-populated as there was just one cluster created, in case you have more clusters, you can always select from the drop-down list of your clusters. Finally, click Create to create a Scala notebook.

输入笔记本的名称，然后选择Scala作为语言。群集名称是自动填充的，因为仅创建了一个群集，如果您有更多群集，则始终可以从群集的下拉列表中进行选择。最后，单击创建以创建Scala笔记本。

We will start by typing in the code, as shown in the following screenshot. Let’s break this chunk of code in small parts and try to understand.

我们将首先输入代码，如以下屏幕截图所示。让我们将这段代码分成小部分，并尝试理解。

In the below code, we will first create the JDBC URL, which contains information like SQL Server, SQL Database name on Azure, along with other details like Port number, user, and password.

在下面的代码中，我们将首先创建JDBC URL，其中包含诸如SQL Server，Azure上SQL数据库名称之类的信息，以及诸如端口号，用户和密码之类的其他详细信息。

val url = "jdbc:sqlserver://azsqlshackserver.database.windows.net:1433;database=azsqlshackdb;user=gauri;password=*******"

Next, we will create a Properties() to link the parameters.

接下来，我们将创建一个Properties（）来链接参数。

import java.util.Propertiesval myproperties = new Properties()myproperties.put("user", "gauri")myproperties.put("password", "******")

The following code helps to check the connectivity to the SQL Server Database.

以下代码有助于检查与SQL Server数据库的连接。

val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"myproperties.setProperty("Driver", driverClass)

Lastly, we will read the CSV file into mydf data frame. With header = true option, the columns in the first row in the CSV file will be treated as the data frame’s columns names. Using inferSchema = true, we are telling Spark to automatically infer the schema of each column.

最后，我们将CSV文件读取到mydf数据框中。如果header = true选项，则CSV文件第一行中的列将被视为数据框的列名。使用inferSchema = true，我们告诉Spark自动推断每个列的模式。

val mydf = spark.read.format("csv")    .option("header","true")    .option("inferSchema", "true")          .load("/FileStore/tables/1000_Sales_Records-d540d.csv")

We will use the display() function to show records of the mydf data frame.

我们将使用display（）函数显示mydf数据框的记录。

display(mydf)

转换数据 (Transforming the data)

Now, let’s try to do some quick data munging on the dataset, we will transform the column SalesChannel -> SalesPlatform using withColumnRenamed() function.

现在，让我们尝试对数据集进行一些快速的数据处理，我们将使用withColumnRenamed（）函数转换列SalesChannel-> SalesPlatform。

val transformedmydf = mydf.withColumnRenamed("SalesChannel", "SalesPlatform")display(transformedmydf)

()

Before we load the transformed data into the Azure SQL Database, let’s quickly take a peek at the database on the Azure portal. For this go to the portal, and select the SQL database, click on the Query editor (preview),

在将转换后的数据加载到Azure SQL数据库之前，让我们快速浏览一下Azure门户上的数据库。为此，请转到门户并选择SQL数据库，然后单击查询编辑器（预览） ，

And provide your Login and Password to query the SQL database on Azure. Click OK.

并提供您的登录名和密码以查询Azure上SQL数据库。单击确定。

The below screenshot shows that currently, there are no tables, no data in this database.

下面的屏幕截图显示了该数据库中当前没有表，也没有数据。

使用Scala将处理后的数据加载到Azure SQL数据库中 (Loading the processed data into Azure SQL Database using Scala)

On the Azure Databricks portal, execute the below code. This will load the CSV file into a table named SalesTotalProfit in the SQL Database on Azure.

在Azure Databricks门户上，执行以下代码。这会将CSV文件加载到Azure上SQL数据库中名为SalesTotalProfit的表中。

Transformedmydf.write.jdbc(url,"SalesTotalProfit",myproperties)

Head back to the Azure portal, refresh the window and execute the below query to select records from the SalesTotalProfit table.

回到Azure门户，刷新窗口并执行以下查询，以从SalesTotalProfit表中选择记录。

SELECT * FROM [dbo].[SalesTotalProfit]

The data is loaded into the table, SalesTotalProfit in the database, azsqlshackdb on Azure. And you can perform any operations on the data, as you would do in any regular database.

数据被加载到数据库Azure Azure中的表azsqlshackdb中的表SalesTotalProfit中。您可以像在任何常规数据库中一样对数据执行任何操作。

UPDATE [dbo].[SalesTotalProfit]SET ItemType = ‘Clothing’WHERE ItemType = ‘Clothes’SELECT * FROM [dbo].[SalesTotalProfit]

The following code reads data from the SalesTotalProfit table in the Databricks. Here, we are processing and aggregating the data per Region and displaying the results.

以下代码从Databricks中的SalesTotalProfit表中读取数据。在这里，我们正在处理和汇总每个区域的数据并显示结果。

val azsqldbtable = spark.read.jdbc(url, "SalesTotalProfit", myproperties)display(azsqldbtable.select("Region", "TotalProfit").groupBy("Region").avg("TotalProfit")

使用Python将数据从Azure Databricks加载到Azure SQL数据库 (Load data into Azure SQL Database from Azure Databricks using Python)

Let’s create a new notebook for Python demonstration. Just select Python as the language choice when you are creating this notebook. We will name this book as loadintoazsqldb.

让我们为Python演示创建一个新的笔记本。创建此笔记本时，只需选择Python作为语言选择即可。我们将这本书命名为loadintoazsqldb。

The following code sets various parameters like Server name, database name, user, and password.

以下代码设置各种参数，例如服务器名称，数据库名称，用户和密码。

jdbcHostname = "azsqlshackserver.database.windows.net"jdbcPort = "1433"jdbcDatabase = "azsqlshackdb"properties = { "user" : "gauri", "password" : "******" }

The below code creates a JDBC URL. We will use sqlContext() to read the csv file and mydf data frame is created as shown in the screenshot below.

下面的代码创建一个JDBC URL。我们将使用sqlContext（）读取csv文件，并创建mydf数据帧，如下面的屏幕快照所示。

url = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname,jdbcPort,jdbcDatabase)mydf = sqlContext.read.csv("/FileStore/tables/1000_Sales_Records-d540d.csv",header=True)

We will import the pandas library and using the DataFrameWriter function; we will load CSV data into a new dataframe named myfinaldf. And finally, write this data frame into the table TotalProfit for the given properties. In case, this table exists, we can overwrite it using the mode as overwrite.

我们将使用DataFrameWriter函数导入熊猫库；我们将CSV数据加载到名为myfinaldf的新数据框中。最后，将此数据框写入给定属性的表TotalProfit中。如果此表存在，我们可以使用覆盖模式来覆盖它。

from pyspark.sql import *import pandas as pdmyfinaldf = DataFrameWriter(mydf)myfinaldf.jdbc(url=url, table= "TotalProfit", mode ="overwrite", properties = properties)

Go to Azure Portal, navigate to the SQL database, and open Query Editor. Open the Tables folder to see the CSV data successfully loaded into the table TotalProfit in the Azure SQL database, azsqlshackdb.

转到Azure门户，导航到SQL数据库，然后打开“查询编辑器”。打开Tables文件夹，查看将CSV数据成功加载到Azure SQL数据库azsqlshackdb中的表TotalProfit中。

结论 (Conclusion )

Azure Databricks, a fast and collaborative Apache Spark-based analytics service, integrates seamlessly with a number of Azure Services, including Azure SQL Database. In this article, we demonstrated step-by-step processes to populate SQL Database from Databricks using both Scala and Python notebooks.

Azure Databricks是一种基于Apache Spark的快速协作型分析服务，可与包括Azure SQL数据库在内的许多Azure服务无缝集成。在本文中，我们演示了使用Scala和Python笔记本从Databricks填充SQL数据库的分步过程。