Cara menggunakan olap python pandas
By: | Updated: 2022-07-18 | Comments (2) | Related: More > Python Show ProblemIn case you want to perform extra operations, such as describe, analyze, and visualize your data stored in SQL you need an extra tool. If you have the flexibility or requirement to not use Power BI, you can resort to scripting. SolutionIn this tutorial, we examine the scenario where you want to read SQL data, parse it directly into a dataframe and perform data analysis on it. When connecting to an analytical data store, this process will enable you to extract insights directly from your database, without having to export or sync the data to another system. Getting StartedPlease read my tip on How to Get Started Using Python Using Anaconda and VS Code, if you have not already. Then, open VS Code in your working directory. Create a new file with the .ipynbextension: Next, open your file by double-clicking on it and select a kernel: You will get a list of all your conda environments and any default interpreters (if
installed). You can pick an existing one or create one from the conda interface or terminal prior. Assuming you do not have Repeat the same for the Establishing a connectionHaving set up our development environment we are ready to connect to our local SQL server. First, import the packages needed and run the cell: import pandas as pd from sqlalchemy import create_engine Next, we must establish a connection to our server. This is what a connection string for the local database looks like with inferred credentials (or the trusted connection under pyodbc): engine = create_engine( 'mssql+pyodbc://' '@./AdventureWorks2019?' # username:pwd@server:port/database 'driver=ODBC+Driver+17+for+SQL+Server' ) Let us break it down:
Reading data with the Pandas LibraryThe Let us try out a simple query: df = pd.read_sql( 'SELECT [CustomerID]\ ,[PersonID]\ ,[StoreID]\ ,[TerritoryID]\ ,[AccountNumber]\ ,[ModifiedDate]\ FROM [Sales].[Customer]', engine, index_col='CustomerID') The first argument (lines 2 – 8) is a string of the query we want to be executed. The second argument (line 9) is the engine object we previously built to connect to the server. Lastly (line10), we have an argument for the index column. Here it is the CustomerID and it is not required. However, if you have a bigger dataset, it can be very useful. For example, thousands of rows where each row has a timestamp column and numerical value column. There, it can be very useful to set the index to the timestamp of each row at query run time instead of post-processing later. Explore the dataframeLet us pause for a bit and focus on what a dataframe is and its benefits. The pandas dataframe is a tabular data structure, consisting of rows, columns, and data. It is like a two-dimensional array, however, data contained can also have one or multiple dimensions. Within the pandas module, the dataframe is a cornerstone object allowing quick (relatively, as they are technically quicker ways), straightforward and intuitive data selection, filtering, and ordering. Additionally, the dataframe can provide a good overview of an entire dataset by using additional pandas methods or additional modules to describe (profile) the dataset. Turning your SQL table to a pandas dataframe 'on the fly' enables you as the analyst to gain an overview of the data at hand. You can also process the data and prepare it for further analysis. More complex exampleLet us investigate defining a more complex query with a join and some parameters. Parametrizing your query can be a powerful approach if you want to use variables existing elsewhere in your code. For example: start_date = '2012-01-01' end_date = '2012-12-31' product_name = '%shorts%' df2 = pd.read_sql('SELECT AVG(sod.OrderQty) [Avg Order Qty],\ p.Name,\ FORMAT(soh.OrderDate,\'yyyy-MM\') [Year-Month]\ FROM Sales.SalesOrderHeader soh\ JOIN Sales.SalesOrderDetail sod ON sod.SalesOrderID = soh.SalesOrderID\ JOIN Production.Product p ON sod.ProductID = p.ProductID\ WHERE soh.OrderDate >= ?\ AND soh.OrderDate <= ? \ AND p.Name LIKE ?\ GROUP BY p.Name, FORMAT(soh.OrderDate,\'yyyy-MM\') \ ORDER BY FORMAT(soh.OrderDate,\'yyyy-MM\') DESC', engine, params=[start_date, end_date,product_name]) df2.head() For this query, we have first defined three variables for our parameter values:
PivotTo take full advantage of this dataframe, I assume the end goal would be some visualization. In this case, we should pivot the data on the product type column to make it more suitable for a stacked bar chart visualization: df_piv = df2.pivot(index=['Year-Month'], columns=['Name'], values=['Avg Order Qty']) VisualizeFinally, we can use the pivoted dataframe to visualize it in a suitable way with this syntax: import matplotlib.pyplot as plt plt.rcParams['figure.figsize'] = (15,10) ax = df_piv.plot(stacked=True, kind='bar') for bar in ax.patches: height = bar.get_height() width = bar.get_width() x = bar.get_x() y = bar.get_y() label_text = height label_x = x + width / 2 label_y = y + height / 2 ax.text(label_x, label_y, label_text, ha='center', va='center') ax.set_xticklabels(df_piv.index,rotation='horizontal') First, we must import the matplotlib package. Then we set the figsize argument to 15x10 inches. Next, we set the ConclusionIn this tutorial, we examined how to connect to SQL Server and query data from one or many tables directly into a pandas dataframe. With this technique, we can take full advantage of additional Python packages such as pandas and matplotlib. Next Steps
Related ArticlesPopular ArticlesAbout the authorHristo Hristov is a Microsoft certified data professional, specializing in Power Apps, Power BI and Python.View all my tips Article Last Updated: 2022-07-18 |