Cara menggunakan difflib python install

Kumpulan Tutorial Belajar Git dalam Bahasa Indonesia. Git adalah program yang bertugas mengontrol versi dalam source code

  1. Belajar Git #01: Apa itu Git dan Kenapa Penting Bagi Programmer?
  2. Belajar Git #02: Cara Install Git dan Konfigurasi Awal yang Harus Dilakukan
  3. Belajar Git #03: Cara Membuat Repositori Git pada Proyek
  4. Belajar Git #04: Melihat Catatan Log Revisi
  5. Belajar Git #05: Melihat Perbandingan Revisi dengan Git Diff
  6. Belajar Git #06: Perintah untuk Membatalkan Revisi
  7. Belajar Git #07: Menggunakan Cabang untuk Mencegah Konflik
  8. Belajar Git #08: Perbedaan Git checkout, Git Reset, dan Git Revert
  9. Belajar Git #09: Bekerja dengan Remote Repositori
  10. Belajar Git #10: Menggunakan Git Pull dan Git Fetch
  11. Belajar Git #11: Cara Berkontribusi di Proyek Open Source dengan Git
  12. Belajar Git #12: Cara Menggunakan Git pada Visual Studio Code

Banyak cara menganalisa data dalam sebuah proyek. Dimana biasanya statistika biasanya menggunakan R sebagai tools utamanya dalam mengerjakannya serta beberapa model pengembangan software yang berkutat bagaimana mengoptimalkan struktur data misalnya.

Gambar 1. Ilustrasi Analisa DataApplications of Data Analysis

Beberapa analisa yang digunnakan misalnya contoh pada beberapa percobaan berikut ini:

  • Paper by Facebook on exposure to ideologically diverse information
  • OKCupid blog post on the best questions to ask on a first date
  • How Walmart used data analysis to increase sales
  • How Bill James applied data analysis to baseball
  • A pharmaceutical company uses data analysis to predict which chemical compounds are likely to make effective drugs
Running Python Locally

In this course, we’ll assume you have the ability to run Python locally. Whether you already have Python installed on your computer or not, we recommend downloading and installing Anaconda. This is a scientific Python installation that comes with a lot of libraries and tools we’ll be using in this course, some of which are otherwise very difficult to install.

If you haven’t already, go through our short course on to set up your computer. I’ve included an environment file you can use to create a conda environment that will provide all the necessary packages and versions for this course. If the resource links open up the environment files as text tiles in your browser, you can use right-click [Win] or control-click [Mac] to open up a menu to “Save as…” to download the file. If you’d like to set up your own environment, this course requires Python 2.7, numpy, pandas, matplotlib, and seaborn.

Downloading Data Files

You should also download the data files from the Resources section. Make sure you save these in the same directory as your IPython notebook. The files we’ll be using in this lesson are enrollments.csv, daily_engagement.csv, and project_submissions.csv. daily_engagement_full.csv contains more detailed data than daily_engagement.csv, but it's a larger file [about 500 MB], so downloading and using this file is optional.

You should also download and read table_descriptions.txt, which describes what data is present in each file [or table] and what columns are present. The data has been anonymized, and contains a random selection of Data Analyst Nanodegree students who had completed the first project at the time the data was collected, as well as a random selection of students who had not.

Supporting Materials

ipython_notebook_tutorial.ipynb

DAND Environment [OS X]

DAND Environment [Windows]

Reminder: You should download the notebook and data files from the Resources section to follow along with the analysis performed through the rest of this lesson. You can open this section by clicking on the icon to the upper right of the classroom.

Python’s csv Module

This page contains documentation for Python’s

array[[ 192.83205,  205.28855,  202.82258,  186.63257,  206.91115,
0 module. Instead of
array[[ 192.83205,  205.28855,  202.82258,  186.63257,  206.91115,
0, you'll be using
array[[ 192.83205,  205.28855,  202.82258,  186.63257,  206.91115,
2 in this course.
array[[ 192.83205,  205.28855,  202.82258,  186.63257,  206.91115,
2 works exactly the same as
array[[ 192.83205,  205.28855,  202.82258,  186.63257,  206.91115,
0, but it comes with Anaconda and has support for unicode. The
array[[ 192.83205,  205.28855,  202.82258,  186.63257,  206.91115,
0 documentation page is still the best way to learn how to use the
array[[ 192.83205,  205.28855,  202.82258,  186.63257,  206.91115,
2 library, since the two libraries work exactly the same way.

Iterators in Python

This page explains the difference between iterators and lists in Python, and how to use iterators.

Solutions

If you want to check our solution for the problem, look at the end of this lesson for Quiz Solutions.

Removing an Element from a Dictionary

If you’re not sure how to remove an element from a dictionary, this post might be helpful.

Solutions

If you want to check our solution for the problem, look at the end of this lesson for Quiz Solutions.

Updated Code for Previous Exercise

After running the above code, Caroline also shows rewriting the solution from the previous exercise to the following code:

def get_unique_students[data]:
unique_students = set[]
for data_point in data:
unique_students.add[data_point['account_key']]
return unique_students
len[enrollments]
unique_enrolled_students = get_unique_students[enrollments]
len[unique_enrolled_students]
len[daily_engagement]
unique_engagement_students = get_unique_students[daily_engagement]
len[unique_engagement_students]
len[project_submissions]
unique_project_submitters = get_unique_students[project_submissions]
len[unique_project_submitters]
Adding labels and titles

In matplotlib, you can add axis labels using

array[[ 192.83205,  205.28855,  202.82258,  186.63257,  206.91115,
7 and
array[[ 192.83205,  205.28855,  202.82258,  186.63257,  206.91115,
8. For histograms, you usually only need an x-axis label, but for other plot types a y-axis label may also be needed. You can also add a title using
array[[ 192.83205,  205.28855,  202.82258,  186.63257,  206.91115,
9.

Making plots look nicer with seaborn

You can automatically make matplotlib plots look nicer using the seaborn library. This library is not automatically included with Anaconda, but Anaconda includes something called a package manager to make it easier to add new libraries. The package manager is called conda, and to use it, you should open the Command Prompt [on a PC] or terminal [on Mac or Linux], and type the command

import matplotlib.pyplot as plt
plt.hist[data]
0.

If you are using a different Python installation than Anaconda, you may have a different package manager. The most common ones are pip and easy_install, and you can use them with the commands

import matplotlib.pyplot as plt
plt.hist[data]
1 or
import matplotlib.pyplot as plt
plt.hist[data]
2 respectively.

Once you have installed seaborn, you can import it anywhere in your code using the line

import matplotlib.pyplot as plt
plt.hist[data]
3. Then any plot you make afterwards will automatically look better. Give it a try!

If you’re wondering why the abbreviation for seaborn is sns, it’s because seaborn was named after the character Samuel Norman Seaborn from the show The West Wing, and sns are his initials.

The seaborn package also includes some extra functions you can use to make complex plots that would be difficult in matplotlib. We won’t be covering those in this course, but if you’d like to see what functions seaborn has available, you can look through the documentation.

Adding extra arguments to your plot

You’ll also frequently want to add some arguments to your plot to tune how it looks. You can see what arguments are available on the documentation page for the hist function. One common argument to pass is the

import matplotlib.pyplot as plt
plt.hist[data]
4 argument, which sets the number of bins used by your histogram. For example,
import matplotlib.pyplot as plt
plt.hist[data]
5 would make sure your histogram has 20 bins.

Improving one of your plots

Use these techniques to improve at least one of the plots you made earlier.

Sharing your findings

Finally, decide which of the discoveries you made this lesson you would most want to communicate to someone else, and write a forum post sharing your findings.

Solution Code

A notebook containing all code shown in this lesson is available in the Downloadables section, as well as the Quiz Solutions page at the end of the lesson.

Supporting Materials

L1_Solution_Code.ipynb

Gapminder data

The data in this lesson was obtained from the site gapminder.org. The variables included are:

  • Aged 15+ Employment Rate [%]
  • Life Expectancy [years]
  • GDP/capita [US$, inflation adjusted]
  • Primary school completion [% of boys]
  • Primary school completion [% of girls]

You can also obtain the data to anlayze on your own from the Downloadables section.

Bitwise Operations

See this article for more information about bitwise operations.

In NumPy,

import matplotlib.pyplot as plt
plt.hist[data]
6 performs a bitwise and of
import matplotlib.pyplot as plt
plt.hist[data]
7 and
import matplotlib.pyplot as plt
plt.hist[data]
8. This is not necessarily the same as a logical and, if you wanted to see if matching terms in two integer vectors were non-zero. However, if
import matplotlib.pyplot as plt
plt.hist[data]
7 and
import matplotlib.pyplot as plt
plt.hist[data]
8 are both arrays of booleans, rather than integers, bitwise and and logical and are the same thing. If you want to perform a logical and on integer vectors, then you can use the NumPy function
a = np.array[[
[['A1a', 'A1b', 'A1c'], ['A2a', 'A2b', 'A2c']],
[['B1a', 'B1b', 'B1c'], ['B2a', 'B2b', 'B2c']]
]]
1 or convert them into boolean vectors first.

Similarly,

a = np.array[[
[['A1a', 'A1b', 'A1c'], ['A2a', 'A2b', 'A2c']],
[['B1a', 'B1b', 'B1c'], ['B2a', 'B2b', 'B2c']]
]]
2 performs a bitwise or, and
a = np.array[[
[['A1a', 'A1b', 'A1c'], ['A2a', 'A2b', 'A2c']],
[['B1a', 'B1b', 'B1c'], ['B2a', 'B2b', 'B2c']]
]]
3 performs a bitwise not. However, if your arrays contain booleans, these will be the same as performing logical or and logical not. NumPy also has similar functions for performing these logical operations on integer-valued arrays.

For the quiz, assume that the number of males and females are equal i.e. we can take a simple average to get an overall completion rate.

In the solution, we may want to

a = np.array[[
[['A1a', 'A1b', 'A1c'], ['A2a', 'A2b', 'A2c']],
[['B1a', 'B1b', 'B1c'], ['B2a', 'B2b', 'B2c']]
]]
4 instead of just
a = np.array[[
[['A1a', 'A1b', 'A1c'], ['A2a', 'A2b', 'A2c']],
[['B1a', 'B1b', 'B1c'], ['B2a', 'B2b', 'B2c']]
]]
5. This is because in Python 2, dividing an integer by another integer [
a = np.array[[
[['A1a', 'A1b', 'A1c'], ['A2a', 'A2b', 'A2c']],
[['B1a', 'B1b', 'B1c'], ['B2a', 'B2b', 'B2c']]
]]
6] drops fractions, so if our inputs are also integers, we may end up losing information. If we divide by a float [
a = np.array[[
[['A1a', 'A1b', 'A1c'], ['A2a', 'A2b', 'A2c']],
[['B1a', 'B1b', 'B1c'], ['B2a', 'B2b', 'B2c']]
]]
7] then we will definitely retain decimal values.

Erratum: The output of cell [3] in the solution video is incorrect: it appears that the

a = np.array[[
[['A1a', 'A1b', 'A1c'], ['A2a', 'A2b', 'A2c']],
[['B1a', 'B1b', 'B1c'], ['B2a', 'B2b', 'B2c']]
]]
8 variable has not been set to the proper value set in cell [2]. All values except for the first will be different. The correct output in cell
a = np.array[[
[['A1a', 'A1b', 'A1c'], ['A2a', 'A2b', 'A2c']],
[['B1a', 'B1b', 'B1c'], ['B2a', 'B2b', 'B2c']]
]]
9 should instead start with:

array[[ 192.83205,  205.28855,  202.82258,  186.63257,  206.91115,
Pandas idxmax[]

Note: The enrollments.csv0 function mentioned in the videos has been realiased to enrollments.csv1, and returns the index of the first maximally-valued element. You can find documentation for the enrollments.csv1 function in Pandas here.

Remember that Jupyter notebooks will just print out the results of the last expression run in a code cell as though a enrollments.csv3 expression was run. If you want to save the results of your operations for later, remember to assign the results to a variable or, for some Pandas functions like enrollments.csv4, use enrollments.csv5 to modify the starting object without needing to reassign it.

Note: The grader will execute your finished enrollments.csv6 function on some test enrollments.csv7 Series when you submit your answer. Make sure that this function returns another Series with the transformed names.

split[]

You can find documentation for Python’s enrollments.csv8 function

Plotting in Pandas

If the variable enrollments.csv9 is a NumPy array or a Pandas Series, just like if it is a list, the code

import matplotlib.pyplot as plt
plt.hist[data]

will create a histogram of the data.

Pandas also has built-in plotting that uses matplotlib behind the scenes, so if enrollments.csv9 is a Series, you can create a histogram using daily_engagement.csv1.

There’s no difference between these two in this case, but sometimes the Pandas wrapper can be more convenient. For example, you can make a line plot of a series using daily_engagement.csv2. The index of the Series will be used for the x-axis and the values for the y-axis.

In the following quiz, we’ve created Series containing the various variables we’ve been looking at this lesson. Pick a country you’re interested in, and make a plot of each variable over time.

The Udacity editor will only show one plot each time you click “Test Run”, so you can look at multiple plots by clicking “Test Run” multiple times. If you’re running plotting code locally, you may need to add the line daily_engagement.csv3 depending on your setup.

Memory Layout

describes the memory layout of 2D NumPy arrays.

Understand and Interpreting Correlations
  • This page contains some scatterplots of variables with different values of correlation.
  • This page lets you use a slider to change the correlation and see how the data might look.
  • Pearson’s r only measures linear correlation! shows some different linear and non-linear relationships and what Pearson’s r will be for those relationships.
Corrected vs. Uncorrected Standard Deviation

By default, Pandas’ daily_engagement.csv4 function computes the standard deviation using Bessel's correction. Calling daily_engagement.csv5 ensures that Bessel's correction will not be used.

Previous Exercise

The exercise where you used a simple heuristic to estimate correlation was the “Pandas Series” exercise in the previous lesson, “NumPy and Pandas for 1D Data”.

Pearson’s r in NumPy

NumPy’s corrcoef[] function can be used to calculate Pearson’s r, also known as the correlation coefficient.

Pandas shift[]

Documentation for the Pandas shift[] function is here. If you’re still not sure how the function works, try it out and see!

Alternative Solution

As an alternative to using vectorized operations, you could also use the code daily_engagement.csv6 to calculate the answer in a single step.

Note: The grader will execute your finished daily_engagement.csv7 function on some test daily_engagement.csv8 DataFrames when you submit your answer. Make sure that this function returns a DataFrame with the converted grades.​Hint​: You may need to define a helper function to use with daily_engagement.csv9.

Note: In order to get the proper computations, we should actually be setting the value of the “ddof” parameter to 0 in the project_submissions.csv0 function.

Note that the type of standard deviation calculated by default is different between numpy’s project_submissions.csv0 and pandas' project_submissions.csv0 functions. By default, numpy calculates a population standard deviation, with "ddof = 0". On the other hand, pandas calculates a sample standard deviation, with "ddof = 1". If we know all of the scores, then we have a population - so to standardize using pandas, we need to set "ddof = 0".

Using groupby[] to Calculate Hourly Entries and Exits

In the quiz where you calculated hourly entries and exits, you did so for a single set of cumulative entries. However, in the original data, there was a separate set of numbers for each station.

Thus, to correctly calculate the hourly entries and exits, it was necessary to group by station and day, then calculate the hourly entries and exits within each day.

Write a function to do that. You should use the project_submissions.csv3 function to call the function you wrote previously. You should also make sure you restrict your grouped data to just the entries and exits columns, since your function may cause an error if it is called on non-numerical data types.

If you would like to learn more about using project_submissions.csv4 in Pandas, this page contains more details.

Note: You will not be able to reproduce the project_submissions.csv5 and project_submissions.csv6 columns in the full dataset using this method. When creating the dataset, we did extra processing to remove erroneous values.

To clarify the structure of the data, the original data recorded the cumulative number of entries on each station at four-hour intervals. For the quiz, you just need to look at the differences between consecutive measurements on each station: by computing “hourly entries”, we just mean recording the number of new tallies between each recording period as a contrast to “cumulative entries”.

Plotting with DataFrames

Just like Pandas Series, DataFrames also have a plot[] method. If project_submissions.csv7 is a DataFrame, then project_submissions.csv8 will produce a line plot with a different colored line for each variable in the DataFrame. This can be a convenient way to get a quick look at your data, especially for small DataFrames, but for more complicated plots you will usually want to use matplotlib directly.

In the following quiz, create a plot of your choice showing something interesting about the New York subway data. For example, you might create:

  • Histograms of subway ridership on both days with rain and days without rain
  • A scatterplot of subway stations with latitude and longitude as the x and y axes and ridership as the bubble size
  • If you choose this option, you may wish to use the project_submissions.csv9 argument to groupby[]. There is example code in the following quiz.
  • A scatterplot with subway ridership on one axis and precipitation or temperature on the other

If you’re not sure how to make the plot you want, try searching on Google or take a look at the matplotlib documentation. Once you’ve created a plot you’re happy with, share what you’ve found on the forums!

Three-Dimensional Data

Now that you’ve worked with one-dimensional and two-dimensional data, you might be wondering how to work with three or more dimensions.

3D data in NumPy

NumPy arrays can have arbitrarily many dimensions. Just like you can create a 1D array from a list, and a 2D array from a list of lists, you can create a 3D array from a list of lists of lists, and so on. For example, the following code would create a 3D array:

a = np.array[[
[['A1a', 'A1b', 'A1c'], ['A2a', 'A2b', 'A2c']],
[['B1a', 'B1b', 'B1c'], ['B2a', 'B2b', 'B2c']]
]]
3D data in Pandas

Pandas has a data structure called a Panel, which is similar to a DataFrame or a Series, but for 3D data. If you would like, you can learn more about Panels .

Pandas Links
  • Red pandas playing in the snow
  • Giant panda puts baby panda to bed
  • Pandas playing on a slide
Project Overview

Note: This course is currently only available for free, so you won’t be able to submit your work for review. We encourage you to use the specifications and evaluation tools to complete it, then self-assess and seek feedback from family, friends, and your social networks. Use their feedback to improve and you’ll have a great example of your work to show off anytime!

In this project, you will analyze a dataset and then communicate your findings about it. You will use the Python libraries NumPy, Pandas, and Matplotlib to make your analysis easier.

What do I need to install?

You will need an installation of Python, plus the following libraries:

  • pandas
  • numpy
  • matplotlib
  • csv or unicodecsv

We recommend installing Anaconda, which comes with all of the necessary packages, as well as IPython notebook. You can find installation instructions here.

Why this Project?

This project will introduce you to the data analysis process. In this project, you will go through the entire process so that you know how all the pieces fit together. Other courses in the Data Analyst Nanodegree focus on individual pieces of the data analysis process. In this project, you will also gain experience using the Python libraries NumPy, Pandas, and Matplotlib, which make writing data analysis code in Python a lot easier!

What will I learn?

After completing the project, you will:

  • Know all the steps involved in a typical data analysis process
  • Be comfortable posing questions that can be answered with a given dataset and then answering those questions
  • Know how to investigate problems in a dataset and wrangle the data into a format you can use
  • Have practice communicating the results of your analysis
  • Be able to use vectorized operations in NumPy and Pandas to speed up your data analysis code
  • Be familiar with Pandas’ Series and DataFrame objects, which let you access your data more conveniently
  • Know how to use Matplotlib to produce plots showing your findings
Why is this Important to my Career?

This project will show off a variety of data analysis skills, as well as showing potential employers that you know how to go through the entire data analysis process.

Introduction

For the final project, you will conduct your own data analysis and create a file to share that documents your findings. You should start by taking a look at your dataset and brainstorming what questions you could answer using it. Then you should use Pandas and NumPy to answer the questions you are most interested in, and create a report sharing the answers. You will not be required to use statistics or machine learning to complete this project, but you should make it clear in your communications that your findings are tentative. This project is open-ended in that we are not looking for one right answer.

Step One — Choose Your Data Set

Choose one of the following datasets to analyze for your project:

  • Titanic Data — Contains demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic. You can view a description of this dataset on the Kaggle website, where the data was obtained.
  • Baseball Data — A data set containing complete batting and pitching statistics from 1871 to 2014, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. This dataset contains many files, but you can choose to analyze only the one[s] you are most interested in.
  • Choose the comma-delimited version, which contains CSV files.
Step Two — Get Organized

Eventually you’ll want to share your project with friends, family, and employers. Get organized before you begin. We recommend creating a single folder that will eventually contain:

  • The report communicating your findings
  • Any Python code you wrote as part of your analysis
  • The data set you used [which you will not need to submit]

You may wish to use IPython notebook, in which case you can share both the code you wrote and the report of your findings in the same document. Otherwise, you will need to store your report and code separately.

Step Three — Analyze Your Data

Brainstorm some questions you could answer using the data set you chose, then start answering those questions. Here are some ideas to get you started:

  • Titanic Data
  • What factors made people more likely to survive?
  • Baseball Data
  • What is the relationship between different performance metrics? Do any have a strong negative or positive relationship?
  • What are the characteristics of baseball players with the highest salaries?

Make sure you use NumPy and Pandas where they are appropriate!

Step Four — Share Your Findings

Once you have finished analyzing the data, create a report that shares the findings you found most interesting. You might wish to use IPython notebook to share your findings alongside the code you used to perform the analysis, but you can also use another tool if you wish.

Step Five — Review

Use the Project Rubric to review your project. If you are happy with your project, then you’re finished! If you see room for improvement, keep working to improve your project.

Supporting Materials

titanic_data.csv

Evaluation

Use the Project Rubric to review your project. If you are happy with your project, then you are ready to share it with others for feedback! If you see room for improvement in any category in which you do not meet specifications, keep working!

You may wish to ask those who review your work to give feedback according to the same Project Rubric.

Sharing your work

Ready to share your work? Send an email to the person will give you feedback with the following:

  1. A PDF or HTML file containing your analysis. This file should include:
  • A note specifying which dataset you analyzed
  • A statement of the question[s] you posed
  • A description of what you did to investigate those questions
  • Documentation of any data wrangling you did
  • Summary statistics and plots communicating your final results
  1. If the code you used to perform your analysis is not included in the above, you can attach the code separately in daily_engagement_full.csv0 file[s].
  2. A list of Web sites, books, forums, blog posts, github repositories, etc. that you referred to or used in creating your submission [add N/A if you did not use any such resources].
IPython notebook instructions

If you used IPython notebook to create your analysis, you can download your notebook as an HTML file. Click on File -> Download.As -> HTML [.html] within the notebook. This way, your reviewer will not need to have IPython notebook installed to view your work. If you get an error about “No module name”, then open a terminal and try installing the missing module using daily_engagement_full.csv1 [don't include the "" or any words following a period in the module name].

Bài mới nhất

Chủ Đề