Itu juga harus menyebutkan subjek besar apa pun dalam web-scraping, dan menautkan ke topik terkait. Karena Dokumentasi untuk pengikisan web masih baru, Anda mungkin perlu membuat versi awal dari topik terkait tersebut
Pengikisan Web dengan Python [menggunakan BeautifulSoup]
Saat melakukan tugas ilmu data, biasanya ingin menggunakan data yang ditemukan di internet. Anda biasanya dapat mengakses data ini melalui Application Programming Interface [API] atau dalam format lain. Namun, ada kalanya data yang Anda inginkan hanya dapat diakses sebagai bagian dari halaman web. Dalam kasus seperti ini, teknik yang disebut pengikisan web muncul
Untuk menerapkan teknik ini untuk mendapatkan data dari halaman web, kita perlu memiliki pengetahuan dasar tentang struktur halaman web dan tag yang digunakan dalam pengembangan halaman web [i. e, ,
dll. ,]. Jika Anda baru dalam pengembangan web, Anda dapat mempelajarinya di sini
Jadi untuk memulai dengan web scraping, kami akan menggunakan situs web sederhana. Kami akan menggunakan modul requests
untuk mendapatkan konten halaman web ATAU kode sumber
import requests
page = requests.get["//dataquestio.github.io/web-scraping-pages/simple.html"]
print [page.content] ## shows the source code
Sekarang kita akan menggunakan modul bs4 untuk membuang konten untuk mendapatkan data yang berguna
from bs4 import BeautifulSoup
soup = BeautifulSoup[page.content, 'html.parser']
print[soup.prettify[]] ##shows source in html format
_Anda dapat menemukan tag yang diperlukan menggunakan alat inspect element
di browser Anda. Sekarang katakanlah Anda ingin mendapatkan semua data yang disimpan
Dalam tutorial ini, kami akan menunjukkan cara melakukan web scraping menggunakan Python 3 dan pustaka BeautifulSoup
Kami akan mengorek prakiraan cuaca dari Layanan Cuaca Nasional, lalu menganalisisnya menggunakan perpustakaan Pandas
Sebelum kita mulai, jika Anda mencari lebih banyak latar belakang tentang API atau format csv, Anda mungkin ingin melihat kursus Dataquest kami di
- Lebah
- analisis data
Perpustakaan permintaan
Mari kita coba unduh situs web sampel sederhana,
from bs4 import BeautifulSoup
soup = BeautifulSoup[page.content, 'html.parser']
soup.prettify[] # print html
list[soup.children] # collect
type[item] for item in list[soup.children]
# [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
# [0] `Doctype` object, which contains information about the type of the document.
# [1] `NavigableString`, which represents text found in the HTML document.
# [2] `Tag` object, which contains other nested tags.
html = list[soup.children][2]
4. Pertama-tama kita harus mengunduhnya menggunakan metode iniimport requests
page = requests.get["//dataquestio.github.io/web-scraping-pages/simple.html"]
page #
page.status_code # 200
page.content # display html
Parsing halaman dengan BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup[page.content, 'html.parser']
soup.prettify[] # print html
list[soup.children] # collect
type[item] for item in list[soup.children]
# [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
# [0] `Doctype` object, which contains information about the type of the document.
# [1] `NavigableString`, which represents text found in the HTML document.
# [2] `Tag` object, which contains other nested tags.
html = list[soup.children][2]
from bs4 import BeautifulSoup
soup = BeautifulSoup[page.content, 'html.parser']
soup.prettify[] # print html
list[soup.children] # collect
type[item] for item in list[soup.children]
# [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
# [0] `Doctype` object, which contains information about the type of the document.
# [1] `NavigableString`, which represents text found in the HTML document.
# [2] `Tag` object, which contains other nested tags.
html = list[soup.children][2]
_5list[html.children]
# ['\n',
# A simple example page ,
# '\n',
# Here is some simple content for this page.
,
# '\n']
body = list[html.children][3] # get
list[body.children] # content
# ['\n',
# Here is some simple content for this page.
,
# '\n']
p = list[body.children][1] # content
p.get_text[] # extract text from
# 'Here is some simple content for this page.'
_Menemukan semua contoh tag sekaligus
from bs4 import BeautifulSoup
soup = BeautifulSoup[page.content, 'html.parser']
soup.prettify[] # print html
list[soup.children] # collect
type[item] for item in list[soup.children]
# [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
# [0] `Doctype` object, which contains information about the type of the document.
# [1] `NavigableString`, which represents text found in the HTML document.
# [2] `Tag` object, which contains other nested tags.
html = list[soup.children][2]
_6 dan from bs4 import BeautifulSoup
soup = BeautifulSoup[page.content, 'html.parser']
soup.prettify[] # print html
list[soup.children] # collect
type[item] for item in list[soup.children]
# [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
# [0] `Doctype` object, which contains information about the type of the document.
# [1] `NavigableString`, which represents text found in the HTML document.
# [2] `Tag` object, which contains other nested tags.
html = list[soup.children][2]
7 alih-alih melintasi secara manualsoup.find_all['p'] # returns [] of all tags found
# [Here is some simple content for this page.
]
soup.find_all['p'][0].get_text[] # go directly to we want
# 'Here is some simple content for this page.'
soup.find['p'] # only get 1st instance found
# Here is some simple content for this page.
_Menemukan instance berdasarkan kelas atau id
page = requests.get["//dataquestio.github.io/web-scraping-pages/ids_and_classes.html"]
soup = BeautifulSoup[page.content, 'html.parser']
outer_text = soup.find_all[class_="outer-text"] # find all elements with class
outer_text = soup.find_all['p', class_='outer-text'] # find p with class
first_id = soup.find_all[id="first"] # find by id
Menggunakan Pemilih CSS
Anda juga dapat mencari item menggunakan pemilih CSS
8 — menemukan semuafrom bs4 import BeautifulSoup soup = BeautifulSoup[page.content, 'html.parser'] soup.prettify[] # print html list[soup.children] # collect type[item] for item in list[soup.children] # [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag] # [0] `Doctype` object, which contains information about the type of the document. # [1] `NavigableString`, which represents text found in the HTML document. # [2] `Tag` object, which contains other nested tags. html = list[soup.children][2]
9 tag di dalam tagfrom bs4 import BeautifulSoup soup = BeautifulSoup[page.content, 'html.parser'] soup.prettify[] # print html list[soup.children] # collect type[item] for item in list[soup.children] # [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag] # [0] `Doctype` object, which contains information about the type of the document. # [1] `NavigableString`, which represents text found in the HTML document. # [2] `Tag` object, which contains other nested tags. html = list[soup.children][2]
0list[html.children] # ['\n', # A simple example page , # '\n', #
Here is some simple content for this page.
, # '\n'] body = list[html.children][3] # get list[body.children] # content # ['\n', #Here is some simple content for this page.
, # '\n'] p = list[body.children][1] #content
p.get_text[] # extract text from
1 — menemukan semualist[html.children] # ['\n', # A simple example page , # '\n', #
Here is some simple content for this page.
, # '\n'] body = list[html.children][3] # get list[body.children] # content # ['\n', #Here is some simple content for this page.
, # '\n'] p = list[body.children][1] #content
p.get_text[] # extract text from
9 tag di dalam tagfrom bs4 import BeautifulSoup soup = BeautifulSoup[page.content, 'html.parser'] soup.prettify[] # print html list[soup.children] # collect type[item] for item in list[soup.children] # [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag] # [0] `Doctype` object, which contains information about the type of the document. # [1] `NavigableString`, which represents text found in the HTML document. # [2] `Tag` object, which contains other nested tags. html = list[soup.children][2]
0 di dalam taglist[html.children] # ['\n', # A simple example page , # '\n', #
Here is some simple content for this page.
, # '\n'] body = list[html.children][3] # get list[body.children] # content # ['\n', #Here is some simple content for this page.
, # '\n'] p = list[body.children][1] #content
p.get_text[] # extract text from
4list[html.children] # ['\n', # A simple example page , # '\n', #
Here is some simple content for this page.
, # '\n'] body = list[html.children][3] # get list[body.children] # content # ['\n', #Here is some simple content for this page.
, # '\n'] p = list[body.children][1] #content
p.get_text[] # extract text from
5 — menemukan semualist[html.children] # ['\n', # A simple example page , # '\n', #
Here is some simple content for this page.
, # '\n'] body = list[html.children][3] # get list[body.children] # content # ['\n', #Here is some simple content for this page.
, # '\n'] p = list[body.children][1] #content
p.get_text[] # extract text from
4 tag di dalam taglist[html.children] # ['\n', # A simple example page , # '\n', #
Here is some simple content for this page.
, # '\n'] body = list[html.children][3] # get list[body.children] # content # ['\n', #Here is some simple content for this page.
, # '\n'] p = list[body.children][1] #content
p.get_text[] # extract text from
5from bs4 import BeautifulSoup soup = BeautifulSoup[page.content, 'html.parser'] soup.prettify[] # print html list[soup.children] # collect type[item] for item in list[soup.children] # [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag] # [0] `Doctype` object, which contains information about the type of the document. # [1] `NavigableString`, which represents text found in the HTML document. # [2] `Tag` object, which contains other nested tags. html = list[soup.children][2]
8 — menemukan semualist[html.children] # ['\n', # A simple example page , # '\n', #
Here is some simple content for this page.
, # '\n'] body = list[html.children][3] # get list[body.children] # content # ['\n', #Here is some simple content for this page.
, # '\n'] p = list[body.children][1] #content
p.get_text[] # extract text from
0 tag dengan kelaslist[html.children] # ['\n', # A simple example page , # '\n', #
Here is some simple content for this page.
, # '\n'] body = list[html.children][3] # get list[body.children] # content # ['\n', #Here is some simple content for this page.
, # '\n'] p = list[body.children][1] #content
p.get_text[] # extract text from
0soup.find_all['p'] # returns [] of all tags found # [
Here is some simple content for this page.
] soup.find_all['p'][0].get_text[] # go directly towe want
# 'Here is some simple content for this page.' soup.find['p'] # only get 1st instance found #Here is some simple content for this page.
1 — menemukan semuasoup.find_all['p'] # returns [] of all tags found # [
Here is some simple content for this page.
] soup.find_all['p'][0].get_text[] # go directly towe want
# 'Here is some simple content for this page.' soup.find['p'] # only get 1st instance found #Here is some simple content for this page.
0 tag dengan idlist[html.children] # ['\n', # A simple example page , # '\n', #
Here is some simple content for this page.
, # '\n'] body = list[html.children][3] # get list[body.children] # content # ['\n', #Here is some simple content for this page.
, # '\n'] p = list[body.children][1] #content
p.get_text[] # extract text from
3soup.find_all['p'] # returns [] of all tags found # [
Here is some simple content for this page.
] soup.find_all['p'][0].get_text[] # go directly towe want
# 'Here is some simple content for this page.' soup.find['p'] # only get 1st instance found #Here is some simple content for this page.
4 — menemukansoup.find_all['p'] # returns [] of all tags found # [
Here is some simple content for this page.
] soup.find_all['p'][0].get_text[] # go directly towe want
# 'Here is some simple content for this page.' soup.find['p'] # only get 1st instance found #Here is some simple content for this page.
0 tag dengan kelaslist[html.children] # ['\n', # A simple example page , # '\n', #
Here is some simple content for this page.
, # '\n'] body = list[html.children][3] # get list[body.children] # content # ['\n', #Here is some simple content for this page.
, # '\n'] p = list[body.children][1] #content
p.get_text[] # extract text from
0 di dalam tagsoup.find_all['p'] # returns [] of all tags found # [
Here is some simple content for this page.
] soup.find_all['p'][0].get_text[] # go directly towe want
# 'Here is some simple content for this page.' soup.find['p'] # only get 1st instance found #Here is some simple content for this page.
4list[html.children] # ['\n', # A simple example page , # '\n', #
Here is some simple content for this page.
, # '\n'] body = list[html.children][3] # get list[body.children] # content # ['\n', #Here is some simple content for this page.
, # '\n'] p = list[body.children][1] #content
p.get_text[] # extract text from
soup.find_all['p'] # returns [] of all tags found
# [Here is some simple content for this page.
]
soup.find_all['p'][0].get_text[] # go directly to we want
# 'Here is some simple content for this page.'
soup.find['p'] # only get 1st instance found
# Here is some simple content for this page.
_8soup.select["div p"]
# [
# First paragraph.
# ,
# Second paragraph.
# ]
Mengunduh data cuaca
Dapatkan informasi cuaca tentang pusat kota San Francisco di sini
Menjelajahi struktur halaman dengan Chrome DevTools
Hal pertama yang perlu kita lakukan adalah memeriksa halaman menggunakan Chrome Devtools
Kami ingin tag
soup.find_all['p'] # returns [] of all tags found
# [Here is some simple content for this page.
]
soup.find_all['p'][0].get_text[] # go directly to we want
# 'Here is some simple content for this page.'
soup.find['p'] # only get 1st instance found
# Here is some simple content for this page.
_9 dengan id page = requests.get["//dataquestio.github.io/web-scraping-pages/ids_and_classes.html"]
soup = BeautifulSoup[page.content, 'html.parser']
outer_text = soup.find_all[class_="outer-text"] # find all elements with class
outer_text = soup.find_all['p', class_='outer-text'] # find p with class
first_id = soup.find_all[id="first"] # find by id
0Jika Anda mengklik konsol, dan menjelajahi div, Anda akan menemukan bahwa setiap item perkiraan [seperti "Malam Ini", "Kamis", dan "Kamis Malam"] terkandung dalam
soup.find_all['p'] # returns [] of all tags found
# [Here is some simple content for this page.
]
soup.find_all['p'][0].get_text[] # go directly to we want
# 'Here is some simple content for this page.'
soup.find['p'] # only get 1st instance found
# Here is some simple content for this page.
9 dengan kelas page = requests.get["//dataquestio.github.io/web-scraping-pages/ids_and_classes.html"]
soup = BeautifulSoup[page.content, 'html.parser']
outer_text = soup.find_all[class_="outer-text"] # find all elements with class
outer_text = soup.find_all['p', class_='outer-text'] # find p with class
first_id = soup.find_all[id="first"] # find by id
2Kami sekarang cukup tahu untuk mengunduh halaman dan mulai menguraikannya. Pada kode di bawah ini, kita
- Unduh halaman web yang berisi perkiraan
- Buat kelas
_3 untuk mengurai halamanpage = requests.get["//dataquestio.github.io/web-scraping-pages/ids_and_classes.html"] soup = BeautifulSoup[page.content, 'html.parser'] outer_text = soup.find_all[class_="outer-text"] # find all elements with class outer_text = soup.find_all['p', class_='outer-text'] # find p with class first_id = soup.find_all[id="first"] # find by id
- Temukan
_9 dengan idsoup.find_all['p'] # returns [] of all tags found # [
Here is some simple content for this page.
] soup.find_all['p'][0].get_text[] # go directly towe want
# 'Here is some simple content for this page.' soup.find['p'] # only get 1st instance found #Here is some simple content for this page.
0, dan tetapkan kepage = requests.get["//dataquestio.github.io/web-scraping-pages/ids_and_classes.html"] soup = BeautifulSoup[page.content, 'html.parser'] outer_text = soup.find_all[class_="outer-text"] # find all elements with class outer_text = soup.find_all['p', class_='outer-text'] # find p with class first_id = soup.find_all[id="first"] # find by id
6page = requests.get["//dataquestio.github.io/web-scraping-pages/ids_and_classes.html"] soup = BeautifulSoup[page.content, 'html.parser'] outer_text = soup.find_all[class_="outer-text"] # find all elements with class outer_text = soup.find_all['p', class_='outer-text'] # find p with class first_id = soup.find_all[id="first"] # find by id
- Di dalam
_6, temukan setiap item perkiraan individupage = requests.get["//dataquestio.github.io/web-scraping-pages/ids_and_classes.html"] soup = BeautifulSoup[page.content, 'html.parser'] outer_text = soup.find_all[class_="outer-text"] # find all elements with class outer_text = soup.find_all['p', class_='outer-text'] # find p with class first_id = soup.find_all[id="first"] # find by id
- Ekstrak dan cetak item perkiraan pertama
from bs4 import BeautifulSoup
soup = BeautifulSoup[page.content, 'html.parser']
page = requests.get["//forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168"]
soup = BeautifulSoup[page.content, 'html.parser']
seven_day = soup.find[id="seven-day-forecast"]
forecast_items = seven_day.find_all[class_="tombstone-container"]
tonight = forecast_items[0]
print[tonight.prettify[]]
page = requests.get["//dataquestio.github.io/web-scraping-pages/ids_and_classes.html"]
soup = BeautifulSoup[page.content, 'html.parser']
outer_text = soup.find_all[class_="outer-text"] # find all elements with class
outer_text = soup.find_all['p', class_='outer-text'] # find p with class
first_id = soup.find_all[id="first"] # find by id
_8 vs page = requests.get["//dataquestio.github.io/web-scraping-pages/ids_and_classes.html"]
soup = BeautifulSoup[page.content, 'html.parser']
outer_text = soup.find_all[class_="outer-text"] # find all elements with class
outer_text = soup.find_all['p', class_='outer-text'] # find p with class
first_id = soup.find_all[id="first"] # find by id
9 di from bs4 import BeautifulSoup
soup = BeautifulSoup[page.content, 'html.parser']
soup.prettify[] # print html
list[soup.children] # collect
type[item] for item in list[soup.children]
# [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
# [0] `Doctype` object, which contains information about the type of the document.
# [1] `NavigableString`, which represents text found in the HTML document.
# [2] `Tag` object, which contains other nested tags.
html = list[soup.children][2]
6period = tonight.find[class_="period-name"].get_text[]
short_desc = tonight.find[class_="short-desc"].get_text[]
temp = tonight.find[class_="temp"].get_text[]
print[period]
print[short_desc]
print[temp]
soup.select["div p"]
# [
# First paragraph.
# ,
# Second paragraph.
# ]
1 dari tag soup.select["div p"]
# [
# First paragraph.
# ,
# Second paragraph.
# ]
2desc = img['title']
img = tonight.find["img"]
print[desc]
page = requests.get["//dataquestio.github.io/web-scraping-pages/ids_and_classes.html"]
soup = BeautifulSoup[page.content, 'html.parser']
outer_text = soup.find_all[class_="outer-text"] # find all elements with class
outer_text = soup.find_all['p', class_='outer-text'] # find p with class
first_id = soup.find_all[id="first"] # find by id
2period_tags = seven_day.select[".tombstone-container .period-name"]
periods = [pt.get_text[] for pt in period_tags]
print[periods]
short_descs = [sd.get_text[] for sd in seven_day.select[".tombstone-container .short-desc"]]
print[short_descs]
temps = [t.get_text[] for t in seven_day.select[".tombstone-container .temp"]]
print[temps]
descs = [d["title"] for d in seven_day.select[".tombstone-container img"]]
print[descs]
Menggabungkan data kami ke dalam Pandas Dataframe
Kami sekarang dapat menggabungkan data ke dalam Pandas DataFrame dan menganalisisnya. DataFrame adalah objek yang dapat menyimpan tabel data, memudahkan analisis data. Jika Anda ingin mempelajari lebih lanjut tentang salah satu topik yang dibahas di sini, lihat kursus interaktif kami yang dapat Anda mulai secara gratis. Pengikisan Web dengan Python