Itu juga harus menyebutkan subjek besar apa pun dalam web-scraping, dan menautkan ke topik terkait. Karena Dokumentasi untuk pengikisan web masih baru, Anda mungkin perlu membuat versi awal dari topik terkait tersebut
Saat melakukan tugas ilmu data, biasanya ingin menggunakan data yang ditemukan di internet. Anda biasanya dapat mengakses data ini melalui Application Programming Interface (API) atau dalam format lain. Namun, ada kalanya data yang Anda inginkan hanya dapat diakses sebagai bagian dari halaman web. Dalam kasus seperti ini, teknik yang disebut pengikisan web muncul
Untuk menerapkan teknik ini untuk mendapatkan data dari halaman web, kita perlu memiliki pengetahuan dasar tentang struktur halaman web dan tag yang digunakan dalam pengembangan halaman web (i. e, ,
,dll. ,). Jika Anda baru dalam pengembangan web, Anda dapat mempelajarinya di sini
Jadi untuk memulai dengan web scraping, kami akan menggunakan situs web sederhana. Kami akan menggunakan modul requests
untuk mendapatkan konten halaman web ATAU kode sumber
import requests
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
print (page.content) ## shows the source code
Sekarang kita akan menggunakan modul bs4 untuk membuang konten untuk mendapatkan data yang berguna
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify()) ##shows source in html format
_Anda dapat menemukan tag yang diperlukan menggunakan alat inspect element
di browser Anda. Sekarang katakanlah Anda ingin mendapatkan semua data yang disimpan
Dalam tutorial ini, kami akan menunjukkan cara melakukan web scraping menggunakan Python 3 dan pustaka BeautifulSoup
Kami akan mengorek prakiraan cuaca dari Layanan Cuaca Nasional, lalu menganalisisnya menggunakan perpustakaan Pandas
Sebelum kita mulai, jika Anda mencari lebih banyak latar belakang tentang API atau format csv, Anda mungkin ingin melihat kursus Dataquest kami di
Perpustakaan permintaan
Mari kita coba unduh situs web sampel sederhana,
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup.prettify() # print html
list(soup.children) # collect
type(item) for item in list(soup.children)
# [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
# [0] `Doctype` object, which contains information about the type of the document.
# [1] `NavigableString`, which represents text found in the HTML document.
# [2] `Tag` object, which contains other nested tags.
html = list(soup.children)[2]
4. Pertama-tama kita harus mengunduhnya menggunakan metode iniHalo Dunia dengan permintaanimport requests
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page #
page.status_code # 200
page.content # display html
Parsing halaman dengan BeautifulSoup
gunakan pustaka BeautifulSoup untuk mengurai dokumen ini, dan mengekstrak teksfrom bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup.prettify() # print html
list(soup.children) # collect
type(item) for item in list(soup.children)
# [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
# [0] `Doctype` object, which contains information about the type of the document.
# [1] `NavigableString`, which represents text found in the HTML document.
# [2] `Tag` object, which contains other nested tags.
html = list(soup.children)[2]
parse anak-anak di dalam tag from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup.prettify() # print html
list(soup.children) # collect
type(item) for item in list(soup.children)
# [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
# [0] `Doctype` object, which contains information about the type of the document.
# [1] `NavigableString`, which represents text found in the HTML document.
# [2] `Tag` object, which contains other nested tags.
html = list(soup.children)[2]
_5list(html.children)
# ['\n',
# A simple example page ,
# '\n',
# Here is some simple content for this page.
,
# '\n']
body = list(html.children)[3] # get
list(body.children) # content
# ['\n',
# Here is some simple content for this page.
,
# '\n']
p = list(body.children)[1] # content
p.get_text() # extract text from
# 'Here is some simple content for this page.'
_Menemukan semua contoh tag sekaligus
Gunakan from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup.prettify() # print html
list(soup.children) # collect
type(item) for item in list(soup.children)
# [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
# [0] `Doctype` object, which contains information about the type of the document.
# [1] `NavigableString`, which represents text found in the HTML document.
# [2] `Tag` object, which contains other nested tags.
html = list(soup.children)[2]
_6 dan from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup.prettify() # print html
list(soup.children) # collect
type(item) for item in list(soup.children)
# [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
# [0] `Doctype` object, which contains information about the type of the document.
# [1] `NavigableString`, which represents text found in the HTML document.
# [2] `Tag` object, which contains other nested tags.
html = list(soup.children)[2]
7 alih-alih melintasi secara manualsoup.find_all('p') # returns [] of all tags found
# [Here is some simple content for this page.
]
soup.find_all('p')[0].get_text() # go directly to we want
# 'Here is some simple content for this page.'
soup.find('p') # only get 1st instance found
# Here is some simple content for this page.
_Menemukan instance berdasarkan kelas atau id
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
outer_text = soup.find_all(class_="outer-text") # find all elements with class
outer_text = soup.find_all('p', class_='outer-text') # find p with class
first_id = soup.find_all(id="first") # find by id
Menggunakan Pemilih CSS
Anda juga dapat mencari item menggunakan pemilih CSS
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup.prettify() # print html
list(soup.children) # collect
type(item) for item in list(soup.children)
# [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
# [0] `Doctype` object, which contains information about the type of the document.
# [1] `NavigableString`, which represents text found in the HTML document.
# [2] `Tag` object, which contains other nested tags.
html = list(soup.children)[2]
8 — menemukan semua from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup.prettify() # print html
list(soup.children) # collect
type(item) for item in list(soup.children)
# [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
# [0] `Doctype` object, which contains information about the type of the document.
# [1] `NavigableString`, which represents text found in the HTML document.
# [2] `Tag` object, which contains other nested tags.
html = list(soup.children)[2]
9 tag di dalam tag list(html.children)
# ['\n',
# A simple example page ,
# '\n',
# Here is some simple content for this page.
,
# '\n']
body = list(html.children)[3] # get
list(body.children) # content
# ['\n',
# Here is some simple content for this page.
,
# '\n']
p = list(body.children)[1] # content
p.get_text() # extract text from
# 'Here is some simple content for this page.'
0list(html.children)
# ['\n',
# A simple example page ,
# '\n',
# Here is some simple content for this page.
,
# '\n']
body = list(html.children)[3] # get
list(body.children) # content
# ['\n',
# Here is some simple content for this page.
,
# '\n']
p = list(body.children)[1] # content
p.get_text() # extract text from
# 'Here is some simple content for this page.'
1 — menemukan semua from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup.prettify() # print html
list(soup.children) # collect
type(item) for item in list(soup.children)
# [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
# [0] `Doctype` object, which contains information about the type of the document.
# [1] `NavigableString`, which represents text found in the HTML document.
# [2] `Tag` object, which contains other nested tags.
html = list(soup.children)[2]
9 tag di dalam tag list(html.children)
# ['\n',
# A simple example page ,
# '\n',
# Here is some simple content for this page.
,
# '\n']
body = list(html.children)[3] # get
list(body.children) # content
# ['\n',
# Here is some simple content for this page.
,
# '\n']
p = list(body.children)[1] # content
p.get_text() # extract text from
# 'Here is some simple content for this page.'
0 di dalam tag list(html.children)
# ['\n',
# A simple example page ,
# '\n',
# Here is some simple content for this page.
,
# '\n']
body = list(html.children)[3] # get
list(body.children) # content
# ['\n',
# Here is some simple content for this page.
,
# '\n']
p = list(body.children)[1] # content
p.get_text() # extract text from
# 'Here is some simple content for this page.'
4list(html.children)
# ['\n',
# A simple example page ,
# '\n',
# Here is some simple content for this page.
,
# '\n']
body = list(html.children)[3] # get
list(body.children) # content
# ['\n',
# Here is some simple content for this page.
,
# '\n']
p = list(body.children)[1] # content
p.get_text() # extract text from
# 'Here is some simple content for this page.'
5 — menemukan semua list(html.children)
# ['\n',
# A simple example page ,
# '\n',
# Here is some simple content for this page.
,
# '\n']
body = list(html.children)[3] # get
list(body.children) # content
# ['\n',
# Here is some simple content for this page.
,
# '\n']
p = list(body.children)[1] # content
p.get_text() # extract text from
# 'Here is some simple content for this page.'
4 tag di dalam tag from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup.prettify() # print html
list(soup.children) # collect
type(item) for item in list(soup.children)
# [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
# [0] `Doctype` object, which contains information about the type of the document.
# [1] `NavigableString`, which represents text found in the HTML document.
# [2] `Tag` object, which contains other nested tags.
html = list(soup.children)[2]
5list(html.children)
# ['\n',
# A simple example page ,
# '\n',
# Here is some simple content for this page.
,
# '\n']
body = list(html.children)[3] # get
list(body.children) # content
# ['\n',
# Here is some simple content for this page.
,
# '\n']
p = list(body.children)[1] # content
p.get_text() # extract text from
# 'Here is some simple content for this page.'
8 — menemukan semua list(html.children)
# ['\n',
# A simple example page ,
# '\n',
# Here is some simple content for this page.
,
# '\n']
body = list(html.children)[3] # get
list(body.children) # content
# ['\n',
# Here is some simple content for this page.
,
# '\n']
p = list(body.children)[1] # content
p.get_text() # extract text from
# 'Here is some simple content for this page.'
0 tag dengan kelas soup.find_all('p') # returns [] of all tags found
# [Here is some simple content for this page.
]
soup.find_all('p')[0].get_text() # go directly to we want
# 'Here is some simple content for this page.'
soup.find('p') # only get 1st instance found
# Here is some simple content for this page.
0soup.find_all('p') # returns [] of all tags found
# [Here is some simple content for this page.
]
soup.find_all('p')[0].get_text() # go directly to we want
# 'Here is some simple content for this page.'
soup.find('p') # only get 1st instance found
# Here is some simple content for this page.
1 — menemukan semua list(html.children)
# ['\n',
# A simple example page ,
# '\n',
# Here is some simple content for this page.
,
# '\n']
body = list(html.children)[3] # get
list(body.children) # content
# ['\n',
# Here is some simple content for this page.
,
# '\n']
p = list(body.children)[1] # content
p.get_text() # extract text from
# 'Here is some simple content for this page.'
0 tag dengan id soup.find_all('p') # returns [] of all tags found
# [Here is some simple content for this page.
]
soup.find_all('p')[0].get_text() # go directly to we want
# 'Here is some simple content for this page.'
soup.find('p') # only get 1st instance found
# Here is some simple content for this page.
3soup.find_all('p') # returns [] of all tags found
# [Here is some simple content for this page.
]
soup.find_all('p')[0].get_text() # go directly to we want
# 'Here is some simple content for this page.'
soup.find('p') # only get 1st instance found
# Here is some simple content for this page.
4 — menemukan list(html.children)
# ['\n',
# A simple example page ,
# '\n',
# Here is some simple content for this page.
,
# '\n']
body = list(html.children)[3] # get
list(body.children) # content
# ['\n',
# Here is some simple content for this page.
,
# '\n']
p = list(body.children)[1] # content
p.get_text() # extract text from
# 'Here is some simple content for this page.'
0 tag dengan kelas soup.find_all('p') # returns [] of all tags found
# [Here is some simple content for this page.
]
soup.find_all('p')[0].get_text() # go directly to we want
# 'Here is some simple content for this page.'
soup.find('p') # only get 1st instance found
# Here is some simple content for this page.
0 di dalam tag list(html.children)
# ['\n',
# A simple example page ,
# '\n',
# Here is some simple content for this page.
,
# '\n']
body = list(html.children)[3] # get
list(body.children) # content
# ['\n',
# Here is some simple content for this page.
,
# '\n']
p = list(body.children)[1] # content
p.get_text() # extract text from
# 'Here is some simple content for this page.'
4
Gunakan Pemilih CSS soup.find_all('p') # returns [] of all tags found
# [Here is some simple content for this page.
]
soup.find_all('p')[0].get_text() # go directly to we want
# 'Here is some simple content for this page.'
soup.find('p') # only get 1st instance found
# Here is some simple content for this page.
_8soup.select("div p")
# [
# First paragraph.
# ,
# Second paragraph.
# ]
Mengunduh data cuaca
Dapatkan informasi cuaca tentang pusat kota San Francisco di sini
Hal pertama yang perlu kita lakukan adalah memeriksa halaman menggunakan Chrome Devtools
Kami ingin tag
soup.find_all('p') # returns [] of all tags found
# [Here is some simple content for this page.
]
soup.find_all('p')[0].get_text() # go directly to we want
# 'Here is some simple content for this page.'
soup.find('p') # only get 1st instance found
# Here is some simple content for this page.
_9 dengan id page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
outer_text = soup.find_all(class_="outer-text") # find all elements with class
outer_text = soup.find_all('p', class_='outer-text') # find p with class
first_id = soup.find_all(id="first") # find by id
0Jika Anda mengklik konsol, dan menjelajahi div, Anda akan menemukan bahwa setiap item perkiraan (seperti "Malam Ini", "Kamis", dan "Kamis Malam") terkandung dalam
soup.find_all('p') # returns [] of all tags found
# [Here is some simple content for this page.
]
soup.find_all('p')[0].get_text() # go directly to we want
# 'Here is some simple content for this page.'
soup.find('p') # only get 1st instance found
# Here is some simple content for this page.
9 dengan kelas page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
outer_text = soup.find_all(class_="outer-text") # find all elements with class
outer_text = soup.find_all('p', class_='outer-text') # find p with class
first_id = soup.find_all(id="first") # find by id
2Kami sekarang cukup tahu untuk mengunduh halaman dan mulai menguraikannya. Pada kode di bawah ini, kita
- Unduh halaman web yang berisi perkiraan
- Buat kelas
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
outer_text = soup.find_all(class_="outer-text") # find all elements with class
outer_text = soup.find_all('p', class_='outer-text') # find p with class
first_id = soup.find_all(id="first") # find by id
_3 untuk mengurai halaman - Temukan
soup.find_all('p') # returns [] of all tags found
# [Here is some simple content for this page.
]
soup.find_all('p')[0].get_text() # go directly to we want
# 'Here is some simple content for this page.'
soup.find('p') # only get 1st instance found
# Here is some simple content for this page.
_9 dengan id page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
outer_text = soup.find_all(class_="outer-text") # find all elements with class
outer_text = soup.find_all('p', class_='outer-text') # find p with class
first_id = soup.find_all(id="first") # find by id
0, dan tetapkan ke page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
outer_text = soup.find_all(class_="outer-text") # find all elements with class
outer_text = soup.find_all('p', class_='outer-text') # find p with class
first_id = soup.find_all(id="first") # find by id
6 - Di dalam
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
outer_text = soup.find_all(class_="outer-text") # find all elements with class
outer_text = soup.find_all('p', class_='outer-text') # find p with class
first_id = soup.find_all(id="first") # find by id
_6, temukan setiap item perkiraan individu - Ekstrak dan cetak item perkiraan pertama
Mengurai cuaca dan mencetak prakiraan pertamafrom bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())
Perhatikan penggunaan page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
outer_text = soup.find_all(class_="outer-text") # find all elements with class
outer_text = soup.find_all('p', class_='outer-text') # find p with class
first_id = soup.find_all(id="first") # find by id
_8 vs page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
outer_text = soup.find_all(class_="outer-text") # find all elements with class
outer_text = soup.find_all('p', class_='outer-text') # find p with class
first_id = soup.find_all(id="first") # find by id
9 di from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup.prettify() # print html
list(soup.children) # collect
type(item) for item in list(soup.children)
# [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
# [0] `Doctype` object, which contains information about the type of the document.
# [1] `NavigableString`, which represents text found in the HTML document.
# [2] `Tag` object, which contains other nested tags.
html = list(soup.children)[2]
6ekstrak nama, deskripsi singkat, dan suhuperiod = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()
print(period)
print(short_desc)
print(temp)
ekstrak atribut soup.select("div p")
# [
# First paragraph.
# ,
# Second paragraph.
# ]
1 dari tag soup.select("div p")
# [
# First paragraph.
# ,
# Second paragraph.
# ]
2desc = img['title']
img = tonight.find("img")
print(desc)
Pilih semua item dalam page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
outer_text = soup.find_all(class_="outer-text") # find all elements with class
outer_text = soup.find_all('p', class_='outer-text') # find p with class
first_id = soup.find_all(id="first") # find by id
2period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
print(periods)
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
print(short_descs)
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
print(temps)
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
print(descs)
Menggabungkan data kami ke dalam Pandas Dataframe
Kami sekarang dapat menggabungkan data ke dalam Pandas DataFrame dan menganalisisnya. DataFrame adalah objek yang dapat menyimpan tabel data, memudahkan analisis data. Jika Anda ingin mempelajari lebih lanjut tentang salah satu topik yang dibahas di sini, lihat kursus interaktif kami yang dapat Anda mulai secara gratis. Pengikisan Web dengan Python
Bagaimana cara mengikis HTML dengan R?
Secara umum, pengikisan web dalam R (atau dalam bahasa lainnya) bermuara pada tiga langkah berikut. .
Dapatkan HTML untuk halaman web yang ingin Anda kikis
Putuskan bagian mana dari halaman yang ingin Anda baca dan cari tahu HTML/CSS apa yang Anda perlukan untuk memilihnya
Pilih HTML dan analisis dengan cara yang Anda butuhkan
Bisakah saya mengikis data dari github?
Bagaimana cara mengikis data dari situs web menggunakan Beautifulsoup?
Sup Cantik. Bangun Pengikis Web Dengan Python .
Temukan Elemen berdasarkan ID
Temukan Elemen dengan Nama Kelas HTML
Ekstrak Teks Dari Elemen HTML
Temukan Elemen berdasarkan Nama Kelas dan Konten Teks
Meneruskan Fungsi ke Metode Sup Cantik
Mengidentifikasi Kondisi Kesalahan
Akses Elemen Induk
Ekstrak Atribut Dari Elemen HTML
Bagaimana cara menggunakan Python untuk pengikisan web?
Untuk mengekstrak data menggunakan pengikisan web dengan python, Anda harus mengikuti langkah-langkah dasar ini. .
Temukan URL yang ingin Anda kikis
Memeriksa Halaman
Temukan data yang ingin Anda ekstrak
Tulis kodenya
Jalankan kode dan ekstrak datanya
Simpan data dalam format yang diperlukan