kode html

Https database github io web menggores halaman html sederhana

Itu juga harus menyebutkan subjek besar apa pun dalam web-scraping, dan menautkan ke topik terkait. Karena Dokumentasi untuk pengikisan web masih baru, Anda mungkin perlu membuat versi awal dari topik terkait tersebut

Pengikisan Web dengan Python [menggunakan BeautifulSoup]

Saat melakukan tugas ilmu data, biasanya ingin menggunakan data yang ditemukan di internet. Anda biasanya dapat mengakses data ini melalui Application Programming Interface [API] atau dalam format lain. Namun, ada kalanya data yang Anda inginkan hanya dapat diakses sebagai bagian dari halaman web. Dalam kasus seperti ini, teknik yang disebut pengikisan web muncul
Untuk menerapkan teknik ini untuk mendapatkan data dari halaman web, kita perlu memiliki pengetahuan dasar tentang struktur halaman web dan tag yang digunakan dalam pengembangan halaman web [i. e, ,

dll. ,]. Jika Anda baru dalam pengembangan web, Anda dapat mempelajarinya di sini

Jadi untuk memulai dengan web scraping, kami akan menggunakan situs web sederhana. Kami akan menggunakan modul requests untuk mendapatkan konten halaman web ATAU kode sumber

import requests
page = requests.get["//dataquestio.github.io/web-scraping-pages/simple.html"]
print [page.content] ## shows the source code

Sekarang kita akan menggunakan modul bs4 untuk membuang konten untuk mendapatkan data yang berguna

from bs4 import BeautifulSoup
soup = BeautifulSoup[page.content, 'html.parser']
print[soup.prettify[]] ##shows source in html format

Anda dapat menemukan tag yang diperlukan menggunakan alat inspect element di browser Anda. Sekarang katakanlah Anda ingin mendapatkan semua data yang disimpan

Dalam tutorial ini, kami akan menunjukkan cara melakukan web scraping menggunakan Python 3 dan pustaka BeautifulSoup

Kami akan mengorek prakiraan cuaca dari Layanan Cuaca Nasional, lalu menganalisisnya menggunakan perpustakaan Pandas

Sebelum kita mulai, jika Anda mencari lebih banyak latar belakang tentang API atau format csv, Anda mungkin ingin melihat kursus Dataquest kami di

Lebah
analisis data

Perpustakaan permintaan

Mari kita coba unduh situs web sampel sederhana,

from bs4 import BeautifulSoup
soup = BeautifulSoup[page.content, 'html.parser']

soup.prettify[]    # print html 

list[soup.children]  # collect 

type[item] for item in list[soup.children]
    # [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
    # [0] `Doctype` object, which contains information about the type of the document.
    # [1] `NavigableString`, which represents text found in the HTML document. 
    # [2] `Tag` object, which contains other nested tags.

html = list[soup.children][2]

4. Pertama-tama kita harus mengunduhnya menggunakan metode ini

Halo Dunia dengan permintaan

import requests

page = requests.get["//dataquestio.github.io/web-scraping-pages/simple.html"]

page                    # 
page.status_code        # 200
page.content            # display html

Parsing halaman dengan BeautifulSoup

gunakan pustaka BeautifulSoup untuk mengurai dokumen ini, dan mengekstrak teks

from bs4 import BeautifulSoup
soup = BeautifulSoup[page.content, 'html.parser']

soup.prettify[]    # print html 

list[soup.children]  # collect 

type[item] for item in list[soup.children]
    # [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
    # [0] `Doctype` object, which contains information about the type of the document.
    # [1] `NavigableString`, which represents text found in the HTML document. 
    # [2] `Tag` object, which contains other nested tags.

html = list[soup.children][2]

parse anak-anak di dalam tag

from bs4 import BeautifulSoup
soup = BeautifulSoup[page.content, 'html.parser']

soup.prettify[]    # print html 

list[soup.children]  # collect 

type[item] for item in list[soup.children]
    # [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
    # [0] `Doctype` object, which contains information about the type of the document.
    # [1] `NavigableString`, which represents text found in the HTML document. 
    # [2] `Tag` object, which contains other nested tags.

html = list[soup.children][2]

list[html.children]
    # ['\n', 
    #  A simple example page , 
    # '\n', 
    #  Here is some simple content for this page. , 
    # '\n']

body = list[html.children][3] # get 

list[body.children]           #  content 
    # ['\n', 
    # Here is some simple content for this page., 
    # '\n']

p = list[body.children][1]    #  content

p.get_text[]  # extract text from 
    # 'Here is some simple content for this page.'

Menemukan semua contoh tag sekaligus

Gunakan

from bs4 import BeautifulSoup
soup = BeautifulSoup[page.content, 'html.parser']

soup.prettify[]    # print html 

list[soup.children]  # collect 

type[item] for item in list[soup.children]
    # [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
    # [0] `Doctype` object, which contains information about the type of the document.
    # [1] `NavigableString`, which represents text found in the HTML document. 
    # [2] `Tag` object, which contains other nested tags.

html = list[soup.children][2]

_6 dan

from bs4 import BeautifulSoup
soup = BeautifulSoup[page.content, 'html.parser']

soup.prettify[]    # print html 

list[soup.children]  # collect 

type[item] for item in list[soup.children]
    # [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
    # [0] `Doctype` object, which contains information about the type of the document.
    # [1] `NavigableString`, which represents text found in the HTML document. 
    # [2] `Tag` object, which contains other nested tags.

html = list[soup.children][2]

7 alih-alih melintasi secara manual

soup.find_all['p']    # returns [] of all tags found
    # [Here is some simple content for this page.]

soup.find_all['p'][0].get_text[]    # go directly to  we want
    # 'Here is some simple content for this page.'

soup.find['p']        # only get 1st instance found
    # 
Here is some simple content for this page.

Menemukan instance berdasarkan kelas atau id

page = requests.get["//dataquestio.github.io/web-scraping-pages/ids_and_classes.html"]

soup = BeautifulSoup[page.content, 'html.parser']

outer_text = soup.find_all[class_="outer-text"]       # find all elements with class
outer_text = soup.find_all['p', class_='outer-text']  # find p with class

first_id = soup.find_all[id="first"]                  # find by id

Menggunakan Pemilih CSS

Anda juga dapat mencari item menggunakan pemilih CSS

from bs4 import BeautifulSoup
soup = BeautifulSoup[page.content, 'html.parser']

soup.prettify[]    # print html 

list[soup.children]  # collect 

type[item] for item in list[soup.children]
    # [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
    # [0] `Doctype` object, which contains information about the type of the document.
    # [1] `NavigableString`, which represents text found in the HTML document. 
    # [2] `Tag` object, which contains other nested tags.

html = list[soup.children][2]

8 — menemukan semua

from bs4 import BeautifulSoup
soup = BeautifulSoup[page.content, 'html.parser']

soup.prettify[]    # print html 

list[soup.children]  # collect 

type[item] for item in list[soup.children]
    # [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
    # [0] `Doctype` object, which contains information about the type of the document.
    # [1] `NavigableString`, which represents text found in the HTML document. 
    # [2] `Tag` object, which contains other nested tags.

html = list[soup.children][2]

9 tag di dalam tag

list[html.children]
    # ['\n', 
    #  A simple example page , 
    # '\n', 
    #  Here is some simple content for this page. , 
    # '\n']

body = list[html.children][3] # get 

list[body.children]           #  content 
    # ['\n', 
    # Here is some simple content for this page., 
    # '\n']

p = list[body.children][1]    #  content

p.get_text[]  # extract text from 
    # 'Here is some simple content for this page.'

list[html.children]
    # ['\n', 
    #  A simple example page , 
    # '\n', 
    #  Here is some simple content for this page. , 
    # '\n']

body = list[html.children][3] # get 

list[body.children]           #  content 
    # ['\n', 
    # Here is some simple content for this page., 
    # '\n']

p = list[body.children][1]    #  content

p.get_text[]  # extract text from 
    # 'Here is some simple content for this page.'

1 — menemukan semua

from bs4 import BeautifulSoup
soup = BeautifulSoup[page.content, 'html.parser']

soup.prettify[]    # print html 

list[soup.children]  # collect 

type[item] for item in list[soup.children]
    # [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
    # [0] `Doctype` object, which contains information about the type of the document.
    # [1] `NavigableString`, which represents text found in the HTML document. 
    # [2] `Tag` object, which contains other nested tags.

html = list[soup.children][2]

9 tag di dalam tag

list[html.children]
    # ['\n', 
    #  A simple example page , 
    # '\n', 
    #  Here is some simple content for this page. , 
    # '\n']

body = list[html.children][3] # get 

list[body.children]           #  content 
    # ['\n', 
    # Here is some simple content for this page., 
    # '\n']

p = list[body.children][1]    #  content

p.get_text[]  # extract text from 
    # 'Here is some simple content for this page.'

0 di dalam tag

list[html.children]
    # ['\n', 
    #  A simple example page , 
    # '\n', 
    #  Here is some simple content for this page. , 
    # '\n']

body = list[html.children][3] # get 

list[body.children]           #  content 
    # ['\n', 
    # Here is some simple content for this page., 
    # '\n']

p = list[body.children][1]    #  content

p.get_text[]  # extract text from 
    # 'Here is some simple content for this page.'

list[html.children]
    # ['\n', 
    #  A simple example page , 
    # '\n', 
    #  Here is some simple content for this page. , 
    # '\n']

body = list[html.children][3] # get 

list[body.children]           #  content 
    # ['\n', 
    # Here is some simple content for this page., 
    # '\n']

p = list[body.children][1]    #  content

p.get_text[]  # extract text from 
    # 'Here is some simple content for this page.'

5 — menemukan semua

list[html.children]
    # ['\n', 
    #  A simple example page , 
    # '\n', 
    #  Here is some simple content for this page. , 
    # '\n']

body = list[html.children][3] # get 

list[body.children]           #  content 
    # ['\n', 
    # Here is some simple content for this page., 
    # '\n']

p = list[body.children][1]    #  content

p.get_text[]  # extract text from 
    # 'Here is some simple content for this page.'

4 tag di dalam tag

from bs4 import BeautifulSoup
soup = BeautifulSoup[page.content, 'html.parser']

soup.prettify[]    # print html 

list[soup.children]  # collect 

type[item] for item in list[soup.children]
    # [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
    # [0] `Doctype` object, which contains information about the type of the document.
    # [1] `NavigableString`, which represents text found in the HTML document. 
    # [2] `Tag` object, which contains other nested tags.

html = list[soup.children][2]

list[html.children]
    # ['\n', 
    #  A simple example page , 
    # '\n', 
    #  Here is some simple content for this page. , 
    # '\n']

body = list[html.children][3] # get 

list[body.children]           #  content 
    # ['\n', 
    # Here is some simple content for this page., 
    # '\n']

p = list[body.children][1]    #  content

p.get_text[]  # extract text from 
    # 'Here is some simple content for this page.'

8 — menemukan semua

list[html.children]
    # ['\n', 
    #  A simple example page , 
    # '\n', 
    #  Here is some simple content for this page. , 
    # '\n']

body = list[html.children][3] # get 

list[body.children]           #  content 
    # ['\n', 
    # Here is some simple content for this page., 
    # '\n']

p = list[body.children][1]    #  content

p.get_text[]  # extract text from 
    # 'Here is some simple content for this page.'

0 tag dengan kelas

soup.find_all['p']    # returns [] of all tags found
    # [Here is some simple content for this page.]

soup.find_all['p'][0].get_text[]    # go directly to  we want
    # 'Here is some simple content for this page.'

soup.find['p']        # only get 1st instance found
    # 
Here is some simple content for this page.

soup.find_all['p']    # returns [] of all tags found
    # [Here is some simple content for this page.
]

soup.find_all['p'][0].get_text[]    # go directly to  we want
    # 'Here is some simple content for this page.'

soup.find['p']        # only get 1st instance found
    # 
Here is some simple content for this page.

1 — menemukan semua

list[html.children]
    # ['\n', 
    #  A simple example page , 
    # '\n', 
    #  Here is some simple content for this page. , 
    # '\n']

body = list[html.children][3] # get 

list[body.children]           #  content 
    # ['\n', 
    # Here is some simple content for this page., 
    # '\n']

p = list[body.children][1]    #  content

p.get_text[]  # extract text from 
    # 'Here is some simple content for this page.'

0 tag dengan id

soup.find_all['p']    # returns [] of all tags found
    # [Here is some simple content for this page.]

soup.find_all['p'][0].get_text[]    # go directly to  we want
    # 'Here is some simple content for this page.'

soup.find['p']        # only get 1st instance found
    # 
Here is some simple content for this page.

soup.find_all['p']    # returns [] of all tags found
    # [Here is some simple content for this page.]

soup.find_all['p'][0].get_text[]    # go directly to  we want
    # 'Here is some simple content for this page.'

soup.find['p']        # only get 1st instance found
    # 
Here is some simple content for this page.

4 — menemukan

list[html.children]
    # ['\n', 
    #  A simple example page , 
    # '\n', 
    #  Here is some simple content for this page. , 
    # '\n']

body = list[html.children][3] # get 

list[body.children]           #  content 
    # ['\n', 
    # Here is some simple content for this page., 
    # '\n']

p = list[body.children][1]    #  content

p.get_text[]  # extract text from 
    # 'Here is some simple content for this page.'

0 tag dengan kelas

soup.find_all['p']    # returns [] of all tags found
    # [Here is some simple content for this page.]

soup.find_all['p'][0].get_text[]    # go directly to  we want
    # 'Here is some simple content for this page.'

soup.find['p']        # only get 1st instance found
    # 
Here is some simple content for this page.

0 di dalam tag

list[html.children]
    # ['\n', 
    #  A simple example page , 
    # '\n', 
    #  Here is some simple content for this page. , 
    # '\n']

body = list[html.children][3] # get 

list[body.children]           #  content 
    # ['\n', 
    # Here is some simple content for this page., 
    # '\n']

p = list[body.children][1]    #  content

p.get_text[]  # extract text from 
    # 'Here is some simple content for this page.'

Gunakan Pemilih CSS

soup.find_all['p']    # returns [] of all tags found
    # [Here is some simple content for this page.]

soup.find_all['p'][0].get_text[]    # go directly to  we want
    # 'Here is some simple content for this page.'

soup.find['p']        # only get 1st instance found
    # 
Here is some simple content for this page.

soup.select["div p"]
    # [
    # First paragraph.
    # 
, 
    # Second paragraph.
    # ]

Mengunduh data cuaca

Dapatkan informasi cuaca tentang pusat kota San Francisco di sini

Menjelajahi struktur halaman dengan Chrome DevTools

Hal pertama yang perlu kita lakukan adalah memeriksa halaman menggunakan Chrome Devtools

Kami ingin tag

soup.find_all['p']    # returns [] of all tags found
    # [Here is some simple content for this page.]

soup.find_all['p'][0].get_text[]    # go directly to  we want
    # 'Here is some simple content for this page.'

soup.find['p']        # only get 1st instance found
    # 
Here is some simple content for this page.

_9 dengan id

page = requests.get["//dataquestio.github.io/web-scraping-pages/ids_and_classes.html"]

soup = BeautifulSoup[page.content, 'html.parser']

outer_text = soup.find_all[class_="outer-text"]       # find all elements with class
outer_text = soup.find_all['p', class_='outer-text']  # find p with class

first_id = soup.find_all[id="first"]                  # find by id

Jika Anda mengklik konsol, dan menjelajahi div, Anda akan menemukan bahwa setiap item perkiraan [seperti "Malam Ini", "Kamis", dan "Kamis Malam"] terkandung dalam

soup.find_all['p']    # returns [] of all tags found
    # [Here is some simple content for this page.]

soup.find_all['p'][0].get_text[]    # go directly to  we want
    # 'Here is some simple content for this page.'

soup.find['p']        # only get 1st instance found
    # 
Here is some simple content for this page.

9 dengan kelas

page = requests.get["//dataquestio.github.io/web-scraping-pages/ids_and_classes.html"]

soup = BeautifulSoup[page.content, 'html.parser']

outer_text = soup.find_all[class_="outer-text"]       # find all elements with class
outer_text = soup.find_all['p', class_='outer-text']  # find p with class

first_id = soup.find_all[id="first"]                  # find by id

Kami sekarang cukup tahu untuk mengunduh halaman dan mulai menguraikannya. Pada kode di bawah ini, kita

Unduh halaman web yang berisi perkiraan

Buat kelas

page = requests.get["//dataquestio.github.io/web-scraping-pages/ids_and_classes.html"]

soup = BeautifulSoup[page.content, 'html.parser']

outer_text = soup.find_all[class_="outer-text"]       # find all elements with class
outer_text = soup.find_all['p', class_='outer-text']  # find p with class

first_id = soup.find_all[id="first"]                  # find by id

_3 untuk mengurai halaman

Temukan

soup.find_all['p']    # returns [] of all tags found
    # [Here is some simple content for this page.]

soup.find_all['p'][0].get_text[]    # go directly to  we want
    # 'Here is some simple content for this page.'

soup.find['p']        # only get 1st instance found
    # 
Here is some simple content for this page.

_9 dengan id

page = requests.get["//dataquestio.github.io/web-scraping-pages/ids_and_classes.html"]

soup = BeautifulSoup[page.content, 'html.parser']

outer_text = soup.find_all[class_="outer-text"]       # find all elements with class
outer_text = soup.find_all['p', class_='outer-text']  # find p with class

first_id = soup.find_all[id="first"]                  # find by id

0, dan tetapkan ke

page = requests.get["//dataquestio.github.io/web-scraping-pages/ids_and_classes.html"]

soup = BeautifulSoup[page.content, 'html.parser']

outer_text = soup.find_all[class_="outer-text"]       # find all elements with class
outer_text = soup.find_all['p', class_='outer-text']  # find p with class

first_id = soup.find_all[id="first"]                  # find by id

Di dalam

page = requests.get["//dataquestio.github.io/web-scraping-pages/ids_and_classes.html"]

soup = BeautifulSoup[page.content, 'html.parser']

outer_text = soup.find_all[class_="outer-text"]       # find all elements with class
outer_text = soup.find_all['p', class_='outer-text']  # find p with class

first_id = soup.find_all[id="first"]                  # find by id

_6, temukan setiap item perkiraan individu

Ekstrak dan cetak item perkiraan pertama

Mengurai cuaca dan mencetak prakiraan pertama

from bs4 import BeautifulSoup
soup = BeautifulSoup[page.content, 'html.parser']

page = requests.get["//forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168"]
soup = BeautifulSoup[page.content, 'html.parser']
seven_day = soup.find[id="seven-day-forecast"]
forecast_items = seven_day.find_all[class_="tombstone-container"]
tonight = forecast_items[0]
print[tonight.prettify[]]

Perhatikan penggunaan

page = requests.get["//dataquestio.github.io/web-scraping-pages/ids_and_classes.html"]

soup = BeautifulSoup[page.content, 'html.parser']

outer_text = soup.find_all[class_="outer-text"]       # find all elements with class
outer_text = soup.find_all['p', class_='outer-text']  # find p with class

first_id = soup.find_all[id="first"]                  # find by id

_8 vs

page = requests.get["//dataquestio.github.io/web-scraping-pages/ids_and_classes.html"]

soup = BeautifulSoup[page.content, 'html.parser']

outer_text = soup.find_all[class_="outer-text"]       # find all elements with class
outer_text = soup.find_all['p', class_='outer-text']  # find p with class

first_id = soup.find_all[id="first"]                  # find by id

9 di

from bs4 import BeautifulSoup
soup = BeautifulSoup[page.content, 'html.parser']

soup.prettify[]    # print html 

list[soup.children]  # collect 

type[item] for item in list[soup.children]
    # [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
    # [0] `Doctype` object, which contains information about the type of the document.
    # [1] `NavigableString`, which represents text found in the HTML document. 
    # [2] `Tag` object, which contains other nested tags.

html = list[soup.children][2]

ekstrak nama, deskripsi singkat, dan suhu

period = tonight.find[class_="period-name"].get_text[]
short_desc = tonight.find[class_="short-desc"].get_text[]
temp = tonight.find[class_="temp"].get_text[]
print[period]
print[short_desc]
print[temp]

ekstrak atribut

soup.select["div p"]
    # [
    # First paragraph.
    # 
, 
    # Second paragraph.
    # ]

1 dari tag

soup.select["div p"]
    # [
    # First paragraph.
    # 
, 
    # Second paragraph.
    # ]

desc = img['title']
img = tonight.find["img"]
print[desc]

Pilih semua item dalam

page = requests.get["//dataquestio.github.io/web-scraping-pages/ids_and_classes.html"]

soup = BeautifulSoup[page.content, 'html.parser']

outer_text = soup.find_all[class_="outer-text"]       # find all elements with class
outer_text = soup.find_all['p', class_='outer-text']  # find p with class

first_id = soup.find_all[id="first"]                  # find by id

period_tags = seven_day.select[".tombstone-container .period-name"]
periods = [pt.get_text[] for pt in period_tags]
print[periods]

short_descs = [sd.get_text[] for sd in seven_day.select[".tombstone-container .short-desc"]]
print[short_descs]

temps = [t.get_text[] for t in seven_day.select[".tombstone-container .temp"]]
print[temps]

descs = [d["title"] for d in seven_day.select[".tombstone-container img"]]
print[descs]

Menggabungkan data kami ke dalam Pandas Dataframe

Kami sekarang dapat menggabungkan data ke dalam Pandas DataFrame dan menganalisisnya. DataFrame adalah objek yang dapat menyimpan tabel data, memudahkan analisis data. Jika Anda ingin mempelajari lebih lanjut tentang salah satu topik yang dibahas di sini, lihat kursus interaktif kami yang dapat Anda mulai secara gratis. Pengikisan Web dengan Python

Bagaimana cara mengikis HTML dengan R?

Secara umum, pengikisan web dalam R [atau dalam bahasa lainnya] bermuara pada tiga langkah berikut. .

Dapatkan HTML untuk halaman web yang ingin Anda kikis

Putuskan bagian mana dari halaman yang ingin Anda baca dan cari tahu HTML/CSS apa yang Anda perlukan untuk memilihnya

Pilih HTML dan analisis dengan cara yang Anda butuhkan

Bisakah saya mengikis data dari github?

Solusi lengkap untuk kebutuhan pengumpulan data Anda . Ambil snapshot dari seluruh halaman Github dengan resolusi tinggi menggunakan Screenshot API. Kirim halaman yang dirayapi langsung ke cloud menggunakan Penyimpanan Cloud Crawlbase. Use our Crawling API to get the full HTML code and scrape any content that you want. Take a snapshot of an entire Github page on a high resolution using Screenshot API. Send your crawled pages straight to the cloud using Crawlbase's Cloud Storage.

Bagaimana cara mengikis data dari situs web menggunakan Beautifulsoup?

Sup Cantik. Bangun Pengikis Web Dengan Python .

Temukan Elemen berdasarkan ID

Temukan Elemen dengan Nama Kelas HTML

Ekstrak Teks Dari Elemen HTML

Temukan Elemen berdasarkan Nama Kelas dan Konten Teks

Meneruskan Fungsi ke Metode Sup Cantik

Mengidentifikasi Kondisi Kesalahan

Akses Elemen Induk

Ekstrak Atribut Dari Elemen HTML

Bagaimana cara menggunakan Python untuk pengikisan web?

Untuk mengekstrak data menggunakan pengikisan web dengan python, Anda harus mengikuti langkah-langkah dasar ini. .

Temukan URL yang ingin Anda kikis

Memeriksa Halaman

Temukan data yang ingin Anda ekstrak

Tulis kodenya

Jalankan kode dan ekstrak datanya

Simpan data dalam format yang diperlukan