Web Scrapping with Python

Created for

Iva E. Popova, 2018,

Iva E. Popova on LinkedIn

Get URL Content

`urllib.request` module

urllib.request — Extensible library for opening URLs
The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.

Example


			def get_html(url):
				if url.startswith("http"):
					# make the request - change UA to prevent server deny for scripts
					req = request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})

					# get response
					respone = request.urlopen(req)
					return respone.read()
				else:
					# for tests and debugs
					with open(url,"r") as f:
					  return f.read()

Scrape data using the BeautifulSoup

Overview

Beautiful Soup is a Python library for pulling data out of HTML and XML files
BeautifulSoup Docs
You have to install the module:


			pipenv install beautifulsoup4

Example


			def scrape_data(page):
				bs_parser = BeautifulSoup(page, 'html.parser')
				products_html = bs_parser.find('ul', 'products')

				for product in products_html.find_all("article"):	
					try:
						name = product.select("div h2")[0].string
					except:
						name = "NoName"

					try:
						price = product.select("div tspan")[0].string
					except:
						price = None
						
					products.append((name,int(price)))

The whole

You can play with the code on laptopbg-scraper-with-BS4

Submission

PLease, prefix your filenames/archive with your name initials, before sending.: For instance: iep_task1.py or iep_tasks.rar
Send files to progressbg.python.course@gmail.com

These slides are based on

customised version of

Hakimel's reveal.js

framework

Web Scrapping with Python

Get URL Content

Get URL Content

urllib.request module

Example

Scrape data using the BeautifulSoup

Scrape data using the BeautifulSoup

Overview

Example

The whole

Submission

`urllib.request` module