Web Scrapping with Python

Get URL Content

Get URL Content

urllib.request module

urllib.request — Extensible library for opening URLs
The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.

Example


			def get_html(url):
				if url.startswith("http"):
					# make the request - change UA to prevent server deny for scripts
					req = request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})

					# get response
					respone = request.urlopen(req)
					return respone.read()
				else:
					# for tests and debugs
					with open(url,"r") as f:
					  return f.read()
		

Scrape data using the BeautifulSoup

Scrape data using the BeautifulSoup

Overview

Beautiful Soup is a Python library for pulling data out of HTML and XML files
BeautifulSoup Docs
You have to install the module:

			pipenv install beautifulsoup4
		

Example


			def scrape_data(page):
				bs_parser = BeautifulSoup(page, 'html.parser')
				products_html = bs_parser.find('ul', 'products')

				for product in products_html.find_all("article"):	
					try:
						name = product.select("div h2")[0].string
					except:
						name = "NoName"

					try:
						price = product.select("div tspan")[0].string
					except:
						price = None
						
					products.append((name,int(price)))
		

The whole

You can play with the code on laptopbg-scraper-with-BS4

Submission

PLease, prefix your filenames/archive with your name initials, before sending.
For instance: iep_task1.py or iep_tasks.rar
Send files to progressbg.python.course@gmail.com

These slides are based on

customised version of

Hakimel's reveal.js

framework