Parsing various file formats

Data Serialization

Data Serialization

Overview

Data serialization is the concept of converting structured data (lists, dictionaries, objects, etc) into a format that can be stored (for example, in a file) or transmitted and reconstructed later.
The native data serialization module for Python is called Pickle.
pickle — Python object serialization

Example


      import pickle

      # let's serialize a simple dict
      prices = { 'apples': 2.50, 'oranges': 1.90, 'bananas': 2.40 }

      #convert the object to a serialized string
      serialized_prices = pickle.dumps( prices )
      print(serialized_prices)

      #de-serialize (unpickle) an object
      received_prices = pickle.loads( serialized_prices )
      print(received_prices)
    

Notes

For storing data in databases and archival storage, you’re probably better off using a more standard data encoding, such as JSON, XML, CSV

Parsing JSON

Parsing JSON

Overview

JSONJavaScriptObjectNotation
JSON is an open-standard file format that uses human-readable text to transmit data objects consisting of attribute–value pairs and array data types (or any other serializable value)
JSON @wikipedia

Overview - Examples

examples of JSON data: Examples @JSON Schema


      {
          "title": "Person",
          "type": "object",
          "properties": {
              "firstName": {
                  "type": "string"
              },
              "lastName": {
                  "type": "string"
              },
              "age": {
                  "description": "Age in years",
                  "type": "integer",
                  "minimum": 0
              }
          },
          "required": ["firstName", "lastName"]
      }
    

Overview - JSON in Python

Python standard library provides the json module
json — JSON encoder and decoder

Parse JSON string to Python objects.

json.loads() function
Deserialize s (a str, bytes or bytearray instance containing a JSON document) to a Python object using next conversion table.

json.loads() example


      import json
      from operator import itemgetter

      json_str = """ 
        [
            {
                "name": "apple",
                "price": 1.80
            },
            {
                "name": "orange",
                "price": 2.10
            },
            {
                "name": "bananas",
                "price": 1.60
            }
        ]
      """


      #read json from string
      json_data = json.loads(json_str)

      # print the list sorted by "price" 
      for i in sorted(json_data,key=itemgetter("price")):
        print(i)
    

Parse JSON file to Python objects.

json.load() function
Deserialize fp (a .read()-supporting file-like object containing a JSON document) to a Python object using next conversion table.

json.load() sample file


        [
          {
              "name": "apple",
              "price": 1.80
          },
          {
              "name": "orange",
              "price": 2.10
          },
          {
              "name": "bananas",
              "price": 1.60
          }
        ]
    

json.load() example


    	import json
    	from operator import itemgetter

    	json_file = "sample.json"

    	#read json from file
    	with open(json_file) as f:
    	    json_data = json.load(f)

    	for i in sorted(json_data,key=itemgetter("price")):
    	  print(i)
    

Convert Python objects to JSON

json.dump() function
Serialize obj as a JSON formatted stream to fp (a .write()-supporting file-like object) using next conversion table.
json.dumps() function
Same as dump(), but serialize to string

json.dumps() - list example


  		import json

  		mylist = [1,2,3]

  		matrix = [
  		    [1,2,3],
  		    [4,5,6],
  		    [7,8,9],
  		]

  		print('List :', json.dumps(mylist))
  		print('Matrix :', json.dumps(matrix))
  	

json.dumps() - indented list example


  		import json

  		mylist = [1,2,3]

  		matrix = [
  		    [1,2,3],
  		    [4,5,6],
  		    [7,8,9],
  		]

  		print('List :', json.dumps(mylist,indent=2))
  		print('Matrix :', json.dumps(matrix,indent=2))
  	

Parsing CSV

Parsing CSV

Overview

CSVCommaSeparatedValues
the most common import and export format for spreadsheets and databases.
For CSV parsing Python provides the built-in csv module
csv — CSV File Reading and Writing

Parsing CSV data - example


  		Symbol,Price,Date,Time,Change,Volume
  		"AA",39.48,"6/11/2007","9:36am",-0.18,181800
  		"AIG",71.38,"6/11/2007","9:36am",-0.15,195500
  		"AXP",62.58,"6/11/2007","9:36am",-0.46,935000
  		"BA",98.31,"6/11/2007","9:36am",+0.12,104800
  		"C",53.08,"6/11/2007","9:36am",-0.25,360900
  		"CAT",78.29,"6/11/2007","9:36am",-0.23,225400
  	

Parsing CSV data - example


  		import csv
  		with open('sample_data.csv') as f:
  		  f_csv = csv.reader(f)
  		  headers = next(f_csv)
  		  for row in sorted(f_csv, key=lambda a:a[0]):
  		      print(row)
  	

Parsing XML

Parsing XML

Overview

XMLeXtensibleMarkupLanguage
(XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.
XML @wikipedia

Overview

Processing XML in Python:

xml
Built-in XML Processing Modules
untangle
Converts XML to Python objects
xmltodict
xmltodict is a Python module that makes working with XML feel like you are working with JSON

Parsing XML data - example


			<data>  
			    <items>
			        <item atr1="value1">item1 data</item>
			        <item atr2="value2">item2 data</item>
			    </items>
			</data>  
		

Parsing XML data - example


			import xml.etree.ElementTree as ET  
			tree = ET.parse('sample_data.xml')  
			root = tree.getroot()


			# print items attributes
			print('\nAll attributes:')  
			for elem in root:  
			    for subelem in elem:
			        print(subelem.attrib)

			# print items data
			print('\nAll item data:')  
			for elem in root:  
			    for subelem in elem:
			        print(subelem.text)
		

Exercises

Task1: Task1Title

The Task

Given is next JSON file: pythonbooks.revolunet.com.issues.json, which contains information about the books listed in pythonbooks.revolunet.com
Make a program, that will extract from these data only the "title", "author" and "url" fields for the books labeled as "Advanced", and will print the extracted information, as shown in sample output

Submission

PLease, prefix your filenames/archive with your name initials, before sending.
For instance: iep_task1.py or iep_tasks.rar
Send files to progressbg.python.course@gmail.com

These slides are based on

customised version of

Hakimel's reveal.js

framework