Basics of data analysis using Python

Course Leader: Vitalii Naumov

Home Institution: Cracow University of Technology

Course pre-requisites: basics of Calculus and Probability Theory, basic programming skills are desirable but not mandatory

Course Overview

In the article “Data Scientist: The Sexiest Job of the 21st Century” published by Harvard Business Review in October 2012, T.H. Davenport and D.J. Patil have made a prediction that data scientists would become the most demanded specialists in every market due to the development of communication and information technologies. This trend remains the same nowadays, and despite numerous courses and specializations which have been started in universities and in the net, data science professionals are still the most needed specialists in every area.

The course is devoted to persons who want to obtain essential skills in data analysis, and in this way, to catch a wave, and become the demanded professional.

During the course, students will become acquainted with the theoretical basis of data science – statistical analysis. We’re going to begin with the description of a random variable, its distribution functions, and numeric characteristics. Then I will present the basics of distribution fitting and more advanced techniques of mathematical statistics – correlation and regression analysis.

All the presented methods and techniques will be supported by the respective tools in the Python programming language. Students will learn the basics of Python and also will get acquainted with the most popular tools for data analysis: pandas, NumPy, matplotlib, and scikit-learn libraries.

In the last part of the course, I will present essential machine learning tools – simple classifiers and neural networks. Implementation of these tools and their features will be explained with the help of examples in Python.

Learning Outcomes

By the end of the course, students will be able to use Python language and the functionality of its libraries in order to perform basic operations of data processing. They will be proficient in statistical inference, including distribution fitting, correlation, and regression analysis. Students will have basic skills in data visualization with the use of Python libraries.

Course Content

  1. Basics of Python: data types, conditions, loops, functions
  2. Random variable: distribution functions, numeric characteristics
  3. Creating in Python simple functions for basic data analysis
  4. Basic distributions of random variables: discrete and continuous distributions
  5. Using Python for distribution fitting: Pearson’s chi-squared test and Kolmogorov-Smirnov test
  6. Numpy library: the most important functions for data processing and analysis
  7. Correlation analysis: Pearson’s product-moment coefficient, rank correlation coefficients, and correlation matrices
  8. Machine learning introduction: simple classifiers with the scikit-learn library (decision trees and k-nearest neighbors’ method)
  9. Regression analysis using Python: estimation of regression coefficients and significance tests
  10. Basics of neural networks with Python: linear classification using the perceptron

Instructional Method

During the course, we will have lectures and individual projects in 50/50 proportion of time

Required Course Materials

All the required materials will be provided by the instructor during the course.

Recommended additional reading:

Madsen, B.S. Statistics for Non-Statisticians, Springer, 2016

Downey, A.B. Think Python: How to Think Like a Computer Scientist, O'Reilly, 2015

Raschka, S., Mirjalili, V. Python Machine Learning, Packt, 2017

Assessment

The final grade will be calculated on the grounds of two tests (midterm and final) and the project developed during the course. Tests will contribute 80% to the final result, and the project will give 20% respectively