| Date | Lecture Topics and Materials | Assignments |
| Tue 9/2 | Introduction: What is data science. Major tools used by data scientists. Class overview.
Lecture Notes.Readings:
References:
| Lab 0: Basic usage of github, VirtualBox, IPython Notebook (Due 9/12) |
| Thu 9/4 | Basic Statistics: statistical tests, samples, fallacies.
Lecture Notes.Readings:
References:
| |
| Tue 9/9 | Basic Statistics: linear regression, classification, clustering.
Lecture Notes.Readings:
| Lab 1: Python basic stats and plotting (Due 9/19) |
| Thu 9/11 | Data Models: Overview, Why modeling is essential, Commonly used models (Relational, JSON, Protocol Buffers)
Lecture Notes. | |
| Tue 9/16 | Relational Databases, SQL
Lecture Notes. | Lab 2: Basic SQL; Python Pandas and Dataframes; Avro (Due 10/3) |
| Thu 9/18 | (cntd) | |
| Tue 9/23 | (cntd) | |
| Thu 9/25 | Data scraping and wrangling, Unix tools, GUIs
Lecture Notes. | Lab 3: Advanced SQL and Pandas (Due 10/10) |
| Tue 9/30 | (cntd) | |
| Thu 10/2 | Data Integration: Overview, Schema mapping, Entity Resolution
(Lecture Notes Continued) | Lab 4: Data cleaning using unix tools, Data Wrangler (Due 10/17) |
| Tue 10/7 | (cntd) | |
| Thu 10/9 | Information Extraction: Overview, Key Techniques
(Lecture Notes Continued) | Lab 5: Entity Resolution and Information Extraction (Due 10/28) |
| Tue 10/14 | Implementation of Relational Databases
Lecture Notes. | |
| Thu 10/16 | (cntd) | |
| Tue 10/21 | Distributed programming frameworks: Parallel Databases, MapReduce, Apache Spark, Hadoop Ecosystem Lecture Notes. | Lab 6: Hadoop, Spark (Due: 11/7) |
| Thu 10/23 | MIDTERM | |
| Tue 10/28 | (Cntd Distributed Programming Frameworks) | |
| Thu 10/30 | (cntd) | |
| Tue 11/4 | (cntd) | Lab 7: Cassandra and MongoDB (Due: 11/17) |
| Thu 11/6 | (cntd) | |
| Tue 11/11 | Key-value stores: Basics, Differences from Relational Databases, Consistency/Replication issues
Lecture Notes. | Lab 8: Spark Streaming, Storm (Due: 11/26) |
| Thu 11/13 | (cntd) | |
| Tue 11/18 | Visualization: D3.js (see Lab 10 for notes) | |
| Thu 11/20 | Data streaming/Real-time analytics: Data streams in relational databases, Spark Streaming, StormLecture Notes. | Lab 9: Neo4j, GraphX (Due: 12/8) |
| Tue 11/25 | (cntd) | |
| Tue 12/2 | Graph Databases and Graph Analytics
Lecture Notes. | Lab 10: D3 (Due: 12/11) |
| Thu 12/4 | (cntd) | |
| Tue 12/9 | Cloud computing: Overview, Virtualization, Data centers, Platform/Infrastrcture-as-a-Service Lecture Notes. | |
| Thu 12/11 | (cntd) |