Beautiful Data: Introduction to Practical Data Science

      Fall 2017

      Instructor: Alex Szalay

      TA: Heshy Roskes

Class times:MW 15:00-16:15
Class location:Bloomberg 176

Send mail to

Homework assignments:
    There will be 6 homework assignments over the semester, roughly one every 2 weeks, including a take-home mid-term and a take-home final. The homeworks will contain practical problems related to data science, with real data, containing real errors. The solutions should done in an on-line iPython notebook, set up on a server at JHU, so no software installation is required.

    Collaboration in small teams in encouraged. Grades will be based upon the effort demonstrated during class and in the homeworks as the problems do not necessarily have a single "correct" solution.

  • Homework #1, due Sep 26

Useful material:


Database resources:

Powerpoint links

  • Data-Intensive Computing
    • The Fourth Paradigm
    • History of e-Science
    • Big Data in Science
  • Introduction to Databases
    • Relational databases, ACID
    • Indexing
    • Introduction to SQL
    • User defined functions
  • Hardware architectures
    • Storage hierarchy
    • Nature of low level I/O
    • Redundant storage, RAID, erasure codes
    • Networking issues
    • Balanced systems, Amdahl's Laws
    • Cloud computing vs Beowulf
  • Elementary Statistics
    • Distributions
    • Expectation values, moments
    • Central limit theorem
    • Linear regression
    • Principal component analysis
    • Random forests
  • Data transformations
    • Fourier transforms
    • Wavelets
    • Random projections
  • Data structures
    • Trees
    • K-d trees
    • Quad- and octrees, space filling curves
  • Hashing
    • Hash functions
    • Locality sensitive hashing
    • Bloom filters
  • Graphs
    • Representation of graphs
    • Properties of graphs
    • Laplacian, eigenvalues
    • Graphs as spring networks
  • Sorting and Searching
    • Quicksort
    • Queues
    • Merge-sort
  • Data streams, streaming algorithms
    • Mean, median
    • Sketching
  • Data visualization