Beautiful Data: Introduction to Practical Data Science

      Spring 2020

      Instructor: Alex Szalay

      TA: Yuzo Ishikawa

Class times:MW 15:00-16:15
Class location:Bloomberg 176
Class Zoom link:

Send mail to

Homework assignments:
    There will be 6 homework assignments over the semester, roughly one every 2 weeks, including a take-home mid-term and a take-home final. The homeworks will contain practical problems related to data science, with real data, containing real errors. The solutions should done in an on-line iPython notebook, set up on a server at JHU, so no software installation is required. A copy of the data files should be on-line in a data container as part of the SciServer environment, but for a backup they can also found here.

    Collaboration in small teams in encouraged. Grades will be based upon the effort demonstrated during class and in the homeworks as the problems do not necessarily have a single "correct" solution.

Remote classes:
Office Hours: Thursdays 10am-11am, 2pm-3pm EDT
    For office hours, individual and small group discussions we will use my personal Zoom account I have set up a waiting room, so we can arrange both individual and group discussions. Send an email ahead of time to arrange a meeting.
Useful material:


Powerpoint links

  • Data-Intensive Computing
    • The Fourth Paradigm
    • History of e-Science
    • Big Data in Science
  • Introduction to Databases
    • Relational databases, ACID
    • Indexing
    • Introduction to SQL
    • User defined functions
  • Hardware architectures
    • Storage hierarchy
    • Nature of low level I/O
    • Redundant storage, RAID, erasure codes
    • Networking issues
    • Balanced systems, Amdahl's Laws
    • Cloud computing vs Beowulf
  • Elementary Statistics
    • Distributions
    • Expectation values, moments
    • Central limit theorem
    • Linear regression
    • Principal component analysis
    • Random forests
  • Data transformations
    • Fourier transforms
    • Wavelets
    • Random projections
  • Data structures
    • Trees
    • K-d trees
    • Quad- and octrees, space filling curves
  • Hashing
    • Hash functions
    • Locality sensitive hashing
    • Bloom filters
  • Graphs
    • Representation of graphs
    • Properties of graphs
    • Laplacian, eigenvalues
    • Graphs as spring networks
  • Sorting and Searching
    • Quicksort
    • Queues
    • Merge-sort
  • Data streams, streaming algorithms
    • Mean, median
    • Sketching
  • Data visualization