Beautiful Data: Introduction to Practical Data Science

      Spring 2019

      Instructor: Alex Szalay

      TA: Lingyuan Ji


Class times:MW 15:00-16:15
Class location:Bloomberg 176

Send mail to

Homework assignments:
    There will be 6 homework assignments over the semester, roughly one every 2 weeks, including a take-home mid-term and a take-home final. The homeworks will contain practical problems related to data science, with real data, containing real errors. The solutions should done in an on-line iPython notebook, set up on a server at JHU, so no software installation is required. A copy of the data files should be on-line in a data container as part of the SciServer environment, but for a backup they can also found here.

    Collaboration in small teams in encouraged. Grades will be based upon the effort demonstrated during class and in the homeworks as the problems do not necessarily have a single "correct" solution.

  • Homework #1, due Feb 25

Useful material:

Resources:

Database resources:

Powerpoint links


Syllabus
  • Data-Intensive Computing
    • The Fourth Paradigm
    • History of e-Science
    • Big Data in Science
  • Introduction to Databases
    • Relational databases, ACID
    • Indexing
    • Introduction to SQL
    • User defined functions
  • Hardware architectures
    • Storage hierarchy
    • Nature of low level I/O
    • Redundant storage, RAID, erasure codes
    • Networking issues
    • Balanced systems, Amdahl's Laws
    • Cloud computing vs Beowulf
  • Elementary Statistics
    • Distributions
    • Expectation values, moments
    • Central limit theorem
    • Linear regression
    • Principal component analysis
    • Random forests
  • Data transformations
    • Fourier transforms
    • Wavelets
    • Random projections
  • Data structures
    • Trees
    • K-d trees
    • Quad- and octrees, space filling curves
  • Hashing
    • Hash functions
    • Locality sensitive hashing
    • Bloom filters
  • Graphs
    • Representation of graphs
    • Properties of graphs
    • Laplacian, eigenvalues
    • Graphs as spring networks
  • Sorting and Searching
    • Quicksort
    • Queues
    • Merge-sort
  • Data streams, streaming algorithms
    • Mean, median
    • Sketching
  • Data visualization