Beautiful Data: Introduction to Practical Data Science

      Spring 2024

      Instructor: Alex Szalay

      TA: Sanjana Sekhar


Class times:MW 15:00-16:15
Class location:Bloomberg 464

Send mail to

Office Hours: Tuesdays 3pm-4pm EDT
    For office hours, individual and small group discussions we will use either my office or my personal Zoom account. Please send an email ahead of time to schedule a meeting.
Remote classes:
Homework assignments:

    The goals of this class is to help you in developing an intuitive sense in how to tackle real life data problems, and understand the deep connections between different areas of statistics, applied mathematics, computer science and signal processing.

    The class requires prior knowledge of Linear Algebra and Calculus. If you have not mastered these, much of the topic will be a struggle, so I recommend that you take this once you took these prerequisites.

    There will be about 8-10 homework assignments over the semester, roughly one every 2 weeks, including a take-home mid-term and a take-home final. The homeworks will contain practical problems related to data science, with real data, containing real errors. The solutions should done in an on-line iPython notebook, set up on a server at JHU, so no software installation is required. A copy of the data files should be on-line in a data container as part of the SciServer environment, but for a backup they can also found here.

    Collaboration in small teams in encouraged. Grades will be based upon the effort demonstrated during class and in the homeworks as the problems do not necessarily have a single "correct" solution. There is a short template Jupyter notebook that you can start from. You should download this, and then upload it into your own Jupyter directory on the SciServer, and insert the additional commands there.

    The first homework should be emailed to Sanjana as a PDF file.

  • Homework #1, due Jan 31, 2024
  • Homework #2, due Feb 14, 2024
  • Homework #3, due Feb 28, 2024
  • Homework #4 (midterm), due Mar 25, 2024
  • Homework #5, due Apr 15, 2024
  • Homework #6, due May 13, 2024
Useful material:

Resources:

Powerpoint links

Syllabus
  • Data-Intensive Computing
    • The Fourth Paradigm
    • History of e-Science
    • Big Data in Science
  • Introduction to Databases
    • Relational databases, ACID
    • Indexing
    • Introduction to SQL
    • User defined functions
  • Hardware architectures
    • Storage hierarchy
    • Nature of low level I/O
    • Redundant storage, RAID, erasure codes
    • Networking issues
    • Balanced systems, Amdahl's Laws
    • Cloud computing vs Beowulf
  • Elementary Statistics
    • Distributions
    • Expectation values, moments
    • Central limit theorem
    • Linear regression
    • Principal component analysis
    • Random forests
  • Data transformations
    • Fourier transforms
    • Wavelets
    • Random projections
  • Data structures
    • Trees
    • K-d trees
    • Quad- and octrees, space filling curves
  • Hashing
    • Hash functions
    • Locality sensitive hashing
    • Bloom filters
  • Graphs
    • Representation of graphs
    • Properties of graphs
    • Laplacian, eigenvalues
    • Graphs as spring networks
  • Sorting and Searching
    • Quicksort
    • Queues
    • Merge-sort
  • Data streams, streaming algorithms
    • Mean, median
    • Sketching
  • Introduction to Machine Learning
    • Dimensional reduction, embedding techniques
    • Classification and regression
    • Deep Learning, Architectures for CNNs