Description of the Covid-19 data set

Posted on April 03, 2020

The Data
    The current data is all placed into a SQL Server database on the SciServer. This is accessible through SQL by selecting CasJobs upon login, or through iPython notebooks (see template below).

    The data is assembled from several sources. One is US data about the virus in different geographic areas, assembled by the New York Times [NYT]. This consists of two separate tables, one with data with a State granularity, the other with details from every County in each state [1,2,3]. These two tables are called statsS and statsC for the statistics about the states and counties, respectively. They contain a date column (date), the number of days measured from 2020-01-21 (dd), the name of the state (state), the county (county), the cumulative number of cases testing positive (cases), and the number of deaths (deaths), as reported on every day.

    We have also added a few other tables. One contains simple data about each of the states, the name, the (FIPS code, and the postal code).

    We also have harvested a simple census summary database for each county from the US Census Bureau. This table contains teh FIPS code, state, county, population (Population), housing units (HousingUnits), the total area (TotalArea), water area (WaterArea), land area (LandArea), population density (PopDensity) and density of houses (HouseDensity). This can provide relevant sociometric information about the importance of different factors in infection rates. The census data has been modified to be consistent with the NYT data, i.e. the two cities [1,2] have been added, and the relevant counties have been removed. In order to properly capture the average population density, we have used a population weighted averaging procedure over the different counties.

    An addition table is about policies and interventions that the different states and the federal authorities have implemented. The source of this is on github, created by Jie Ying Wu of JHU. The columns are FIPS (fips), state, county, and a column for each type of intervention, containing the day when the ruling was announced. The origin has been modified to 2020-01-21, matching the startt date of the NYT data. The actions are (with column names): stay at home (stayhome), no gatherings for more then 50 people (gather50), no gatherings for more than 500 people (gather500), schools closed (schools), no dine-in inrestuarants (dinein), no public entertainment or public gym use (entgym), and a federal ban on foreign travel (fedGuide).

    Here are some relevant SQL files for creating the database. These may be useful to see how a real life application is put together.
    1. One exception is county='New York City', which is reported as a single county, while it is really composed of the Bronx, New York, Kings and Queens counties, and appears under the name 'New York City', with a FIPS=36999.
    2. The other exception is county='Kansas City', which spans two different states, yet aggregated in the reported data as a single virtual county. It is composed of the following: Jackson, Cass, Clay, Platte counties in Missouri, and Johnson, Leavenworth and Wyandotte counties in Kansas. This has FIPS=29999 assigned.
    3. For almost every state there are a few cases reported with county='Unknown'. As there was no FIPS added to these, we have assigned 1000*(stateFIPS)+998 to these data.
Using the data through SQL:
    1. Using your SciServer account, login to the SciServer. At the Dashboard, click on CasJobs.
    2. A dark screen appears. On the upper left, just below Context, click on the drop-down menu, and select the COVIDNYT database context.
    3. Type in the following simple SQL statement: select * from States
    4. Find the [quick] button in the upper right-hand corner, and press it. The result should be returned to you instanteneously, and shown on the bottom of the screen.
    5. Just below the displayed set of rows and columns, click on the format drop-down list, showing HTML and click "Save As". An HTML fiule should be downloaded to your browser. Instead of HTML, you can also select CSV as the desired output format. Try this, and the query result should appear as a .csv file.
    6. Now that you have successfully ran your first query, click on Help, and explore the system. You should be able to save results also in your own personal database. Try to type select * into MyDB.mystates from States and press "Quick". No results come back to the screen, rather the result set has been saved in your own database as a table called mystates. Click on MyDB in the Menu Bar, and you can see this table. You can include this in further queries, or download the data as html or csv.
    Shortly I will add a few additional query examples here that you can explore.
Using the data with iPython
    Here is a link to an iPython notebook that will show some examples on how to run a SQL query against the database, fetch the result into a Pandas DataFrame, and plot some attributes.