v3
@@ -2,3 +2,5 @@ # Data Analysis Project: COVID-19 Risk Evaluation
* Author: Marco Andronaco * Subject: Statistical Laboratory * Teacher: Alessandro Ortis - University of Catania + +Click [here](https://github.com/Bi-Rabittoh/covid-data-analysis/blob/master/data-analysis.ipynb) to see the analysis.
@@ -16,8 +16,8 @@ "cell_type": "markdown",
"id": "8986d33e-a7a7-4949-9bdd-e84b6dc116ce", "metadata": {}, "source": [ - "## Step 0: libraries, functions and constants\n", - "First of all, let's install the single required library for this analysis.\n", + "## Step 0: libraries and constants\n", + "First of all, let's install the required libraries for this analysis.\n", "\n", "The `require()` function returns `FALSE` if the requested library is not installed, so we can install it only when necessary." ]@@ -95,7 +95,7 @@ "cell_type": "markdown",
"id": "674984b8-405b-4812-b82c-1a4cf072fc83", "metadata": {}, "source": [ - "## Step 1: loading and cleaning our data\n", + "## Step 1: Data Pre-Processing\n", "The dataset comes from [Kaggle](https://www.kaggle.com/datasets/meirnizri/covid19-dataset). Here's the provided description:\n", "\n", ">Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus. Most people infected with COVID-19 virus will experience mild to moderate respiratory illness and recover without requiring special treatment. Older people, and those with underlying medical problems like cardiovascular disease, diabetes, chronic respiratory disease, and cancer are more likely to develop serious illness.\n",@@ -103,7 +103,7 @@ ">During the entire course of the pandemic, one of the main problems that healthcare providers have faced is the shortage of medical resources and a proper plan to efficiently distribute them. In these tough times, being able to predict what kind of resource an individual might require at the time of being tested positive or even before that will be of immense help to the authorities as they would be able to procure and arrange for the resources necessary to save the life of that patient.\n",
"\n", ">The main goal of this project is to build a machine learning model that, given a Covid-19 patient's current symptom, status, and medical history, will predict whether the patient is in high risk or not.\n", "\n", - ">The dataset was provided by the [Mexican government](https://datos.gob.mx/busca/dataset/informacion-referente-a-casos-covid-19-en-mexico). This dataset contains an enormous number of anonymized patient-related information including pre-conditions. The raw dataset consists of 21 unique features and 1,048,576 unique patients. In the Boolean features, 1 means \"yes\" and 2 means \"no\". values as 97 and 99 are missing data.\n", + ">The dataset was provided by the [Mexican government](https://datos.gob.mx/busca/dataset/informacion-referente-a-casos-covid-19-en-mexico). This dataset contains an enormous number of anonymized patient-related information including pre-conditions. The raw dataset consists of 21 unique features and 1,048,576 unique patients. In the Boolean features, 1 means \"yes\" and 2 means \"no\". Values such as 97 and 99 are missing data.\n", "\n", "| Variable | Description |\n", "|:---|:-------------|\n",@@ -301,7 +301,7 @@ "output_type": "display_data"
} ], "source": [ - "df = read.csv(\"Covid Data.csv\")\n", + "df = read.csv(\"res/Covid Data.csv\")\n", "\n", "row.n1 = nrow(df)\n", "print(paste(row.n1, \"rows loaded.\"))\n",@@ -314,7 +314,7 @@ "cell_type": "markdown",
"id": "b78530e5-b7dd-42de-b231-b4e6a7f461fc", "metadata": {}, "source": [ - "First of all, we want to focus our analysis on the patients that have tested positive for COVID, so we can just remove all rows which have values $\\geq 5$ on the `CLASIFFICATION_FINAL` column." + "First of all, we want to focus our analysis on the patients that have tested positive for COVID, so we can just remove all rows which have values $\\geq 4$ on the `CLASIFFICATION_FINAL` column." ] }, {@@ -669,7 +669,7 @@ "cell_type": "markdown",
"id": "46b3e1a3-8872-4ae2-b6fb-239787aca959", "metadata": {}, "source": [ - "## Step 2: Data visualization\n", + "## Step 2: Data Visualization\n", "Now that we have a clean dataset, we can plot some of our data.\n", "\n", ">Please be wary that from now on all visualizations are valid under the assumption that they have already been tested positive for COVID.\n",@@ -1052,7 +1052,7 @@ "metadata": {},
"source": [ "## Step 3: Unsupervised Learning\n", "### PCA\n", - "We can use the `prcomp` function to perform principal component analysis. However, it quickly becomes very slow for datasets with a big number of rows.\n", + "We can use the `prcomp()` function to perform principal component analysis. However, it quickly becomes very slow for datasets with a big number of rows.\n", "\n", "To solve this, we can use the `sample()` function to take a random subset of 10.000 elements." ]@@ -1078,7 +1078,7 @@ "cell_type": "markdown",
"id": "f15d499b-9226-4504-88d8-7880e57dd4cc", "metadata": {}, "source": [ - "Now we can use the `scale()` function to scale our subset, then `prcomp()` to make our PCA." + "Now we can use the `prcomp()` function to make our PCA." ] }, {@@ -1128,8 +1128,6 @@ "output_type": "display_data"
} ], "source": [ - "#df.scaled = scale(df.subset) # scale our subset\n", - "#pca = prcomp(df.scaled)\n", "pca = prcomp(df.subset, scale=TRUE)\n", "names(pca)" ]@@ -1190,7 +1188,7 @@ "cell_type": "markdown",
"id": "94b9b7bd-a8d3-4e71-a5e2-5a40bbb17bd4", "metadata": {}, "source": [ - "We can quickly visualize the first two principal components by creating a `biplot` with this function:" + "We can quickly visualize the first two principal components by creating a biplot with this function:" ] }, {@@ -2004,8 +2002,8 @@ "cell_type": "markdown",
"id": "cd3e1200-9e03-49b0-85d1-82bc3e359a62", "metadata": {}, "source": [ - "Finally it's time to fit our model. We can use the `glm` function, which is used to fit generalized linear models. We specify that we want to use the `binomial()` family, which uses the `logit` (sigmoid) link function.\n", - "![sigmoid graph](sigmoid.png \"Sigmoid\")" + "Finally it's time to fit our model. We can use the `glm` function, which is used to fit generalized linear models. We specify that we want to use the `binomial()` family, which uses the _logit_ (_sigmoid_) link function.\n", + "![sigmoid graph](res/sigmoid.png)" ] }, {@@ -2175,8 +2173,6 @@ "output_type": "display_data"
} ], "source": [ - "# The type=\"response\" option tells R to output probabilities of the\n", - "# form P(Y = 1|X), as opposed to other information such as the logit.\n", "predictions = predict(logistic_model, X_test, type=\"response\")\n", "head(round(predictions, 3))" ]