Hello, data enthusiasts! Welcome back to our Data Science blog. Today, we’ll be working with a COVID-19 dataset from Kaggle (https://www.kaggle.com/imdevskp/corona-virus-report), which contains daily case reports, vaccination data, and more. Our goal is to download, clean, and prepare this data for modeling using a Jupyter Notebook. Let’s get started!
Setting up Kaggle API and downloading the COVID-19 dataset
To download the dataset directly within a Jupyter Notebook, we’ll use the Kaggle API. First, install the Kaggle API package if you haven’t already:
!pip install kaggle
Next, you need to upload your Kaggle API credentials (kaggle.json) to the Jupyter Notebook. To obtain the API credentials, follow these steps:
- Log in to your Kaggle account.
- Click on your profile picture in the top-right corner, and then click “Account.”
- Scroll down to the “API” section, and click “Create New API Token.” This will download a file called “kaggle.json.”
Upload the “kaggle.json” file to your Jupyter Notebook environment.
Now, you can download the COVID-19 dataset using the Kaggle API:
import os
# Make sure the Kaggle API token is in the correct location
os.environ['KAGGLE_CONFIG_DIR'] = '/path/to/your/kaggle.json/directory/'
# Download the dataset
!kaggle datasets download -d imdevskp/corona-virus-report
This will download a compressed file containing the dataset in CSV format. Unzip the file:
!unzip corona-virus-report.zip
Loading and exploring the data
Now that we have the dataset, let’s load it into Python and explore its contents. We’ll use the popular pandas
library to load and manipulate the data:
import pandas as pd
# Load the dataset
data = pd.read_csv('covid_19_clean_complete.csv')
# Check the first few rows of the dataset
print(data.head())
# Get general information about the dataset
print(data.info())
Cleaning the data
Before using the data for modeling, we need to clean and preprocess it. Here are some common data-cleaning steps:
- Handle missing values: Check for missing values and decide how to handle them. You can fill them with a default value, interpolate, or drop rows/columns with missing values.
# Check for missing values
print(data.isnull().sum())
# Fill missing values with a default value or interpolate
data.fillna(value=0, inplace=True) # or data.interpolate()
- Convert data types: Ensure that each column has the correct data type.
# Convert the 'Date' column to a datetime object
data['Date'] = pd.to_datetime(data['Date'])
- Create new features: Extract or create new features that may be useful for modeling.
# Extract the month and year from the 'Date' column
data['Month'] = data['Date'].dt.month
data['Year'] = data['Date'].dt.year
Drop unnecessary columns: Remove columns that are not relevant for modeling or may cause data leakage.
# Drop unnecessary columns (if applicable)
data.drop(columns=['Lat', 'Long'], inplace=True)
Preparing the data for modeling
Once the data is clean, you may need to perform additional preprocessing steps to prepare it for modeling, such as:
- Encoding categorical variables: Convert categorical variables into a numerical format using techniques like one-hot encoding or label encoding.
- Scaling and normalization: Scale numerical features so they have similar ranges, improving model performance.
- Splitting the data: Divide the data into training and testing sets to evaluate your model’s performance.
Here’s an example of how to perform these steps using pandas
, scikit-learn
, and category_encoders
libraries:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import category_encoders as ce
# Encode categorical variables
encoder = ce.OneHotEncoder(cols=['Country/Region', 'Province/State'])
data_encoded = encoder.fit_transform(data)
# Scale numerical features
scaler = StandardScaler()
data_encoded[['Confirmed', 'Deaths', 'Recovered', 'Active']] = scaler.fit_transform(data_encoded[['Confirmed', 'Deaths', 'Recovered', 'Active']])
# Split the data into training and testing sets
X = data_encoded.drop(columns=['Confirmed', 'Date']) # Features (excluding target and date column)
y = data_encoded['Confirmed'] # Target variable (confirmed cases)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In this blog post, we walked through the process of downloading, cleaning, and preparing a COVID-19 dataset from Kaggle for modeling. We covered essential data cleaning and preprocessing techniques, such as handling missing values, converting data types, creating new features, and preparing the data for modeling with encoding, scaling, and splitting. With this clean and prepared data, you're now ready to start building and evaluating machine learning models to gain insights and make predictions related to the COVID-19 pandemic. Keep in mind that the specific steps and techniques you use for data cleaning and preprocessing may vary depending on the dataset and the problem you're trying to solve. The key is to be thorough and thoughtful throughout the process to ensure that your data is clean, relevant, and well-suited for modeling. Happy data cleaning!