In this post we are going to see how
to apply EDA in live problems. For that we taken the coronavirus data
set from kaggle and going to apply EDA . This post will help the
beginners to learn how to apply EDA in data science field.
What is Coronavirus
2019 Novel Coronavirus (2019-nCoV) is a virus (more specifically, a
coronavirus) identified as the cause of an outbreak of respiratory
illness first detected in Wuhan, China. Early on, many of the patients
in the outbreak in Wuhan, China reportedly had some link to a large
seafood and animal market, suggesting animal-to-person spread. However, a
growing number of patients reportedly have not had exposure to animal
markets, indicating person-to-person spread is occurring. At this time,
it’s unclear how easily or sustainably this virus is spreading between
people - CDC
This dataset has daily level information on the number of affected cases, deaths and recovery from 2019 novel coronavirus.
The data is available from 22 Jan 2020.
Define the Problem
Coronaviruses are a large family of viruses that are common in many
different species of animals, including camels, cattle, cats, and bats.
Rarely, animal coronaviruses can infect people and then spread between
people such as with MERS, SARS, and now with 2019-nCoV.
Outbreaks of novel virus infections among people are always of public
health concern. The risk from these outbreaks depends on characteristics
of the virus, including whether and how well it spreads between people,
the severity of resulting illness, and the medical or other measures
available to control the impact of the virus (for example, vaccine or
treatment medications).
This is a very serious public health threat. The fact that this virus
has caused severe illness and sustained person-to-person spread in China
is concerning, but it’s unclear how the situation in the United States
will unfold at this time.
The risk to individuals is dependent on exposure. At this time, some
people will have an increased risk of infection, for example healthcare
workers caring for 2019-nCoV patients and other close contacts. For the
general American public, who are unlikely to be exposed to this virus,
the immediate health risk from 2019-nCoV is considered low. The goal of
the ongoing U.S. public health response is to prevent sustained spread
of 2019-nCov in this country.
Precautions
Health authorities and scientists say the same precautions against other
viral illnesses can be used: wash your hands frequently, cover up your
coughs, try not to touch your face. And anyone who does come down with
the virus should be placed in isolation. "Considering that substantial
numbers of patients with SARS and MERS were infected in health-care
settings", precautions need to be taken to prevent that happening again,
the Chinese team warned in The Lancet.
Coronavirus Exploratory Data Analysis
We can explore the analysis of the corona virus affected stats
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
1
2
| nCov_df = pd.read_csv('2019_nCoV_data.csv')
nCov_df.columns
|
The output is
Index(['Sno', 'Date', 'Province/State', 'Country', 'Last Update', 'Confirmed',
'Deaths', 'Recovered'],
dtype='object')
Column Description
- Sno - Serial number
- Date - Date and time of the observation in MM/DD/YYYY HH:MM:SS
- Province / State - Province or state of the observation (Could be empty when missing)
- Country - Country of observation
- Last Update - Time in UTC at which the row is updated for the given
province or country. (Not standardised currently. So please clean them
before using it)
- Confirmed - Number of confirmed cases
- Deaths - Number of deaths
- Recovered - Number of recovered cases
The sample data are given below,
the output is given below,
Sno Date Province/State Country Last Update Confirmed Deaths Recovered
0 1 01/22/2020 12:00:00 Anhui China 01/22/2020 12:00:00 1.0 0.0 0.0
1 2 01/22/2020 12:00:00 Beijing China 01/22/2020 12:00:00 14.0 0.0 0.0
2 3 01/22/2020 12:00:00 Chongqing China 01/22/2020 12:00:00 6.0 0.0 0.0
3 4 01/22/2020 12:00:00 Fujian China 01/22/2020 12:00:00 1.0 0.0 0.0
4 5 01/22/2020 12:00:00 Gansu China 01/22/2020 12:00:00 0.0 0.0 0.0
nCov_df.info()
the output is,
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1199 entries, 0 to 1198
Data columns (total 8 columns):
Sno 1199 non-null int64
Date 1199 non-null object
Province/State 888 non-null object
Country 1199 non-null object
Last Update 1199 non-null object
Confirmed 1199 non-null float64
Deaths 1199 non-null float64
Recovered 1199 non-null float64
dtypes: float64(3), int64(1), object(4)
memory usage: 75.0+ KB
Based on the above information ,The Province/State having some missing values
nCov_df.describe()
the output is,
Sno Confirmed Deaths Recovered
count 1199.000000 1199.000000 1199.000000 1199.000000
mean 600.000000 276.213511 5.961635 14.617181
std 346.265794 1966.264622 58.082724 103.959136
min 1.000000 0.000000 0.000000 0.000000
25% 300.500000 2.000000 0.000000 0.000000
50% 600.000000 10.000000 0.000000 0.000000
75% 899.500000 82.000000 0.000000 2.000000
max 1199.000000 31728.000000 974.000000 2222.000000
nCov_df[['Confirmed', 'Deaths', 'Recovered']].sum().plot(kind='bar')
The out put of the above line is ,
Observations
- The data set is contains many countries like China, Japan, US, India and so on.
- The comparison of confirmed with Recovered, It clearly states that the recovery action from virus is dead slow.
- The data clearly indicating the spread of virus
Data Clean up
Removing the unwanted columns from the data
nCov_df.drop(['Sno', 'Last Update'], axis=1, inplace=True)
nCov_df.columns
The result is
Index(['Date', 'Province/State', 'Country', 'Confirmed', 'Deaths',
'Recovered'],
dtype='object')
Converted the date data type object into datetime
nCov_df['Date'] = nCov_df['Date'].apply(pd.to_datetime)
nCov_df['Date'].head()
The result is
0 2020-01-22 12:00:00
1 2020-01-22 12:00:00
2 2020-01-22 12:00:00
3 2020-01-22 12:00:00
4 2020-01-22 12:00:00
Name: Date, dtype: datetime64[ns]
Replacing the wrongly mapped country value towards states
nCov_df[nCov_df['Province/State'] == 'Taiwan']['Country'] = 'Taiwan'
nCov_df[nCov_df['Province/State'] == 'Hong Kong']['Country'] = 'Hong Kong'
nCov_df.replace({'Country': 'Mainland China'}, 'China', inplace=True)
Listing all the countries which is affected with corona virus
nCov_df['Country'].unique()
The result is
array(['China', 'US', 'Japan', 'Thailand', 'South Korea',
'Mainland China', 'Hong Kong', 'Macau', 'Taiwan', 'Singapore',
'Philippines', 'Malaysia', 'Vietnam', 'Australia', 'Mexico',
'Brazil', 'France', 'Nepal', 'Canada', 'Cambodia', 'Sri Lanka',
'Ivory Coast', 'Germany', 'Finland', 'United Arab Emirates',
'India', 'Italy', 'Sweden', 'Russia', 'Spain', 'UK', 'Belgium',
'Others'], dtype=object)
Country based virus affected people information
nCov_df.groupby(['Country']).Confirmed.count().reset_index().sort_values(['Country'], ascending = True)
The output is,
Country Confirmed
0 Australia 56
1 Belgium 7
2 Brazil 1
3 Cambodia 15
4 Canada 38
5 China 618
6 Finland 13
7 France 18
8 Germany 15
9 Hong Kong 19
10 India 12
11 Italy 12
12 Ivory Coast 1
13 Japan 20
14 Macau 19
15 Malaysia 18
16 Mexico 1
17 Nepal 17
18 Others 4
19 Philippines 13
20 Russia 11
21 Singapore 19
22 South Korea 20
23 Spain 11
24 Sri Lanka 15
25 Sweden 11
26 Taiwan 19
27 Thailand 20
28 UK 11
29 US 113
30 United Arab Emirates 13
31 Vietnam 19
Top most Severely affected countries
nCov_df.groupby(['Country']).Confirmed.count().reset_index().sort_values(['Confirmed'], ascending=False).head(10)
The output is,
Country Confirmed
5 China 618
29 US 113
0 Australia 56
4 Canada 38
13 Japan 20
27 Thailand 20
22 South Korea 20
26 Taiwan 19
21 Singapore 19
14 Macau 19
List all the Provinces/States that were affected with Virus
nCov_df['Province/State'].unique()
The output is
array(['Anhui', 'Beijing', 'Chongqing', 'Fujian', 'Gansu', 'Guangdong',
'Guangxi', 'Guizhou', 'Hainan', 'Hebei', 'Heilongjiang', 'Henan',
'Hong Kong', 'Hubei', 'Hunan', 'Inner Mongolia', 'Jiangsu',
'Jiangxi', 'Jilin', 'Liaoning', 'Macau', 'Ningxia', 'Qinghai',
'Shaanxi', 'Shandong', 'Shanghai', 'Shanxi', 'Sichuan', 'Taiwan',
'Tianjin', 'Tibet', 'Washington', 'Xinjiang', 'Yunnan', 'Zhejiang',
nan, 'Chicago', 'Illinois', 'California', 'Arizona', 'Ontario',
'New South Wales', 'Victoria', 'Bavaria', 'British Columbia',
'Queensland', 'Chicago, IL', 'South Australia', 'Boston, MA',
'Los Angeles, CA', 'Orange, CA', 'Santa Clara, CA', 'Seattle, WA',
'Tempe, AZ', 'Toronto, ON', 'San Benito, CA', 'London, ON',
'Madison, WI', 'Cruise Ship', 'Diamond Princess cruise ship'],
dtype=object)
Impact in india
nCov_df[nCov_df.Country == 'India']
The output is,
Date Province/State Country Confirmed Deaths Recovered
432 2020-01-30 21:30:00 NaN India 1.0 0.0 0.0
491 2020-01-31 19:00:00 NaN India 1.0 0.0 0.0
552 2020-02-01 23:00:00 NaN India 1.0 0.0 0.0
611 2020-02-02 21:00:00 NaN India 2.0 0.0 0.0
675 2020-02-03 21:40:00 NaN India 3.0 0.0 0.0
745 2020-02-04 22:00:00 NaN India 3.0 0.0 0.0
815 2020-02-05 12:20:00 NaN India 3.0 0.0 0.0
885 2020-02-06 20:05:00 NaN India 3.0 0.0 0.0
958 2020-02-07 20:24:00 NaN India 3.0 0.0 0.0
1030 2020-02-08 23:04:00 NaN India 3.0 0.0 0.0
1102 2020-02-09 23:20:00 NaN India 3.0 0.0 0.0
1175 2020-02-10 19:30:00 NaN India 3.0 0.0 0.0
Country most affected
nCov_df.groupby(['Country']).Confirmed.max().reset_index().sort_values(['Confirmed'], ascending=False).head(20).plot(x='Country',
kind='bar', figsize=(12,6))
Country most recovered
nCov_df.groupby(['Country']).Recovered.max().reset_index().sort_values(['Recovered'], ascending=False).head(20).plot(x='Country',
kind='bar', figsize=(12,6))
Country faced more deaths over the world
nCov_df.groupby(['Country']).Deaths.max().reset_index().sort_values(['Deaths'], ascending=False).head(20).plot(x='Country',
kind='bar', figsize=(12,6))
Recovery vs Deaths in world wide
nCov_df[['Country', 'Deaths', 'Recovered']].groupby('Country').max().plot(kind='bar', figsize=(12, 7))
Recovery vs Deaths in world wide other than China
nCov_df[nCov_df['Country'] != 'China'][['Country', 'Deaths', 'Recovered']].groupby('Country').max().plot(kind='bar', figsize=(12, 7))
Philippines clearly show that the no recovered happen
nCov_df[nCov_df['Country'] == 'Philippines'][['Country', 'Confirmed', 'Deaths', 'Recovered']].groupby('Country').max().plot(kind='bar')
When did Virus Confirmed initially?
nCov_df['Date'].min()
The result is,
Timestamp('2020-01-22 12:00:00')
When was the Virus Confirmed recently?
nCov_df['Date'].max()
The output is,
Timestamp('2020-02-10 19:30:00')
How many total no.of persons were identified with Virus on each day
nCov_df.groupby('Date')[['Confirmed', 'Deaths', 'Recovered']].max().reset_index()
The output is,
Date Confirmed Deaths Recovered
0 2020-01-22 12:00:00 444.0 0.0 0.0
1 2020-01-23 12:00:00 444.0 17.0 28.0
2 2020-01-24 12:00:00 549.0 24.0 31.0
3 2020-01-25 22:00:00 1052.0 52.0 42.0
4 2020-01-26 23:00:00 1423.0 76.0 44.0
5 2020-01-27 20:30:00 2714.0 100.0 47.0
6 2020-01-28 23:00:00 3554.0 125.0 80.0
7 2020-01-29 21:00:00 4586.0 162.0 90.0
8 2020-01-30 21:30:00 5806.0 204.0 116.0
9 2020-01-31 19:00:00 7153.0 249.0 169.0
10 2020-02-01 23:00:00 9074.0 294.0 215.0
11 2020-02-02 21:00:00 11177.0 350.0 295.0
12 2020-02-03 21:40:00 13522.0 414.0 396.0
13 2020-02-04 22:00:00 16678.0 479.0 522.0
14 2020-02-05 12:20:00 16678.0 479.0 538.0
15 2020-02-06 20:05:00 22112.0 618.0 817.0
16 2020-02-07 20:24:00 22112.0 618.0 867.0
17 2020-02-08 23:04:00 27100.0 780.0 1440.0
18 2020-02-09 23:20:00 29631.0 871.0 1795.0
19 2020-02-10 19:30:00 31728.0 974.0 2222.0
Case confirmed for each countries
nCov_df.groupby(['Country']).Confirmed.max().reset_index().plot(x='Country', kind='bar', figsize=(10,6))
Case confirmed other than China
nCov_df[nCov_df['Country'] != 'China'].groupby(['Country']).Confirmed.max().reset_index().plot(x='Country', kind='bar', figsize=(10,6))
The virus spreadness over the confirmed, Deaths and Recovered in globally
nCov_df.groupby('Date')[['Confirmed', 'Deaths', 'Recovered']].max().reset_index().plot(x='Date',
y=['Confirmed', 'Deaths', 'Recovered'],
figsize=(12, 7))
Spreadness of virus , Deaths and recovery data other than China
List the States in China which were affected
nCov_df[nCov_df['Country'] == 'China'].groupby('Province/State')[['Confirmed']].count().reset_index().plot(x='Province/State',
y=['Confirmed'],kind='bar',
figsize=(12, 7))
nCov_df[nCov_df.Country == 'China'][['Province/State', 'Deaths', 'Recovered']].groupby('Province/State').max().plot(kind='bar',
figsize=(12, 7))
Countries those have worst recovery services
nCov_df[nCov_df['Recovered'] < nCov_df['Deaths']][['Country', 'Confirmed', 'Deaths', 'Recovered']].groupby('Country').max().plot(kind='bar')
Countries death rate high and 0 recovery rate
nCov_df[(nCov_df['Recovered'] < nCov_df['Deaths'])&(nCov_df['Country'] != 'China')][['Country', 'Confirmed', 'Deaths', 'Recovered']].groupby('Country').max().plot(kind='bar',
figsize=(12,7))
In those countries are having the more infacted people along with hign deaths. There is no recovery happened.
Very slow recovery in china
nCov_df[(nCov_df['Country'] == 'China') & (nCov_df['Recovered'] == 0 )&( nCov_df['Deaths'] != 0)][['Province/State', 'Confirmed', 'Deaths', 'Recovered']].groupby('Province/State').max().plot(kind='bar',
figsize=(12, 7))
From all the above information the recovery rate is very slow and the
death rate is increasing day by day. Many of states are strugling to
recover their own citizen. The goodness is India protecting their people
well compared with china's other neighbours.