People who come into contact with someone infected with Covid-19 within a radius of 2m have a high probability of being positive. These people are called F1 (who has been infected is F0).
With the grouping of people in close contact, the DBSCAN algorithm and the Haversine distance formula (used to calculate the distance of 2 points on the sphere) is a pretty suitable choice for this.
There are two datasets required:
- The dataset of coordinates of people in Vietnam.
- The dataset of geometry of the province in Vietnam.
The coordinate will be randomly generated in the bound of Vietnam. This has been done before, so what I need to do now is read the saved CSV file
Geojson file of Vietnam can be found in this link
Here are the required libraries to be able to run the file:
If there is already a saved CSV file, read it. Otherwise, generate a new one.
Then rename column lo to longitude and la to latitude
df = df.rename(columns={'lo': 'longitude', 'la': 'latitude'})Assign id to each row as index
df['id'] = df.indexFor the column 'infected' it is randomly generated with a rate of 1% positive and 99% negative out of 10,000,000 people
df['infected'] = np.random.choice([True, False],
size=10000000,
p=[0.01, 0.99])Column F1 initially has the value False
df['f1'] = FalseGeometry dataset is a geojson file.
Use geopandas to read it.
df_geo = gpd.read_file('./data/diaphantinh.geojson')The value to train X is the latitude and longitude that has been converted to radians
X = df[['longitude', 'latitude']].to_numpy()
X = np.radians(X)Initiate the DBSCAN algorithm with:
- Metric: Haversine distance
- Epsilon:
- The distance to determine close contact is 2 meters
- Earth's radius is 6371 km
- So the epsilon is 2/6371 = 0.000011
- Min_samples: As long as 2 people come into contact with each other, it will be considered a cluster
- Algorithm: ball_tree
dbscan = DBSCAN(metric='haversine',
eps=0.02/6371,
min_samples=2,
algorithm='ball_tree')Create a new DataFrame consisting of the original DataFrame combined with the resulting clusters column after fitting the DBSCAN algorithm
df_result = df.assign(cluster=clusters)In this DataFrame, this result only needs people who are F0 and those who are close to each other
df_result = df_result[(df_result['cluster'] != -1) |
(df_result['infected'] == True)]People who close contact with someone infected with Covid-19 will be labeled as F1
for index, row in df_result[(df_result['infected'] == True)
& (df_result['cluster'] >= 0)].iterrows():
same_id = retrieve_same_cluster(df_result,
row.id,
exclusive=True).id.values
df_result.loc[same_id, 'f1'] = TrueDraw a map of people in close contact with each other
Demo a cluster of people in close contact with each other
Draw a map according to the number of people infected with Covid in each province
Besides, there is a column chart showing the number of infected people by province sorted from highest to lowest
Similar to above, draw 2 graphs to show people with F1
Aggregate number of infected people and contacts of infected people represented by grouped bar chart






