-
Notifications
You must be signed in to change notification settings - Fork 1
Data Sharks
This report will start from the point we have finished watching all the tutorials and acquired all the necessary knowledge in order to start with the task we have been assigned to.
In the first place, to check the veracity of the hypothesis data was needed. After searching for some time we landed in different database organizations that were trustworthy, but a lot of them had abundant information. We arrived to the conclusion due to limitations of the hardware that not all indicators could be selected and therefore only around 30 of them would make it in the final analysis. Moreover, we had a difficult time choosing the indicators as a lot of them had a lot of missing information and for this reason we had to inspect each table manually.
When the selection of databases was completed we tried importing each database using the URL through python code, however, not all databases had a direct link. For this reason, downloading the .csv and converting all files to a .zip was a simpler approach. The first battle with the data was a really difficult process, each one had a different format of columns, many of the columns were in different dtypes and there were rows that had missing values or were in blank. A specific treatment for each provider of the database was needed, so all the files were named using a same approximation: 'NameOfTheProvider - Indicator Name'. A long time was used in the making a huge method called 'preprocess' and 'rename_value_column' that reads the name of the .csv and makes a special treatment taking into account from were the file was downloaded.
After a few issues, we managed to make the merge method working properly, resulting in a comprehensive database of around 70,000 rows and 34 columns, including those for the country and year.
Once all the data was placed in one dataframe, we could refine the integration, identify the total of missing values for each country or period of years, and narrow our study to a sample with enough data.
We decided to ignore the entries older than 1990, and drop the countries that had no more than 15 indicators to study. This way, we reduced the size of our dataframe without compromising the quality of the data. At the same time, a model definition diagram is created and still needs some improvements due to the fact that the python code is still in process. Also, a poetry file is still pending to do for this next upcoming week.
This first week has been a really big challenge in terms of normalizing the data, however, we think a huge progress is made and learned a lot of code and what is the best way to work in a team.