-
Notifications
You must be signed in to change notification settings - Fork 1
Data Sharks
This report will start from the point we have finished watching all the tutorials and acquired all the necessary knowledge in order to start with the task we have been assigned to.
In the first place, to check the veracity of the hypothesis data was needed. After searching for some time we landed in different database organizations that were trustworthy, but a lot of them had abundant information. We arrived to the conclusion due to limitations of the hardware that not all indicators could be selected and therefore only around 30 of them would make it in the final analysis. Moreover, we had a difficult time choosing the indicators as a lot of them had a lot of missing information and for this reason we had to inspect each table manually.
When the selection of databases was completed we tried importing each database using the URL through python code, however, not all databases had a direct link. For this reason, downloading the .csv and converting all files to a .zip was a simpler approach. The first battle with the data was a really difficult process, each one had a different format of columns, many of the columns were in different dtypes and there were rows that had missing values or were in blank. A specific treatment for each provider of the database was needed, so all the files were named using a same approximation: 'NameOfTheProvider - Indicator Name'. A long time was used in the making a huge method called 'preprocess' and 'rename_value_column' that reads the name of the .csv and makes a special treatment taking into account from were the file was downloaded.
After a few issues, we managed to make the merge method work properly, resulting in a comprehensive database of around 70,000 rows and 34 columns, including those for the country and year.
Once all of the data was placed into one dataframe, we could refine the integration, identify the total of missing values for each country or period of years, and narrow our study to a sample with enough data.
In a first approach, we decided to ignore the entries older than 1990, and drop the countries that had no more than 15 indicators to study. This way, we reduced the size of our dataframe without compromising the quality of the data. This trimming process will be refined in the near future, as we finish the main part of the code and become able to focus more on the subject of study.
At the same time, a model definition diagram is created and still needs some improvements due to the fact that the python code is still in process. The structure and dimensions of the code made jupyter notebook no longer suitable for the development, so we migrated our code to Visual Studio and divided methods, functions and classes in different modules.
This change allowed for the creation of a poetry file and a more agile sharing via GitHub, but required a considerable effort to make adapt the code to the new environment. It was detected that the code can be further reestructured to make it more readable and tidy in this new environment, but it had to be postponed and instead we focused on making the code work again properly.
Finally, we have spent some time planning the next step on the treatment of the data. The null values are the first sensible option, and we decided to fill the blanks with values estimated through interpolation between the closest non-null values.
The second stage involved dealing with outliers. First of all, we had to identify which values were just anomalies due to explainable circumstances, and which ones were true outliers. This process was made first manually, and later with help from a primitive algorithm to roughly estimate if the values were part of a tendency. Those that weren't, had to be treated, and we decided to apply the same criteria than with the null values and use interpolation methods.
This second week has been a really big challenge in terms of normalizing the data and structuring code properly, however, we think a huge progress is made and learned a lot of code and what is the best way to work in a team.
This week's main focus was to tidy up the code, design useful test cases and rearrange the code so it is more readable and easy to maintain. Also, we started to work with thresholds for the countries and indicators, and we defined a year range first of 1990 to 2020, and later of 2000 to 2020. We then realized that we might need more sources of information, as we cut off a few indicators since there was not enough data available for the scope of our analysis.
During the restructuring of the code, a few bugs were found that, even though they did not affect the results, they could make the code to fail if the way the methods were used was slightly changed.
Finally, we modified some constants and arguments from the code so it could treat and preprocess any other database we decide to use in the future, with only a few changes needed depending on how different they are.
In summary, we invested a considerable amount of time in making the code more robust, versatile and intuitive to use; all while still filtering what information was useful and which should be gone.
With the code perfectly tuned, we started write the notebook that would separate the countries by regions, while preparing the first batch of charts and tests for the newly formed groups.
We decided to create a .csv file for every region with all the countries it contained, but first, we needed to define which countries were in which region. After searching for different ways of dividing them, we finally downloaded a small .csv from the World Bank that grouped the countries not just by continent, but also applied a more socio-economical criteria.
We also made a few tests with some clustering algorithms so we could group the countries based on more empirical evidence. Even though we managed to obtain satisfactory results, we decided we would postpone the clustering analysis so we could first refine our study methods and have more time to read about the options that clustering algorithms can offer to improved the results.
This week, we started to work with widgets that would provide a more interactive and visual way of representing the information we were obtaining from our study, such as showing the most correlated indicators based on a threshold the user could set of what they consider a significative correlation value.
On the other hand, we coded an aggregation algorithm that allowed us to infer the values of whole regions or even the world, based on a summatory of the absolute indicators and a weighted average for the relative ones. We decided to use the population of each country as the weight for this average, but any other absolute indicator, such as the GDP, can also be used with a simple modification to the code.