Skip to content

Data Sharks

smanolesCAPG edited this page Jul 1, 2022 · 13 revisions

WEEKLY REPORT

FIRST WEEK: 21 June to 28 June

This report will start from the point we have finished watching all the tutorials and acquired all the necessary knowledge in order to start with the task we have been assigned to.

In the first place, to check the veracity of the hypothesis data was needed. After searching for some time we landed in different database organizations that were trustworthy, but a lot of them had abundant information. We arrived to the conclusion due to limitations of the hardware that not all indicators could be selected and therefore only around 30 of them would make it in the final analysis. Moreover, we had a difficult time choosing the indicators as a lot of them had a lot of missing information and for this reason we had to inspect each table manually.

When the selection of databases was completed we tried importing each database using the URL through python code, however, not all databases had a direct link. For this reason, downloading the .csv and converting all files to a .zip was a simpler approach. The first battle with the data was a really difficult process, each one had a different format of columns, many of the columns were in different dtypes and there were rows that had missing values or were in blank. A specific treatment for each provider of the database was needed, so all the files were named using a same approximation: 'NameOfTheProvider - Indicator Name'. A long time was used in the making a huge method called 'preprocess' and 'rename_value_column' that reads the name of the .csv and makes a special treatment taking into account from were the file was downloaded.

At the same time, a model definition diagram is created and still needs some improvements due to the fact that the python code is still in process. Also, a poetry file is still pending to do for this next upcoming week.

This first week has been a really big challenge in terms of normalizing the data, however, we think a huge progress is made and learned a lot of code and what is the best way to work in a team.

Clone this wiki locally