Skip to content

Deal with changing format

Tammy Yang edited this page Feb 19, 2020 · 4 revisions

Why this has to be a little bit complicated

Since Facebook may change the json structure in the downloaded archive, we developed a series of methods to deal with the problem. The following is a simple instruction to get a table with columns you want.
The core idea is, "Let's put everything into one table first, then get the columns you want from the table." In order to get the data you need from changing Facebook data format, the following three steps are required:

  1. Create tmp tables (df_list) from json files
  2. Merge sub-DataFrames under the same top_df to generate wanted_df
  3. Select only the columns we need ("wanted_columns") to generate the final DataFrame

Let's take a look of: link

The first thing we need to do is get_routed_dfs.

=> Since "df" in "df_list" map may diverge to different directions, we have to get the DataFrames that in the same path (route) toward "top_df".

Second, do get_start_peeling.

=> Find the index of top_df in the df_list. This index will be used for the next step.

Then, do merge_one_to_one_sub_df.

=> Merge all one-to-one sub_dfs under the same top_df.

Finally, do get_wanted_columns to create the final DataFrame

=> Extract the columns we need and return NaN if the columns are missing.

Run temp_to_wanted_df

In order to run temp_to_wanted_df function, you need to specify four parameters:

  1. wanted_columns
  2. route_by_table_name
  3. start_by
  4. regex

The wanted_columns is a list of column names we want. Use the empty list [] to return all columns.

The route_by_table_name is the last table with the suffix of top_df which will be used when we want to choose one DataFrame as top_df where that df is in a diverging branch of df_list.

Take example_df_list for example,

If we plot the relationship of df_list:

Image of relation of df_list

We can find out that there are two paths (routes) in the picture. If we want to merge temp__attachments__data__media and its sub_dfs, we have to extract them from df_list. Otherwise, temp__data may cause problems when we do the join operation.

If we set route_by_table_name as "media", we will get the path (route) like:

Image of route media

The start_by is the unique column name of top_df in the routed_df_list.

When we do join operation, we have to choose one df in df_list as our top_df. Take the example_df_list, if we want to set "temp__attachments__data__media" as top_df, we have to find the unique column name of it such as "uri".

Please note, it is not allowed to use title as a column name since it is preserved in the temp tables.

The regex is whether to use "regex" when find start_by.

Thus, if we want get the "creation_timestamp", "description", "title", "uri", and "upload_ip" of photos of posts in one table, the whole process will be like:

from fbjson2table.func_lib import parse_fb_json                              
from fbjson2table.table_class import TempDFs                                 
                                                                             
                                                                             
json_content = parse_fb_json($PATH_OF_JSON)                                  
temp_dfs = TempDFs(json_content)                                             
one_to_one_df, _ = temp_dfs.temp_to_wanted_df(                               
                   route_by_table_name='media',                              
                   strat_by='uri'                                            
                   )                                                         
print(one_to_one_df.columns)                                                 

get

Index(['creation_timestamp', 'description', 'id_0', 'id_attachments_1',      
    'id_data_2', 'id_media_3', 'title', 'uri', 'media_metadata_id_0',        
    'media_metadata_id_attachments_1', 'media_metadata_id_data_2',           
    'media_metadata_id_media_metadata_4', 'photo_metadata_id_0',             
    'photo_metadata_id_attachments_1', 'photo_metadata_id_data_2',           
    'photo_metadata_id_media_metadata_4',                                    
    'photo_metadata_id_photo_metadata_5', 'photo_metadata_upload_ip'],       
   dtype='object')                                                           

Then,

wanted_columns = ['creation_timestamp', 'description', 'title', 'uri',       
                  'photo_metadata_upload_ip']                                
photo_df, _ = temp_dfs.temp_to_wanted_df(                                    

The photo_df is what we want.

Clone this wiki locally