Auto-pull Data Algorithm #143

samshdn · 2025-05-27T03:26:44Z

samshdn
May 27, 2025
Maintainer

Main logic

Pulling data from the logistics source every five minutes.

(1) Get incomplete trackingIDs (not delivered)

Note: The trackingID that is marked Completed if its status is 3500 or 3007.

Retrieve ongoing trackingIDs and additional information (e.g., phone number) from the entities table. Include all ongoing trackingIDs where completed = false.

trackingID: operatorCode-trackingNum. Ex: fdx-881463876410

(2) Iterate through each trackingID

2.1) Based on the operator code, send a request to the data provider (e.g., SF Express or FedEx), receive the response, and convert it to our standard Entity object.

During the process of converting the JSON response to an Entity object, we navigate to the scanEvents section (for FDX) or the routes section (for SFEX), where the data provider stores the list of events. We iterate through each event item in the list, convert it into the Event object we defined, and set the eventId for each Event object. Before appending the newly created Event object to the Entity object, we check whether an event with the same eventId already exists in the Entity. If an event with the same eventId already exists, the newly created Event will be ignored.

2.2) Retrieve existing eventIds from the database using the trackingID.

eventId: operatorCode-trackingNum-hash. Ex: ev_fdx-881463876410-0cb79ccf0ea0a1906434f88994f6ad45
hash is MD5 hash for each event data block received by the provider.

Compare the eventIds in the returned Entity object with the existing eventIds in the database to determine if the Entity object’s events have been updated.

2.4) If updated, proceed to update the event records in the database.

Logic to Compare Two Sets of Event IDs

We have two sets of eventIds:

Set A pulled from the data provider providerEventIds
Set B retrieved from the database dbEventIds

If the sizes (count of the eventIds) of these two sets are different, goto Section 3.

If the sizes are the same, we need to compare the eventIds further.

First, sort the eventIds in both sets A and B in alphabetical order.
Then, convert each sorted set to a string by concatenating the eventIds.
Finally, compare the two resulting strings to determine if the sets are identical.

It’s important to note that the eventIds may change due to the data received for the event block from the source, which could cause the hash values to differ.

If both sets have the exact eventIDs, then do nothing. Else goto Section 3.

(3) Update events by comparing 2 sets

3.1) If in Set A but not in B, it’s a new event to add to the database.
3.2) If in Set B but not in A, delete it from the database.

Update Logic

The updateEntity() function accepts two parameters:
Parameter 1: entity.events: entity object with events, retrieved from the data provider. Set A
Parameter 2: eventIds[], an array containing the saved eventIds. Set B

Step 1: Update `entities`

Mark the trackingID as Completed if its status is 3500 or 3007.

// This step is required only when entity.isCompleted() === true.
if(entity.isCompleted() {
     UPDATE entities SET completed = true where id=${trackingId}
}

Step 2: Update `events` in 2 loops

There are two loops to compare eventIds from sets A and B.

Loop 1: Insert new events

Iterate through event IDs in set A (in the newly pulled data).
If an eventId not found in set B, add the event to the database.

Loop 2: Delete events

Iterate through event IDs in set B (in database already).
If an eventId not found in set A, delete it from the database.

jduan00 · 2025-05-27T04:03:20Z

jduan00
May 27, 2025
Maintainer

Please add the algorithm that for each data object received, we create a hash.

EventId="3 components"

should we insert first then delete? It delete first then insert?
Mystery in auto-pull event deletes #133

0 replies

samshdn · 2025-05-27T04:23:12Z

samshdn
May 27, 2025
Maintainer Author

A1:
For each JSON dataset retrieved from the data provider, we navigate the scanEvents (FDX) or routes (SFEX) section, where the data provider stores the event list. We generate an MD5 hash for each source event data.

export async function jsonToMd5(
    json: Record<string, unknown>,
): Promise<string> {
    // Convert JSON object to string
    const jsonString = JSON.stringify(json);

    // Convert json stringto Uint8Array
    const encoder = new TextEncoder();
    const data = encoder.encode(jsonString);

    // Calcualte MD5
    const hashBuffer = await crypto.subtle.digest("MD5", data);

    // Convert HASH value to Hex string
    const hashArray = Array.from(new Uint8Array(hashBuffer));
    return hashArray.map((b) => b.toString(16).padStart(2, "0")).join("");
}

A2:
The current logic is insert first, then delete invalid eventIds.

0 replies

jduan00 · 2025-05-28T23:09:28Z

jduan00
May 28, 2025
Maintainer

Encountered issues

FDX change data source for events causing different hash values, resulting in varying eventIds with frequent auto-pull removals and additions. Mystery in auto-pull event deletes #133

Todo: Debug why we have different hash for events that should be the same (from FDX)

0 replies

samshdn · 2025-05-29T00:59:31Z

samshdn
May 29, 2025
Maintainer Author

Fdx's data is indeed different from what we expected. The fact that the new data contains fewer records than the old data is another strange issue. see: #119 (comment).

I guess the Fdx system is constantly organizing its data, but we're not sure — we don't know what kind of problems they are encountering.

0 replies

samshdn · 2025-05-29T03:08:30Z

samshdn
May 29, 2025
Maintainer Author

To understand why data is deleted during the auto-pull process, we need to analyze both the newly added data and the deleted data. To do this, we need to record the deleted source data.

My approach is to set the tracking_num of the records to be deleted in the events table to tracking_num-delete.

This way, the deleted data can be retained without affecting front-end queries.

0 replies

jduan00 · 2025-05-30T18:20:19Z

jduan00
May 30, 2025
Maintainer

Let's rethink the logic to compare Set A and Set B

If count_of(A) > count_of(B) 
then 
  new updates must happened
if count_of(A) < count_of(B)
then
  Why? We ignored some duplicated data?
if count_of(A) = count_of(B)
then
   Should we compare more or just do nothing?
endif

Other Ideas

We observed that FDX source data changes, then the hash changes. We won't use hash anymore. For example, we can use the format ev_fdx-{time}-{status_code}, where time represents the number of seconds since the epoch and status_code represents the status code.
We can eliminate the events table and instead keep the latest data pulled from the source in the trackingNum table.

0 replies

jduan00 · 2025-06-03T02:35:52Z

jduan00
Jun 3, 2025
Maintainer

Two data issues (observed and potential)

hash changes (to ignore)
duplicated events (filter out)

Design guideline

Keep meaningful and standard data, filter out duplicated data

Side effects

Our event count could be different from source count (ex: 10 < 12)

Lessons learned

Source data is unpredicatable. Standardization is hard.
What is our value? Standardized status_codes, and expected data json format.

0 replies

samshdn · 2025-06-03T03:58:32Z

samshdn
Jun 3, 2025
Maintainer Author

Issue:

The auto-pull process frequently deletes database records.

Analysis:

At the outset of the design, we assumed:

Historical event data would not change.
It is possible that the data source contains duplicate data.
Based on these considerations, to ensure we do not write duplicate data into the database, we generate an eventId by performing an MD5 operation on the source data and use the eventId as the primary key for the records.

Through detailed analysis of FedEx data, we confirmed that FedEx frequently makes minor adjustments to historical data, causing slightly different data to generate a new eventId. From a human perspective, the event data corresponding to this new eventId is essentially the same, leading to duplicate data being recorded in the database. This results in the number of records stored in the database exceeding the number of records in the data source.

After discovering that the number of records in the database exceeds that of the data source, our approach was to delete data from the database that no longer exists in the data source. Due to FedEx’s frequent data changes, we correspondingly end up frequently deleting data from the database.

Data change example

The following is the data pulled for fdx-881600917035 at different time points:
data pulled at： 2025-05-29T14:45:44.786Z

{
  "date": "2025-05-29T01:22:00-07:00",
  "eventType": "AR",
  "locationId": "OAKH",
  "locationType": "FEDEX_FACILITY",
  "scanLocation": {
    "city": "OAKLAND",
    "postalCode": "94621",
    "countryCode": "US",
    "countryName": "United States",
    "residential": false,
    "streetLines": {
    },
    "stateOrProvinceCode": "CA"
  },
  "derivedStatus": "In transit",
  "exceptionCode": "",
  "eventDescription": "Arrived at FedEx hub",
  "derivedStatusCode": "IT",
  "exceptionDescription": ""
}

data pulled at: 2025-05-30T00:55:01.945Z

{
  "date": "2025-05-29T01:22:00-07:00",
  "eventType": "AR",
  "locationType": "FEDEX_FACILITY",
  "scanLocation": {
    "city": "OAKLAND",
    "postalCode": "94621",
    "countryCode": "US",
    "countryName": "United States",
    "residential": false,
    "streetLines": {
    },
    "stateOrProvinceCode": "CA"
  },
  "derivedStatus": "In transit",
  "exceptionCode": "",
  "eventDescription": "Arrived at FedEx hub",
  "derivedStatusCode": "IT",
  "exceptionDescription": ""
}

The difference between these two JSON data is: the latter does not include the locationId . See #133

Objective Review:

The primary goal of this product is to standardize logistics data and provide developers with an easy-to-use API. Specifically, it tracks key events during logistics transportation through status codes. Therefore, we do not present the data source’s data as-is: we add status codes to the data, and similarly, we remove data that is meaningless from the perspective of our objectives. In other words, we aim to:

Retrieve data from the data source.
Extract key information.
Filter out invalid information.
Invalid information includes: duplicate data and erroneous data.
Present the key information in a standardized format.

Algorithm Change:

Based on the above objectives, we cannot simply convert source data into standardized data on a one-to-one basis. For key information, we need to perform standardized transformation before adding it to the database. For invalid data, we need to remove it from the database.

Thus, the updated eventId format is:
ev_{trackingId}-{time}-{statusCode}

trackingId: operator-trackingNum: ex: fdx-12341234
time: represents the number of seconds since the epoch
statusCode: represents the status code.

In simple terms, for each trackingId, we only record one status code at a given point in time. If the data source contains multiple events at the same time point with the same status code, they are ignored.

0 replies

jduan00 · 2025-06-03T04:18:02Z

jduan00
Jun 3, 2025
Maintainer

Algorithm for events update

For a trackingID

Set (A): Fresh event-related data pulled from the source.
Set (B): Event data already saved in db.

Goal: find the simplist and most efficient way to do both:

a) Update: (B) = (A);
b) Filter: ignoring any duplicate or irrelevant data in (A).

Outcome: (B) <= (A)

Steps:

DATA_CHANGED=true  // assume (A) =/= (B)

For each event in (A)
    calculate: statusCode
    calculate: eventID = ev_operatorCode-trackingNum-time-statusCode
End

(A) has an array of eventIDs (sorted)

(B) has an array of eventIDs (sorted)

If (A) = (B)  // both arrays equal
then
    DATA_CHANGED=false

// db-related operations
If DATA_CHANGED == true 
    Loop each eventID through (A), whatever not in (B), add into db
    Loop each eventID through (B), whatever not in (A), remove from db

0 replies

samshdn · 2025-06-03T07:14:16Z

samshdn
Jun 3, 2025
Maintainer Author

Assume that we have two variables available:

entity object: An entity object includes fresh events
a string array represents eventIds already saved in the db

in-memory operations

The following is the logic to check whether the event data has changed.

DATA_CHANGED=true  // assume (A) =/= (B)

(A): array of eventIDs from the fresh event data
(B): array of eventIDs already saved in the db

(D1) = eventIDs exist in (A), not in (B)...new eventIds
(D2) = eventIDs exist in (B), not in (A)...eventIds need to be removed from db

If (D1.length==0 && D2.length==0)  // NOT any changes
then
    DATA_CHANGED=false

db-related operations

If DATA_CHANGED == true 
     Loop each eventId through (D1), add into db
     Loop each eventID through (D2), remove from db

0 replies

Auto-pull Data Algorithm #143

Uh oh!

Uh oh!

samshdn May 27, 2025 Maintainer

Main logic

(1) Get incomplete trackingIDs (not delivered)

(2) Iterate through each trackingID

Logic to Compare Two Sets of Event IDs

(3) Update events by comparing 2 sets

Update Logic

Step 1: Update entities

Step 2: Update events in 2 loops

Replies: 10 comments

Uh oh!

Uh oh!

jduan00 May 27, 2025 Maintainer

Uh oh!

Uh oh!

samshdn May 27, 2025 Maintainer Author

Uh oh!

Uh oh!

jduan00 May 28, 2025 Maintainer

Encountered issues

Uh oh!

samshdn May 29, 2025 Maintainer Author

Uh oh!

samshdn May 29, 2025 Maintainer Author

Uh oh!

Uh oh!

jduan00 May 30, 2025 Maintainer

Let's rethink the logic to compare Set A and Set B

Other Ideas

Uh oh!

Uh oh!

jduan00 Jun 3, 2025 Maintainer

Two data issues (observed and potential)

Design guideline

Side effects

Lessons learned

Uh oh!

Uh oh!

samshdn Jun 3, 2025 Maintainer Author

Issue:

Analysis:

Data change example

Objective Review:

Algorithm Change:

Uh oh!

Uh oh!

jduan00 Jun 3, 2025 Maintainer

Algorithm for events update

For a trackingID

Goal: find the simplist and most efficient way to do both:

Outcome: (B) <= (A)

Steps:

Uh oh!

Uh oh!

samshdn Jun 3, 2025 Maintainer Author

in-memory operations

db-related operations

samshdn
May 27, 2025
Maintainer

Step 1: Update `entities`

Step 2: Update `events` in 2 loops

jduan00
May 27, 2025
Maintainer

samshdn
May 27, 2025
Maintainer Author

jduan00
May 28, 2025
Maintainer

samshdn
May 29, 2025
Maintainer Author

samshdn
May 29, 2025
Maintainer Author

jduan00
May 30, 2025
Maintainer

jduan00
Jun 3, 2025
Maintainer

samshdn
Jun 3, 2025
Maintainer Author

jduan00
Jun 3, 2025
Maintainer

samshdn
Jun 3, 2025
Maintainer Author