Skip to content

Commit 0a975bc

Browse files
authored
Merge pull request #9 from jeronimoluza/dev
Enhanced Scraping, EPU Analysis, and Project Structure
2 parents e557290 + 947a731 commit 0a975bc

File tree

289 files changed

+33190
-4401
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

289 files changed

+33190
-4401
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,4 +92,5 @@ notebooks/R/.Rhistory
9292
.Rproj.user
9393

9494
# data files
95-
data/
95+
data/
96+
logs/

Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11

22
all:
3+
jupyter-book clean docs
34
jupyter-book build docs
45
open docs/_build/html/index.html

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -118,4 +118,4 @@ data_startyear = 2009
118118
119119
## License
120120

121-
Materials under this repository are open-source under an [MIT license](LICENSE). The community is invited to test, adapt, and re-purpose materials as needed.
121+
Materials under this repository are open-source under an [MIT license](LICENSE.md). The community is invited to test, adapt, and re-purpose materials as needed.

docs/images/interactive/text/epu_pic.html

Lines changed: 146 additions & 57 deletions
Large diffs are not rendered by default.

docs/images/interactive/text/epu_topics_pic.html

Lines changed: 147 additions & 0 deletions
Large diffs are not rendered by default.

docs/images/interactive/text/news_count_pic.html

Lines changed: 141 additions & 0 deletions
Large diffs are not rendered by default.

docs/images/interactive/text/out_of_bag_predictions_pic.html

Lines changed: 131 additions & 0 deletions
Large diffs are not rendered by default.

docs/images/interactive/text/sentiment_pic.html

Lines changed: 141 additions & 0 deletions
Large diffs are not rendered by default.

docs/images/interactive/text/train_predictions_pic.html

Lines changed: 136 additions & 0 deletions
Large diffs are not rendered by default.

docs/text/text_intro.md

Lines changed: 93 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,60 @@
11
# Economic Analysis with News Sources
22

3-
New analytical techniques have increased the role of non-traditional data sources for economic analysis, including text-based data. This research explores the use of text-based data from news articles, using natural language processing (NLP), to fill key data gaps on economic sentiments and prices, offering insights into relevant economic trends across the Pacific region.
3+
New analytical techniques have increased the role of non-traditional data sources for economic analysis, including text-based data. This research explores the use of text-based data from news articles, using natural language processing (NLP), to fill key data gaps on economic sentiments and prices, offering insights into relevant economic trends across the East Asia and Pacific region.
44

55
## Data Sources
66

7-
The Pacific region hosts a substantial corpus of accessible English-based content from newspapers and international news platforms, providing an opportunity to generate timely, comprehensive indicators of economic and political trends. Specifically, local news outlets from Pacific Island Countries (PICs), complemented by regional sources such as the Pacific Islands News Association (PINA), ABC Australia (ABC AU), and Radio New Zealand (RNZ), were selected due to their robust coverage and reliability. We used web-scraping techniques to extract articles from the selected sources, before organizing the contents into structured datasets.
7+
The East Asia and Pacific region hosts a substantial corpus of accessible English-based content from newspapers and international news platforms, providing an opportunity to generate timely, comprehensive indicators of economic and political trends. Specifically, local news outlets from East Asia and Pacific countries, complemented by regional sources such as the Pacific Islands News Association (PINA), ABC Australia (ABC AU), and Radio New Zealand (RNZ), were selected due to their robust coverage and reliability. We used web-scraping techniques to extract articles from the selected sources, before organizing the contents into structured datasets.
88

99
**Table 1: News Sources by Country**
1010

11-
| Country | Newspaper/Media Source | Number of Articles | From |
12-
|----------------|---------------------------------------------|--------------------|------------|
13-
| Fiji | Fiji Sun | 46,350 | 2008-05-27 |
14-
| Pacific | Pacific Islands News Association (PINA) | 26,151 | 2011-11-19 |
15-
| | Australian Broadcasting Corporation (ABC AU) | 16,297 | 2003-02-19 |
16-
| | Radio New Zealand (RNZ) | 18,160 | 2015-02-18 |
17-
| Papua New Guinea | Post Courier | 6,278 | 2016-04-08 |
18-
| | PNG Business News | 2,197 | 2019-05-24 |
19-
| Samoa | Samoa Observer | 35,489 | 2012-01-01 |
20-
| Solomon Islands | Solomon Islands Broadcasting Corporation (SIBC) | 9,062 | 2013-12-04 |
21-
| | Solomon Times | 11,139 | 2007-04-14 |
22-
| | Solomon Star | 14,484 | 2014-04-10 |
23-
| | The Island Sun | 9,117 | 2017-09-01 |
24-
| Tonga | Matangi Tonga Online | 14,071 | N/A |
25-
| Vanuatu | The Daily Post | 29,469 | 2014-04-08 |
26-
| | Vanuatu Business Review | 577 | 2020-04-27 |
27-
| **Total** | | **238,941** | |
11+
| Country | Newspaper/Media Source | Number of Articles | From |
12+
|---------|------------------------|--------------------|----|
13+
| Cambodia | Khmer Times | 69,680 | 1970-01-01 |
14+
| China | China Daily | 10,512 | 2014-03-28 |
15+
| | People's Daily Online | 3,442 | 2024-09-13 |
16+
| Fiji | Fiji Sun | 63,880 | 2008-05-27 |
17+
| Indonesia | Antara | 10,886 | 2025-09-23 |
18+
| | Jakarta Post | 1,635 | 2025-02-24 |
19+
| | Tempo | 77,615 | 2003-07-21 |
20+
| Japan | Japan News | 51,555 | 2022-04-29 |
21+
| | Japan Today | 4,500 | 2012-09-27 |
22+
| | The Asahi Shimbun | 11,399 | 2020-04-16 |
23+
| Lao | The Laotian Times | 8,687 | 2016-06-03 |
24+
| Malaysia | Malay Mail | 225,506 | 2013-06-18 |
25+
| Marshall Islands | MI Journal | 1,620 | 2015-01-02 |
26+
| Mongolia | UB Post | 462 | 2016-10-08 |
27+
| New Zealand | New Zealand Herald | 16,802 | 2025-06-10 |
28+
| Pacific | Australian Broadcasting Corporation (ABC AU) | 25,468 | 2003-02-19 |
29+
| | PINA | 39,176 | 2011-11-19 |
30+
| | Radio New Zealand (RNZ) | 53,118 | 2007-06-17 |
31+
| Palau | Island Times | 10,094 | 2016-06-03 |
32+
| Papua New Guinea | PNG Business News | 3,498 | 2019-05-24 |
33+
| | Post Courier | 52,768 | 2015-12-16 |
34+
| Philippines | Asia News Network | 3,067 | 2018-04-03 |
35+
| | Inquirer | 50,685 | 1998-10-07 |
36+
| | Philippine Star | 220 | 2025-10-11 |
37+
| Samoa | Samoa Observer | 77,557 | 2012-01-01 |
38+
| Singapore | The Independent | 1,885 | 2022-10-17 |
39+
| | The Straits Times | 9,789 | 2024-09-15 |
40+
| | Today Online | 616 | 2024-04-13 |
41+
| Solomon Islands | SIBC | 10,916 | 2013-12-14 |
42+
| | Solomon Star | 34,109 | 2014-04-10 |
43+
| | Solomon Times | 22,976 | 2007-04-14 |
44+
| | The Island Sun | 10,301 | 2017-09-01 |
45+
| South Korea | The Korea Herald | 12,431 | 2025-05-05 |
46+
| | The Korea Times | 94,323 | 2006-12-07 |
47+
| Thailand | Nation Thailand | 13,854 | 2024-04-22 |
48+
| Tonga | Matangi Tonga Online | 40,481 | 1997-11-04 |
49+
| Vanuatu | Vanuatu Daily Post | 35,333 | 2014-04-08 |
50+
| | Vanuatu Business Review (VBR) | 577 | 2020-04-27 |
51+
| Vietnam | Tuoi Tre | 36,564 | 1970-01-01 |
52+
| | Vietnam News | 38,577 | 2004-06-21 |
53+
| **Total** | | **1,292,472** | |
2854

2955
## Methods
3056

31-
### Economic Policy Uncertainty Index (EPU) Index
57+
### Economic Policy Uncertainty (EPU) Index
3258

3359
One of the most influential applications of exploiting text data in economics is the Economic Policy Uncertainty (EPU) index first developed by {cite:t}`baker2016measuring`. In the initial application, an index of policy uncertainty was constructed based on analyzing the frequency of keywords related to economics, policy, and uncertainty in news articles. The authors found periods of elevated policy uncertainty to be strongly associated with declining in investment and employment, highlighting the negative impact of uncertainty on economic decision-making.
3460

@@ -49,34 +75,72 @@ The construction of the EPU index follows a systematic approach where a news art
4975
- Compute $M$, the mean value of $Z_t$ over the period $T_1$
5076
- Normalize the EPU index by multiplying $Z_t$ by $ \left( \frac{100}{M} \right) $ for $T_1$, resulting in the normalized EPU time-series index with a mean of 100 over $T_1$.
5177

52-
<div class="flourish-embed flourish-chart" data-src="visualisation/22204379?2274258"><script src="https://public.flourish.studio/resources/embed.js"></script><noscript><img src="https://public.flourish.studio/visualisation/22204379/thumbnail" width="100%" alt="chart visualization" /></noscript></div>
78+
<div>
79+
<iframe src="../interactive/text/epu_pic.html"
80+
frameborder="0" marginwidth="0" marginheight="0" width="800" height="433"></iframe>
81+
</div>
5382

5483
### Topic-based EPU
5584

5685
The EPU index can also be computed for news sources related to specific policy topics. To qualify, articles need to contain at least one keyword in each of the four criteria, namely (1) Economy, (2) Uncertainty, (3) Policy, and (4) Policy Topic - a list of terms for a specific theme (labor, inflation, climate, food security). Because the sample of articles that meet this refined criteria decreases, a topic-based EPU is constructed at the quarterly time frequency. The graphs below display quarterly EPU for jobs and inflation.
5786

58-
<div class="flourish-embed flourish-chart" data-src="visualisation/22205009?2274258"><script src="https://public.flourish.studio/resources/embed.js"></script><noscript><img src="https://public.flourish.studio/visualisation/22205009/thumbnail" width="100%" alt="chart visualization" /></noscript></div>
87+
<div>
88+
<iframe src="../interactive/text/epu_topics_pic.html"
89+
frameborder="0" marginwidth="0" marginheight="0" width="800" height="433"></iframe>
90+
</div>
5991

6092
### Economic Policy Sentiment
6193

6294
We use the EPU to filter news articles that align with the economic and policy categories for targeted sentiment analysis. The sentiment analysis uses VADER (Valence Aware Dictionary and sEntiment Reasoner), a rule-based model that handles social media and news text (Hutto and Gilbert, 2014). VADER calculates the sentiment score S based on the sum of lexical features (positive, neutral, and negative words). The final sentiment score S ranges between -1 (most negative) and +1 (most positive), with neutral scores around 0.
6395

64-
<div class="flourish-embed flourish-chart" data-src="visualisation/22205348?2274258"><script src="https://public.flourish.studio/resources/embed.js"></script><noscript><img src="https://public.flourish.studio/visualisation/22205348/thumbnail" width="100%" alt="chart visualization" /></noscript></div>
96+
<div>
97+
<iframe src="../interactive/text/sentiment_pic.html"
98+
frameborder="0" marginwidth="0" marginheight="0" width="800" height="433"></iframe>
99+
</div>
65100

66-
### CPI and Inflation
101+
### Consumer Price Index (CPI) and Inflation
67102

68-
Once we have obtained the EPU index for each country and period, we use the result as an input to analyze and predict price movements. To do so, we apply a three-month moving average (MA3) to smooth the volatile directly measured inflation data and introduce an additional policy category (using the same index approach described above) that focuses on inflation-specific terms. Finally, we conduct a regression analysis using variables selected through the cross-validated LASSO method.
103+
Once we have obtained the EPU index for each country and period, we use the result as an input to analyze and predict price movements. To do so, we obtain the International Monetary Fund (IMF) Consumer Price Index (CPI) data and apply a three-month moving average (MA3) to smooth the volatile directly measured inflation data. Subsequently, we conduct a regression analysis using variables selected through the cross-validated LASSO method, ensuring the inclusion of relevant variables while minimizing the risk of overfitting. To further prevent overfitting brought by the high-order polynomial, we limit the lag used in the analysis to a maximum of two, meaning for the next prediction, the model can only use past three months’ inflation information.
69104

70105
## Results
71106

72107
### Country-Specific Models
73108

74-
At the country level, Fiji achieves the lowest RMSE at 0.739, indicating that the model’s predictions deviate by approximately 0.76 percentage points from the actual inflation values. Although Samoa and the Solomon Islands fail to accurately capture the magnitude of inflation, they exhibit stronger directional accuracy, correctly identifying inflationary and deflationary trends in the model evaluation with accuracies of 63.6 percent and 68.5 percent, respectively. Inflation volatility and the rapid alternation between deflation and inflation in PICs reduce prediction accuracy. Smoothing techniques considerably enhance the performance of the pooled model compared to country-specific approaches.
109+
We use a training set of seven countries to evaluate the performance of the country-specific models. These are China, Fiji, Indonesia, Japan, Lao, Samoa, Solomon Islands, and Tonga. At the country level, Japan achieves the lowest RMSE at 0.11, indicating that the model’s predictions deviate by approximately 0.11 percentage points from the actual inflation values. Countries with the highest accuracy are Lao, Indonesia, and Samoa, achieving accuracies of 0.95, 0.88, and 0.84, respectively. Inflation volatility and the rapid alternation between deflation and inflation amongst countries reduce prediction accuracy.
75110

76-
<div class="flourish-embed flourish-chart" data-src="visualisation/22209056?2274258"><script src="https://public.flourish.studio/resources/embed.js"></script><noscript><img src="https://public.flourish.studio/visualisation/22209056/thumbnail" width="100%" alt="chart visualization" /></noscript></div>
111+
<div>
112+
<iframe src="../interactive/text/train_predictions_pic.html"
113+
frameborder="0" marginwidth="0" marginheight="0" width="800" height="433"></iframe>
114+
</div>
77115

78116
### Pooled Model
79117

80-
The pooled model using MA3 performs better than any of the country-level models with an accuracy of approximately 70 percent of the time and deviation around 0.70 percentage points from the actual inflation. This means that, based on historical data and the constructed EPU indexes, the models correctly predicted inflationary or deflationary trends more than two-thirds of the time.
118+
The pooled model using MA3 achieves an accuracy of approximately 83.1 percent of the time and deviation around 0.83 percentage points from the actual inflation. This means that, based on historical data and the constructed EPU indexes, the models correctly predicted inflationary or deflationary trends more than four out of five times.
81119

82-
<div class="flourish-embed flourish-chart" data-src="visualisation/22209247?2274258"><script src="https://public.flourish.studio/resources/embed.js"></script><noscript><img src="https://public.flourish.studio/visualisation/22209247/thumbnail" width="100%" alt="chart visualization" /></noscript></div>
120+
For out-of-sample validation of the pooled model, we use a set of three countries: Philippines, South Korea, and Vietnam. Philippines achieves a RMSE of 0.14 and an accuracy of 92.91%. South Korea achieves a RMSE of 0.15 and an accuracy of 84.25%, and Vietnam achieves a RMSE of 0.17 and an accuracy of 88.43%.
121+
122+
<div>
123+
<iframe src="../interactive/text/out_of_bag_predictions_pic.html"
124+
frameborder="0" marginwidth="0" marginheight="0" width="800" height="433"></iframe>
125+
</div>
126+
127+
## Future Work
128+
129+
Future work will involve the development of a methodology that can interpolate quarterly CPI data to monthly values, bring lagged CPI data to the same time frequency as the EPU index, and generate inflation predictions on countries with no inflation data.
130+
131+
**Table 3: IMF CPI Data Availability by Country**
132+
133+
134+
| Country Name | ISO3 | Frequency | Last Reported |
135+
|:-----------------|:-------|:------------|:----------------|
136+
| American Samoa | ASM | No Data | No Data |
137+
| Guam | GUM | No Data | No Data |
138+
| Marshall Islands | MHL | No Data | No Data |
139+
| New Zealand | NZL | Quarterly | 2025-Q3 |
140+
| Palau | PLW | Quarterly | 2025-Q2 |
141+
| Papua New Guinea | PNG | Quarterly | 2025-Q2 |
142+
| Thailand | THA | Monthly | 2025-M03 |
143+
| Tonga | TON | Monthly | 2025-M01 |
144+
| Tuvalu | TUV | Quarterly | 2012-Q2 |
145+
| Vanuatu | VUT | Quarterly | 2023-Q4 |
146+
| Vietnam | VNM | Monthly | 2025-M03 |

0 commit comments

Comments
 (0)