-
Notifications
You must be signed in to change notification settings - Fork 2
Statistics
While there are many statistical functions, the summary
command is a simple way to compute descriptive statistics for a list of series. Here is an example only computing the most basic statistics using the --simple
option:
open abdata.gdt --quiet
list Y = IND YEAR n
summary Y --simple
# Store the results as a matrix
summary Y --simple
matrix stats = $result
print stats
The output is:
Mean Median S.D. Min Max
IND 5.123 5.000 2.678 1.000 9.000
YEAR 1980 1980 2.583 1976 1984
n 1.056 0.8272 1.342 -2.263 4.687
stats (3 x 5)
Mean Median S.D. Min Max
IND 5.1232 5.0000 2.6781 1.0000 9.0000
YEAR 1980.0 1980.0 2.5830 1976.0 1984.0
n 1.0560 0.82724 1.3415 -2.2634 4.6873
By means of the --by=Series
option, you can also compute statistics for each category of some other variable. The following example prints basic statistics for series n
and w
for each value of series IND
(industry ID):
set verbose off
open abdata.gdt --quiet
list Y = n w
summary Y --by=IND --simple
The output for the first three industries is:
IND = 1 (n = 122):
Mean Median S.D. Min Max
n 1.234 1.095 1.172 -0.5942 4.099
w 3.186 3.183 0.1511 2.757 3.581
IND = 2 (n = 88):
Mean Median S.D. Min Max
n 1.039 0.9792 1.387 -2.104 3.223
w 3.410 3.409 0.1363 2.870 3.812
IND = 3 (n = 89):
Mean Median S.D. Min Max
n 0.7006 0.4324 1.199 -1.726 3.030
w 3.287 3.331 0.1640 2.910 3.614
The aggregate()
function is powerful and allows you to aggregate data (like Pivot tables) by means of some aggregation function. Here is a simple example on how to compute the mean values of series n
and w
for each unique combination of the discrete series IND
and YEAR
(only showing the initial rows)
open abdata.gdt --quiet
list Y = n w
list groupby = IND YEAR
matrix mean_values = aggregate(Y, groupby, "mean")
printf "\n%12.2f\n", mean_values
The output is:
IND YEAR count n w
1.00 1976.00 8.00 0.89 3.12
1.00 1977.00 16.00 1.34 3.11
1.00 1978.00 17.00 1.37 3.09
.
.
.
2.00 1976.00 8.00 1.51 3.58
2.00 1977.00 12.00 1.14 3.50
2.00 1978.00 12.00 1.13 3.44
You can also pass your own custom aggregate function to aggregate()
. The function must return a scalar value. Here is an example for the inter-quartile range:
set verbose off
function scalar iqr (const series y)
/* Compute the interquartile range. */
scalar result = quantile(y, 0.75) - quantile(y, 0.25)
return result
end function
open mroz87.gdt --quiet
matrix result = aggregate(FAMINC, CIT, iqr)
printf "%12.2f\n", result
which returns the following table
byvar count f(x)
0.00 269.00 9591.00
1.00 484.00 13349.25
The following example shows how to run a simple OLS regression and how to store post-estimation information.
set verbose off
open abdata.gdt --quiet
ols ys const n w #OPTIONAL: --robust
matrix coeff = $coeff # point estimates
matrix stderr = $stderr # std. error
series uhat = $uhat # residuals
series yhat = $yhat # fitted values
# Print values
print coeff ~ stderr
print ys yhat uhat --byobs --range=:5
The output is:
Model 1: Pooled OLS, using 1031 observations
Included 140 cross-sectional units
Time-series length: minimum 7, maximum 9
Dependent variable: ys
coefficient std. error t-ratio p-value
--------------------------------------------------------
const 4.60388 0.0351033 131.2 0.0000 ***
n 0.00626942 0.00217539 2.882 0.0040 ***
w 0.00875566 0.0110959 0.7891 0.4302
Mean dependent var 4.638015 S.D. dependent var 0.093961
Sum squared resid 9.015800 S.E. of regression 0.093650
R-squared 0.008551 Adjusted R-squared 0.006622
F(2, 1028) 4.433125 P-value(F) 0.012105
Log-likelihood 980.1866 Akaike criterion −1954.373
Schwarz criterion −1939.558 Hannan-Quinn −1948.751
rho 0.802880 Durbin-Watson 0.305346
4.6039 0.035103
0.0062694 0.0021754
0.0087557 0.011096
ys yhat uhat
1:1
1:2 4.561294 4.636576 -0.07528
1:3 4.578384 4.636651 -0.05827
1:4 4.601245 4.636334 -0.03509
1:5 4.610656 4.636581 -0.02592
The modtest
command provides various specification tests which can be conducted after having estimated a model. Another command is reset
for running Ramsey's RESET test:. Here are examples:
open abdata.csv --quiet
ols ys const n w
modtest --normality --quiet
modtest --white --quiet
reset --squares-only --quiet
Gretl allows you to test hypothesis in a simple manner.
First you can call the omit
command for testing zero restrictions on coefficients. Here is a simple example for testing the removal of two variables by means of an F-Test:
list X = n w
ols ys const X
# Test the restriction but do not re-estimate the model
omit X --test-only
# Test the restriction and re-estimate the model
omit X
The output is:
Test on Model 3:
Null hypothesis: the regression parameters are zero for the variables
n, w
Test statistic: F(2, 1028) = 4.43312, p-value 0.0121052
The restrict
-block command provides a powerful apparatus for testing set of (non-)linear restrictions. Here is an example using the --quiet
option for avoiding detailed output. You may also try the --bootstrap
option:
restrict --quiet # --bootstrap
b[w] = 0.005
b[n] - b[w] = 0
end restrict
This returns:
Restriction set
1: b[w] = 0.005
2: b[n] - b[w] = 0
Test statistic: F(2, 1028) = 0.224807, with p-value = 0.798709
This example illustrates on how to run non-parametric test to test for differences between variables. The example uses simulated series.
##################
## Non-parametric difference tests
##################
set seed 1234 # only to ensure replicability
nulldata 100 # cross-sectional dataset
# Create some random variables
series y = normal(0, 2) # expected value 0
series x = normal(10, 2) # expected value 2
list L = y x # define a list of series which can be handy
# Stats and plot
summary L --simple
boxplot L --output=display
freq y --normal --plot=display
# Non-parametric difference tests
help difftest # see the help for information
difftest y x --sign # Sign test -- less powerful
printf "\nP-value of the Sign-test = %.2f (test-stat = %g)\n", $pvalue, $test
difftest y x --rank-sum # Wilcoxon rank-sum test (aka Mann-Whitney U test)
printf "\nP-value of the Wilcoxon rank-sum test = %.2f (test-stat = %g)\n", $pvalue, $test
difftest y x --signed-rank # Wilcoxon rank test
This example loads the cross-sectional and well-known MROZ dataset. By means of an OLS regression employing a level dummy, we want to test whether men earn higher wages in large cities compared to small cities.
open mroz87.gdt
boxplot HW CIT --factorized --output=display \
{ set title "Wages of men in small and large cities" font ',15'; }
# Regression for explaining Husband's wage by CIT (0: lives in small city, 1: lives in large city)
ols HW const CIT --robust # robust standard errors wrt eventual heteroskedasticity
printf "\nThe null hypothesis that wages in large cities are equal \n\
to wages in small cities can be rejected at the %.2f pct. \n\
significance level\n", $pvalue
# Run a restriction by hand
help restrict
# Test the null that hourly wages are on average 3$ higher in large cities
restrict --bootstrap
b[CIT] = 3
end restrict