Skip to content

Statistics

atecon edited this page Jan 20, 2024 · 10 revisions

Descriptive statistics

Basic example

While there are many statistical functions, the summary command is a simple way to compute descriptive statistics for a list of series. Here is an example only computing the most basic statistics using the --simple option:

open abdata.gdt --quiet
list Y = IND YEAR n 
summary Y --simple

# Store the results as a matrix
summary Y --simple
matrix stats = $result
print stats

The output is:

                 Mean     Median       S.D.        Min        Max
IND             5.123      5.000      2.678      1.000      9.000
YEAR             1980       1980      2.583       1976       1984
n               1.056     0.8272      1.342     -2.263      4.687

stats (3 x 5)

             Mean       Median         S.D.          Min          Max 
 IND       5.1232       5.0000       2.6781       1.0000       9.0000 
YEAR       1980.0       1980.0       2.5830       1976.0       1984.0 
   n       1.0560      0.82724       1.3415      -2.2634       4.6873 

Grouped statistics

By means of the --by=Series option, you can also compute statistics for each category of some other variable. The following example prints basic statistics for series n and w for each value of series IND (industry ID):

set verbose off
open abdata.gdt --quiet
list Y = n w
summary Y --by=IND --simple

The output for the first three industries is:

IND = 1 (n = 122):

                 Mean     Median       S.D.        Min        Max
n               1.234      1.095      1.172    -0.5942      4.099
w               3.186      3.183     0.1511      2.757      3.581

IND = 2 (n = 88):

                 Mean     Median       S.D.        Min        Max
n               1.039     0.9792      1.387     -2.104      3.223
w               3.410      3.409     0.1363      2.870      3.812

IND = 3 (n = 89):

                 Mean     Median       S.D.        Min        Max
n              0.7006     0.4324      1.199     -1.726      3.030
w               3.287      3.331     0.1640      2.910      3.614

Aggregation

The aggregate() function is powerful and allows you to aggregate data (like Pivot tables) by means of some aggregation function. Here is a simple example on how to compute the mean values of series n and w for each unique combination of the discrete series IND and YEAR (only showing the initial rows)

open abdata.gdt --quiet
list Y = n w
list groupby = IND YEAR

matrix mean_values = aggregate(Y, groupby, "mean")
printf "\n%12.2f\n", mean_values

The output is:

         IND        YEAR       count           n           w
        1.00     1976.00        8.00        0.89        3.12
        1.00     1977.00       16.00        1.34        3.11
        1.00     1978.00       17.00        1.37        3.09
          .
          .
          .
        2.00     1976.00        8.00        1.51        3.58
        2.00     1977.00       12.00        1.14        3.50
        2.00     1978.00       12.00        1.13        3.44

OLS regression

Estimation

The following example shows how to run a simple OLS regression and how to store post-estimation information.

open abdata.csv --quiet

ols ys const n w   #OPTIONAL: --robust

matrix coeff = $coeff  # point estimates
matrix stderr = $stderr  # std. error
series uhat = $uhat  # residuals
series yhat = $yhat  # fitted values

Specification tests

The modtest command provides various specification tests which can be conducted after having estimated a model. Another command is reset for running Ramsey's RESET test:. Here are examples:

open abdata.csv --quiet
ols ys const n w

modtest --normality --quiet
modtest --white --quiet
reset --squares-only --quiet

Hypothesis testing

Gretl allows you to test hypothesis in a simple manner.

omit variables

First you can call the omit command for testing zero restrictions on coefficients. Here is a simple example for testing the removal of two variables by means of an F-Test:

list X = n w
ols ys const X

# Test the restriction but do not re-estimate the model
omit X --test-only

# Test the restriction and re-estimate the model
omit X

The output is:

Test on Model 3:

  Null hypothesis: the regression parameters are zero for the variables
    n, w
  Test statistic: F(2, 1028) = 4.43312, p-value 0.0121052

Set of linear restrictions

The restrict-block command provides a powerful apparatus for testing set of (non-)linear restrictions. Here is an example using the --quiet option for avoiding detailed output. You may also try the --bootstrap option:

restrict --quiet # --bootstrap
    b[w] = 0.005
    b[n] - b[w] = 0
end restrict

This returns:

Restriction set
 1: b[w] = 0.005
 2: b[n] - b[w] = 0

Test statistic: F(2, 1028) = 0.224807, with p-value = 0.798709

Non-parametric test for differences between variables

This example illustrates on how to run non-parametric test to test for differences between variables. The example uses simulated series.

##################
## Non-parametric difference tests
##################
set seed 1234 	# only to ensure replicability
nulldata 100 	# cross-sectional dataset

# Create some random variables
series y = normal(0, 2)  # expected value 0
series x = normal(10, 2)  # expected value 2
list L = y x			# define a list of series which can be handy

# Stats and plot
summary L --simple
boxplot L --output=display
freq y --normal --plot=display

# Non-parametric difference tests
help difftest			# see the help for information

difftest y x --sign   # Sign test -- less powerful
printf "\nP-value of the Sign-test = %.2f (test-stat = %g)\n", $pvalue, $test

difftest y x --rank-sum   # Wilcoxon rank-sum test (aka Mann-Whitney U test)
printf "\nP-value of the Wilcoxon rank-sum test = %.2f (test-stat = %g)\n", $pvalue, $test

difftest y x --signed-rank   # Wilcoxon rank test

Parametric regression-based test for differences between categories

This example loads the cross-sectional and well-known MROZ dataset. By means of an OLS regression employing a level dummy, we want to test whether men earn higher wages in large cities compared to small cities.

open mroz87.gdt

boxplot HW CIT --factorized --output=display \
  { set title "Wages of men in small and large cities" font ',15'; }

# Regression for explaining Husband's wage by CIT (0: lives in small city, 1: lives in large city)
ols HW const CIT --robust   # robust standard errors wrt eventual heteroskedasticity
printf "\nThe null hypothesis that wages in large cities are equal  \n\
  to wages in small cities can be rejected at the %.2f pct. \n\
  significance level\n", $pvalue

# Run a restriction by hand
help restrict

# Test the null that hourly wages are on average 3$ higher in large cities
restrict --bootstrap
    b[CIT] = 3
end restrict
Clone this wiki locally