Skip to content

Commit ebdd372

Browse files
author
Qian, Hai
committed
Logistic regression: Added the help message
Also updated the help messages for linear, estimators of regressions to add the description of summary tables.
1 parent ef3b7d7 commit ebdd372

File tree

9 files changed

+191
-20
lines changed

9 files changed

+191
-20
lines changed

src/ports/postgres/modules/regress/clustered_variance.py_in

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -288,6 +288,9 @@ def clustered_variance_linregr_help(schema_madlib, msg=None, **kwargs):
288288
std_err DOUBLE PRECISION[], -- Clustered standard errors for coef
289289
t_stats DOUBLE PRECISION[], -- t-stats of the errors
290290
p_values DOUBLE PRECISION[] -- p-values of the errors
291+
292+
The output summary table is the same as linregr_train(), see also:
293+
SELECT linregr_train('usage');
291294
""".format(schema_madlib=schema_madlib)
292295

293296
# ========================================================================
@@ -451,6 +454,9 @@ def clustered_variance_logregr_help(schema_madlib, msg=None, **kwargs):
451454
std_err DOUBLE PRECISION[], -- Clustered standard errors for coef
452455
z_stats DOUBLE PRECISION[], -- z-stats of the errors
453456
p_values DOUBLE PRECISION[] -- p-values of the errors
457+
458+
The output summary table is the same as logregr_train(), see also:
459+
SELECT logregr_train('usage');
454460
""".format(schema_madlib=schema_madlib)
455461

456462

@@ -720,4 +726,7 @@ def clustered_variance_mlogregr_help(schema_madlib, msg=None, **kwargs):
720726
std_err DOUBLE PRECISION[], -- Clustered standard errors for coef
721727
z_stats DOUBLE PRECISION[], -- z-stats of the errors
722728
p_values DOUBLE PRECISION[] -- p-values of the errors
729+
730+
The output summary table is the same as mlogregr_train(), see also:
731+
SELECT mlogregr_train('usage');
723732
""".format(schema_madlib=schema_madlib)

src/ports/postgres/modules/regress/linear.py_in

Lines changed: 28 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -208,8 +208,8 @@ def linregr_help_message(schema_madlib, message, **kwargs):
208208
Ordinary Least Squares Regression, also called Linear Regression, is a
209209
statistical model used to fit linear models.
210210

211-
It models a linear relationship of a scalar dependent variable \f$ y \f$ to one
212-
or more explanatory independent variables \f$ x \f$ to build a
211+
It models a linear relationship of a scalar dependent variable y to one
212+
or more explanatory independent variables x to build a
213213
model of coefficients.
214214

215215
For more details on function usage:
@@ -224,29 +224,39 @@ def linregr_help_message(schema_madlib, message, **kwargs):
224224
USAGE
225225
-----------------------------------------------------------------------
226226
SELECT {schema_madlib}.linregr_train(
227-
source_table, -- name of input table
227+
source_table, -- name of input table
228228
out_table, -- name of output table
229229
dependent_varname, -- name of dependent variable
230-
independent_varname, -- name of independent variable
230+
independent_varname, -- name of independent variables
231231
grouping_cols, -- names of columns to group-by
232-
heteroskedasticity_option, -- perform heteroskedasticity test?
232+
heteroskedasticity_option -- perform heteroskedasticity test?
233233
);
234234

235235
-----------------------------------------------------------------------
236-
OUTUPT
236+
OUTPUT
237237
-----------------------------------------------------------------------
238-
The output table ('out_table' above) has the following columns
239-
<...>, -- Grouping columns used during training
240-
'coef' DOUBLE PRECISION[], -- Vector of coefficients
241-
'r2' DOUBLE PRECISION, -- R-squared coefficient
242-
'std_err' DOUBLE PRECISION[], -- Standard errors of coefficients
243-
't_stats' DOUBLE PRECISION[], -- t-stats of the coefficients
244-
'p_values' DOUBLE PRECISION[], -- p-values of the coefficients
245-
'condition_no' INTEGER, -- The condition number of the covariance matrix.
246-
'bp_stats' DOUBLE PRECISION, -- The Breush-Pagan statistic of heteroskedacity.
247-
(if heteroskedasticity_option=TRUE)
248-
'bp_p_value' DOUBLE PRECISION -- The Breush-Pagan calculated p-value.
249-
(if heteroskedasticity_option=TRUE)
238+
The output table ('out_table' above) has the following columns:
239+
<...>, -- Grouping columns used during training
240+
'coef' DOUBLE PRECISION[], -- Vector of coefficients
241+
'r2' DOUBLE PRECISION, -- R-squared coefficient
242+
'std_err' DOUBLE PRECISION[], -- Standard errors of coefficients
243+
't_stats' DOUBLE PRECISION[], -- t-stats of the coefficients
244+
'p_values' DOUBLE PRECISION[], -- p-values of the coefficients
245+
'condition_no' INTEGER, -- The condition number of the covariance matrix.
246+
'bp_stats' DOUBLE PRECISION, -- The Breush-Pagan statistic of heteroskedacity.
247+
(if heteroskedasticity_option=TRUE)
248+
'bp_p_value' DOUBLE PRECISION, -- The Breush-Pagan calculated p-value.
249+
(if heteroskedasticity_option=TRUE)
250+
'num_rows_processed' INTEGER, -- Number of rows that are actually used in each group
251+
'num_missing_rows_skipped' INTEGER -- Number of rows that have NULL and are skipped in each group
252+
253+
A summary table is also created at the same time, which has:
254+
'source_table' VARCHAR, -- the data source table name
255+
'out_table' VARCHAR, -- the output table name
256+
'dependent_varname' VARCHAR, -- the dependent variable
257+
'independent_varname' VARCHAR, -- the independent variable
258+
'num_rows_processed' INTEGER, -- total number of rows that are used
259+
'num_missing_rows_skipped' INTEGER -- total number of rows that are skipped because of NULL values
250260
"""
251261
elif message in ['example', 'examples']:
252262
help_string = """

src/ports/postgres/modules/regress/logistic.py_in

Lines changed: 118 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -333,7 +333,7 @@ def __logregr_train_compute(schema_madlib, tbl_source, tbl_output, dep_col,
333333
if num_rows['num_rows_processed'] is None:
334334
num_rows['num_rows_processed'] = "NULL"
335335
num_rows['num_missing_rows_skipped'] = "NULL"
336-
336+
337337
args.update(num_rows)
338338

339339
plpy.execute(
@@ -364,3 +364,120 @@ def __logregr_train_compute(schema_madlib, tbl_source, tbl_output, dep_col,
364364

365365
plpy.execute("set client_min_messages to " + old_msg_level)
366366
return None
367+
368+
# --------------------------------------------------------------------
369+
370+
def logregr_help_msg (schema_madlib, message, **kwargs):
371+
""" Help message for logistic regression
372+
373+
@param message A string, the help message indicator
374+
375+
Returns:
376+
A string, contains the help message
377+
"""
378+
if message is None:
379+
help_string = """
380+
----------------------------------------------------------------
381+
SUMMARY
382+
----------------------------------------------------------------
383+
Binomial logistic regression models the relationship between a
384+
dichotomous dependent variable and one or more predictor variables.
385+
386+
The dependent variable may be a Boolean value or a categorial variable
387+
that can be represented with a Boolean expression.
388+
389+
For more details on function usage:
390+
SELECT {schema_madlib}.logregr_train('usage')
391+
392+
For a small example on using the function:
393+
SELECT {schema_madlib}.logregr_train('example')
394+
"""
395+
elif message in ['usage', 'help', '?']:
396+
help_string = """
397+
------------------------------------------------------------------
398+
USAGE
399+
------------------------------------------------------------------
400+
SELECT {schema_madlib}.logregr_train(
401+
source_table, -- name of input table
402+
out_table, -- name of output table
403+
dependent_varname, -- name of dependent variable
404+
independent_varname, -- names of independent variables
405+
grouping_cols, -- optional, default NULL, names of columns to group-by
406+
max_iter, -- optional, default 20, maximum iteration number
407+
optimizer, -- optional, default 'irls', name of optimization method
408+
tolerance, -- optional, default 0.0001, the stopping threshold
409+
verbose -- optional, default FALSE, whether to print useful info
410+
);
411+
412+
------------------------------------------------------------------
413+
OUTPUT
414+
------------------------------------------------------------------
415+
The output table ('out_table' above) has the following columns:
416+
<...>, -- Grouping column values used during training
417+
'coef', double precision[], -- vector of fitting coefficients
418+
'log_likelihood', double precision, -- log likelihood
419+
'std_err', double precision[], -- vector of standard errors of the fitting coefficients
420+
'z_stats', double precision[], -- vector of the z-statistics of the coefficients
421+
'p_values', double precision[], -- vector of the p values
422+
'odds_ratios', double precision[], -- vector of odds ratios, exp(coefficients)
423+
'condition_no', double precision, -- the condition number
424+
'num_rows_processed', integer, -- how many rows are actually used in the computation
425+
'num_missing_rows_skipped', integer, -- number of rows that contain NULL and were skipped per group
426+
'num_iterations' double precision -- how many iterations are used in the computation per group
427+
428+
A summary table is also created at the same time, which has:
429+
'source_table' varchar, -- the data source table name
430+
'out_table' varchar, -- the output table name
431+
'dependent_varname' varchar, -- the dependent variable
432+
'independent_varname' varchar, -- the independent variable
433+
'optimizer_params' varchar, -- 'optimizer=..., max_iter=..., tolerance=...'
434+
'num_all_groups' integer, -- how many groups
435+
'num_failed_groups' integer, -- how many groups' fitting processes failed
436+
'num_rows_processed' integer, -- total number of rows used in the computation
437+
'num_missing_rows_skipped' integer -- total number of rows skipped
438+
"""
439+
elif message in ['example', 'examples']:
440+
help_string = """
441+
CREATE TABLE patients( id INTEGER NOT NULL,
442+
second_attack INTEGER,
443+
treatment INTEGER,
444+
trait_anxiety INTEGER);
445+
COPY patients FROM STDIN WITH DELIMITER '|';
446+
1 | 1 | 1 | 70
447+
3 | 1 | 1 | 50
448+
5 | 1 | 0 | 40
449+
7 | 1 | 0 | 75
450+
9 | 1 | 0 | 70
451+
11 | 0 | 1 | 65
452+
13 | 0 | 1 | 45
453+
15 | 0 | 1 | 40
454+
17 | 0 | 0 | 55
455+
19 | 0 | 0 | 50
456+
2 | 1 | 1 | 80
457+
4 | 1 | 0 | 60
458+
6 | 1 | 0 | 65
459+
8 | 1 | 0 | 80
460+
10 | 1 | 0 | 60
461+
12 | 0 | 1 | 50
462+
14 | 0 | 1 | 35
463+
16 | 0 | 1 | 50
464+
18 | 0 | 0 | 45
465+
20 | 0 | 0 | 60
466+
\.
467+
468+
SELECT madlib.logregr_train( 'patients',
469+
'patients_logregr',
470+
'second_attack',
471+
'ARRAY[1, treatment, trait_anxiety]',
472+
NULL,
473+
20,
474+
'irls'
475+
);
476+
477+
SELECT * from patients_logregr;
478+
"""
479+
else:
480+
help_string = "No such option. Use {schema_madlib}.logregr_train()"
481+
482+
return help_string.format(schema_madlib=schema_madlib)
483+

src/ports/postgres/modules/regress/logistic.sql_in

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
*
1111
*//* ----------------------------------------------------------------------- */
1212

13-
m4_include(`SQLCommon.m4') --'
13+
m4_include(`SQLCommon.m4')
1414

1515
/**
1616
@addtogroup grp_logreg
@@ -750,6 +750,23 @@ RETURNS VOID AS $$
750750
SELECT MADLIB_SCHEMA.logregr_train($1, $2, $3, $4, $5, $6, $7, $8, False);
751751
$$ LANGUAGE sql VOLATILE;
752752

753+
-----------------------------------------------------------------------
754+
755+
-- Help messages
756+
757+
CREATE FUNCTION MADLIB_SCHEMA.logregr_train ()
758+
RETURNS TEXT AS $$
759+
BEGIN
760+
RETURN MADLIB_SCHEMA.logregr_train(NULL);
761+
END;
762+
$$ LANGUAGE plpgsql VOLATILE;
763+
764+
CREATE FUNCTION MADLIB_SCHEMA.logregr_train(
765+
message TEXT
766+
) RETURNS TEXT AS $$
767+
PythonFunction(regress, logistic, logregr_help_msg)
768+
$$ LANGUAGE plpythonu VOLATILE;
769+
753770
------------------------------------------------------------------------
754771

755772
/**

src/ports/postgres/modules/regress/marginal.py_in

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -274,6 +274,9 @@ The output table ('output_table' above) has the following columns
274274
std_err DOUBLE PRECISION[], -- Standard errors using delta method
275275
z_stats DOUBLE PRECISION[], -- z-stats of the standard errors
276276
p_values DOUBLE PRECISION[], -- p-values of the standard errors
277+
278+
The output summary table is the same as logregr_train(), see also:
279+
SELECT logregr_train('usage');
277280
"""
278281
else:
279282
help_string = "No such option. Use {schema_madlib}.margins_mlogregr()"
@@ -675,6 +678,9 @@ The output table ('output_table' above) has the following columns
675678
std_err DOUBLE PRECISION[], -- Standard errors using delta method
676679
z_stats DOUBLE PRECISION[], -- z-stats of the standard errors
677680
p_values DOUBLE PRECISION[], -- p-values of the standard errors
681+
682+
The output summary table is the same as mlogregr_train(), see also:
683+
SELECT mlogregr_train('usage');
678684
"""
679685
else:
680686
help_string = "No such option. Use {schema_madlib}.margins_mlogregr()"

src/ports/postgres/modules/regress/multilogistic.py_in

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -584,8 +584,11 @@ The output table ('output_table' above) has the following columns
584584

585585
The output summary table is named as <'output_table'>_summary has the following columns
586586
source_table -- VARCHAR, Source table name
587+
out_table -- VARCHAR, Output table name
587588
dep_var -- VARCHAR, Dependent variable name
588589
ind_var -- VARCHAR, Independent variable name
590+
optimizer_params -- VARCHAR, Optimizer parameters used
591+
ref_category -- INTEGER, The value of reference category used
589592
num_rows_processed -- INTEGER, Number of rows processed during training
590593
num_missing_rows_skipped -- INTEGER, Number of rows skipped during training due
591594
to missing values

src/ports/postgres/modules/regress/robust_linear.py_in

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,9 @@ The output table (''output_table'' above) has the following columns:
9393
'std_err' DOUBLE PRECISION[], -- Huber-White standard errors
9494
'stats' DOUBLE PRECISION[], -- T-stats of the standard errors
9595
'p_values' DOUBLE PRECISION[] -- p-values of the standard errors
96+
97+
The output summary table is the same as linregr_train(), see also:
98+
SELECT linregr_train('usage');
9699
"""
97100
else:
98101
help_string = "No such option. Use {schema_madlib}.robust_variance_linregr()"

src/ports/postgres/modules/regress/robust_logistic.py_in

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,9 @@ The output table ('output_table' above) has the following columns:
7878
'std_err' DOUBLE PRECISION[], -- Huber-White standard errors
7979
'stats' DOUBLE PRECISION[], -- Z-stats of the standard errors
8080
'p_values' DOUBLE PRECISION[] -- p-values of the standard errors
81+
82+
The output summary table is the same as logregr_train(), see also:
83+
SELECT logregr_train('usage');
8184
"""
8285
else:
8386
help_string = "No such option. Use {schema_madlib}.robust_variance_linregr()"

src/ports/postgres/modules/regress/robust_mlogistic.py_in

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -329,6 +329,9 @@ The output table ('out_table' above) has the following columns
329329
std_err DOUBLE PRECISION[], -- Huber-White standard errors
330330
z_stats DOUBLE PRECISION[], -- Z-stats of the standard errors
331331
p_values DOUBLE PRECISION[] -- p-values of the standard errors
332+
333+
The output summary table is the same as mlogregr_train(), see also:
334+
SELECT mlogregr_train('usage');
332335
"""
333336
else:
334337
help_string = "No such option. Use {schema_madlib}.robust_variance_mlogregr()"

0 commit comments

Comments
 (0)