Skip to content

Conversation

@dh-cloud
Copy link
Owner

When converting semi-join to inner-join, one distinct agg on ctid is added
above the hash-join node. But tuples from VALUE scan may have invalid ctids;
so the assert check failed.

This patch is based on commit d8886cf.

Here are some reminders before you submit the pull request

  • Add tests for the change
  • Document changes
  • Communicate in the mailing list if needed
  • Pass make installcheck
  • Review a PR in return to support the community

Query:

SELECT * FROM tmp A WHERE (a=any(values (0),(1), (5), (99))) and (b >= '2019-09-20 00:00:00' AND b < '2019-09-21 00:00:00')
 AND (c = ANY(VALUES ('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),
 ('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),
 ('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),
 ('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),
 ('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),
 ('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),
 ('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),
 ('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),
 ('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),
 ('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),
 ('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),
 ('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),
 ('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307'),('3307')))
 

Query Plan:

 Gather Motion 2:1  (slice1; segments: 2)  (cost=126.36..126.44 rows=9 width=2224)
   ->  HashAggregate  (cost=126.36..126.44 rows=5 width=2224)
         Group Key: a.ctid, "*VALUES*".ctid
         ->  Hash Join  (cost=121.31..126.32 rows=5 width=2224)
               Hash Cond: ("*VALUES*_1".column1 = (a.c)::text)
               ->  Values Scan on "*VALUES*_1"  (cost=0.00..3.79 rows=152 width=32)
               ->  Hash  (cost=121.22..121.22 rows=4 width=2224)
                     ->  Hash Semi Join  (cost=0.10..121.22 rows=4 width=2224)
                           Hash Cond: (a.a = ("*VALUES*".column1)::numeric)
                           ->  Seq Scan on tmp a  (cost=0.00..121.00 rows=7 width=2218)
                                 Filter: ((b >= '2019-09-20 00:00:00'::timestamp without time zone) AND (b < '2019-09-21 00:00:00'::timestamp without time zone))
                           ->  Hash  (cost=0.05..0.05 rows=2 width=10)
                                 ->  Values Scan on "*VALUES*"  (cost=0.00..0.05 rows=2 width=10)
 Optimizer: Postgres query optimizer

Error Log:

ERROR:  Unexpected internal error (execTuples.c:1559)  (seg0 slice1 xx:40000 pid=145348) (execTuples.c:1559)
DETAIL:  FailedAssertion("!(((bool) (((const void*)(&(htup->t_self)) != ((void *)0)) && ((&(htup->t_self))->ip_posid != 0))))", File: "execTuples.c", Line: 1559)
HINT:  Process 145348 will wait for gp_debug_linger=120 seconds before termination.
Note that its locks and other resources will not be released until then.

@dh-cloud dh-cloud added the bug Something isn't working label May 12, 2020
Lisa Owen and others added 29 commits May 21, 2020 08:58
* docs - document the new plcontainer --use_local_copy option

* add more words around when to use the option

Co-authored-by: David Yozie <dyozie@pivotal.io>
In Greenplum, filespace information is maintained on the master but not
on the segment. During check mode, it's not required to pass the
tablespace information as there is no relevant check for it. In
greenplum, during upgrading the segment tablespace information is passed
via a csv file created from the master contents.
GPHOME_LOADERS and greenplum_loaders_path.sh are still requried by some users.
So export GPHOME_LOADERS as GPHOME_CLIENTS and link greenplum_loaders_path.sh
to greenplum_clients_path.sh for compatible

(Cherry-pick from master commit 9e9917a)
This is mainly to resolve slow response to sequence requests under
TCP interconnect, sequence requests are sent through libpqs from
QEs to QD (we call them dispatcher connections). In the past, under
TCP interconnect, QD checked the events on dispatcher connections
every 2 seconds, obviously it's inefficient.

Under UDPIFC mode, QD also monitors the dispatcher connections when
receving tuples from QEs so QD can process sequence requests in
time, this commit applies the same logic to the TCP interconnect.

Reviewed-by: Hao Wu <gfphoenix78@gmail.com>
Reviewed-by: Ning Yu <nyu@pivotal.io>
…10174)

`select c.c1, c.c2 from d1 c union all select a.c1, a.c2 from d2 a;`
Both d1 and d2 are replicated tables, but the `numsegments` of them
in gp_distribution_policy are different. This could happen during
gpexpanding. The bug exists in function cdbpath_create_motion_path.
Both `subpath->locus` and `locus` are SegmentGeneral, but the locuses
are not equal

Co-authored-by: Pengzhou Tang <ptang@pivotal.io>
(cherry picked from commit 4974929)
pkill is not in /bin/ folder on Ubuntu, so gpfdist can't be killed
in sreh test. That would make gpfdist regression test fail
This issue is exposed when doing an experiment to remove the
special "eval_stable_functions" handling in evaluate_function(),
qp_functions_in_* test cases will get stuck sometimes and it turns
out to be a gp_interconnect_id disorder issue.

Under UDPIFC interconnect, gp_interconnect_id is used to
distinguish the executions of MPP-fied plan in the same session
and in the receiver side, packets with smaller gp_interconnect_id
is treated as 'past' packets, receiver will stop the sender to send
the packets.

The RCA of the hung is:
1. QD call InitSliceTable() to advance the gp_interconnect_id and
store it in slice table.
2. In CdbDispatchPlan->exec_make_plan_constant(), QD find some
stable function need to be simplified to const, then it executes
this function first.
3. The function contains the SQL, QD init another slice table and
advance the gp_interconnect_id again, QD dispatch the new plan and
execute it.
4. After the function is simplified to const, QD continues to dispatch
the previous plan, however, the gp_interconnect_id for it becomes the
older one. When a packet comes, if the receiver hasn't set up the
interconnect yet, the packet will be handled by handleMismatch() and
it will be treated as `past` packets and the senders will be stopped
earlier by the receiver. Later the receiver finish the setup of
interconnect, it cannot get any packets from senders and get stuck.

To resolve this, we advance the gp_interconnect_id when a plan is
really dispatched, the plan is dispatched sequentially, so the later
dispatched plan will have a higher gp_interconnect_id.

Also limit the usage of gp_interconnect_id in rx thread of UDPIFC,
we prefer to use sliceTable->ic_instance_id in main thread.

Reviewed-by: Heikki Linnakangas <hlinnakangas@pivotal.io>
Reviewed-by: Asim R P <apraveen@pivotal.io>
Reviewed-by: Hubert Zhang <hzhang@pivotal.io>
Just like `man strtol` says:

>   the calling program should set errno to 0 before the call, and then determine if
>   an error occurred by checking whether errno has a nonzero value after the call.

Cherry-picked from 6e76d8a
This is a continuation of commit 456b2b3 in GPORCA. Adding more errors to the list that
doesn't get logged in log file. We are also removing the code that writes to std::cerr,
generating a not very nice looking log message. Instead, add the info whether the error was
unexpected to another log message that we also generate.

The original commit on the master branch is fba7770.
pg_resgroup_move_query is an asychronous operation, pg_resgroup_move_query
complete doesn't mean the query has already move to the destination group.
* clarifying pg_upgrade note

* gpinitsystem -I second format

* gpinitsystem edits

* edits from review
In cases where Orca generates a NLJ with a parameter on the inner side,
the executor will not pass the EXEC_FLAG_REWIND flag down, as it assumed
the inner side will always need to be rescanned. The material node will therefore
not have its rewind flag set and can act as a no-op.

This is not always correct. While the executor will set EXEC_FLAG_REWIND
if the Materialize is directly above a motion, it does not recognize
the case where the Materialize is on the inner side with other nodes
between it and the motion, even though the Materialize serves to
prevent a rescan of the underlying Motion node.

This causes the execution to fail with:
`Illegal rescan of motion node: invalid plan (nodeMotion.c:1623)` as it
would attempt to rescan a motion.

Since Orca only produces Materialize when necessary, either for
performance reasons or to prevent rescan of an underlying Motion,
EXEC_FLAG_REWIND should be set for any Materialize generated by Orca.

Below is a valid plan generated by Orca:

```
 Result  (cost=0.00..3448.01 rows=1 width=4)
   ->  Nested Loop  (cost=0.00..3448.01 rows=1 width=1)
         Join Filter: true
         ->  Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..431.00 rows=2 width=4)
               ->  Seq Scan on foo1  (cost=0.00..431.00 rows=1 width=4)
         ->  Result  (cost=0.00..431.00 rows=1 width=1)
               Filter: (foo1.a = foo2.a)
               ->  Materialize  (cost=0.00..431.00 rows=1 width=4)
                     ->  Hash Semi Join  (cost=0.00..431.00 rows=1 width=4)
                           Hash Cond: (foo2.b = foo3.b)
                           ->  Gather Motion 3:1  (slice2; segments: 3)  (cost=0.00..0.00 rows=1 width=8)
                                 ->  Bitmap Heap Scan on foo2  (cost=0.00..0.00 rows=1 width=8)
                                       Recheck Cond: (c = 3)
                                       ->  Bitmap Index Scan on f2c  (cost=0.00..0.00 rows=0 width=0)
                                             Index Cond: (c = 3)
                           ->  Hash  (cost=431.00..431.00 rows=1 width=4)
                                 ->  Gather Motion 3:1  (slice3; segments: 3)  (cost=0.00..431.00 rows=2 width=4)
                                       ->  Seq Scan on foo3  (cost=0.00..431.00 rows=1 width=4)
 Optimizer: Pivotal Optimizer (GPORCA)
 ```

Co-authored-by: Chris Hajas <chajas@pivotal.io>
Co-authored-by: Shreedhar Hardikar <shardikar@pivotal.io>
…029)

Previously in the DPv2 transform (exhaustive2) while we penalized
cross joins for the remaining joins in greedy, we did
not for the first join, which in some cases selected a cross join.
This ended up selecting a poor join order in many cases and went against
the intent of the alternative being generated, which is to minimize
cross joins.

We also increase the cost of the default penalty from 5 to 1024, which is the value we use in the cost model during the optimization stage.

The greedy alternative also wasn't kept in the heap, so we include that now too.

(cherry picked from commit 457bb92)
For some reason, gpmon_qexeckey_t structure used int16 for ccnt while all other GP code operates int32. This problem can cause ccnt overflow in gpperfmon packets.

This problem doesn't affect master branch as gpperfmon code has been removed from it.
But it seems to affect 6X_STABLE and 5X_STABLE branches.

Authored-by: Denis Smirnov darthunix@gmail.com
Reviewed-by: Hao Wang haowang@pivotal.io
We introduce function which runs on INITPLAN in commit a21ff2
INITPLAN function is designed to support "CTAS select * from udf();"
Since udf() is run on EntryDB, but EntryDB is always read gang which
cannot do dispatch work, the query would fail if function contains DDL
statement etc.

The idea of INITPLAN function is to run the function on INITPLAN, which
is QD in fact and store the result into a tuplestore. Later the FunctionScan
on EntryDB just read tuple from tuplestore instead of running the real function.

But the life cycle management is a little tricky. In the original commit, we
hack to close the tuplestore in INITPLAN without deleting the file, and let
EntryDB reader to delete the file after finishing the tuple fetch. This will
introduce file leak if the transaction abort before the entryDB runs.

This commit add a postprocess_initplans in ExecutorEnd() of the main plan to
clean the tuplestore createed in preprocess_initplans in ExecutorStart() of
the main plan. Note that postprocess_initplans must be place after the dispatch
work are finished i.e. mppExecutorFinishup().
Upstream don't need this function since it always use scalar PARAM to communicate
between INITPLAN and main plan.

cherry-pick from: f669acf
The large_objects test invokes pg_upgrade and must not run concurrently
with the setup DDL. This led to intermittent failures, since some setup
scripts temporarily add objects that aren't upgradable.
The query to obtain AO auxiliary catalog names is relatively expensive
compared to the other aux table queries, and it was being performed once
for every AO table. Consolidate all calls into a single query, and
manually join the results with the other relation info by relid.

Also improve the "couldn't find aux tables" FATAL message for easier
debugging (it needs to include the dbname).
If there is ORDER BY or DISTINCT in the query, we need to bring all the
data to a single node by setting must_gather to be true. An exception is
when there's a LIMIT or OFFSET clause, which would be handled later when
inserting Limit node. Here to tell if there is any LIMIT or OFFSET
clause, we should use limit_needed, instead of checking limitCount or
limitOffset directly.

Fixes issue #9746.

Reviewed-by: Heikki Linnakangas <hlinnakangas@pivotal.io>
Reviewed-by: Ekta Khanna <ekhanna@pivotal.io>
…atview. (#10215)

In REFRESH MATERIALIZED VIEW command, CONCURRENTLY option is only
allowed if there is at least one unique index with no WHERE clause on
one or more columns of the matview. Previously, concurrent refresh
checked the existence of a unique index on the matview after filling
the data to new snapshot, i.e., after calling refresh_matview_datafill().
So, when there was no unique index, we could need to wait a long time
before we detected that and got the error. It was a waste of time.

To eliminate such wasting time, this commit changes concurrent refresh
so that it checks the existence of a unique index at the beginning of
the refresh operation, i.e., before starting any time-consuming jobs.
If CONCURRENTLY option is not allowed due to lack of a unique index,
concurrent refresh can immediately detect it and emit an error.

Author: Masahiko Sawada
Reviewed-by: Michael Paquier, Fujii Masao

Co-authored-by: Fujii Masao <fujii@postgresql.org>
SkipData flag should only short circuit in transientrel_receive on QE

We should still do the begin/end work, e.g. remove the new
created temp file, or we will have file leak.

Co-authored-by: Jinbao Chen <jinchen@pivotal.io>

Co-authored-by: Hubert Zhang <hzhang@pivotal.io>
* Make DbgPrint and OsPrint methods on CRefCount

Create a single DbgPrint() method on the CRefCount class. Also create
a virtual OsPrint() method, making some objects derived from CRefCount
easier to print from the debugger.

Note that not all the OsPrint methods had the same signatures, some
additional OsPrintxxx() methods have been generated for that.

* Making print output easier to read, print some stuff on demand

Required columns in required plan properties are always the same
for a given group. Also, equivalent expressions in required distribution
properties are important in certain cases, but in most cases they
disrupt the display and make it harder to read.

Added two traceflags, EopttracePrintRequiredColumns and
EopttracePrintEquivDistrSpecs that have to be set to print this
information. If you want to go back to the old display, use these
options when running gporca_test: -T 101016 -T 101017

* Add support for printing alternative plans

A new method, CEngine::DbgPrintExpr() can be called from
COptimizer::PexprOptimize, to allow printing of the best plan
for different contexts. This is only enabled in debug builds.

To use this:

- run an MDP using gporca_test, using a debug build
- print out memo after optimization (-T 101006 -T 101010)
- set a breakpoint near the end of COptimizer::PexprOptimize()
- if, after looking at the contents of memo, you want to see
  the optimal plan for context c of group g, do the following:
  p eng.DbgPrintExpr(g, c)

You could also get the same info from the memo printout, but it
would take a lot longer.

(cherry picked from commit b3fdede)
This reverts commit 412493b.

Failed to compile pgbouncer on centos6: can't find libevent.
pgbouncer 1.13 uses pkg-config to look libevent up instead of
using --with-libevent.

Another issue is: pgbouncer does not support libevent version 1.x
in 1.13 version, but we use libevent 1.4 on centos6.
The psql client ignored rel storage when he create the \dm command.
So the output of \dm was empty. Add the correct rel storage check in command
… in ORCA

Duplicate sensitive HashDistribute Motions generated by ORCA get
translated to Result nodes with hashFilter cols set. However, if the
Motion needs to distribute based on a complex expression (rather than
just a Var), the expression must be added into the targetlist of the
Result node and then referenced in hashFilterColIdx.

However, this can affect other operators above the Result node. For
example, a Hash operator expects the targetlist of its child node to
contain only elements that are to be hashed. Additional expressions here
can cause issues with memtuple bindings that can lead to errors.

(E.g The attached test case, when run without our fix, will give an
error: "invalid input syntax for integer:")

This PR fixes the issue by adding an additional Result node on top of
the duplicate sensitive Result node to project only the elements from
the original targetlist in such cases.
Orca uses this property for cardinality estimation of joins.
For example, a join predicate foo join bar on foo.a = upper(bar.b)
will have a cardinality estimate similar to foo join bar on foo.a = bar.b.

Other functions, like foo join bar on foo.a = substring(bar.b, 1, 1)
won't be treated that way, since they are more likely to have a greater
effect on join cardinalities.

Since this is specific to ORCA, we use logic in the translator to determine
whether a function or operator is NDV-preserving. Right now, we consider
a very limited set of operators, we may add more at a later time.

Let's assume that we join tables R and S and that f is a function or
expression that refers to a single column and does not preserve
NDVs. Let's also assume that p is a function or expression that also
refers to a single column and that does preserve NDVs:

join predicate       card. estimate                         comment
-------------------  -------------------------------------  -----------------------------
col1 = col2          |R| * |S| / max(NDV(col1), NDV(col2))  build an equi-join histogram
f(col1) = p(col2)    |R| * |S| / NDV(col2)                  use NDV-based estimation
f(col1) = col2       |R| * |S| / NDV(col2)                  use NDV-based estimation
p(col1) = col2       |R| * |S| / max(NDV(col1), NDV(col2))  use NDV-based estimation
p(col1) = p(col2)    |R| * |S| / max(NDV(col1), NDV(col2))  use NDV-based estimation
otherwise            |R| * |S| * 0.4                        this is an unsupported pred
Note that adding casts to these expressions is ok, as well as switching left and right side.

Here is a list of expressions that we currently treat as NDV-preserving:

coalesce(col, const)
col || const
lower(col)
trim(col)
upper(col)

One more note: We need the NDVs of the inner side of Semi and
Anti-joins for cardinality estimation, so only normal columns and
NDV-preserving functions are allowed in that case.

This is a port of these GPDB 5X and GPOrca PRs:
https://github.com/greenplum-db/gporca/pull/585
https://github.com/greenplum-db/gpdb/pull/10090

(cherry picked from commit 3ccd1eb)

Also updated join.sql expected files with minor motion changes.
zhangh43 and others added 19 commits August 19, 2020 10:13
Previously, when backends connect to a proxy, we need to setup
domain socket pipe and send HELLO message(recv ack message) in
a blocking and non-parallel way. This makes ICPROXY hard to introduce
check_for_interrupt during backend registeration.

By utilizing libuv loop, we could register backend in paralle. Note
that this is one of the step to replace all the ic_tcp backend logic
reused by ic_proxy currently. In future, we should use libuv to replace
all the backend logic, from registeration to send/recv data.

Co-authored-by: Ning Yu <nyu@pivotal.io>
(cherry picked from commit 608514c)
ic-proxy is developed with libuv, the minimal supported libuv version is
1.18.0. But in commit 608514, we introduce new API in libuv 1.19, which
break compatibility on os like Ubuntu 18.04 whose default libuv version
is 1.18.

We should keep our code base align with libuv 1.18, and replace the new
API in libuv 1.19. The API change is mainly about how to access data field
in uv handle and uv loop. The new API uses function interface like
`uv_handle_set_data` and `uv_handle_get_data` to access data filed, while
the old API in 1.18 access the data filed directly. Note that in the latest
libuv version 1.38.2, the old API and new API are both supported. And libuv
is stable enough to support the old API for a long time.

(cherry picked from commit ab36eb9)
ic-proxy supports to handle cancel signal during
backend connection setup now, so external_table test case
could be enabled.

Reviewed-by: Ning Yu <nyu@pivotal.io>
(cherry picked from commit 1789e43)
* docs - clarify gpcopy supported greenplum versions

* use or later rather than +
…… (#10635)

* docs - add information on table rewrites when executing ALTER TABLE commands.

Information is in Notes section of ALTER TABLE.
Also, made some minor edits.

* docs - updates based on review comments.
-simplified table listing ALTER TABLE commands that perform table rewrites.
-add links to table rewrite information.
There is a race condition between receiveFrom() and
dbms_pipe_session_A::createImplicitPipe(). Sleep an extra seconds
is not reliable. So wait until the data is available.
This PR changes the semantics of receiveFrom(), i.e. wait until
the data is available, no timeout happens.

Reviewed-by: Hubert Zhang <hzhang@pivotal.io>
(cherry picked from commit 2f170b3)
This creates an explain pipeline which is used to run query workloads
using explain and identify plan differences during feature development.

This previously existed on 5X, but with the Orca in GPDB effort the
architecture needed to be updated.

The new pipeline has the following features for use with slack commands:

- Comparing a SHA with a baseline SHA
- Identifying and outputting plan diffs
- Summarizing the results with planning time, any
fallbacks/crashes/timeouts, and the type of diff (cost, cardinality,
plan change)
- Enabling custom gucs (turning optimizer off, using a different cost
model, enabling/disabling a feature flag/join order)

Some use cases that we've come across:
- Checking whether a new join order algorithm influences optimization
time/dramatically changes plans

One of the main goals with this change is to make it easy to add
workloads. The slack command can be used in the following format:

\test_gpdb6_explain <SHA_remote> <SHA> <baseline_SHA_remote>
<baseline_SHA> <workload> [optional_gucs]

where workloads is either "all" or a list with no spaces of the names of
the workloads to run (eg: workload1,workload2,workload3).
and gucs, if specified, are in the format:
```
optimizer=off
enable_nestloop=on
```
This is kept as a separate commit as an example of how to add additional
workloads.

Co-authored-by: Chris Hajas <chajas@vmware.com>
Raise a meaningful error message for this case.
GPDB doesn't support alter type on primary key and unique
constraint column. Because it requires to drop - recreate logic.
The drop currently only performs on master which lead error when
recreating index (since recreate index will dispatch to segments and
there still an old constraint index exists).

This fixes the issue https://github.com/greenplum-db/gpdb/issues/10561.

Reviewed-by: Hubert Zhang <hzhang@pivotal.io>
(cherry picked from commit 32446a3)
* allow NEq predicates with lossy casts for PS
* Updating some mdp files that were missing the PartitionTypes XML tag
* Update partition_pruning result file for planner
* adding new testcase for NEq
* move icg tests from partition_pruning to gporca
* add test for checking multilevel range/list PS

Co-authored-by: Hans Zeller <hzeller@vmware.com>
Commit 445fc7c hardened some parts of analyzedb. However, it missed a
couple of cases.

1) When the statement to get the modcount from the pg_aoseg table failed
due to a dropped table, the transaction was also terminated. This caused
further modcount queries to fail and while those tables were analyzed,
it would error and not properly record the mod count. Therefore, we now
restart the transaction when it errors.

2) If the table is dropped and then recreated while analyzedb is running
(or some other mechanism that results in the table being successfully
analyzed, but the pg_aoseg table did not exist during the initial
check), the logic to update the modcount may fail. Now, we skip the
update for the table if this occurs. In this case, the modcount would
not be recorded and the next analyzedb run will consider the table
modified (or dirty) and re-analyze it, which is the desired behavior.
* docs - add information on upgrading to PostGIS 2.5.4

Upgrade instructions 2.1.5 to different versions of 2.5.4

* docs - upgrade to PostGIS 2.5.4 review comments

* docs - more review comment updates.
reorder upgrade sections.
clarify removing PostGIS package, is for removing the gppkg

* docs - minor edit

* docs - review updates - more emphasis on removing PostGIS from a database deleting objects.
-Create separate paragraph in Upgrading section.
-Add warning in Removing PostGIS section

* docs - minor review comment update

* small edits

Co-authored-by: David Yozie <dyozie@pivotal.io>
* Fix postgres_fdw's libpq issue

When using posgres_fdw, it reports the following error:
unsupported frontend protocol 28675.0: server supports 2.0 to 3.0

root cause: Even if postgres_fdw.so is dynamic linked to libpq.so
which is compiled with the option -DFRONTEND, but when it's loaded
in gpdb and run, it will use the backend libpq which is compiled together
with postgres program and reports the error.

We statically link libpq into postgres_fdw and hide all the symbols
of libpq.a with --exclude-libs=libpq.a to make it uses the frontend
libpq.

As postgres_fdw is compiled as a backend without -DFRONTEND, and linked
to libpq which is a frontend, but _PQconninfoOption's length is
different between backend and frontend as there is a macro in it.
The backend's _PQconninfoOption has field connofs, but the frontend
doesn't. This leads to the crash of postgres_fdw. One solution is to
remove the FRONTEND macro in struct PQconninfoOption to make PQconninfoOption
is same in the backend and frontend, but that brings an ABI change. To avoid that,
we add the FRONTEND macro on including libpq header files, so that postgres_fdw
can process the libpq's variables returned by libpq.a's functions as frontend.

* Disable postgres_fdw on MaxOS
* Fix dblink's libpq issue

When using dblink to connect to a postgres database, it reports the
following error:
unsupported frontend protocol 28675.0: server supports 2.0 to 3.0

Even if dblink.so is dynamically linked to libpq.so which is compiled
with the option -DFRONTEND, but when it's loaded in gpdb and run,
it will use the backend libpq which is compiled together with
Postgres program and reports the error.

This pr is almost the same as the pr:
https://github.com/greenplum-db/gpdb/pull/10617
which is used to fix postgres_fdw issue

* Disable dblink on MaxOS

--exclude-libs and -Bstatic are not supported on MaxOS
Previous commit ab74e1c, c7befb1 did not completely solve its race
condition, it did not test for last iteration of the while/for loop.
This could result in failed assertion in the following loop. The patch
moves the judgement to the ending of the for loop, it is safe, because
the first iteration will never trigger: Assert(activeWeight > 0.0).

Also, the other one race condition can trigger this assertion
Assert(gl->numFollowersActive > 0). Consider this situation:

    Backend A, B belong to the same statement.

    Timestamp1: backend A's leader is A, backend B's leader is B.

    Timestamp2: backend A's numFollowersActive remains zero due to timeout.

    Timestamp3: Sweeper calculates leader B's numFollowersActive to 1.

    Timestamp4: backend B changes it's leader to A even if A is inactive.

We stop sweeping for this race condition just like commit ab74e1c did.

Both Assert(activeWeight > 0.0) and Assert(gl->numFollowersActive > 0)
are removed.

(cherry picked from commit b1c1919)
…h recovery

After crash recovery finishes, the startup process will create an
end-of-recovery checkpoint. The checkpoint will recycle/remove xlog files
according to orphaned prepared transaction LSNs, replication slot data, etc.
The orphaned prepared transaction LSN data (TwoPhaseState->prepXacts, etc) for
checkpoint are populated in the startup process RecoverPreparedTransactions(),
but the function is called after the end-of-recovery checkpoint creation so
xlog files with orphaned prepared transactions might be recycled/removed. This
can cause "requested WAL segment pg_xlog/000000010000000000000009 has already
been removed" kind of error when bringing up the crashed primary. For example
if you run 'gprecoverseg -a -v', you might be able to see failure with stack as
below. In more details, this happens when running the single mode postgres in
pg_rewind.
	2    0x5673a1 postgres <symbol not found> (xlogutils.c:572)
	3    0x567b2f postgres read_local_xlog_page (xlogutils.c:870)
	4    0x5658f3 postgres <symbol not found> (xlogreader.c:503)
	5    0x56518a postgres XLogReadRecord (xlogreader.c:226)
	6    0x54d725 postgres RecoverPreparedTransactions (twophase.c:1955)
	7    0x55b0b7 postgres StartupXLOG (xlog.c:7760)
	8    0xb0087f postgres InitPostgres (postinit.c:704)
	9    0x97a97c postgres PostgresMain (postgres.c:4829)

Fixing this by prescanning the prepared transaction data from xlogs directly
for checkpoint creation if it's still InRecovery (i.e. when
TwoPhaseState->prepXacts is not populated).

Please note that "orphaned" might be temporary (usually happens on an
environment with heavy write-operation load) or permanent unless master dtx
recovery (implies a product bug).

This is a gpdb 6 only issue. On gpdb 7, state files are used to store prepared
transactions during checkpoint so we do not keep the wal files with orphaned
prepared transactions and thus we won't encounter this issue.

Reviewed-by: Asim R P <pasim@vmware.com>
* Fix distclean of postgres_fdw and dblink
@hlinnaka hlinnaka force-pushed the valuescan_ctid_crash_patch branch 2 times, most recently from 7468473 to a5f0675 Compare August 31, 2020 10:45
darthunix and others added 3 commits August 31, 2020 13:50
Glibc implementations are known to return inconsistent results for
strcoll() and strxfrm() on many platforms that can cause
unpredictable bugs. Because of that PostgreSQL disabled strxfrm()
by default since 9.5 at compile time by TRUST_STRXFRM definition.
Greenplum has its own mk sort implementation that can also use
strxfrm(). Hence mk sort can also be affected by strcoll() and
strxfrm() inconsistency (breaks merge joins). That is why strxfrm()
should be disabled by default with TRUST_STRXFRM_MK_SORT definition
for mk sort as well. We don't use PostgreSQL's TRUST_STRXFRM
definition as many users used Greenplum with strxfrm() enabled for
mk sort and disabled in PostgreSQL core. Keeping TRUST_STRXFRM_MK_SORT
as a separate definition allows these users not to reindex after
version upgrade.

Reviewed-by: Asim R P <pasim@vmware.com>
Reviewed-by: Heikki Linnakangas <linnakangash@vmware.com>
Reviewed-by: Hubert Zhang <hzhang@pivotal.io>
When we convert semi join to inner join, a distinct agg on ctid is added
above the hash join node. But tuples from function scan have no ctid. So
the assert check failed.

Set a synthetic ctid based on a fake ctid by call existed funciton
slot_set_ctid_from_fake.

(cherry picked from commit d8886cf)
When converting semi-join to inner-join, a distinct agg on ctid is added
above the hash-join node. But the fake ctids generated in Values Scan
were invalid, with offset number 0, which caused an assertion failure.

This patch is based on commit d8886cf, which fixed the same issue for
Function Scans.

Co-authored-by: dh-cloud <60729713+dh-cloud@users.noreply.github.com>
Co-authored-by: Jesse Zhang <sbjesse@gmail.com>
@hlinnaka hlinnaka force-pushed the valuescan_ctid_crash_patch branch from a5f0675 to 3b7ca45 Compare August 31, 2020 13:39
dh-cloud pushed a commit that referenced this pull request Oct 18, 2021
…CREATE/ALTER resouce group.

In some scenarios, the AccessExclusiveLock for table pg_resgroupcapability may cause database setup/recovery pending. Below is why we need change the AccessExclusiveLock to ExclusiveLock. 

This lock on table pg_resgroupcapability is used to concurrent update this table when run "Create/Alter resource group" statement. There is a CPU limit, after modify one resource group, it has to check if the whole CPU usage of all resource groups doesn't exceed 100%.

Before this fix, AccessExclusiveLock is used. Suppose one user is running "Alter resource group" statement, QD will dispatch this statement to all QEs, so it is a two phase commit(2PC) transaction. When QD dispatched "Alter resource group" statement and QE acquire the AccessExclusiveLock for table pg_resgroupcapability. Until the 2PC distributed transaction committed, QE can release the AccessExclusiveLock for this table.

In the second phase, QD will call function doNotifyingCommitPrepared to broadcast "commit prepared" command to all QEs, QE has already finish prepared, this transation is a prepared transaction. Suppose at this point, there is a primary segment down and a mirror will be promoted to primary.

The mirror got the "promoted" message from coordinator, and will recover based on xlog from primary, in order to recover the prepared transaction, it will read the prepared transaction log entry and acquire AccessExclusiveLock for table pg_resgroupcapability. The callstack is:
#0 lock_twophase_recover (xid=, info=, recdata=, len=) at lock.c:4697
#1 ProcessRecords (callbacks=, xid=2933, bufptr=0x1d575a8 "") at twophase.c:1757
#2 RecoverPreparedTransactions () at twophase.c:2214
#3 StartupXLOG () at xlog.c:8013
#4 StartupProcessMain () at startup.c:231
#5 AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7fff84b94a70) at bootstrap.c:459
#6 StartChildProcess (type=StartupProcess) at postmaster.c:5917
#7 PostmasterMain (argc=argc@entry=7, argv=argv@entry=0x1d555b0) at postmaster.c:1581
#8 main (argc=7, argv=0x1d555b0) at main.c:240

After that, the database instance will start up, all related initialization functions will be called. However, there is a function named "InitResGroups", it will acquire AccessShareLock for table pg_resgroupcapability and do some initialization stuff. The callstack is:
#6 WaitOnLock (locallock=locallock@entry=0x1c7f248, owner=owner@entry=0x1ca0a40) at lock.c:1999
#7 LockAcquireExtended (locktag=locktag@entry=0x7ffd15d18d90, lockmode=lockmode@entry=1, sessionLock=sessionLock@entry=false, dontWait=dontWait@entry=false, reportMemoryError=reportMemoryError@entry=true, locallockp=locallockp@entry=0x7ffd15d18d88) at lock.c:1192
#8 LockRelationOid (relid=6439, lockmode=1) at lmgr.c:126
#9 relation_open (relationId=relationId@entry=6439, lockmode=lockmode@entry=1) at relation.c:56
#10 table_open (relationId=relationId@entry=6439, lockmode=lockmode@entry=1) at table.c:47
#11 InitResGroups () at resgroup.c:581
#12 InitResManager () at resource_manager.c:83
#13 initPostgres (in_dbname=, dboid=dboid@entry=0, username=username@entry=0x1c5b730 "linw", useroid=useroid@entry=0, out_dbname=out_dbname@entry=0x0, override_allow_connections=override_allow_connections@entry=false) at postinit.c:1284
#14 PostgresMain (argc=1, argv=argv@entry=0x1c8af78, dbname=0x1c89e70 "postgres", username=0x1c5b730 "linw") at postgres.c:4812
#15 BackendRun (port=, port=) at postmaster.c:4922
#16 BackendStartup (port=0x1c835d0) at postmaster.c:4607
#17 ServerLoop () at postmaster.c:1963
#18 PostmasterMain (argc=argc@entry=7, argv=argv@entry=0x1c595b0) at postmaster.c:1589
#19 in main (argc=7, argv=0x1c595b0) at main.c:240

The AccessExclusiveLock is not released, and it is not compatible with any other locks, so the startup process will be pending on this lock. So the mirror can't become primary successfully.

Even users run "gprecoverseg" to recover the primary segment. the result is similar. The primary segment will recover from xlog, it will recover prepared transactions and acquire AccessExclusiveLock for table pg_resgroupcapability. Then the startup process is pending on this lock. Unless users change the resource type to "queue", the function InitResGroups will not be called, and won't be blocked, then the primary segment can startup normally.

After this fix, ExclusiveLock is acquired when alter resource group. In above case, the startup process acquires AccessShareLock, ExclusiveLock and AccessShareLock are compatible. The startup process can run successfully. After startup, QE will get RECOVERY_COMMIT_PREPARED command from QD, it will finish the second phase of this distributed transaction and release ExclusiveLock on table pg_resgroupcapability. The callstack is:
#0 lock_twophase_postcommit (xid=, info=, recdata=0x3303458, len=) at lock.c:4758
#1 ProcessRecords (callbacks=, xid=, bufptr=0x3303458 "") at twophase.c:1757
#2 FinishPreparedTransaction (gid=gid@entry=0x323caf5 "25", isCommit=isCommit@entry=true, raiseErrorIfNotFound=raiseErrorIfNotFound@entry=false) at twophase.c:1704
#3 in performDtxProtocolCommitPrepared (gid=gid@entry=0x323caf5 "25", raiseErrorIfNotFound=raiseErrorIfNotFound@entry=false) at cdbtm.c:2107
#4 performDtxProtocolCommand (dtxProtocolCommand=dtxProtocolCommand@entry=DTX_PROTOCOL_COMMAND_RECOVERY_COMMIT_PREPARED, gid=gid@entry=0x323caf5 "25", contextInfo=contextInfo@entry=0x10e1820 ) at cdbtm.c:2279
#5 exec_mpp_dtx_protocol_command (contextInfo=0x10e1820 , gid=0x323caf5 "25", loggingStr=0x323cad8 "Recovery Commit Prepared", dtxProtocolCommand=DTX_PROTOCOL_COMMAND_RECOVERY_COMMIT_PREPARED) at postgres.c:1570
#6 PostgresMain (argc=, argv=argv@entry=0x3268f98, dbname=0x3267e90 "postgres", username=) at postgres.c:5482

The test case of this commit simulates a repro of this bug.
dh-cloud pushed a commit that referenced this pull request Oct 18, 2021
…ce (#12447)

Recently I built from GreenPlum master branch to run TPC-DS query with 1GB data. For Q47 and Q57, when I turned off GUC `execute_pruned_plan` (on by default), some of worker processes will be hang and the query never returns.

Take Q57 as an example. My cluster configuration is 1 QD + 2 QE. The query looks like:

```sql
with v1 as(
  select
    i_category,i_brand,
    cc_name,d_year,d_moy,
    sum(cs_sales_price) sum_sales,
    avg(sum(cs_sales_price)) over (partition by
      i_category,i_brand,cc_name,d_year)
    avg_monthly_sales,
    rank() over (partition by
      i_category,i_brand,cc_name
    order by
      d_year,d_moy
    ) rn
  from
    item,catalog_sales,date_dim,call_center
  where
    cs_item_sk = i_item_sk and
    cs_sold_date_sk = d_date_sk and
    cc_call_center_sk= cs_call_center_sk and(
      d_year = 1999 or
      ( d_year = 1999-1 and d_moy =12) or
      ( d_year = 1999+1 and d_moy =1)
    )
  group by
    i_category,i_brand,cc_name,d_year,d_moy
),
v2 as(
  select
    v1.i_category,v1.i_brand,v1.cc_name,
    v1.d_year,v1.d_moy,v1.avg_monthly_sales,
    v1.sum_sales,v1_lag.sum_sales psum,
    v1_lead.sum_sales nsum
  from
    v1,v1 v1_lag,v1 v1_lead
  where
    v1.i_category = v1_lag.i_category and
    v1.i_category = v1_lead.i_category and
    v1.i_brand = v1_lag.i_brand and
    v1.i_brand = v1_lead.i_brand and
    v1. cc_name = v1_lag. cc_name and
    v1. cc_name = v1_lead. cc_name and
    v1.rn = v1_lag.rn + 1 and
    v1.rn = v1_lead.rn - 1
)
select *
from v2
where
  d_year = 1999 and
  avg_monthly_sales > 0 and
  case when avg_monthly_sales > 0 then
    abs(sum_sales - avg_monthly_sales) / avg_monthly_sales
    else null end > 0.1
order by
  sum_sales - avg_monthly_sales,3
limit 100;
```

When `execute_pruned_plan` is on by default, the plan looks like:

```
                                                                                                                                                                 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=0.00..2832.84 rows=1 width=64) (actual time=10792.606..10792.702 rows=100 loops=1)
   ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=0.00..2832.84 rows=1 width=64) (actual time=10792.597..10792.673 rows=100 loops=1)
         Merge Key: ((share0_ref4.sum_sales - share0_ref4.avg_monthly_sales)), share0_ref4.cc_name
         ->  Sort  (cost=0.00..2832.84 rows=1 width=72) (actual time=10791.203..10791.225 rows=50 loops=1)
               Sort Key: ((share0_ref4.sum_sales - share0_ref4.avg_monthly_sales)), share0_ref4.cc_name
               Sort Method:  quicksort  Memory: 152kB
               ->  Sequence  (cost=0.00..2832.84 rows=1 width=72) (actual time=10790.522..10790.559 rows=50 loops=1)
                     ->  Shared Scan (share slice:id 1:0)  (cost=0.00..1539.83 rows=1 width=1) (actual time=10140.895..10145.397 rows=16510 loops=1)
                           ->  WindowAgg  (cost=0.00..1539.83 rows=1 width=56) (actual time=10082.465..10128.750 rows=16510 loops=1)
                                 Partition By: item.i_category, item.i_brand, call_center.cc_name
                                 Order By: date_dim.d_year, date_dim.d_moy
                                 ->  Sort  (cost=0.00..1539.83 rows=1 width=48) (actual time=10082.429..10084.923 rows=16510 loops=1)
                                       Sort Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year, date_dim.d_moy
                                       Sort Method:  quicksort  Memory: 20078kB
                                       ->  Redistribute Motion 2:2  (slice2; segments: 2)  (cost=0.00..1539.83 rows=1 width=48) (actual time=9924.269..9989.657 rows=16510 loops=1)
                                             Hash Key: item.i_category, item.i_brand, call_center.cc_name
                                             ->  WindowAgg  (cost=0.00..1539.83 rows=1 width=48) (actual time=9924.717..9974.500 rows=16633 loops=1)
                                                   Partition By: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year
                                                   ->  Sort  (cost=0.00..1539.83 rows=1 width=126) (actual time=9924.662..9927.280 rows=16633 loops=1)
                                                         Sort Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year
                                                         Sort Method:  quicksort  Memory: 20076kB
                                                         ->  Redistribute Motion 2:2  (slice3; segments: 2)  (cost=0.00..1539.83 rows=1 width=126) (actual time=9394.220..9856.375 rows=16633 loops=1)
                                                               Hash Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year
                                                               ->  GroupAggregate  (cost=0.00..1539.83 rows=1 width=126) (actual time=9391.783..9833.988 rows=16424 loops=1)
                                                                     Group Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year, date_dim.d_moy
                                                                     ->  Sort  (cost=0.00..1539.83 rows=1 width=124) (actual time=9397.448..9628.606 rows=174584 loops=1)
                                                                           Sort Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year, date_dim.d_moy
                                                                           Sort Method:  external merge  Disk: 134144kB
                                                                           ->  Redistribute Motion 2:2  (slice4; segments: 2)  (cost=0.00..1539.83 rows=1 width=124) (actual time=6107.447..8237.581 rows=174584 loops=1)
                                                                                 Hash Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year, date_dim.d_moy
                                                                                 ->  Hash Join  (cost=0.00..1539.83 rows=1 width=124) (actual time=6112.706..7088.349 rows=178669 loops=1)
                                                                                       Hash Cond: (date_dim.d_date_sk = catalog_sales.cs_sold_date_sk)
                                                                                       ->  Seq Scan on date_dim  (cost=0.00..436.38 rows=204 width=12) (actual time=10.656..17.972 rows=222 loops=1)
                                                                                             Filter: ((d_year = 1999) OR ((d_year = 1998) AND (d_moy = 12)) OR ((d_year = 2000) AND (d_moy = 1)))
                                                                                             Rows Removed by Filter: 36504
                                                                                       ->  Hash  (cost=1103.41..1103.41 rows=1 width=120) (actual time=6100.040..6100.040 rows=1430799 loops=1)
                                                                                             Buckets: 16384 (originally 16384)  Batches: 32 (originally 1)  Memory Usage: 12493kB
                                                                                             ->  Broadcast Motion 2:2  (slice5; segments: 2)  (cost=0.00..1103.41 rows=1 width=120) (actual time=1.802..5410.377 rows=1434428 loops=1)
                                                                                                   ->  Nested Loop  (cost=0.00..1103.40 rows=1 width=120) (actual time=1.632..5127.625 rows=718766 loops=1)
                                                                                                         Join Filter: true
                                                                                                         ->  Redistribute Motion 2:2  (slice6; segments: 2)  (cost=0.00..1097.40 rows=1 width=22) (actual time=1.564..362.958 rows=718766 loops=1)
                                                                                                               Hash Key: catalog_sales.cs_item_sk
                                                                                                               ->  Hash Join  (cost=0.00..1097.40 rows=1 width=22) (actual time=1.112..996.643 rows=717589 loops=1)
                                                                                                                     Hash Cond: (catalog_sales.cs_call_center_sk = call_center.cc_call_center_sk)
                                                                                                                     ->  Seq Scan on catalog_sales  (cost=0.00..509.10 rows=720774 width=18) (actual time=0.144..602.362 rows=721193 loops=1)
                                                                                                                     ->  Hash  (cost=431.00..431.00 rows=1 width=12) (actual time=0.022..0.022 rows=6 loops=1)
                                                                                                                           Buckets: 32768  Batches: 1  Memory Usage: 257kB
                                                                                                                           ->  Broadcast Motion 2:2  (slice7; segments: 2)  (cost=0.00..431.00 rows=1 width=12) (actual time=0.009..0.012 rows=6 loops=1)
                                                                                                                                 ->  Seq Scan on call_center  (cost=0.00..431.00 rows=1 width=12) (actual time=0.032..0.035 rows=4 loops=1)
                                                                                                         ->  Index Scan using item_pkey on item  (cost=0.00..6.00 rows=1 width=102) (actual time=0.000..0.006 rows=1 loops=718766)
                                                                                                               Index Cond: (i_item_sk = catalog_sales.cs_item_sk)
                     ->  Redistribute Motion 1:2  (slice8)  (cost=0.00..1293.01 rows=1 width=72) (actual time=646.614..646.646 rows=50 loops=1)
                           ->  Limit  (cost=0.00..1293.01 rows=1 width=72) (actual time=10787.533..10787.700 rows=100 loops=1)
                                 ->  Gather Motion 2:1  (slice9; segments: 2)  (cost=0.00..1293.01 rows=1 width=72) (actual time=10787.527..10787.654 rows=100 loops=1)
                                       Merge Key: ((share0_ref4.sum_sales - share0_ref4.avg_monthly_sales)), share0_ref4.cc_name
                                       ->  Sort  (cost=0.00..1293.01 rows=1 width=72) (actual time=10789.933..10789.995 rows=357 loops=1)
                                             Sort Key: ((share0_ref4.sum_sales - share0_ref4.avg_monthly_sales)), share0_ref4.cc_name
                                             Sort Method:  quicksort  Memory: 14998kB
                                             ->  Result  (cost=0.00..1293.01 rows=1 width=150) (actual time=10648.280..10774.898 rows=12379 loops=1)
                                                   Filter: ((share0_ref4.d_year = 1999) AND (share0_ref4.avg_monthly_sales > '0'::numeric) AND (CASE WHEN (share0_ref4.avg_monthly_sales > '0'::numeric) THEN (abs((share0_ref4.sum_sales - share0_ref4.avg_monthly_sales)) / share0_ref4.avg_monthly_sales) ELSE NULL::numeric END > 0.1))
                                                   ->  Hash Join  (cost=0.00..1293.01 rows=1 width=150) (actual time=10648.253..10740.262 rows=13582 loops=1)
                                                         Hash Cond: ((share0_ref4.i_category = share0_ref3.i_category) AND (share0_ref4.i_brand = share0_ref3.i_brand) AND ((share0_ref4.cc_name)::text = (share0_ref3.cc_name)::text) AND (share0_ref4.rn = (share0_ref3.rn + 1)) AND (share0_ref4.rn = (share0_ref2.rn - 1)))
                                                         ->  Shared Scan (share slice:id 9:0)  (cost=0.00..431.00 rows=1 width=142) (actual time=0.013..5.570 rows=16510 loops=1)
                                                         ->  Hash  (cost=862.00..862.00 rows=1 width=142) (actual time=10647.380..10647.380 rows=209076 loops=1)
                                                               Buckets: 65536 (originally 32768)  Batches: 2 (originally 1)  Memory Usage: 31389kB
                                                               ->  Hash Join  (cost=0.00..862.00 rows=1 width=142) (actual time=10156.494..10374.421 rows=209076 loops=1)
                                                                     Hash Cond: ((share0_ref3.i_category = share0_ref2.i_category) AND (share0_ref3.i_brand = share0_ref2.i_brand) AND ((share0_ref3.cc_name)::text = (share0_ref2.cc_name)::text))
                                                                     ->  Shared Scan (share slice:id 9:0)  (cost=0.00..431.00 rows=1 width=126) (actual time=0.009..6.887 rows=16510 loops=1)
                                                                     ->  Hash  (cost=431.00..431.00 rows=1 width=126) (actual time=10156.297..10156.298 rows=16178 loops=1)
                                                                           Buckets: 32768  Batches: 1  Memory Usage: 3144kB
                                                                           ->  Shared Scan (share slice:id 9:0)  (cost=0.00..431.00 rows=1 width=126) (actual time=10139.421..10144.473 rows=16510 loops=1)
 Planning Time: 1905.667 ms
   (slice0)    Executor memory: 330K bytes.
   (slice1)    Executor memory: 4750K bytes avg x 2 workers, 4968K bytes max (seg1).  Work_mem: 4861K bytes max.
   (slice2)    Executor memory: 4701K bytes avg x 2 workers, 4952K bytes max (seg0).  Work_mem: 4894K bytes max.
   (slice3)    Executor memory: 12428K bytes avg x 2 workers, 12428K bytes max (seg0).  Work_mem: 12375K bytes max.
 * (slice4)    Executor memory: 14021K bytes avg x 2 workers, 14021K bytes max (seg0).  Work_mem: 12493K bytes max, 221759K bytes wanted.
   (slice5)    Executor memory: 77K bytes avg x 2 workers, 77K bytes max (seg0).
   (slice6)    Executor memory: 323K bytes avg x 2 workers, 323K bytes max (seg0).  Work_mem: 257K bytes max.
   (slice7)    Executor memory: 39K bytes avg x 2 workers, 39K bytes max (seg0).
   (slice8)    Executor memory: 242K bytes (entry db).
 * (slice9)    Executor memory: 35344K bytes avg x 2 workers, 35360K bytes max (seg1).  Work_mem: 31389K bytes max, 37501K bytes wanted.
 Memory used:  128000kB
 Memory wanted:  3328681kB
 Optimizer: Pivotal Optimizer (GPORCA)
 Execution Time: 10856.507 ms
(86 rows)

Time: 12779.991 ms (00:12.780)
```

There is only one share slice in this query, one producer in slice 1, three consumers in slice 9. However, when I turned GUC off, the query never returns, and the process situation looks like:

```
postgres  22285  22255  0 03:03 pts/1    00:00:00 psql -p9221
postgres  22288  20912  3 03:03 ?        00:00:03 postgres:  9221, postgres tpcds [local] con150 cmd16 EXPLAIN
postgres  22294  20939  0 03:03 ?        00:00:00 postgres:  9210, postgres tpcds 172.17.0.50(60732) con150 seg0 cmd17 slice1 MPPEXEC SELECT
postgres  22295  20950  0 03:03 ?        00:00:00 postgres:  9211, postgres tpcds 172.17.0.50(36177) con150 seg1 cmd17 slice1 MPPEXEC SELECT
postgres  22306  20939  5 03:03 ?        00:00:04 postgres:  9210, postgres tpcds 172.17.0.50(60742) con150 seg0 idle
postgres  22307  20950  5 03:03 ?        00:00:04 postgres:  9211, postgres tpcds 172.17.0.50(36187) con150 seg1 idle
postgres  22310  20939 11 03:03 ?        00:00:10 postgres:  9210, postgres tpcds 172.17.0.50(60745) con150 seg0 idle
postgres  22311  20950 12 03:03 ?        00:00:11 postgres:  9211, postgres tpcds 172.17.0.50(36190) con150 seg1 idle
postgres  22314  20939  5 03:03 ?        00:00:04 postgres:  9210, postgres tpcds 172.17.0.50(60748) con150 seg0 idle
postgres  22315  20950  5 03:03 ?        00:00:04 postgres:  9211, postgres tpcds 172.17.0.50(36193) con150 seg1 idle
postgres  22318  20939  1 03:03 ?        00:00:01 postgres:  9210, postgres tpcds 172.17.0.50(60750) con150 seg0 idle
postgres  22319  20950  2 03:03 ?        00:00:01 postgres:  9211, postgres tpcds 172.17.0.50(36195) con150 seg1 idle
postgres  22322  20912  0 03:03 ?        00:00:00 postgres:  9221, postgres tpcds [local] con150 seg-1 idle
postgres  22324  20939  0 03:03 ?        00:00:00 postgres:  9210, postgres tpcds 172.17.0.50(60754) con150 seg0 idle
postgres  22325  20950  0 03:03 ?        00:00:00 postgres:  9211, postgres tpcds 172.17.0.50(36199) con150 seg1 idle
postgres  22348  20939  0 03:05 ?        00:00:00 postgres:  9210, postgres tpcds 172.17.0.50(45936) con150 seg0 idle
postgres  22349  20950  0 03:05 ?        00:00:00 postgres:  9211, postgres tpcds 172.17.0.50(49614) con150 seg1 idle
postgres  22352  20939  4 03:05 ?        00:00:00 postgres:  9210, postgres tpcds 172.17.0.50(45939) con150 seg0 idle
postgres  22353  20950  4 03:05 ?        00:00:00 postgres:  9211, postgres tpcds 172.17.0.50(49617) con150 seg1 idle
```

According to my debugging, the stack of slice 1 processes looks like:

```
#0  0x00007fde606f94f3 in epoll_wait () from /lib64/libc.so.6
#1  0x0000000000d2eec1 in WaitEventSetWaitBlock (set=0x87d8fe0, cur_timeout=-1, occurred_events=0x7ffce695fe00, nevents=1) at latch.c:1081
#2  0x0000000000d2ed9a in WaitEventSetWait (set=0x87d8fe0, timeout=-1, occurred_events=0x7ffce695fe00, nevents=1, wait_event_info=0) at latch.c:1033
#3  0x0000000000d5987d in ConditionVariableSleep (cv=0x7fde540890b0, wait_event_info=0) at condition_variable.c:157
#4  0x0000000000b30a61 in shareinput_writer_waitdone (ref=0x87da950, nconsumers=1) at nodeShareInputScan.c:994
#5  0x0000000000b2fe89 in ExecEndShareInputScan (node=0x88c2ec0) at nodeShareInputScan.c:522
#6  0x0000000000ad63e8 in ExecEndNode (node=0x88c2ec0) at execProcnode.c:888
#7  0x0000000000b3237b in ExecEndSequence (node=0x88c2d80) at nodeSequence.c:132
#8  0x0000000000ad623f in ExecEndNode (node=0x88c2d80) at execProcnode.c:779
#9  0x0000000000b1772e in ExecEndSort (node=0x88c2658) at nodeSort.c:365
```

That is to say, the producer is waiting for consumers to wake it up, while the consumers didn't. According to further debugging, I found a **squelch** is triggered on the *Gather Motion* node upstream of three ShareInputScan consumer nodes. In the squelch logic of ShareInputScan, the consumer will notify producer only if `ndone == nsharers`:

```c
local_state->ndone++;

if (local_state->ndone == local_state->nsharers)
{
    shareinput_reader_notifydone(node->ref, sisc->nconsumers);
    local_state->closed = true;
}
```

While `ndone` will be accumulated one by one consumer, `nsharers` is initialized in ExecInitNode. However, GUC `execute_pruned_plan` affects the root node where the Executor starts to call `ExecInitNode`:

- `execute_pruned_plan` set to true: the initialization will start at the root node of slice 9, `nsharers` will be 3
- `execute_pruned_plan` set to false: the initialization will start at the root node of the whole plan tree, `nsharers` will be 4, then `ndone == nsharers` will never establish, because we only have three consumers, `ndone` will be 3 at most

According to my understanding, the algorithm should work well no matter this GUC is set to true or false. So I add some conditions in the process of initialization of `nsharers`: to accumulate `nsharers` only when initializing consumer nodes of current slice. Then this algorithm should be working fine.
dh-cloud pushed a commit that referenced this pull request Jul 19, 2022
In commit a0684da we made a change to use the fixed TOAST_TUPLE_TARGET
instead of custom toast_tuple_target that's calculated based on the the
blocksize of the AO table for toasting AO table tuples.

This caused an issue that some tuples that are well beyond the
toast_tuple_target are not being toasted (because they're smaller than
TOAST_TUPLE_TARGET). This is ok when the tuples can still fit into the
AO table's blocksize. But if not, an error would occur. E.g.:

  postgres=# create table t (a int, b int[]) WITH (appendonly=true, blocksize=8192);
  CREATE TABLE
  postgres=# insert into t select 1, array_agg(x) from generate_series(1, 2030) x;
  ERROR:  item too long (check #1): length 8160, maxBufferLen 8168
  CONTEXT:  Append-Only table 't'

Therefore, we should revert those changes, including:

1. Still use the toast_tuple_target for toasting 'x' and 'e' columns for AO tables.
There's also a GPDB_12_MERGE_FIXME in the original comment suggesting to use
RelationGetToastTupleTarget for AO table (in order to unify the logic to calculate
maxDataLen for heap and AO tuples, I suppose). But that's not possible currently
because the calculated toast_tuple_target is stored in AppendOnlyInsertDesc and
we cannot get it from the Relation struct. And it seems to me that we won't have
a way to do that easily. Therefore, keep this FIXME comment being removed.

2. Still use the toast_tuple_target for toasting 'm' columns for AO tables. Also restore
the FIXME comment suggesting that we should use a bigger target size for 'm' columns.
This should be something that worth investigating more into.

Commit a0684da also includes a change to use custom toast_tuple_target for heap tables,
which should be a valid change. I think that fixed an oversight when merging the upstream
commit c251336.
kainwen pushed a commit that referenced this pull request Jan 16, 2023
We used to rename index-backed constraints in the EXCHANGE PARTITION
command in 6X. Now we don't. We've decided to keep that behavior
in 7X after looking into the opposing arguments:

Argument #1. It might cause problem during upgrade.
- Firstly, we won't be using legacy syntax in the dump scripts so we
just need to worry about the existing tables produced by EXCHANGE
PARTITION. I.e. whether or not they can be restored correctly.
- For upgrading from 6X->7X, since those tables already have matched
constraint and index names with the table names, we should be OK.
- For upgrading 7X->onward, since we implement EXCHANGE PARTIITON
simply as a combination of upstream-syntax commands (see
AtExecGPExchangePartition()), pg_upgrade should be able to handle
them. We've verified that manually and the automated test should
cover that too. In fact, this gives another point that we shouldn't
do our own hacky things as part of EXCHANGE PARTITION which might
confuse pg_upgrade.

Argument #2. It might surprise the users and their existing workloads.
- The indexed constraint names are all implicitly generated and
shouldn't be directly used in most cases.
- They are not the only thing that will appear changed. E.g. the
normal indexes (e.g. B-tree ones) will be too. So the decision to
change one should be made with changing all of them.

More details see: https://docs.google.com/document/d/1enJdKYxkk5WFRV1WoqIgLgRxCGxOqI2nglJVE_Wglec
dh-cloud pushed a commit that referenced this pull request Dec 11, 2023
…586)

My previous commit 8915cd0 caused coredump in some pipeline jobs.
Example stack:
```
Core was generated by `postgres:  7000, ic proxy process
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000b46ec3 in pg_atomic_read_u32_impl (ptr=0x7f05a8c51104) at ../../../../src/include/port/atomics/generic.h:48
(gdb) bt
#0  0x0000000000b46ec3 in pg_atomic_read_u32_impl (ptr=0x7f05a8c51104) at ../../../../src/include/port/atomics/generic.h:48
#1  pg_atomic_read_u32 (ptr=0x7f05a8c51104) at ../../../../src/include/port/atomics.h:247
#2  LWLockAttemptLock (mode=LW_EXCLUSIVE, lock=0x7f05a8c51100) at lwlock.c:751
#3  LWLockAcquire (lock=0x7f05a8c51100, mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1188
#4  0x0000000000b32fff in ShmemInitStruct (name=name@entry=0x130e160 "", size=size@entry=4, foundPtr=foundPtr@entry=0x7ffcf94513bf) at shmem.c:412
#5  0x0000000000d6d18e in ic_proxy_server_main () at ic_proxy_main.c:545
#6  0x0000000000d6c219 in ICProxyMain (main_arg=<optimized out>) at ic_proxy_bgworker.c:36
#7  0x0000000000aa9caa in StartBackgroundWorker () at bgworker.c:955
#8  0x0000000000ab9407 in do_start_bgworker (rw=<optimized out>) at postmaster.c:6450
#9  maybe_start_bgworkers () at postmaster.c:6706
#10 0x0000000000abbc59 in ServerLoop () at postmaster.c:2095
#11 0x0000000000abd777 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x36e3650) at postmaster.c:1633
#12 0x00000000006e4764 in main (argc=5, argv=0x36e3650) at main.c:240
(gdb) p *ptr
Cannot access memory at address 0x7f05a8c51104
```

The root cause is I forgot to init SHM structure at CreateSharedMemoryAndSemaphores().
Fix it in this commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.