Just to give my two cents exposing a problem that I had a week ago when one of ours databases repeatedly entered in recovery mode because of a query.
Let’s put some emotion and start from the beginning.
Suddenly everyone starts to complain that the ERP is down, the company is losing money, the end of the world is near and some heads gonna roll.
With 15 people around you, you calmly check space left, load average, database connections an so on.
After some researching at the log, you could spot a query looking for an invalid memory address. After some debugging, that’s what I’ve found:
To avoid rewriting on a function, it was used a native function from PostgreSQL to get the IP and port from the database so the move from the test server to the production server would be easier. The functions are inet_server_addr() and inet_server_port().
It’s not a problem when your test and/or dev environments are equal to the production. As we don’t have to treat too many connections in the test environment, we do not use PgBouncer, a very light connection pooler.
Because of a very specific demand, we need to connect out of a transaction to get some values inside a running function, so we use a method from a C++ library to connect to the database, get a value and update it, returning to the original function an integer.
When that function was called in the production environment, PGBouncer returns nothing or some trash, and the C++ function, for some reason, tries to reach an invalid memory address and the database, to preserve itself, enters into recovery mode.
To fix the whole problem, we have a lot of solutions:
- Rewrite the C++ function to validate the parameters, which will not resolve the PgBouncer thing;
- Put fixed values for IP and port at the functions;
- Create a user for this specific call and connect to the main port, instead of the PgBouncer port, which is by far the worst solution;
- Diminishing this type of problems by creating dev/test environments equal the production.
Not a difficult problem to fix, neither to spot, but indeed a fatal error.