DATABASE 八月 29, 2021

​Real-World Performance - 13 - Large Dynamic Connection Pools - Part 1

文章字数 15k 阅读约需 13 mins. 阅读次数

https://mp.weixin.qq.com/s/D2PUOwAE93eHJpfYXARoXA

Today we are going to talk about connecting to the database and

how we choose connect to the database.

In previous YouTube videos, we’ve shown
how

developers might connect to the database,
and how they would use

cursors and arrays, and things like that. But we are going to do a

topic that really should be thought about from both the system administrator

and the system architect. And it’s the number of connections that we make to the database

and how they connect to the database. Are
those connections permanent

or are they dynamic? This is a topic that my colleagues within the Real World Performance group

have been dealing with for
probably the best part of 20 years.

And basically

it’s done wrong in many cases, and it’s
usually done wrong

because of over-optimistic views

of how they think the database can just
keep on

absorbing more and more connections. The reality of the situation is

that there’s two things at stake here. One the database–you cannot add

an infinite number connections to the
database. And

the second thing is the number of
connections to the database should ideally

be static–should be the same number all
time.

Unfortunately, today’s applications
tend to get built in middle tiers

and they tend to specify dynamic
connection pools. Very often it’s a

minimum number and then maximum number.

And there’s under…. people are under this illusion that basically

the applications servers will keep on
creating connections as required,

and they’ll potentially cut them back if the load goes off.

Well in fact this is actually the worst
thing possible to do.

In fact, one of my colleagues often defines this sort of architecture

as a “gun ready to go off.” You have the
potential to rapidly create a large

number

connections. We have the ability to
basically

destabilize the entire database
environment.

And for this reason itself, um,

we actually got quite tired of talking to
people about this.

And for this reason we built demos and
we

built graphical screens to actually show
what happens

when the load comes on the system. So to
demonstrate the

the challenges associated with both a
large connection pool and a

dynamic connection pool, what we’re going to do is we are going to simulate load coming on a system

in slow motion. So we can actually

watch and interrogate a system as the load comes up.

The reasoning being for that is that if we see in slow motion

we can actually see what is happening
and watch the system fall apart in

front of you rather than it all
happening in two seconds

and system just falling over and everyone
curious as to what has happened.

And this it becomes particularly
important

to understand what is happening and how
quickly these things

can run out of control. So and

we’re back using the Real World Performance demo screens

and you’ll notice we have basically

three windows down the middle which is
reporting what we’re doing. Now we have a

very stable system here,

running at this point in time. You can
notice that the response time from the

applications

is two milliseconds. We’re running at
quite a modest transaction rate of just

under 2,000 transactions a second.

But let’s talk about what our user
community is doing. What we actually have

is nearly 15,000 individual Java threads

all playing effectively cards online.

And they’re

basically pressing the button,
each thread, once every 10 seconds.

What we’ve done is we’ve created a
connection pool that can

literally go out of control and destabilize
the system.

The initial connection pool was a 144
connections to the database, but we’re going

let it grow

to 6,000 connections to the database

when the load comes on. Now if we look at the system at the moment you can see we’re

barely recording anything on the AWR
report or ASH reports.

The CPU utilization is extremely low.

And the other thing to notice is that
even though the

utilization on the system is extremely
low, we’ve already

extended the connection pool at over 200
connections to the database

up from 144, just because the
application servers

occasionally found that all their connections were busy, we allowed it to grow.

And with that in mind and I’m going to

force it to grow even more. And we are going to watch the system as we start to grow.

So I’m going to start doubling up the work load, so instead of everyone

pressing the button every 10 seconds,
we are now going to make it every five seconds.

And we’ll double this up.

And I’m going to continue to double this up until it gets quite

interesting and we’ve actually started to make the system yield.

At this point just doubling up the
workload, you can see that there’s just a

tiny

increment in the workload, the transaction
rate is doubling,

the response time hasn’t been impacted
it all,

we’re starting to see a little bit of climb on
the CPU utilization,

which we can see we’re barely even using
seven or eight

percent of the system this point in time. So let’s double it up

in another step.

And again, here we go doubled up the
workload again. And

at this point in time, we can see very
linear behavior

on the system. We’re seeing the
transaction rate literally

double with each step. We’re barely
impacting

response time. And we’re seeing the CPU getting proportional response

coming up as the workload comes up.

So, at this point in time, people will
start to think we’ve built a linear and

a scaleable system. The other thing to note

is that we are starting to see the connections to the database

and the size of the connection pool now is now approaching nearly 500, okay.

So we are really about four times bigger

nearly five times bigger than the
initial connection pool,

yet we’ve hardly stretch the machine at all.

So there’s a lot of connections that
actually aren’t doing anything

in fact

is 465 at this point in time, where we are not actually doing anything at all.

We’ve only got 16 active. And this is on a
machine here

with actually twenty four cores on. So
you can see we’re

not even fully utilizing the machine in
terms of sessions per

core at this point. But you can see, again,

that connection count is increasing just
because you get occasional

timing glitches where of all the 32
JVMs that

we have driving this application, all the
connections are

busy for one point in time and so we just
popped another connection.

So let’s double up the workload a little
bit more, and then we should start

getting into start seeing

load come onto the system a little bit
more seriously.

So we’re doubling it up now,

and we are starting to push the transaction
rate

much higher. We are getting up into the
15-16

thousand transactions per second. Thing
to note here, you starting to see a much

more

noticeable increase in the CPU on the
system,

and we’re approaching at this point 25
percent utilization on the system.

But again nothing to see in database on
ASH report,

not much to see of the user response time.
This is the point in time

where very often we see system architects and administrators

congratulating each other because they
think, “wow we’ve got this

scaleable system.” But it’s only
scaleable to

25 percent utilization of the database server. That’s

hardly the best utilization of hardware
resources,

software licenses. So let’s double up
again

and see where get to when we start. Hopefully, now that we’re going to get

to use 50 percent of the utilization on the
machine.

So we’re doubling up again and we can
see,

again, a very immediate response. The
transaction rate is doubling.

We are seeing the CPU workload doubling. And we’re nearly at

50 percent utilization on the system.
Barely an increase on the system

response time. Again a very stable system.

The thing to note now is we really are
starting to grow a lot more connections to the

database,

is the middleware is now getting to
situation. It is running at a higher rate

and the statistical chance that all the
connections used from one JVM

being in use is increasing all the time. So literally as I’ve been speaking, we’re seeing the

connections pop to

over nearly 2000 here. We are at 1900

as I spoke, before we were in the
hundreds,

and within a second or so, I imagine we are going to be

greater than 2000 fairly quickly. But you can see quite quickly that

you know, still at this point in time the
system is very predictable,

very linear, and everyone seeming great. And this is at the point where

we’re about fifty percent utilization on the system.

Now at fifty percent utilization on the
system,

we can see this is a point where our
statistical chance of getting scheduled

is 2 in 1. As I crank up the
utilization on the system,

your chance of getting scheduled
immediately

is starting to get reduced. We are starting to not get scheduled all of the

time. We are going to start seeing more variation in the response time.

As the response time varies, we are more likely to have

more application servers waiting.

And they are more likely to initiate more
connections.

So you can see as we crank up the
utilization

we can more likely to basically get more
variable response time,

which is going to force the app servers
to throw more connections at the database,

which in its turn, are going to get more and more unpredictable response time,

and we’re get ourselves in what is known as a race condition. And this race condition we

very often call a connection
storm, when basically we’re initiating a

storm of connection requests but we are actually getting

nothing back. And remember what was shown in previous YouTubes

the actually logging on and off the database is quite an expensive operation.

So this just compounds this effect, and we start going into a spiral of degrading

performance.

So I’m going to not double the performance at this point in time.

Because as you can see it’s been quite well behaved. We’ve got over

2000 connections to the database at this
point. But what we’re going to do this we do

is we are going to do it in smaller increments.

So instead of doubling it, I’m just gonna

take it up to four hundred milliseconds, so
0.4 percent of a second between each mouse clicks.

So

some fairly serious card players. And

we are just going to jump the workload up a little bit more.

And you can see that basically, suddenly the small increment in work suddenly

caused

quite an impact on the system, okay?

So almost like a 50 percent increase in the
workload, we saw almost

twice as much CPU. We see this disproportional response in CPU utilization

getting extremely busy. We’re not seeing
proportional

increase in the transaction rate, and
we’re now starting to see the

transaction rate

starting to become quite choppy. And we are starting to see

queuing in the middle tier machine. And we’re seeing extended database response times.

And for the first time, we are actually starting to see wait events

inside the database through the ASH
reports.

And some of you may have seen some of these

wait events in your life before. Latches on end queues,

row cache objects, latch free

enq:TX index contention, buffer busy waits.

Quite common wait events as and well, we getting to the situation where we’re starting to

oversubscribe the machine, but the CPU
is starting to get extremely busy.

And we’re starting to see more and more wait events. We are starting to see the CPU get busier

and busier.

And you can see now that the database
connection pool

that was at 2000 is now approaching 6000.

Remember we have a limit here of 6144.

So very quickly, we’re actually getting
to the situation where we are approaching our

connection limits.

And you’ll notice the number of active connections on the database

is massively oversubscribed, compared to
the number

of cores on the system. It’s now telling me I have

6,000 active requests into the database

at this point in time. As a result you can
actually see

we’ve completely oversubscribed the machine so the response time

has gone out of control, the wait event
inside the database has gone

completely out of control. Just
looking at that you can see

enq:TX index contention, contention events

and the throughput is dropped right off.

And in fact what we’re starting to see
is the monitoring of the thing

is starting to become unpredictable. And
even in fact the CPU has dropped off

because this machine has now actually rendered itself completely unstable.

And at this point in time, we’ve
actually

simulated a failure in the system.

And then sometimes it starts to recover and it’s erratically

you can see the trajectory sometimes
bounce-back, but

effectively what we’ve done is hang
the system.

And trying to use a keyboard and things like this

becomes impossible. So getting debugging
information on such a system

into this state becomes next to
impossible.

0%