[vworld-tech] Java scaling
ceo
ceo at grexengine.com
Sat Apr 30 19:25:08 PDT 2005
Not intended as any pro-java evangelism post; just trying to give Alex
some advice and pointers for further exploration.
Alex Chacha wrote:
> ceo wrote:
>
>> ...but now I *know* you're doing something very, very wrong, or you're
>> distributed raytracer), because java scales fairly effortlessly to 500
>> clients these days (i.e. use the sun tutorials aimed at newcomers to
>
>
> Let me give you two scenarios where I have direct experience with java.
>
> 1. The MMORPG I am working on part time (labor of love more than
...
> magnitude difference). This was done using Sun's JVM 1.4.1_06
Right. 1.4.1 was never a usable release for networking and was retired
?several years ago (2003)? IIRC. So, right from the start, you're using
an unsupported version of java - seriously, there are massive bugs in
that version that make it practically impossible to do various things,
and there are no workarounds (FYI IIRC it was mistakes in the
interfacing to low-level platform-specific networking libs on two of the
major platforms; for isntance, a confused approach to using IOCP on
windows. But only IIRC...it was a long time ago).
> out which would work best). There are 10 worker threads and 1
> listener/queue thread. The design works great, but the quest and
Off the top of my head...I'm sure you're not doing this (it would be
crazy), but you're not running a single I/O thread using old I/O are you?
Assuming you're using NIO, I could make some stab-in-the-dark guesses.
If upgrading to the current version (1.4.2_XX - currently _07 IIRC)
doesn't fix it, then...I'd take a look at your synchronization design.
Sun's API comes with very sparse docs, but it does explicitly tell you
that "high quality implementations will block for a very small amount of
time, if at all, in this scenario, but simpler implementations may block
for a long time or indefinitely". They wrote an API spec with holes you
could drive a bus through - and then implemented their JVM with
ultra-low-quality.
(I believe this is known as making the free version of your product crap
so that more people buy the expensive one? Tongue firmly in cheek)
> tradeskill engine is still using simple data for testing, so I am not
> yet CPU bound. I tried loading up the machine in this scenario and hit
> a wall when I had 50 remote clients. CPU utilization was about 70% but
First suggestion: go read my section in Game Programming Gems 4
(Thousands of clients per server...although the copy of the source on
the CD isn't usable). In some ways, I regret publishing it there,
because it measn I can't (legally) post it anywhere on the web :(. I'm
not banging my drum here, but it highlights a load of separate issues,
and you may just glance at one and go "ah - that gives me an idea of
something I didn't mean to be doing".
> the JVM was spending a lot of time in context switches. I didn't really
Shrug. With your invalid JVM I'm not even sure it's worth commenting on
this, other than to say "this doesn't happen in normal usage". If you
reproduce that on a real JVM then you'll need to give a rundown of the
pseudo-code algo you're using - there may be something pathologically
bad you're doing with your arrangement of code within your method to
cause real badness here.
Note: not necessarily "wrong", just unfortunately bad as far as the JVM
goes.
> 2. Where I am currently doing time of hard labor, we get about a billion
> hits a day. With C++ backend the whole thing ran with 40 machines, we
We're building a billion-hit-per-day-at-peak system at the moment. It's
specced to handle a little more than a billion, but we're only expecting
it to run at that rate for a few days a week, and only 6 hours at a
time. So whilst I'm not in the same position at the moment, I'm working
within similar parameters.
> ported to java and pushing 1500 machines to keep the same response times
Off the top of my head, knowing practically nothing about what you do
:P, 40 to 1500 means one of two things:
1. You need to fire your System Architects for gross incompetence
2. Someone decided that it was cheaper for you to buy and maintain
1500, given the context, than to go with 40
Whilst it sounds drastic, point number 2 is fairly common these days.
Why waste manpower making a system fast and efficient if it's cheaper to
be slow and inefficient and throw cheap hardware and low-end sysadmins
at the problem? Not for me to judge...
> (and this is after EJB was dumped due to dismal performance). Now
> before you say something is really wrong, we have almost 1000 people
> working full time at optimizing, tuning, debugging and coding (not
Shrug. With a large staff, I wouldn't be scared of having that many
machines if it was going to save me some hassle in some other way not
mentioned so far. So...I wouldn't say it's "wrong" per se, but I would
be very surprised if you couldn't get it down to 100 fairly easily
(conceptually; I'm guessing that in practice you've got that classic
problem that your data is in the "wrong" structure and just transforming
it to the right structure to fit the easy efficient solution would be
very costly).
> a way of life, but with it we accepted that the performance will be
> awful give the complexity of the system (lots of database interaction,
> messaging, logging, and tons of external services, etc). I can see a
To put it another way, I could walk into a 50k-staff corporation as CTO
and quickly design and rollout a system that replaced their office
systems and service systems and was horribly slow and inefficient but on
paper looked fine. It's easy to do stuff like buy-in to CORBA and assume
that "it works" means "it works as fast as we want it to", or buy-in to
Sun's J2EE because "Sun has lots of evidence of other bigger rollouts
working fast".
I'm not criticising, just pointing out how easy it is to do a complex
middleware system with Java that is awfully slow - largely because the
DEFAULT setup *wasn't aimed at people who want raw speed*. This isn't a
secret, BUT an awful lot of people whose training or experience in J2EE
isn't quite sufficient assume it was built for speed (or, at least, for
their expectation of speed), and get burnt. Badly.
> Now for even more issues to note with Java:
> 1. Client implementation (unless you mean telnet like text emulation) is
> going to be very very tough. The UI parts change between versions and
No, they don't. Which means I must be completely misunderstanding you :).
> something written for v1.3 will not look right with v1.4 and v1.5 (and
It will, in fact, look identical !?! I'm sure I'm just being stupid in
misunderstanding you, so please expand on what you mean and I should
spot it.
> vice versa). To add to this, I have ran into more versions of java
> installed than I cared to note. Trying to enforce a version on the
> client side is also a messy endeavor. After trying a ver revisions on
There are extremely effective (and free) solutions to this problem that
work very well. Webstart (part of core java), for instance, is a very
very good way of hadnling *all* this automatically. The only downside is
that Sun's corporate arm couldnt' find it's arse with both hands
sometimes, and fails to promote the best of it's own new technology,
such as JWS.
> 2. Not all JVMs are implemented the same. Recently I ran into a nasty
Which is why nearly everyone uses Sun only, unless they choose to commit
to supporting IBM too because it has in the past, for a year or two at a
time, been much more cutting-edge in terms of performance.
Getting the same codebase across the three major client platforms
(windows, linux, OS X) tends to push people towards Sun.
> slightly different results, a non-deterministic JVM was a surprise. I
Sounds horrible. This isn't C, and non-deterministic JVM activity is
extremely rare (has something to do with the arduous certification
process, I believe).
That said, I've found a couple of instances of it. In all cases,
however, it turned out not really to be the JVM, but the OS. For
instance, buggy graphics card drivers running under Microsoft's
not-quite-as-robust-as-it-should-be DirectX failing to perform an op
(like allocating RAM!) yet returning a success code, leading to
catastrophic or bizarre behaviour much later in execution.
Obviously, it's up to the JVM vendor to implement workarounds for such
bugs, so they're still responsible. But it's nothing like the
gcc/msvc/etc weirdnesses that used to go on.
> What type of work do you use java for on the back end?
The central part that's serving those 1 billion hits per day will be
java (it's not live yet. Ha. Famous last words!). Obviously, it's a lot
more complex than that, and deep at the bottom sits MySQL DB's (which is
getting close to bearable performance these days ;)). Things are going
OK so far, and we know what we're doing - e.g. one of my staff used to
run a billion-hits-a-day site entirely in perl. There's quite a few such
sites around, if you know where to look ;).
Previously, I did a lot of work on the GrexEngine, which is (in gross
simplification) "J2EE re-designed and written from scratch as a
high-performance system, specifically aimed at online games".
At grex, I would compare performance of a game-server to the latest,
fastest, apache running largely static pages, and be happy when I was
level-pegging. Day-to-day testing loads were typically in the 200-750
simulated clients to each server range over a couple of 100Mbit switches
(with the servers running very old hardware - 0.5-Gigahertz processors
for instance)
J2EE makes a lot of assumptions to the tune of "your requests are
business traffic, hence infrequent and either very very light or very
very heavy". The GE assumes all requests are very frequent and
moderately heavy.
Adam M
More information about the vworld-tech
mailing list