BryanMcPhail.com

Professional software development, amateur BMW tinkering, old arcade game stuff

By

Technos Renegade arcade PCB repair

Renegade is a JAMMA pcb and the predecessor to Double Dragon.  Initially it seemed the vertical sync was broken on this board – the image rolled vertically very fast.  I traced the sync line from the JAMMA connector back on the top board and found it goes to the bottom (video) board.  After cleaning up a lot of dirt and dust, a massive gouge in the bottom board became clear!  Must have been a fairly heavy impact as you can see two TTL pins are sheared clean off.

IMG_0532 IMG_0533 IMG_0528

Bridged the broken traces with wire and sync was restored but the image was cut into three pieces.

IMG_0537 IMG_0536

Re-checked the repair with a logic probe and found no activity on one of the repaired traces.  There was actually a through-hole to the other side of the pcb in the damaged area that need soldered back on the trace.  Then everything worked 100%.

IMG_0546 IMG_0543 IMG_0542

 

 

By

Data East Hoops 95 pcb non-repair

This is the Data East MLC package – which is a two layer pcb inside a protective plastic box.  Unfortunately this one seems 100% dead – no video or sound output at all.  Components are actually surface mounted to all 4 surfaces on the two layers – the main CPU (an encrypted ARM) actually sits on an inside surface so it’s hard to diagnose directly.

Using a logic probe with the game powered on shows that the data and address lines on the program EPROMS are pulsing – so the CPU is definitely trying to do something.  All the graphics hardware (ROMs, custom chip) probes as completely dead – that doesn’t prove for sure that it is dead – it may be the CPU is actually failing for some reason and not instructing the graphics customs to start up.  My immediate theory would be one of the main RAM chips for the CPU has failed – these are 4 Winbond chips on the main board.  The ARM is a 32 bit chip and these 8 bit chips run in parallel so a failure in any one of them would cause the CPU program to immediately fail.

Hoops isn’t that great a game, so I’ve no plans to probe further – this can wait until I find another MLC game and swap the top & bottom boards and see what happens.

IMG_0616 IMG_0618 IMG_0617IMG_0619

By

Namco Sky Kid pcb

Bought as a non-worker for parts – actually it works 100%!  The non-obvious catch is that who-ever made the JAMMA adaptor wired it upside down!?  The tell-tale sign is the ‘double’ +5V and GND traces are on the right instead of the left.

IMG_0613 IMG_0608 IMG_0607 IMG_0605 IMG_0604

By

Super Chase / Hantarex Polo

A reasonably rare/overlooked game from Taito in 1992. This was completely dead when I got it – no lights, sounds, picture. The power supply was the first problem as it wasn’t able to supply +5V to the pcb. In fact I had to try 3 (used) power supplies until I found one that could give a consistent +5. This pcb draws a lot more current than a lot of older titles (probably as it uses a 68020 CPU and two sub 68000 cpus, plus a lot of graphics and sound ic’s that were cutting edge at the time) so some power supplies can’t keep up and voltage drops.

IMG_0234 IMG_0251

This made two LED’s on the light driver board illuminate, but still nothing else. The sound board is quite unusual in it expects +13V as well as +5V and +12V. Without the +13V line connected the sound amps don’t work at all – however putting 12V there made them work well enough that I could hear game sounds – so pcb confirmed as running! [I should mention that Super Chase isn't jamma - so I couldn't just test it in another cabinet].

The monitor remained completely dead – it’s a Hantarex Polo 25″ standard resolution. No signs on physical problems (cold solder joints, blown fuses, burnt areas, broken components). I hate working on high voltage stuff, so rather than debug anything I decided just to shotgun replace the flyback, all capacitors, and the HOT (horizontal output transistor). On a 23 year old monitor it’s a good bet the capacitors need replacing, and the flyback may have failed. Internet repair logs on this monitor suggest bad flybacks can kill the HOT, so as it’s only $5 may as well replace it too.

IMG_0250 IMG_0249

And it worked! It’s quite unusual for a cap kit to bring a dead monitor back to life, but the Polo has a built in power supply (no isolation transformer needed) so bad caps there were probably the primary reason for not turning on. Monitor looks good as new now.

A cool thing about Super Chase is the flashing lights – these are just 40W incandescent bulbs, but both were blown – replaced them, and replaced a blown fuse on the driver board and all was good there. The driver board is quite a simple thing – it takes two 5V lines from the game pcb as input, a 110V mains source, and outputs two 110V lines to the bulbs. You can see the board is designed for 4 lights, but only 2 channels are populated.

IMG_0264 IMG_0262

(Main marquee light still not fixed in picture below)

IMG_0268IMG_0278 IMG_0275

 

By

Operation Wolf

Recently RetroGamer ran a feature on Operation Wolf (Taito, 1987) that mentioned the emulation in M.A.M.E. and the ‘arcade perfect’ Taito Legends (PS2, 2005) was not quite correct – missing cutscenes, a missing boss, some other details.  Unfortunately, that’s true – the reason is Operation Wolf contains copy protection that has never been properly emulated.  I was the programmer who implemented the protection emulation on Taito Legends (PS2) and then M.A.M.E. and it’s actually based on a bootleg (pirate) version of the game.  The problem was that original ‘cracker’ of the game didn’t get everything right.  The protection device is actually an 8 bit micro controller of some sort that maps into 256 bytes of the main CPU address space.  It runs a private embedded program that cannot be read out, so what exactly the controller does is somewhat of a mystery.  Memory mapping is a very effective copy protection when done right, as the main CPU can just write bytes into random locations, and then expect certain results & logic back at a later date without any obvious link as to what inputs caused what outputs.  The bootleg sheds light on the setup though – one example is the end of level detection – the main cpu constantly writes the number of enemies, tanks & helicopters remaining in the level into shared memory.  Only when all of these are zero does the protection chip write a byte that signals ‘level is complete’.  The protection is effective because without a reference it’s very hard to know the internal logic of the controller.

20160325_185948

Prototype

By chance, a prototype Japanese version of Operation Wolf was recently discovered – this was almost certainly a test unit placed on location for player feedback before the game was finished, and before the copy protection was added.  It’s fairly close to the final game though, and does contain the missing scenes and boss, so in theory it should be possible to compare the two programs, and make a guess at what the protection logic is doing compared to the unprotected reference program.  I thought some people might like to read about this process, so hence this post – be warned, it’s going to get a bit technical…

0005

Initial Comparison

In an ideal world the two games would disassemble to near identical programs, with the only difference being a bit of protection logic replacing the original game logic from the prototype.  Unfortunately that’s not the case – what we can see is that there is a lot of code that is logically the same – but it’s assembled to different locations, and variables have moved around in memory.  Some of that is deliberate – with some variables moving into shared c-chip (protection) ram rather than main memory, but some of it is clearly just from sections being moved around in the original source and the assembler following suit.

00541E: 6100 0024 bsr $5444
005422: 3211 move.w (A1), D1
005424: 0241 8200 andi.w #$8200, D1
005428: 0C41 8000 cmpi.w #-$8000, D1
00542C: 660E bne $543c
00542E: 4251 clr.w (A1)
005430: 7200 moveq #$0, D1
005432: 1229 001F move.b ($1f,A1), D1
005436: 3F01 move.w D1, -(A7)
005438: 4E42 trap #$2
00543A: 548F addq.l #2, A7
00543C: D2C4 adda.w D4, A1
00543E: 51C8 FFE2 dbra D0, $5422
005442: 4E75 rts
006100: 6100 002C bsr $612e
006104: 3211 move.w (A1), D1
006106: 0241 8200 andi.w #$8200, D1
00610A: 0C41 8000 cmpi.w #-$8000, D1
00610E: 6616 bne $6126
006110: 48E7 8000 movem.l D0, -(A7)
006114: 4251 clr.w (A1)
006116: 7200 moveq #$0, D1
006118: 1229 001F move.b ($1f,A1), D1
00611C: 3F01 move.w D1, -(A7)
00611E: 4E42 trap #$2
006120: 548F addq.l #2, A7
006122: 4CDF 0001 movem.l (A7)+, D0
006126: D2C4 adda.w D4, A1
006128: 51C8 FFDA dbra D0, $6104
00612C: 4E75 rts
0119F8: 203C 4001 0306 move.l #$40010306, D0
0119FE: 3439 000F F036 move.w $ff036.l, D2 // current level
011A04: 0C02 0002 cmpi.b #$2, D2
011A08: 6700 0004 beq $11a0e
011A0C: 7000 moveq #$0, D0
011A0E: 3439 000F F04C move.w $ff04c.l, D2
011A14: 0242 00FF andi.w #$ff, D2
011A18: 08E9 0003 0000 bset #$3, ($0,A1)
011A1E: D5FC 0000 0010 adda.l #$10, A2
011A24: 6100 E5B8 bsr $ffde
011A28: 4E75 rts
00DD8C: 203C 4001 0306 move.l #$40010306, D0
00DD92: 0C6D 0002 0B20 cmpi.w #$2, ($b20,A5)
00DD98: 6700 0004 beq $dd9e
00DD9C: 7000 moveq #$0, D0
00DD9E: 142D 0C7D move.b ($c7d,A5), D2
00DDA2: 08E9 0003 0000 bset #$3, ($0,A1)
00DDA8: D5FC 0000 0010 adda.l #$10, A2
00DDAE: 6100 E834 bsr $c5e4
00DDB2: 4E75 rts

In the above tables the code functionally does the same thing but you can see it’s assembled to different addresses.  The original code (on the left) accesses the 8 bit c-chip shared ram for the level variable ($ffxxx), so it’s changed to byte instructions rather than word.

Software Architecture

Operation Wolf has interesting software architecture. Unlike most games of this
era which have a simple main loop and linear code flow, Operation Wolf
implements a co-operative threading model where routines run in 68K user mode
until giving up their timeslice and a supervisor mode scheduler picks the next
thread to run. There are 32 thread slots, and each enemy in game run as its
own thread/object as well as a thread for coins, scrolling the level, level
specific gameplay and so on. The code is very robust when creating threads,
for example if there are no free slots, the creating thread just spins until
a slot frees up. The rest of the game just keeps on playing in the background.
Another interesting detail is that a thread can give up it’s timeslice for more
than 1 frame – this makes it really easy to implement timed events. The ‘WARNING’
text at the end of level 2 is handled by a thread that prints to screen, then just
waits a second before spawning the boss enemy thread.

0015

Each level in the game implements its own logic thread and often sub-threads -
this is the major difference between the protected game and the bootleg – the bootleg
mostly implements the parts that are generic between all levels rather than all of
the details. The biggest single area the bootleg did not implement revolves
around location 0x5f in the shared c-chip RAM. The original code sets up a thread
that just waits for this value to become non-zero. It then jumps to a set of
functions defined in a look-up table (that can then spawn further threads). There
are 10 non-null functions tied to this routine.
 
1: Enemy spawn for level 7 (first ‘Located’ cut-scene)

2: Enemy spawn for level 8 (second ‘Located’ cut-scene) – zoom in helicopters

3: Enemy spawn for level 9 (third ‘Located’ cut-scene)

4: Boss & hostage sequence for level 2

5: Enemy spawn when less than 45 enemies in level 2 (paratrooper drop-down)

6: Enemy spawn when less than 25 enemies in level 2

7: Enemy spawn when 0 men left in levels 2,4,5,6

8: Enemy spawn when 0 men left in level 3

9: Enemy spawn when 0 men left in level 1

10: Special explosion animation when level 4 (Powder Dump) is completed

 

The bootleg also misses some other details, for example in level 5 the c-chip
sets a flag when all men are destroyed (not vehicles) and this triggers the 68K
to scroll the screen vertically to focus on the remaining helicopter enemies.

The ‘Enemy has located you’ cut-scenes appear ‘randomly’ between levels in the
original game, but are deliberately disabled in the bootleg. The exact formula
for determining if the cut-scene appears is ‘(frameCount & levelNumber)==0′.  There are three different cutscene levels.

 

0012

Source Dumps

Should you want to see what the code for an old arcade game looks like I’ve attached my annotated dumps of the original and prototype source code – it fills in some labels and some variables and the thread functions to make it easier to read!

 

_signal_end_of_level:
0053EC: 322D 0B20 move.w $CURRENT_LEVEL, D1
0053F0: 0C41 0006 cmpi.w #$6, D1
0053F4: 6614 bne $540a
0053F6: 1B7C 0001 0998 move.b #$1, ($998,A5)
0053FC: 4EB9 0000 3314 jsr $3314.l
005402: 3F3C 0020 move.w #$20, -(A7)
005406: 4E46 trap #$6 (SLEEP_THREAD_FOR_N_FRAMES)
005408: 548F addq.l #2, A7
00540A: 6100 0012 bsr $541e
00540E: 3F3C 003C move.w #$3c, -(A7)
005412: 4E46 trap #$6 (SLEEP_THREAD_FOR_N_FRAMES)
005414: 548F addq.l #2, A7
005416: 1B7C 0001 0999 move.b #$1, $END_OF_LEVEL_FLAG 
00541C: 4E45 trap #$5 (KILL_THREAD)

opwolf.dasm  opwolfp.dasm

00040008

By

Donkey Kong 3 pcb repair

Well, not really a repair as it was so simple.  Game booted consistently to the same garbage.  The usually suggests the CPU is doing something (as the garbage would be random each time if it wasn’t) but goes wrong because of RAM or other IC’s.  However, replacing the socketed Z80 CPU fixed it!  Easy fixes are always good.

IMG_9428 IMG_9785 IMG_9787

I must admit I made a mistake when wiring the pcb at one point (DK3 style edge connector to old style DK block connectors) and ran monitor blue to sync and vice versa.  The monitor actually managed to sync to the blue output – and then you can see the sync represented on blue on screen.  The long blue line is really the vertical blank sync, and the short blue lines are the horizontal sync pulses – they are just all out of place of course.  So if your game looks like this – check wiring!

IMG_9425

 

By

Cheap method for procedural vegetation animation

SpeedTree has all kinds of great ways to animate trees, but if you need a very, very cheap (in performance terms) solution without having to mark up the assets in any way, I used a few tricks back in 2003 on the game Corvette 50th Anniversary.  Back then it was vertex shader 1.0 in straight assembly (no hlsl) for Xbox and VU code on PlayStation 2.

The technique is essentially to do a little bit of morph target blending, but to create the target on the fly, and subtly change it every frame.  You also vary how the blend itself is performed every frame.  So you just need to pass a matrix and a blend vector to your shader.

 float3 localSpacePosition = In.pos;
 float3 targetPosition = mul(localSpacePosition, g_morphMatrix);
 float3 blendedPosition = lerp(localSpacePosition, targetPosition, g_morphBlend.xyz);
 worldSpaceVertex = mul(float4(blendedPosition, 1), matWorld);

So obviously this is very, very cheap in shader terms.  The magic is in how you construct your input parameters.

The morph matrix is an orthogonal matrix that will rotate an object around it’s up axis.  The up vector of that matrix is constructed with some tweakables, like this:

float3 up = normalize(float3(0.14, 4.0 + sin(vortexRate * time) * vortexScale), 0.14));

The trick comes from the fact if you normalize a non-unit vector the largest component is dominant – so by tweaking the Y value over time you can get a kind of ‘eccentric’ rotation that is still mostly around float3(0,1,0) but your eye won’t recognize this as an obvious linear rotation.  By tweaking x & z you can set a general ‘wind direction’.

The lateral and direction vectors of the matrix are formed just by advancing time around 2PI, you likely have an engine function to make this kind of matrix already:

g_morphMatrix.setFromAxisRotation(up, time * 2PI);

The blend vector controlling the morph is another tweakable, this is the other half of the trick because you can scale this vector so the morph is applied differently on each X Y Z axis.  This is really how you can tweak it to look natural – by applying more morph on the X & Z axes than Y vertices on branches that are further from the trunk will move more than the trunk itself.  So you can very cheaply simulate wind & gravity on these extremities while the trunk and inner pieces have a much softer sway.  Applying further damped sine waves to X & Z over time can simulate wind gusts, or even air drafts caused by objects flying by.

So, by setting up the morph target and the blend to suit, very cheap natural look motion can be achieved.

chain3_0562

corvette

 

 

By

What’s in a frame?

A lot of people use Unity and Unreal rather than a bespoke renderer these days, and that’s fine, it all depends on what you need.  I thought it might be interesting to show how a graphics frame in a modern game is put together piece by piece and talk about relative performance.  This is in the bespoke Octane engine but the same principles apply to most engines.  As ever, even if you are using Unity or Unreal, understanding what’s going on ‘under the hood’ can aid performance work in particular, as well as debug problems.  In this particular example it was important to hit 60 frames per second on a particular piece of hardware, so some of the choice below are trade-offs where visual quality was reduced in order to render the frame with 16.67ms.

Here is the frame I’ll talk about, though I’ve chosen one without any player elements.  In general terms the objects are shaded with a physically based shading model and most of the objects on screen here are also prelit with RNM lightmaps.

shot2

The first thing that happens are shadow cascade renders.  I chose two sets of three cascades – the first set is a shadowmap that applies to dynamic, moving objects – so this cascade must contain all the static casters as well as dynamic caster objects.  The second set is for moving dynamic objects casting onto static objects – this is much cheaper as the static casters do not need to be rendered (as those shadows come from the pre-baked lightmaps).  The shadowmap method is based on variance shadow-mapping, and I use the little trick of packing 4 shadowmaps onto the one larger texture and using a transform to reference the correct cascade from the shader.  The shadows are a good example of trade-offs between performance and visual – I can ramp up to 8 cascades (via the atlas packing) or down to 1 or 2, and small object culling (via projected screen-size) is important to lessen the amount of draw calls so that tiny objects don’t cast shadows far away.  The dynamic object shadowmap is set to a 2048 texture, and the static one 4096.  Even with this shadows are still expensive, and this phase can take 3-4ms to render.

3 shadow cascades packed to 1 atlas

3 shadow cascades packed to 1 atlas

Next up I render a depth pre-pass for the main scene – I don’t have a screenshot for this, but if you aren’t familiar with the technique then you render all your opaque objects with a cheap shader that just writes to the depth buffer, not any frame buffer.  Then when you render expensive shaders later, any overdrawn parts will be rejected by depth test (or early Z) and you save pixel fill-rate.  In order to really optimize this step I do extra processing in the build pipeline for the models – all opaque meshes under a node are combined into a new single mesh containing only positions (no material splits are needed if using a depth only shader, nor normals or uvs), this mesh then gets it’s own vertex re-ordering to make best use of the vertex processor cache.  On average this takes 1 – 1.5ms, but it can save 5-10ms from the next stage.

For the main scene, I used a hybrid deferred approach that I suspect is quite different from other games – 90% of lighting is done in a forward render pass.  There is a still a g-buffer, which stores normals, albedo, roughness, specular, depth and objects id’s, but these buffers are mainly used for post-processing and some incidental lights composited onto the main scene.  The reason for this is the scene is lit by the sun so there is only a single directional light, and some of the benefits of the forward render are that I don’t have to light alpha separately and special shaders such as car paint, skin and water can be integrated without having to pack all sorts of special info into the g-buffer.  Now, of course, I’m not suggesting this is the solution for every scenario – it’s definitely not, you would never do a night-time first person shooter like this where you have a need for far more lights, it’s just one of many possible approaches to solve a particular problem within a particular performance constraint.

Normal and albedo g-buffer

Normal and albedo g-buffer

The ‘opaque’ section of the render takes about 8-10ms – the shaders are massively heavy in terms of texture bandwidth.  A common shader type is what I call ‘PBR Decal’ – where you have a base albedo, roughness, specular (or metal), normal and ao map, but also a ‘decal’ albedo, roughness and spec that is blended in via an alpha channel and second UV set.  This is great for artists to blend in dirt and details over base materials in a single pass and break up repetition, but it does mean 8 texture reads.  To that are added radiosity lightmap reads for both direct light and indirect light (using the Halflife 2 basis), plus cubemap reads for PBR specular (used pre-blurred cubemaps for roughness), plus the dynamic shadow cascades (which for most objects use a 3 pixel radius for the blur sample – so 81 shadowmap samples!).

PBR Decal texture inputs

PBR Decal texture inputs

[The 'red' textures above are actually monochrome BC4 format to save memory, they are just rendered from the red channel in example above].

At the end of the opaque section, although obscured in this screenshot, there is an ocean simulation, the first part uses compute shaders to update the simulation, and the second actually renders to the screen.  The water is deliberately drawn last to take more advantage of early z rejection from the other opaque meshes.  Opaque objects intersecting the water are alpha composited later on using the stencil buffer to mask out the water pixels.

Water simulation

Water simulation

Following the opaque section, the Nvidia HBAO technique is used to provide dynamic ambient occlusion based on the depth buffer.  It’s a great effect and works well, though depending on the quality settings it can take up to 2ms.

HBAO buffer

HBAO buffer

Further deferred lights, alpha meshes, particles, etc, are then composited onto the scene as required.  This is usually a pretty cheap step.

The frame-buffer then goes through some post-processing – FXAA anti-aliasing (tuned down to 0.5ms), god rays, lens flare and streak flare filter (up to 1ms), velocity motion blur (not seen here, but around 0.5ms), and some cheap color grading.

Godray blur (mostly occluded)

Godray blur (mostly occluded)

 

Lens flare

Lens flare

Hopefully after all that the frame is complete within 16.67 milliseconds!  If not some of the quality (number of samples usually) has to be traded off for performance.

shot2

By

Old school shadow map method

Cascaded shadow maps have been the de-facto choice for games for many years, certainly for most of the PS3/Xbox 360 era and a few titles before that.  However, CSM has never been perfect, far from it, the biggest problems are it uses a lot of memory as to get good quality shadows on a 720p screen you generally need the biggest cascade to be a 2048×2048 texture, or 4096 for 1080p.  Many meshes will be present on multiple cascades, so you have a heavy draw call cost.  When rendering the shadow objects, in order to smooth the shadow edges you have to make a lot of samples in order to perform a blur (my current PC engine has quality presets for shadow samples count from 25 to 81 per pixel based on blur kernel size).

Now on PS4/Xbox One/PC that’s all fine, it’s still the best overall choice, but on low to medium mobile and on devices like the 3ds, you may end up with quite low visual quality when you scale back the buffer sizes and samples for performance reasons.

An alternative method I used on Hot Wheels Track Attack on the Nintendo Wii and several subsequent Wii and 3ds games was a top down orthographic shadow projection.  For prelit static shadows a mesh is created by artists in Maya with the shadow/light maps as textured geometry.  Solid vertex coloured geometry is also possible.  This is rendered to an RGB buffer in-game, initialized with a white background, from an orthographic camera pointing straight down around the area of interest.  The dynamic objects can then reference this texture using their XZ world positions only and multiply the texture RGB against the object output RGB.  This sounds pretty crude, and it is!  However, the Wii and 3ds don’t have programmable pixel shaders so it was important the shadow method could be achieved in their fixed function pipelines.

Pre-baked mesh in Maya

Pre-baked mesh in Maya

Crudeness aside, you can achieve really high quality shadows onto dynamic objects with this method because you can not only pre-blur all your shadows and bake them at a high resolution, you can use coloured shadows (for example, light cast through coloured glass).  It’s also very cheap to light particles with this method.  You can mix dynamic objects with static by also rendering the dynamic objects to the buffer after the static, or by having two buffers and merging them later.  You can also aim for a very low number of draw calls, especially by merging and using texture atlases in the art side.

There are some clear downsides to this method – there is no self-shadowing, so an object can never cast onto itself, there are also artifacts on vertical surfaces as the Y component is ignored in the projection so essentially you a get a single sample smeared all the way down.  You also have to be careful of the height (in Y) that you position the ortho camera at, if any objects pop above this level then will be incorrectly shadowed.  If using multiple buffers (perhaps per object of interest) be aware that render target changes can be expensive on PVR type hardware (iOS).

As always though graphics is about trade-offs and in some applications you might still find a use for this technique, especially fast moving games.

Hot Wheels Track Attack (Wii, 2010)

Hot Wheels Track Attack (Wii, 2010)

Tech Demo (2010) showing particles shadowed by XZ projection

Tech Demo (2011) showing particles cheaply shadowed by XZ projection

Tech Demo (2011).  A 512x512 shadow buffer is used for the car but the visual quality is high.  Also note that despite the XZ projection, Y artifacts are barely noticable

Tech Demo (2011). A small 512×512 shadow buffer is used for the car but the final visual quality is high. Also note that despite the XZ projection, Y artifacts are barely noticable

 

By

Blending against deep water

Below I’m going to talk about a cheap technique for blending objects against ‘deep’ water.  If you’ve never had to do this, then you might wonder what the problem is – you just render the water with alpha blending on right?  Well, no, because if you have an intersecting object it should ‘fade out’ the deeper into the water it is, to simulate less light bouncing out out from under the water surface.  A regular alpha blend would give you a uniform alpha level under the surface which would look strange.

So a technique you can use with both deferred and forward rendering, is to render your water as the very last thing in your opaque section, and use the stencil buffer to identify water pixels and non-water pixels.  In the example below I’m using 1 (essentially black) to tag water pixels (which have animated vertex waves so it’s not just a flat plane) and 32 to identify my non-water meshes.  You then re-render your intersecting meshes with alpha-blend on, and the stencil buffer set to reject any pixels that do not equal the water (value 1).  The shaders for this pass are modified to compute alpha based on the world space height, so the deeper below the water the pixel is the lower the alpha blend value.

float ComputeWaterStencilAlpha(float _y)
{
   return 1.0 - saturate((g_waterStencilParams.x - _y) * g_waterStencilParams.y);
}

I use a few tweakables as above so the alpha fall-off can be easily tweaked.  It can also look nice to apply a different fog set to the intersecting meshes, to tint the underwater objects a darker blue.  Although you pay the vertex cost for the second pass on these intersecting meshes you only pay the pixel cost for the pixels actually under the water as the stencil buffer rejects everything else.  If you want to gain some visual quality at some more cost, the instead of rendering the intersecting meshes to the framebuffer, render to an offscreen buffer (with destination alpha) and then you can composite that render target to the framebuffer, again using the stencil buffer as a mask.  With that method you can easily add caustics and fake refractions into the compositing shader.

No blending

No blending

 

No water

No water

 

Stencil Blending

Stencil Blending

 

Stencil Buffer

Stencil Buffer