[AI] Some initial observations

Warmonger · August 8, 2014, 5:31pm

Yes, in particular to refresh pathfinder. The pathfinder itself is no longer a bottleneck. It still could work to cache pathfinder for all heroes and not only the current one to avoid this call, still.

Ivan · August 8, 2014, 5:40pm

Warmonger, yes, but if VCMI would be able to handle several human players acting at once implementing several AI’s at once would be simple.

rhn, simultaneous turns were implemented in H5. All players will act at the same time until two players meet each other. When this happens - game switches to normal turn-based mode.

All of them. Key here is call to sendRequest(). Results of my profiling session:
CBattleCallback::sendRequest() -> 400 seconds
calls from CCallback::setSelection() -> 306 s
calls from CCallback::moveHero() -> 81 s

How much each of calls to setSelection affects this - not sure, still trying to figure that out. Profiler refuses to show that…

Will try re-running analysis with debug build (unoptimized) to get better look. It sounds like gcc is too good at inlining

rhn · August 8, 2014, 6:08pm

My remark stays, heroes can move farther than they can see, and this can change the outcome of the game. Under other definitions of “meet” (e.g. in attack range, but invisible), this may leak information to players (oh, enemy is near!).

I’m tempted to say that having a message that doesn’t change game state is bad design… But I can see situations where this could be useful.

Ivan · August 8, 2014, 6:27pm

Actually removing setSelection completely is not a bad idea. But only once all server-side and library-side usages of this field will be dealt with.

This won’t solve moveHero slowdown though which is also significant.

AVS · August 8, 2014, 6:51pm

Should we think about moving AI to server-side?

Tow · August 8, 2014, 7:10pm

Well, it is a workaround, but… if you cannot get profiler to work, you can try do the sampling manually, ie. use debugger to stop the application during the AI turn and check stacktrace.
It is nowhere as accurate as the real profiler but if there is single chokepoint where AI is spending most of its time, you can still find it. Eg. if setSelection takes 75% of time, then even if you take 10 samples, there’s a good chance it will show up.

AI cannot live in the GUI thread because it would lead to the GUI freezups.
Still, if it was thread creation/deletion taking time, there could be a single thread for AI living all the time and activated only for AI turn. That’s quite easily implementable. Still, creating several threads per turn should be really negligible.

Interesting question. If it was the AI logic itself taking the most time, than parallelization should give quite a speed-up.
The simultaneous turns aren’t that far away if given some effort. Server should simply accept packages from multiple players (now it rejects packs from non-current players) and it should work without bigger changes.
However, there are some other problems:

Parallel turns change the game logic and may lead to results impossible with sequential turns. However “different” doesn’t necessarily mean “worse”.
Parallel turns were designed with multiplayer matches in mind. That means it works with multiple clients making simultaneous turns. Here, however, we talk about several players making simultaneous turns on a single client process. This will be the big obstacle: AI locks Game State for its turn, so it won’t get modified while AI reads it. GS is unlocked only when AI makes callback call, so it can be applied. While AIs can share GS for reading, every time one of them wants to take an action it would have wait for all other AIs to relinquish their mutexes so the request can be made.
I’m also not sure how well AI would handle observing unexpected (not resulting their own actions) game events during their turn but that’s a side issue.

Quite much of that, though I’d like to know how long was the AI turn in total. 400s of 425 total is quite different than 400s of 1000s. As for profiling the time taken by a single call, you can try doing this programatically — use static variables that’ll count total time spend function and the count of calls.

Still, it is certainly troubling that waiting for request being processed taking that much time.
Well, obviously TCP interprocess communication is slow but we should be able to get thousands of such calls per second without significant trickery. Currently I’m working on quite complex project composed of numerous processes communicating by exchanging serialzied packages by TCP sockets. RPC implemented over that isn’t fastest but won’t be a bottleneck unless there are really tons of calls. (But if there are — they should be batched.]

There might be more elements causing the delays. AI waits not only for server. Even after the pack is processed by server and reply is sent, client still has to notify all player interfaces and apply it to its copy of game state. I wonder how much time does it take. (Could it be that eg. AI waits for GUI thread to unlock the GS after applying changes?)
There might be multiple non-server elements delaying AI from resuming after the request.

AFAIR setSelection caused pathfinder to update the paths. Could this be a reason why such calls are that slow? moveHero might be delaying processing packages until GUI finishes animating the movement (assuming the movement is visible).
These are my speculations (I haven’t touched the code for quite a while) but it seems suspicious that these two calls are that slow. There must be something to them.

Active hero was used in a few server-only places AFAIR, I wouldn’t remove that message entirely. And I believe it is rather used by AI to trigger pathfinding calculations rather than anything else.
I don’t rememeber ATM whether server requires hero to be avtivated before move but even if it was the case, then such requiremenet could be easily lifted. (just automatically activate hero that is ordered to move)

Ivan · August 8, 2014, 7:45pm

Results that I see here:
28 seconds out of 465, with 466 calls in total. These calls come from:
VCAI::makeTurn
CClient::waitForMoveAndSend
VCAI::requestActionASAP::helper

So (relatively) few of them come from AI’s makeTurn

399 seconds out of 465 seconds total time with 5,106 calls to sendRequest()
Note that this includes 5-10 seconds it took me to start the game and enable autoskip

From what I see:
Client::run() executes for ~1 ms (receiving netpack most likely and passing it to AI)
Immediately after this VCAI::makeTurn activates and runs for ~5 ms, then goes back into wait for ~80ms
In the middle of that timeframe I see activation of GameHandler, likely - processing of incoming netpack.

Or see attachment for screenshot from profiler.

xyz · August 27, 2014, 5:52pm

Are there any updates on this issue?

Macron1 · August 30, 2014, 6:19am

In 0.96 build there are still many crashes preventing from full scale gaming.
Maybe it’s time to start changing approach to programming exceptions?
For example, when AI or player enters some building, and there goes crash. So game is stopped and can be no longer played at some point.
What terrible will happen, when some action causes crash?
It’s local to some hero, some dwelling.
I propose to catch exeptions and continue game, not going with program crash.
There maybe some transaction mechanism if needed (save values state on start of procedure, and if something goes wrong, just continue work with previously saved data.
In some times there no need to save data at all. If some unit gets -9999 luck, there is no need to crush. Just ignore or correct wrong values automatically.

Ivan · September 18, 2014, 7:00pm

Right now I’m trying to remove entirely server-side selection and move it to client-side. So “current selection” field will be either removed completely or will become part of CPlayerInterface/VCAI class.

Once I’m done with this I’ll re-run profiler and post new results.

Ivan · September 21, 2014, 11:14am

Okay, so this is what I’ve got, will upload into a separate branch on github today.

I’ve removed server-side setselection and now this is handled locally in CPlayerInterface/VCAI classes. BTW - does AI ever need “selecting” heroes? Except for that weird pathfinder update mechanism?
I also tried to disable TCP Nagle algorithm to ensure that there are no delays from TCP but it does not seems to have any effect.

Downsides:

cheats. Right now I have most of them disabled since I no longer know on whom they must be applied.
AI - I had to dismantle that selection-pathfinder connection. Now pathfinder info can be received for any hero (with caching of last used hero to emulate what we had before). And some bits of AI code no longer work with this change

Results:

visually there is significant increase in speed
according to profiler time spent on that condition variable decreased from 90%+ to 60%. This means ~4 times faster AI.

Possible improvements:
Now main bottleneck is same delays but for moveHero package. If I will figure out how to fix it AI performance may improve twice more. However I have no idea what can we do here. Any suggestions? Some insight from Tow would be great (if he is still here).

One approach would be to remove that wait from moveHero package. Does not sounds realistic - AI may need to process server results before moving further. Perhaps batch multiple packages in one? Specifically for this case (AI movement) - server will try to move hero as far as possible until he encounters something like event.

Another approach would be to actually fix source of this delay but on this one I am clueless

UPD: and here is how profiler looks now:

Warmonger · September 21, 2014, 11:38am

God job

No, in fact the selection was used just to update pathfinder for that hero.

Well, that’s a problem. Which ones exactly?

No wonder, since moving hero is pretty much everything AI can do, apart from recruiting creatures at towns and few occassional actions. I don’t think much can be done here, server needs to validate all the actions taken by players.

Ivan · September 21, 2014, 12:05pm

To be fair I have no idea how much AI speed would be affected on other OS’s. It is always possible that these delays are Linux specific (which includes Android BTW)

Ah, good. Then I think I’ll wipe selections from AI code completely.

Mostly when AI will query for pathfinder info without any “active” hero. This is no longer legal since pathfinder info can now be accessed only with some hero pointer.

There were other (manageable) places - for example I turned function getDistance (between two objects) into functor that needs hero to be created.

When I’ll upload code to github I’ll comment on all places where I had to disable some code so you can take a look on it.

But even then there would be large amount of sequential movement requests. So if I won’t manage to figure out what’s causing these huge delays then another option would be to batch AI movement actions:

AI generates list of MoveHero packages and sends all of them to server
Server receives list and starts moving hero tile by tile until hero will encounter something (e.g. event or battle) or till the end of this list.

This should work for AI (from what I see AI won’t interrupt movement) and may give good speedup to AI. And if there is need to interrupt movement - it is possible to change batch size or disable it entirely in some cases (e.g. don’t batch for AI scouts)

Warmonger · September 21, 2014, 12:15pm

But this doesn’t make sense. Paths can’t exist without hero.

No way. Every movement changes state of game and every change in state of game might change AI decisions.

Ivan · September 21, 2014, 12:51pm

Not exactly. There was a way to access cached pathfinder data of last active hero. So technically there is a hero. But I removed that possibility so this code is now broken. For example check here:
method FindObj::whatToDoToAchieve()

	if (o && isReachable(o)) //we don't use isAccessibleForHero as we don't know which hero it is
		return sptr (Goals::GetObj(o->id.getNum()));
	else
		return sptr (Goals::Explore());

isReachable() relies on pathfinder to check reachability but source of this pathfinder data would be last selected hero. So it may be completely wrong (e.g. hero in sector without any open exit).

In most cases there was some hero pointer around so I could easily fix code but this is a case which I can’t fix without understanding logic of this code.

Just wondering - even when moving a hero across already explored & controlled area?

In any event I’m not planning to implement such batching as of now - it may only serve as a limited workaround if there won’t be any way to get rid of those 50 ms delays

Warmonger · September 21, 2014, 1:21pm

isReachable(o) checks if object can be visited by any hero, in loop.

Of course. Hero can visit dwellings or get resources, also his movement points change which is the key for hero management.

Ivan · September 21, 2014, 1:37pm

bool isReachable(const CGObjectInstance *obj)
{
	return cb->getPathInfo(obj->visitablePos())->turns < 255;
}

This is how isReachable works now (in current git). And getPathInfo() return pathfinder data for selected hero. No loop over all heroes here.

However if “loop over all heroes” is good enough approach I can implement isReachable in this way. IIRC isReachable was main source of my AI problems.

Idea was to create batches that would contain path from hero to desired object - without any object visiting in the middle of “batch”.

Anyway - I finally found to remove those huge delays. Now I have ~15ms between two AI requests instead of original 100+ ms so these batching should not be needed anymore. At least not until someone decide to do one more run on AI optimization.

Warmonger · September 21, 2014, 1:43pm

Oh wait about that part, I need to think. I meant isAccessible() function.

…if AI is checking paths for unspecified hero, the indeed it may behave oddly.

EDIT: isReachable() actually duplicates isAccessible() and should be replaced altogether. Good spot.

Ivan · September 21, 2014, 1:58pm

Ok, replaced isReachable with VCAI::isAccessible.

Another case. This is what I have right now, a bit different from current git:

TSubgoal GatherTroops::whatToDoToAchieve()
{
	...
	if (dwellings.size())
	{
		//FIXME: we need hero to sort dwellings by distance
		//boost::sort(dwellings, CDistanceSorter(h.get()));
		return sptr (Goals::GetObj(dwellings.front()->id.getNum()));
	}
}

(CDistanceSorter is functor I created to replace original sorter function. Calculates distance from hero to passed object)

What should be done here? I need to calculate distance from hero to dwelling to sort them by distance. But hero for this goal is not selected yet (?)

Warmonger · September 21, 2014, 2:10pm

Yeah, that’s the issue with isCloser function - works only if the paths were previously calculated in correct context. It was all so simple before I added support for multiple heroes

If the hero is set, calculate distance for him.
If the hero is not set however, need to calculate distance for each hero-dwelling pair. This in fact should go to fuzzy engine for comparison, but that’s not urgent.