Scraping an HTTP API with Python and distributed task queues

The company behind  the game World of Tanks provides a public HTTP API for third-party developers to create their own applications. API features include OpenID based login, clan and player statistics and information about the game. I’ve already used this API for my attendance tracker application but wanted to try something of bigger scale and with different technologies now. One idea was to create an application that tracks the clan membership of players over time.

The API offers methods to get a list of clans and members of a clan but only allows to fetch 100 clans at once. There are about 40000 clans on the EU server cluster which means many requests are required to get the data of all clans. Furthermore the API is rate-limited and only allows around 10 requests per second (which can be negotiated though).

Celery, an asynchronous task queue, seemed perfectly suited for this. For storing the player clan history itself I chose MongoDB because as schema-less document database it is very comfortable to use.

In Python, a Celery task is declared by annotating a function:

Now we can queue up tasks for all possible pages in Celery and have multiple workers execute them in parallel:

get() blocks until all tasks are finished and returns a list of lists (because each get_clans task returns a list) with the JSON response data from the API. We can then start a task for each clan to request the members and store the results in the database.

Workers can be started from the celery command line tool:

In this case, 10 processes run in parallel to process the tasks but Celery can also use threads or green threads.

This works quite well and retrieves the information of 40000 clans and around 550000 players in an hour. With all the information in the local MongoDB, a few interesting statistics are already possible. For example this histogram visualizes the member count of clans:

Member count of EU clans

Unsurprisingly, most clans have few members and only 1030 clans have at least 80 members.

This project is mostly for the learning experience and currently work in progress. The source code is on Github.

Organizing the SAT Competition 2014

This year I was part of the organizing team of the SAT Competition 2014, a competitive event for solvers for the Boolean Satisfiability (SAT) problem. It is organized – usually annually – as satellite event to the International Conference on Theory and Applications of Satisfiability Testing.

SAT solvers are crucial in many applications, from resolving package dependencies in Linux package managers to hardware and software verification and planning in air traffic control. Many problems can be encoded as SAT instances and solved very effectively by general purpose SAT solvers.

The competition was a huge success in terms of participation with 79 participants from all over the world and 137 submitted solvers. About 10 years worth of CPU time were spent on evaluating the solvers. The winners were awarded silver medals at the FLoC Olympic Games 2014 in Vienna.

SAT Challenge 2012, SAT Competition 2013 and now SAT Competition 2014 were conducted using EDACC, a framework for experiments with solvers on computer clusters, on which I work on as core developer. It was started as student project at Ulm University and further improved as part of various bachelor and master theses. The hardware resources for the SAT Competition 2014 were provided by the Texas Advanced Computing Center (TACC). The solvers ran on the very impressive Lonestar cluster, which consists of around 22000 Cores. Sadly you cannot allocate all of them at once but at times we had more than 1000 cores reserved 🙂

Naturally there are always unforeseen issues when running such competitions and this year was not different. The main technical issues were due to the large UNSAT proof files generated by solvers. These proof files can grow up to hundreds of GB in size and checking them for correctness often takes about as long as running the solver itself. Unfortunately, EDACC can not parallelize the proof verification yet which can leave the other cores idle while verification is running. This would be a great feature to implement in the future.

Another issue was caused by TCP connection timeouts between the computation clients running on the cluster nodes and the database server. This issue was quite hard to track down because the clients seemed to randomly stop for no reason. Such timeouts were also never an issue on another cluster before. The log files written by the client helped enormously to track down the issue and we worked around it by periodically sending keep-alive queries over the connection.

Overall, it was a lot of work but it’s great to have a university project like EDACC used on such a scale with a real purpose.

More information about the SAT Competition can be found on the website and in the results presentation slides.