AI Zone: chatbots.org

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

ALF

Posted: Sep 7, 2011

[ # 31 ]

Dan Hughes

Experienced member

Total posts: 64

Joined: Aug 6, 2011

E-mail Dan

Hehe your right, a Cray it is not, but it should do the job!

The good thing with the way that it is setup is very modular, both from a hardware perspective (obviously) and also the software to handle the comm’s will be too. I’ll be able to add more nodes and scale quickly and easily should I need to in the future.

I’m really eager to get cracking on the real work with this now, but I’m still in the process of setting up and ironing out the early glitches.

That said, it should be operational in a day or to, in the sense that I can start dumping down data and running queries.

Posted: Sep 8, 2011

[ # 32 ]

Dan Hughes

Experienced member

Total posts: 64

Joined: Aug 6, 2011

E-mail Dan

Quick update on the progress of the custom KB DB cluster I’ve been working on as I know a few peeps here are interested.

Initially to try and get things working I started off with MySQL cluster install on all of the machines. MySQL is pretty cool, but unfortunatley, for what I needed it to do it was lacking in a number of area’s, specifically “Horizontal Sharding”. While it can do it, I found that performance was sub-par what I was expecting.

After trying a few other solutions, including Hibernate Shards, I decided that once again, a custom approach was probably best.

So far I have written the modules that control “INSERT/UPDATE” of DB elements, as to be able to do anything with the DB I need to populate it, so this seemed a good place to start :D

My initial implementation was able to push around 2000 inserts/updates per second, which, compared to the original pre-cluster box, was about 4x slower. However, the degredation of performance over time was a lot less, and it was still very crude, so I assumed a good starting position and proceeded.

For comparison’s sake, the final pre-cluster box (lots of discs, lots of RAM) when populating started off at around 12000 insert per sec, and gradually dropped off to 500 inserts per sec once past 10M rows. It held steady at this speed from there forward until completed.

Entire import time of DBPedia, Freebase and YAGO on the final pre-cluster box was around 4 days. In comparison the first cluster implementation imported the same data in 2 days 9 hours, a good hike in performance, but still, slow.

Continuing in my efforts with the cluster, Ive improved and added to the protocol, yet still keeping it “none batching”. By this I mean single row inserts, not batching them together, which while faster, is only good for data imports and in the real world of querying databases, I’ve found its rare you can take advantage of it.

Currently I’m running an import using the latest shard protocol code, and results are VERY good. Having tuned the cluster hardware and MySQL servers, and continued with development of the Shard protocol code, I am now seeing 8-10k inserts/second, with practically no drop off. I am well within the DB size where significant drop off of speed was present before, so this seems to be good

Although it hasnt finished, there is a particularly large file in the DBPedia set that I have been using as a gauge for performance, this file is around 7GIG in size and the latest implementation has import that entire file into the DB within around 2 hours.

Previous absolute best was 16 hours. :D

Posted: Sep 9, 2011

[ # 33 ]

Andrew Smith

Senior member

Total posts: 473

Joined: Aug 28, 2010

E-mail Andrew

It’s a shame that you seem to be limiting yourself to MySQL which, its popularity notwithstanding, is only a toy DBMS and one that is only suitable for amateurs and toy problems. Certainly you won’t find many experienced database developers bothering with it. Once you run out of memory, you can get better results using awk and sed and grep than MySQL.

While I haven’t tried building YAGO on my current system which, although quite powerful still falls far short of the hardware that you have at your disposal, I did build an instance of it a couple of years ago on a much older and slower system using PostgreSQL. In spite of that, the job only took a few hours, not the days that you are reporting, and query times were quite acceptable. It’s not faster hardware that you need, but a robust scalable DBMS such as PostgreSQL or Oracle.

Failing that, you could go the nosql route, although I doubt that you’d gain much performance and it would take a lot more effort, especially if you will need a lot of concurrency at some point in the future.

Posted: Sep 9, 2011

[ # 34 ]

Dan Hughes

Experienced member

Total posts: 64

Joined: Aug 6, 2011

E-mail Dan

Opinion noted on MySQL, but, as this is simply research at the moment, MySQL is what I know inside and out.

I’d much rather to be trying new things with respect to the AI research, as opposed to learning all about a new DB system when, for now at least, MySQL will serve the purpose just fine.

Although I dont think I agree that MySQL is a toy DBMS, Ive used it in many commercial grade projects in the past, as do many of the large scale internet services that we all use today.

That said, the DB tables setup for ALF is very simple, so regardless of MySQL being toy or not, I dont think anything else would bring anything to the table that would warrant the time or effort to convert at present :D

Sidepoint….YAGO isnt too bad itself, but coupled with the other 2 that I want to use as knowledge, results in a quite a big set. Past around 10G in DB size is where speeds start to taper off, and 20G+ it gets noticable. YAGO on its own will easily fit in around 10G and by itself would no doubt import pretty quick…its the other 90G that has been a problem :D

Posted: Sep 9, 2011

[ # 35 ]

Andrew Smith

Senior member

Total posts: 473

Joined: Aug 28, 2010

E-mail Andrew

I agree that it would not be worth your while to learn a new DBMS at this point, and my remarks opened with what was intended as sympathy in that regard. Even with the right software it still takes a good deal of experience and expertise to get the best performance out of it.

Yes there are large scale internet services that still use MySQL and they all have one thing in common: they wish they weren’t. The trouble with DBMS systems is that once you settle on one you tend to be stuck with it, at least until such time as you experience a total melt down and you are forced to rebuild from scratch, usually with new staff.

Posted: Sep 9, 2011

[ # 36 ]

Dan Hughes

Experienced member

Total posts: 64

Joined: Aug 6, 2011

E-mail Dan

I guess that the saving grace of research, you are able to change things if you must, and not annoy any shareholders

I have kinda built some failsafe for this in place, un-intentionally due to the overall design, but the actual ALF code never interfaces directly with the MySQL server.

Theres a number of layers between Logic, and DB server which ultimatley end up at a JDBC wrapper class, which is the module that handles MySQL communication.

Worst case, should I need to change in the future, would be ensuring I can migrate the data, and writting a new wrapper class that interfaces with the new DB type.

So, I guess im covered? Right?

Posted: Sep 9, 2011

[ # 37 ]

Andrew Smith

Senior member

Total posts: 473

Joined: Aug 28, 2010

E-mail Andrew

Given the nature of the project you certainly do have lots of wiggle room so there is nothing to worry about. The main reason that I felt compelled to comment was that upgrading hardware to improve database performance can turn into a bottomless pit, and rarely shows the kind of performance gains that can be had much more easily through other means.

As a general rule of thumb 25 percent of your performance gains will come through tweaking the client software, 25 percent through tweaking the hardware (by that I mean tuning it right across the board), and the greatest gains come through optimising the database design and having skilled database administrators keeping everything in balance.

I’m currently preoccupied with Project Gutenberg but I’ll be revisiting YAGO, Wikipedia and the others again very soon. I’ve been compiling databases of useful knowledge and linguistics data on my website at http://wixml.net in case you have any interest.

Posted: Sep 9, 2011

[ # 38 ]

Fatima Pereira

Experienced member

Total posts: 56

Joined: Jan 23, 2011

E-mail Fatima

Dan,

Have you ever heard about a supercomputer called Microwulf?

http://www.calvin.edu/~adams/research/microwulf/performance/

Posted: Sep 9, 2011

[ # 39 ]

Dan Hughes

Experienced member

Total posts: 64

Joined: Aug 6, 2011

E-mail Dan

No never heard of that, but I have now!

Very similar in design to what I’ve put together, maybe I’ll run some GFlop benchmarks in the future to see what I can do

Posted: Sep 10, 2011

[ # 40 ]

Dan Hughes

Experienced member

Total posts: 64

Joined: Aug 6, 2011

E-mail Dan

Just a quickie update here.

As probably assumed I have been futher working on and tuning the ALF cluster and the DB environments.

If you remember at the beginning of the all this, I had a target to return all information available (around 1000 “facts”) for Bruce Willis, from a combined data set totalling around 100G in 5 secs or less.

I can inform you all, that…I have achieved that goal…..and then some.

I spent most of today working on an a-sync query model for MySQL that allows queries to be both distrubuted (via the sharding) and also to “overlap”.

The “overlapping"m for want of a better term, simply works as follows. Instead of hitting the cluster with 4 requests of “get me all about Bruce” it is further diced up AT the cluster node into a number of smaller requests, all returning say 100 items in length.

As a subrequest completes, the next subrequest is performed, while another process on the cluster does some work and combines the freshly returned results with any other previous results that have been completed before into the format I require for ALF.

Thus, the end result for Bruce, or any other “simple” query for that matter, is typically 500-750ms….from a 100G database, I dont think thats too shoddy at all.

Posted: Sep 10, 2011

[ # 41 ]

Andrew Smith

Senior member

Total posts: 473

Joined: Aug 28, 2010

E-mail Andrew

Congratulations, it does sound like you are on the right track, though I suspect that some of what you are doing (e.g. the asynchronous processing) is still redundant.

While I can only guess at the details of what you were doing before and what you are doing now, my hunch is that you were using a row-at-a-time processing model before, that is retrieving each record individually and processing it on the client, instead of the chunk-at-a-time processing model that SQL was intended to use.

In other words, a smaller number of more complex queries out performs a larger number of simpler queries because the query planning and execution processor in the DBMS is able to work out the most efficient way to do it all. A good DBMS will even maintain statistics about the distribution of data in the rows and columns on disk and factor that in to the query execution plan that it develops. At any rate, letting the DBMS do all the planning ought to yield a hundred fold increase in performance for this sort of data processing and that seems to be what you are getting.

Now the good news is that if you are able to use inverted indices instead of b-trees you’ll get another ten to one hundred fold increase in performance again.

Posted: Sep 10, 2011

[ # 42 ]

Dan Hughes

Experienced member

Total posts: 64

Joined: Aug 6, 2011

E-mail Dan

Hi Andrew, we seem to meet again in the early hours! (well it is for me)

I was pulling in chunks before, but I imagine purely due to the size of the DB, it was overwhelming the machine it was on. Basically it was thrashing the hard discs constantly on any query, as I had the max memory I could have in that particular machine, which was well short of the DB size.

I concur with what you say about more complex queries outperforming simpler ones in the “grand scheme” of things, but for now I wanted to keep it simple, and be able to have a clear target.

Complex queries however, are next on the board as I need to be able to return ontology information (from a different tables) and various other attributes that are spread around the DB

I’ll keep updating with the results of that if folks are interested.

Posted: Sep 10, 2011

[ # 43 ]

Andrew Smith

Senior member

Total posts: 473

Joined: Aug 28, 2010

E-mail Andrew

Maybe I should add that the overall size of the database should be irrelevant as far as query execution times are concerned. Obviously it will have a strong impact on the initial building phase and subsequent administration, but once the database is up and running, the query execution time should be proportional to the volume of data that is being retrieved in the query (the size of the result), and not the volume of data that is in the database.

The way to accomplish this is first and foremost to minimise the number of transactions that you use to perform each query. Each transaction has a fixed overhead in terms of set up and tear down, but even worse, having a lot of transactions conceals your intentions from the DBMS and prevents it from globally optimising its actions.

Also, you can take it as given that everyone is extremely interested in what you are doing. I’m looking forward to getting my own version of YAGO up and running again too.

Posted: Sep 10, 2011

[ # 44 ]

Dave Morton

Administrator

Total posts: 3111

Joined: Jun 14, 2010

E-mail Dave

I’ll second Andrew’s indication of interest. As I stated earlier, I may not be able to contribute much (I don’t even know what questions to ask!), but I’m keenly interested.

Posted: Sep 11, 2011

[ # 45 ]

Dan Hughes

Experienced member

Total posts: 64

Joined: Aug 6, 2011

E-mail Dan

Great Im always open to any kind of suggestions or idea’s if anyone has any.

For ALF I dont really have any concrete goals, there are abilities that I have pegged as “nice to have’s” but really I’m just moving forward by messing with a bunch of different things and then following the Yellow Brick Road to see where that takes me.

I started off trying to have a “design” for this, but that soon got thrown out, it’s pretty impossible to design past anything in this field that is an unknown. Still, its good fun, and I’m making progress, so thats all that counts

With the DB now at a point where its usable, I’m looking at some simple QA against it, things along the lines of

“Is Bruce Willis an actor?”
“What are the films that Bruce Willis is in?”

you get the idea

< 1 2 3 4 >

3 of 4

‹‹ Your Pandorabot HTML page. YAGO2 Processing ››

Search the Forum

Forum Profile

Forum Subscription

Forum Moderators

On Our Admin Forums

Partner Forums

Science Statistics

Chatbot Statistics