AI Zone Admin Forum Add your forum

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

YAGO2 Processing
 
 

I’ve been taking a fresh look at the YAGO2 knowledge base. I spent some time converting the first version of YAGO into a fairly efficient relational database a couple of years ago, but YAGO2 is significantly larger again.

Yago2 comes in two versions, the full version and the core version. They are approximately 9 gigabytes and 4 gigabytes to download respectively. The full version extracts into about 90 gigabytes of text files while the core version only takes up about 23 gigabytes.

To load these into a relational database would easily double the disk requirements, not counting the temporary storage for the extracted text files. It wouldn’t be the first time that I’ve installed a database of that size on my server and it won’t be the last, but given the simple structure of YAGO2, I may have a better idea.

YAGO2 is comprised entirely of relations with three fields, commonly known as “triples”. Even the full version only has about 1.2 billion triples, and there are about 1.8 billion distinct values in all the fields of all the triples. Most of those values are actually the record numbers of other triples, and the remaining 500 million are literal values such as text strings, dates and numbers. Therefore, every unique value could be encoded as a number and stored comfortably in a 4 byte integer, requiring just 12 bytes per triple.

Literal values can be stored in a separate file which, depending on how I decide to encode numeric values in the final version, is somewhat less than 2 gigabytes uncompressed, including an index. This works out to a total size of about 8 gigabytes for the core version or 16 gigabytes for the full version of YAGO2 using this format.

I’ve written a converter program in C which takes the compressed archives that comprise YAGO2 and extracts them directly into this simple and compact file structure. I still have a fair bit of work to do to figure out what additional indices need to be built to make it useful, but by avoiding having to decompress these enormous files in the first place, I think it will be possible to create a version of YAGO2 that runs directly from a DVD or a flash drive with a minimum of installation and hardware required.

I’ve attached a file to this post with a detailed breakdown of the contents of the YAGO2 archives. It is in the form of an ASCII table, so I hope you are able to view it ok.

File Attachments
yago-summary.text  (File Size: 10KB - Downloads: 238)
 

 
  [ # 1 ]

Wouldn’t it be simpler to have a version running on a webserver with an API to connect to it? That way, we can all use it anywhere (as long as we have internet) without downloading it. And don’t the creators of this dataset provide this service already?

The dataset looks amazing, but as you say it’s quite too large to simply include it in your agent.

 

 
  [ # 2 ]

I hadn’t come across that dataset before and clearly a local, streamlined version or a fast web service would be an interesting option. The mpi interface is only “for demonstration purposes”.

I’m always wondering about the trade-off between these “pre-processed” sources of data and the alternative of looking up the information from the original source in real-time.  In this case, it seems practical to search Wikipedia and extract an answer.  The benefit is that the answer / scope is always up to date and you’re not limited to the features that the original dataset thought it relevant to extract.  On the flip side, it can be slower and requires connectivity + the other service to be online (even Wikipedia is down sometimes).

 

 
  [ # 3 ]

True, but I just saw that the dataset includes links to the relevant wiki-pages wink

 

 
  [ # 4 ]
Mark tM - Oct 6, 2011:

Wouldn’t it be simpler to have a version running on a webserver with an API to connect to it? That way, we can all use it anywhere (as long as we have internet) without downloading it. And don’t the creators of this dataset provide this service already?

The dataset looks amazing, but as you say it’s quite too large to simply include it in your agent.

If you read Dan Hughes excellent account of his own recent efforts to set up YAGO2 on a server you’ll be able to appreciate that there is no “simpler” way to work with a knowledge base of this size, and in the grander scheme of things, YAGO2 is actually quite small.

http://www.chatbots.org/ai_zone/viewreply/6823/

Since I don’t have the budget to throw more hardware at the problem the way Dan (or indeed the original creators of YAGO) did, I’ve opted for a software solution, one which is potentially useful to a lot more people. Once I’ve developed a more efficient application, I’ll set it up with a server interface to (hopefully) get the best of both worlds.

Also, anything that can fit on a flash drive can be carried around and used anywhere that there is a computer, regardless of connectivity. It looks like that is a distinct possibility now.

 

 

 
  [ # 5 ]
OliverL - Oct 6, 2011:

I’m always wondering about the trade-off between these “pre-processed” sources of data and the alternative of looking up the information from the original source in real-time.  In this case, it seems practical to search Wikipedia and extract an answer.  The benefit is that the answer / scope is always up to date and you’re not limited to the features that the original dataset thought it relevant to extract.  On the flip side, it can be slower and requires connectivity + the other service to be online (even Wikipedia is down sometimes).

Different kinds of information change at different rates. There is much that is historical which won’t change again, barring isolated corrections or a major paradigm shift.

Therefore it makes sense to organise information in tiers with slowly changing, massive amounts of data converted to structures optimised for fast access and small size while rapidly changing current information is optimised for ease of modification, even if it is less efficiently stored in the short term.

 

 

 
  [ # 6 ]

After some further analysis I’ve settled on a format which takes 16 bytes per fact. For the full version of YAGO2 this requires a 19 gigabyte fact file and an additional 1.5 gigabyte file of unique strings. The core version of YAGO2 requires files approximately half these sizes. Each fact comprises four 32-bit integers which can uniquely identify another fact, or a specific string. The fields are number,predicate,domain,range where number is a unique id.

To be able to access the individual items efficiently it is necessary to sort the files. For the time being I’m keeping four copies of the fact file, each one sorted by a different field. i.e.

number
predicate,domain,range,number
domain,range,predicate,number
range,domain,predicate,number

This will make it possible to retrieve all the facts pertaining to a particular entity as efficiently as possible. For example all the facts pertaining to AlbertEinstein can be read in two blocks, one from the domain file and one from the range file. (Depending on how much of the file is already in memory, a small number of additional reads would be needed to perform the binary search needed to find the blocks of records.)

Employing a smaller number of large reads from the disk like this yields a huge performance improvement over naive indexing where it might be necessary to read thousands of different sections of the file to get all the records. Clustering the records like this is the basis of what is called inverted indexing, a method which I have employed to great advantage in the past on projects like http://tracktype.org

The downside is the extra disk space required, not to mention the extra processing required for creating the more sophisticated indices, however it will certainly be justified for a knowledge base like YAGO2.

I’ve attached the C source code that I wrote for sorting very large files on disk, since there doesn’t seem to be much in the way of libraries commonly available for doing this. I did find one called STXXL on Source Forge though, just for comparison.

http://stxxl.sourceforge.net/

EDIT: Ok I keep getting an error when I try to upload the source file. Please email me if you’d like a copy and I haven’t figured out how to upload it in the meantime.

 

 
  [ # 7 ]

What sort of error are you getting, Andrew? Is it a permissions error, based on the file type? If so, I can host the file, and provide a link for downloading it. smile Well, I can host the file for you regardless of the reason.

 

 
  [ # 8 ]
Dave Morton - Oct 12, 2011:

What sort of error are you getting, Andrew? Is it a permissions error, based on the file type? If so, I can host the file, and provide a link for downloading it. smile Well, I can host the file for you regardless of the reason.

Error Message:
upload_unable_to_write_file

That’s what it says when I try to attach a file to a message. Maybe chatbots.org is running out of disk space.

Anyway, hosting the file isn’t a problem as I already have a very capable web server of my own. In this case I just thought it would be nice to attach the file to the message.

It’s still a work in progress anyway… just about to start coding the binary search part. It’s kind of exciting in a way because it’s so long since I’ve had to deal with data sets that didn’t fit in 8 gigabytes of memory.

 

 

 
  [ # 9 ]

It sounds like a great improvement in efficiency, both of storage, and of access. It would be interesting to see something like this put to use with a conversational agent that can make use of so much data. smile

 

 
  [ # 10 ]

I’ve read their site, but I have to say I’m still not that clear what the added value of “full” is vs “core”.  Do you have any examples of facts / entities which are in full but not core?  Do you plan to do both?  On a central server I guess the full set is quite straightforward but on a local machine, where space and resources are more constrained, an informed choice might be required.

 

 
  [ # 11 ]

At some point in time, I will probably also have to look at this dataset.
Bloody big though.

 

 
  [ # 12 ]
Dave Morton - Oct 12, 2011:

It sounds like a great improvement in efficiency, both of storage, and of access. It would be interesting to see something like this put to use with a conversational agent that can make use of so much data. smile


I am anxious to see something like this working someday.


What do you think is the worst part: the database or building a hardware that achieves a nice processing?
As I can understand, a so huge database needs a hardware able to simulate sinapses?  That is what Dan Hughes is trying to do with ALF, isnt it? ( forgive my ignorance, I am just a curious person trying to learn)... LOL

 

 
  [ # 13 ]
Fatima Pereira - Oct 15, 2011:

As I can understand, a so huge database needs a hardware able to simulate sinapses?  That is what Dan Hughes is trying to do with ALF, isnt it? ( forgive my ignorance, I am just a curious person trying to learn)...

YAGO2 is based on Resource Description Framework (RDF)

http://en.wikipedia.org/wiki/Resource_Description_Framework

Shortcuts and syntactic sugar aside, this consists entirely of triples in the form of

subject predicate object 

For convenience in YAGO2, each triple is assigned a unique identification number allowing it to be referenced by other triples. This is called reification and in natural language terms, is like the process of assigning names to ideas and concepts.

This means that the YAGO2 knowledge base is comprised of nothing more than billions of sets of exactly four values.

{identifier,subject,predicate,object} 

Sometimes the terms “domain” and “range” are used instead of “subject” and “object”, reflecting the mathematical function aspects of this data i.e predicate(domain)->range

So, you do not need (or want) to be able to simulate synapses to use this. You do need a really good relational database management system (RDBMS) though. The authors of YAGO2 have used PostgreSQL for the project, but conversion tools are available for other flavours of RDBMS if you really must use something else (not recommended).

I did a lot of experimenting with YAGO and PostgreSQL a couple of years ago. This time round I’m attempting to go the extra mile by writing some low level software in C specially for processing YAGO2.

So far it is proving to be about 10 times faster than using an RDBMS for the job (processing takes a couple of hours instead of days or weeks), and it also requires a lot less storage space (maybe less than 100 gigabytes in total), but there is still a lot work left to be done to implement it.

 

 

 
  [ # 14 ]

One of the things that I tried doing to make the YAGO2 knowledge base more compact was to treat the entire fact file as an array instead of a table. The difference is that in a table, the unique row number has to be stored in each row, requiring 16 bytes per row instead of 12. If it was converted to an array, the unique row number would not need to be stored because it would be implied by the position of the element in the array.

The problem with this plan is that there are a lot of gaps in the row numbers. Replacing the “missing” rows with nulls to pad out the existing rows to the proper positions in the array works, but the resulting file of 12 byte records is 20 gigabytes, even larger than the 19 gigabyte file that uses packed 16 byte rows. This would still be the best way to do it in the long run, because by renumbering all the rows, the empty positions could be closed up and the resulting file would be much smaller.

However there may be an even better way which I am exploring at the moment…

 

 
  [ # 15 ]

I don’t have much to say on this subject, but wanted to let you know I’m following this thread and look forward to seeing what you put together. smile

 

 1 2 > 
1 of 2
 
  login or register to react
‹‹ ALF      A Financial Advisor Chatbot ››