Tuesday, 23 December 2014

Using Elastic Search with mongodb

Eastic search(ES) is super duper fast. Once you integrate it with mongodb and start redirecting your queries their you will be able to scale much better.

But like every other thing out their ES also has a learning curve. And some times when you get stuck it is difficult to figure out whats going wrong.

We ran into some issues while working with ES so hear goes our experience and learning with ES.

I. Installations:

    A. Install mongodb : http://docs.mongodb.org/manual/tutorial/install-mongodb-on-ubuntu/

    B. Install Elastic Search: this is simple. just download the latest package of ES and extract it.

    C. Install mongo connector: pip install mongo-connector #maybe with sudo

         This application is used to pull data out of mongo and push to ES

    D. Install plugin head on ES: [Optional but very helpful plugin]
         (if elastic-search is installed then plugin may be found in this directory : /usr/share/elasticsearch/bin)

        elasticsearch/bin/plugin -install mobz/elasticsearch-head

II. Start Replication log on Mongodb:

    You dont need to actually set up replication on mongo. You just need to enable replication logs. This is because mongo connector uses mongo replication logs to read the changes coming in.

    A. edit mongodb config file
     
            sudo gedit /etc/mongod.conf

        add line

            replSet=rs0

    B. Restart mongo

            sudo service mongod restart

    C. Login to mongo

            mongo

        and run this command

           rs.initiate()

    Now the replication log is setup.

III. Start Elastic Search

    A. Goto ES directory

            cd elasticsearch/bin/
         
            or

            cd /usr/share/elasticsearch/bin/

    B. run elasticsearch

             ./elasticsearch

IV. Setup mongo Connector

    A. Open a new tab and enter the following command

            mongo-connector -m localhost:27017 -t localhost:9200 -d elastic_doc_manager --oplog-ts oplogstatus.txt

You ES is now setup. All indexes from your database will be automatically created

V. monitor all indexes [if you have installed head plugin]

        http://localhost:9200/_plugin/head/

    Index are going to be created with the following naming convention:
     
        Mongo-connector gives each MongoDB collection its own index in Elasticsearch. For example, documents from the collection kittens in the database animals will put into the animals.kittens index in Elasticsearch.

        index naming convention

VI. Querying Elastic search:

    A. You can use elastic search clients:
       
         Official list of supported clients : See Clients & Integrations

         http://www.elasticsearch.org/guide/

    B. Simple rest calls:
     
         You can simple make calls to ES over http

         ex: http://localhost:9200/animals.kittens/_search?pretty=1&q=Tom&size=2&fields=id,owner.name&sort=age

         Here may have to experiment and try out different combinations. And you have a very complicated structure then building a query may turn out to be difficult. So do it step by step.
         Partition the query into pieces and build them separately and figure a way to integrate them.

         eg: result = es.search(index="animals.kittens", body=
               {
                    "from" : 0,
                    "size" : 10,
                    "query" : {
                        "filtered" : {
                            "query": {
                                "query_string": {
                                    "query": "mouse",
                                    "fields": ["product"]
                                }
                            },
                            "filter" : {
                                "bool" : {
                                    "must" : [
                                        {
                                            "terms" : {
                                                "address.name" : ["mumbai", "pune"]
                                            }
                                        },
                                        {
                                            "range": {
                                                "cost.inr": {
                                                    "gte" : 0,
                                                    "lte" : 400
                                                }
                                            }
                                        }
                                    ],
                                    "must_not" : {
                                        "terms" : {"id" : ["43832jd0dskf09123yhjdhf012u3j"]}
                                    }
                                }
                            }
                        }
                    },
                    "sort": [
                        {
                            "rating.avg_ratting": {
                                "order": "desc"
                            }
                        }
                    ]
               })

       with the above query you can figure out what the structure of the data will be like. Play around with ES and if you get stuck then drop a comment or ask on Stack-overflow or Quora. :)

LEARNING:

I. To create a custom mapping :

    If you are planning to use ES on production for a big data set you will probably find the need to write your own mapping file. Know that mongo-connector creates the index with the name:

   http://localhost:9200/animals.kittens/

and the mapping is put under :

    http://localhost:9200/animals.kittens/string/

So to apply your own mapping you will need to do this:

    A. First clear the existing index by deleting it. [head plugin gifs you the feature of make a Delete curl call at the index'

    B. Create a new index

            curl -XPUT 'http://localhost:9200/animals.kittens/'

    C. Apply the mapping file

            curl -XPUT 'http://localhost:9200/animals.kittens/string/_mapping/' -d '
              {
                  "string": {
                      "properties": {
                          "owner": {
                              "properties": {
                                  "name": {
                                      "type": "string"
                                  },
                                  "address": {
                                      "type": "string"
                                  },
                                  "tell": {
                                      "type": "string"
                                  }
                              }
                          },
                          "breed": {
                              "type": "string"
                          },
                          "id": {
                              "type": "string",
                              "index": "not_analyzed"
                          }
                      }
                  }
              }'

         Not analyzed is very import for fields which might contain spaces or special characters like "#$@" etc..

II. DB-structure limitation:

A. To efficiently use ES you have to ensure a few consistencies in your database.

all top level properties must be present in all documents:

eg:

            [
                {
                    "name": "Vikash",
                    "dob": "23-11-1989",
                    "address": "Indiranagar, Bangalore"
                },
                {
                    "name": "Viki",
                    "dob": "22-10-1990"
                }
            ]

This is a bad idea because the absence of field address in the 2nd document may cause issues.

B. If you don't have a value for a particular field replace it will null but create the field.

                {
                    "name": "Viki",
                    "dob": "22-10-1990",
                    "address": null
                }

C. This is more a json rule than a db schema rule.

  values in an array should be of the same type and i dont mean just data type i mean in terms of meaning.

eg: if you need to store a expiration data in year and months separately

        "expiration_time":
        [
            1, //year
            6 //months
        ]

you should rather store it as a map/dictionary as
         "expiration_time":
        {
            "years" : 1,
            "months" :5
        }

D. Another json rule:

    keys that you use to store the dictionaries should not be generated randomly. Also if possible avoid rule based key generation. This will help you keep your database more structured.

     eg:

        {
            "january":
            {
                "sum": value
            },
            "december":
            {
                "sum": value
            }
        }
        // if you are not going to have all 12 months avg stored in there then i suggest use month as a paramet in the data
     
        {
            "aggregated_data":
            [
                {
                    "month": 1,
                    "sum": value
     
                },
                {
                    "month":12,
                    "sum": value
                }
            ]
        }

 

Here it will be much easier to select the data from the min month then it will be in the 1st case. Also you can sort by month much easily here.

To conclude: ES is really really really fast. I am in love with its speed. Hope you enjoy working with it too.

Bye for now.