Blog Analysis based on Elasticsearch

Hands on Elasticsearch.

Elasticsearch is a document-oriented search engine based on Lucene library. The use case I have in my work is about searching relevant documents in large document database systems based on specified keywords, where Elasticsearch can be perfectly applied. As it has open source version, I decide to build a blog search engine for my past posts at this site.

1. Download and Configure Elasticsearch

First download elasticsearch-7.7.1 from elastic. Next go to /elasticsearch-7.7.1/config/elasticsearch.yml, I only change network.host to "localhost", other fields remain defaults. Then run command in /elasticsearch-7.7.1/bin/

1	elasticsearch -d

Then go to http://localhost:9200/, see

{
  "name" : "BANGDA-PC",
  "cluster_name" : "elasticsearch",
  ...
}

indicates elasticsearch engine is already on.

2. Kibana Dashboard

There is a dashboard called kibana can be used to visualize data. Download kibana and go to /kibana-7.7.1/bin run

kibana

Then the dashboard will be run at http://localhost:5601/.

3. Import Data

The posts I have are all in markdown files, I use python to process and save all of them into a single csv file (see notebook). The csv file can be imported into kibana.

A screenshot of the dashboard

As the dashboard shows, I have 56 blogs until now. There are several periods that I have high productivity: 2018/07, 2019/03 and 2019/04.

And the distribution of blog category is like this

4. Query Blogs

In the query window, run

1	title: ("recommender")

All blogs with “recommender” in the title will be returned:

A more extensive exploration of query can be found in this notebook.

5. Visualization Using Wordcloud

Using elasticsearch_dsl can query data using python. First check all blogs with title contains “recommender”:

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
url = 'http://localhost:9200'
es = Elasticsearch(url)
s = Search()
s = s.query('term', title='recommender')
res = es.search(index='bangda_blog', body=s.to_dict(), size=50)

Three blogs are returned, the score here is by default using Lucene’s Practical Scoring Function. This is a similarity model based of Tfidf for queries.

Title                                                               | Tags                                                                           | Score
------------------------------------------------------------------------------------------------------------------------------------------------------------------
Matrix Factorization for Recommender Systems                        | [Recommender Sys, Machine Learning, Factorization Models]                      | 3.004182
Recommender Systems Overview - From A Survey                        | [Recommender Sys, Machine Learning]                                            | 2.8004267
Wide and Deep Learning for Recommender Systems                      | [Recommender Sys, Machine Learning, Deep Learning]                             | 2.6225545

A wordcloud of these three blogs:

Another wordcloud for blogs contains “machine learning” in tags.

Here is the notebook used to query blogs and generate wordclouds.