Getting destination themes from Wikivoyage text

With so much publicly available information on travel destinations these days, I was wondering if there was a simple way to extract themes for destinations around the world. This would allow me to quickly lookup potential destinations to travel to based on any theme of interest. Using the Wikivoyage text dumps and some Python, I found a simple solution that worked decently for the tiny amount of effort involved. And so, back on the topic of extracting structured information from the considerably amount of unstructured text on Wikivoyage.

The Wikivoyage dumps is a gold mine for anyone looking for lots of general destinations related information for destinations around the world. After the previous post on creating a destinations graph from Wikivoyage text dumps, this post explores a quick way of identifying destination themes from the text dumps. Some parts of this exercise is taken from the previous post and notebook so it’ll be a good idea to go through that first. The complete code can be found on my Github here.

1. Loading data from previous exercise

The first of these is the articles text, downloaded, slightly processed and slightly cleaned. This will be cleaned further later before having themes extracted. raw_text is a dictionary with destination names (or article titles) as keys, and their text as the values. Refer to steps 1 to 4 in the notebook for more details on how the JSON file was created.

with open('wikivoyage_latest_articles_text.json', 'r') as f:
    raw_text = json.load(f)

The second reused part is the cleaned data of destination details, another by product of the earlier exercise. I use this to identify which articles in the full dump to process. Articles with destination details found are likely to have more complete information for analysis. Step 5 in the previous notebook explains how this is obtained.

with open('destination_details.json', 'r') as f:
    destination_details = json.load(f)

2. Standardizing names in both datasets

Due to the order in which they were created, destination names in ‘raw_text’ are not cleaned, while those in ‘destination_details’ are cleaned. Going back to step 5 in the previous notebook, I replicate the cleaning process as a function so as to be able to cross check between the 2 datasets consistently.

import unicodedata
def standardize_name(destination_name):
    destination_name = destination_name.replace('_', ' ').split('{{')[0].strip().lower()
    if unicodedata.normalize('NFKD', destination_name).encode('ascii', 'ignore') == 'brac':
        destination_name = 'brac'
    elif unicodedata.normalize('NFKD', destination_name).encode('ascii', 'ignore') == 'rugen':
        destination_name = 'rugen'
    return destination_name

3. Extract specific parts of article text for more efficient extraction of themes

In this final cleaning step I look for specific parts of each article’s Wikivoyage text that are most relevant for finding themes associated with a destination.

Selecting relevant articles/ destinations

First I start by going through each article in the entire raw text dumps. To pick only destinations with more reliable information I only process destinations which normalized names can be found in the destination details dataset:

for i in raw_text:
    if standardize_name(i) in destination_details:

Selecting relevant portions of text

Next, I pick out the introductory part of the text, which can be found before the ‘Get in’ section. The ‘Get in’ section is marked with slightly different variations. ‘t’ will be the selected relevant text for each article consolidated into a string.

t =''
if '==Get in==' in raw_text[i]:
    t = raw_text[i].split('==Get in==')[0]
    t += '\n'
elif '== Get in ==' in raw_text[i]:
    t = raw_text[i].split('== Get in ==')[0]
    t += '\n'

The other 2 sections where information about a destination’s themes is most likely to be found is in the ‘See’ and ‘Do’ section, and I’ll proceed to extract those, appending them to ‘t’. These sections, like the earlier part, are extracted by setting the start and end limiters and searching for these delimiters.

if '==See==' in raw_text[i]:
    t+= raw_text[i].split('==See==')[1].split('==Do==')[0].split('== Do ==')[0].split('==Eat==')[0].split('== Eat ==')[0].split('==Eat and Drink==')[0].split('== Eat and Drink==')[0].split('==Buy==')[0].split('== Buy ==')[0]
elif '== See ==' in raw_text[i]:
    t += raw_text[i].split('== See ==')[1].split('==Do==')[0].split('== Do ==')[0].split('==Eat==')[0].split('== Eat ==')[0].split('==Eat and Drink==')[0].split('== Eat and Drink==')[0].split('==Buy==')[0].split('== Buy ==')[0]
if '==Do==' in raw_text[i]:
    t += raw_text[i].split('==Do==')[1].split('==Eat==')[0].split('== Eat ==')[0].split('==Eat and Drink==')[0].split('== Eat and Drink==')[0].split('==Buy==')[0].split('== Buy ==')[0]
if '== Do ==' in raw_text[i]:
    t += raw_text[i].split('== Do ==')[1].split('==Eat==')[0].split('== Eat ==')[0].split('==Eat and Drink==')[0].split('== Eat and Drink==')[0].split('==Buy==')[0].split('== Buy ==')[0]

Cleaning and standarizing formatting

If any text has been found in the above parts, the follow section does a series of cleaning processes to remove tags, headers, links and to standardize some formatting. Finally, only when a sufficient number of relevant words is found in that article will that destination be added to the data to be processed in the next step.

4. Create Corpus

This part creates 2 items. corpus is a python list with each item in the list corresponding to the consolidated text from the previous step for one article. id_lookup is created so as to be able to identify which item in the corpus corresponds to which destination.

corpus = []
id_lookup = {}
completed = 0
for destination in cleaned_text:
    id_lookup[len(corpus)] = destination
    corpus.append(cleaned_text[destination])

For example, on the first iteration, len(corpus) is 0, so id_lookup[len(corpus)] = destination will create in id_lookup a key of 0 with the first destination in cleaned_text as the value. The extracted text for this destination ( cleaned_text[destination]) is then appended to corpus as the first item (index 0). So with index 0 you’ll be able to find the destination in id_lookup (with id_lookup[0]), and you’ll also be able to find the associated text for this destination at index 0 of corpus (corpus[0]).

5. Perform TFIDF on data

Next I extract the prominence of each word within the context of each destination’s extracted text using an algorithm within the scikit-learn (sklearn) package. The algorithm used is TFIDF, though after many iterations I’ve decided to just use the settings that extract raw word count for each individual word. Let’s talk a little on TFIDF, followed by TFIDF in sklearn, and finally why I used raw word count in the end.

What is tf-idf?

According to Wikipedia,

term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus

Term frequency: within a document, a term that is found more frequently will be viewed as more important (given greater weight)

Inverse document frequency: across documents in the corpus, if a term is found across a higher proportion of documents, the term will be viewed as less significant, compared to a term which is found in a lower proportion of documents.

And then there is lots of math involved which you can read up more on the Wikipedia page if you are interested.

sklearn.feature_extraction.text.TfidfVectorizer

There are lots of parameters that can be set for this function, and you can read more about it on the documentation page for tfidfvectorizer. Here I’ll go through some of the parameters I’ve used.

analyzer: sets the vectorizer to analyze either ‘word’ or ‘character’. In this case I’ll analyze words

ngram_range: sets the boundaries of how many words or characters to analyze as a term. setting it to (1,1) will analyze only individual words, while (1,3) will analyze individual words and phrases containing up to 3 words. a larger range will increase processing time.

min_df: minimum document frequency before a term is included in the analysis. entering float values or integer values will produce different effect. e.g. entering in 0.1 means that any term found in less than 10% of all documents in the corpus will be ignored. entering 10 means that any term found in less than 10 documents in the corpus will be ignored. 0 means all terms will be included. a lower value will increase processing time.

max_df: opposite of min_df, and a higher value will increase processing time.

lowercase: coverts all words to lowercase by default

use_idf: enable the ‘inverse document frequency’ part of tf-idf by default. If unselected words will not be re-weighted based on document frequency.

norm: set l1 or l2 norms. Briefly speaking l1 normalization zeros out unnecessary coefficients while l2 normalization reduces the magnitude of coefficients where possible. You can read more about it here.

stop_words: removes stop words from the list of terms to be analyzed as it may not be meaningful. e.g. ‘the’, ‘on’, etc

My initial attempt

My first set up mirrored that of Mark Needham analyzing ‘How I met your mother’ transcripts with the tfidf vectorizer. Here’s how he set up his vectorizer, which worked well for that context:

tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')

Many issues occurred when I attempted this over the Wikvoyage text to obtain destination themes from prominent words in each document, some listed below:

Each word in very short article gets a very high weight due to its relative prominence in the document. Removing normalization helped with this.
By default, all words are lowercased before analysis. Destinations with names containing a keyword ended up getting weighed very highly even if that destination had nothing to do with the theme/activity. By not lowercasing the words I was able to separate destination names from adjectives, verbs and non-name nouns.

And a host of other issues making interpretation difficult. Thus for this initial run I decided to keep things simple and extract the word count of each term, approximating the popularity of each word relative to other destination articles as the relative prominence of the associated theme.

A few other changes were made to optimize for performance, such as the min and max document frequency and ngram range. Words that occur too infrequently are not so relevant in this analysis, which is to pick out words commonly associated with particular themes. On the other hand words that occur too frequently are not that relevant too in picking out unique characteristics about a destination.

Final set up

In the end it got all stripped down to this:

tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 0.001, max_df=0.6, lowercase=False, use_idf=False, norm=None, stop_words = 'english')

The steps after this fit the corpus into the vectorizer and obtains the terms that meet the criteria set in the vectorizer.

6. Transform the output to store it

Next I transform this corpus to make it into a more understandable format, then store it. Again, this is loosely following Mark’s tutorial in using tf-idf on How I met your mother transcripts.

This refers back to the lookup dictionaries created earlier to identify which word score pairs are associated with which destination article. The order of the documents in the corpus is the same as the order of documents in the tfi-df matrix. This allows the calculations to work, e.g. the first doc in tfidf_matrix.to_dense is the document in index 0 of the id_lookup, and doc_id starts at 0, allowing for the line consolidated[id_lookup[doc_id]]. Likewise the words in the matrix follow the order in feature_names, allowing for a lookup of feature_names[word_id] with word_id starting at 0 before iterating through each word score in each document.

completed=0
doc_id = 0
consolidated = {}
for doc in tfidf_matrix.todense():
    word_id = 0
    consolidated[id_lookup[doc_id]] = {}
    for score in doc.tolist()[0]:
        if score &amp;gt; 0:
            consolidated[id_lookup[doc_id]][feature_names[word_id].encode("utf-8")] = score
        word_id +=1
    doc_id +=1

7. Get themes from word prominence in destination articles

Now that we know each word’s ‘prominence’ for each destination, what’s left is to map destination themes to the relevant words and we’ll be able to find out how each destination ‘scores’ for each theme.

Here’s a simple example to pick out relevant keywords for each theme. To put it simply, a destination with an article which mentions ‘beach’ or ‘beaches’ many times is likely to be a beach destination. Not exact, but should approximately get the job done.

themes_dict = {
    'beach': ['beach', 'beaches'],
    'shopping': ['shopping', 'malls'],
    'temples': ['temple', 'temples'],
    'surfing': ['surf', 'surfing', 'surfers'],
    'diving': ['dive', 'diving', 'divers'],
    'hiking': ['hike', 'hiking', 'hikers', 'trek', 'trekking', 'trekkers'],
    'culture': ['culture', 'cultural', 'cultures'],
    'food': ['foodie', 'food', 'restaurants', 'delicacy', 'delicacies'],
    'museums': ['museums', 'museum'],
}

All that mapping is done and the output saved as a JSON file so it can be reused later. In this dictionary themes are the keys, with a list of tuples. Each tuple corresponds to a destination and theme score, which is the sum of all word scores for that destination.

consolidated = {'beach': [], 'shopping': [], 'temples': [], 'surfing': [], 'diving': [], 'hiking': [], 'culture': [], 'food': [], 'museums': []}
completed=0
for destination in data:
    for theme in themes_dict:
        score = 0
        for word in themes_dict[theme]:
            if word in data[destination]:
                score+=data[destination][word]
        if score&gt;0:
            consolidated[theme].append((destination, score))

with open('destination_themes.json', 'w') as f:
    json.dump(consolidated, f)

8. Re use functions created in destinations graph notebook

Before we proceed to test the output, I’ll create some functions that had been first created in the destinations graph notebook so that it’ll be easier to test the output.

The first, get_parent, produces the hierarchy of destinations that happens above the selected destination:

def get_parent(current, chain=''):
    if chain is '':
        chain=current.lower()
        current=current.lower()
    try:
        for parent in destination_details[current]['ispartof']:
            chain = '%s|%s' %(parent, chain)
            chain = get_parent(parent, chain)
    except KeyError:
        return chain
    else:
        return chain

For example, get_parent(‘Thailand’) produces:

asia|southeast asia|thailand

Next, get_child produces a lists the immediate child articles of that parent, if any:

def get_child(search):
    child_articles = []
    for article in destination_details:
        for parent in destination_details[article]['ispartof']:
            if parent == search.lower():
                child_articles.append(article)
    return child_articles

And get_child(‘Thailand’) produces:

[u'isaan', u'eastern thailand', u'southern thailand', u'central thailand', u'northern thailand']

9. Get top destinations for given theme and region

Now, let’s try to get the top beach destinations in Asia.

with open('destination_themes.json', 'r') as f:
    destination_themes = json.load(f)
theme = 'beach'
region = 'Asia'
sorted_scores = sorted(destination_themes[theme], key=lambda t: t[1] * -1)

printed = 0
for score in sorted_scores:
    if region.lower() in get_parent(score[0]) and (len(get_child(score[0]))==0):
        print '%s (%s): %s' %(score[0].title(), score[1], get_parent(score[0]).title().replace('|', ' > '))
        printed +=1
        if printed == 5:
            break

After setting the variables, sorted scores obtains all the tuples found for that theme, sorted by the second value (t[1]), which is the score for that theme for that destination, with the destination name in t[0].

The second part goes through the sorted scores, and prints out the top 5 destinations in the specified region by looking for the region in the hierarchy string. It also checks that the destination is an end node (a destination rather than a country or larger region) before printing it. The printing step does some formatting to make it look a little nicer. Finally, after 5 items are printed the for loop breaks.

Quy Nhon (25.0): Asia > Southeast Asia > Vietnam > Central Coast (Vietnam) > Quy Nhon
Kovalam (21.0): Asia > South Asia > India > Southern India > Kerala > Southern Travancore > Kovalam
Palolem (19.0): Asia > South Asia > India > Western India > Goa > South Goa > Canacona > Palolem
Legian (17.0): Asia > Southeast Asia > Indonesia > Bali > South Bali > Legian
Lovina (15.0): Asia > Southeast Asia > Indonesia > Bali > North Bali > Lovina

10. Get top themes for a given destination

Next, let’s try to do it the other way around – getting the top associated themes for a given destination, in this case, Paris.

destination = 'paris'
final_themes = []
for theme in destination_themes:
    for destination_score in destination_themes[theme]:
        if destination.lower() == destination_score[0]:
            final_themes.append((theme, destination_score[1]))
sorted_scores = sorted(final_themes, key=lambda t: t[1] * -1)

print get_parent(destination.lower()).title().replace('|', ' > ')
print
for item in sorted_scores:
    print '%s (%s)' %(item[0].title(), item[1])

This function goes through each theme, and within each theme look through each tuple to see if the destination for that word score pair matches the input. If it does the theme, along with the score, is appended to the final output. From there, it is similarly sorted by the second value in the tuple and all the associated themes for the specified destination, sorted by scores descending, are printed out.

Europe > France > Île-De-France > Paris

Museums (11.0)
Food (4.0)
Culture (3.0)
Beach (1.0)

Final thoughts

And that concludes a simple way of extracting destination themes from Wikivoyage text. All the code used here can be found on Github here. The code and analysis used here are quite basic, and there’s lots of room where this can be done better, so hopefully I’ll get to a v2 of this in future. As for now, it works sufficiently and I’ve combined the work done in extracting themes and mapping destinations with my very basic web skills to put up a very basic web application that will suggest travel destinations for a given theme and continent (with more granular regions for Asia). Please bear with the speed as it loads from a free Heroku instance.

And now, looking at how else to consolidate public information on the web to create more useful travel related web applications. Till the next time, stay tuned!