Tourism Surveying from Social Media: The Validity of User-Generated Content (UGC) for the Characterization of Lodging Rankings

L'objectiu d'aquesta investigacio consisteix en determinar si el contingut generat pels usuaris en llocs webs relacionats amb la industria de l'allotjament valida els sistemes de classificacio de qualsevol establiment i plataforma d'allotjament per crear un sistema internacional de classificacio d'hotels que pugui servir tambe per categoritzar qualsevol tipus d'allotjament.
Aquesta memoria presenta el treball dut a terme a traves de la descarrega i l'analisi de prop de quaranta milions de ressenyes sobre hotels de tot el mon descarregades des de TripAdvisor i Booking.com.
D’una banda, s’analitza les valoracions dels usuaris sobre establiments d'allotjament per comparar les seves puntuacions i determinar que la posicio que ocupen els hotels en el ranquing esta molt relacionada entre TripAdvisor i Booking.com.
De l'altra, la investigacio consisteix en l'analisi dels sistemes de classificacio d'hotels, demostrant que hi ha una relacio entre les valoracions dels usuaris i les categories d'hotels a tot el mon, de tal manera que les categories es poden predir a partir del contingut generat pels usuaris a internet, entre d'altres parametres.
Finalment, a traves de tecniques d'aprenentatge automatic, es crea un model per classificar els establiments d'allotjament siguin tradicionals o plataformes d'allotjament col·laboratiu, com ara Airbnb, per a que convergeixin els diferents sistemes de classificacio de tot el mon.
El objetivo de esta investigacion consiste en determinar si el contenido generado por los usuarios en webs relacionadas con la industria del alojamiento valida los sistemas de clasificacion de cualquierestablecimiento y plataforma de alojamiento para crear un sistema internacional de clasificacion de hoteles que pueda servir tambien para categorizar cualquier tipo de alojamiento.
 Esta memoria presenta el trabajo llevado a cabo a traves de la descarga y el analisis de cerca de cuarenta millones de resenas sobre hoteles de todo el mundo descargadas desde TripAdvisor y Booking.com.
Por un lado, esta investigacion se centra en analizar la informacion brindada por los usuarios con sus valoraciones sobre establecimientos de alojamiento para comparar sus puntuaciones y determinar que la posicion que ocupan los hoteles en el ranking esta muy relacionada entre TripAdvisor y Booking.com.
Por el otro, la investigacion consiste en el analisis de los sistemas de clasificacion de hoteles, demostrando que existe una relacion entre las valoraciones de los usuarios y las categorias de hoteles en todo el mundo, de tal manera que las categorias se pueden predecir a partir del contenido generado por los usuarios en internet, entre otros parametros.
Finalmente, a traves de tecnicas de aprendizaje automatico, esta tesis crea un modelo que permite clasificar los establecimientos de alojamiento ya sean tradicionales o plataformas de alojamiento colaborativo, como Airbnb, para que converjan los diferentes sistemas de clasificacion de todo el mundo.
The aim of this research is to determine whether online User-Generated Content (UGC) about the lodging industry validates the ranking system of any accommodation property or platform in order to create an international hotel classification system that may also serve to categorize any type of accommodation.
This thesis presents the work carried out following the collection and analysis of nearly 40 million reviews of hotels worldwide downloaded from TripAdvisor and Booking.com. 
On the one hand, this research focuses on the analysis of information provided by users, and specifically their reviews of accommodation properties, to compare their scores and to determine whether the position of a hotel in both ranking systems is closely related.
On the other hand, the research consists of the analysis of hotel classification systems. It demonstrates that although international classification systems are not unified, there is a relationship between users’ ratings and hotel categories worldwide. Consequently, categories can be predicted from UGC on the Internet, among other parameters.
Finally, through machine learning, this thesis creates a model that allows accommodation properties, whether on traditional or collaborative hosting platforms (e.g., Airbnb), to be classified in such a way that different classification systems worldwide are consistent.


Goal and objectives of the dissertation
The aim of the thesis is to determine whether online User-Generated Content (UGC) within the lodging industry validates the ranking system of any accommodation property or platform in order to create an international hotel classification system that could categorize any type of accommodation based on different variables.

Objectives
 To review the state-of-the-art of electronic Word of Mouth (eWOM) and UGC with regard to lodging websites, both sales and recommendation websites. To compare the behaviour of usergenerated ratings, online reviews, and measurement scales on two of the most popular tourism platforms (TripAdvisor and Booking.com).
 To confirm that verified or unverified reviews do not generate better or worse hotel ranking positions. To predict the international hotel categories with UGC and other features, considering that the hotel classification system is not unified because each country or region applies its own regulations. To create a model to classify the properties offered by "peer-to-peer" (P2P) accommodation platforms based on user interaction, similar to grading scheme categories for hotels.

Methodology
In order to achieve the objectives of the thesis, it was necessary to download, process and analyse the data from TripAdvisor and Booking.com.The data were downloaded at different times, between November 2015 and April 2016, using an automatically controlled ) in order to correctly determine the class for the instances (hotels), in other words, the supervised learning algorithm predicted a classification that could be applied later on to new instances.

Results
For most of the cities analysed, there is a high degree of relationship between both websites' hotels rankings (Booking.comand TripAdvisor) (Balagué, Martin-Fuentes, & Gómez, 2016).They likewise show that the possible posting of fake reviews on TripAdvisor does not seem to be prevalent, as both rankings behave similarly.After comparing TripAdvisor to Booking.com, suspicions of fraud on TripAdvisor because of its unverified user reviews were not borne out.The results confirm also that there is also a relationship between volume and score on TripAdvisor (Melián-González, Bulchand-Gidumal, & González López-Valcárcel, 2013) but not, generally speaking, on Booking.com because any reviews older than 24 months are not taken into account on Booking.com to calculate a hotel's score.When a hotel's score is based on older reviews too, it tends to make the score more positive but does not reflect the current reality thereof.Booking.comdeletes old reviews, which allows an overall score to be obtained that is closer to the actual situation.
Moreover, there are significant differences in the score results depending on the hotel category.The relationships between users' ratings on TripAdvisor and Booking.com and other parameters were analysed, and it was confirmed that UGC does indeed validate the hotel classification systems at an international level.
And lastly, the possibility of predicting the hotel categories worldwide with UGC and other features by using machine learning techniques was established.After this finding, the same methodology was applied to lodging properties of the so-called 'sharing economy' and it was found that it is indeed possible to establish a ranking system from UGC and other parameters that is easily understood worldwide for this type of accommodation.

Theoretical conclusions
Apart from creating a model for the accommodation industry that is equivalent to the hotel grading scheme, this model also contributes to the literature by finding the most significant elements for inferring international hotel categories, which are price, cleanliness and location.These variables fit appropriately into P2P lodging platforms because, as other authors have confirmed (although the literature in this field is scarce), price is a key issue for the guests' choice of and satisfaction with these types of accommodation (Wang & Nicolau, 2017).Cleanliness and location are also among the most-mentioned topics on P2P platforms, as it is in hotel guests' UGC (Belarmino, Whalen, Koh, & Bowen, 2017;Tussyadiah & Zach, 2017).The methodology applied in this research has hardly been used in the tourism and hospitality field yet the results are very accurate, better than other traditional techniques such as logistic regression, which were also used.

Practical application of the dissertation
The hotel categories at an international level are predicted from data generated specially by users through their online travel reviews and from other features, thereby avoiding criteria that could become out-dated with the passage of time and by making the classification systems worldwide more comprehensive and simple.This finding could help the initiative carried out by the Association of Hotels, Restaurants and Cafés in Europe to harmonize the criteria of the lodging industry by implementing a scoring system to bring the hotel categories in Europe closer to one another.Although this association has been in operation since 2004, only 17 countries were members of it at the time of writing.In addition, this finding could help the accommodation industry to find a better fit between the hotel categories and the online travel reviews in order to integrate consumers' needs into a single system (Blomberg-Nygard & Anderson 2016).
If the experts responsible for deciding which criteria should be applied to officially assign the hotel categories agree with guests' needs, then that would be the right situation.However, if the experts believe that their ideal model is one that does not match consumers' aims, then there is a problem.In other words, the best model would be the one that fits the guests' needs, the model in which the categories are closer to what customers want and what users value.
As mentioned previously, it is especially relevant to apply these methods to the hospitality industry, not only to hotels or traditional properties, but also to the new online collaborative tourism platforms whose raison d'être lies within the collaborative web.These platforms base their operation and success on value co-creation, user interactions, information exchange and the production of UGC by different economic actors in a service ecosystem defined as "online engagement platforms" (Breidbach & Brodie, 2017).
Additionally, regulations worldwide are completely different and, for that reason, this model stands out from the traditional system because it can predict the hotel category worldwide with great accuracy, which would allow users to compare hotels (and any other types of accommodation that may decide to implement this model).The audits carried out to check if the category achieved corresponds exactly to the variables offered by the hotels would become redundant or be done only in those cases where there is a mismatch between the category and the results in our model.Such a mismatch could be resolved just by knowing about certain public features, most of which are related to UGC.Thus, the bureaucracy related to the hotel classification system could be dramatically reduced as the number of audits to check if the criteria are met to keep or to lose a star would be minimized or better targeted.
Finally, the classification model is not intended to be only for hotels.Thus, the proposed classification model could serve either as a tool for making comparisons between hotels, and other third-party lodging platforms such as P2P platforms, as a metasearch engine, that merges all the different lodging options in a given destination and provides a standard and well-known system for comparing them.Moreover, it could be used for any product or service that could potentially be classified and rated by users, or even to create a platform unifying businesses' digital reputation.

Content of the dissertation
The thesis is a compendium of five articles already published or accepted in different journals.

Abstract of Chapter 1
Hotel rankings of platforms where users are verified (Booking.com)and non-verified (TripAdvisor) behave in a similar way, so, the conclusions of this chapter is that the fraudulent practices on TripAdvisor cannot be confirmed and the impact of possible fake reviews falls because of the high number of these, as they are dominated by genuine comments (Martin-Fuentes, Mateu, & Fernandez, 2018a).

Abstract of Chapter 2
There is a relationship between the number of reviews and the score of the hotels on TripAdvisor but not on Booking.com.So, keeping the old reviews of a hotel makes the scores more positive but, on the other hand, it does not show the current reality of hotels, instead, Booking.comallows hotels to have a score that is closer to the recent situation because this website deletes the reviews older than 24 months (Martin-Fuentes, Mateu, & Fernandez, 2018b).

Abstract of Chapter 3
The results confirm that despite the differences in criteria in implementing the hotel star-rate classification system throughout the world, a relationship does exist with user satisfaction on TripAdvisor and on Booking.com.In turn, it can be confirmed that customers evaluate quality with relation of the price, as well as the hotel categories can be a predictor of the room prices (Martin-Fuentes, 2016).

Abstract of Chapter 4
With the analysis of 15.7 million reviews on TripAdvisor of almost 80,000 hotels in nine European countries.Then, the results show that the hotel classification system fulfils its purpose, as the score of hotels increase with each additional star with the exception of 1 and 2-star hotels that users don't perceive the differences when rating their satisfaction (Martin-Fuentes, Mateu, & Fernandez, 2018c).

Abstract of Chapter 5
The UGC validates the grading schemes of the lodging industry that enables the creation of an international classification system that could categorize any type of accommodation (hotels, apartments, private properties, etc.) so that they can be compared using the same classifier.The system could also be used to classify any other product or service that could potentially be classified by UGC (Martin-Fuentes, Fernandez, Mateu, & Marine-Roig, 2018).

Table 1 .
Sample selection Source: Compiled by the authors based on data from Booking.com and TripAdvisor