Tag Recommendation Method for Enhancing Web Novel Retrieval
Authors
Abstract
A huge number of web novels i.e. user-generated novels, exist on the Internet. The increase in their number has cased crucial problems for both writers and readers; even if noteworthy, newly submitted novels are buried amidst existing ones and readers have difficulty locating suitable, relevant novels. Most web novel sites attach several tags to each novel to describe its themes, characters, genre and so on. These tags generally vary widely because they are attached by the individual writer without specific rules or uniformity. In this paper, we propose tag recommendation methods that enable readers to easily retrieve suitable novels. These recommended tags should express the contents of the novel, differ from each other, and be selected from tags attached to a sufficiently limited number of novels. To satisfy these requirements, we analyze the tag distributions in an actual web novel site. We vectorize the novels using Dov2Vec methods, and vectorize the tags from the novel's vectors attached to the novels.
I. Introduction
Web novel sites are major community communities of shared consumer-generated media. The appearance and development, of these web novel sites has allowed anyone to post and publicize their novels. "Syousetsuka-ni-narou 1 " is one of the most popular web novel sites in Japan with over 696,000 novels, and continuing to grow by 40,000 in the last six months.
This increase in submitted novels has given rise to two crucial problems for both authors and readers. For authors, newly submitted novels become buried among existing novels, thus losing the opportunity to be seen and read by readers. On the other hand, readers have a difficult time finding suitable suggestions amidst this vast sea of novels.
Some of the typical web novel sites include a function to attach tags when a novel is submitted. There are two types of tags: system tags prepared by site operators and free tags that can be set ad hoc by the authors. These tags denote the features of the novel, such as the themes, stages, worldviews, characters, and elements that appear in it. These tags, therefore, might have the potential to assist in search behaviors. In particular, since the free tags can describe the contents of the novel, attaching appropriate free tags will effectively support searches.
These free tags are attached ad hoc by the individual writers. Since there are no restrictions on attaching tags, different tags may be given for the same characteristic or very rare tags may be attached. These tags cannot work well in novel searches and tags that are commonly attached to most novels will fail to narrow the search. In some cases, even if the tag itself is not a problem, it may not be used effectively, such as where either no tag is set that can differentiate the features of the novel from other books or where multiple similar tags are attached.
In this research, we propose a method for recommending tags to writers that effectively narrow down the search to support reader searches. The recommended tags are extracted from free tags attached to a moderate number of novels. Moreover, by calculating the distance of vectorized tags, we aim to recommend tags with different properties.
The remaining part of this paper is organized follows. We review related work in Section II to clarify the standpoint of this research. In Section III, we describe our proposed method to achieve tag recommendations. In Sections IV and V, we evaluate and consider the effectiveness and limitation of the proposed methods using the actual dataset and, finally, we conclude in Section VI.
Ii. Related Work
Choosing a desired item from a large selection is a difficult task. The reason may be that the user's request is not sufficiently reflected in the query so that, while the selected item may be the best based on this query, more suitable items are not provided.
Various recommendation methods have been proposed to solve this problem, which we can generally classify into social and content-based recommendation methods. In any method, the well-known major items tend to be recommended. While typical items are recommended, these needs to be diversity among these recommended items. In addition, if the number of target items is large, it is necessary to further narrow the results of the selection.
On the other hand, faceted search provides recommendations to users in multiple dimensions using various aspects of the products as the axis of refinement [1] . This method uses items features as facets such as the price range, volume and weight, functions, and so on of the items. The challenge of facet search is how to determine the facet that is the axis of the search. A study by Vandic et al. [2] summarized a framework for automatically creating facets based on product properties in e-commerce. Wang et al. studied tag recommendation methods [3] . They created tag rankings based on the popularity of both posts and posting authors on a social network system (SNS). They concluded that tag recommendations based on this tag ranking increased browsing compared to conventional tags. In contrast, our proposed method recommends several tags based on the related novel's contents. There are some studies involving web novels [4] [5] . Ito et al. [4] have clarified the relationship between genres and tags by analyzing the tags set in the novel. They also analyzed links between authors, readers, and novels [5] .
In our paper targeting web novels, tags set by the authors are treated as facets. By recommending effective tags for searching web novels, it is possible to assist users in locating the requested novels and tags with different features to enable more diverse recommendations.
A. Overview
Tags attached to a novel by its author may match tags given to other novels. Thus, the set of novels and the set of tags form a bipartite graph. Some tags will be attached to a very large number of novels, while many tags will be attached to a very limited number of novels. We propose a method of analyzing the distribution of tags assigned and recommending tags that can be narrowed down in the search. The following are the three requirements that the recommended tags must meet: 1) Selected tags should be attached to a moderate number of novels 2) Each tag should represent a different aspect of the novel 3) Tags should express the contents of the novel An overview of the proposed method is shown in Fig. 1 . As shown in this figure, we crawl both web novels and tags and perform tag distribution analysis according to the bipartite graph between the novels and tags. Here, we determine the threshold number of tags, θ H and θ L , which are the upper and lower bounds of tags that are attached to a moderate number of novels.
Every novel is vectorized using Doc2Vec and the tag vectors are then composed of the vectors of novels indicated by the bipartite graph. Finally, the appropriate tags are recommended by tag similarities of the tags between θ H and θ L .
B. Extract Effective Tag Range For Search
We determine the thresholds θ L and θ H from the distribution. In this chapter, we describe only the required condition and these specific values are determined in Chapter IV. Tags that are effective for search are tags whose number set in the novel (the number of appearances of the tag) is sufficiently smaller than the number of all online novels submitted on the web novel site. In addition, tags that are set in only a few novels are highly dependent on the writers and are considered to be tags that are not searched by readers. Therefore, tags whose number of appearances is small are not effective tags for searches. Based on the above, we can calculate the range of effective search tags.
We first create a distribution of the number of appearances of tags that were actually collected. For tags whose number of appearances is considered to be too small or, big, we determined the effective range of tags by referring to the actual distribution and extracted the tags.
C. Vector Pre-Calculation Of Novels And Tags
Tags attached by authors or writers represent the varying attributes of the novel, such as its worldview, theme, character elements, overall growth, and more detailed classifications that cannot be explained by genre. In this paper, the appropriate tag set for a given novel is recommended from a huge number of tags supplied in advance. To achieve this, we calculated the vector values of both the novel contents and related tags in advance. Thus, at the time of recommendation, appropriate tags can be quickly selected by a tiny operation between the vectors.
The following are detailed descriptions of how to calculate the novel vectors and reflect them into tag vectors.
1) Vectorization of web novels: Using Doc2Vec, which enables distributed representation of documents, we convert the body of the web novel into a vector. We first create training data to be input to Doc2Vec and perform morphological analysis of the bodies of web novels collected in advance through MeCab. We extract the four parts of speech (noun, adjective, verb, and adverb) from the bodies. Words that do not correspond to the four parts of speech, such as numbers and auxiliary verbs, are excluded as stop words. In novels, sentences contain many proper nouns. However, these are not listed in the morphological analysis dictionary. Therefore, when the nouns are consecutive, the word is connected as one proper noun. We used mecab-ipadic-NEologd for Mecab's library. We apply the created training data to Doc2Vec to create a learning model, vectorize each online novel using the learning model, and set the vector dimensions to 300.
2) Vectorization of tags: Tags are attached by authors to represent a novel's attributes. Therefore, we assume that the attribute feature, i.e., the tag's vector, can be composed of the vector of the novel to which the tag is attached.
A bipartite graph G(D i → T j ) is formed between the novel D i (1 ≤ i ≤ I) and the tag T j(1 ≤ j ≤ J). Therefore, the vector of the tag T j is reflected from the vector of the novel D i to which T j is added.
T j = i∈Di→Tj D i (1)
D. How To Extract Recommended Tags
By the description here, we have calculated the vectors for all novels and tags. In order to satisfy requirement 2) of the recommendation tag, i.e., "Each tag should represent a different aspect of the novel" described in Section III-A, we perform the distance calculation using the cosine similarity between vectors. Here, the cosine similarity between a novel document D i and a tag T j is calculated as follows:
EQUATION (2): Not extracted; please refer to original document.
satisfy requirement 2), the recommended tags must be around the document D and differ from each other. We illustrate this in Fig. 2 . In this figure, tags up to tags T 0 to T i−1 are recommended for a given novel document D and T i is determined from the latest recommended tags T i−1 , T i−2 , and D. Here, the tags recommended at the initial stage are formally:
T 0 = D T 1 = arg min i |DT i |
Basically, all tags that have already been recommended should be considered, but in order to reduce the amount of calculation, we consider only the two tags that were recently recommended and the next recommended tag is determined.
T in the figure is the middle point of T i−1 and T i−2 , and a projecting of the candidate tag T i is obtained on a straight line passing D fromT . The point on which T i is mapped is defined as T i and the T i that minimizes the distance of T i , i.e., |T i |, is defined as the next recommended tag. Here, θ i is the angle betweenT i D and T i . The range of θ i is 0 ≤ θ i ≤ π. Therefore, θi 2 is adopted so that T i is located on the opposite side ofT based on D.
EQUATION (3): Not extracted; please refer to original document.
A. Data Crawling
We collected the text of online novels and tags set in online novels from web novel sites. In this paper, we collected online novels regardless of whether they were completed or serialized. In serial novels and omnibus style novels, stories, chapters, and sections are combined and treated as one novel's text.
B. Data Set
We collected the bodies and tags of web novels from "Syousetsuka-ni-narou," a Japanese web novel site operated by HinaProject Inc.
1) Collecting of bodies of web novels: "Syousetsuka-ninarou" categorizes novels into five major genres: romance, fantasy, literature, science fiction, and other. Further, under the large genres, more detailed divisions are provided and a total of 20 genres are delineated. In this paper, we collected 23,036 novels that belong to the genre of different world, which is classified as part of the larger romance genre. We used Beautiful Soup to collect the novel bodies. Beautiful Soup is a Python library that can extract data from HTML and XML files.
2) Collecting tags: What we call tags in this paper corresponds to the keywords and required registration keywords on "Syousetsuka-ni-narou." Keywords are divided into the following eight classifications:
• Official scenario keywords • Replay keywords • Fanfiction keywords • Management keywords • Recommend keywords • Official keywords • Project keywords
• Manually input keywords Keywords are described in 10 characters or less, and up to 15 keywords can be set in one work. Only the novel's author can set the keywords. The free tags defined in this paper correspond to manually input keywords and the other seven keywords correspond to system tags. Manually input keywords can be described without any restrictions other than the character limit mentioned above.
The required registration keyword is a keyword that is required to be set when submitting a novel that includes elements that are not suitable for some users as specified by "Syousetsuka-ni-narou." The elements are as follows:
• R-15 • Boys' love • Girls' love • Cruelty
• Reincarnated to a different world • Transferred to a different world Up to 21 tags can be set in one novel, including keywords and required registration keywords. In this paper, the required keywords for registration were treated as system tags in the same way as keywords corresponding to other system tags. Tags were collected using "Narou-syousetu-API 2 " from the development tool "Narou-Developer 3 " provided by HinaProject Inc., which operates "Syousetsuka-ni-narou."
To understand the general tendency of "Syousetsuka-ninarou," we also collected tags that were set in addition to the collected bodies. On June 27, 2019, we collected 3,833,470 tags from 661,495 novels. There were 319,277 types of tags and, on average, each novel had 5.80 tags. Figure 3 shows the number of tags actually set in the novel. As can be seen from Figure 3 , most novels have four tags. Thus, four is the maximum, and those with more tags and those with less tags decrease in order as they approach the upper and lower limits, respectively. Based on the above values, we recommend five tags for one novel in the experiment.
C. Extract Effective Tag Range For Search
Figure4 shows the frequency of the appearance of tags set in the web novels as learning data. As seen in Figure 4 , the frequency of the appearance of tags is distributed approximately in a power. In addition, comparing the distribution of all tags including system tags with the distribution of free tags only, system tags occupy high orders. The appearance frequency of tags for all the tags set in the online novel posted to "Syousetsuka-ni-narou" collected in Chapter IV-B is shown in Figure 5 . As seen in Figure 5 , the frequency of the appearance of the tags is also approximately a power, as in Figure 4 . Comparing the distribution of all tags including system tags with the distribution of free tags only, the system tags occupy high orders. In this experiment, the number of appearances of the tag set in the learning data is set to a range of 100 to 1,000 as the effective tag range for a search. We extracted 901 tags in the corresponding range.
D. Creating A Tag Vector List
In order to vectorize the tags extracted in Section IV-C, we selected novel texts as learning data. The definition of a novel is prose that freely describes the author's emotions. Prose is very flexible, so many parts of the text may be written with emoticons, symbols, or a series of meaningless words.
Such novels need to be removed because their format differs greatly from other novels. In addition, as we did not specify novels according to the number of characters, etc., while collecting, novels with very small sentences, such as those with less than one line of text, were obtained. These novels are considered to have insufficient features when vectorized and thus are unsuitable as training data. Therefore, in this paper, we extracted only novels whose texts contained more than 1,000 characters.
As you can see from Figure 3 , some novels have no tags. If using a novel without a tag when recommending tags, the accuracy of the recommendation may be reduced. Therefore, novels with no tags were also removed from the training data. Using the above preprocessing, 20,962 novels were used as the training data. From the created Doc2Vec learning model, we generated a 300-dimensional vector of novel text from the learning data. We vectorized the free tags extracted in Section IV-C based on the vector of the generated novel text and stored the vectorized tags in a list. All vectors were normalized to
E. Tag Recommendations
In Section IV-D, we used the created list of tags to compare with web novels to be recommended and to recommend tags that are effective for searches. We collected web novels to be entered as recommendations using the same method used to collect the novel bodies used as learning data in Section IV-B and submitted them to "Syousetsuka-ni-narou" under the different world genre, a large genre of the romance genre. These bodies should be over 1,000 words so that the features are sufficient, as with the training data. We used 158 web novels collected under the above conditions as our experimental data. Using the Doc2Vec trained model, we vectorized 158 novel bodies that were morphologically analyzed by MeCab.
We recommended tags using the proposed method, which is intended to prevent duplication of the meaning of tags described in Section III-D. As a baseline method, we also recommended novels and a method that recommends tags in descending order of the cos similarity between the tag vectors.
We calculated the cos similarity between the tags and novels recommended by the proposed method and the baseline method, and calculated the average of all experimental data. A summary is shown in Table I . The closer the cos similarity is to 1, the more similar it is. Therefore, recommended tags are closer to the actual contents of the novel when the number is closer to 1. Compared to the baseline, the average similarity of the tags recommended by the proposed method is generally lower. In addition, the similarity between the proposed method and the baseline method is low overall, and the recommended tags may be not valid for searches that are different from the contents of the novels.
Therefore, we compared whether the recommended tags matched the contents of the novel. We compared the tags with the tags set by the writer. Thus, some of the tags recommended by the proposed method matched the content of the novel. However, on the other hand, many of the recommended tags were out of contents with the novels.
Next, in order to verify whether the information possessed by the recommended tags is different, we obtained the cos sim- ilarity between the tags recommended by each method. Based on the obtained cos similarities, we calculated the average value of the cos similarities between the tags recommended for one novel. We provide the results in Fig 6 In this graph, the horizontal axis is the cos similarity of the proposed method and the vertical axis is the cos similarity of the baseline. We can see that the proposed method is able to recommend different tags because the points are distributed above the slanted line. These experiments satisfied one of the requirements for effective search tags to retrieve information that does not overlap between the tags set by the proposed method. However, in some cases, while tags that matched the content of the novels were recommended, tags that did not relate to the content were also recommended. Also, the cos similarity between the novel and the recommended tag was reduced overall. Further studies are needed in order to recommend tags that match the content of the novel.
V. Discussion
The reason why the similarity between the vector of the recommended tag and the vector of the novel text was low may be that the tag vector could not represent the feature of the tag.
In the proposed method, under the assumption that the tag has the same properties as the novel to which the tag is set, the tag vector is created based on the vector created from the text of the novel. We presumed that when creating a vector from the text of a novel, the feature quantity representing the properties of the tag could not be extracted.
In this paper, the novels used as training data were set to have more than 1,000 characters so that they had sufficient features. However, when the number of characters in the novels was actually confirmed, some novels were close to 1,000 characters and some exceeded 100,000 characters. The number of characters varied greatly. Since these novels are treated uniformly, it is probable that accurate feature extraction failed. The following methods can be considered for accurately extracting the feature values of a novel: This time, we used Doc2Vec to obtain the novel's feature value in order to take into account its content and meaning. We believe that the extraction of features will be improved by using a more accurate method such as Sent2Vec.
In Section IV-E, we assumed the number of appearances of the tag set in the learning data is set to a range of 100 to 1,000 as the effective tag range for a search. We will consider the most effective range.
We confirmed that tags with different meanings were recommended by the proposed method. However, this method only considers the two previously recommended tags. In the future, we will aim for a method that considers all the tags already recommended.
In the proposed method, the tag to be recommended was calculated using an equation with a fixed value. We want to consider improving the equations by parameterizing angles and distances.
We compared the tag with the actual tag to verify that it has the same meaning as the novel, but this was qualitative. We want to create a quantitative scale and visualize the difference in meaning.
The novel in our experimental data belonged to the larger genre of romance. It is self-evident that the theme of the novel is love and it is also obvious that affection is included as an element of the story. Therefore, it is not necessary to recommend a tag representing elements of love and affection to a novel belonging to the romance genre because this genre overlaps with the meaning of the tag. As mentioned above, when recommending tags, tags that are more effective for search can be obtained by considering not only the similarity with the novels and the duplication of meaning between tags, but also the genre.
When comparing the contents of the novels of the experimental data, many things similar to the development of the story were seen. Keeping this in mind, we want to improve the proposed method.
From Section IV-C, the same tendency as where the tag distribution of our data set was a power applies to the distribution of tags throughout "Syousetsuka-ni-narou." In our experiments, we could not recommend effective tags for retrieval, but we believe that the proposed method can be applied to all novels belonging to other large genres by improving the accuracy of the recommended tags. If it becomes possible to recommend tags for these novels, we speculate that the search can be performed more smoothly.
Vi. Conclusion
In this paper, we proposed a tag recommendation method for enhancing web novel searches. Each novel is vectorized by the Doc2Vec method for its body text. The tag vector is then composed of the vectors of the novel to which it has been attached. Tags having different characteristics or features are recommended based on the proposed calculation algorithm of the vectors.
For evaluating our proposed method, we crawled the web novel site "Syousetsuka-ni-narou" to collect novel body text and tags. Our experiment results demonstrate that we achieved the goal of recommending tags with different characteristics, although tags that were not intuitive or suitable were also included. In this study, different tags could be recommended. In the process, it became clear that there were many novels that were very similar, so we recommended a dynamic tag set indicating the category of a typical novel and a tag set indicating the differences within the category. Combined hierarchical tag recommendation is an issue for the future.
In the future, we would like to further develop the proposed method by refining the features of the novel with reference to the improved method described in Section V. In addition, as in Wang et al. [3] , creating a tag ranking based on the popularity of the tag could be a method to support the search more. In addition, we aim to recommend useful tags for web novel search by improving the feature calculation of the tags.
SECTION
Syousetsuka-ni-narou is a registered trademark of HinaProject Inc., Japan. In English, its site name's means "Let's be a novelist. https://syosetu.com/
Authorized licensed use limited to: Keio University. Downloaded on September 11,2023 at 04:31:12 UTC from IEEE Xplore. Restrictions apply.
https://dev.syosetu.com/man/api/ 3 https://dev.syosetu.com/
SECTION
| Proposed method | Baseline method | |
|---|---|---|
| 1 | 0.379 | 0.379 |
| 2 | 0.160 | 0.362 |
| 3 | 0.162 | 0.351 |
| 4 | 0.161 | 0.344 |
| 5 | 0.154 | 0.338 |