Scaling Creative Inspiration with Fine-Grained Functional Facets of Product Ideas

Tom Hope
Ronen Tamari
Hyeonsu Kang
Daniel Hershcovich
Joel Chan
Aniket Kittur
Dafna Shahaf
ArXiv
2021
View in Semantic Scholar

Abstract

Web-scale repositories of products, patents and scientific papers offer an opportunity for building automated systems that scour millions of existing ideas and assist users in discovering novel inspirations and solutions to problems. Yet the current way ideas in such repositories are represented is largely in the form of unstructured text, which is not amenable to the kind of user interactions required for creative innovation. Prior work has pointed to the importance of functional representations – capturing the mechanisms and purposes of inventions – for allowing users to discover structural connections across ideas and creatively adapt existing technologies. However, previous work exploring the use of functional representations was either very coarse-grained and limited in expressivity, or dependent on manually curated knowledge bases with poor coverage and significant manual effort from users. To help bridge this gap and unlock the potential of large-scale idea mining, we propose a novel computational representation that automatically breaks up products into fine-grained functional facets. We train a model to extract these facets from a challenging real-world corpus of invention descriptions, and represent each product as a set of facet embeddings. We design similarity metrics that support granular matching between functional facets across ideas, and use them to build a novel functional search capability that enables expressive queries for mechanisms and purposes. We construct a graph capturing hierarchical relations between purposes and mechanisms across an entire corpus of products, and use the graph to help problem-solvers explore the design space around a focal problem and view related problem perspectives. In empirical user studies, our approach leads to a significant boost in search accuracy and in the quality of creative inspirations, outperforming strong baselines and state-of-art representations of product texts by 50-60%.

1 Introduction

A modern-day engineer, scientist or designer has access to online repositories of millions of products, scientific papers and patents, containing descriptions of myriad technologies and their uses; essentially, a huge database of problems and solutions. Combined with rapid advances in algorithms for extracting information from large unstructured databases, this raises the prospect of using machines to augment and scale the process of innovation, helping human problem solvers identify inspirations and solutions across domains.

The human ability to detect abstract relations across ideas and find ways to creatively adapt existing tools for new uses, has been a driving force in the history of innovation [10, 28, 32, 34, 40] . Microwave ovens were discovered by repurposing radar technology developed during World War II; Teflon, today chiefly used in non-stick cookware, was first used in armament development; and gigantic organizations such as NASA and Procter & Gamble actively engage in searching for opportunities to adapt existing technologies for new domains and markets [18] . In a very different kind of example, a car mechanic recently invented a simple device to ease childbirths by adapting a trick for extracting a cork 2021. Manuscript submitted to ACM stuck in a wine bottle -which he discovered online, in a YouTube video [1] . This award-winning vacuum device could save millions of lives in developing countries. Strikingly, according to the World Health Organization, there has been no innovation in this area of work "for almost centuries". These and many other examples suggest a future where automated systems mine web-scale repositories with myriad descriptions of inventions, surfacing pertinent inspirations or solutions to problems. But despite the immense promise for accelerating the pace of innovation, finding inspirations continues to currently mostly be a manual, trial-and-error process, or simply the result of serendipity. A key limiting factor is that these large idea repositories cannot support the kinds of user interactions that are required to support creative inspiration, because the predominant computational representation of ideas -in the form of unstructured textual descriptions -is unsuitable for these interactions.

Human creativity often relies on detecting structural matches across distant ideas, adapting them by transferring mechanisms from one domain to another [12, 13, 25, 26] -but this human skill is notoriously hard to transfer to machines [36] . A primary reason is that structured representations of ideas are simply not generally available. Repositories of scientific papers, patent publications or product descriptions are typically limited to "structure" in the form of high-level category-focused keywords, which do not support the functional interactions we desire. For example, to identify that a contraption for extracting a cork stuck in a bottle could serve as relevant inspiration for easing childbirth, an automated system would need to figure out that a vacuum-based mechanism can serve the purpose of extraction of physical objects, and match this function to the problem of extracting babies stuck in the birth canal. At the same time, most structured knowledge bases that do provide richer, more structured representations (e.g., [6, 62] ) are hand-crafted and small, and previous efforts to scale-up have been limited in expressivity [17, 22] . General-purpose knowledge bases (e.g., Cyc [44] , NELL [53] , DBpedia [20] ) largely encode categorical knowledge (e.g., is-a, has-a) and rarely functional knowledge (e.g., used-for), and can also suffer from poor coverage [29] .

One promising recent approach [36] trains neural networks to learn one aggregate purpose and mechanism vector of products as coarse, soft "structure" that can be derived from raw text and used to find analogically related products with similar overall purpose but distant mechanism. The resulting matches led to increased creativity measured in an empirical ideation study. Further work used the same approach to find analogies in scientific papers [10] . However, in reality products have multiple fine-grained purposes with different mechanisms for achieving each, as demonstrated in Figure 1 . As shown in Figure 1 , the single-vector approach of [36] to search for products related to a smart pillow device cannot disentangle its different functional facets (tracking sleep, neck support, etc.). The aggregate approach squashes together multiple purposes and mechanisms into one soft "puddle", losing important information for retrieval of products that have only partial functional matches and limiting the ability to find diverse adaptation opportunities.

Fig. 1. Extracting fine-grained purpose and mechanism functional facets from an online product description, to search for adaptation opportunities. Green spans are mechanisms, red spans are purposes. Left: Standard vector-based search does not enable control for partial functional matches. Retrieval results are typically highly similar to the original product, which is not helpful in creative innovation interactions. Center: The aggregate approach in previous work [36] captures only one overall, coarse purpose/mechanism, limiting the expressivity of the search and losing important information for retrieval of products that have only partial functional matches. Right: Our fine-grained functional facets enable users to discover focused matches based on specific functions, retrieving more diverse inspirations for creative adaptation.

Importantly, this aggregate representation does not only harm retrieval accuracy, but suffers from a fundamental limitation in terms of the interactions it enables. Prior work has demonstrated the importance of interactions for traversing and exploring granular functions. A recent study [29] showed that providing designers with computational tools to express the particular aspects of purposes that they are interested in and to traverse multiple levels of granularity and abstraction, could significantly increase the novelty and usefulness of ideas they generated. An earlier study [65] showed that representing problems in terms of multiple purposes and constraints enabled designers to search for more novel and useful inspirations. The WordTree method [46] -a prominent method in creative engineering design -directs designers to break their problem into subfunctions, and then use the WordNet [51] database to explore abstractions and related functional facets to inspire analogies to products and designs across domains.

However, to date the scope of applicability of these interactions has been limited by the lack of scalable means for modeling ideas in terms of granular purposes and mechanisms. The approach in [65] 1 . Extracting fine-grained purpose and mechanism functional facets from an online product description, to search for adaptation opportunities. Green spans are mechanisms, red spans are purposes. Left: Standard vector-based search does not enable control for partial functional matches. Retrieval results are typically highly similar to the original product, which is not helpful in creative innovation interactions. Center: The aggregate approach in previous work [36] captures only one overall, coarse purpose/mechanism, limiting the expressivity of the search and losing important information for retrieval of products that have only partial functional matches. Right: Our fine-grained functional facets enable users to discover focused matches based on specific functions, retrieving more diverse inspirations for creative adaptation.

problem representations, and the WordTree method provided instructions to, but not technical scaffolds for, identification of functional facets to use for exploring a design space. The system of [29] required both manual effort from the user in specifying the different purposes, as well as a manually-curated knowledge base (Cyc [45] ) in which those purposes were already connected in a concept graph describing their hierarchical relationships, which suffers from poor coverage [29] for real-world product description texts. In addition, even after the user manually specified granular purposes, the system was forced to use the aggregate approach of [36] to retrieve relevant matches from a corpus of products, since no automated tool for extracting granular purposes at scale across all products was available.

To help close this gap and enable interactions between humans and automated systems that facilitate innovation, we develop a new computational representation of idea descriptions based on fine-grained functional facets. Our system automatically identifies multiple purposes and mechanisms within a given product description. We then construct a novel span-based representation of each product in terms of purpose and mechanism functional facets and their corresponding vector embeddings. We demonstrate the utility of our approach for supporting human creativity in two applications: (1) Fine-grained functional search for alternative uses of mechanisms, and (2) Exploring alternative problem perspectives around a focal problem for potential inspirations.

Functional search for alternative uses of mechanisms. Our span-based representation enables innovators to search for ideas with expressive queries for specific functions. Figure 1 shows an example of functional facets automatically extracted by our system and their use for retrieval of potential inspirations for adaptation opportunities. In Section 3, we build a prototype fine-grained functional search tool, and evaluate its utility in an alternative uses task in which users find unconventional applications of given mechanisms, potentially leading to pathways to new markets.

Exploring problem perspectives with a functional concept graph. We further use our representation to automatically generate a functional concept graph that embeds purpose/mechanism facets at different levels of granularity. While the coarse representation in [36] made it hard to pull out discrete and interpretable concepts from product texts, our fine-grained approach allow us to mine recurring functional relations, such as specific problems that are often mentioned together or specific problems and solutions associated with them. This level of detail can enable us to map the landscape of ideas -similarly to manually curated functional ontologies, a core tool used in engineering and design ideation [27, 35] .

By automating the graph construction, we take a step toward removing the dependence on manually-constructed KBs that limited previous work [29] . We evaluate the utility of our graph in an application involving problem reformulation [15, 16] : construing an existing problem in terms of other structurally related problems, to explore alternative problem perspectives and the design space around a focal problem. This capability can help users "break out" of fixation on the details of a specific problem and connect to parts of the design space that may superficially look unrelated [11, 40] .

In both applications, our approach leads to a significant boost of 50-60% over the best-performing baselines, including the previous work of [36] .

Our computational representation of idea descriptions and the interactions it enables, help address several key challenges to unlocking the potential of large scale online idea mining, including the bottlenecks in manual construction of structured idea repositories; limited expressivity for users in searching fine-grained purposes and mechanisms; and harnessing idea repositories to flexibly explore alternative problem formulations across levels of abstraction. We believe our representation may serve as a useful building block for novel creativity support tools that can help users find and recombine the inspirations latent in unstructured idea repositories at a scale previously impossible. A summary of our contributions:

• We propose a novel computational representation of ideas with granular functional facets for purposes and mechanisms extracted automatically from product descriptions. • We use crowd workers to annotate product texts from a challenging real-world corpus, and evaluate several extraction models trained on these annotations. We represent each product as a set of span embeddings, corresponding to the multiple facets, and use similarity metrics over these sets to support partial, focused matching between ideas.

• Using our similarity measures between ideas, we build a novel functional search capability that supports expressive, fine-grained queries for purposes and mechanisms.

• We demonstrate the flexibility and utility of the representation for computational support of core creative tasks: (1) searching for alternative, atypical product uses for potential adaptation opportunities; and (2) creating a functional concept graph that enables to explore the design space around a focal problem. Through two empirical user studies we demonstrate that our representation significantly outperforms both previous work and state-of-the art embedding baselines on these tasks. We achieve Mean Average Precision (MAP) of 87% in the alternative product uses search, and 62% of our inspirations for design space exploration are found to be useful and novel -a relative boost of 50-60% over the best-performing baselines, including the coarse representation approach of [36] .

2 Learning A Fine-Grained Functional Representation

Our goal in this section is to construct a representation that can support the creative innovation tasks and interactions discussed in the Introduction. Previous work [36] suggested a representation separating an idea into one purpose vector and one mechanism vector. While that approach showed promise, the one-vector representation was coarse, mashing together many different purposes and mechanisms, and limiting interactions that require fine-grained control by the user. Figure 1 shows an example. When searching for products sharing structural relations with a smart pillow product, the aggregate purpose/mechanism vectors squash together multiple concepts such as comfort, sleep, travel, neck support (purposes) or neck pillow, soft material, sensors (mechanisms) -limiting the ability to tease apart different sub-purposes and sub-mechanisms. This results in retrieval of another smart pillow, which is only slightly different in that it is not intended for travel. The aggregate vectors are also not interpretable -leaving the user blind to what is truly being matched as part of the process of idea retrieval, and not enabling targeted focus on specific functional aspects.

In contrast, we propose to use span representations [42] . Given a product text description, we extract tagged spans of text corresponding to purposes and mechanisms (see Figure 2) , and represent the product as a set of span embeddings. By doing so, we are able to employ similarity metrics that support partial, faceted matching between ideas.

Fig. 2. Crowdsourcing interface for fine-grained purposes and mechanisms. Boxes are predefined chunks to annotate.

Continuing our example, we can now represent the smart pillow with a set of purpose and mechanism spans (Figure 1 , right). This allows to retrieve a wider range of products with faceted matches, such as a robotic neck brace for the neck support purpose, or a car seat vital signs monitor which matches on the embedding combination of travel, support neck, sensors. These retrieved products could point to new directions to explore, such as new markets where the smart pillow technology could be adapted (e.g., to increase comfort in robotic neck braces or car seats with sensors).

More technically, we use a standard sequence tagging formulation, with X = {x 1 , x 2 , . . . , x } a training set of texts, each a sequence of tokens x = ( 1 , 2 , . . . , ), and Y a corresponding set of label sequences,

Y = {y 1 , y 2 , . . . , y }, y = { 1 , 2 , . . . , },

where each indicates token 's label (purpose/mechanism/other). In later sections, we represent each product as a set of purpose span embedding vectors and a set of mechanism span embedding vectors.

For the reasons discussed above, we view the span-based approach not simply as a more flexible and nuanced model, but as a potential building block that can power new interfaces and paradigms for innovation that we explore later in this paper. We start by describing our data and annotation process; we then discuss and evaluate models to extract spans from product texts, followed by applications and experiments. We use real-world product idea descriptions taken from crowdsourced innovation website Quirky.com and used in [36] , including 8500 user-generated texts describing inven- Annotation. To create a dataset annotated with purposes and mechanisms, we collect crowdsourced annotations on Amazon Mechanical Turk (AMT). We observed that in the annotation task of [36] workers tend to annotate long, often irrelevant spans. We thus guided workers to focus on shorter spans. To further improve quality and encourage more granular annotations, we limited maximal span length that could be annotated, and disabled the annotation of stopwords. Fig. 2 shows our tagging interface; rectangles are taggable chunks. For quality control, we required US-based workers with approval rate over 95% and at least 1000 approved tasks, and filtered unreasonably fast users. Workers were paid $0.1 per task. In total, we had 400 annotating workers. Median completion time was 100 seconds.

2.1 Data

While a manual inspection of the annotations revealed they are mostly satisfactory, we observe two main issues: First, there are often multiple correct annotations. Second, workers provide partial tagging -in particular, if similar spans appear in different sentences, very few workers bother tagging more than one instance (despite instructions). These issues would have made computing evaluation metrics problematic. We thus decided to use the crowdsourced annotations as a bronze-standard for training and development sets only. For a reliable evaluation, we collected gold-standard test sets annotated by two CS graduate students. Annotators were instructed to mark all the relevant chunks, resulting in high inter-annotator agreement of 0.71. We collect 22316 annotated training sentences and 512 gold sentences, for a total of

238, 399

(tag proportions: 14.5% mechanism, 15.9% purpose, 69.6% other).

A note on related annotated data. There has been recent work on the related topic of information extraction from scientific papers by classifying sentences, citations, or phrases. Recent supervised approaches [8, 38, 47] use annotations which are often provided by either paper authors themselves, NLP experts, domain experts, or involve elaborate (multiround) annotation protocols. Sequence tagging models are often trained and evaluated on (relatively) clean, succinct sentences [49, 66] . When trained on noisy texts, results typically suffer drastically [2] . Our corpus of product descriptions is significantly noisier than scientific papers, and our training annotations were collected in a scalable, low-cost manner by non-experts. Using noisy crowdsourced annotation for training and development only is consistent with our quest for a lightweight annotation approach that would still enable training useful models. In a closer domain than scientific texts, [43] classify product review sentences as containing a usage expression or not, over five products only. In contrast, this work focuses on extracting fine-grained purposes and mechanisms from a diverse range of products. Review texts are often written in fairly clean and coherent language, commonly appear in NLP tasks [61] , and do not typically describe in detail the mechanisms and purposes of products. In addition, sentence-level classification would not support the user interactions we explore in this paper, which require fine-grained control .

2.2 Extracting Spans

After collecting annotations, we can now train models to extract the spans. We explore several models likely to have sufficient power to learn our proposed novel representation, with the goal of selecting the best performing one. In particular, we chose two approaches that are common for related sequence-tagging problems, such as named entity recognition (NER) and part-of-speech (POS) tagging: a common baseline and a recent state-of-the-art model. We also tried a model-enrichment approach with syntactic relational inputs.

We stress that our goal in this section is to find a reasonable model whose output could support creative downstream tasks; many other architectures are possible and could be considered in future work.

• BiLSTM-CRF. A BiLSTM-CRF [37] neural network, a common baseline approach for NER tasks, enriched with semantic and syntactic input embeddings known to often boost performance [66] . We first pass the input sentence

x = ( 1 , 2 , . . . , ) through an embedding module resulting in v 1: , v ∈ R ,

where is the embedded space dimension. We adopt the "multi-channel" strategy as in [66] , concatenating input word embeddings (pretrained GloVe vectors [56] ) with part-of-speech (POS) and NER embeddings. We additionally add an embedding corresponding to the incoming dependency relation. The sequence of token embeddings is then processed with a BiLSTM layer to obtain contextualized word representations h linear layer to obtain per-word tag scores h

(0) 1: , h ∈ R ℎ ,

( ) 1 , h ( ) 2

, ..., h ( ) . These are used as inputs to a conditional random field (CRF) model which maximizes the tag sequence log likelihood under a pairwise transition model between adjacent tags [5] .

• Pooled Flair. A pre-trained language model [4] based on contextualized string embeddings, recently shown to outperform powerful approaches such as BERT [14] in NER and POS tagging tasks and achieve state-of-art results.

Flair 1 uses a character-based language model pre-trained over large corpora, combined with a memory mechanism that dynamically aggregates embeddings of each unique string encountered during training and a pooling operation to distill a global word representation. We follow [4] and concatenate pre-trained GloVe vectors to token embeddings, add a CRF decoder, and freeze the language-model weights rather than fine-tune them [14, 57] .

• GCN. We also explore a model-enrichment approach with syntactic relational inputs. We employ a graph convolutional network (GCN) [39] over dependency-parse edges [66] . GCNs are known to be useful for propagating relational information and utilizing syntactic cues [49, 66] . The linguistic cues are of special relevance and interest to us, as they are known to exist for purpose/mechanism mentions in texts [22] .

We used a GCN with same token embeddings as in the BiLSTM-CRF baseline, with a BiLSTM layer for sequential context and a CRF decoder. For the graph fed into the GCN, we use a pre-computed syntactic edges with dependency parsing: For sentence x 1: , we convert its dependency tree to A where A = 1 for any two tokens , connected by a dependency edge. We also add self-loops A = (to propagate from h ( −1) to h ( ) [66] ). Following [66] , we normalize activations to reduce bias toward high-degree nodes. For an -layer GCN, denoting h ( ) ∈ R ℎ to be the -th layer output node, the GCN operation can be written as

ℎ ( ) = ∑︁ ∈R       ∑︁ =1 A W ( ) ℎ ( −1) / + b ( )       where R = {syn, self}

= =1

A is the degree of token w.r.t . In the GCN architecture, layers correspond to propagating information across -order neighborhoods. We set the contextualized word vectors h

1: to be the input to the GCN, and use h ( ) 1: as the output word representations. Similarly to [49] , we do not model edge directions or dependency types in the GCN layers, to avoid over-parameterization in our data-scarce setting. We also attempted edge-wise gating [49] to mitigate noise propagation but did not see improvements, similarly to [66] .

In our experiments, we followed standard GCN training procedures. Specifically, we base our model on the experimental setup detailed in [66] (see also the authors' code which we adapt for our architecture, at https://github.com/qipeng/gcnover-pruned-trees). We pre-process the data using the spaCy (https://spacy.io) package for tokenization, dependency parsing, and POS/NER-tagging. We use pretrained GloVE embeddings of dimension 300, and NER, POS and dependency relation embeddings of size 30 each, giving a total embedding dimension = 390. The bi-directional LSTM and GCN layers' hidden dimension is ℎ = 200, with 1 hidden layer for the LSTM. We find that the setting of 2 hidden layers works best for the GCNs. The semantic similarity threshold was tuned on the development sets, and was found to be 0.4 on Quirky and 0.3 for the patents data. We also tried training with edge label information based on syntactic relations, but found this hurts performance. The training itself was carried out using SGD with gradient clipping (cutoff 5) for 100 epochs, selecting the best model on the development set.

For the Pooled-Flair approach [4] , we use the FLAIR framework [3] , with the settings obtaining SOTA results for CONLL-2003 as in [4] (see https://github.com/flairNLP/flair/blob/master/resources/docs/EXPERIMENTS.md). We also experiment with non-pooled embeddings and obtain similar results. We experiment with initial learning rate and batch size settings described in [4] , finding 0.1 and 32 to work best, respectively.

2.3 Evaluation Of Extraction Accuracy

In this section we assess extraction accuracy (whether we are able to extract purpose and mechanism spans of text). In the next sections, we evaluate the utility of the extracted spans for enabling creative innovation tasks.

To evaluate raw accuracy of the model's predictions, we use the standard IOB label markup to encode the purpose and mechanism spans (5 possible labels per token, {Beginning, Inside} x {Purpose, Mechanism} plus an "Outside" label). We conduct experiments using a train/development/test split of 18702/3614/512 . We pre-process the data using the spaCy package for tokenization, dependency parsing, and POS/NER-tagging. We use pretrained GloVE embeddings of dimension 300, and NER, POS and dependency relation embeddings of size 30 each, giving a total embedding dimension = 390. The bi-directional LSTM and GCN layers' hidden dimension is ℎ = 200, with 1 hidden layer for the LSTM. We find that the setting of 2 hidden layers works best for the GCNs. The training itself was carried out using SGD with gradient clipping (cutoff 5) for 100 epochs, selecting the best model and hyper-parameters based on the development set. For Flair, we experiment with initial learning rate and batch size settings described in [4] , finding 0.1 and 32 to work best, respectively.

Fig. 3. Precision@K results for the best performing model (GCN + self-training).

Due to our challenging setting, we train models on bronze-standard annotations with noisy and partial tagging done by non-experts; for evaluation we use a curated gold-standard test set (Section 2). See Table 1 for results: GCN reaches an F 1 Fig. 4 . Comparing our GCN model predictions (right) to human annotations (left). Interestingly, our model managed to correct some errors made by the annotator (e.g., "it's", "heated", "coffee warm", "beverages"). Purposes shown in pink, mechanisms in green.

Fig. 4. Comparing our GCN model predictions (right) to human annotations (left). Interestingly, our model managed to correct some errors made by the annotator (e.g., “it’s”, “heated”, “coffee warm”, “beverages”). Purposes shown in pink, mechanisms in green.

Table 1. Not extracted; please refer to original document.

score of ∼ 48%, outperforming the BiLSTM-CRF model (enriched with multi-channel GloVe, POS, NER and dependency relation embeddings) by 6%. GCN also surpasses the strong Pooled-Flair pre-trained language model by nearly 2.5%. A random baseline guessing each token by label frequencies (Section 2) achieves 16.01 F 1 . We interpret these results as possibly attesting to the utility of graph representations and features capturing syntactic and semantic information when labels are noisy. As a sanity check, we also computed precision@K (Figure 3 ). As expected, precision is higher with low values of , and gradually degrades. Precision for mechanisms is higher than for purposes. Interestingly, a manual inspection revealed many cases where despite the noisy training setting, our models managed to correct mistaken or partial annotations (see Figure 4) .

Self-Training. According to the results, we chose GCN as our best-performing model. We experimented adding selftraining [59] to GCN. Self-training is a common approach in semi-supervised learning where we iteratively re-label "O" tags in training data with model predictions. A large portion of our training sentences are (erroneously) un-annotated by workers, perhaps due to annotation fatigue, introducing bias towards the "O" label.

Self-training with GCN shows an improvement in F 1 by an additional 2.6%, substantially increasing recall (more than 12% over Flair), see Table 1 . Self-training stopped after 2 iterations, following no gain in F 1 on the development set.

3 Case Study: Fine-Grained Functional Search For Alternative Uses

Our focus in this paper is to study the utility of the extracted purposes and mechanisms, in terms of the user interactions they enable. We explore two tasks demonstrating the value of our novel representation for supporting creative innovation.

We start with a case study involving search for alternative uses.

Our task is inspired by one of the most well-known divergent thinking tests [31] for measuring creative abilitythe alternative uses test [33] , where participants are asked to think of as many uses as possible for some object. Aside from serving as a measure of creativity, the ability to find alternative uses for technologies has important applications in engineering, science and industry. Technologies developed at NASA, the US space agency, have led to over 2,000 spinoffs, finding new uses in areas such as computer technology, agriculture, health, transportation, and even consumer products 2 . Procter & Gamble, the multinational consumer goods company, has invested millions of dollars in systematic search for ideas to re-purpose and adapt from other industries, such as using a compound that speeds up wound healing to treat wrinkles -an idea that led to a new line of anti-wrinkle products [18] . And very recently, the COVID-19 pandemic provided a stark example of human innovation during times of crisis, with many companies actively seeking to pivot their business and re-purpose existing products to fit the new climate [19] .

One teaching example is that of John Osher, creator of the popular "Spin Pop" -a lollipop with a mechanism for twirling in your mouth. After selling his invention in the late 90's, Osher and a group fellow inventors proactively searched for ideas -"rather than having an idea come to us" 3 . The group drew up a list of dozens of potential ideas, and eventually landed on the "Spin Brush" -a cheap electric toothbrush adapted from the same mechanisms behind the twirling lollipop. This case of repurposing an existing technology is involved a systematic search process which required a rich, granular understanding of products and their designs rather than pure serendipity. However, Osher and his team still had to rely on human processing power -inherently limited in its ability to scour millions of potential descriptions of problems available online, and find relevant and non-obvious candidate problems for which the twirling mechanism could be adapted.

Introducing automation could help accelerate the search process, helping scale human ingenuity by sifting through millions of ideas for relevant inspirations. However, the task is challenging for existing search systems, because it requires a nuanced, multi-aspect understanding of both products and queries. Consider, for example, a company that manufactures some product (e.g., light bulbs). The company is familiar with straightforward usages of their products (lamps, flashlights), and wants to identify non-standard uses and expand to new markets. Finding uses for a lightbulb that are not about the standard purpose of illuminating a space would be difficult to do with a standard search query over an idea repository. To come up with different applications of lights, one may turn to the Web to collect examples. However, it quickly turns out to be a non-trivial task, as the term "lights" or "lighting" will bring back lots of results close to "lamps," "flashlight," and the like. The result of a quick Google search is also inundated with Christmas lights or light bulbs (not to mention light in the sense of "lightweight"). What one might want instead is finding a diverse set of applications other than just building floor lamps or decorative lights.

In contrast, using our representation, each idea in the repository is associated with mechanism spans and purpose spans, and one could form a query such as mechanism="light bulb", purpose= NOT "light". Using our system, the searcher adds "light" as a mechanism and also add "light" as a negative purpose (i.e., results should not include "light" purpose).

Our engine returns interesting examples such as billiard laser instructor devices ( Table 1 , warning signs on food packages to get attention of kids with allergies and lights attached to furniture to protect your pinky toes at night (Fig. 5 bottom) .

Fig. 5. Applications for light where light is not in the purpose. Two of the results and their automatic annotations (purposes in pink, mechanisms in green).

3.1 Study Design

We have built a prototype search engine supporting our representation. Figure 5 shows the top two results for the light bulb scenario: warning lights on food for kids with allergies, and lights attached to furniture to protect your pinky toe at night. These are non-standard recombinations [21] (light + allergies, light + furniture guard) that could lead the company to new markets. We conduct an experiment simulating scenarios where users wish to find novel/uncommon uses of mechanisms. Table 2 shows the scenarios and examples. To choose these scenarios for the experiment, we find popular/common mechanisms in the dataset and their most typical uses. For example, one frequent mechanism is RFID, which is typically used for purposes such as "locating" and "tracking". We then create queries searching for different uses -purposes that do not include concepts related to the typical uses of a given mechanism. 4 We now describe the methods we use to retrieve results for the scenarios. 3.1.1 Our Approach. We represent each product as a set of purpose vectors . . ({q "GPS" }, M˜) ≥ threshold

Table 2. Scenarios and example results retrieved by our FineGrained-AVG method. All queries reflect non-trivial uses of mechanisms (e.g., a query for using water not for drinking/cleaning, retrieves a lighter running on hydrogen from water and sunlight).

EQUATION (1): Not extracted; please refer to original document.

We explore two alternatives for computing distance metrics , :

• FineGrained-AVG. (q , P ) is 1 minus the dot product between average query and purpose vectors (normalized to unit norm). We define similarly.

• FineGrained-MAXMIN. We match each element in q with its nearest neighbor in P , and then find the minimum over the distances between matches. is defined as 1 minus the minimum. All vectors are normalized. We define Mechanism: water. Purpose: NOT cleaning, NOT drinking A lighter that burns hydrogen generated from water and sunlight. Mechanism: RFID. Purpose: NOT locating, NOT tracking A digital lock for your luggage with RFID access. Mechanism: light. Purpose: cleaning A UV box to clean and sanitize barbells at the gym. Table 2 . Scenarios and example results retrieved by our FineGrained-AVG method. All queries reflect non-trivial uses of mechanisms (e.g., a query for using water not for drinking/cleaning, retrieves a lighter running on hydrogen from water and sunlight). similarly. This captures cases where queries match only a small subset of product chunks, erring on the side of caution with a max-min approach.

3.1.2 Baselines. We Test Our Model Against:

• AvgGloVe. A weighted average of GloVe vectors of the entire text (excluding stop words), similar to standard NLP approaches for retrieval and textual similarity. We average query terms and normalize to unit norm. Distance is computed via the dot product.

• Aggregate purpose/mechanism. Representing each document with the model in [36] , using a BiLSTM neural network taking as input raw text and producing two vectors corresponding to aggregate purpose and mechanism. We average and normalize query vectors, and use the dot product.

For all four methods, we handle negative purpose queries by filtering out all products whose distance is greater than , where lambda is a threshold selected to be the 90 th percentile of distances.

Scaling Creative Inspiration With

Fine-Grained Functional Facets of Product Ideas , ,

3.2 Results

We recruited five engineering students to judge the retrieved product ideas. Each participant provided binary relevance ranking to the top 20 results from each of the four methods, shuffled randomly so that judges are blind to the condition 5 .

See Figure 6 for results. We report Non Cummulative Discounted Gain (NDCG) and Mean Average Precision (MAP), two common metrics in information retrieval [60] . Our FineGrained-AVG wins for both metrics, followed by FineGrained-MAXMIN. The baselines perform much worse, with the aggregate-vectors approach in [36] outperforming standard embedding-based retrieval with GloVe. Importantly, our approach achieves high MAP (85% -87%) in absolute terms, in addition to a large relative improvement over the baselines (MAP of 40%-60%). , and a light strip to change your water color). Only the fifth result (out of the top five) was closer to being related to the query (a LED lamp designed to look like a window that can keep air odorless with an electrostatic air purifier), yet not precisely capturing the purpose of cleaning -due to squashing together multiple concepts in one soft average (this product was also ranked as the top fourth result by the standard search baseline). In contrast, the fifth result found by FineGrained-MAXMIN was a die grinder with a light to see inside when cleaning / fixing root welds inside steel pipe. As another example, for the query of using RFID not for locating or tracking , the top result with both FineGrained-AVG and FineGrained-MAXMIN is a walk through checkout scanner that uses RFID, a product not captured by the two other baselines in their top five results. The first-ranked result found by the aggregate-vector baseline approach was a customizable luggage system with RFID protection (also the the second result retrieved by FineGrained-AVG) but it also retrieved products such as a wifi enabled chip for kids and pets that allows them to go in or out without tripping the alarm, and a case with laser and bluetooth to connect to smart devices, that are of weaker relatedness to RFID technology. Overall, our results demonstrate that fine-grained purposes and mechanisms lead to better functional search expressivity than approaches based on distributional representations or coarse purpose-mechanism vectors.

Fig. 6. Results for search evaluation test case. Mean average precision (MAP) and Normalized Discounted Cumulative Gain (NDCG) by method, averaged across queries. Methods in bold use our model.

4 Exploring The Design Space With A Functional Concept Graph

In this section we test the value of our novel representation for supporting users in exploring the design space for solving a given problem. We use our span-based representation to construct a corpus-wide graph of purpose/mechanism concepts.

We demonstrate the utility of this approach in an ideation task, helping users identify useful inspirations in the form of problems that are related to their own.

Our goal is to help users "break out" of fixation on a certain domain, a well-known hindrance to innovation [11, 40] .

Doing so is challenging because it requires some level of abstraction: being able to go beyond the details of a concrete problem to connect to a part of the design space that may look dissimilar on the surface, but has abstract similarity.

Numerous studies in engineering and cognitive psychology have shown the benefits of problem abstractions for ideation [22, 24, 30, 40, 46, 63, 64] . However, these studies either involve non-scalable methods (relying on highly-structured annotations, or on crowd-sourcing) or simple, syntactical pattern-matching heuristics incapable of capturing deeper abstract relations.

In the work closest to ours [36] , crowd workers were given a product description from the Quirky database, and asked to come up with ideas for products that solve the same problem in a different way. Soft aggregate vectors representing purposes and mechanisms were used to find near-purpose, far-mechanism analogies. Thus, the ability to find analogs was limited by relying on having a given mechanism to control for structural distance. Unlike [36] , in our setup we assume a realistic scenario where we are given only a very short problem title -e.g., generating power for a phone, reminding to take medicine, folding laundry -and aim to find inspirational stimuli [30] in the "sweet spot" for creative ideationstructurally related to the given problem, not too near yet also not too far [23] .

To address this challenge, in this section we build a tool inspired by functional modeling, which we call a Functional Concept Graph. A functional model [35] is, roughly put, a hierarchical ontology of functions and ways to achieve them, and is a key concept in engineering design. Such models are especially useful for innovation, allowing problem-solvers to "break out" of a fixed overly-concrete purpose or mechanism and move up and down the hierarchy. Despite their great potential, today's functional models are constructed manually, and thus do not scale. We thus construct a (crude) approximation of a functional representation that would still be useful for exploring the design space and suggesting potentially useful inspirations to users.

In our approach, Functional Concept Graphs consist of nodes corresponding to purposes or mechanisms, and edges encoding semantic (not necessarily hierarchical) relations. Our span-based representation enables us to build this graph ( Figure 7 ) -products that mention certain purposes (e.g., "charge your phone") will often mention other, structurally related problems that could be more general/abstract (e.g., "generate power") or more specific ("wireless phone charging"). These relations between purposes and mechanisms could help find connections across ideas at different levels of abstraction.

Fig. 7. An example of our learned functional concept graph extracted from texts. Mechanism in green, purpose in pink. Titles are tags nearest to cluster centroids (redacted to fit).

Our approach allows us to look at fine-grained co-occurrences of concepts appearing together in products and thus infer relations between them, unlike the coarse representation in [36] that represented entire products with one aggregate purpose/mechanism vector which could not reveal important granular information needed for constructing such a graph.

In other words, we can discover patterns in the form of products that solve problem , also often solve problems I, and suggest I as potential inspirations to be recommended 6 . However, naively looking for co-occurrences of problems may yield I too near to the original , as many frequently co-occurring purposes tend to be very similar, while we are interested discovering the more abstract relations. In addition, raw chunks of text extracted from our tagging model have countless variants that are not sufficiently abstract and are thus sparsely co-occurring. We thus design our approach to encourage abstract inspirations, as we describe next.

4.1 Building A Functional Concept Graph

We develop a method to infer this representation from co-occurrence patterns of the fine-grained spans of text. We take the following two steps (see more details in the next section):

I. Concept discretization. Intuitively, nodes in our graph should correspond to groups of related spans ("charging", "charging the battery", "charging a laptop"). To achieve this, we take all purpose and mechanism spansP,M in the corpus, extracted using our GCN model, and cluster them (separately), using pre-trained vector representations. We refer to the clusters C , C as concepts.

II. Relations. We employ rule-mining [55] to discover a set of relations R between concepts. Relations are Antecedent =⇒ Consequent, with weights corresponding to rule confidence. To illustrate our intuition, suppose that when "prevent head injury" appears in a product description, the conditional probability of "safety" appearing too is large (but not the other way around). In this case, we can (weakly) infer that preventing head injuries is a sub-purpose of "safety". Indeed, manually observing the purpose-purpose edges, the one-directional relations captured are often sub-purpose, and the bi-directional ones often encode abstract similarity. Similarly, for mechanism concepts the one-directional relations are often part of ("cell phone" and "battery"), and bi-directional are mechanisms that co-occur often. For pairs of purpose and mechanism concepts, the relation is often functionality ("charger", "charge").

Example. Figure 7 shows a subgraph from our automatically constructed functional concept graph (showing only highconfidence edges). Pink nodes correspond to purposes and green nodes to mechanisms. The figure shows a part of the graph related to electricity, power and charging. A designer could go from the problem of charging batteries to the more general problem of generating power, and from there to another branch (e.g., solar power and mechanical stored energy), to get inspired by structurally related ideas. Fig. 8 . A snippet from our ideation interface for "morning medicine reminder". Users indicate which inspirations were useful, and what they inspired. For example, seeing "real time health checker" inspired one user to suggest a monitoring device for finding the best time for reminding to take the medicine.

Fig. 8. A snippet from our ideation interface for “morning medicine reminder”. Users indicate which inspirations were useful, and what they inspired. For example, seeing “real time health checker” inspired one user to suggest a monitoring device for finding the best time for reminding to take the medicine.

4.2 Study Design

Next, we set out to test the utility of the functional concept graph, based on our nuanced representation, in an ideation task. In our setup we gave participants problems (e.g., reminding people to take their medication) and asked them to think of creative solutions. Participants were also given a list of potential inspirations, and were instructed to mark whether each was novel and helpful. They were encouraged to explain the solution it inspired. See example in Figure 8 : Seeing "real time health checker" inspired one user to suggest monitoring the person to find the best time to remind them to take medicine.

To create a set of seed problems, a graduate student mapped between problems from WikiHow.com (a website of how-to guides) to purposes in our data. Using this source allowed us to collect real-world problems that are broadly familiar, with succinct and self-explanatory titles that do not require further reading to understand. The student was tasked with confirming that our Quirky dataset contains idea descriptions that mention these problems. For a given problem in WikiHow (how to remember to take medication), they performed keyword search over 17 purpose spans gleaned by our model from Quirky, and found matching spans (morning medicine reminder). We use those matching spans as our seed problem description given to users (purple text in Figure 8 ). We collect 25 problems this way. Table 3 shows more examples, such as Tracking distance walked, folding laundry or sensing dryness level.

Table 3. Example inspirations and explanations given by human evaluators.

Inspirations are other purpose spans from our dataset (see Table 3 ), selected automatically using our approach and comparing to baselines. For our approach, we explore two common and powerful vector representations of spans, one based pre-trained word embeddings [56] and the other on a more recent language model representation tuned to capture semantic similarity [58] . Our method is based on clustering related purpose spans into concept nodes in the functional graph; some of these nodes contain tens of spans in them. Thus, we also explore two approaches to "summarize" each concept cluster with representative spans displayed to users.

In more concrete detail, we experiment with the following approaches for selecting inspirations: Health trackers to tell if medicine not taken, alert accordingly Heart rate monitoring, continuously monitor glucose Find the best time to take medicine Table 3 . Example inspirations and explanations given by human evaluators.

• GloVe pre-trained word embeddings, averaged across tokens.

• BERT-based contextualized vectors that have been fine-tuned for semantic similarity tasks 7 [58] .

Each representation is used to cluster the spans. We cluster the spans using K-Means++ 8 [7] . We then apply the Apriori algorithm 9 to automatically mine association rules between clusters, [55] and use the confidence metric to select the top rules 10 . To use the mined rules between purpose nodes (clusters) for selecting inspirations shown to users, we start from the purpose node corresponding to the given problem and take its consequents; as explained earlier, this captures a weak signal of abstract similarity.

We experiment with two approaches for displaying concepts to users -one that attempts to summarize the cluster independently of the seed problem, and one that takes the seed problem into account:

• TextRank [50] . We construct a graph where nodes are the spans in a cluster and edges represent textual similarity. We run PageRank [54] on this graph, selecting the top spans to present.

• Nearest spans. Following the findings in [23] , select the top spans in C that are nearest to the query . (For both approaches, we use = 5).

4.2.2 Baselines.

• Purpose span similarity. Given a problem , we find the = 5 nearest purpose spans of text in our corpus (out of 17 purposes). We experiment with the same two vector representations used by our approach: GloVe and BERT. This method is similar to applying the methodology in [36] to our setting, where in our setting we are given only a problem and no mechanism is available to control for structural distance. While this approach relies on our model for extracting purpose spans, we consider it a baseline to study the added value of our hierarchy.

• Linguistic abstraction. We use the WordNet [52] lexical database to extract hypernyms (for each token in ), in order to capture potential abstractions. WordNet is often used in similar fashion for design-by-analogy studies [30, 46] . 7 We use RoBERTa-large-STS-SNLI, available at github.com/UKPLab/sentence-transformers. 8 = 250 selected automatically with elbow-based criteria on silhouette scores. 9 http://www.borgelt.net/pyfim.html. 10 We use the top 3 rules in our experiment. Fig. 9 . Example from our Functional Concept Graph, explaining the inspirations shown to users in Figure 8 . Nodes represent concepts (clusters of purposes), named by us for readability. Edges are annotated with products containing spans from both concepts. The problem of "medicine morning reminder" is mapped (via embedding) to the Alert/remind concept, which is linked to the concepts of medical monitoring and making hot drinks through products such as "smart medicine injector" and "coffee machine alarm" (among others, not displayed in the figure). These links serve as inspirations in our study.

Fig. 9. Example from our Functional Concept Graph, explaining the inspirations shown to users in Figure 8. Nodes represent concepts (clusters of purposes), named by us for readability. Edges are annotated with products containing spans from both concepts. The problem of “medicine morning reminder” is mapped (via embedding) to the Alert/remind concept, which is linked to the concepts of medical monitoring and making hot drinks through products such as “smart medicine injector” and “coffee machine alarm” (among others, not displayed in the figure). These links serve as inspirations in our study.

Medical Monitoring

• Random concepts. Random inspirations are often considered as a baseline in ideation studies since diversity of examples is a known booster for creative ability [36] . For each task, we select a random cluster from C and display its TextRank summary.

4.2.3 Rating Collection.

In our study, each method generated = 5 spans (concept summaries), which are grouped and displayed together in a box (Figure 8 ). For each problem a rater views 8 boxes in randomized order, to avoid bias. We recruit 10 raters (8 graduate students, a senior engineering professor, and an architect). Raters were instructed to mark inspirations they consider useful and relevant for solving a given problem, while being not about the same problem. Raters were also encouraged to write comments, especially for non-trivial cases which they found of interest (see Table 3 ). In total, raters viewed 2584 boxes, or 12920 purpose descriptors.

4.3 Results

Qualitative analysis. Table 3 and Figure 8 show examples of problems, inspirations and user explanations from our study. For instance, users facing the "morning medicine reminder" problem were presented with nearby concepts in the Functional Concept Graph that included health monitoring and coffee machines. To explore why these concepts are connected in our graph and why they are potentially useful as inspirations, we make use of the direct interpretability of our approach. We examine the purpose co-occurrences from which the Functional Concept Graph was constructed. Figure 9 shows the a graph with concept nodes of Making hot drinks, alerting/reminding, health monitoring, medicine delivery, and edges representing products in which two adjacent purposes were co-mentioned (e.g., a coffee machine alarm product that mentioned the purposes of making hot drinks and alerting/reminding, or a "smart medicine injector" that mentioned both alerting/reminding and medicine delivery). This explains why the concepts are nearby in the graph, as there are multiple products in our dataset that refer to purposes from both concepts. For example, a pill reminder product refers to the problem of forgetting to take medicine at prescribed times (Sends notification if you forgot to take your AM or PM meds), while a smart injector device administers medicine on set time intervals. At the same time, both of these products of course mention purposes of medicine delivery. When our graph construction algorithm observes enough similar co-occurrence patterns between the concepts of alerting and medicine delivery, across multiple products, an edge is added between the two in the graph. Similarly, an Alarm coffee maker product mentions the purposes of time management and making coffee at a set time as well as alerting when the coffee is ready, explaining how it emerges as a potential inspiration in our graph. This type of linkage or overlap between an original problem space and inspiration problems helps get at a sweet-spot of innovation [12] by finding ideas that are not too near and not too far from the original problem, helping users break out of fixation as discussed earlier in this section. Users used these inspirations to come up with a tracker that alerts the user at the best time to take a medicine, and a coffee machine reminding the user to take their medication with their morning coffee. Those creative directions demonstrate the utility of the Functional Concept Graph for exploring the design space.

Quantitative results. Figure 10 shows the results of the user study. On the left, we show the proportion of inspirations (individual spans) selected by at least two raters, for each method. Our approach significantly outperforms all the baselines. The effect is particularly pronounced for the BERT-based approach, with 51% of inspirations found useful, while the best baseline reaches less than 30%. Interestingly, for both BERT and GloVe representations, the Nearest-span summarization approach fares better, potentially due to striking a balance between being too far/near the initial problem . Figure 10 (right) shows the proportion of inspiration boxes that got at least 2 individual inspirations marked (by at least 2 raters). This metric measures the effect of a box as one unit, as each box is meant to represent a coherent cluster.

Fig. 10. Inspiration user study results. Left: Proportion of inspirations selected by at least 2 raters, per condition. Right: Proportion of boxes (clusters) with at least 2 spans marked by ≥ 2 raters.

Our method is able to reach 62%, while the best baseline (GloVe search on purpose spans) yields only 39%. Again, the nearest-span summarization is prefered to TextRank. Importantly, for both individual inspiration spans and inspirations boxes, 51%-62% are rated as useful -high figures considering the challenging nature of the task.

5 Conclusion

In this paper we introduced a novel span-based representation of ideas in terms of their fine-grained purposes and mechanisms and used it to develop new tools for creative ideation. We trained a model to extract spans from a noisy, real-world corpus of products. We used this representation to help search for alternative, uncommon uses of products and to generate a graph capturing abstract similarities in idea repositories to help problem-solvers explore the design space around their problem. In both ideation studies, we were able to achieve high accuracies, significantly outperform baselines and help boost user creativity.

In future work, we would like to further explore weak supervision approaches to augment annotation in noisy settings.

Another direction is learning purposes and mechanisms in an end-to-end fashion. Another exciting prospect is deploying our search engine publicly, allowing scientists, engineers and designers to perform rich queries, discover new similarities, and boost innovation with enhanced capabilities not possible with today's search.

Beyond supporting richer search for creative inspiration, a data-driven approach to extracting functional facets and learning abstractive relationships between the facets could power much more expansive approaches to mapping out design spaces for entire domains or problem areas, identifying key subproblems and constraints and novel paths through the design space. Mapping approaches like this, such as technological roadmapping [9] , have already shown significant promise for reinvigorating research and development in real-world applications such as neural recording [48] . However, these mapping exercises are still highly manual and labor-intensive processes; computational support for such tasks could have transformative impacts on innovation.

https://spinoff.nasa.gov/

https://www.allbusiness.com/the-man-the-legend-john-osher-inventor-of-the-spin-brush-part-i-2-7665547-1.html4 To automate scenario selection, we cluster mechanisms (see Section 4.1 for details), select frequent mechanisms from the top 5 largest mechanism clusters, and identify purposes strongly co-occurring with them (e.g., "RFID" co-occurs with "locating", "tracking") to avoid.

Inter-rater agreement measured across all scenarios was at 50% by both Fleiss kappa and Krippendorff's alpha tests.

This also bears certain resemblance to collaborative filtering[41], where recommendations are based on the pattern: people who buy item X, also often buy Y; in our case, instead of people and items, we have ideas written by people, and the problems they solve.