RailsConf 2017: Implementing the Web Speech API for Voice Data Entry by Cameron Jacoby

RailsConf 2017: Implementing the Web Speech API for Voice Data Entry by Cameron Jacoby


(band music) – Alright, so we’re gonna
go ahead and get started. Thank you guys for coming, I
know it’s the end of the day and everyone’s probably pretty tired, but I really appreciate you
guys coming to this talk. So, I’m gonna start out with a short poll. Can you guys raise your hand for me if you’ve ever used Siri or Google Voice or any other type of
voice dictation software? So, show of hands. Okay, great. So almost everyone in the room. And then I want you to
raise your hand again if you would characterize
that voice dictation software as 100 percent accurate all the time. Anyone (laughs)? Okay, so that’s pretty
much what I expected and that was basically my impression of voice dictation as well. So for example, when I
was back in San Francisco in my apartment putting
together my slides, I tried to get Siri to turn
on the lights above my desk and I had to ask about
four different times. And in the end, she didn’t
even turn on the correct light. But that being said, today
we are gonna talk about voice dictation, specifically
with a technology called the Web Speech API. But first, introductions. Hi, I’m Cameron and I
build expert-use software for Stitch Fix in Ruby on Rails. And you may have also
hear expert-use software described as internal tools. So what this means is
that I build applications for other Stitch Fix employees. And I’m gonna be spending
most of our time today talking about a project that I worked on recently at Stitch Fix. So I thought it would be a good idea to give a brief overview
of what the company does so that everyone can level set and everyone’s on the same page. So, Stich Fix is an online
personalization company currently focused on
men’s and women’s apparel, accessories, and footware. The way that it works is you go online, you fill out a survey with
your size and fit preferences and then we match you
with a personal stylist who puts together a box
for you of five items. We pick the five items from our warehouse, we send it to your house,
you try it on at home and you get to keep what you love and send the rest back free of charge. The previous slide showed a picture of a typical Stitch Fix
box, also know as a Fix. And here I wanna show the lifecycle of one of those items in a Fix. Before the item gets to the client, there are several different
steps that it goes through. At the very beginning it
involves a choice by the buyer to actually bring in the
style to sell to our clients. The buyer places the order for the style to come in at a certain date. And then the vendor ships
the items of the style to our warehouse on that date. Next, the warehouse receives and puts the items into inventory and then they’re available for the stylist to send out to the client. Once the stylist picks the
item to go in the client’s Fix, we’re back at the
warehouse and the warehouse picks the items out of
inventory, packs them up, and ships them to the client. Then, like I mentioned before, the client can try on the items at home and then the warehouse is gonna process anything that the client returns. So we’ll come back to that in a second, but now that we’ve talked a
little bit about Stitch Fix, here’s a brief overview of
what we’re gonna cover today. First I’m gonna go through a case study featuring data entry by
associates at our warehouse. Then I’ll show you how you can get started with the Web Speech API to experiment with voice
dictation on your own. I’ll talk about some
voice dictation challenges that we ran into and
solutions that we implemented. And then I’ll answer the question, is voice the right solution for you? Jumping into the case study, like many retail companies,
Stitch Fix takes measurements of the items that we bring into inventory to eventually sell to our clients. This is a diagram of a men’s
long-sleeve woven shirt and you can see six marks across the shirt and these are called points of measure. These are very specific,
technical retail measurements that we would take on the shirt. There’s actually hundreds
of these measurements that can be taken and
they range from something as specific to the width of a button to as generic as the length
or width of the shirt. But at Stitch Fix, for
any given men’s shirt we typically take about
15 to 20 measurements. The part of the process where
we take these measurements, if we go back to the lifecycle of an item, it’s at our warehouse, when we receive the inventory from the vendor. The way that the
measurements are collected is just with a basic
sewing measuring tape. Here you can see one of the men’s shirts laid out flat on a table
and we’re measuring across the shoulders
with the measuring tape. When I started working on this project, the goal of the project
was to build an application to start capturing these measurements that we were taking at our warehouses. And the process was already in place before I started working on the project. Measurements were already
being taken and collected, but the team was using Google Sheets to record these measurements. And you’ll see that as
kind of a recurring theme in internal software is that
we’re taking existing processes and making them more
efficient and scalable by building software to support them. And so that’s exactly what
we did in this project. Throughout this project,
I got the opportunity to partner with my coworker
on the user experience or UX design team at Stitch Fix. And we worked together
throughout the entire project from the user research
phase, the prototyping phase, and the development phase. Here’s a picture from our
initial user research session where we went to the warehouse to observe the current measurement process before figuring out what type of tool we were gonna build to
support the process. And we had a couple main takeaways from the first user research session. The pictures on the left and the right show some handmade props
that the warehouse associates had made to aide them in
the measuring process. And you can see from the
diagram a couple slides ago that we really took
inspiration from these props that they made and carried that through into the application. And then the middle photo shows one of the warehouse associates actually taking these measurements. And our main takeaway from that was that they were recording
these measurements on very small laptop screens and there was a lot of hunching
over, a lot of shifting and body language back and forth between measuring the garment and entering it into the keyboard. Before I talk about the rest
of the process we went through to build this application, I wanted to give you guys some context so you can think about
what we ended up building as we go through the rest of the process. So here’s a quick demo
of the final solution that we came up with. – [Narrator] 23. 18 and 1/4. 9 3/4. 8 1/2. 4 7/8. 2 3/4. 16 1/2. 2 5/8. Save. – So, in case you guys haven’t
figured it out already, you are at one of the
JavaScript talks as RailsConf. We ended up going with voice
dictation as our solution. This is a Rails app, but
all of the voice dictation is built on the front end. But in all honesty, this isn’t really a talk about JavaScript or
even about voice dictation. It’s a story about how to
leverage the UX design process in engineering to build the
best product for our users. So, how did we do that? Well let’s finish the story. So after our initial
user research session, we were pretty focused on the fact that the users were hunched
over the small laptop and they had to switch back and forth from measuring and
entering the measurements into the keyboard. So we wanted to test
out measuring in pairs. We asked the associates to pair up so that one of them could
continuously measure and dictate the measurements aloud. And then the other could
type into the laptop. And the reasoning behind that is that our hypothesis was that
if one of the associates could spend 100 percent
of their time measuring, they wouldn’t have to break their flow or their concentration, they wouldn’t have to
reset their body language or their hand position
on the measuring tape, and they would be able
to be more efficient. So what we found from this test is that the associates kind of hated this concept of measuring in pairs. The person who was typing on the laptop felt like he was just
sort of waiting around and wasn’t really doing much, and felt like he could
have been more efficient if he had grabbed another shirt and started measuring himself. But what we did notice that was promising is that the associate who
got to focus completely on measuring did seem
to be more efficient. And she didn’t have to break her flow, she was able to continuously measure without breaking her concentration and without shifting her body language. Because of that finding, we
then wanted to move forward with a voice usability study. These two screenshots show
our initial prototypes that we brought to the warehouse. The one on the left is
basic keyboard entry and the one on the right is
the voice dictation prototype. And they don’t look that much different. Just wanna call that
out, you’re not really supposed to see a difference in these because this isn’t so much of a UI change as an input change. But you can see in the
voice dictation prototype there’s a click to speak button on the top for the associates to press when they’re ready to start speaking
into the application. But aside from that, the
interfaces are pretty similar. In this voice usability study, there were three main questions that we were hoping to answer. And the first one was around efficiency. So would voice entry
affect the overall time to measure a garment? And then the second question that we wanted to answer
was around accuracy. So a little bit of background. Our warehouses are pretty
noisy environments. The associates often like to play music, they sing aloud to the music, and they wanna talk to their
friends during their shifts. And so we were wondering
if this would work out for voice dictation or if it would be hard to capture the input
that the user was saying. And then the last one was a question a little bit around culture and workflow. How would the warehouse
associates feel about voice entry? So a little bit of context into that, any associate who’s working
on the measurements, usually is doing it in
about a four hour shift. So about half of a work
day with breaks in between and we didn’t know if
that would feel exhausting to be talking aloud for hours at a time or if they would prefer to be typing into a keyboard instead. So let’s take a look at the results. Here are the results around efficiency. We tested these prototypes
with two warehouse associates and you can see that Participant 1 had a pretty dramatic
increase in efficiency, shaved about three minutes
off of his measurement time with the voice data entry. And then Participant 2 also saw a bit of a lift in efficiency,
but not quite as dramatic. And the interesting thing
here is that Participant 2 was already the more experienced person doing the measurements. And so he was already
very ridiculously fast at taking measurements,
which is why he didn’t have quite the increase in efficiency as the less experienced associate. But we thought these were
really promising results, especially since we knew
that we would be onboarding new people onto this process to be taking these measurements. There seemed to be a huge
efficiency gain here. So the next thing that we wanted to take a look at was the accuracy. And we found that investing
in the right headset was really the key here. And we were able to
mitigate the accuracy issues from the noisy environment. So this is the headset
that we ended up purchasing for our warehouse associates. The microphone has a
pretty narrow input range and then the most important factor here is that the microphone can be
flipped up into the headset and it stops recording
when it’s flipped up. So this was important to us in terms of keeping the culture
going in the warehouse. The associates could move
seamlessly back and forth between measuring and singing along or talking to their friends
and they didn’t have to feel trapped by this
voice dictation device. And then the last thing
that we wanted to know was how the associates would
feel about voice dictation. So here are some photos. The left one shows the
keyboard entry prototype and the right one shows
the voice dictation. And this is Participant 1 in the study and his main comment was
that the voice dictation felt a lot better for his back. And you can’t see as well
in the keyboard picture, but in the dictation
picture he’s standing up definitely straighter and not
as hunched over the laptop. And then this is Participant
2 who was the most experienced and already pretty efficient
at using the keyboard and his main comment is
that he liked that he never had to remove his hands
from the measuring tape. So you can see in the photo on the left, even when he’s using keyboard entry he had kind of a one-handed approach to typing into the keyboard and since he’s more
experienced at measuring, he really capitalized on the fact that if you don’t have to
completely reset your hands on the measuring tape each time, you can move through
the measurements faster. And so that was his main call out. That with the voice data entry he could truly use two hands at all times to do the measurements. So now that you’ve seen how
we utilized voice dictation with our warehouse
associates, I want to talk a little bit about how you can get started with the Web Speech API on your own. First, here’s a bit of copy script showing how to initialize
the Web Speech API. And the really cool thing about this API is that there’s no external library or anything that you need to pull in. This is available as part
of the JavaScript language if you’re using the Chrome browser. So it’s really just as
simple as initializing the web kit speech recognition. And on that note, like I mentioned this is available to use in Chrome with no external libraries,
but the flip side of that is that it’s only available in Chrome. So that’s why internal tools
make a really good candidate for using the Web Speech API because we can fully
control our users’ browsers. But it probably wouldn’t
be the best solution for something that’s customer-facing where you have to be able to support every browser under the sun. And then below the screenshot is just a little code snippet showing that we’re only initializing
the speech recognition if it’s defined. And then pretty much the only other thing that you have to do is
start the recognition and record voice results. You can also see a bit of
code here in the middle where we have logic that
restarts the recognition every time it ends. And this is so that the associates could continuously measure. They could go through every
measurement on the form without having to click any button or actively turn on and
off the voice recognition. And the way that they
could move in and out of a voice dictation was
by flipping the microphone up in the headset, as opposed to clicking anything on the keyboard or
messing with the app at all. And then the last step is
just getting the results back from the API and
returning the transcript. So it’s a pretty straightforward setup and I now wanna go into
some of the challenges that we ran into with voice dictation and some of the solutions
that we implemented. The first challenge was
around contextual formatting. You may or may not have
noticed a couple slides ago that the results that come
back from the Web Speech API come back as an array. And this is because what the API is doing is it’s recording context along the way as the user is speaking. And it actually returns snippets of speech along with the final result, which is gonna be the
last element in the array. So let’s look at two basic examples here. On the top, you can
see the user will start to speak and they say “two”, they continue speaking and
they say “two and a half”, and then they finish
off their sentence with “two and a half ice creams.” The API determines okay,
this person is speaking in sentence form. There’s other context around, they’re talking about ice cream, we’re just gonna return the
words just as they said them and there’s no need for
additional formatting. But then in the second example, the user starts out the same. They start to say the word two, they then continue on and
they say “two and a half” and then they stop speaking, which is the case for our application since our users are
only recording numbers. And what the API is supposed to do here is from the lack of context it implies that the user is speaking a number. And so it transforms the text
into the numeric version. And this is pretty awesome. I thought that this was one
of the most fascinating things about the API that you kind
of just get, out of the box was this contextual formatting. But unfortunately, it doesn’t really work 100 percent of the time. I think we saw about a 50/50
success rate with this. What that meant for our users was that they were speaking aloud
2 1/2 and they were seeing the words two and a half
come onto the screen, as opposed to the number. And that’s really confusing when they’re using a measuring tape and they’re supposed to be
entering data in fractions and sometimes they get a fraction and sometimes they get words. But we were able to
solve this pretty easily. I think because of the fact that we have such structured data. We were only expecting our
users to dictate numbers, so we were able to account for that and be able to do the
contextual formatting ourselves. So what we did is we set up an object, which is a mapping between
the numbers as words and then the numeric counterparts. And then every time we
got a transcript back from the API, we iterated
through the object, we checked for matches in the keys. And then if we found a match, we just replaced it with the value, which is the numeric version. In addition to contextual formatting, another challenge we ran into was dictation errors from our users, which were a little bit harder to solve. So here’s two examples that we ran into. The first one the user
dictated thirty-five eighths, which came out to be 35/8, pretty much just as expected. But what the user was actually trying to say was thirty and five eighths. But they just didn’t enunciate, they didn’t actually
physically pronounce the and. And so this was more of a matter of training the users in
how the API would work and how it would record their results. So in this case it was literally recording exactly what they were saying, but the user didn’t say
what they meant to say. And the same thing with
the example on the bottom with four quarters would return 4/4 because that’s four quarters. But what they actually meant to say was four and one quarter. It just didn’t come out of their mouth the way that they were intending it to. So that’s a little challenging because both of the
results are valid fractions so it’s a little hard to catch these. Luckily, we do have a friend in validation in the application that makes sure users
reduce their fractions. Both of these examples
will catch an error, but that’s only because
they’re non-reduced fractions. So you could imagine
other dictation errors that are totally valid
fractions that it may not catch. So that one’s a little
bit harder to plan for. And then the last challenge that we ran into with voice dictation was around reliability. If you go to the MDN documentation for the Web Speech API, you’ll see a notice at the top. This is an experimental technology and a bunch of caveats
about not always being backwards compatible, could
have breaking changes, et cetera, et cetera. And after a few weeks of
using the voice dictation in production, ongoing on a daily basis, we noticed some unexpected behavior. And the main unexpected
behavior we noticed was that the users would get through about half of a page of measurements and the recording would just completely stop working altogether. And this proved pretty
challenging to debug because there wasn’t really a difference between those scenarios and the scenarios when it was working. And by that I mean there were no errors in the JavaScript console,
nothing really indicating that something was wrong. So it was pretty hard to debug and test. We initially turned to hardware
as the potential problem. We thought oh, maybe we made
the wrong choice in headsets. So we tested a few
different headset options, even just regular
earbuds and didn’t really find anything there. We also tested different laptops, Mac versus PC, to see if it could potentially have to do with the laptop or the internal mic in the laptop. And then we made sure our users had up-to-date versions of
Chrome, which they all did. So it’s still a little bit inconclusive and something that
we’re digging into more, but we’re not quite sure what’s causing the reliability issues right now. But the good thing is, when working with an experimental technology, we knew we had to have a
fallback plan from the outset. So we never blocked the users from just entering the data into the
form that you saw in the demo. So that’s what they’re using
for the most part right now, as we work through some of
these reliability issues. And if you remember,
one of the main reasons that we wanted to
implement voice dictation was for the user’s comfort,
in their positioning, in their body language. And so what we ended up
doing was purchasing monitors that had large screens that we could stand up in the warehouse so that the associates could clearly see the form of measurements
in front of them. They didn’t have to hunch over and it was a much better
experience for them. So I wanna call out one other challenge that we’ve run into that’s not
related to voice dictation, but has more to do with users
entering data in the form. You can imagine, if a user’s
typing into the keyboard and they intend to type 10,
they might accidentally slip and type an extra zero and
then we have invalid data. Essentially we have a measurement
of 100, instead of 10. And that’s difficult to catch
because 100 is a valid number. It’s not any more invalid than 10. But what we had to do was
implement suggested ranges for each of our measurements. The way that we did that is for every single point of measure, and again there’s hundreds of these and they differ by type of
item that we’re measuring. We added a minimum and
maximum value to the database that we were able to
use and implement some front end warnings if the
measurements were out of range. Here we show the orange warning, but we’re not ever blocking the users from submitting the form
because it’s definitely possible that a measurement could be out of range. We just want to catch the
extreme outliers like 100 for across the shoulder,
which would never happen. So, is voice the right solution for you? A couple thoughts on that. A few things that I would consider if you wanna look into voice dictation as a potential solution for your users. And the first being browser control. For the Web Speech API in particular, it’s only supported in Chrome, right now. Because we were building
this as an internal tool, that allowed us to experiment
with it more easily than if you were potentially building a customer-facing tool. I think the fact that
we had structured data was also really helpful, particularly in the contextual formatting. I’m not really sure how we
would solve that problem if the API was returning unexpected data and we were just allowing
any words to come through. So the fact that we were
only expecting numbers, we were only allowed to input numbers really helped us out there in order to quickly solve the problem. And then I think it’s
important since this is a pretty experimental technology that you have a flexible user base and a fallback plan. So building that trust with your users, making sure they’re willing
to experiment with you, and making sure they
have the understanding that it might not be perfect, especially for the first few iterations and communicating that
there’s always a fallback plan and making sure that everyone’s trained on the fallback plan is really key. So when I was thinking
about my key takeaways to make up this slide, I sort
of came to the realization that this talk has served
as a bit of a post mortem on this project for me. There’s a lot to learn here,
but when I think about it there’s a couple things that I’d like you to take away from this story. And those are around UX and
engineering collaboration. So the first one is that the UX and engineering collaboration that we had allowed us to empathetically
build expert-use software. And by that I mean, usually
I’m working on software that’s used by people who are sitting at a desk typing and on a keyboard. This was the first time I
had thought about things like body language and
the user’s comfort level while they were using the app. And that’s something
that I hope to bring to a lot more of my applications
and products down the line. And then the collaboration also allowed us to quickly prototype early on. And we were able to iterate
and quickly solve problems. So the couple prototypes that
I showed at the beginning that we tested with our users, those were built 100 percent in code as a true collaboration
between UX and engineering. The reason that was beneficial
is because we could get out a realistic prototype to our user, test it quickly, and make
iterations directly in the code. And some of that code ended
up in our production version. So I think it allowed us to
move through the process faster. It also allowed us to look at the problem from both the engineering
and the user experience lens. So I wanna thank a few people for their collaboration on this project. First and foremost Sarah Poon, my coworker on the UX design team who was with me every step of
the way during this project. And everyone else on this
list was also instrumental in getting it off the ground. And with that, thank you guys. (applause) Why did we choose speech
versus something else? Well, were there any other options? I mean the only options
we were really considering was the speech versus the
traditional keyboard entry. We haven’t really looked
into any technology like a smart measuring tape or
something like that, but … It sounds like a cool idea, yeah (laughs). Are there limits to how long
we can be recording for? So our impression was no, but the sort of intermittent recording that we’re getting might imply otherwise if that makes sense. Whenever, it’s really hard
to duplicate the problem if you’re just testing at your laptop or even in the warehouse. And so we have noticed that
it becomes more of a problem with continuous use, like hours at a time. So you’re right, there
could be something there. Although every time you submit
the form it stops recording and then starts back up again when they go to a new shirt to measure. It’s a good point, it’s something
to look into, definitely. So, the question was was
there any reason to using the Web Speech API versus
some other speech … Have we played around with any other API? So like I said this is still the very early stages of the project and we went with the Web Speech API pretty much because it was
available quickly and easily for our initial prototyping and we didn’t run into the issues early on. So we were like, “Let’s not fix what’s not broken.” But now we’re at the point where we probably have to
evaluate some other options, but that’s a good question. Oh yes, good question. So the question was around
if the warehouse associates saw something that they didn’t
expect or maybe hit an error and they wanted to go back,
did we implement anything that helps them do that? And yes, I didn’t
mention that in the talk, but we did implement some
sort of key word triggers that would move the
cursor around on the page. The one that was in the demo was the save, which submits the form. And then we did implement a
back and a forward as well, so that if they did hit
a mistake they didn’t have to touch the keyboard
to go back and change it. Did I say we have a warehouse in France? The question was if we
implemented localisation with different languages. We only have warehouses
in the US right now, so that wasn’t an issue. But, you might have seen
on the initialization slide you can set a language option. We haven’t tried it yet, but I’m assuming it would work. Oh, yes. Am I able to share the model of headset it was that we were using? It’s called Jabra, I think
it’s a pretty common company for call centers, anyway
that’s how we found it. How many garments per day
does the company measure? I’m not quite sure and
we’re not fully ramped up since this is such an early project. We started with men’s clothing because it was a little simpler to wrap our minds around
how we would measure it because men’s clothing is primarily based on a box or a square (laughs), not to make any implications,
but it’s a little bit tougher to measure women’s clothing
because of the extreme silhouette variation and different styles of necklines and sleeves,
et cetera, et cetera. So we rolled this out with men’s which is why most of the
examples were for men’s and then a small subset
of women’s blouses. So right now we have a
handful of associates measuring for a few hours a day, but I’m not sure quite how many garments. Anyone else? Alright, thank you guys. (applause)