Google and IBM still trying desperately to move cloud market share needle

When it comes to the cloud market, there are few known knowns. For instance, we know that AWS is the market leader with around 32 percent of market share. We know Microsoft is far back in second place with around 14 percent, the only other company in double digits. We also know that IBM and Google are wallowing in third or fourth place, depending on whose numbers you look at, stuck in single digits. The market keeps expanding, but these two major companies never seem to get a much bigger piece of the pie.

Neither company is satisfied with that, of course. Google so much so that it moved on from Diane Greene at the end of last year, bringing in Oracle veteran Thomas Kurian to lead the division out of the doldrums. Meanwhile, IBM made an even bigger splash, plucking Red Hat from the market for $34 billion in October.

This week, the two companies made some more noise, letting the cloud market know that they are not ceding the market to anyone. For IBM, which is holding its big IBM Think conference this week in San Francisco, it involved opening up Watson to competitor clouds. For a company like IBM, this was a huge move, akin to when Microsoft started building apps for iOS. It was an acknowledgement that working across platforms matters, and that if you want to gain market share, you had better start thinking outside the box.

While becoming cross-platform compatible isn’t exactly a radical notion in general, it most certainly is for a company like IBM, which if it had its druthers and a bit more market share, would probably have been content to maintain the status quo. But if the majority of your customers are pursuing a multi-cloud strategy, it might be a good idea for you to jump on the bandwagon — and that’s precisely what IBM has done by opening up access to Watson across clouds in this fashion.

Clearly buying Red Hat was about a hybrid cloud play, and if IBM is serious about that approach, and for $34 billion, it had better be — it would have to walk the walk, not just talk the talk. As IBM Watson CTO and chief architect Ruchir Puri told my colleague Frederic Lardinois about the move, “It’s in these hybrid environments, they’ve got multiple cloud implementations, they have data in their private cloud as well. They have been struggling because the providers of AI have been trying to lock them into a particular implementation that is not suitable to this hybrid cloud environment.” This plays right into the Red Hat strategy, and I’m betting you’ll see more of this approach in other parts of the product line from IBM this year. (Google also acknowledged this when it announced a hybrid strategy of its own last year.)

Meanwhile, Thomas Kurian had his coming-out party at the Goldman Sachs Technology and Internet Conference in San Francisco earlier today. Bloomberg reports that he announced a plan to increase the number of salespeople and train them to understand specific verticals, ripping a page straight from the playbook of his former employer, Oracle.

He suggested that his company would be more aggressive in pursuing traditional enterprise customers, although I’m sure his predecessor, Diane Greene, wasn’t exactly sitting around counting on inbound marketing interest to grow sales. In fact, rumor had it that she wanted to pursue government contracts much more aggressively than the company was willing to do. Now it’s up to Kurian to grow sales. Of course, given that Google doesn’t report cloud revenue it’s hard to know what growth would look like, but perhaps if it has more success it will be more forthcoming.

As Bloomberg’s Shira Ovide tweeted today, it’s one thing to turn to the tried and true enterprise playbook, but that doesn’t mean that executing on that approach is going to be simple, or that Google will be successful in the end.

These two companies obviously desperately want to alter their cloud fortunes, which have been fairly dismal to this point. The moves announced today are clearly part of a broader strategy to move the market share needle, but whether they can or the market positions have long ago hardened remains to be seen.

Mixtape Podcast: Oracle’s alleged $400M issue with underrepresented groups

Screen time for kids, corporations allegedly not paying people from underrepresented groups and IBM offers some hope for the future of facial recognition technology: These are the topics that Megan Rose Dickey and I dive into on this week’s episode of Mixtape.

According to research by psychologists from the University of Calgary, spending too much time in front of screens can stung the development of toddlers. The study found that kids 2-5 years old who engage in more screen time received worse scores in developmental screening tests.” We talk a bit about this then wax nostalgically about “screen time” of yore.

We then turn to a filing against Oracle by the U.S. Department of Labor’s Office of Federal Contract Compliance Programs that states the enterprise company allegedly withheld upwards of $400 million to employees from underrepresented minority groups. The company initially declined to comment, but then thought better of itself and returned the very next day with its thoughts on the matter.

And finally, IBM is trying to make facial recognition technology a thing that doesn’t unfairly target people of color. Technology! The positive news comes a week after Amazon shareholders demanded that the company stop selling Rekognition, its very own facial recognition tech that it sells to law enforcement and government agencies.

Click play above to listen to this week’s episode. And if you haven’t subscribed yet, what are you waiting for? Find us on Apple PodcastsStitcherOvercastCastBox or whatever other podcast platform you can find.

IBM builds a more diverse million-face data set to help reduce bias in AI

Encoding biases into machine learning models, and in general into the constructs we refer to as AI, is nearly inescapable — but we can sure do better than we have in past years. IBM is hoping that a new database of a million faces more reflective of those in the real world will help.

Facial recognition is being relied on for everything from unlocking your phone to your front door, and is being used to estimate your mood or likelihood to commit criminal acts — and we may as well admit many of these applications are bunk. But even the good ones often fail simple tests like working adequately with people of certain skin tones or ages.

This is a multi-layered problem, and of course a major part of it is that many developers and creators of these systems fail to think about, let alone audit for, a failure of representation in their data.

That’s something everyone needs to work harder at, but the actual data matters, as well. How can you train a computer vision algorithm to work well with all people if there’s no set of data that has all people in it?

Every set will necessarily be limited, but building one that has enough of everyone in it that no one is effectively systematically excluded is a worthwhile goal. And with its new million-image Diversity in Faces (DiF) set, that’s what IBM has attempted to create. As the paper introducing the set reads:

For face recognition to perform as desired – to be both accurate and fair – training data must provide sufficient balance and coverage. The training data sets should be large enough and diverse enough to learn the many ways in which faces inherently differ. The images must reflect the diversity of features in faces we see in the world.

The faces are sourced from a huge 100-million-image data set (Flickr Creative Commons), through which another machine learning system prowled and found as many faces as it could. These were then isolated and cropped, and that’s when the real work started.

These sets are meant to be ingested by other machine learning algorithms, so they need to be both diverse and accurately labeled. So the DiF set has a million faces, and each one is accompanied by metadata describing things like the distance between the eyes, the size of the forehead and all that. All these measurements together create the “faceprint” that a system would use to, for example, match one image to another of the same person.

But any given set of those measurements may or may not be good for identifying people, or accurate for a certain ethnic group, or what have you. So the IBM team put together a revised set that not only includes simple things like distances between features, but how those measures relate to one another; for example, how the ratio of this area above the eyes to that area below the nose. Skin color, as well as contrast and types of coloration, are also included.

In a move that is long overdue, gender in the set is detected and encoded according to a spectrum, not a binary. As gender is itself nonbinary, it makes sense to represent it as any fraction between 0 and 1. So what you really have is a metric describing how individuals present on a scale from feminine to masculine.

Age is also automatically estimated, but for these two last values a sort of “reality check” is also included in the form of a “subjective annotation” field in which people were asked to label faces male or female and guess at age. Here there may be bias re-encoded, as sourcing from humans tends to introduce it. All these make for a considerably broader set of measurements than any other publicly available facial recognition training set.

You may wonder why race or ethnicity isn’t a category — IBM’s John R. Smith, who led the creation of the set, explained in an email to me:

Ethnicity and race are often used interchangeably, although the first is more related to culture and the second is related to biology. The boundaries within either are not distinct, and labeling is highly subjective and noisy as found in prior work. Instead, we chose to focus on coding schemes that could be determined reliably and have some kind of continuous scale that could feed diversity analysis. We may return to some of these subjective categories.

Even with a million faces, however, there’s no guarantee that this set is adequately representative — that enough of all groups and sub-sets are present to prevent bias. In fact, Smith seems sure it isn’t, which is really the only logical position:

We could not ensure this in this first version of the data set. But, it is the goal. First, we need to figure out the dimensions for diversity. We do that by starting with data and coding schemes as in this release. Then we iterate. Hopefully, we bring along the larger research community and industry in the process.

In other words, it’s a work in progress. But so is all of science, and despite the frequent missteps and broken promises, facial recognition is inarguably a technology with which we all will be engaging in the future, whether we like it or not.

Any AI system is only as good as the data on which it’s built, so improvements to the data will trickle down for a long time. Like any other set, DiF will likely go through iterations addressing shortcomings, adding more content and integrating suggestions or requests from researchers using it. You can request access here.