Learning the names of Michigan-based animals from their descriptions

The Michigan Open Data website contains information about different types of animals (like turtles, salamanders, birds, lizards, snakes, frogs) found in the region, along with characteristical descriptions. This allows us to learn from the descriptions and then extract the name of an animal when a description or a keyword is known.

import pandas as pd from sklearn.preprocessing import LabelEncoder from sklearn.naive_bayes import GaussianNB import re import numpy as np store_data_in = 'all_michigan_animals.csv' files = ['Turtles', 'Salamanders', 'Birds', 'Lizards', 'Snakes', 'Native_Frogs_And_Toads'] with open(store_data_in, 'w') as fw: for fname in files: df = pd.read_csv('Michigan_%s.csv' % (fname)) res = df[['commonname', 'latinname', 'narrative']] res['name'] = res['commonname'] + ' (' + res['latinname'] + ')' del res['commonname'] del res['latinname'] res.to_csv(fw, index=None)

We obtain a separate file with all descriptions and names of all animals. Then we focus our analysis only on this file.

df = pd.read_csv(store_data_in) lb = LabelEncoder() gnb = GaussianNB() X = lb.fit_transform(df['narrative']) y = lb.fit_transform(df['name']) # Reshape since we have single feature X = X.reshape(-1,1) y = y.reshape(-1,1) gnb.fit(X,y) keywords = ['bird', 'eat small birds', 'black bird', 'frog', 'eel like salamander', 'loud', 'near water', 'egg', 'threatened', 'endangered', 'hunt', 'bury from 10 to 96 round eggs in a sunny spot', '15.2 to 27.4 cm', 'pollut'] for keyword in keywords: list_idx = df['narrative'].str.count(keyword, re.IGNORECASE) list_idx = list_idx[list_idx != 0].index.values.tolist() learned_idx = X[list_idx] if len(learned_idx) == 1: learned_idx = learned_idx.reshape(1,-1) predicted = gnb.predict(learned_idx) print('\nKeyword: "%s"' % (keyword)) print('\n'.join(df['name'][np.where(np.in1d(y.ravel(), predicted))[0]].tolist())) """ Keyword: "bird" Sandhill Crane (Grus canadensis) Great Blue Heron (Ardea herodias) Upland Sandpiper (Bartramia longicauda) King Rail (Rallus elegans) Common Terns (Sterna hirundo) Trumpeter Swan (Cygnus buccinator) Common Loon (Gavia immer) Osprey (Pandion haliaetus) Broad-winged Hawk (Buteo platypterus) Bald Eagle (Haliaeetus leucocephalus) Peregrine Falcon (Falco peregrinus) Mourning Dove (Zenaida macroura) American Goldfinch (Carduelis tristis) Brown-headed Cowbird (Molothrus ater) Common Crow (Corvus brachyrhynchos) Common Raven (Corvis corax) Red crossbill (Loxia curvirostra) Wild Turkey (Meleagris gallopavo) Loggerhead Shrike (Lanius ludovicianus) Pileated Woodpecker (Dryocuopus pileatus) Scarlet Tanager (Piranga olivacea) Kirtland's Warbler (Dendroica kirtlandii) Blue Racer (Coluber constrictor foxi) Fox Snakeand (Elaphe gloydi) Bullfrog (Rana catesbeiana) Keyword: "eat small birds" Bullfrog (Rana catesbeiana) Keyword: "black bird" Common Raven (Corvis corax) Keyword: "frog" Sandhill Crane (Grus canadensis) Great Blue Heron (Ardea herodias) American Bittern (Botanus lentiginosus) Common Loon (Gavia immer) Northern Water Snake (Nerodia sipedon) Eastern Garter Snake (Thamnophis sirtalis) Northern Ribbon Snake (Thamnophis sauritus septentrionalis) Blue Racer (Coluber constrictor foxi) Fox Snakeand (Elaphe gloydi) Copper-bellied Water Snake (Nerodia erythrogaster neglecta) Green Frog (Rana clamitans) Mink Frog (Rana septentrionalis) Western Chorus Frog (Pseudacris triseriata triseriata) Gray Treefrogand (H. chrysoscelis) Bullfrog (Rana catesbeiana) Wood Frog (Rana sylvatica) Northern Leopard Frog (Rana pipiens) Pickerel Frog (Rana palustris) Northern Spring Peeper (Pseudacris crucifer) Blanchard's Cricket Frog (Acris crepitans blanchardi) Keyword: "eel like salamander" Western Lesser Siren (Siren intermedia nettingi) Keyword: "loud" Common Raven (Corvis corax) Kirtland's Warbler (Dendroica kirtlandii) Eastern Hog-nosed Snake (Heterodon platirhinos) Green Frog (Rana clamitans) Northern Spring Peeper (Pseudacris crucifer) Keyword: "near water" Spiny Soft-shell Turtle (Apalone spinifera spinifera) Common Map Turtle (Graptemys geographica) Wood Turtle (Glyptemys insculpta) Black Rat Snake (Elaphe obsoleta obsoleta) Keyword: "egg" Eastern Box Turtle (Terrapene carolina carolina) Spiny Soft-shell Turtle (Apalone spinifera spinifera) Common Snapping Turtle (Chelydra serpentina) Common Musk Turtle (Sternotherus odoratus) Blanding's Turtle (Emys blandingii) Painted Turtle (Chrysemys picta) Red-eared Slider (Trachemys scripta elegans) Common Map Turtle (Graptemys geographica) Wood Turtle (Glyptemys insculpta) Spotted Turtle (Clemmys guttata) Western Lesser Siren (Siren intermedia nettingi) Red-backed Salamander (Plethodon cinereus) Mudpuppy (Necturus maculosus) Four-toed Salamander (Hemidactylium scutatum) Spotted Salamander (Ambystoma maculatum) Eastern Newt (Notophthalmus viridescens) Marbled Salamander (Ambystoma opacum) Blue-spotted Salamander (Ambystoma laterale) Piping Plover (Charadrius melodus) Sandhill Crane (Grus canadensis) King Rail (Rallus elegans) Common Terns (Sterna hirundo) Trumpeter Swan (Cygnus buccinator) Common Loon (Gavia immer) Short-eared owl (Asio flammeus) Bald Eagle (Haliaeetus leucocephalus) Peregrine Falcon (Falco peregrinus) Mourning Dove (Zenaida macroura) Brown-headed Cowbird (Molothrus ater) Common Crow (Corvus brachyrhynchos) Common Raven (Corvis corax) Spruce Grouse (Canachites canadensis) Black-backed Woodpecker (Picoides arcticus) Scarlet Tanager (Piranga olivacea) Kirtland's Warbler (Dendroica kirtlandii) Smooth Green Snake (Liochlorophis vernalis) Eastern Milk Snake (Lampropeltis triangulum triangulum) Ring-necked Snake (Diadophis punctatus edwardii) Eastern Hog-nosed Snake (Heterodon platirhinos) Blue Racer (Coluber constrictor foxi) Black Rat Snake (Elaphe obsoleta obsoleta) Fox Snakeand (Elaphe gloydi) Copper-bellied Water Snake (Nerodia erythrogaster neglecta) Fowler's Toad (Bufo fowleri) Green Frog (Rana clamitans) Mink Frog (Rana septentrionalis) Western Chorus Frog (Pseudacris triseriata triseriata) Gray Treefrogand (H. chrysoscelis) Eastern American Toad (Bufo americanus) Bullfrog (Rana catesbeiana) Wood Frog (Rana sylvatica) Northern Leopard Frog (Rana pipiens) Pickerel Frog (Rana palustris) Northern Spring Peeper (Pseudacris crucifer) Blanchard's Cricket Frog (Acris crepitans blanchardi) Keyword: "threatened" Spiny Soft-shell Turtle (Apalone spinifera spinifera) Common Musk Turtle (Sternotherus odoratus) Blanding's Turtle (Emys blandingii) Spotted Turtle (Clemmys guttata) Marbled Salamander (Ambystoma opacum) Common Terns (Sterna hirundo) Common Loon (Gavia immer) Eastern Garter Snake (Thamnophis sirtalis) Eastern Hog-nosed Snake (Heterodon platirhinos) Fox Snakeand (Elaphe gloydi) Kirtland's Snake (Clonophis kirtlandii) Eastern Massasauga Rattlesnake (Sistrurus catenatus catenatus) Copper-bellied Water Snake (Nerodia erythrogaster neglecta) Keyword: "endangered" Small-mouthed Salamander (Ambystoma texanum) Piping Plover (Charadrius melodus) Upland Sandpiper (Bartramia longicauda) King Rail (Rallus elegans) Common Terns (Sterna hirundo) Bald Eagle (Haliaeetus leucocephalus) Peregrine Falcon (Falco peregrinus) Loggerhead Shrike (Lanius ludovicianus) Kirtland's Warbler (Dendroica kirtlandii) Northern Water Snake (Nerodia sipedon) Kirtland's Snake (Clonophis kirtlandii) Eastern Massasauga Rattlesnake (Sistrurus catenatus catenatus) Copper-bellied Water Snake (Nerodia erythrogaster neglecta) Keyword: "hunt" Wood Turtle (Glyptemys insculpta) Trumpeter Swan (Cygnus buccinator) Short-eared owl (Asio flammeus) Peregrine Falcon (Falco peregrinus) Mourning Dove (Zenaida macroura) Black-backed Woodpecker (Picoides arcticus) Wild Turkey (Meleagris gallopavo) Loggerhead Shrike (Lanius ludovicianus) Five-lined Skink (Eumeces fasciatus) Eastern Massasauga Rattlesnake (Sistrurus catenatus catenatus) Copper-bellied Water Snake (Nerodia erythrogaster neglecta) Keyword: "bury from 10 to 96 round eggs in a sunny spot" Common Snapping Turtle (Chelydra serpentina) Keyword: "15.2 to 27.4 cm" Blanding's Turtle (Emys blandingii) Keyword: "pollut" Spiny Soft-shell Turtle (Apalone spinifera spinifera) Common Snapping Turtle (Chelydra serpentina) Wood Turtle (Glyptemys insculpta) Eastern Tiger Salamander (Ambystoma tigrinum tigrinum) Mudpuppy (Necturus maculosus) Eastern Newt (Notophthalmus viridescens) Pickerel Frog (Rana palustris) """

Perhaps you noticed that no word called "pollut" exists. But by typing stems, we could look for any word that happens to have this stem (polluted, pollution), which allows more general searches. We could also type the entire description as a keyword, but we usually don't know it upfront.

This approach isn't perfect since in most cases we want to be able to query the data in natural words. For example, "give me all species which are 10-20cm long, have eggs of size 4-5cm and who are endangered". This is a conditional on the description, which means that we have to ensure that the machine has the same level of understanding of the natural language as a human being, to be able to quickly identify the answer to this query. We already need to have a numeric representation of the descriptions to be able to learn from them, and it is not clear how we can map such conditions to numbers.

But the example here illustrates how in few lines of code we can find the names of the animals, either by keywords or phrases appearing in their descriptions, without having to query a separate database.