Pandas and DataFrames

In this lesson we will be exploring data analysis using Pandas.

  • College Board talks about ideas like
    • Tools. "the ability to process data depends on users capabilities and their tools"
    • Combining Data. "combine county data sets"
    • Status on Data"determining the artist with the greatest attendance during a particular month"
    • Data poses challenge. "the need to clean data", "incomplete data"
  • From Pandas Overview -- When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.

DataFrame

'''Pandas is used to gather data sets through its DataFrames implementation'''
import pandas as pd

Cleaning Data

When looking at a data set, check to see what data needs to be cleaned. Examples include:

  • Missing Data Points
  • Invalid Data
  • Inaccurate Data

Run the following code to see what needs to be cleaned

import pandas as pd
# reads the JSON file and converts it to a Pandas DataFrame
df = pd.read_json('files/grade.json')

print(df)
# What part of the data set needs to be cleaned?
# a few values in the school year, "Junior" and "9th Grade". one student ID also isn't valid, it's value is "nil". 
# From PBL learning, what is a good time to clean data?  Hint, remember Garbage in, Garbage out?
# A good time to clean data is when getting an input. If you never let garbage get into the database then you never have to worry about garbage in the database.
   Student ID Year in School   GPA
0         123             12  3.57
1         246             10  4.00
2         578             12  2.78
3         469             11  3.45
4         324         Junior  4.75
5         313             20  3.33
6         145             12  2.95
7         167             10  3.90
8         235      9th Grade  3.15
9         nil              9  2.80
10        469             11  3.45
11        456             10  2.75

Extracting Info

Take a look at some features that the Pandas library has that extracts info from the dataset

DataFrame Extract Column

print(df[['GPA']])

print()

#try two columns and remove the index from print statement
print(df[['Student ID','GPA']].to_string(index=True))
print(df[['Student ID','GPA']])
     GPA
0   3.57
1   4.00
2   2.78
3   3.45
4   4.75
5   3.33
6   2.95
7   3.90
8   3.15
9   2.80
10  3.45
11  2.75

   Student ID   GPA
0         123  3.57
1         246  4.00
2         578  2.78
3         469  3.45
4         324  4.75
5         313  3.33
6         145  2.95
7         167  3.90
8         235  3.15
9         nil  2.80
10        469  3.45
11        456  2.75
   Student ID   GPA
0         123  3.57
1         246  4.00
2         578  2.78
3         469  3.45
4         324  4.75
5         313  3.33
6         145  2.95
7         167  3.90
8         235  3.15
9         nil  2.80
10        469  3.45
11        456  2.75

DataFrame Sort

print(df.sort_values(by=['GPA']))

print()

#sort the values in reverse order
print(df.sort_values(by=['GPA'], ascending=False))
   Student ID Year in School   GPA
11        456             10  2.75
2         578             12  2.78
9         nil              9  2.80
6         145             12  2.95
8         235      9th Grade  3.15
5         313             20  3.33
3         469             11  3.45
10        469             11  3.45
0         123             12  3.57
7         167             10  3.90
1         246             10  4.00
4         324         Junior  4.75

   Student ID Year in School   GPA
4         324         Junior  4.75
1         246             10  4.00
7         167             10  3.90
0         123             12  3.57
3         469             11  3.45
10        469             11  3.45
5         313             20  3.33
8         235      9th Grade  3.15
6         145             12  2.95
9         nil              9  2.80
2         578             12  2.78
11        456             10  2.75

DataFrame Selection or Filter

print(df.GPA)
print(df[["GPA"]])
print(df[df.GPA > 3.00])
0     3.57
1     4.00
2     2.78
3     3.45
4     4.75
5     3.33
6     2.95
7     3.90
8     3.15
9     2.80
10    3.45
11    2.75
Name: GPA, dtype: float64
     GPA
0   3.57
1   4.00
2   2.78
3   3.45
4   4.75
5   3.33
6   2.95
7   3.90
8   3.15
9   2.80
10  3.45
11  2.75
   Student ID Year in School   GPA
0         123             12  3.57
1         246             10  4.00
3         469             11  3.45
4         324         Junior  4.75
5         313             20  3.33
7         167             10  3.90
8         235      9th Grade  3.15
10        469             11  3.45

DataFrame Selection Max and Min

print(df[df.GPA == df.GPA.max()])
print()
print(df[df.GPA == df.GPA.min()])
print()
  Student ID Year in School   GPA
4        324         Junior  4.75

   Student ID Year in School   GPA
11        456             10  2.75

Create your own DataFrame

Using Pandas allows you to create your own DataFrame in Python.

import pandas as pd

database = pd.read_json("files/test.json")

print(database)
             name Coolness  Awesomness            badness
0            Dash       10           5               3.57
1     Aiden Huynh       50         100  9.6 lipbite emoji
2    Aiden Leeder        0           3               2.78
3   Bailey Leeder        1          14               3.45
4          Nathan        4           6               4.75
5          Austin        5          13               3.33
6           Gavin        8           7               2.95
7            Mort        7           8                3.9
8            Pham        2           3               3.15
9      [Redacted]      nil           9                2.8
10           Toby       11          15               3.45
11         Trains        8           1               2.75

Python Dictionary to Pandas DataFrame

import pandas as pd

#the data can be stored as a python dictionary
dict = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
#stores the data in a data frame
print("-------------Dict_to_DF------------------")
df = pd.DataFrame(dict)
print(df)

print("----------Dict_to_DF_labels--------------")

#or with the index argument, you can label rows.
df = pd.DataFrame(dict, index = ["day1", "day2", "day3"])
print(df)
-------------Dict_to_DF------------------
   calories  duration
0       420        50
1       380        40
2       390        45
----------Dict_to_DF_labels--------------
      calories  duration
day1       420        50
day2       380        40
day3       390        45

Examine DataFrame Rows

print("-------Examine Selected Rows---------")
#use a list for multiple labels:
print(df.loc[["day1", "day3"]])

#refer to the row index:
print("--------Examine Single Row-----------")
print(df.loc["day1"])
print()
-------Examine Selected Rows---------
      calories  duration
day1       420        50
day3       390        45
--------Examine Single Row-----------
calories    420
duration     50
Name: day1, dtype: int64

Pandas DataFrame Information

print(df.info())
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, day1 to day3
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   calories  3 non-null      int64
 1   duration  3 non-null      int64
dtypes: int64(2)
memory usage: 180.0+ bytes
None

Example of larger data set

Pandas can read CSV and many other types of files, run the following code to see more features with a larger data set

import pandas as pd

#read csv and sort 'Duration' largest to smallest
df = pd.read_csv('files/data.csv').sort_values(by=['Duration'], ascending=False)

print("--Duration Top 10---------")
print(df.head(10))

print("--Duration Bottom 10------")
print(df.tail(10))

print("Number of rows and columns")
print("Rows", len(df.axes[0]))
print("Columns", len(df.axes[1]))
--Duration Top 10---------
     Duration  Pulse  Maxpulse  Calories
69        300    108       143    1500.2
79        270    100       131    1729.0
109       210    137       184    1860.4
60        210    108       160    1376.0
106       180     90       120     800.3
90        180    101       127     600.1
65        180     90       130     800.4
61        160    110       137    1034.4
62        160    109       135     853.0
67        150    107       130     816.0
--Duration Bottom 10------
     Duration  Pulse  Maxpulse  Calories
68         20    106       136     110.4
100        20     95       112      77.7
89         20     83       107      50.3
135        20    136       156     189.0
94         20    150       171     127.4
95         20    151       168     229.4
139        20    141       162     222.4
64         20    110       130     131.4
112        15    124       139     124.2
93         15     80       100      50.5
Number of rows and columns
Rows 169
Columns 4

APIs are a Source for Writing Programs with Data

3rd Party APIs are a great source for creating Pandas Data Frames.

  • Data can be fetched and resulting json can be placed into a Data Frame
  • Observe output, this looks very similar to a Database
'''Pandas can be used to analyze data'''
import pandas as pd
import requests

def fetch():
    '''Obtain data from an endpoint'''
    url = "https://flask.nighthawkcodingsociety.com/api/covid/"
    fetch = requests.get(url)
    json = fetch.json()

    # filter data for requirement
    df = pd.DataFrame(json['countries_stat'])  # filter endpoint for country stats
    print(df.loc[0:5, 'country_name':'deaths']) # show row 0 through 5 and columns country_name through deaths
    
fetch()
  country_name       cases     deaths
0          USA  82,649,779  1,018,316
1        India  43,057,545    522,193
2       Brazil  30,345,654    662,663
3       France  28,244,977    145,020
4      Germany  24,109,433    134,624
5           UK  21,933,206    173,352

Hacks

Early Seed award

  • Add this Blog to you own Blogging site.
  • Have all lecture files saved to your files directory before Tech Talk starts. Have data.csv open in vscode. Don't tell anyone. Show to Teacher.

AP Prep

  • Add this Blog to you own Blogging site. In the Blog add notes and observations on each code cell.
  • In blog add College Board practice problems for 2.3.

The next 4 weeks, Teachers want you to improve your understanding of data. Look at the blog and others on Unit 2. Your intention is to find some things to differentiate your individual College Board project.

  • Create or Find your own dataset. The suggestion is to use a JSON file, integrating with your PBL project would be Amazing.

  • When choosing a data set, think about the following...

    • Does it have a good sample size?
    • Is there bias in the data?
    • Does the data set need to be cleaned?
    • What is the purpose of the data set?
    • ...
  • Continue this Blog using Pandas extract info from that dataset (ex. max, min, mean, median, mode, etc.)

Hack Helpers

Here is how Mort started on this assignment by asking ChatGPT ... Regarding Python Pandas, what are some data sets that would be good for learning Pandas?

  • There are many data sets that are suitable for learning pandas, depending on your interests and the skills you want to develop.

    Here are some suggestions...

    • Titanic Dataset:This is a classic dataset for data analysis and machine learning, and is often used as an introduction to pandas. The dataset contains information about passengers on the Titanic, including their demographics, ticket class, and survival status.
    • Iris Dataset: This is another classic dataset that is often used in machine learning and data analysis courses. The dataset contains information about iris flowers, including their sepal length, sepal width, petal length, and petal width.

    • NBA Player Stats: This dataset contains information about NBA players, including their stats for various games, their positions, and their salaries. It is a great dataset for learning how to clean and manipulate data using pandas.

    • Housing Prices Dataset: This dataset contains information about housing prices in a particular city or region, including variables such as the number of bedrooms, square footage, and price. It is a great dataset for learning how to manipulate and analyze numerical data using pandas.

    • COVID-19 Dataset: This dataset contains information about the COVID-19 pandemic, including the number of cases, deaths, and recoveries in various countries and regions. It is a great dataset for learning how to manipulate and analyze time-series data using pandas.

    • World Bank Dataset: This dataset contains information about economic indicators from various countries, such as GDP, population, and poverty rates. It is a great dataset for learning how to manipulate and analyze large datasets using pandas.

  • These are just a few examples of the many datasets that are suitable for learning pandas. The best dataset for you will depend on your interests and the specific skills you want to develop.

    Follow up question, "where can I find Titanic data set?"

  • The Titanic dataset is a popular dataset for data analysis and machine learning, and it can be found on various websites and data repositories. Here are a few places where you can find the Titanic dataset...

    • Kaggle:The Titanic dataset is available on Kaggle, which is a popular platform for data scientists and machine learning enthusiasts. You can download the dataset from the Kaggle website after creating an account.
    • UCI Machine Learning Repository: The Titanic dataset is also available on the UCI Machine Learning Repository, which is a collection of datasets that are commonly used for machine learning research. You can download the dataset from the UCI Machine Learning Repository website.

    • Seaborn library: If you have the Seaborn library installed in your Python environment, you can load the Titanic dataset directly from the library using the following code:

      import seaborn as sns
      titanic_data = sns.load_dataset('titanic')
      
import os
# assign directory
directory = 'files/Minecraft_Composting_v2/'
 
# iterate over files in
# that directory
for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    # checking if it is a file
    if os.path.isfile(f):
        print(f)
files/Minecraft_Composting_v2/Cookie_v2.csv
files/Minecraft_Composting_v2/Sugarcane_v2.csv
files/Minecraft_Composting_v2/Wheatseed_v2.csv
files/Minecraft_Composting_v2/Carrot_v2.csv
files/Minecraft_Composting_v2/Apple_v2.csv
files/Minecraft_Composting_v2/Wheat_v2.csv
files/Minecraft_Composting_v2/Oaksapling_v2.csv
files/Minecraft_Composting_v2/Melonseed_v2.csv
files/Minecraft_Composting_v2/Potato_v2.csv
files/Minecraft_Composting_v2/Oakleaf_v2.csv
files/Minecraft_Composting_v2/Melon_v2.csv
files/Minecraft_Composting_v2/Cactus_v2.csv
files/Minecraft_Composting_v2/Radish_v2.csv
files/Minecraft_Composting_v2/Pumpkin_v2.csv
files/Minecraft_Composting_v2/Pumpkinseed_v2.csv
files/Minecraft_Composting_v2/Radishseed_v2.csv
import pandas as pd
import os

minecraft = pd.DataFrame()
directory = 'files/Minecraft_Composting_v2/'
for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    if os.path.isfile(f):
        minecraft[f[30:29 + f[29:].find("_")]] = pd.read_csv(f)



# print(minecraft)
print(minecraft.mode())
print("Mean")
print(minecraft.mean())
print("Median:")
print(minecraft.median())
print("Max")
print(minecraft.max())
column = input("What column would you like to sort by?")
ascension = input("would you like it in ascending or descending order? (y/n)")
check = lambda ascension : True if ascension == 'y' else False
print(minecraft.sort_values(by=[column], ascending=(check(ascension))))
   Cookie  Sugarcane  Wheatseed  Carrot  Apple  Wheat  Oaksapling  Melonseed  \
0       7         12         17       9      9      9          19         17   

   Potato  Oakleaf  Melon  Cactus  Radish  Pumpkin  Pumpkinseed  Radishseed  
0       9       18      9      12       9        9           19          20  
Mean
Cookie          8.034667
Sugarcane      13.171667
Wheatseed      20.797333
Carrot         10.278000
Apple          10.243333
Wheat          10.243333
Oaksapling     21.064667
Melonseed      21.052333
Potato         10.229000
Oakleaf        20.938667
Melon          10.302667
Cactus         13.007333
Radish         10.218333
Pumpkin        10.233333
Pumpkinseed    21.133000
Radishseed     20.860000
dtype: float64
Median:
Cookie          8.0
Sugarcane      13.0
Wheatseed      20.0
Carrot         10.0
Apple          10.0
Wheat          10.0
Oaksapling     20.0
Melonseed      20.0
Potato         10.0
Oakleaf        20.0
Melon          10.0
Cactus         12.0
Radish         10.0
Pumpkin        10.0
Pumpkinseed    20.0
Radishseed     20.0
dtype: float64
Max
Cookie         15
Sugarcane      41
Wheatseed      50
Carrot         22
Apple          21
Wheat          22
Oaksapling     56
Melonseed      55
Potato         21
Oakleaf        57
Melon          22
Cactus         28
Radish         23
Pumpkin        21
Pumpkinseed    73
Radishseed     52
dtype: int64
      Cookie  Sugarcane  Wheatseed  Carrot  Apple  Wheat  Oaksapling  \
885        7         10         15       9     10     10          56   
590        8         17         18      10     14     13          53   
2842       8          8         35       8     10      9          53   
102        8         10          9       8      9      7          50   
467        8         17         27       7      9      8          50   
...      ...        ...        ...     ...    ...    ...         ...   
512        9         11         14      18     11     14           8   
415        7         12         22       8     11     11           8   
1349       8         15         29       9     10     12           8   
2638       7         10         16      10     13     15           7   
640        9         14         20      13      7      9           7   

      Melonseed  Potato  Oakleaf  Melon  Cactus  Radish  Pumpkin  Pumpkinseed  \
885          33       9       27     10      13      10       11           17   
590          21       9       12      9      15       9        8           20   
2842         32      11       38     10      13      10        8           26   
102          22      10       24     13      12      10        8           16   
467          18       8       19      7      11      10        9           18   
...         ...     ...      ...    ...     ...     ...      ...          ...   
512          14      11       15      7      14      10        9           23   
415          18      10       28      8      13      15       14           11   
1349         32      10       23      9      10      13       11           15   
2638         24       7       31     11      10      10       11           15   
640          21      11       15      9      11      11       10           19   

      Radishseed  
885           21  
590           26  
2842          17  
102           17  
467           17  
...          ...  
512           24  
415           18  
1349          13  
2638          22  
640           19  

[3000 rows x 16 columns]
import pandas as pd
bank = pd.read_json("files/world-bank-projects.json")


print(bank[bank.countryname == "Republic of Indonesia"])
print(bank[["countryname"]].mode())
print()
print(bank[bank.countryname == "Republic of Indonesia"].axes[0])
                                      _id  approvalfy board_approval_month  \
92   {'$oid': '52b213b38594d8a2be17c7dc'}        2013                 July   
138  {'$oid': '52b213b38594d8a2be17c80a'}        2013                 June   
176  {'$oid': '52b213b38594d8a2be17c830'}        2013                  May   
202  {'$oid': '52b213b38594d8a2be17c84a'}        2013                  May   
210  {'$oid': '52b213b38594d8a2be17c852'}        2013                  May   
212  {'$oid': '52b213b38594d8a2be17c854'}        2013                  May   
232  {'$oid': '52b213b38594d8a2be17c868'}        2013                April   
247  {'$oid': '52b213b38594d8a2be17c877'}        2013                March   
299  {'$oid': '52b213b38594d8a2be17c8ab'}        2013                March   
300  {'$oid': '52b213b38594d8a2be17c8ac'}        2013                March   
364  {'$oid': '52b213b38594d8a2be17c8ec'}        2013             December   
377  {'$oid': '52b213b38594d8a2be17c8f9'}        2013             December   
395  {'$oid': '52b213b38594d8a2be17c90b'}        2013             November   
396  {'$oid': '52b213b38594d8a2be17c90c'}        2013             November   
397  {'$oid': '52b213b38594d8a2be17c90d'}        2013             November   
398  {'$oid': '52b213b38594d8a2be17c90e'}        2013             November   
445  {'$oid': '52b213b38594d8a2be17c93d'}        2013              October   
475  {'$oid': '52b213b38594d8a2be17c95b'}        2013            September   
498  {'$oid': '52b213b38594d8a2be17c972'}        2013               August   

        boardapprovaldate                                  borrower  \
92   2013-07-08T00:00:00Z                     REPUBLIC OF INDONESIA   
138  2013-06-13T00:00:00Z                   GOVERNMENT OF INDONESIA   
176  2013-05-24T00:00:00Z                    TBD (TO BE DETERMINED)   
202  2013-05-10T00:00:00Z                    TBD (TO BE DETERMINED)   
210  2013-05-07T00:00:00Z                   GOVERNMENT OF INDONESIA   
212  2013-05-02T00:00:00Z                   GOVERNMENT OF INDONESIA   
232  2013-04-24T00:00:00Z                    TBD (TO BE DETERMINED)   
247  2013-03-29T00:00:00Z                     REPUBLIC OF INDONESIA   
299  2013-03-04T00:00:00Z                                       NGO   
300  2013-03-04T00:00:00Z  DG COMMUNITY AND VILLAGE EMPOWERMENT, MI   
364  2012-12-19T00:00:00Z                   GOVERNMENT OF INDONESIA   
377  2012-12-11T00:00:00Z                     REPUBLIC OF INDONESIA   
395  2012-11-20T00:00:00Z                     REPUBLIC OF INDONESIA   
396  2012-11-20T00:00:00Z                     REPUBLIC OF INDONESIA   
397  2012-11-20T00:00:00Z                   GOVERNMENT OF INDONESIA   
398  2012-11-20T00:00:00Z                     REPUBLIC OF INDONESIA   
445  2012-10-05T00:00:00Z                                       NGO   
475  2012-09-11T00:00:00Z                   GOVERNMENT OF INDONESIA   
498  2012-08-02T00:00:00Z                     REPUBLIC OF INDONESIA   

              closingdate            country_namecode countrycode  \
92   2018-12-31T00:00:00Z  Republic of Indonesia!$!ID          ID   
138  2015-12-31T00:00:00Z  Republic of Indonesia!$!ID          ID   
176                   NaN  Republic of Indonesia!$!ID          ID   
202                   NaN  Republic of Indonesia!$!ID          ID   
210  2013-12-31T00:00:00Z  Republic of Indonesia!$!ID          ID   
212                   NaN  Republic of Indonesia!$!ID          ID   
232  2013-12-31T00:00:00Z  Republic of Indonesia!$!ID          ID   
247  2020-12-31T00:00:00Z  Republic of Indonesia!$!ID          ID   
299  2013-12-31T00:00:00Z  Republic of Indonesia!$!ID          ID   
300                   NaN  Republic of Indonesia!$!ID          ID   
364  2014-12-31T00:00:00Z  Republic of Indonesia!$!ID          ID   
377  2015-12-31T00:00:00Z  Republic of Indonesia!$!ID          ID   
395  2013-04-30T00:00:00Z  Republic of Indonesia!$!ID          ID   
396  2015-12-31T00:00:00Z  Republic of Indonesia!$!ID          ID   
397  2013-04-30T00:00:00Z  Republic of Indonesia!$!ID          ID   
398  2013-12-31T00:00:00Z  Republic of Indonesia!$!ID          ID   
445  2014-06-30T00:00:00Z  Republic of Indonesia!$!ID          ID   
475  2018-03-31T00:00:00Z  Republic of Indonesia!$!ID          ID   
498  2017-09-30T00:00:00Z  Republic of Indonesia!$!ID          ID   

               countryname countryshortname  ...      sectorcode source  \
92   Republic of Indonesia        Indonesia  ...              LT   IBRD   
138  Republic of Indonesia        Indonesia  ...           LS,LC   IBRD   
176  Republic of Indonesia        Indonesia  ...              BG   IBRD   
202  Republic of Indonesia        Indonesia  ...              JB   IBRD   
210  Republic of Indonesia        Indonesia  ...     BH,BC,EP,JA   IBRD   
212  Republic of Indonesia        Indonesia  ...  JB,BC,BH,WA,WC   IBRD   
232  Republic of Indonesia        Indonesia  ...              EV   IBRD   
247  Republic of Indonesia        Indonesia  ...     EZ,BN,EV,ET   IBRD   
299  Republic of Indonesia        Indonesia  ...           JB,BH   IBRD   
300  Republic of Indonesia        Indonesia  ...              BH   IBRD   
364  Republic of Indonesia        Indonesia  ...           BH,BC   IBRD   
377  Republic of Indonesia        Indonesia  ...  WZ,TI,JA,EZ,BH   IBRD   
395  Republic of Indonesia        Indonesia  ...  LZ,WZ,JA,CT,TZ   IBRD   
396  Republic of Indonesia        Indonesia  ...  BH,JB,LZ,WZ,TI   IBRD   
397  Republic of Indonesia        Indonesia  ...     BK,JB,BZ,BC   IBRD   
398  Republic of Indonesia        Indonesia  ...     FD,FK,FA,FZ   IBRD   
445  Republic of Indonesia        Indonesia  ...              EV   IBRD   
475  Republic of Indonesia        Indonesia  ...  BO,BH,LZ,TZ,WZ   IBRD   
498  Republic of Indonesia        Indonesia  ...        YA,BL,AB   IBRD   

     status  supplementprojectflg  \
92   Active                     N   
138  Active                     N   
176  Active                     N   
202  Active                     N   
210  Active                     N   
212  Active                     Y   
232  Active                     N   
247  Active                     N   
299  Active                     N   
300  Active                     N   
364  Active                     Y   
377  Active                     N   
395  Closed                     N   
396  Active                     N   
397  Closed                     N   
398  Closed                     N   
445  Active                     N   
475  Active                     N   
498  Active                     N   

                                                theme1  \
92           {'Name': 'Climate change', 'Percent': 10}   
138  {'Name': 'Other public sector governance', 'Pe...   
176  {'Name': 'Other accountability/anti-corruption...   
202  {'Name': 'Participation and civic engagement',...   
210       {'Name': 'Education for all', 'Percent': 24}   
212  {'Name': 'Participation and civic engagement',...   
232  {'Name': 'Participation and civic engagement',...   
247  {'Name': 'Education for the knowledge economy'...   
299  {'Name': 'Other rural development', 'Percent':...   
300  {'Name': 'Other public sector governance', 'Pe...   
364  {'Name': 'Vulnerability assessment and monitor...   
377  {'Name': 'Participation and civic engagement',...   
395  {'Name': 'Infrastructure services for private ...   
396  {'Name': 'Other social protection and risk man...   
397  {'Name': 'Participation and civic engagement',...   
398  {'Name': 'International financial standards an...   
445  {'Name': 'Other social development', 'Percent'...   
475  {'Name': 'Infrastructure services for private ...   
498  {'Name': 'Rural services and infrastructure', ...   

                                        theme_namecode       themecode  \
92   [{'name': 'Climate change', 'code': '81'}, {'n...           39,81   
138  [{'name': 'Other public sector governance', 'c...              30   
176  [{'name': 'Other accountability/anti-corruptio...        35,31,29   
202  [{'name': 'Participation and civic engagement'...              57   
210  [{'name': 'Education for all', 'code': '65'}, ...  57,63,69,68,65   
212  [{'name': 'Participation and civic engagement'...           78,57   
232  [{'name': 'Participation and civic engagement'...              57   
247  [{'name': 'Education for the knowledge economy...           59,66   
299  [{'name': 'Other rural development', 'code': '...           62,79   
300  [{'name': 'Other public sector governance', 'c...        29,79,30   
364  [{'name': 'Vulnerability assessment and monito...           52,55   
377  [{'name': 'Participation and civic engagement'...  78,68,65,77,57   
395  [{'name': 'Infrastructure services for private...     78,47,49,39   
396  [{'name': 'Other social protection and risk ma...     59,57,41,56   
397  [{'name': 'Participation and civic engagement'...  27,53,54,28,57   
398  [{'name': 'International financial standards a...     40,41,59,42   
445  [{'name': 'Other social development', 'code': ...              62   
475  [{'name': 'Infrastructure services for private...           72,39   
498  [{'name': 'Rural services and infrastructure',...        77,91,78   

      totalamt totalcommamt                                                url  
92   325000000    325000000  http://www.worldbank.org/projects/P123994/indo...  
138          0      1050000  http://www.worldbank.org/projects/P133729/indo...  
176          0      3960000  http://www.worldbank.org/projects/P144386/pnpm...  
202          0      1800000  http://www.worldbank.org/projects/P143354/id-t...  
210          0     31700000  http://www.worldbank.org/projects/P132585/id-t...  
212   99900000     99900000  http://www.worldbank.org/projects/P129486/add-...  
232          0       880000  http://www.worldbank.org/projects/P132831/pnpm...  
247   95000000     95000000  http://www.worldbank.org/projects/P121842/rese...  
299          0      1670000  http://www.worldbank.org/projects/P131815/id-t...  
300          0      1650000  http://www.worldbank.org/projects/P132945/id-p...  
364          0      2380000  http://www.worldbank.org/projects/P124674?lang=en  
377  650000000    650000000  http://www.worldbank.org/projects/P128832/nati...  
395  100000000    100000000  http://www.worldbank.org/projects/P124006/firs...  
396  266000000    266000000  http://www.worldbank.org/projects/P125405/four...  
397  300000000    300000000  http://www.worldbank.org/projects/P126162/inst...  
398  100000000    100000000  http://www.worldbank.org/projects/P130150/fina...  
445          0      4150000  http://www.worldbank.org/projects/P131387/bare...  
475   29600000     29600000  http://www.worldbank.org/projects/P118916/infr...  
498   80000000     80000000  http://www.worldbank.org/projects/P117243/sust...  

[19 rows x 50 columns]
                  countryname
0  People's Republic of China
1       Republic of Indonesia

Int64Index([ 92, 138, 176, 202, 210, 212, 232, 247, 299, 300, 364, 377, 395,
            396, 397, 398, 445, 475, 498],
           dtype='int64')

Titanic Data

Look at a sample of data.

import seaborn as sns

# Load the titanic dataset
titanic_data = sns.load_dataset('titanic')

print("Titanic Data")


# print(titanic_data.columns) # titanic data set

# print(titanic_data[['survived','pclass', 'sex', 'age', 'sibsp', 'parch', 'class', 'fare', 'embark_town']]) # look at selected columns
# and titanic_data.survived == 0
print("Number of kids on the titanic:", len(titanic_data[titanic_data.age < 18].axes[0]))
child_data = titanic_data[titanic_data.age < 18]

print("Number of kids who died on the titanic:", len(child_data[child_data.survived < 1].axes[0]))
Titanic Data
Number of kids on the titanic: 113
Number of kids who died on the titanic: 52

Use Pandas to clean the data. Most analysis, like Machine Learning or even Pandas in general like data to be in standardized format. This is called 'Training' or 'Cleaning' data.

import seaborn as sns
titanic_data = sns.load_dataset('titanic')

# Preprocess the data
from sklearn.preprocessing import OneHotEncoder


td = titanic_data

td.drop(['alive', 'who', 'adult_male', 'class', 'embark_town', 'deck'], axis=1, inplace=True)
td.dropna(inplace=True)
td['sex'] = td['sex'].apply(lambda x: 1 if x == 'male' else 0)
td['alone'] = td['alone'].apply(lambda x: 1 if x == True else 0)

# Encode categorical variables
print(td)
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(td[['embarked']])
onehot = enc.transform(td[['embarked']]).toarray()
cols = ['embarked_' + val for val in enc.categories_[0]]
td[cols] = pd.DataFrame(onehot)
td.drop(['embarked'], axis=1, inplace=True)
td.dropna(inplace=True)

print("new table")
     survived  pclass  sex   age  sibsp  parch     fare embarked  alone
0           0       3    1  22.0      1      0   7.2500        S      0
1           1       1    0  38.0      1      0  71.2833        C      0
2           1       3    0  26.0      0      0   7.9250        S      1
3           1       1    0  35.0      1      0  53.1000        S      0
4           0       3    1  35.0      0      0   8.0500        S      1
..        ...     ...  ...   ...    ...    ...      ...      ...    ...
885         0       3    0  39.0      0      5  29.1250        Q      0
886         0       2    1  27.0      0      0  13.0000        S      1
887         1       1    0  19.0      0      0  30.0000        S      1
889         1       1    1  26.0      0      0  30.0000        C      1
890         0       3    1  32.0      0      0   7.7500        Q      1

[712 rows x 9 columns]
     survived  pclass  sex   age  sibsp  parch     fare embarked  alone  \
0           0       3    1  22.0      1      0   7.2500        S      0   
1           1       1    0  38.0      1      0  71.2833        C      0   
2           1       3    0  26.0      0      0   7.9250        S      1   
3           1       1    0  35.0      1      0  53.1000        S      0   
4           0       3    1  35.0      0      0   8.0500        S      1   
..        ...     ...  ...   ...    ...    ...      ...      ...    ...   
885         0       3    0  39.0      0      5  29.1250        Q      0   
886         0       2    1  27.0      0      0  13.0000        S      1   
887         1       1    0  19.0      0      0  30.0000        S      1   
889         1       1    1  26.0      0      0  30.0000        C      1   
890         0       3    1  32.0      0      0   7.7500        Q      1   

     embarked_C  embarked_Q  embarked_S  
0           0.0         0.0         1.0  
1           1.0         0.0         0.0  
2           0.0         0.0         1.0  
3           0.0         0.0         1.0  
4           0.0         0.0         1.0  
..          ...         ...         ...  
885         NaN         NaN         NaN  
886         NaN         NaN         NaN  
887         NaN         NaN         NaN  
889         NaN         NaN         NaN  
890         NaN         NaN         NaN  

[712 rows x 12 columns]
     survived  pclass  sex   age  sibsp  parch     fare  alone  embarked_C  \
0           0       3    1  22.0      1      0   7.2500      0         0.0   
1           1       1    0  38.0      1      0  71.2833      0         1.0   
2           1       3    0  26.0      0      0   7.9250      1         0.0   
3           1       1    0  35.0      1      0  53.1000      0         0.0   
4           0       3    1  35.0      0      0   8.0500      1         0.0   
..        ...     ...  ...   ...    ...    ...      ...    ...         ...   
885         0       3    0  39.0      0      5  29.1250      0         NaN   
886         0       2    1  27.0      0      0  13.0000      1         NaN   
887         1       1    0  19.0      0      0  30.0000      1         NaN   
889         1       1    1  26.0      0      0  30.0000      1         NaN   
890         0       3    1  32.0      0      0   7.7500      1         NaN   

     embarked_Q  embarked_S  
0           0.0         1.0  
1           0.0         0.0  
2           0.0         1.0  
3           0.0         1.0  
4           0.0         1.0  
..          ...         ...  
885         NaN         NaN  
886         NaN         NaN  
887         NaN         NaN  
889         NaN         NaN  
890         NaN         NaN  

[712 rows x 11 columns]
new table

The result of 'Training' data is making it easier to analyze or make conclusions. In looking at the Titanic, as you clean you would probably want to make assumptions on likely chance of survival.

This would involve analyzing various factors (such as age, gender, class, etc.) that may have affected a person's chances of survival, and using that information to make predictions about whether an individual would have survived or not.

  • Data description:- Survival - Survival (0 = No; 1 = Yes). Not included in test.csv file. - Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)

    • Name - Name
    • Sex - Sex
    • Age - Age
    • Sibsp - Number of Siblings/Spouses Aboard
    • Parch - Number of Parents/Children Aboard
    • Ticket - Ticket Number
    • Fare - Passenger Fare
    • Cabin - Cabin
    • Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
  • Perished Mean/Average

print(titanic_data.query("survived == 0").mean())
survived       0.000000
pclass         2.464072
sex            0.844311
age           31.073353
sibsp          0.562874
parch          0.398204
fare          24.835902
alone          0.616766
embarked_C     0.185629
embarked_Q     0.038922
embarked_S     0.775449
dtype: float64
  • Survived Mean/Average
print(td.query("survived == 1").mean())
survived       1.000000
pclass         1.878261
sex            0.326087
age           28.481522
sibsp          0.504348
parch          0.508696
fare          50.188806
alone          0.456522
embarked_C     0.152174
embarked_Q     0.034783
embarked_S     0.813043
dtype: float64

Survived Max and Min Stats

print(td.query("survived == 1").max())
print(td.query("survived == 1").min())
survived        1.0000
pclass          3.0000
sex             1.0000
age            80.0000
sibsp           4.0000
parch           5.0000
fare          512.3292
alone           1.0000
embarked_C      1.0000
embarked_Q      1.0000
embarked_S      1.0000
dtype: float64
survived      1.00
pclass        1.00
sex           0.00
age           0.75
sibsp         0.00
parch         0.00
fare          0.00
alone         0.00
embarked_C    0.00
embarked_Q    0.00
embarked_S    0.00
dtype: float64

Machine Learning Visit Tutorials Point

Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python.

  • Description from ChatGPT. The Titanic dataset is a popular dataset for data analysis and machine learning. In the context of machine learning, accuracy refers to the percentage of correctly classified instances in a set of predictions. In this case, the testing data is a subset of the original Titanic dataset that the decision tree model has not seen during training......After training the decision tree model on the training data, we can evaluate its performance on the testing data by making predictions on the testing data and comparing them to the actual outcomes. The accuracy of the decision tree classifier on the testing data tells us how well the model generalizes to new data that it hasn't seen before......For example, if the accuracy of the decision tree classifier on the testing data is 0.8 (or 80%), this means that 80% of the predictions made by the model on the testing data were correct....Chance of survival could be done using various machine learning techniques, including decision trees, logistic regression, or support vector machines, among others.

  • Code Below prepares data for further analysis and provides an Accuracy. IMO, you would insert a new passenger and predict survival. Datasets could be used on various factors like prediction if a player will hit a Home Run, or a Stock will go up or down.

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Split arrays or matrices into random train and test subsets.
X = td.drop('survived', axis=1)
y = td['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a decision tree classifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

# Test the model
y_pred = dt.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('DecisionTreeClassifier Accuracy:', accuracy)

# Train a logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Test the model
y_pred = logreg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('LogisticRegression Accuracy:', accuracy)
DecisionTreeClassifier Accuracy: 0.7588235294117647
LogisticRegression Accuracy: 0.788235294117647
/Users/tobyleeder/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(