Joining us today is Manjunath Rajmohan who joined the student programs team to support the Minidrone Competition and student challenges. In today’s post, Manjunath chats with the winners of... read more >>

]]>Joining us today is Manjunath Rajmohan who joined the student programs team to support the Minidrone Competition and student challenges. In today’s post, Manjunath chats with the winners of MathWorks Minidrone Competition – EMEA 2022. Over to you, Manjunath…

MathWorks Minidrone Competition has been crafted to enable students to learn key concepts through a fun filled approach and be industry ready. The challenge provides an opportunity for students to design a Minidrone line follower and learn Model-Based-Design hands-on using Simulink.

Model-Based Design enables fast and cost-effective development of dynamic systems, including control systems, signal processing systems, and communications systems. For many aspiring engineers, a key avenue toward learning Model-Based-Design has been through their participation in the MathWorks Minidrone Competition.

Today we’re talking to Team Tabono, from University of Uyo, Nigeria, the winners of “The MathWorks Minidrone competition – EMEA 2022”. We will be interacting with Kingsley Udofia, Emmanuel Ogungbemi, Chidinma Kalu, and Oluwaseun Ilori, the members of Team Tabono.

Hosting the competition in the post pandemic environment has required major adjusting – especially to the format of the competition. With the safety of students being our top priority, MathWorks ensured the competition format was virtual. The format ensured that the competition opened to a larger audience across countries in Europe, Middle East, and Africa (EMEA) region. However, it put the adaptability aspect of students to test.

The students rose to the challenge and the MathWorks Minidrone Competition – EMEA 2022 witnessed a staggering 300+ team registrations from 44 different countries. The top 9 teams were shortlisted for the finals post the simulation round.

When asked about the driving factor to participate in the competition, OluwaseunIlori, a member of team Tabono, says

“One of our staff members informed us about the contest after receiving a notification about it through her MathWorks account. After some discussion, we decided to throw our hats into the ring. We have been using MATLAB to complete numerical simulation school assignments, so we thought that taking part in this project would be an opportunity to challenge ourselves and learn something new in a competitive environment”

To enable students to quickly get started, MathWorks provides numerous resources in line with the competition. We spoke to the team to understand how these resources helped them in their journey.

“It was quite helpful to have the early materials made available to us, such as the ‘getting started guide’ and videos on the MathWorks Minidrone Competitions webpage. It gave us a good starting point in understanding the workflow, the image processing system, and the flight control system using Simulink and the Stateflow Chart” says Kingsly Udofia, Captain, Team Tabono.

“The MATLAB and Simulink Onramp introductory tutorials and videos which could be accessed directly within MATLAB were also of great help”, he added.

In our interaction with the team, we were intrigued to hear about the learning curve and understand how students found the competition useful.

In students’ words: Has the competition helped learn new technologies?

“Drones are ubiquitous today, however, this was our first time developing an algorithm for one and simulating it. We also got introduced to working hands-on with Simulink and learning by doing always has a lasting impression on the exploration journey. We gained useful experience, especially on working as a team, because of this competition” says Chidinma Kalu, Member, Team Tabono.

“Starting with something new can be challenging, and that was certainly the case when we first encountered Model-Based Design. However, we persevered and worked through the learning process. We went through multiple iterations to prototype our solution, trying them in simulations to eventually come to a result satisfactory to us” he added.

To help all the participants of future competitions, we asked the team to share some useful tips on how students can better approach the competition.

Team Tabono shares a success formula to ace the competition.

“Participants should use the tools /reference model that MathWorks has made available to them, including the videos, tutorials, and documentation, or else they risk having a challenging beginning. They ought to use a fresh algorithm to address the problem at hand. This will ensure a strong possibility of getting through the simulation round, and eventually succeeding. It gets easier with every step forward.”

The competition has helped the team learn, use, and witness how Simulink helps in rapid virtual prototyping of control systems.

On being asked how Simulink could help the team in the future, Emmanuel Ogungbemi, member of Team Tabono, says “Simulink being a very versatile tool for simulation helps bring down the overall product development time. Its capability to simulate complex systems is valuable for testing conditions that might be difficult to reproduce with hardware prototypes alone. One of our team members is currently working on Software Defined Radios using Simulink”

The team stated that “The experience opened our eyes as much as it confirmed our convictions that problems are chances to find answers and that working together as a team is essential for project success. It has made us more aware of the virtually endless possibilities that come with using MathWorks products”

Hope this conversation was useful to all our readers. If you are interested in participating, sign up for an upcoming Minidrone Competitions near you

If you would like to host the MathWorks Minidrone Competition on your campus, club, or class, drop us an e-mail at minidronecompetition@mathworks.com.

Joining us today is Keshav Patel, who is a NSF Graduate Research Fellow at University of Utah. He was a part of the team that finished as runners-up in the 2015 MathWorks Math Modeling Challenge... read more >>

]]>*Joining us today is Keshav Patel, who is a NSF Graduate Research Fellow at University of Utah. He was a part of the team that finished as runners-up in the 2015 MathWorks Math Modeling Challenge (M3C). Keshav will be following up on a previous blog post regarding Part 1 of the 2019 MathWorks Math Modeling Challenge (M3C). If you have not read part 1 yet, we encourage you to take a look here. Over to you Keshav..*

In the first post, Dr. Wesley Hamilton created a framework for how to tackle this problem. Now, let’s take a look at what real participants submitted for this competition. We have combed through several solutions across the entire range of scores to ask, “what makes a good submission?”

In this post, we will be examining teams for their approach to the problem, their assumptions, their results, and their submission structure/format. As you read this blog, you may ask yourself: which of these models is the “best”? We hope to show through these examples the following fact about modeling in the M3C: Model justifications are far more important than the use of high-level mathematics. Good submissions are the ones that have good arguments for their model which provide justification for the choices they made. Said a way that might better appeal to the mathematicians out there:

Cool math + poor communication = bad model

While this blog post is geared towards the M3C, the takeaways from our analysis are good pieces of advice for any math modeling work! If you want to think specifically about how an M3C judge may view your work, It may also be helpful to look at the M3C Scoring Guide.

Table of Contents

Curve Fitting Solutions

A Close Look at a “good” Regression Model

Assumptions

Model

Strengths and Weaknesses

Extra Practice

Examining Other Regression Models

Extra Practice

Other Mathematical Methods

Ordinary Differential Equations Model

Extra Practice

Closing Thoughts

A majority of teams approached this problem in a very similar way to Wesley, although the exact methods varied depending on what data they chose to incorporate and what function they chose to fit their data to. First, we will examine a submission that did not win, but came quite close.

The first team we will be examining set up their submission in a manner outlined by other math modeling resources: they restated the problem, wrote down assumptions, defined their variables, and then described their model. First, let’s take a look at one of the most important components of model setup: the assumptions.

Here are a few of the most noteworthy assumptions in this submission:

- Assumption: The percent of the population that uses vaping products is an accurate measure of the spread of nicotine due to vaping products. Justification: It would be unreasonable to determine the exact amount of nicotine used over the past couple years and predict it for the coming years. Each vaping product has a different amount of nicotine in it and, as seen during our investigation, existing data does not record the amount of nicotine each user consumed against a time metric. However, the spread of nicotine can be measured by its popularity in the US market, as the more people use nicotine-based vaping products, the more nicotine is used.
- Assumption: There is no new pertinent information regarding the dangers of nicotine-based vaping products or laws that will affect its popularity. Justification: Many studies and reports have already been released advocating the negatives of using nicotine-based vaping products, but despite this, as our research showed, the popularity of nicotine-based vaping products has continued to increase. Additionally, although the introduction of comprehensive FDA legislation in the August of 2016 did cause a sharp decrease in the popularity of nicotine-based vaping products [citation], since most relevant legislation regarding the use of nicotine-based vaping products has already been passed, and since these products did regain popularity in the aftermath of the legislation with the surge of vaping use in 2018, it is reasonable to assume that future legislation will not have a significant impact on the popularity of nicotine-based vaping products.
- Assumption: The carrying capacity of the market size for nicotine-based vaping products can be estimated using the historical carrying capacity of the market size for cigarettes. Justification: When analyzing the trends in the popularity of cigarettes, the group noticed that the initial growth in popularity of cigarettes closely mirrored that of nicotine-based vaping products like e-cigarettes.

The first assumption explains to the reader how this team chose to interpret nicotine use data, the second assumption is a simplifying assumption that justifies the extrapolation of their statistical model, and the third assumption gives a justification for using cigarette data to help fill in a missing piece of information for vaping data.

This team not only gave a list of assumptions, but also gave a justification (with citations!) for most assumptions. Some questions your assumptions should help to answer include: how are you interpreting your data? what are you taking into account or not taking into account when you extrapolate your data? why is your choice of function for the regression a valid function to use? What would happen if you did not make the assumption?

One more note about justifications: their first assumption’s justification is more or less an issue of availability of data. Since the competition is only fourteen hours, this is a perfectly reasonable assumption. It would be a good idea to revisit this assumption in the “Strengths and Weaknesses” section of your submission. Consider how your assumptions would change and how you would change your model setup if you had better access to relevant data.

Next, we will look at the model that was built. This team in particular compiled the e-cigarette use dataset in the same way that Wesley did in last week’s blog:

However, they decided to apply a logistic regression to this dataset:

From this curve, they claimed that the overall use of e-cigarettes will increase from about 15% in 2018 to about 45% around the year 2025, and then level off (but never decrease). From a mathematical perspective, this work and their results are perfectly valid. Now is the time we should ask “is this a reasonable result?” Well, one of the team’s assumptions is that “There is no new pertinent information regarding the dangers of nicotine-based vaping products or laws that will that will affect its popularity“. So, unlike the historical data on cigarette usage, there is no change to the laws or how nicotine is consumed to think that the amount of cigarette usage will ever decline. Because I can connect the team’s assumptions to their results, I believe this is a reasonable result. My personal biases would lead me to believe that their result is pessimistic; thankfully, examining the submission based on argumentation helps to take personal biases out the equation.

Before we move on to the Strengths and Weaknesses section, there are two more comments worth making. First, it would have been preferable to plot the logistic curve and the data on the same figure. Second, it is worth taking a moment either in this section or in the Strengths and Weaknesses section to mention the impacts of the results on the real world. After all, math modeling is all about answering real world questions!

Given the time restriction in the M3C, you may have had to make a lot of simplifying assumptions that you otherwise would not have, or you may not have had time to research the topic or data more thoroughly. The Strengths and Weaknesses section is a good time to acknowledge limitations to your model and suggest potential improvements.

In the submission we have been analyzing, the team discussed how their model is an improvement over a polynomial regression because of the physical interpretation. While the team does not go into further details on what “physical interpretation” means, I take it to mean that the polynomial regression concludes that the cigarette usage either becomes negative or increases indefinitely. Neither of these results would be reasonable when thinking about numbers of people, so this is a good thing to make a note of. Wesley also mentioned this fact in Part 1 of our blog series.

The team also conceded that there is a limited amount of data they were able to use, and there was limited benefit to performing a sensitivity analysis. If you are not familiar with sensitivity analysis, check out Chapter 6 of the Math Modeling Handbook, or sixth installment of our Essentials of Math Modeling Series.

Their point on sensitivity analysis is one that a judge may push back on. Any parameters that are estimated (by using previous literature, regression, intuition, etc.) are inherently not exact. So, it is usually good practice to analyze how your results would change if your parameters are increased or decreased in some range. For the context of the M3C, a range of 5% or 10% from your estimated value is standard, but in practice, your range may be based on other factors, such as the standard deviation of your estimated value.

As a whole, this team’s submission was not one of the winners, but scored very well. This method of solving the problem was quite popular; the fourth, fifth, and sixth place teams also fit a logistic regression. You can find their submissions here.

These questions ask you to compare and contrast the answers you gave to the problems in last week’s blog post to the submission we just looked at and the fourth, fifth, and sixth place submissions, which you can find here. Note: the fifth place team starts out with an Ordinary Differential Equation (ODE) model, but through some calculations they find a logistic curve that matches their data.

- What similarities and differences do you note in the assumptions? If you had opposing assumptions to another team, consider a) how well you justified your assumption, b) how your model would have to change if you used a different assumption, and c) which assumption you would rather use.
- What similarities and differences do you note in the figures? How does the formatting look (i.e. is the text big enough, are the different curves clearly labeled, is there spacing between table entries, etc.)? If you looked only at a team’s figures and tables (and their captions), could you understand the team’s results?
- Consider each team’s model. Is it clear what they are trying to do? Are the variables clearly marked or labeled in some way? Is there an aspect of their model or their results that go against the team’s assumptions?

As mentioned earlier, several teams used a regression approach to tackle this problem. However, there are a few similarities that many non-winning submissions share:

- Poor communication of their assumptions and variables
- Poor summary of the mathematical model
- Poor formatting or placement of important components

Notice that none of these have anything to do with the actual mathematics! Recall our handy-dandy formula as you write your submission and analyze others:

Cool math + poor communication = bad model

Let’s look at some examples of regression models that had some communication shortcomings.

In the first example, this team conducted a linear regression on the provided datasets. This is a good approach because it is quite simple, so not a lot of details are needed. Also, if a team argues that ten years is not a long time, then it is reasonable to assume that most other regression models could look quite similar to a line. The following is a screenshot of their results:

The figure is fairly straightforward, is well labeled, and allows us to directly compare cigarette use and e-cigarette use over time. One place where their submission was not as strong was in one of their key assumptions, which is given below:

- Assumption: Nicotine/Tobacco product usage trends will have a linear pattern in the coming decade. Justification: Both Normal Average and Exponential lines-of-best-fit proved to be highly problematic in their ability to project nicotine product usage. Therefore, a linear trend must be assumed.

While it is excellent to include this kind of assumption, their justification is quite vague. As I read this justification, I am left wondering why the methods they mentioned are “problematic” and why linear trend is not “problematic”? After looking at Part 1 of our blog series and the submission above, we know that there are issues related to “physical interpretation” of the regression methods, but the linear regression runs into the same issue. To strengthen the justification, this team should discuss more explicitly what was “highly problematic” about these other methods. Also, there are a plethora of other fits to try (logarithmic, for example), so the team may also want to comment on why the linear model is the best.

- Do you think the above assumption is a good assumption to make for this problem? If yes, rewrite the justification to improve the argument. If no, write down a different assumption and justification and consider how this team’s model might change as a result.

The next submission we will examine was given a mid-level score. The team starts out with a list of very good assumptions along with some very brief justifications, like

- Assumption: Teens had the same access to cigarettes as they do to vaping. Justification: This allows for equal comparisons of the two forms of nicotine transmission.
- Assumption: Once a health issue is discovered, the vaping growth rate will decrease similarly to the decrease of cigarette usage after 1964. Justification: This can be assumed because of the known detrimental effects of nicotine.

The first justification is quite vague. As a reader, I am unsure as to what “equal comparison” means. Are they trying to define what a “cigarette user” and a “vape user” are so the dependent variables can be compared? Or, are they trying to equate a specific type of cigarette purchase to a vape purchase? Also, as I read the rest of the report, I am unsure as to exactly how “access” factors into their model. These are components of assumptions that appear in plenty of other submissions, so having them is not a bad idea at all. However, if we consider our submission as a series of logical arguments, it is important to consider how your assumptions flow into the later parts.

The team then describes their model; they created a compound interest formula for the growth in percentage of both cigarette and vape usage. They then go through the calculations necessary to reach their result, as shown below:

This is a good place to highlight that your submission is not the same as your homework. While your instructors may care about the nuts and bolts of your computations, judges want to see just enough work that your results are reproducible. In this case, a compound interest model is something I feel does not warrant space for computations. This space might better be used in adding a table or figure, or in adding more details to the model explanation or justifications.

If you do feel that your model is quite complex, you might consider giving a short sample scenario. It’s good to keep these sample calculations in your notes and then tentatively include them in the report, but if your report is too long then sample calculations are good things to consider removing first.

As mentioned above, a majority of teams tackled this problem using a regression approach. This has many advantages, one of which is that it is easy to implement and write about. However, if you would like your model to more deeply explain how individual factors build to population-level dynamics, then more advanced mathematical methods could be beneficial. For this section of the blog, we will briefly examine more advanced modeling frameworks. When used correctly, these models can allow a team to make meaningful connections between the inputs and outputs of the model.

However, this is not a recommendation to build a highly involved model that your team is not comfortable with. Plenty of high scoring and winning submissions use simple statistical/mathematical approaches and solid arguments. So, if your team is uncomfortable using a specific type of math, then don’t use it!

A few teams created what is known as a Susceptible, Infected, and Recovered (SIR) model to model the spread of nicotine use over time. This is an approach that may be beyond the scope of your mathematical education thus far, so we won’t dwell too much on it. If you would like to know more, consider watching the fourth installment of our Essentials of Math Modeling Series. The important thing to point out about simple infectious disease models is that you can generate functions that look very close to (or in some cases, are exactly) the functions we are using in our regression methods!

The first, second, and fifth place teams all utilized differential equations in their solutions, which can be found here. The submission we will be looking at next was not a winning submission, but was quite close. The team has a nice setup of their model and explains it well. For example, they provide the following table for their formulae and parameters:

Next, the team spends quite a bit of the submission (perhaps too much) explaining how they compute an important parameter in SIR models, called

R0, from the available data. This parameter is a measurement for how many people on average a single infectious person ends up infecting. Finally, they give the following plot as their main result:

Like the figures in the submissions that utilize regression, we can visually see the increase in the “Infected” group (corresponding to active users of e-cigarettes) up to 10 years, then a slight decline. The group does not go into a sensitivity analysis, and they have short section on strengths and weaknesses for their entire submission at the end, which does not discuss any future directions.

As we mentioned earlier, a sensitivity analysis, particularly on this all-important

R0parameter, might be a good idea to include to show how much variability exists in your model. Also, it is a good idea to refer back to your assumptions and discuss how they match up with your results, and where you could alter your assumptions or conduct further testing in the future. Thinking about the logical arguments you are making as well as the problem outside of the context of the competition are things readers would love to see!

These questions ask you to compare and contrast the submission we just looked at and the first, second, and fifth place submissions, which you can find here.

- What similarities and differences do you note in the assumptions? What assumptions are made in the SIR models that are not made in the regression models, and vice versa?
- What similarities and differences do you note in the figures? How does the formatting look (i.e. is the text big enough, are the different curves clearly labeled, is there spacing between table entries, etc.)? If you looked only at a team’s figures and tables (and their captions), could you understand the team’s results?
- Consider each team’s model. Is it clear what they are trying to do? Are the variables clearly marked or labeled in some way? Is there an aspect of their model or their results that go against the team’s assumptions?

We hope that these two “You’ve Got to be Modeling Me” blog posts give you a roadmap for getting started with this year’s M3C. Of course, each competition consists of three parts. Parts 2 and 3 are much more open ended, will often require extra research, and really benefit from teams dividing up tasks. Stay tuned for future resources from me and Wesley for those parts as well!

What is certainly applicable to all parts of the M3C (and math modeling in general) is that the arguments that are made in your submission are just as important (if not more important) than the mathematics. Quality submissions help your reader understand how your modeling choices factor into your results and highlight strengths and weaknesses of your work. Who knew writing would be so useful in doing math?

]]>Are you a content creator that is looking to make your content thumbnails more eye catching? Joining us today is Nathan Fong and Stuart Fong from Queens University in Canada! Read on to learn more... read more >>

]]>Are you a content creator that is looking to make your content thumbnails more eye catching? Joining us today is Nathan Fong and Stuart Fong from Queens University in Canada! Read on to learn more about how their hack can help you! Over to you’ll guys..

Hi everybody, Nathan and Stuart Fong here! We are second-year computer science students at Queen’s University located in Ontario, Canada. We are hackathon lovers who enjoy learning about data science and machine learning. On July 15-17, we participated in the SelfieHacks II hackathon hosted by Major League Hacking (MLH) and created a project called YouTube Creator Assistant, which won the prize for best use of MATLAB.

Going into SelfieHacks II, we had no ideas on what to make, but we knew that we wanted to create a project that empowered content creators. During our brainstorming session, we wondered “What is something that all content creators struggle with?,” where we came up with the idea to help content creators to grow their communities. We then narrowed the scope to helping YouTube creators, and helping their content reach a wider audience.

If we wanted to track how many people are actively engaging with a channel, one of the best indicators is the view count of their videos. As views and subscribers are main indicators of the success of a video or channel, we wanted to make a tool that increases these numbers. This then increases the exposure of their videos to new users, allowing the channel to grow. Thinking in this way, we finally came up with our project idea, which we called YouTube Creator Assistant.

We started with looking at the YouTube homepage and identifying what elements would persuade a user to click on a certain video over another, such as the title and thumbnail. In our program, we wanted to take these components of the video to generate a predicted view count. The user can test various combinations of components such as thumbnails, titles, video duration and categories to maximise the number of views. While editing can be done with trial and error after the video has been published, our solution allows it to be done beforehand. Views can then be gained more easily during the time that is most crucial: right at the beginning.

We used the Youtube Thumbnail and Youtubers Saying Things datasets, but before we could use them, we had to clean the data. To start off, some of the columns were unneeded such as the video link and transcript. While we could use the video itself to pull some features, we decided against it for now and deleted the columns. Moving on, some of the variables were not in usable formats, where viewer and subscriber counts were abbreviated, and the video length was in HH:MM:SS format. Fixing this in MATLAB was very convenient as we could open the data table beside us, allowing us to see changes in real-time.

To create our model, we first looked at the types of data we had, which included images for the thumbnails, language data for the titles, and tabular data for the rest of the information. For the thumbnails, we used a convolutional neural network (CNN) to identify eye-catching elements of the image (AKA clickbait). Next for the titles, we extracted features that we thought were useful, such as the length and the percentage of capital letters. Finally for the tabular data, we used a fully connected neural network to predict how each variable relates to the resulting number of viewers. Then, we combined the outputs of the two networks, giving us the predicted viewer count.

After being introduced to MATLAB during a workshop at Local Hack Day: Build 2022, we wanted to try using one of the tools, Deep Network Designer, to build our neural networks. While using it, we saw how easy it was to prototype our model. The drag-and-drop interface allowed us to quickly change or swap out our layers without lowering the readability of our code. The process was as easy as creating how our model looked, choosing the input and output datastores, and then starting the training.

To deploy our model, we wanted to use Gradio as it is a web interface that we were more familiar with. The problem with this is that our model was created using MATLAB while Gradio uses the Python programming language. Luckily, MATLAB offers something called MATLAB Engine, which allows us to run MATLAB code in Python. To do this, we first installed it, and then imported it using the following code:

import matlab.engine

eng = matlab.engine.start_matlab()

We were then able to take inputs from our Gradio web app in Python, feed them into our MATLAB model, and output the predicted view count as a Python integer.

We tested our model by creating a fake thumbnail and filling in some details about our hypothetical video and channel. We then tried changing the thumbnail and title to one that we thought would attract more viewers and as expected, the predicted number of views increased!

Overall, our finished model performed well on new input++s, where a more “clickbait-y” thumbnail or title is predicted to have a greater number of views. Despite this, we found during testing that it has some difficulties with outputting an accurate prediction of the viewer count for

+ channels with a small number of subscribers. This is fine for the intended purpose of the model, but we feel that it would benefit from some additional data, as the current data only features the most trending and popular creators. In the future, we plan to give the model more data specifically containing YouTube channels with fewer subscribers, so that the model can better identify how a specific feature impacts the resulting viewer count. Watch this video to see how our code works

Compared to Python, we found that MATLAB was easier to use for prototyping, as there were many built-in functions to make coding quick and easy. The huge amount of documentation reduced the difficulty of trying new things, allowing us to explore more of MATLAB’s many features. YouTube Creator Assistant was a fun project to work on and we learned a ton about MATLAB’s features for data science and machine learning, as well as its Deep Network Designer and MATLAB Engine.

If you have any comments or questions about this project, feel free to reach out to us! Our code is available on Github, and you can see more about this project on our Devpost submission page.

]]>

Joining us today is Wesley Hamilton, who is a STEM Outreach Engineer here at MathWorks. Wesley will talk about tackling the 2019 MathWorks Math Modeling Challenge (M3C) problem. Wesley, over to... read more >>

]]>Joining us today is Wesley Hamilton, who is a STEM Outreach Engineer here at MathWorks. Wesley will talk about tackling the 2019 MathWorks Math Modeling Challenge (M3C) problem. Wesley, over to you…

Hello all, I’m Wesley and in this blog post we’ll start tackling the 2019 MathWorks Math Modeling Challenge (M3C) problem. The M3C is a free, annual math modeling competition for high school (HS) juniors and seniors in the U.S. and sixth form students in England and Wales, and is a program of the Society of Industrial and Applied Math. More information about the M3C can be found on their website, including two free workbooks aimed at preparing participants and their coaches to take part in the M3C!

Here we’ll tackle the 2019 M3C problem by building an initial model for the first part of the challenge. To do this we will make use of best modeling practices, including identifying variables and making assumptions, as well as use MATLAB’s data analysis and curve fitting toolboxes to support our modeling process. Note that this approach isn’t the only way teams might approach this problem (though some of the top submissions did something similar!); our goal is to see what a start might look like.

M3C problems consist of three parts. The first part is designed to be doable by most, if not all, teams. The next two parts typically build on work from the first part, but feature more open-ended questions that let teams showcase their creativity and technical prowess when developing models and solutions to the stated questions. As I mentioned, we’ll focus in on part 1 of the 2019 M3C problem and set ourselves up with a solid start for tackling the rest of the challenge.

The general outline we’ll follow is

- read the problem in its entirety,
- formulate a plan to answer the problem,
- carry out our plan.

Throughout this process we’ll make use of MATLAB, a scientific and engineering programming language developed by MathWorks. If you’re new to MATLAB (or programming), MathWorks has a free OnRamp to help you get started. Teams taking part in the M3C can request complimentary MATLAB licenses for the competition. Some of these resources (and much more!) can be found on the M3C’s “Learn Technical Computing” website.

Table of Contents

Let’s start by reading the part 1 problem statement in its entirety (in the actual competition we’d read the entire problem). In addition to setting the stage for why someone might want to know the answer to these questions, the introduction often contains pieces of information that will be helpful to us as we answer the questions.

We’ll start by reading the introduction for context of the problem:

Substances such as tobacco, alcohol, and narcotics can affect the physical and mental health of users. The consequences of substance abuse, both financial (health care, the criminal justice system, workplace productivity, etc) and non-financial (divorce, domestic abuse, etc), ripple through society and affect more than just the user. The effects of substance abuse on individuals and society have come to the forefront recently as opioid addiction has become prominent [1].

We see they provide a reference to an NPR article on Opioid overdoses in the United States. Let’s hold off on looking at the article until we finish reading the problem statement:

Efforts, such as taxes and regulations on cigarettes and the Drug Abuse Resistance Education program, have been made at the local, state, and national level to educate, control, and/or restrict the consumption of such substances. Such efforts need to start with an understanding of how substance abuse spreads and affects some individuals more than others.

Now to part 1, what we are actually being asked to do:

Darth Vapor—Often containing high doses of nicotine, vaping (inhalation of an aerosol created by vaporizing a liquid) is hooking a new generation that might otherwise have chosen not to use tobacco products. Build a mathematical model that predicts the spread of nicotine use due to vaping over the next 10 years. Analyze how the growth of this new form of nicotine use compares to that of cigarettes.

The last two sentences are key, and worth repeating: “Build a mathematical model that predicts the spread of nicotine use due to vaping over the next 10 years. Analyze how the growth of this new form of nicotine use compares to that of cigarettes.”

So, our task is to model the spread of nicotine use due to vaping and compare our model to the usage of cigarettes. Some questions I already have in mind are: how do we measure the spread or use? Do we keep track of vape usage by the number of cartridges used? By number of users?

If you don’t know how vaping works, this would be an opportunity to perform some research to inform some of the questions we’re coming up with. Let’s open a new MATLAB Live Script to start taking notes, starting with the tasks of part 1; let’s convert this to “text”, and then copy over the two key sentences.

Now let’s look at the provided data. Let’s start with the Figure on historical cigarette usage. “Adult* per capita cigarette consumption and major smoking and health events, United States, 1900–2012.”

The first thing we should take note of as we look at this graph is the labels on the axes. The x-axis has time in years, and the y-axis is labeled “per capita number of cigarettes smoked per year”. If you’re not sure what per capita means, you should take a moment to do a search for that term so you can fully understand the meaning of the data you have. A quick search turns up that the phrase per capita is often used to replace the words “per person”. So, this data gives us the average number of cigarettes consumed per person each year. This data does not tell us what percentage of the population smokes. However, we can make an assumption that if we did have that data, it might “look” similar to this data in terms of shape. Let’s capture that assumption in our Live Script. If we have time later on, we might search for data to support this assumption.

Speaking of shape, next we might examine the shape of the curve. It starts near zero on the y-axis, hits a peak in the 1960s and 1970s, and then decreases. The shape of this curve reminds me of a Bell curve, or Normal distribution, also called a Gaussian, from probability theory.

Now let’s take a look at the provided data, starting with the “high_school_vaping_data.xlsx” data. The title of the actual spreadsheet is “Percentage of High School Students who used e-cigarettes in the past 30 days, by gender and race/ethnicity; National Youth Tobacco Survey, 2011-2015”.

This data looks to be for all HS students (9th through 12th grade), and splits it up based on gender and race/ethnicity. Already it looks like we’ll need to do a little pre-processing if we want to read in the data, because it includes the percentage along with a confidence interval in parentheses in each cell (think of a confidence interval as the error bounds coming from the survey sampling); for MATLAB to happily read it in, there should be a single number in each cell. We also don’t know (without extra research) what percentage of HS students are White or Hispanic/Latino, so we might have to put in extra work if we want to mix and match the data objects that aren’t in the “Overall” row.

Finally, let’s look at the “NIH-DrugTrends-DataSheet.xlsx”.

This contains a lot more data on various drugs, but especially relevant for us we see the row corresponding to “any vaping in the past month”. This data is for the years 2015 to 2018, and is split amongst 8th grade, 10th grade, and 12th grade. Also, because of how the data is formatted, we’ll have to perform a little pre-processing before we can read it into MATLAB (like the other spreadsheet). Since the other data is for the years 2011 – 2015, we might want to merge the two datasets so we have data for HS vape usage from 2011 – 2018, which gives our model more to work with. We can justify this by making an assumption that “the National Youth Tobacco Survey and NIH Drug Trends data are comparable”. We’ll have to decide which dataset we use for the 2015 data, but we can do that after we’ve read the data in and are starting to build the model.

Now that we’ve taken a look at the provided data, let’s formulate an initial plan of attack to get started with a solution. For assumptions so far we have:

- cigarette per capita usage is comparable to number of people smoking (so that when we go to compare e-cigarette usage with cigarette usage, we can tie in that figure of historical cigarette usage), and
- the National Youth Tobacco Survey and NIH Drug Trends data are comparable (so that we can combine the two datasets so we have more data to work with).

For a course of action, let’s:

- Pre-process the data and read it in.
- Try to fit some curves to the data to predict usage in the future. We might hope that fitting a Gaussian to the data works best, so that we have some basis to say that e-cigarette usage will mimic historical cigarette trends, but let’s see how the models behave first.
- Write a short summary, which we could then copy and paste into our final report.

Now let’s preprocess the data. Since the spreadsheet has the percentages with confidence intervals in the same cells, it’s easiest if we manually re-enter the data into a different sheet. This will make importing the data significantly easier later on. I’ve also incorporated the confidence intervals by including the low (lower bound of the interval), medium (reported survey result), and high (upper bound of the interval) estimates in the spreadsheet.

Next, let’s do the same thing with the 10th and 12th grade data from the NIH data. Since we want to compare vaping to cigarette usage, let’s record both the e-cigarette and cigarette data. In this case, let’s actually use the “Any vaping” in the “past month” data, where we’re assuming vaping and e-cigarette usage are comparable data. Otherwise we’re going to do the same thing as before: copy the values into a new spreadsheet page so they’re easy to read in.

With the data ready, let’s use MATLAB to import everything we’ll be using. For this, we’re going to use the “Import Data” app on the home bar, as demonstrated below. For now we’ll just plan to use the “med” values and save the confidence intervals (“low” and “high” values) for later, after we’ve made a solid start and looked at later parts of the challenge.

To summarize the steps, we:

- opened the “Import Data” app from the Home toolbar,
- selected the spreadsheet “high_school_vaping_data.xlsx” which has the data we want to import,
- selected Sheet 2 of the spreadsheet, and selected the row of data we want to start with (since it’s only one row, let’s not bother with importing the column labels, and since we don’t need to worry about row labels, we’ll just import the numeric data),
- clicked the “Import Selection” dropdown menu and selected “Generate Script” (so that we can easily modify the code later and not have to go through the app each time),
- changed the name of the table that is imported into MATLAB, converted it to an array, and then ran the code to verify we imported the data correctly.

Since we didn’t include a semicolon after “HSData = table2array(HSTable)”, MATLAB prints the result of that line of code, which shows us the contents of the array “HSData”, the data we expected to read in.

Now let’s repeat the same process with the NIH data. Here, we’re recording 10th and 12th grader data for vaping and cigarette usage. As before, we select the correct spreadsheet, go to sheet 2, select the data (ignoring the data labels), and click “import selection” and then “generate script”. As before, let’s convert this data to an array, this time called “NIHData”.

Note that we also changed the view style for the Live Script, so that the output of code appears inline and not to the side of the editor. Also, since we’ve copied what we needed from the untitled.m files we can go ahead and close those tabs.

One point worth mentioning here: this is a relatively small dataset, and we could have possibly analyzed the data in excel directly, and/or manually type the data into MATLAB. The Import Data app is incredibly powerful when you start dealing with much larger datasets though, especially if you want to keep track of column or row labels, and is worth knowing about and getting comfortable with.

Once all of that is done, we can get started on our model! The code we generated is included here for reference (make sure the excel spreadsheets are located in the same folder as this Live Script):

%% Set up the Import Options and import the data

opts = spreadsheetImportOptions(“NumVariables”, 6);

% Specify sheet and range

opts.Sheet = “Sheet2”;

opts.DataRange = “A3:F3”;

% Specify column names and types

opts.VariableNames = [“Var1”, “VarName2”, “VarName3”, “VarName4”, “VarName5”, “VarName6”];

opts.SelectedVariableNames = [“VarName2”, “VarName3”, “VarName4”, “VarName5”, “VarName6”];

opts.VariableTypes = [“char”, “double”, “double”, “double”, “double”, “double”];

% Specify variable properties

opts = setvaropts(opts, “Var1”, “WhitespaceRule”, “preserve”);

opts = setvaropts(opts, “Var1”, “EmptyFieldRule”, “auto”);

% Import the data

HSTable = readtable(“high_school_vaping_data.xlsx”, opts, “UseExcel”, false);

HSData = table2array(HSTable)

%% Clear temporary variables

clear opts

%% Set up the Import Options and import the data

opts = spreadsheetImportOptions(“NumVariables”, 5);

% Specify sheet and range

opts.Sheet = “Sheet2”;

opts.DataRange = “A2:E5”;

% Specify column names and types

opts.VariableNames = [“Var1”, “VarName2”, “VarName3”, “VarName4”, “VarName5”];

opts.SelectedVariableNames = [“VarName2”, “VarName3”, “VarName4”, “VarName5”];

opts.VariableTypes = [“char”, “double”, “double”, “double”, “double”];

% Specify variable properties

opts = setvaropts(opts, “Var1”, “WhitespaceRule”, “preserve”);

opts = setvaropts(opts, “Var1”, “EmptyFieldRule”, “auto”);

% Import the data

NIHTable = readtable(“NIH-DrugTrends-DataSheet .xlsx”, opts, “UseExcel”, false);

NIHData = table2array(NIHTable)

%% Clear temporary variables

clear opts

Now that our Live Script is growing, let’s add in a few section breaks and headers so it’s easier for us to revisit and use our code. Based on our strategy for tackling this problem, our sections will be:

- Report (where we write our assumptions and conclusions),
- Reading in data
- Vape model (where we predict future vape usage),
- Comparison to cigarettes.

Our plan here will be to model vape usage by fitting a curve to the data we’ve imported, and use the fitted curve to say something about future vape usage. Two considerations arise:

- The NIH data is split between 10th and 12th grade students. How do we use this data to say something about high school students as a whole?
- We have 2 datasets (the high school vaping data and the NIH data) that cover the years 2011 to 2015, and 2015 to 2018. Can we combine the data in some way to have more data points with which to build the model?

For the first consideration, let’s start by averaging the 10th and 12th grade data from the NIH for each year, and use that as an approximation for all of high school. Implicitly we’re assuming that there are approximately the same number of 10th and 12th grade students in the study (so we can average the percentages) and that 10th and 12th grade students provide a reasonable approximation of how all high school students would behave; let’s actually add this assumption to our running list of assumptions, just in case (we may also want to break it up into two assumptions, but we can decide that later).

We’ll use MATLAB’s mean function to quickly average the percentages by columns, which correspond to years, of the first two rows of NIHdata.

NIHHSData = mean(NIHData(1:2,:))

For the second consideration, we’re going to take the HS data from 2011 to 2014, and use the NIHdata from 2015 to 2018. Here we’re assuming that the methods of surveying are comparable for the two datasets, so it makes sense to merge them. Without looking into the studies themselves, one justification for this we can provide is that the NIH data 2015 value we have (15.25) is within the confidence interval for the HS vaping data (14.1 – 18).

mergedData = [HSData(1:end-1), NIHHSData]

This is a quick and handy trick for combining arrays into new ones, which is something you pick up the more you use MATLAB. More info about working with arrays can be found in MathWorks’ Help Center. As a sanity check, let’s plot the data we’ve imported using the following code:

% visualize the imported data

years = 2011:2018;

plot(years,mergedData)

Our Live Script should show us a plot of vape usage per year, as demonstrated below. In particular, vape usage seems to be on the rise.

Now that we’ve preprocessed and verified our data, let’s use the built-in “Curve Fitter” app. Once the app opens we need to specify what data we want to fit a curve to by clicking the “Select Data” button. We want “years” as the X data and “mergedData” as Y data. The app automatically tries to fit a 1st degree polynomial (line) to the data, and we can play around with some options to see how 2nd and 3rd degree polynomials approximate the data as well.

As soon as we specify the data to be used, we see the data plotted with a line (1-dimensional polynomial, behind the data selector window) automatically plotted. Is a line a reasonable model for vape usage? Well, it seems to “fit” the data well enough, but what happens as time increases? Since we’re modeling the percentage of HS students vaping, as time increases our model tells us that near 2045, 100% of HS students will be vaping, near 2050 approximately 114% of HS students will be vaping, and so on. This doesn’t make sense for our problem, and using 2nd and 3rd degree polynomials don’t seem to do any better: using a polynomial fit doesn’t seem like the right way to go.

If we think back to our earlier reading of the problem, the historical cigarette usage data looked like a Gaussian. Luckily, Gaussians are a built-in option in the curve fitting app, so let’s try that next.

If we click “Gaussian” in the “Fit Type” menu, our model is automatically generated and looks reasonable! From the plot we see that HS vape usage will peak near 2023 at 30%, and then tend towards 0% usage by around 2040. We’re happy with this for now, so let’s export the code to generate the fitted Gaussian and plot and copy it into our notebook. To do this, we want to click the “Export Button” and then “Generate Code”. We can ignore the first few lines and just copy everything underneath “%%Fit: ‘untitled fit 1’.”

Note that this generated code also plots our model. Let’s change some of the settings so the plot is more descriptive:

- Change the title to “Vape usage prediction”,
- Plot the original data for comparison,
- Extend the range of where we want the fitted curve to be plotted, say 2010 to 2040 by specifying those years and then plotting that data,
- Add a legend for the original and fitted data,
- Add a descriptive Y axis label,
- Ensure the extended years are displayed in the plot.

Voila, we have our model for high school vape usage over the next 10+ years, addressing the first of two tasks for part 1 of this challenge. The model in question is a Gaussian, whose equation we see on the right in the “Curve Fitter” app: a1*exp(-((x-b1)/c1)^2), with a1 = 31.9, b1 = 2023, c1 = 7.598. In our report we’ll want to report this model, which we’ll come back to at the end of this post. The code we generated is included here for reference (make sure the excel spreadsheets are located in the same folder as this Live Script):

NIHHSData = mean(NIHData(1:2,:))

mergedData = [HSData(1:end-1),NIHHSData]

years = 2011:2018

plot(years,mergedData)

%% Fit: ‘untitled fit 1’.

[xData, yData] = prepareCurveData( years, mergedData );

% Set up fittype and options.

ft = fittype( ‘gauss1’ );

opts = fitoptions( ‘Method’, ‘NonlinearLeastSquares’ );

opts.Display = ‘Off’;

opts.Lower = [-Inf -Inf 0];

opts.StartPoint = [24.2 2018 1.82425684047535];

% Fit model to data.

[fitresult, gof] = fit( xData, yData, ft, opts );

% Plot fit with data.

figure( ‘Name’, ‘Vape usage prediction’ ); %change the title

plot(years,mergedData) %plot the original data

hold on % don’t generate a new figure when we plot other stuff

extendedYears = 2010:2040; %extend the range of the prediction

plot(extendedYears,fitresult(extendedYears))

%fitresult is a function, so fitresult(extendedYears) are the values of our

%model on the values in the array extendedYears

legend(‘HS vaping data’,‘model prediction’ ); %add a legend for the original and fitted data

% Label axes

xlabel( ‘years’, ‘Interpreter’, ‘none’ );

ylabel( ‘percentage of HS students vaping’, ‘Interpreter’, ‘none’ ); %add a descriptive label

xlim([2010 2040]) % ensure the extended years are displayed in the plot

grid on

The other task for part 1 is to compare the HS cigarette usage to the historical trend, and the strategy we’ll start with is fitting a Gaussian to that data and comparing to the historical trend. This would provide further evidence that

- the NIH data is reliable, and
- the Gaussian is a reasonable model to use for cigarette usage and, hence, vape usage.

To do this we’ll use the same pipeline as above:

- use MATLAB’s mean function to average the third and fourth rows of NIHdata, which have the cigarette usage data for 10th and 12th graders from 2015 to 2018,
- use the curve fitter app to fit a Gaussian to the data.

The first step looks just like it did before: isolate the 10th and 12th grade cigarette usage data and take the column-wise mean to get our HS cigarette usage estimate.

cigYears = 2015:2018;

cigData = NIHData(3:4,:);

cigDataHS = mean(cigData)

The second step looks similar to before: open up the curve fitter app, specify the x and y data to use, and choose “Gaussian” under “Fit Type”. Note that we also specified a new vector with just the years 2015 – 2018, since we’ll need to tell the curve fitter app which X data we want to use and the Y data in this case isn’t the entire period 2011-2018.

When we zoom out this time, however, the model looks a little less realistic: this Gaussian suggests that HS cigarette usage peaked around 2012, and around 1996 was close to 0 percent usage. We would probably expect to see a higher percentage of HS cigarette usage further back in time, so maybe we won’t include this model in our report just yet. One possible reason for this model not matching our expectations is that the amount of data we used is relatively small: in the vape usage model we had 8 data points to fit our model with, whereas here we only have 4. We’ll go ahead and save the code and figure we generated so they’re easy to return to. As we did above, we’ll change some of the settings so the plot is more descriptive:

- Change the title to “Vape usage prediction”
- Plot the original data for comparison
- Extend the range of where we want the fitted curve to be plotted, say 1990 to 2020 by specifying those years and then plotting that data
- Add a legend for the original and fitted data
- Add a descriptive Y axis label

We’re not ready to include this model or plot in our report, but at least we have the code ready to go and modify if/when we find supplementary data to improve our model. The code we generated is included here for reference (make sure the excel spreadsheets are located in the same folder as this Live Script):

cigYears = 2015:2018;

cigData = NIHData(3:4,:);

cigDataHS = mean(cigData)

%% Fit: ‘untitled fit 1’.

[xData, yData] = prepareCurveData( cigYears, cigDataHS );

% Set up fittype and options.

ft = fittype( ‘gauss1’ );

opts = fitoptions( ‘Method’, ‘NonlinearLeastSquares’ );

opts.Display = ‘Off’;

opts.Lower = [-Inf -Inf 0];

opts.StartPoint = [8.85 2015 2.00542812275561];

% Fit model to data.

[fitresult, gof] = fit( xData, yData, ft, opts );

% Plot fit with data.

figure( ‘Name’, ‘Cigarette usage prediction’ ); % change the title

plot(cigYears,cigDataHS); %plot the original data

hold on

extendedYears = 1990:2020; % specify an extended plot range earlier in time

plot(extendedYears, fitresult(extendedYears));

% fitresult is a function, so fitresult(extendedYears) are the values of our model on earlier years

legend(‘HS cigarette data’, ‘model prediction’,‘Location’,‘northwest’) % add a legend

% Label axes

xlabel( ‘cigYears’, ‘Interpreter’, ‘none’ );

ylabel( ‘cigDataHS’, ‘Interpreter’, ‘none’ ); %add a descriptive Y axis label

grid on

One aspect of the competition not yet discussed is the fact that it’s a team effort, in that (likely) it will be you and 1-3 of your colleagues tackling this question together. So, as you and your teammates continue on in the day, one task someone could take on is to refine the cigarette usage model. This might mean finding more data points to build a more accurate model, or trying a different curve to fit the model, or exploring a completely different approach! We could also revisit the model we’re happy with to show we tried other reasonable models; polynomials didn’t seem to give us reasonable results, but maybe a logistic curve or Weibull curve (whatever that is) would give other reasonable models.

Another task for a teammate could be to revisit and add justification to the two assumptions we wrote down: does per capita cigarette usage correlate with with number of cigarette users (maybe as a percentage of total population)? can we find data about number or percentage of cigarette users throughout history, maybe even HS cigarette usage data or charts? Why are the National Youth Tobacco Survey and NIH Drug Trends datasets comparable/mergeable? Etc.

At this point we’ve more-or-less answered part 1, at least enough for us to be comfortable that we have something we can write about and so we can get started on part 2 of the question. Before reading and starting on part 2, let’s collect our thoughts and outline our solution (so far) to part 1:

To model the spread of nicotine usage due to vaping over the next 10 years, we use survey data from the National Youth Tobacco Survey and NIH to fit a Gaussian model on percentage of HS students vaping:

percentage of HS students vaping = a1*exp(-((x-b1)/c1)^2),

where a1 = 31.9, b1 = 2023, c1 = 7.598, and x is the year of interest. Our model predicts vaping usage will peak with approximately 32% of HS students vaping by the year 2023, afterwhich usage will decline until nearly 0% usage by 2040. The rest of this section describes in more detail our methods and assumptions that support this model. [Here we would include assumptions, discuss the model development including the rational for choosing a Gaussian, etc.]

We’re not done with part 1, but we have a solid start. Make sure to check out part 2 that examines some student submissions to this challenge, and some of the models their teams built!

In today’s blog, Grace Woolson gives us an insight into how you can get started with using Machine Learning and MATLAB for Weather Forecasting to take on the WiDS Datathon 2023 challenge. Over... read more >>

]]>In today’s blog, Grace Woolson gives us an insight into how you can get started with using Machine Learning and MATLAB for Weather Forecasting to take on the WiDS Datathon 2023 challenge. Over to you Grace..

Today, I’m going to show an example of how you can use MATLAB for the WiDS Datathon 2023. This year’s challenge tasks participants with creating a model that can predict long-term temperature forecasts, which can help communities adapt to extreme weather events often caused by climate change. WiDS participants will submit their forecasts on Kaggle. This tutorial will walk through the following steps of the model-making process:

- Importing a Tabular Dataset
- Preprocessing Data
- Training and Evaluating a Machine Learning Model
- Making New Predictions and Exporting Predictions

MathWorks is happy to support participants of the Women in Data Science Datathon 2023 by providing complimentary MATLAB licenses, tutorials, workshops, and additional resources. To request complimentary licenses for you and your teammates, go to this MathWorks site, click the “Request Software” button, and fill out the software request form.

To register for the competition and access the dataset, go to the Kaggle page, sign-in or register for an account, and click the ‘Join Competition’ button. By accepting the rules for the competition, you will be able to download the challenge datasets available on the ‘Data’ tab.

First, we need to bring the training data into the MATLAB workspace. For this tutorial, I will be using a subset of the overall challenge dataset, so the files shown below will differ from the ones you are provided. The datasets I will be using are:

- Training data (train.xlsx)
- Testing data (test.xlsx)

The data is in tabular form, so we can use the readtable function to import the data.

trainingData = readtable(‘train.xlsx’, ‘VariableNamingRule’, ‘preserve’);

testingData = readtable(‘test.xlsx’, ‘VariableNamingRule’, ‘preserve’);

Since the tables are so large, we don’t want to show the whole dataset at once, because it will take up the entire screen! Let’s use the head function to display the top 8 rows of the tables, so we can get a sense of what data we are working with.

head(trainingData)

head(testingData)

Now we can see the names of all of the columns (also known as variables) and get a sense of their datatypes, which will make it much easier to work with these tables. Notice that both datasets have the same variable names. If you look through all of the variable names, you’ll see one called ‘tmp2m’ – this is the column we will be training a model to predict, also called the response variable.

It is important to have a training and testing set with known outputs, so you can see how well your model performs on unseen data. In this case, it is split ahead of time, but you may need to split your training set manually. For example, if you have one dataset in a 100,000-row table called ‘train_data’, the example code below would randomly split this table into 80% training and 20% testing data. These percentages are relatively standard when distributing training and testing data, but you may want to try out different values when making your datasets!

[trainInd, ~, testInd] = dividerand(100000, .8, 0, .2);

trainingData = train_data(trainInd, :);

testingData = train_data(testInd, :);

Now that the data is in the workspace, we need to take some steps to clean and format it so it can be used to train a machine learning model. We can use the summary function to see the datatype and statistical information about each variable:

summary(trainingData)

This shows that all variables are doubles except for the ‘start_time’ variable, which is a datetime, and is not compatible with many machine learning algorithms. Let’s break this up into three separate predictors that may be more helpful when training our algorithms:

trainingData.Day = trainingData.start_date.Day;

trainingData.Month = trainingData.start_date.Month;

trainingData.Year = trainingData.start_date.Year;

trainingData.start_date = [];

I’m also going to move the ‘tmp2m’ variable to the end, which will make it easier to distinguish that this is the variable we want to predict.

trainingData = movevars(trainingData, “tmp2m”, “After”, “Year”);

head(trainingData)

Repeat these steps for the testing data:

testingData.Day = testingData.start_date.Day;

testingData.Month = testingData.start_date.Month;

testingData.Year = testingData.start_date.Year;

testingData.start_date = [];

testingData = movevars(testingData, “tmp2m”, “After”, “Year”);

head(testingData)

Now, the data is ready to be used!

There are many different ways to approach this year’s problem, so it’s important to try out different models! In this tutorial, we will be using a machine learning approach to tackle the problem of weather forecasting, and since the response variable ‘tmp2m’ is a number, we will need to create a regression model. Let’s start by opening the Regression Learner app, which will allow us to rapidly prototype several different models.

regressionLearner

When you first open the app, you’ll need to click on the “New Session” button in the top left corner. Set the “Data Set Variable” to ‘trainingData’, and it will automatically select the correct response variable. This is because it is the last variable in the table. Then, since this is a pretty big dataset, I change the validation scheme to “Holdout Validation”, and set the percentage held out to 15. I chose these as starting values, but you may want to play around with the Validation Scheme when making your own model.

After we’ve clicked “Start Session”, the Regression Learner App interface will load.

Step 1: Start A New Session

[Click on “New Session” > “From Workspace”, set the “Data Set Variable” to ‘trainingData’, set the “Validation Scheme” to ‘Holdout Validation’, set “percent held out” to 15, click “Start Session”]

From here, I’m going to choose to train “All Quick-to-Train” model options, so I can see which one performs the best out of these few. The steps for doing this are shown below. Note: this recording is slightly sped up since the training will take several seconds.

Step 2: Train Models

[Click “All Quick-To-Train” in the MODELS section of the Toolstrip, delete the “1. Tree” model in the “Models” panel, click “Train All”, wait for all models to finish training]

I chose the “All Quick-to-Train” option so that I could show the process, but if you have the time, you may want to try selecting “All” instead of the “All Quick-to-Train” option. This will give you more models to work with.

Once those have finished training, you’ll see the RMSE, or Root-Mean-Squared-Error values, shown on the left hand side. This is a common error metric for regression models, and is what will be used to evaluate your submissions for the competition. RMSE is calculated using the following equation:

This value tells you how well the model performed on the validation data. In this case, the Fine Tree model performed the best!

The Regression Learner app also lets you import test data to see how well the trained models perform on new data. This will give you an idea on how accurate the model may be when making your final predictions for the competition test set. Let’s import our ‘testingData’ table, and see how these models peform.

Step 3: Evaluate Models with Testing Data

[Click on the “Test Data” dropdown, select “From Workspace”. In the window that opens, set “Test Data Set Variable” to ‘testingData’, then click “Import”. Click “Test All” – new RMSE values will be calculated]

This will take a few seconds to run, but once it finishes we can see that even though the Fine Tree model performed best on the validation data, the Linear Regression model performs best on completely new data.

You can also use the ‘PLOT AND INTERPRET’ tab of the Regression Learner app to create visuals that show how the model performed on the test and validation sets. For example, let’s look at the “Predicted vs. Actual (Test)” graph for the Linear Regression model:

Step 4: Plot Results

[Click on the drop-down menu in the PLOT AND INTERPRET section of the Toolstrip, then select “Predicted vs. Actual (Test)”]

Since this model performed relatively well, the blue dots (representing the predictions) stay pretty close to the line (representing the actual values). I’m happy with how well this model performs, so lets export it to the workspace so we can make predictions on other datasets!

Step 5: Export the Model

[In the EXPORT section of the Toolstrip, click “Export Model” > “Export Model”. In the window that appears, click “OK”]

Now the model is in the MATLAB Workspace as “trainedModel” so I can use it outside of the app.

To learn more about exporting models from the Regression Learner app, check out this documentation page!

Once you have a model that you are happy with, it’s time to make predictions on new data. To show you what this workflow looks like, I’m going to remove the “tmp2m” variable from my testing dataset, because the competition test set will not have this variable.

testingData = removevars(testingData, “tmp2m”);

Now we have a dataset that contains the same variables as our training set except for the response variable. To make predictions on this dataset, use predictFcn:

tmp2m = trainedModel.predictFcn(testingData);

This returns an array containing one prediction per row of the test set. To prepare these predictions for submission, we’ll need to create a table with two columns: one containing the index number, and one containing the prediction for that index number. Since the dataset I am using does not provide an index number, I will create an array with index numbers to show you what the resulting table will look like.

index = (1:length(tmp2m))’;

outputTable = table(index, tmp2m);

head(outputTable)

Then we can export the results to an excel sheet to be read and used by others!

writetable(outputTable, “datathonSubmission.csv”);

To learn more about submission and evaluation for the competition, refer to the Kaggle page.

When creating any kind of AI model, it’s important to test out different workflows to see which one performs best for your dataset and challenge! This tutorial was only meant to be an introduction, but there are so many other choices you can make when preprocessing your data or creating your models. There is no one algorithm that suits all problems, so set aside some time to test out different models. Here are some suggestions on how to get started:

- Try other preprocessing techniques, such as normalizing the data or creating new variables
- Play around with the training options available in the app
- Change the variables that you use to train the model
- Try machine and deep learning workflows
- Change the breakdown of training, testing, and validaton data

If you are training a deep learning network, you can also utilize the Experiment Manager to train the network under different conditions and compare the results!

Thank you for joining me on this tutorial! We are excited to find out how you will take what you have learned to create your own models. I recommend looking at the ‘Additional Resources’ section below for more ideas on how you can improve your models. We are running a few complimentary online workshops going over how MATLAB can be used to solve problems like this. Use this link to sign up for a workshop in your time zone

Feel free to reach out to us at studentcompetitions@mathworks.com if you have any further questions.

- Overview of Supervised Learning (Video)
- Preprocessing Data Documentation
- Missing Data in MATLAB
- Supervised Learning Workflow and Algorithms
- Train Regression Models in Regression Learner App
- Train Classification Models in Classiication Learner App
- 8 MATLAB Cheat Sheets for Data Science
- MATLAB Onramp
- Machine Learning Onramp
- Deep Learning Onramp

In today’s post, Harshita Sharma joins us to talk about how she used MATLAB and Transfer Learning to build an application that helps differently abled children interpret sign language... read more >>

]]>In today’s post, Harshita Sharma joins us to talk about how she used MATLAB and Transfer Learning to build an application that helps differently abled children interpret sign language alphabets! Her hack won her the award for Best Use of MATLAB at HackMerced VII! VERY COOL! Harshita, over to you…

My name’s Harshita Sharma and I’m from India. I am a junior at BIT, Mesra majoring in Computer Science. I am tech enthusiast, always finding an opportunity to learn, develop and make myself fit for the tech industry. I love learning new technologies and implementing my knowledge to solve real-world problems. One of the main reasons why I am pursuing a degree in Computer Science is to make technology more accessible for people who aren’t part of the tech industry, people like my grandparents, small businessmen, farmers, etc, and accessible to the differently-abled! I enjoy working with code to develop applications, and am an open-source software enthusiast as well! I am also working on my Data Structures and problem-solving skills. Dancing Kathak and listening to music help me relax. I love travelling to places and experiencing their culture and cuisine. You can follow me and my work on my social media links below.

Sign language is a necessity for differently-abled people, especially deaf people since it’s their way of communication. It is estimated that there are 70 million deaf people that use sign language and around 1 million people use ASL as their primary language of communication. It is one of the oldest and most natural forms of language for communication, but since most people do not know sign language and interpreters are very difficult to come by, I have come up with a real-time method using neural networks for fingerspelling based on American sign language.

I built this for deaf children especially and people around them so that they can learn using an interactive platform.

Deaf and Mute people make use of their hands to express different gestures to express their ideas with other people. Gestures are the nonverbally exchanged messages and these gestures are understood with vision. This nonverbal communication of deaf and dumb people is called sign language. Sign language is a visual language and consists of 3 major components

The problem was divided into 3 parts:

I have created my own dataset for following reasons, firstly I was not able find a dataset which has size same as of alexnet’s input layer, secondly by creating my own dataset and working on other dataset made me realise that working on your own build dataset increases accuracy. I have taken 300 pictures for every letter for this purpose.While making the dataset one more thing which I kept in mind was the background and lighting conditions.

Transfer learning is commonly used in deep learning applications. You can take a pretrained network and use it as a starting point to learn a new task. Fine-tuning a network with transfer learning is usually much faster and requires less data than training a network with randomly initialized weights from scratch. You can use layers from a network trained on a large data set and fine-tune on a new data set to identify new classes of objects.

To create the dataset, I used the MATLAB Support Package for USB Webcams, if you are using MATLAB Online, no additional installation is required, watch this video to learn more. First a processing area was created and declared the variable temp. Then a while loop is set to create the dataset folder so the loop will run until 300 photos have been clicked. In the loop, the image is stored in “BMP” and image was resized and cropped for AlexNet as it’s input layer requires images of size 227*227. Clearing the camera object c will shut down the connection to the webcam

c = webcam; % Create the Camera Object

% Processing Area

x = 0;

y = 0;

height = 300;

width = 300;

bboxes=[x y height width];

temp=0;

% Loop to click 300 photos for each letter

while temp<=300

e=c.snapshot;

IFaces = insertObjectAnnotation(e,‘rectangle’,bboxes,‘Processing Area’);

imshow(IFaces);

filename=strcat(num2str(temp),‘.bmp’); % Image Filename

es=imcrop(e,bboxes);

es=imresize(es,[227 227]); % Resize to meet AlexNet’s specs

imwrite(es,filename);

temp=temp+1;

drawnow;

end

clear c;

For this application, as discussed above, I used AlexNet, which is a convolutional neural network that is 8 layers deep. You can load a pre-trained version of the AlexNet trained on more than a million images from the ImageNet database. The pretrained network can classify images into 1000 object categories, such as keyboard, mouse, pencil, and many animals. As a result, the network has learned rich feature representations for a wide range of images. The network has an image input size of 227-by-227. To download AlexNet I used the Deep Learning Toolbox Model for AlexNet File Exchange Submission.

g = alexnet;

layers = g.Layers; % extract the layers

layers(23) = fullyConnectedLayer(10); % 10 indicates the output size

layers(25) = classificationLayer;

allImages = imageDatastore(‘testing’,‘IncludeSubfolders’,true, ‘LabelSource’,‘foldernames’);

opts = trainingOptions(‘sgdm’,‘InitialLearnRate’,0.001,‘MaxEpochs’,20,‘MiniBatchSize’,64);

myNet1 = trainNetwork(allImages,layers,opts);

save myNet1;

To test my trained network, I start by first loading the network, and then making a connection to the webcam to stream in images in real time. I then crop out thr processing area and resize to fit AlexNet’s input layer requirements.

load myNet1; % Load the trained network

c = webcam; % create camera object

x=0;

y=0;

height=300;

width=300;

bboxes=[x y height width];

% Loop

while true

e=c.snapshot;

IFaces = insertObjectAnnotation(e,‘rectangle’,bboxes,‘Processing Area’);

es=imcrop(e,bboxes);

es=imresize(es,[227 227]);

label=classify(myNet1,es);

imshow(IFaces);

title(char(label));

drawnow;

end

As you can see in the photo below, the network I trained was able to identify the letter a, in real-time. I was able to get this up and running duing a weekend of hacking and you can too! My code is available on this GitHub repository and you can watch this YouTube video I submitted to the Hackathon.

My MATLAB journey started during my 2nd year in engineering college when my professor introducedus to MATLAB. I really like how through MATLAB we can calculate things like matrix multiplication in seconds when in real life or while coding it will take minutes to solve. MATLAB seemed to be a mystery, a mystery that attracted me. So, I did the MATLAB Onramp and paid more attention to my MATLAB classes. Eventually, my interest in MATLAB grew more and more. I am still learning it and hope to be part of MathWorks one day!

I am a big fan of hackathons, why not, when you participate in a hackathon you get to meet new people, learn new things, and most importantly build a project! Even better if you win the hackathon. So, believing the same concept I decided to participate in a hackathon. I was scrolling through Major League Hacking’s website one evening and came across a GitHub Repository where I saw some MLH hackathons where MATLAB was a partner and that was the moment I decided to participate in this. So yes, funny enough but true that’s how I participated in HackMerced Hackathon and built a MATLAB project and won “Best Use of MATLAB” category award yay. Here is the winning swag which I received.

Recently, Liping from the Student Programs team at MathWorks had an interview with Lingfeng Lu and Jiangchuan Li, where they talked about their experience on how to use deep learning to accelerate... read more >>

]]>Recently, Liping from the Student Programs team at MathWorks had an interview with Lingfeng Lu and Jiangchuan Li, where they talked about their experience on how to use deep learning to accelerate PLL (Phase Locked Loop) design when they participated in one of the MathWorks Excellence in Innovation Projects. You can download the code and data shared by the team on GitHub: https://github.com/lulf0020/Behavior-modeling-of-PLL.

Lingfeng and Jiangchuan’s store started in 2021 when they were students at Shanghai Jiao Tong University in China. They took the school-enterprise cooperation course on “Engineering practice and scientific innovation”, which was taught by Prof. Yuhong Yang and supported by Dr. Yueyi Xu, a MathWorks engineer. They are required to complete a project for this course, so decided to select one of the MathWorks Excellence in Innovation Projects and complete it within three months.

MathWorks Excellence in Innovation Projects provides students and researchers with different cutting-edge ideas. All the projects are designed by MathWorks’ engineers who combined current industry needs with the latest technology development trend. The topics cover different areas including 5G, big data, industrial 4.0, artificial intelligence, automatic driving, robots, unmanned aerial vehicle (UAV), computer vision, sustainable development, and renewable energy.

When checking the project list, Lingfeng and Jiangchuan were attracted by one of them named Behavioral Modelling of Phase-Locked Loop using Deep Learning Techniques.

With the significantly increased complexity of chip design, how to explore the design space faster has been becoming more and more challenging! A PLL is usually called the ‘heart’ of a chip, which uses the signal from an external oscillator as a reference and generates an output as a stable clock usually with a higher frequency via a closed-loop control. Designing a stable and robust PLL is important for a chip just like a healthy heart for a human body.

The practicality and novelty of this project attracted us! Behavior-level modeling of PLL can save time and costs in the design process. Specifically, after establishing the behavior-level model of PLL, we can directly obtain the performance of a PLL by importing the device parameters into the model without running a lot of simulations or tests.

Data sets and models are two key factors for deep learning. In this project, the two main problems that we met are:

Problem 1: How to build a data set efficiently?

Problem 2: How to build an effective deep learning model?

In this project, no data set was available for us. Before we started, Mr. Pragati Tiwary, the MathWorks engineer who designed the problem, gave us an in-depth explanation of the problem statement. He told us that the N-division PLL reference model provided in the Mixed-Signal Blockset in MATLAB provided a way to build data sets through simulations.

One of the reference models shown below consists of five modules: Phase Frequency Detector (PFD), charge pump, loop filter, Voltage Controlled Oscillator (VCO), and frequency divider. What we need to do was constantly change the parameters of five modules to test the PLL’s performance on operation frequency, lock time, and phase noise.

Besides reference models, a lot of different test benches provided by the Mixed-Signal Blockset made our task easier. Leveraging the PLL Testbench, we can conveniently test the performance of the PLL model with various parameters and then record the results.

In the beginning, to obtain a set of data, we manually changed the parameter settings of the model, ran a simulation, and then manually record the output results. However, we found the way we collect data was very time-consuming.

At this point, Pragati gave us patient guidance on how to automatically import data, run simulations, and export performance results in batches. Please refer to this webpage for more information on programmatic model management in Simulink. With Pragati’s help, we changed the model parameters from constants to variables, then used the MATLAB program to adjust the parameter value of the Simulink PLL model, run simulations, and then collected the results automatically.

However, we then found that some parameters that defined the model structure, such as the order of the loop filter, could not be modified simply by changing the value of the variable.

When we lost our path, we pleasantly found that we can always assume a fourth-order loop filter architecture and set some capacitance and resistance values to 0 to achieve a lower-order one, for instance, we set R3=R4=0 (ohm) and C3=C4=0 (F) the in fourth-order loop filter architecture to achieve the second-order loop filter. In this way, we could do a rapid scanning of different model settings.

We also hope that the performance can be automatically recorded by the program. However, we found that the required performance data could not be exported, so we must export the intermediate outputs from the test bench and then calculate the final ones with MATLAB programs.

Finally, we established a MATLAB program to automatically simulate and test the PLL model. In each round, the program:

- Generated random numbers within a certain range and then set the values of the model parameters as these random numbers.
- Ran simulations and tests of the Simulink model.
- Recorded the intermediate results sent back by Simulink.
- Calculated the final performance metrics based on the recorded intermediate results.

Using MATLAB program to automatically collect data improves the efficiency of data set establishment significantly. After we have a data set, our problem became how to build an effective deep-learning model.

Deep learning is generally used for feature extraction and regression or fitting. For example, convolutional neural network models have many convolutional layers and pooling layers for feature extraction.

Through experiments, we found that a two layers feedforward neural network could already model the mapping between input parameters and output performance metrics well, so we use a simple feedforward neural network structure in our project.

MATLAB provides a Deep Learning Toolbox, where you can build the neural network model from scratch or by modifying a reference model. With this toolbox, MATLAB supports transfer learning for popular pre-trained models such as DarkNet-53, ResNet-50, NASNet, and SqueezeNet. Moreover, you can also import models from TensorFlow and Caffe to MATLAB.

What we have used in this project is the Neural Network Fitting App included in the Deep Learning Toolbox. We recommended this App since it provides a two-layer feedforward neural network with an optional number of neurons as shown in the figure below.

In our neural network, the classical nonlinear function Sigmoid was used as the activation function of neurons in the hidden layer, while the linear output function was used in the output layer. The performance of the neural network has been evaluated using the Mean Square Error (MSE) and the regression analysis.

It must be mentioned that the fitting effect of the model was not good at the beginning, so we tried different methods such as data preprocessing, increasing the number of neurons, adjusting the ratio of the training, the test, and the validation sets, and finally achieved a good result. The improvement methods that we finally used in our project include:

- Data preprocessing: For the data with a large difference in magnitude, we normalized the data with a logarithm function, so that the distribution of the output data became more uniformly, which reduced the possibility of underfitting or overfitting.
- Increasing the size of the test set: as usual, the data set in our project has been divided into the training set, the test set, and the validation set. The training set and test set were used for the training, while the validation set was mainly used for the final evaluation. We noted that a test set with at least 200 samples was necessary to ensure the reliability of the model training.

Time flies. Now Lingfeng has been working in Shanghai Mitsubishi Elevator Co., Ltd. (SMEC), and Jiangchuan has been preparing for the postgraduate entrance examination.

They told us this cross-cultural-difference experience was very unforgettable. This project has not only broadened their horizons but also improved their courage when communicating with people across the world. Through this project, they have realized that it is important for students to leverage knowledge and innovation to solve real-world problems.

At last, they would like to thank MathWorks for providing them with the opportunity and thank Prof. Yang, Mr. Tiwary, and Dr. Xu for their help and guidance!

]]>Joining us today is Grace Woolson, who joined the student programs team in June 2022 to support data science challenges and help participants use MATLAB to win! Grace will talk about a new data... read more >>

]]>*Joining us today is Grace Woolson, who joined the student programs team in June 2022 to support data science challenges and help participants use MATLAB to win! Grace will talk about a new data science challenge that launches today November 1, 2022 with our partner DrivenData. Grace, over to you.. *

Hello all, my name is Grace Woolson and today we will be talking about a new Data Science Challenge for you to test your skills! We at MathWorks®, in collaboration with DrivenData, are excited to bring you this challenge. The objective of this challenge is to estimate the annual aboveground biomass (AGBM) in a given patch of Finland when provided satellite imagery of that patch. In this blog we are providing a basic starter example in MATLAB®. In this code, I create a basic image-to-image regression model and train it to predict peak AGBM for each pixel in the input data. Then I use this model on test data and save the results in the format required for the challenge.

This should serve as basic starting code to help you to start analyzing the data and work towards developing a more efficient, optimized, and accurate model using more of the training data available. To request your complimentary MATLAB license and other getting started resources, visit the MathWorks BioMassters challenge homepage.

[Fig 1.1: a visualization of AGBM in color. [Image derived from data provided by the Finnish Forest Centre under the CC BY 4.0 license]]

If you want to access and run this code, you can use the ‘Run in your browser’ and ‘Download Live Script’ buttons in the bottom right corner of this page.

3.2. Programmatically

6.1. Visualize

10. Helper Functions

Each chip_id represents one patch of land in a given year. For each chip, you are provided approximately 24 satellite images and 1 AGBM image.

The satellite imagery comes from two satellites called Sentinel-1 (S1) and Sentinel-2 (S2), covering nearly 13,000 patches of forest in Finland from 2017 to 2021. Each chip is 2,560 by 2,560 meters, and the images of these chips are 256 by 256 pixels, so each pixel represents a 10 by 10 meter area of land within the chip. You are provided a single image from each satellite for each calendar month. For S1, each image is generated by taking the mean across all images acquired by S1 for the chip during that time. For S2, you are provided the best image for each month.

The AGBM image serves as the label for each chip in a given year. Just like the satellite data, the AGBM data is provided in the form of images that cover 2,560 meter by 2,560 meter areas at 10 meter resolution, which means they are 256 by 256 pixels in size. Each pixel in the satellite imagery corresponds to a pixel in the AGBM image with the same chip ID.

For the competition, you will use this data to train a model that can predict this AGBM value when provided only the satellite imagery. To learn more about the images, features, labels and submission metrics, head over to the challenge’s Problem Description page!

To understand the data that we will be working with, let’s look at a few example images for a specific chip_id. In the sections below, the images correspond to chip_id 0a8b6998.

First, define a variable that points to the S3 bucket so that we can access the data. You can find this path in the *‘biomassters_download_instructions.txt’* file provided on the data download page. Make sure this is the path for the entire bucket, not any specific folder – it should start with *‘s3://’*. This will be used throughout the blog

% Example path, you will need to replace this

s3Path = ‘s3://competition/file/path/’;

For each chip_id, we expect to see 12 images from Sentinel-1 with the naming convention {chip_id}_S1_{month}, where month is a value between 00 and 11. There are cases where there may be missing data, which could result in one or more of these images missing.

Each Sentinel-1 image has four bands, where each band is one 256×256 matrix that contains a specific measurement for the chip. Let’s visualize each band of one of these S1 images:

exampleS1Path = fullfile(s3Path, ‘train_features’, ‘0a8b6998_S1_00.tif’);

exampleS1 = imread(exampleS1Path);

% To visualize each layer, rescale the values of each pixel to be between 0 and 1

% Darker pixels indicate lower values, ligher pixels indicate higher values

montage(rescale(exampleS1));

Much like Sentinel-1, for each chip_id, we expect to see 12 images from Sentinel-2 with the naming convention {chip_id}_S2_{month}, where month is a value between 00 and 11. There are cases where there may be missing data, which could result in one or more of these images missing.

Each Sentinel-2 image has 11 bands, where each band is one 256×256 matrix that contains a specific measurement for the chip. Let’s visualize each band of one of these S2 images:

exampleS2Path = fullfile(s3Path, ‘train_features’, ‘0a8b6998_S2_00.tif’);

exampleS2 = imread(exampleS2Path);

% To visualize each layer, rescale the values of each pixel to be between 0 and 1

% Darker pixels indicate lower values, ligher pixels indicate higher values

montage(rescale(exampleS2));

For each chip_id, there will be one AGBM image, with the naming convention {chip_id)_agbm.tif. This image is a 256×256 matrix, where each element is a measurement of aboveground biomass in tonnes for that pixel. For 0a8b6998, it looks like this:

exampleAGBMPath = fullfile(s3Path, ‘train_agbm’, ‘0a8b6998_agbm.tif’);

exampleAGBM = imread(exampleAGBMPath);

% Since we only have to visualize one layer, we can use imshow

imshow(rescale(exampleAGBM))

Before we can start building a model, we have to find a way to get the data into the MATLAB Workspace. The data for this competition is contained in a public Amazon S3 bucket. The URL for this bucket will be provided once you have registered, so make sure you have signed up for the challenge so you can access the data. In total, all of the imagery provided takes up about 235GB of memory, which is too much to work with all at once. So that we can work with all of the data, I will be taking advantage of MATLAB’s imageDatastore, which allows us to read the data in one chip_id at a time and will make it easy to train a neural network later on. If you want to learn more about datastores, you can refer to the following resources:

We use the s3Path variable we created earlier to create a agbmFolder, which points specifically to the AGBM training data.

agbmFolder = fullfile(s3Path, ‘train_agbm’);

We can then use agbmFolder to create a datastore for our input (Satellite imagery) and output (AGBM imagery) data, named imInput and imOutput respectively. When you use an imageDatastore, you can change the way images from the specified directory are read in to the MATLAB Workspace using the ‘ReadFcn‘ option. Since I want to read one AGBM image but 24 satellite images at a time, I define a helper function readTrainingSatelliteData that reads the filename of the AGBM file we will read, which contains the chip_id, and instead reads in and preprocesses all corresponding satellite images. Then I use the built-in splitEachLabel function to divide the dataset into training, testing, and validation data, so that we can evaluate its performance during and after training. For this example, I chose to use 95% of the data for training, 2.5% for validation and 2.5% for testing because I wanted to use most of the data for training, but you can play around with these numbers.

The readTrainingSatelliteData helper function does the following:

- Extracts the chip_id from the filename of the AGBM image that we will read
- Reads in and orders all satellite images that correspond to this chip_id
- Handles missing data. Since this is just our first model, I have decided to omit any images that contain missing data.
- With the remaining data, finds the average value of each pixel for each band.
- Rescales the values to be between 0 and 1. Each satellite has different units of measurement, which can make it difficult for some algorithms to learn from the data properly. By normalizing the data scale, it may allow the neural network to learn better.

This results in a single input image of size 256x256x15, where each 256×256 matrix represents the average values for one band from S1 or S2 over the course of the year. Since S1 has 4 bands and S2 has 11, this results in 15 matrices. This is a very simplified way to represent the data, as this will only be our starting model.

imInput = imageDatastore(agbmFolder, ‘ReadFcn’, @(filename)readTrainingSatelliteData(filename, s3Path), ‘LabelSource’, ‘foldernames’);

[inputTrain,inputVal,inputTest] = splitEachLabel(imInput,0.95,0.025);

For the output data, we will use the default read function, as we only need to read one image at a time and don’t need to do any preprocessing. Since we are passing the same directory to each datastore, we know that they will read the images in the same chip_id order. Once again, split the data into training, testing, and validation data.

imOutput = imageDatastore(agbmFolder, ‘LabelSource’, ‘foldernames’);

[outputTrain,outputVal,outputTest] = splitEachLabel(imOutput,0.95,0.025);

Once the data has been preprocessed, we combine the input and output sets so they may be used with our neural network later.

dsTrain = combine(inputTrain, outputTrain);

dsVal = combine(inputVal, outputVal);

dsTest = combine(inputTest, outputTest);

The preview function allows me to view the first item in the datastore, so that we can validate that the inputs (the first item) and outputs (the second item) are the sizes we are expecting:

sampleInputOutput = preview(dsTrain);

montage(rescale(sampleInputOutput{1})); % Input Data

imshow(rescale(sampleInputOutput{2})) % Output Data

Now that the data is imported and cleaned up, we can get started on actually developing a neural network! This challenge is interesting, in that the inputs and outputs are images. Often, neural networks will be used to take an image as input and output a class (image classification) or maybe a specific value (image-to-one regression), as shown below:

[Fig 2.1: visualization of an image classification convolutional neural network]

In this challenge, we are tasked with outputting a new image, so our network structure will need to look a little different:

[Fig 2.2: visualization of an image-to-image convolutional neural network]

No matter the type of network you’re using, there are two different ways you can make or edit a deep learning model: with the Deep Network Designer app or programmatically.

First, we have to choose a network architecture. For this blog, I have decided to create a starting network architecture using the ‘unetLayers‘ function. This function provides a network for semantic segmentation (an image-to-image classification problem), so it can be easily adapted for image-to-image regression. If you want to learn more about other starting architectures, check out this documentation page on Example Deep Learning Networks Architectures.

Since the input images will be 256x256x15, this must also be the input size of the network. For the other options, I chose an arbitrary number of classes since we will change the output layers anyway, and a starting depth of 3.

lgraph = unetLayers([256 256 15], 2,‘encoderDepth’,3);

From here, I can open the Deep Network Designer app and modify the model interactively. I like this option as it lets me visualize what the network looks like and it’s easier to see that I’ve made the changes I want.

deepNetworkDesigner(lgraph)

When the app opens, it should look similar to the image below. If it’s zoomed in on certain layers, you can zoom out to see the full network by pressing the space bar.

[Fig 3.1: Deep Network Designer]

From here, remove the last two layers, and change the “Final-ConvolutionLayer” so that NumFilters is equal to 1. Some tips for using the app for this step:

- To zoom in or out, hold CTRL and scroll up or down on the mouse
- To delete a layer, click on it and hit the Backspace button on your keyboard.
- To modify a property of a layer, click on the layer. This will open a menu on the right that you can interact with.

[Fig 3.2: Removing and Modifying layers in Deep Network Designer]

It’s time to add in the regression layer:

[Fig 3.3: Adding a regression layer in Deep Network Designer]

Now, the model is done! It’s time to export it back into the MATLAB Workspace so it can be trained.

[Fig 3.4: Exporting a model from Deep Network Designer]

Note: it will automatically be exported as lgraph_1.

If you want to get more creative with your model, this documentation page about Deep Network Designer has more details on how to use the app.

First, we have to choose a network architecture. For this blog, I have decided to create a starting network architecture using the ‘unetLayers‘ function. This function provides a network for semantic segmentation (an image-to-image classification problem), so it can be easily adapted for image-to-image regression. If you want to learn more about other starting architectures, check out this documentation page on Example Deep Learning Networks Architectures.

Since the input images will be 256x256x15, this must also be the input size of the network. For the other options, I chose an arbitrary number of classes since we will change the output layers anyway, and a starting depth of 3.

lgraph = unetLayers([256 256 15], 2,‘encoderDepth’,3);

Now we have to change the final few layers so that the model will perform regression instead of classification. I do this by removing the softmax and segmentation layers and replacing them with a new convolution layer and a regression layer. The new convolution layer has a single filter so that the final image output will be a single layer, and the regression layer tells MATLAB how to interpret the output and computes the model’s half-mean-squared-error. To learn more about converting classification networks into regression networks, you can refer to this resource: Convert Classification Network into Regression Network.

lgraph = lgraph.removeLayers(‘Softmax-Layer’);

lgraph = lgraph.removeLayers(‘Segmentation-Layer’);

finalConvolutionLayer = convolution2dLayer([1, 1], 1, ‘Name’, ‘Final-ConvolutionLayer-2D’);

lgraph = lgraph.replaceLayer(‘Final-ConvolutionLayer’, finalConvolutionLayer);

lgraph = lgraph.addLayers(regressionLayer(‘name’,‘regressionLayer’));

lgraph_1 = lgraph.connectLayers(‘Final-ConvolutionLayer-2D’,‘regressionLayer’);

Once the network is built, we can use the analyzeNetwork function to check for errors and visualize the network. This will open in a new window.

analyzeNetwork(lgraph_1);

[Fig 4: Analysis and visualization of lgraph_1]

Once all of the layers are sorted out, it’s time to set the training options. The trainingOptions function lets us specify which solver will train the model and how it will be trained, and it’s important to play around with these options when training a model. There are endless combinations you can choose from, but these are the ones that have worked best for me so far:

options = trainingOptions(‘adam’, …

‘InitialLearnRate’, .0001, …

‘MiniBatchSize’, 10, …

‘MaxEpochs’, 50, …

‘ValidationData’, dsVal, …

‘OutputNetwork’, ‘best-validation-loss’, …

‘Verbose’, false);

Note: if you want to see evaluation metrics and visualizations while the model is being trained, set ‘Verbose‘ to true and set ‘Plots’ to ‘training-progress’.

To learn more about what these training options do and how you can optimize them, you can refer this resource: Set Up Parameters and Train Convolutional Neural Network

This step can be accomplished in only one line of code:

net = trainNetwork(dsTrain,lgraph_1,options)

While this is the shortest section of code, it will take several hours to train a deep learning model. If you have access to a supported GPU, I recommend using it – the ‘trainNetwork’ function will automatically utilize a supported GPU if one is detected. The following resource contains more information on GPUs and Deep Learning: Run MATLAB Functions on a GPU

Now we have a fully trained model that is ready to make predictions on the test data! Please note that the model I created was trained on only a subset of the training data, so the results you see in this section may look different than results you get if you run the same code.

To get output images from the test set, use the predict function.

ypred = predict(net,dsTest);

size(ypred)

The resulting ypred is a 4-D matrix. The first 3 dimensions represent each output image of size 256x256x1, and the last dimension represents how many of these images we have predicted. It is hard to tell how well our model performed just by looking at these numbers, so we can take a few extra steps to evaluate the network.

To access the first pair of satellite and AGBM images from the test set, use the preview function.

testBatch = preview(dsTest);

This will allow us to visualize a sample input image, the actual AGBM, and the associated predicted AGBM from the network to get a sense of how well the network is performing.

idx = 1;

predicted = ypred(:,:,:,idx);

ref = testBatch{idx,2};

montage({ref,predicted})

title(‘Expected vs Actual’);

While the images aren’t identical, we can definitely see similar shapes and shading! Since the output data is a measure of AGBM and not a representation of color, however, the values for each pixel aren’t between 0 and 1, so anything above 1 is being displayed as a white pixel. Let’s use the rescale function as we did before to get a better representation of the images so we can see more details and ensure that these higher values are still accurate.

rescaledPred = rescale(predicted);

rescaledRef = rescale(ref);

montage({rescaledRef,rescaledPred})

title(‘Expected vs Actual’);

Now that we can see much more detail, we can confirm that the network does a good job of matching the general shapes and countours of the expected output. We can also see, however, that the image produced by the network is generally brighter than the expected output, indicating that a lot of the values are higher than they should be.

For the competition, your final score will be the average root-mean-square error (RMSE) of each image submitted. RMSE can be represented by the following formula:

– $E =$\sqrt{

E=1n∑i=1n|Ai-Fi|2

For a forecast array F and actual array A made up of n scalar observations.

Given this formula, we can calculate the RMSE for a given prediction with the following line of code:

rmse = sqrt(mean((ref(:) – predicted(:)).^2))

The lower the RMSE, the better the model is. As you can see, there is still plenty of room for improvement of this model.

Keep in mind that this network may not be the best because my main goal with this blog was to show how to use the imageDatastore and how to set up a network. But I do have a network that really tries, and there are lots of ways to keep trying things out and improving this network:

- Create a model that accepts more information! Right now we lose a lot of information from the raw training data, so finding a way to use more of it could result in a more informed model.
- Instead of ignoring data, find ways to fill it in. Do you make a copy of previous satellite images when one is missing? Fill it in with an average? There are lots of ways to approach this.
- Incorporate the cloud cover layer.
- Try out different model structures! One other example structure can be found here.
- Experiment with training options.
- Try different distributions of training, testing, and validation data.

Once you have a model that you are happy with, you will need to use it to make predictions on the test data. To do this, we’ll need to first import and preprocess the data as we did above, then use the predict function to make predictions. Since we don’t have an ‘agbm’ folder to use as reference this time, the way we preprocess the data will have to look a little different.

To start, we will use the ‘features_metadata’ file provided to get a list of all test chip_ids.

featuresMetadataLocation = fullfile(s3Path, ‘features_metadata.csv’)

featuresMetadata = readtable(featuresMetadataLocation, ‘ReadVariableNames’,true);

testFeatures = featuresMetadata(strcmp(featuresMetadata.split, ‘test’), :);

testChips = testFeatures.chip_id;

[~, uniqueIdx, ~] = unique(testChips);

uniqueTestChips = testChips(uniqueIdx, :);

Then I make a new folder that will hold all of the predictions and a variable that points to this folder:

if ~exist(‘test_agbm’, ‘dir’)

mkdir test_agbm

end

% Include full file path, this is a placeholder – this should NOT be on the S3 bucket

outputFolder = ‘C:\DrivenData\..’;

Then, iterate through each chip_id and format the input data to match the expected format of our network (256x256x15), make predictions on the input data, then export each prediction as a TIFF file using the Tiff and write functions. For the competition, the expected names of these TIFF files is ‘{chip_id}_agbm.tif’.

for chipIDNum = 1:length(uniqueTestChips)

chip_id = uniqueTestChips{chipIDNum};

% Format inputs

inputImage = readTestingSatelliteData(chip_id, s3Path);

% Make predictions

pred = predict(net, inputImage);

% Set up TIF file and export prediction

filename = [outputFolder, chip_id, ‘_agbm.tif’];

t = Tiff(filename, ‘w’);

% Need to set tag info of Tiff file

tagstruct.ImageLength = 256;

tagstruct.ImageWidth = 256;

tagstruct.Photometric = Tiff.Photometric.MinIsBlack;

tagstruct.BitsPerSample = 32;

tagstruct.SamplesPerPixel = 1;

tagstruct.SampleFormat = Tiff.SampleFormat.IEEEFP;

tagstruct.PlanarConfiguration = Tiff.PlanarConfiguration.Chunky;

tagstruct.Compression = Tiff.Compression.None;

tagstruct.Software = ‘MATLAB’;

setTag(t,tagstruct)

% Export your prediction

write(t, pred);

close(t);

end

And just like that, you’ve exported your predictions! To create a TAR file of these predictions, we can simply use the built-in tar function.

tar(‘test_agbm.tar’, ‘test_agbm’);

The resulting ‘test_agbm.tar’ is what you will submit for the challenge.

Thank you for following along with this starter code! We are excited to see how you will build upon it and create models that are uniquely yours. Feel free to reach out to us in the DrivenData forum or email us at studentcompetitions@mathworks.com if you have any further questions. Good luck!

If you want to learn more about deep learning with MATLAB, check out these resources!

- Deep Learning with MATLAB: Training a Neural Network from Scratch with MATLAB
- Series: Deep Neural Networks
- Experiment Manager: Design and run experiments to train and compare deep learning networks
- Data Preprocessing for Images
- Deep Learning Tips and Tricks

function avgImS1S2 = readTrainingSatelliteData(outputFilename, s3Path)

outputFilenameParts = split(outputFilename, [“_”, “\”]);

chip_id = outputFilenameParts{end-1};

inputDir = fullfile(s3Path, ‘train_features\’);

correspondingFiles = dir([inputDir, chip_id, ‘*.tif’]);

% The satellite images range from 00-11, so preallocate a cell arrray

s1Data = cell(1, 12);

s2Data = cell(1, 12);

% Compile and order all data

for fileIdx = 1:length(correspondingFiles)

filename = correspondingFiles(fileIdx).name;

filenameParts = split(filename, [“_”, “\”, “.”]);

satellite = filenameParts{end-2};

fullfilename = strcat(inputDir, filename);

im = imread(fullfilename);

% Plus one because matlab starts at 1

idx = str2double(filenameParts{end-1}) + 1;

% Add all input images to ordered cell array

if satellite == ‘S1’

s1Data{idx} = im;

elseif satellite == ‘S2’

s2Data{idx} = im;

end

end

% Ignore missing data

for imgNum = 1:12

% Not enough bands

if size(s1Data{imgNum}, 3) ~= 4

s1Data{imgNum} = [];

elseif size(s2Data{imgNum}, 3) ~= 11

s2Data{imgNum} = [];

end

% Value of -9999

if ismember(-9999, s1Data{imgNum})

s1Data{imgNum} = [];

elseif ismember(-9999, s2Data{imgNum})

s2Data{imgNum} = [];

end

end

% Calculate average S1 data

totalImS1 = zeros(256, 256, 4);

for imgNum1 = 1:length(s1Data)

currIm = s1Data{imgNum1};

if ~(isempty(currIm))

totalImS1 = totalImS1 + currIm;

end

end

avgImS1 = totalImS1 ./ length(s1Data);

% Calculate average S2 data

totalImS2 = zeros(256, 256, 11);

for imgNum2 = 1:length(s2Data)

currIm = s2Data{imgNum2};

if ~(isempty(currIm))

totalImS2 = totalImS2 + currIm;

end

end

avgImS2 = totalImS2 ./ length(s2Data);

% Combine all bands into one 15 band image

avgImS1S2 = cat(3, avgImS1, avgImS2);

% Rescale so the values are between 0 and 1

avgImS1S2 = rescale(avgImS1S2);

end

function avgImS1S2 = readTestingSatelliteData(chip_id, s3Path)

% Update to point to s3

inputDir = fullfile(s3Path, ‘test_features’);

correspondingFiles = dir([inputDir, chip_id, ‘*.tif’]);

% The satellite images range from 00-11, so preallocate a cell arrray

s1Data = cell(1, 12);

s2Data = cell(1, 12);

% Compile and order all data

for fileIdx = 1:length(correspondingFiles)

filename = correspondingFiles(fileIdx).name;

filenameParts = split(filename, [“_”, “\”, “.”]);

satellite = filenameParts{end-2};

fullfilename = strcat(inputDir, filename);

im = imread(fullfilename);

% Plus one because matlab starts at 1

idx = str2double(filenameParts{end-1}) + 1;

% Add all input images to ordered cell array

if satellite == ‘S1’

s1Data{idx} = im;

elseif satellite == ‘S2’

s2Data{idx} = im;

end

end

% Ignore missing data

for imgNum = 1:12

% Not enough bands

if size(s1Data{imgNum}, 3) ~= 4

s1Data{imgNum} = [];

elseif size(s2Data{imgNum}, 3) ~= 11

s2Data{imgNum} = [];

end

% Value of -9999

if ismember(-9999, s1Data{imgNum})

s1Data{imgNum} = [];

elseif ismember(-9999, s2Data{imgNum})

s2Data{imgNum} = [];

end

end

% Calculate average S1 data

totalImS1 = zeros(256, 256, 4);

for imgNum1 = 1:length(s1Data)

currIm = s1Data{imgNum1};

if ~(isempty(currIm))

totalImS1 = totalImS1 + currIm;

end

end

avgImS1 = totalImS1 ./ length(s1Data);

% Calculate average S2 data

totalImS2 = zeros(256, 256, 11);

for imgNum2 = 1:length(s2Data)

currIm = s2Data{imgNum2};

if ~(isempty(currIm))

totalImS2 = totalImS2 + currIm;

end

end

avgImS2 = totalImS2 ./ length(s2Data);

% Combine all bands into one 15 band image

avgImS1S2 = cat(3, avgImS1, avgImS2);

% Rescale so the values are between 0 and 1

avgImS1S2 = rescale(avgImS1S2);

end

We now have 4 new members located in Natick (US) and Bangalore (India). The Student Programs team’s more recent members will be introduced in today’s post, including Emily, Ben, Roshan,... read more >>

]]>We now have 4 new members located in Natick (US) and Bangalore (India). The Student Programs team’s more recent members will be introduced in today’s post, including Emily, Ben, Roshan, and Manjunath. This team is tasked to eqip more student teams around the globe with software, training, and mentoring to tackle the same technical issues as professional engineers. MATLAB is one of the fastest-growing skills on LinkedIn profiles at top companies. By using MATLAB and Simulink in your projects you’re adding in-demand skills to your resume that could help you by hired. Find out what the newest members of our team are doing to help you be successful!

traveladdict #redhead #bachelorfan

What is your role in the Student Programs team?

My role in the Student Programs team is to establish the agreements for Academic Support. I assist with the application and payment process when a proposal has been submitted.

What big project are you currently working on?

I am currently crafting a 2023 budget for potential proposals and submissions.

Why do you like working in education?

Students with the opportunity to use MATLAB products open the doors to new discoveries, ideas, and solutions. It is great knowing that my small part can have a greater impact.

Fun Facts

- Trader Joe’s is my happy place.
- I love traveling and experiencing new places.
- I adopted a dog in 2022 and she is now my best bud.

#Beaches #Soccer #Concerts #Technology #Coffeeaddict

What is your role in the Student Programs team?

I support the MATLAB Student Ambassador program here at MathWorks. My role consists of managing the day-to-day operations of our student ambassadors worldwide.

What big project are you currently working on?

I am currently working on supporting our ambassadors and their return to campus for the fall semester. We have several new ambassadors joining us this semester!

Why do you like working in education?

I come from a family of educators in K-12, so education is “in my blood”. I have a strong passion for technology and education, so Ed Tech is the best of both worlds. I truly love working with students and watching them develop their technical and interpersonal skills throughout the program. I am always amazed by the creativity and passion our student ambassadors display every day. They always exceed my expectations and are remarkable to work with. Seeing them succeed in their professional careers after completing the ambassador program gives me joy.

Fun Facts

- I am a true lefty and have zero coordination with the right side of my body.
- Before MathWorks I worked for LEGO, so I have an epic LEGO collection.
- I played soccer in college, and I love Boston Sports.
- Completed my first Boston Marathon in 2018.

#Football #Gamer #AnimeEnthusiast #Music

What is your role in the Student Programs team?

My role in Students competition team focuses on the MATLAB Student Ambassador program. I create packaged content for the ambassadors to leverage for organizing competitions and workshops in their college.

What big project are you currently working on?

I am currently working on creating a ‘Minidrone Modelling and Simulation’ workshop, where the students get to learn and implement through Simulink, Control Systems, Image Processing and Path Planning.

Why do you like working in education?

The students have the most creative and imaginative ideas. Education is the ideal platform to interact with the students, learn from them and impart real world perspective to random but innovative ideas.

While I was a student participating in such competitions, the learning and hands on experience I got was priceless as compared to any academics. By working in Education, I wish to impart similar experiences to the next generation.

Fun Facts

- I represented my state football team in the Nationals
- Love to read books

#Movielover #explorer #trekking

What is your role in the Student Programs team?

My role focuses on supporting worldwide Minidrone Competitions and Student Simulink Challenge.

What big project are you currently working on?

Recently, I have been working on launching and promoting Student Simulink Challenge for the year 2022. I am also supporting the Minidrone India and EMEA competitions

Why do you like working in education?

I believe thinking out of the box is a necessity in the today’s world. Students consistently showcase how to think out of the box to bring out innovative ideas. Working in education team and supporting students in their journey motivates me and helps me grow personally.

Fun Facts

- I am a cinephile and action thrillers are my go-to genre
- Love watching tennis but prefer playing table tennis.

]]>

In this blog, Veer Alakshendra will show how you can develop a basic path planning algorithm for Formula Student Driverless competitions. Before we get started, we just want to mention that you... read more >>

]]>In this blog, Veer Alakshendra will show how you can develop a basic path planning algorithm for Formula Student Driverless competitions.

Before we get started, we just want to mention that you can run this code in your browser or can download the complete live script using the buttons at the bottom right corner.

Table of Contents

Various Formula Student competitions have introduced the driverless category, where the goal for the teams is to design and build an autonomous vehicle that can compete in different disciplines. In this script, we have demonstrated the steps to plan a path through a racing track using Delaunay triangulation. The application is analogous to the first lap path planning of the Formula Student Driverless competition to plan the path through the coordinates of the detected cones.

Please note that the Delaunay triangulation is just one of the methods for planning a path for Formula Student Driverless competitions. You can also try to develop a sampling-based planner like RRT, RRT*, etc, or any other custom algorithm that best fulfills your requirements. To develop such planners using MATLAB, please check out the functions listed on the motion planning webpage.

Figure 1

First, let us briefly try to understand Delaunay triangulations. The fundamental property is the Delaunay criterion. The criterion says that for a set of points in 2-D, a Delaunay triangulation of these points ensures the circumcircle associated with each triangle contains no other point in its interior. In the figure below, the circumcircle associated with T2 is empty. It does not contain a point in its interior. Hence, this triangulation is a Delaunay triangulation.

In the algorithm below, we have used this property to create a path using the detected cones as vertices.

Figure 2

Reference: Working with Delaunay Triangulations

Figure 3 shows the methodology we have implemented to plan the path through the cones. To understand the algorithm, let us go through the code.

Figure 3

- Load cone coordinates

As a first step, we will load the x and y coordinates of the inner and outer cones. It is assumed that the perception algorithm is detecting the yellow and blue cones. As one of the most common approaches in Formula Student competitions, you can use the YOLO network to detect cones. For reference, you can watch this video to learn how to design and train a YOLO network in MATLAB.

clc

clear

innerConePosition = [6.49447171658257,41.7389113024907;8.49189149682204,41.8037451937836;10.4848751821667,41.8690573815958;12.4735170408320,41.9319164105607;14.4579005366844,41.9894100214277;16.4380855350346,42.0386448286094;22.3534294484539,42.1081444836209;24.3165071700586,42.0957886616701;26.2748937812757,42.0609961552005;28.2282468521326,42.0009961057027;30.1761149294924,41.9130441723441;32.0795946202318,41.7896079706255;33.8817199327800,41.5914171121172;37.1238045770298,40.8042280456036;38.5108000329817,40.1631834270328;40.7498148915033,38.2980872410971;41.6428012725483,37.0572740958520;42.3684952713925,35.6541128117207;42.9237412798850,34.1220862330776;43.4562377484922,30.9418382209873;43.4041825643156,29.3682424552239;43.1310900802234,27.8454335309325;42.6373380962399,26.4081334491868;41.9213606895600,25.0723633391567;39.8493389129556,22.6733478226331;37.1330772214184,20.7325660900255;35.6438750732877,19.9947555768891;34.0970198095490,19.4366654165469;32.5117013027651,19.0709696852617;30.9105163807122,18.9081142458350;29.3050197645280,18.9545691109581;27.6851829730976,19.2045278218877;24.4759113345748,20.2570787967781;21.3685688158133,21.9377743574784;19.8072187451783,22.9775182612717;18.2040616100214,24.1054677217171;16.5291821840059,25.2778069272720;14.7490275765791,26.4446189166734;10.7566166245341,28.5122919557571;8.59035703212379,29.2671114310357;6.36970146209175,29.8034933598624;4.11753580708921,30.1252372241127;-0.405973075502234,30.1500578730274;-2.75740378439869,29.8083496207198;-7.53032772904185,27.7630095130657;-9.55882345130782,25.8360602858122;-10.9967405848897,23.4446538263345;-11.7530941810305,20.7728147412547;-11.7945553145768,18.0098876247875;-11.1377067999357,15.3398589997259;-9.84329815385971,12.9293912425976;-4.31864858110937,8.32717777382664;-2.64821833398598,7.33684919790624;-1.23396167898668,6.38669110355496;-0.139159724307982,5.41310565569044;0.583196113153409,4.42232005043196;1.00846697640521,3.25999992647677;1.42452204766312,-0.235409796458984;1.70769766109080,-2.50157264279400;2.43395748729283,-5.02746251429141;5.90118769712182,-9.36241405660719;8.13219227914984,-10.6723062142501;10.4108490166236,-11.5249176387577;14.7174578544944,-12.4162331505645;16.6611999625517,-12.7135418322201;18.4669199253036,-13.0527675263258;20.2513584750154,-13.4902268953088;22.0275043214267,-14.0208096878619;23.7938485314227,-14.6351269687361;29.0380553533840,-16.8864873288738;30.7681622622371,-17.7392626196140;32.3760783979903,-18.6039808584518;33.7897740653103,-19.5166383512928;34.9667330396653,-20.5090458088496;35.8614855279756,-21.5808775016001;36.4612858521334,-22.7451308643890;36.8711477182864,-25.4557910211687;36.6696611457761,-26.8930663715963;35.5175477601575,-29.6867414698392;34.6369025552692,-30.9011171932826;33.5888282300982,-31.9233335254203;32.4026659456202,-32.7184009884738;29.6211021726631,-33.5886007841515;27.9849204584717,-33.7170193136260;26.2183096377485,-33.6848022646400;24.3309596262912,-33.5465676173721;22.3208062313579,-33.3726612899257;18.1328018280879,-33.1964401661521;14.1429922166079,-33.1731958647648;12.2313503871593,-33.1266878316603;10.3778348140314,-33.0132052855498;8.58829435522285,-32.8040580663346;6.84064584789227,-32.4725374962639;5.07009944482550,-32.0172508962124;3.27322533222869,-31.4640545504384;-0.418687595522242,-30.1796696447230;-4.09627739578414,-28.8238238024923;-5.89202160786470,-28.1122858460715;-7.65255868890126,-27.3645667343373;-9.37355826729420,-26.5706223212060;-11.0772549160090,-25.7164049091686;-14.4775165017317,-23.8562440887552;-16.1852335110450,-22.8738429891042;-17.9066984601430,-21.8724503885825;-19.6280899733090,-20.8717543910881;-22.9283852622260,-18.8697018891971;-24.4790807692059,-17.8277052050408;-27.3051265922114,-15.5801090683244;-28.5448479313762,-14.3530174024786;-29.6530218737847,-13.0404842531107;-30.6444406510768,-11.6234683085559;-31.5169602294234,-10.1146356600596;-32.2692264908853,-8.52739205631154;-32.9016624434539,-6.87420998418598;-34.1160057737752,-1.60803776182190;-34.3261608207142,0.237322459171913;-34.4624174228984,2.11880932640591;-34.5407037273032,4.03273612643633;-34.5779532171168,5.97700580252438;-34.5920061096721,7.95066706685753;-34.6203010935088,11.9628384547546;-34.6463516562825,13.9656473217268;-34.6751545298116,15.9616504117489;-34.7230350194217,19.9334733431980;-34.7332354577762,21.9093507078716;-34.7284338451120,23.8784778261844;-34.6466389854161,27.7320065787716;-34.4887912178893,29.4532110956616;-34.1674004338218,30.9833511541837;-33.6523388115164,32.2835000565933;-32.9540435069214,33.3001054449054;-32.0153881487289,34.0792118997941;-28.7867224485036,35.7027232116937;-26.8737074947600,36.7067299514829;-21.7623826032159,39.5958663210330;-20.1418625665089,40.3549987041733;-18.5286281071084,40.9596629960629;-16.9269162480936,41.3805845888608;-15.2947586791386,41.6060043323431;-13.5312107482484,41.6777684016896;-9.64289482747188,41.5853268011948;-7.61112976471737,41.5421552074769;-5.58318659783131,41.5226049040385;-1.53976115651896,41.5427010208205;0.475431094342800,41.5765527022834;2.48619653317858,41.6224387797260;6.49447171658257,41.7389113024907;8.49189149682204,41.8037451937836;10.4848751821667,41.8690573815958;12.4735170408320,41.9319164105607;14.4579005366844,41.9894100214277] % load inner cone x and y coordinates

outerConePosition = [8.29483356036796,47.8005083348189;10.2903642978411,47.8659036790991;12.2903853701921,47.9291209919643;14.2949817462715,47.9871977358879;16.3042155724446,48.0371512121282;18.3181128601480,48.0759783968909;24.3872198920862,48.0953719564451;26.4188343006777,48.0592693339474;28.4542241392916,47.9967391176812;30.4929111587893,47.9046750145358;32.5715480310539,47.7694057800486;34.7354354110013,47.5303707137414;36.9671887094624,47.1090924863221;41.4608084515852,45.3878796220444;43.5486168233166,43.9466679046951;46.7637769841785,40.1838709296144;47.8711744070693,38.0458741563760;48.6759026786991,35.8287318442088;49.2095764033813,33.5308461962448;49.3717906596030,28.7456240961468;48.9402528124274,26.3442245678724;48.1379348685719,24.0115869446911;46.9718766730536,21.8331831495909;45.5183908761479,19.8937545099424;42.0344522843090,16.7521854290723;38.0024261140352,14.4777602906678;35.7974585974047,13.6826661179064;33.4965697064231,13.1523520909686;31.1302557069132,12.9121393769316;28.7495405480103,12.9803375419597;26.4275819814214,13.3378047377686;24.1935407090117,13.9452300781406;20.0475993937396,15.7534500265274;16.3951857815830,18.0421326577611;14.7344934919394,19.2103582158950;13.1401972893467,20.3265665388672;11.5833537070008,21.3477072094266;10.0428032559345,22.2341839570762;6.90006888859156,23.5101221144020;5.24310276950825,23.9102111346615;3.54614319085073,24.1525066527486;1.82453456285617,24.2384953941222;-1.49646987010121,23.9423419780574;-2.88611819700968,23.5188503619890;-4.87663538345929,22.0841121388762;-5.48860547878733,21.0654842962416;-5.81567263287249,19.9085084532591;-5.83349192339180,18.6923268134878;-5.53917605118359,17.4977407053060;-4.95012865665813,16.4016948398736;-4.05317915721677,15.4107620877673;0.492121533127250,12.4494087837712;2.38653784056378,11.1712478467585;4.26363974073903,9.48929957533922;5.87263326747036,7.25460792254726;6.83628597561633,4.68706884998182;7.22868549746256,2.31823546262226;7.60968340642873,-1.42149659832866;7.97509056184070,-2.72619237583490;8.54802041473498,-3.68009075055991;10.6601317831479,-5.23084300440234;12.1167992046948,-5.77254995615783;13.7805173778787,-6.17261534649334;17.6231089179688,-6.79114948102549;19.7388585053217,-7.18913620702706;21.8298950982966,-7.70159820373082;23.8771264994103,-8.31301696223576;25.8797524901441,-9.00938215804746;27.8385154732156,-9.77726113305095;33.4828656272925,-12.3885256946890;35.3884273243004,-13.4149776838374;37.3243202944926,-14.6682383106842;39.1922962154652,-16.2493960759111;40.8455371260767,-18.2403341823979;42.0705518955054,-20.6153105774461;42.7346589897432,-23.1596769590206;42.5147079384368,-28.2478460553850;41.7538036771420,-30.6144579632713;39.1831183440142,-34.8167170198948;37.3770559671547,-36.5762176430613;35.2523671945937,-37.9984770098583;32.8791324240351,-38.9934826446380;28.1270516052592,-39.7153356388220;25.9046030197368,-39.6765956654354;23.8121031317075,-39.5240911796868;21.8464792201864,-39.3538830619781;19.9742827992146,-39.2418187116909;16.0937484438592,-39.1847191943436;11.9928246424488,-39.1219447460144;9.86352410414552,-38.9911216862446;7.69014987529461,-38.7364552625087;5.51782683234722,-38.3249002543397;3.42382206323132,-37.7869797271760;1.40682879137184,-37.1663842455962;-0.541821138011257,-36.5027177844325;-4.34015621146816,-35.1410426555394;-8.16436000121084,-33.6653461038145;-10.0767778404464,-32.8530238495275;-11.9792230934156,-31.9752972248347;-13.8428086034457,-31.0410375539532;-15.6623928384561,-30.0682796038662;-19.1958077261686,-28.0638760232219;-20.9238240938196,-27.0586776214187;-22.6524765196268,-26.0537507263292;-24.4077080359509,-25.0144494209715;-27.9360780875981,-22.7317004681533;-29.6698538901663,-21.4381474644069;-32.9453564203185,-18.4316844689044;-34.4107565227371,-16.6961594315364;-35.7065080013589,-14.8445665903860;-36.8313109009813,-12.8998963192641;-37.7878917266941,-10.8820330087871;-38.5811746376266,-8.80892992160400;-39.2178699842561,-6.69652639898087;-40.3011807339499,-0.309611756649864;-40.4533406202231,1.78890139416343;-40.5382789183978,3.86217286202336;-40.5775538214583,5.90777727960811;-40.5919497511314,7.92466131148779;-40.6014196538181,9.91275631233656;-40.6457557289961,13.8810850824754;-40.6745334498720,15.8753221224371;-40.7017394704418,17.8763494567535;-40.7332264185658,21.8989357934230;-40.7282871101318,23.9204396925825;-40.7033369152928,25.9485603009070;-40.4291604069450,30.2970202551981;-39.9260632680985,32.6679289532538;-38.9665731582284,35.0689826553948;-37.4022554152914,37.3266934320522;-35.2924298638551,39.1052438930590;-33.2056190082383,40.2549711124138;-29.7935826692359,41.9483259837501;-28.1179159337869,42.9113732230407;-22.4882559681602,45.8771756304123;-20.3655683666973,46.6715497668858;-18.1093389106216,47.2629205744315;-15.7909195473899,47.5854545071348;-13.5681633690504,47.6776546092619;-11.4644254110510,47.6457593795914;-7.51988574761303,47.5414613781388;-5.55737846449435,47.5225493988029;-3.59058902924480,47.5236739806379;0.355197826404633,47.5753479114301;2.33404449770947,47.6205092826547;4.31684977705784,47.6749328204622;8.29483356036796,47.8005083348189;10.2903642978411,47.8659036790991;12.2903853701921,47.9291209919643;14.2949817462715,47.9871977358879;16.3042155724446,48.0371512121282] % load outer cone x and y coordinates

- Preprocess the data

After loading the data, we have merged the inner and outer coordinates with alternate coordinates (Figure 4). This step will ensure that the input to the function delaunayTriangulation is a matrix whose columns are the x-coordinates, and y-coordinates of the triangulation points.

Figure 4

[m,nc] = size(innerConePosition); % size of the inner/outer cone positions data

P = zeros(2*m,nc); % initiate a P matrix consisting of inner and outer coordinates

P(1:2:2*m,:) = innerConePosition;

P(2:2:2*m,:) = outerConePosition; % merge the inner and outer coordinates with alternate values

xp = []; % create an empty numeric xp vector to store the planned x coordinates after each iteration

yp = []; % create an empty numeric yp vector to store the planned y coordinates after each iteration

- Form triangles

In real scenarios, the sensors mounted on the vehicle will detect only a certain number of yellow and blue cones while going through the race track. Hence, we have implemented a for loop that allows the code to create Delaunay triangulation objects for every nth interval of the cone position. For example, if n=4, the Delaunay triangulation will be created based on the coordinates of 4 cones. The image below illustrates the procedure.

Figure 5

interv = 10; % interval

for i = interv:interv:2*m

DT = delaunayTriangulation(P(((abs((i-1)-interv)):i),:)); % create Delaunay triangulation for abs((i-1)-interv)):i points

Pl = DT.Points; % coordinates of abs((i-1)-interv)):i vertices

Cl = DT.ConnectivityList; % triangulation connectivity matrix

[mc,nc] = size(Pl); % size

figure(1) % plot delaunay triangulations

triplot(DT,'k')

grid on

ax = gca;

ax.GridColor = [0, 0, 0]; % [R, G, B]

xlabel('x(m)')

ylabel('y (m)')

set(gca,'Color','#EEEEEE')

title('Delaunay Triangulation')

hold on

- Define constraints

While performing triangulation, the coordinates of the inner and outer cones are bound to create triangles outside the boundary of the track. As an example, Figure 6 shows one of the cases where the exterior triangle is formed.

Figure 6

As these triangles can generate a wrong path, we have removed them by imposing constraints, C. These constraints are the vertex IDs of constrained edges, specified as a 2-column matrix. Each row of C corresponds to a constrained edge and contains two IDs:

C(j,1) is the ID of the vertex at the start of an edge.

C(j,2) is the ID of the vertex at end of the edge.

As an example, Figure 7 shows the vertex IDs of the constrained edges where the matrix C = [2 1;1 3;3 5;5 6;2 4;4 6].

Figure 7

So now we have defined the constraints for the inner and outer boundaries.

% inner and outer constraints when the interval is even

if rem(interv,2) == 0

cIn = [2 1;(1:2:mc-3)' (3:2:(mc))'; (mc-1) mc];

cOut = [(2:2:(mc-2))' (4:2:mc)'];

else

% inner and outer constraints when the interval is odd

cIn = [2 1;(1:2:mc-2)' (3:2:(mc))'; (mc-1) mc];

cOut = [(2:2:(mc-2))' (4:2:mc)'];

end

C = [cIn;cOut]; % create a matrix connecting the constraint boundaries

- Create Delaunay triangulation with constraints

Once we have defined the constraints, we have used the delaunayTriangulation object to create 2-D Delaunay triangulations.

TR = delaunayTriangulation(Pl,C); % Delaunay triangulation with constraints

Before we move to the next step, it is important to introduce you to the 'Connectivity List.' This property will be used in the subsequent steps to create a new triangulation by excluding the exterior triangles.

As per the documentation, the triangulation connectivity list is a matrix with the following characteristics:

- Each element in DT.ConnectivityList is a vertex ID.
- Each row represents a triangle or tetrahedron in the triangulation.
- Each row number of DT.ConnectivityList is a triangle or tetrahedron ID.

For example, in Figure 8 the elements of the first row [2 1 3] represent the vertices of the first triangle.

Figure 8

Now that you understood the meaning of the connection list, let us output the connectivity list of TR.

TRC = TR.ConnectivityList; % triangulation connectivity matrix

Once you have listed the connectivity matrix, we need to remove the rows that construct exterior triangles. Figure 9 shows that the second row [1 5 3] represents an exterior triangle.

Figure 9

With the delaunayTriangulation object, you can perform a variety of topological and geometric queries. For our case, we have used the object function isInterior which returns a column vector of logical values that indicate whether the triangles are inside a bounded geometric domain. The ith triangle in the triangulation is considered to be inside the domain if the ith logical flag is true, otherwise, it is outside. For example, as shown in Figure 10, the exterior triangle is assigned to the logical value 0 or false.

Figure 10

TL = isInterior(TR); % logical values that indicate whether the triangles are inside the bounded region

TC = TR.ConnectivityList(TL,:); % triangulation connectivity matrix

From the previous step, we have obtained a new connectivity matrix that doesn't contain the exterior triangles. So now in this step, we have used the updated connectivity matrix TC to create 2-D triangulation using the points in matrix Pl.

[~,pt] = sort(sum(TC,2)); % optional step. The rows of connectivity matrix are arranged in ascending sum of rows...

% This ensures that the triangles are connected in progressive order.

TS = TC(pt,:); % connectivity matrix based on ascending sum of rows

TO = triangulation(TS,Pl); % create triangulations based on sorted connectivity matrix

figure(2) % plot delaunay triangulations

triplot(TO,'k')

grid on

ax = gca;

ax.GridColor = [0, 0, 0]; % [R, G, B]

xlabel('x(m)')

ylabel('y (m)')

set(gca,'Color','#EEEEEE')

title('Delaunay Triangulation without Outliers')

hold on

Once we have removed the outliers, the next step is straightforward. We just need to compute the midpoint of the internal edges.

Figure 11

xPo = TO.Points(:,1);

yPo = TO.Points(:,2);

E = edges(TO); % triangulation edges

iseven = rem(E, 2) == 0; % neglect boundary edges

Eeven = E(any(iseven,2),:);

isodd = rem(Eeven,2) ~=0;

Eodd = Eeven(any(isodd,2),:);

xmp = ((xPo((Eodd(:,1))) + xPo((Eodd(:,2))))/2); % x coordinate midpoints

ymp = ((yPo((Eodd(:,1))) + yPo((Eodd(:,2))))/2); % y coordinate midpoints

Pmp = [xmp ymp]; % midpoint coordinates

Finally, to obtain a smooth path we have performed interpolation.

Figure 12

distancematrix = squareform(pdist(Pmp));

distancesteps = zeros(length(Pmp)-1,1);

for j = 2:length(Pmp)

distancesteps(j-1,1) = distancematrix(j,j-1);

end

totalDistance = sum(distancesteps); % total distance travelled

distbp = cumsum([0; distancesteps]); % distance for each waypoint

gradbp = linspace(0,totalDistance,100);

xq = interp1(distbp,xmp,gradbp,'spline'); % interpolate x coordinates

yq = interp1(distbp,ymp,gradbp,'spline'); % interpolate y coordinates

xp = [xp xq]; % store obtained x midpoints after each iteration

yp = [yp yq]; % store obtained y midpoints after each iteration

Plot results

figure(3)

% subplot

pos1 = [0.1 0.15 0.5 0.7];

subplot('Position',pos1)

pathPlanPlot(innerConePosition,outerConePosition,P,DT,TO,xmp,ymp,cIn,cOut,xq,yq)

title(['Path planning based on constrained Delaunay' newline ' triangulation'])

% subplot

pos2 = [0.7 0.15 0.25 0.7];

subplot('Position',pos2)

pathPlanPlot(innerConePosition,outerConePosition,P,DT,TO,xmp,ymp,cIn,cOut,xq,yq)

xlim([min(min(xPo(1:2:(mc-1)),xPo(2:2:mc))) max(max(xPo(1:2:(mc-1)),xPo(2:2:mc)))])

ylim([min(min(yPo(1:2:(mc-1)),yPo(2:2:mc))) max(max(yPo(1:2:(mc-1)),yPo(2:2:mc)))])

end

h = legend('yCone','bCone','start','midpoint','internal edges',...

'inner boundary','outer boundary','planned path');

Pp = [xp' yp']; % concatenated planned path

Figure 13

So, the algorithm only computes the path through the cones. However, in Formula Student Driverless competitions, the vehicle needs to simultaneously plan and track the path in the first lap. Hence as a next task, you can try to implement a trajectory tracking controller. Here is a tutorial that shows how to implement trajectory tracking controllers in MATLAB and Simulink: Simulating Trajectory Tracking Controllers for Driverless Cars.

Further, if you are interested to generate an optimized raceline please feel free to check out this GitHub repository from Gautam Shetty: Raceline Optimization.

Also, in case of any queries related to this blog please feel free to reach out to us at racinglounge@mathworks.com.

function y = pathPlanPlot(innerConePosition,outerConePosition,P,DT,TO,xmp,ymp,cIn,cOut,xq,yq) %function to animate the plot

plot(innerConePosition(:,1),innerConePosition(:,2),'.y','MarkerFaceColor','y')

hold on

plot(outerConePosition(:,1),outerConePosition(:,2),'.b','MarkerFaceColor','b')

plot(P(1,1),P(1,2),'|','MarkerEdgeColor','#77AC30','MarkerSize',15, 'LineWidth',5)

grid on

ax = gca;

ax.GridColor = [0, 0, 0]; % [R, G, B]

xlabel('x(m)')

ylabel('y (m)')

set(gca,'Color','#EEEEEE')

hold on

plot(xmp,ymp,'*k')

drawnow

hold on

triplot(TO,'Color','#0072BD')

drawnow

hold on

plot(DT.Points(cOut',1),DT.Points(cOut',2), ...

'Color','#7E2F8E','LineWidth',2)

plot(DT.Points(cIn',1),DT.Points(cIn',2), ...

'Color','#7E2F8E','LineWidth',2)

drawnow

hold on

plot(xq,yq,'Color','#D95319','LineWidth',3)

drawnow

end

Copyright 2022 The MathWorks, Inc.