trile.github.io/content.json at master · trile/trile.github.io · GitHub

1
{"meta":{"title":"trile.github.io","subtitle":null,"description":null,"author":"Tri Le","url":"http://trile.github.io"},"pages":[{"title":"","date":"2017-07-11T04:10:26.000Z","updated":"2017-07-11T04:10:26.000Z","comments":true,"path":"resume/index.html","permalink":"http://trile.github.io/resume/index.html","excerpt":"","text":"Tri Le Product Manager - Software Engineer 28 Nam Quoc Cang, District 1 Ho Chi Minh, Vietnam Phone: (+84) 98 244 7987 Email: tri.cao.le@gmail.com Education University of California San Diego B.S. Computer Science - 2007 Skills Product development and management. Java, Python, C, and C++ in UNIX and Windows environment. Web development with HTML5, CSS, JavaScript, NodeJs. Mobile app development and deployments. Web design with Sketch. Windows, UNIX/LINUX and Mac OS with administration experiences. Experiences Webtretho (by IDGVV) - Ho Chi Minh City, VN Product Manager Sep 2012 - Jul 2016 Directly reported to CEO. Helped organize company structure and source new recruitments. Updated and refined crossed functional team processes and communications. Managed cross-functional teams to develop and maintain products. These activities include analyzing, researching and developing product strategies, collecting user/client feedbacks, managing community forums and events, creating new contents, and developing and promoting new features and products (website and apps) for users. Managed products: User Feedback and Complaint, E-commerce and Media Integration, Mom and Kid Portal, Search Engine Optimization, Homepage, and Forum Mobile Web. Refined various product documents such as Community Guideline and Term of Services. Helped acquired social network license from Ministry of Information and Communication. IDG Venture Vietnam - Ho Chi Minh City, VN Senior Investment Analyst Apr 2010 - Aug 2012 Researched and analyzed product and service companies in technology sector. Provided information and recommendation on investments. Supported investment due diligence process. Worked with portfolio managers on product, strategy, business development and funding. Reported and updated portfolio statuses to upper Managers and Partners. East Agile - Ho Chi Minh City, VN Software engineering Aug 2009 - Mar 2012 Developed iPhone and Ruby on Rails applications for clients in San Francisco. Vusion Inc. - San Jose, CA Software Engineer Mar 2008 - May 2009 Integrated Rhozet, a third party transcoding farm into Vusion video management system (VMS). Set up, monitored and measured performance video transcoding systems. Handled customized transcoding requests and hold training sections for customers and partners in transcoding related topics. Designed and developed an automated, scalable video and audio transcoding system. Written entirely in Python, the Vusion transcoding system retrieved jobs from MySql database, evaluate properties of input video files, determine transcoding profile for each input, create XML job templates to produce thumbnails, video and audio in Vusion format, flash and H.264 streams. The system fed XML files to Rhozet to start transcoding. During this process, progress for each job was reported to the customers. Result video and audio streams got encrypted with Vusion tools and the VMS was notified through the database. The system also logged customer transcoding usage as well as monitoring transcoding load. Two or more transcoding systems can be run concurrently to ensure service uptime and scalable capacity. Dot Hill System Corp. - Carlsbad, CA Software engineering Aug 2009 - Mar 2012 Applied RAID, SCSI knowledge to create and enhance test suite for Dot Hill’s storage technologies. Developed performance-measuring software for RAID arrays. UCSD - San Diego, CA T.A. for Data Structures and OO Design class Jan 2009 - Dec 2012 Assisted students in programming assignments using C, C++, and Java, data structure, algorithm concepts and Object-Oriented Design. Helped students improve their skills in using gdb, IDE debuggers and Purify profiler. Evaluated programming submissions by running automated grading software, designed test cases to exercise homework requirements and assisted other tutors in resolving technical problems. Download"}],"posts":[{"title":"Machine Learning class note 3 - Logistic Regression","slug":"ml3-logistic-regression","date":"2017-04-28T11:37:54.000Z","updated":"2017-07-11T07:13:04.000Z","comments":true,"path":"2017/04/28/ml3-logistic-regression/","link":"","permalink":"http://trile.github.io/2017/04/28/ml3-logistic-regression/","excerpt":"","text":"II. Logistic regression0. Presentation Idea: classify y=0y=0y=0 (negative class) or y=1y=1y=1 (positive class) From linear regression hθ(x)=θTX h_\\theta(x) = \\theta^TXhθ(x)=θTX We need to choose hypothesis function such as 0≤h(x)≤10 \\leq h(x) \\leq 10≤h(x)≤1 1. Hypothesis functionhθ(x)=g(∑i=0nθixi)forg(z)=11+e−zh\\theta(x) = g(\\sum{i=0}^{n}\\theta_ix_i) \\quad \\textrm{for} \\quad g(z) = \\frac{1}{1 + e^{-z}}hθ(x)=g(i=0∑nθixi)forg(z)=1+e−z1 Notes: Octave implementation of sigmoid function g = 1 ./ ( 1 + e .^ (-z)); Vectorized form:Since ∑i=0nθixi=θTX\\sum_{i=0}^{n}\\theta_ixi = \\theta^TX∑i=0nθixi=θTX. The vectorized form of hθ(x)h\\theta(x)hθ(x) is 11+e−θTXorsigmoid(θTX) \\frac{1}{1 + e^{-\\theta^TX}} \\quad \\textrm{or} \\quad sigmoid(\\theta^TX)1+e−θTX1orsigmoid(θTX) Octave implementation h = sigmoid(theta' * X) h(x)h(x)h(x) is the estimate probability that y=1y=1y=1 on input xxx When sigmoid(θTX)≥0.5sigmoid(\\theta^TX) \\geq 0.5sigmoid(θTX)≥0.5 then we decide y=1y=1y=1. As we know sigmoid(θTX)≥0.5sigmoid(\\theta^TX) \\geq 0.5sigmoid(θTX)≥0.5 when θTX≥0\\theta^TX \\geq 0θTX≥0 So for y=1y=1y=1 , θTX≥0\\theta^TX \\geq 0θTX≥0. We call θTX\\theta^TXθTX is the line that define the Decision boundary that separate the area where y=0y=0y=0 and y=1y=1y=1. It does not need to be linear since X can contain polynomial term. Decision boundary is the property of the hypothesis and parameter θ\\thetaθ, not of the training set 2. The cost functionWe need to choose the cost function so that it is “convex” toward one single global minimum Cost function J(θ)=1m∑i=1mCost(hθ(x(i),y(i)))withCost(hθ(x(i),y(i)))={−log(hθ(x))ify=1−log(1−hθ(x))ify=0\\begin{aligned} &amp;J{(\\theta)} = \\frac{1}{m} \\sum{i=1}^{m}Cost( h\\theta(x^{(i)}, y ^ {(i)})) \\ &amp;\\textrm{with} \\quad \\ &amp;Cost( h\\theta(x^{(i)}, y ^ {(i)})) = \\left{ \\begin{array}{c} -log(h\\theta(x))&amp;\\text{if} &amp;y = 1 \\ -log(1 - h\\theta(x))&amp;\\text{if} &amp;y = 0 \\end{array} \\right.\\end{aligned}J(θ)=m1i=1∑mCost(hθ(x(i),y(i)))withCost(hθ(x(i),y(i)))={−log(hθ(x))−log(1−hθ(x))ifify=1y=0 A simplified form of the cost function is:J(θ)=−1m[∑i=1my(i)log(hθ(x(i))+(1−y(i))log(1−hθ(x(i)))]J{(\\theta)} = -\\frac{1}{m}\\left[ \\sum{i=1}^{m} y^{(i)} log(h\\theta(x^{(i)}) + (1-y^{(i)})log(1 - h\\theta(x^{(i)}))\\right]J(θ)=−m1[i=1∑my(i)log(hθ(x(i))+(1−y(i))log(1−hθ(x(i)))] Vectorized form:J(θ)=1m(−yTlog(sigmoid(Xθ))−(1−yT)log(1−sigmoid(Xθ)))J_{(\\theta)} = \\frac{1}{m}(-y^Tlog(sigmoid(X\\theta)) - (1-y^T)log(1-sigmoid(X\\theta)))J(θ)=m1(−yTlog(sigmoid(Xθ))−(1−yT)log(1−sigmoid(Xθ))) Code in Octave to compute cost function J = (1/m) * ( -y' * log(sigmoid(X*theta) ) - (1-y') * log(1-sigmoid(X*theta)) ); We need to get the parameter θ\\thetaθ where J(θ)J{(\\theta)}J(θ) is min. Then we can make a prediction when given new xxx usinghθ(x)=11+e−θTXh\\theta(x) = \\frac{1}{1 + e^{-\\theta^TX}}hθ(x)=1+e−θTX1 3. Gradient Descent algorithmUsing Gradient Descent to find value of θ\\thetaθ when J(θ)J_{(\\theta)}J(θ) is min, we have Repeat until convergenceθj:=θj−α∂∂θjJ(θ) \\theta_j := \\theta_j - \\alpha \\frac {\\partial }{\\partial \\theta_j}J(\\theta)θj:=θj−α∂θj∂J(θ) Repeat until convergenceθj:=θj−α1m∑i=1m(hθ(x(i))−y(i))x(i) \\theta_j := \\thetaj - \\alpha \\frac{1}{m} \\sum{i=1}^{m} \\left( h_\\theta ( x^{(i)} ) - y^{(i)} \\right) x^{\\left(i\\right)}θj:=θj−αm1i=1∑m(hθ(x(i))−y(i))x(i) Vectorized form:θ:=θ−αm(XT(sigmoid(Xθ)−y⃗)) \\theta := \\theta - \\frac{\\alpha}{m} \\left( X^T (sigmoid(X\\theta) - \\vec{y}) \\right)θ:=θ−mα(XT(sigmoid(Xθ)−y⃗)) Code in Octave to compute gradient step ∂∂θjJ(θ)\\frac {\\partial }{\\partial \\theta_j}J(\\theta)∂θj∂J(θ) grad = (1 / m) * (X' * (sigmoid( X * theta) - y) ); 4. Adding Regularization parameterRegularized cost function:J(θ)=−1m[∑i=1my(i)log(hθ(x(i))+(1−y(i))log(1−hθ(x(i)))]+λ2m∑j=1nθj2J{(\\theta)} = -\\frac{1}{m}\\left[ \\sum{i=1}^{m} y^{(i)} log(h\\theta(x^{(i)}) + (1-y^{(i)})log(1 - h\\theta(x^{(i)}))\\right] + \\frac{\\lambda}{2m}\\sum_{j=1}^{n}\\theta_j^2J(θ)=−m1[i=1∑my(i)log(hθ(x(i))+(1−y(i))log(1−hθ(x(i)))]+2mλj=1∑nθj2 The second sum, ∑j=1nθj2\\sum_{j=1}^{n}\\theta_j^2∑j=1nθj2 means to explicitly exclude the bias term, θ0\\theta_0θ0. I.e. the θ\\thetaθ vector is indexed from 0 to n (holding n+1 values, θ0\\theta_0θ0 through θn\\theta_nθn), and this sum explicitly skips θ0\\theta_0θ0, by running from 1 to n, skipping 0. Octave code to compute cost function with regularization J = (1/m) * (-y' * log(sigmoid(X*theta)) - (1-y')*log(1-sigmoid(X*theta))) + lambda/(2*m)*sum(theta(2:end).^2); Thus, when computing the equation, we should continuously update the two following equations:Regularized Gradient Descent:Repeat{θ0:=θ0−α1m∑i=1m(hθ(x(i))−y(i))x0(i)θj:=θj−α[(1m∑i=1m(hθ(x(i))−y(i))x(i))+λmθj]forj≥1\\begin{aligned}\\theta_0 &amp;:= \\theta0 - \\alpha \\frac{1}{m} \\sum{i=1}^{m} \\left( h_\\theta ( x^{(i)} ) - y^{(i)} \\right) x_0^{\\left(i\\right)} \\\\theta_j &amp;:= \\thetaj - \\alpha \\left[ \\left(\\frac{1}{m} \\sum{i=1}^{m} \\left( h_\\theta ( x^{(i)} ) - y^{(i)} \\right) x^{\\left(i\\right)} \\right) + \\frac{\\lambda}{m}\\theta_j \\right] \\quad \\textrm{for} \\quad j \\geq 1\\end{aligned}θ0θj:=θ0−αm1i=1∑m(hθ(x(i))−y(i))x0(i):=θj−α[(m1i=1∑m(hθ(x(i))−y(i))x(i))+mλθj]forj≥1} Octave code to compute gradient step ∂∂θjJ(θ)\\frac {\\partial }{\\partial \\theta_j}J(\\theta)∂θj∂J(θ) grad = (1 / m) * (X' * (sigmoid( X * theta) - y)) + (lambda/m)*[0; theta(2:end)]; Notice that we don’t add the regularization term for θ0\\theta_0θ0 5. Advanced OptimizationPrepare a function that can compute J(θ)J_{(\\theta)}J(θ) and ∂∂θjJ(θ)\\frac {\\partial }{\\partial \\theta_j}J(\\theta)∂θj∂J(θ) for a given θ\\thetaθ function [jVal, gradient] = costFunction(theta) jVal = [...code to compute J(theta)...]; gradient = [...code to compute derivative of J(theta)...]; end Then with this function Octave can provide us some advanced algorithms to compute min of J(θ)J_{(\\theta)}J(θ). We should not implement these below algorithms by ourselves. Conjugate gradient BFGS L-BFGS options = optimset('GradObj', 'on', 'MaxIter', 100); initialTheta = zeros(2,1); [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options); We give to the function fminunc() our cost function, our initial vector of theta values, and the options object that we created beforehand. Advantages:No need to pick up α\\alphaα.Often faster than gradient descent. Disadvantages:More complex.Practical advice: try a couple of different libraries.","categories":[],"tags":[]},{"title":"Machine Learning class note 2 - Linear Regression","slug":"ml2-linear-regression","date":"2017-04-22T11:23:03.000Z","updated":"2017-07-11T04:10:26.000Z","comments":true,"path":"2017/04/22/ml2-linear-regression/","link":"","permalink":"http://trile.github.io/2017/04/22/ml2-linear-regression/","excerpt":"","text":"I. Linear regression0. Presentation Idea: try to fit the best line to the training sets 1. Hypothesis functionhθ(x)=θ0+θ1x1+θ2x2+...=∑i=0nθixi\\begin{aligned}h_\\theta(x) &amp; = \\theta_0 + \\theta_1x_1 + \\theta_2x2 + … \\&amp; = \\sum{i=0}^{n}\\theta_ixi\\end{aligned}hθ(x)=θ0+θ1x1+θ2x2+...=i=0∑nθixiVectorized form:hθ(x)=[θ0θ1θ2...][x0x1x2...]=θTXh\\theta(x) = \\left[\\begin{matrix} \\theta_0 &amp; \\theta_1 &amp; \\theta_2 &amp; … \\end{matrix}\\right] \\left[\\begin{matrix} x_0 \\ x_1 \\ x_2 \\ … \\end{matrix}\\right] = \\theta^TXhθ(x)=[θ0θ1θ2...]⎣⎢⎢⎡x0x1x2...⎦⎥⎥⎤=θTX Octave implemtationPrepping by adding an all-one-column in front of X for x0x_0x0 h = theta' * X The line fits best when the distance of our hypothesis to the sample training set is minimum. Distance from hypothesis to the training set hθ(x)−y=θ0+θ1x1+θ2x2+...−y h_\\theta(x) -y = \\theta_0 + \\theta_1x_1 + \\theta_2x_2 + … - y hθ(x)−y=θ0+θ1x1+θ2x2+...−y 2. The cost functionDo the above for every sample point we arrive at the cost function below (or square error function or mean squared error) J(θ)=12m∑i=1m(hθ(x(i))−y(i))2J{(\\theta)} = \\frac{1}{2m}\\sum{i=1}^{m} \\left( h_\\theta ( x^{(i)} ) - y^{(i)} \\right) ^ 2J(θ)=2m1i=1∑m(hθ(x(i))−y(i))2 Octave implementation J = 1 / (2*m) * sum((X * theta - y).^2) Vectorized form: J(θ)=12m(Xθ−y⃗)T(Xθ−y⃗)J_{(\\theta)} = \\frac{1}{2m}(X\\theta - \\vec{y}) ^ T ( X\\theta - \\vec{y})J(θ)=2m1(Xθ−y⃗)T(Xθ−y⃗) J = 1 / (2*m) * (X*theta - y)' * (X*theta - y); 3. Batch Gradient Descent algorithmTo find out the value of θ\\thetaθ when J(θ)J_{(\\theta)}J(θ) is min, we can use batch Gradient descent rule below: Repeat until convergenceθj:=θj−α∂∂θjJ(θ)\\theta_j := \\theta_j - \\alpha \\frac {\\partial }{\\partial \\theta_j}J(\\theta)θj:=θj−α∂θj∂J(θ) and if we take the partial derivative of J(θ)J_{(\\theta)}J(θ) respect to θj\\theta_jθj we have:Repeat until convergenceθj:=θj−α1m∑i=1m(hθ(x(i))−y(i))x(i)\\theta_j := \\thetaj - \\alpha \\frac{1}{m} \\sum{i=1}^{m} \\left( h_\\theta ( x^{(i)} ) - y^{(i)} \\right) x^{\\left(i\\right)}θj:=θj−αm1i=1∑m(hθ(x(i))−y(i))x(i) Octave implementation for feature = 1:size(X, 2) theta(feature) = theta(feature) - alpha*(1/m) * sum((X*theta-y) .* X(:,feature)); end Vectorized form:θ:=θ−αm(XT(Xθ−y⃗))\\theta := \\theta - \\frac{\\alpha}{m} \\left( X^T (X\\theta - \\vec{y}) \\right)θ:=θ−mα(XT(Xθ−y⃗)) Octave implementation theta = theta - alpha*(1/m) * (X' * (X*theta-y)); Notes on using gradient descentGradient descent need many iteration and works well for number of features n &gt; 10,000 Complexity of O(n2)O(n^2)O(n2) Choosing alpha:Should step 0.1, 0.3, 0.6, 1 … Feature normalizationTo make gradient descent converges faster we can normalize the data a bitxi=xi−μisix_i = \\frac{x_i - \\mu_i}{s_i}xi=sixi−μi where mu_i is the average of all the value of feature i s_i is the range of value or the standard deviation of feature i After having theta, we can plug X and theta back in hypothesis function to find out the predictionhθ(x)=θTX h_\\theta(x) = \\theta^TX hθ(x)=θTX 4. Adding Regularization parameterWhy Regularization ?The more features introduced, the higher chances that overfitting happens. To address overfitting, we can reduce the number of features or use regularization: Keep all the features, but reduce the magnitude of parameters θj. Regularization works well when we have a lot of slightly useful features. How? Adding regularization parameter changing: Regularized cost function:J(θ)=12m[∑i=1m(hθ(x(i))−y(i))2+λ∑j=1nθj2]J{(\\theta)} = \\frac{1}{2m} \\left [ \\sum{i=1}^{m} \\left( h\\theta ( x^{(i)} ) - y^{(i)} \\right) ^ 2 + \\lambda \\sum{j=1}^{n}\\theta_j^2 \\right]J(θ)=2m1[i=1∑m(hθ(x(i))−y(i))2+λj=1∑nθj2] Regularized Gradient Descent: We will modify our gradient descent function to separate out θ0\\theta_0θ0 from the rest of the parameters because we do not want to penalize θ0\\theta_0θ0. Repeat{θ0:=θ0−α1m∑i=1m(hθ(x(i))−y(i))x0(i)θj:=θj−α[(1m∑i=1m(hθ(x(i))−y(i))x(i))+λmθj]forj≥1\\begin{aligned}\\theta_0 &amp;:= \\theta0 - \\alpha \\frac{1}{m} \\sum{i=1}^{m} \\left( h_\\theta ( x^{(i)} ) - y^{(i)} \\right) x_0^{\\left(i\\right)} \\\\theta_j &amp;:= \\thetaj - \\alpha \\left[ \\left( \\frac{1}{m} \\sum{i=1}^{m} \\left( h_\\theta ( x^{(i)} ) - y^{(i)} \\right) x^{\\left(i\\right)} \\right) + \\frac{\\lambda}{m}\\theta_j \\right] \\quad \\textrm{for} \\quad j \\geq 1\\end{aligned}θ0θj:=θ0−αm1i=1∑m(hθ(x(i))−y(i))x0(i):=θj−α[(m1i=1∑m(hθ(x(i))−y(i))x(i))+mλθj]forj≥1} 5. Normal equationThere is another way to minimize J(θ)J_{(\\theta)}J(θ). This is the explicitly way to compute value of θ\\thetaθ without resorting to an iterative algorithm.θ=(XTX)−1XTy⃗\\theta = (X^TX)^{-1} X^T\\vec{y}θ=(XTX)−1XTy⃗ Sometimes (XTX)(X^TX)(XTX) is non-invertable. Some of the reasons include existence of redundant features or too many features (m≤n)(m \\leq n)(m≤n). We should recude number of feature or switch to gradient descent in such case. Octave implementation theta = (X' * X)^(-1) * X' * y; Notes on using Normal equationNo need to choose alpha. No need to run many iterations. Feature scaling is also not necessary. Complexity of O(n3)O(n^3)O(n3) since we need to calculate inverse of (XTX)(X^TX)(XTX) or pinv(X&#39;* X) Close of n is very large, should switch to gradient descent. We can also apply regularization to the normal equationθ=(XTX+λ⋅L)−1XTy⃗withL=[011⋱1]\\begin{aligned}&amp;\\theta = (X^TX+ \\lambda \\cdot L)^{-1} X^T\\vec{y} \\&amp;\\textrm{with} \\quad L = \\begin{bmatrix} 0 &amp; &amp; &amp; &amp;\\ &amp; 1 &amp; &amp; &amp; \\ &amp; &amp; 1 &amp; &amp; \\ &amp; &amp; &amp; \\ddots &amp; \\ &amp; &amp; &amp; &amp; 1 \\end{bmatrix}\\end{aligned}θ=(XTX+λ⋅L)−1XTy⃗withL=⎣⎢⎢⎢⎢⎡011⋱1⎦⎥⎥⎥⎥⎤Recall that if m&lt;nm &lt; nm&lt;n, and may be non-invertible if m=nm = nm=n then XTXX^TXXTX is non-invertible. However, when we add the term λ⋅L\\lambda \\cdot Lλ⋅L, then XTX+λ⋅LX^TX + \\lambda \\cdot LXTX+λ⋅L becomes invertible.","categories":[],"tags":[]},{"title":"Machine Learning class note 1 - Intro","slug":"ml1-intro","date":"2017-04-12T14:47:15.000Z","updated":"2017-07-11T04:10:26.000Z","comments":true,"path":"2017/04/12/ml1-intro/","link":"","permalink":"http://trile.github.io/2017/04/12/ml1-intro/","excerpt":"","text":"Machine Leaning2 definitions: Arthur Samuel the field of study that gives computers the ability to learn without being explicitly programmed. Tom Mitchell: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. General notations m = Number of training examples. n = Number of features. x = Input variables. y = Output variables. (x,y) = one training example. x(i), y(i) = i(th) training example. h = hypothesis function, h maps from x’s to y’s. 1. Supervised LearningIn supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output. Supervised learning problems are categorized into “regression” and “classification” problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories. 2. Unsupervised LearningUnsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don’t necessarily know the effect of the variables. We can derive this structure by clustering the data based on relationships among the variables in the data. With unsupervised learning there is no feedback based on the prediction results. Unsupervised learning problems are categorized into “clustering” and “non-clustering” problems. Clustering: Take a collection of 1,000,000 different genes, and find a way to automatically group these genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles, and so on. Non-clustering: The “Cocktail Party Algorithm”, allows you to find structure in a chaotic environment. (i.e. identifying individual voices and music from a mesh of sounds at a cocktail party).","categories":[],"tags":[]}]}